Exploratory Data Analysis: Data Visualization

In this article, we’ll use data visualization to explore a dataset from Streeteasy which contains information about housing rentals in New York City.

Exploratory Data Analysis (EDA) is a process of describing the data by means of statistical and visualization techniques in order to bring important aspects of that data into focus for further analysis.

Univariate analysis
Univariate analysis focuses on a single variable at a time. Univariate data visualizations can help us answer questions like:

What is the typical price of a rental in New York City?
What proportion of NYC rentals have a gym?
Depending on the type of variable (quantitative or categorical) we want to visualize, we need to use slightly different visualizations.

Quantitative variables
Box plots (or violin plots) and histograms are common choices for visually summarizing a quantitative variable. These plots are useful because they simultaneously communicate information about minimum and maximum values, central location, and spread. Histograms can additionally illuminate patterns that can impact an analysis (eg., skew or multimodality).

For example, suppose we are interested in learning more about the price of apartments in NYC. A good starting place is to plot a box plot of the rent variable. We could plot a boxplot of rent as follows:

# Load libraries
import seaborn as sns
import matplotlib.pyplot as plt 

# Create the plot
sns.boxplot(x='rent', data=rentals)
plt.show()

We can see that most rental prices fall within a range of $2500-$5000; however, there are many outliers, particularly on the high end. For more detail, we can also plot a histogram of the rent variable.

# Create a histogram of the rent variable
sns.displot(rentals.rent, bins=10, kde=False)
plt.show()

The histogram highlights the long right-handed tail for rental prices. We can get a more detailed look at this distribution by increasing the number of bins:

# Create a histogram of the rent variable
sns.displot(rentals.rent, bins=50, kde=False)
plt.show()

Categorical variables
For categorical variables, we can use a bar plot (instead of a histogram) to quickly visualize the frequency (or proportion) of values in each category. For example, suppose we want to know how many apartments are available in each borough. We can visually represent that information as follows:

# Create a barplot of the counts in the borough variable
# The palette parameter will set the color scheme for the plot
sns.countplot(x='borough', data=rentals, palette='winter')
plt.show()

Bivariate analysis
In many cases, a data analyst is interested in the relationship between two variables in a dataset. For example:

Do apartments in different boroughs tend to cost different amounts?
What is the relationship between the area of an apartment and how much it costs?
Depending on the types of variables we are interested in, we need to rely on different kinds of visualizations.

One quantitative variable and one categorical variable
Two good options for investigating the relationship between a quantitative variable and a categorical variable are side-by-side box plots and overlapping histograms.

For example, suppose we want to understand whether apartments in different boroughs cost different amounts. We could address this question by plotting side by side box plots of rent by borough:

# Create a box plot of the borough variable relative to rent
sns.boxplot(x='borough', y='rent', data=rentals, palette='Accent')
plt.show()

This plot indicates that rental prices in Manhattan tend to be higher and have more variation than rental prices in other boroughs. We could also investigate the same question in more detail by looking at overlapping histograms of rental prices by borough:

plt.hist(rentals.rent[rentals.borough=='Manhattan'], label='Manhattan', density=True, alpha=.5)
plt.hist(rentals.rent[rentals.borough=='Queens'], label='Queens', density=True, alpha=.5)
plt.hist(rentals.rent[rentals.borough=='Brooklyn'], label='Brooklyn', density=True, alpha=.5)
plt.legend()
plt.show()

Two quantitative variables
A scatter plot is a great option for investigating the relationship between two quantitative variables. For example, if we want to explore the relationship between rent and size_sqft, we could create a scatter plot of these two variables:

# Create a scatterplot of the size_sqft variable relative to rent
sns.scatterplot(rentals.size_sqft, rentals.rent)
plt.show()

The plot indicates that there is a strong positive linear relationship between the cost to rent a property and its square footage. Larger properties tend to cost more money.

Two categorical variables
Side by side (or stacked) bar plots are useful for visualizing the relationship between two categorical variables. For example, suppose we want to know whether rentals that have an elevator are more likely to have a gym. We could plot a side by side bar plot as follows:

sns.countplot(x='has_elevator', hue='has_gym', data=rentals)
plt.show()

This plot tells us that buildings with elevators are approximately equally likely to have a gym or not have a gym; meanwhile, apartments without elevators are very unlikely to have a gym.

Multivariate analysis
Sometimes, a data analyst is interested in simultaneously exploring the relationship between three or more variables in a single visualization. Many of the visualization methods presented up to this point can include additional variables by using visual cues such as colors, shapes, and patterns. For example, we can investigate the relationship between rental price, square footage, and borough by using color to introduce our third variable:

sns.scatterplot(rentals.size_sqft, rentals.rent, hue = rentals.borough, palette='bright')
plt.show()

Another common data visualization for multivariate analysis is a heat map of a correlation matrix for all quantitative variables:

# Define the colormap which maps the data values to the color space defined with the diverging_palette method  
colors = sns.diverging_palette(150, 275, s=80, l=55, n=9, as_cmap=True)

# Create heatmap using the .corr method on df, set colormap to cmap
sns.heatmap(rentals.corr(), center=0, cmap=colors, robust=True)
plt.show()

Conclusion
In this article, I’ve summarized some of the important considerations for choosing a data visualization based on the question a data analyst wants to answer and the type of data that is available.

Blog

Exploratory Data Analysis: Data Visualization

Christopher Ambala

Join Our Newsletter. No Spam, Only the good stuff.

Related