Histogram: Your first statistical Analysis
Madhuri Patil
Posted on June 3, 2024
The first step in any machine learning project is to analyze the data before building a model. This includes understanding of data, where we primarily use distribution visualization to perform some early analysis.
This visual representation can reveal a lot about the underlying distribution, such as its normality, whether it is skewed, has a single peak or multiple peaks, and their central tendencies and potential outliers, which are crucial for understanding the underlying structure of the data.
There are several methods that you can use to visualize distribution, and each has its own set of advantage and disadvantages.
The most common method to visualizing a distribution is the histogram.
In this article, let's study histogram and learn how to use them effectively to reveal valuable information from the data and their significance to avoid the common pitfalls such as inappropriate selection of bin sizes.
What is Histogram
Histogram is a graphical representation of the distribution of numerical data. They divide the data into bins or intervals and display the frequency of occurrences within each bin using bars of varying heights.
You must use data that is continuous in nature, as histograms are best suited for continuous data because they can effectively represent the distribution of data points within continuous intervals.
Discrete numeric data, on the other hand, often contains a finite number of fixed values, which may result in a misleading representation if forced into a histogram.
Above figure shows the histogram plots for both continuous and discrete data values.
Histogram is a type of bar plot that represents the counts of number of data points that fall within a range of values, known as bins.
The bins are typically of equal in width size which we can observe in both graphs, and there should be no gaps between the bars of the histogram like in plot for continuous data (left figure). However, there is huge gap can observe between the bars of histogram for discrete data (right figure).
You can specify discrete=True
parameter if you are using seaborn for plotting, but it does not work all the time. So alternative visualizations like bar charts or frequency tables are typically more appropriate, as they accurately display the count or frequency of each unique value.
Histogram plot using Seaborn library of python for data visualization
You can use histplot method of seaborn library to plot histogram. It offers a range of functionalities to visualize data effectively.
# import seaborn library
import seaborn as sns
import matplotlib.pyplot as plt
# load dataset
tips = sns.load_dataset('tips')
# Univariate plotting of histogram
sns.histplot(tips, x='total_bill', bins=20)
plt.grid(ls="--", c='#000', alpha=0.3)
plt.show()
The above plot reveals the few insights about total bill of the customer's meal. For instance -
- We can see that the distribution has single peak, with most common total bill is between $14–$16.
- Distribution appeared to have positive tail which indicates the right skewness with some potential outliers.
You can evaluate the normality of data further, by observing the mean and median values of the data.
Selection of Bin Size
The choice of the size of bins is very important, as wrong bin size can mislead the conclusion draw form the visualization.
Too small bin size can lead to a histogram with many bins (plot 1), each bins containing a small number of observations, which can result in an overly complex and noisy distribution.
This granularity can distort the underlying trends and make it difficult to identify the true distribution pattern.
On the other hand, choosing an overly large bin size for a histogram (plot 3) can significantly affect its ability to accurately represent the underlying data distribution.
Large bins may lead to oversimplified distribution - as the multiple values are grouped in a single bin, the variations in the data points may lost, making it difficult to identify trends or anomalies.
Conversely, a well-chosen bin size which is shown in second plot, can help in highlighting the true distribution of the data, allowing for better insights and decisions based on the visualized information.
Sometimes it is more appropriate to use number of bins instead of their size.
There are several methods to select the right size of the bins, each method has its advantages and is suitable for different types of data sets. You can learn about different methods for selection of the bin size here.
Seaborn uses the default bin size, which is determined using a reference rule that depends on the sample size and variance. This works well in many cases, (i.e., with "well-behaved" data) but it fails in others.
It is always a good to try different bin sizes to be sure that you are not missing something important.
Seaborn offers various functionality to specify bins in several different ways, such as by setting the total number of bins to use, the width of each bin, or the specific locations where the bins should break.
hue
After univariate analysis of a particular feature, you must analyze their distribution further across the different set of groups of the variable.
For instance, here we must analyze the distribution of total_bills across the different group of people such as male and female.
sns.histplot(tips, x='total_bill', hue='sex', bins=20);
element
In the above figure, it is little difficult to visualize the shape of the distributions for the groups as the histogram overlap by default on top of each other.
You can use step function instead of bars by setting up element parameter.
sns.histplot(tips, x='total_bill', hue='sex', bins=20, element='stop');
kde
You can also visualize the smooth distribution of observations to understand the shape of the data, by producing continuous density estimate by setting kde=True.
sns.histplot(tips, x='total_bill', kde=True, bins=20);
As data changes, so does the shape of the histograms. There are various types of histograms, each with different meanings. Understanding the implications of a histogram's shape can guide further analysis and algorithm selection. This understanding is crucial in interpreting data correctly and making informed decisions based on statistical information.
For instance, a normal distribution might suggest different data preprocessing steps or model assumptions than a bimodal distribution.
Let's explore these types and learn how to transform them into normally distributed data in upcoming tutorials. For now, let's conclude this article.
I hope this article helps you understand histograms using Seaborn. Don't forget to visit the Seaborn documentation to learn more details.
Reference
Seaborn offers many more functionality to effectively analyze data distribution. You can learn about them here in - seaborn histogram plot doc
🔗 Affiliate link
If you're interested in learning machine learning and are searching for a course, you should consider checking out this Master machine learning with scikit-learn offered by Kevin at Data School.
This course is designed to provide comprehensive knowledge and practical skills in machine learning using the Scikit-Learn library.
Posted on June 3, 2024
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.