Concentration Check: KDEs and Hexbins
Zander Bailey
Posted on April 28, 2019
We all know histograms can be useful for showing the distribution of a single variable, but it turns out that we can also look at histogram-like distributions on two variables using Hex-Bins and Multivariate KDE plots. Before we can compare these two type of plots, it’s important to understand exactly how each type works. They both work similarly to a scatter plot, and they both map the density of points corresponding to 2 axes. But they have some important differences. A Multivariate KDE plot shows a gradient of density, whereas a hex-bin plot shows more distinction between density in certain areas. So let’s start by looking at each type of plot on its own.
Multivariate KDE Plot
KDE stands for Kernel Density Estimation, and a normal KDE plot is similar to a histogram, with a single variable of data. It is also used to visualize the distribution of that single variable, but Instead of showing the volume of value ranges as a series of distinct bars or ‘bins’ , a KDE displays a continuous, smoother, rounded line, approximating the shape of the histogram.
A Multivariate KDE shows the density of points across a graph, on the same axes that would be used to plot scatter points. In this way, a Multivariate KDE more closely resembles a scatter plot than a histogram, but it serves a different function from a scatter plot, and helps to observe other trends in the data. An MKDE plot doesn’t care about mapping the individual points, instead it cares more about the distribution of those points across the plot. It shows a gradient of color across the area(s) where there are data points, with lighter areas indicating a lower concentration of points in those areas, and darker areas indicating a higher concentration of points. This is is useful because on a scatter plot with many data points in close proximity, sometimes you are presented with a solid blob of points, and it can be difficult to discern the number of data points in a single area. It helps to understand larger sets of data as they appear on a scatterplot, and to see where the data is more tightly packed.
A scatter plot depicting the ratings of horror movies from the last 7 years, compared with their run time. Movie data from IMDB.
The same data on a KDE plot.
With a KDE plot you can see that there are many, many points concentrated in one area, even though the on the scatter plot it appears even. It’s also an easy way to show the concentration of data on plots that may appear more scattered.
Hex-Bin
At first glance, a Hex-Bin plot appears very similar to a KDE, and indeed they serve a similar function. A Hex-Bin also shows the density and distribution of data on the axes of a scatter plot. Likewise, it also uses light and dark colors to indicate concentration, or lack thereof. But a Hex-Bin displays more distinct sections of concentration. In some ways, a Hex-Bin is like looking at a histogram from the top, and seeing taller columns as darker, and shorter columns as lighter. But instead of columns, each hexagon is a ‘bin’, hence Hex-Bin. The advantage here over KDE plots, is that with Hex-Bins you can more easily show concentrations in certain spots, instead of the general gradient. Like a histogram, the hexagons can be resized to expand or reduce the range of data.
Plot with bigger hexes
Plot with really big hexes
Here we can see a Hex-Bin plots with increasing larger hexes, which changes how concentration of data is arranged. With smaller hexes there are more individual points of density, and more contrast between neighboring hexes. On the plot with medium hexes the distribution appears more smooth, approaching a similar appearance to a Multivariate KDE plot. But when we further increase the size of the hexes the drop-off in density from on hex to the next becomes more drastic, and we can see how the highest concentration is now in only one or two hexes.
Purpose
Both types of plots are useful for looking at where the data on two axes are concentrated. The comparison to a histogram has come up multiple times, and for good reason. A histogram shows the distribution of data across one axis, showing the ranges that more or less data points fall into. Hex-Bins and Multivariate KDEs visualize a similar concept, but across two axes. A KDE is more useful for showing the overall area or areas of concentration, using a more topographical approach, while a Hex-Bin can zero in on more specific concentrations using hexes to divide the area into smaller hexes. So it turns out that despite appearing very similar, these two types of plots can each have their own uses.
Posted on April 28, 2019
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.
Related
November 30, 2024