The Histogram: Some Common and Not-So-Common Variants

What is it?

The traditional Histogram provides an approximate representation of the distribution of continuous data. This is typically done by binning, or discretizing, the data into equidistant ranges. These ranges are then represented as vertical bars where the area of each bar corresponds with the number of data points falling within a particular bin. The bars are then placed directly beside one another along a continuous horizontal axis. In the case where the bin widths are equal across all bins, the length of each bar in addition to the area can equally be said to correspond to the count, or frequency, of values falling within that bin’s range. See an example below.

To build in R with ggplot2 and plot.data:

ggplot2::ggplot(iris) + ggplot2::geom_histogram(ggplot2::aes(x=Sepal.Length), binwidth=plot.data::findBinWidth(iris$Sepal.Length), color='#332288', fill='#332288')

While a Histogram is usually constructed by first creating bins with equal bin widths, it is also possible to construct one using variable bin widths. It is common, within this relatively uncommon case, to construct bins which are equiprobable rather than equidistant. The objective then becomes to create bars which all represent roughly the same number of data points. Consequently, each bar in a variable bin width Histogram has roughly the same area.

A Histogram can also represent relative frequencies, as opposed to distributions. In this case the height of bars represent a proportion of the whole, with all bars summing to 1. If the histogram has been stratified and the resulting groups are overlaid atop one another, then the proportions will sum to 1 within each group. If the histogram has been stratified and the resulting groups are stacked, then the proportions will sum to 1 within each bin range.

One final variation of the Histogram worth mentioning is the cumulative Histogram. In this case each bin represents not only the frequency of values corresponding to that bin but for all previous bins along the range. It is not uncommon, in this case, for the Y-axis to be represented as a percent rather than a raw count.

When do I use it?

The Histogram is best used for becoming acquainted with a single variable, and is recommended as the first visualization you build with a quantitative variable of potential interest. They are very effective at providing consistent approximations of distributions. They lack the granularity of the Stem and Leaf Plot, but are easier to construct. They lack the accuracy of the Density Plot, but are easier to describe statistically. They lack the ability to easily compare multiple distributions like the Frequency Polygon, but better support reads of ‘how much’ of something is present. They are even capable of partially supporting the investigation of part-to-whole relationships, by combining the stacked and proportional variations. Overall, they are a fairly strong general-purpose statistical tool.

How do I read it?

Ideally, the Histogram should be able to support reads of both length and area. The human brain is much better able to compare lengths than it is areas, and so the read of bar length perhaps takes precedence. This makes it critically important that the Y-axis have an origin at 0 in order to avoid being misleading. However, Histograms inherently represent continuous data along the X-axis and essentially are a visual representation of the discretized approximation of the actual density curve. The area beneath that approximated density curve can be conceptually considered as a single graphical element. Consequently, I have a general preference to visualize Histograms without a border around the bars.

It is important to remember while reading a Histogram that no one bin width can really be said to be better than all others. It is possible to programmatically identify an ideal default, but there remains a responsibility as the reader to modify that default bin width and understand what the perceived changes in the approximate distribution imply about the true distribution. In general wider bin widths reduce noise, particularly in areas of low density. If the bin width is too wide however, it can hide important signals as well. What is ideal depends on a number of factors including but not limited to the total number of data points, their skewness and their density.

See the same data with varying bin widths below. The first image uses the programmatically identified default of 0.4. The second uses a bin width of 0.2 and the third uses 0.6. The smaller bin width lets us see a region of apparently lower density about a quarter of the way along the range. The larger bin width however makes it very obvious very quickly where the majority of the data is located. The default provides a bit of the best of both worlds.

Since the features that help identify informative bin widths change over the range of the data, it may in some cases be advantageous to use variable bin widths. It is important to note though, that variable bin width Histograms must be read slightly differently. As the area of each bar is roughly equal, trying to read those areas is not likely to be very accurate or informative. However, the length of the bars may still be read as a rough estimate of the density distribution. It can be difficult to compare the widths of the bins in a Histogram constructed this way, as they are not plotted against a consistent base line.

It’s worth pointing out a final time that I keep saying that Histograms represent an approximate distribution of a continuous variable. This is a consequence of binning, which can create perceived differences in the size of bins which are merely artifacts of the chosen bin width. A trivial example might be seen in the binning of integers. If we have integer values between 1 and 5 and a bin width of 2.75, then two bins will result. The first will represent the count of values 1 and 2. The second will represent the count of values 3, 4 and 5. Consequently, the second bin includes counts for 1.5x the number of values as the first bin, and results in a Histogram with systematic false peaks.

Resources

  1. Wikipedia
  2. Signal: Understanding What Matters in a World of Noise by Stephen Few
  3. Exploratory Data Analysis by Frederick Hartwig and Brian E. Dearing