Why does it matter?
It does. It really really really really does. A Histogram is highly sensitive to its bin width, and may give different representations of the distribution of the underlying data based on the chosen bin width. As discussed in my Graphical Literacy post on Histograms, its important to always adjust the bin width and see what features of the underlying distribution are revealed by different values.
Different bin widths may provide more or less truthful visual representations of the true distribution of the data. Even the best, the most truthful, the least misleading bin widths will likely represent some features of the distribution better than others. So how do you know when you have found a quality bin width? How do you find ideal bin widths without taking an eternity to do it? You don’t. The software does, or should.
What makes a bin width good depends on the features that bin width is meant to support visually. If its not an ideal bin width, it may be misleading you about those same features. As a reader of a Histogram, its a bit of a chicken and egg situation. Features are identified with good and informative bin widths, and good bin widths are identified based on the features they’re meant to represent.
Fortunately, we know some things about the types of features that influence optimal bin widths, such as the shape of the distribution, its sparsity and skewness. We also know how to programmatically and mathematically identify those features. A computer is much more able and willing to look at every value and do some arithmetic to identify features of particular interest to the bin width. We also have the hard work of a lot of statisticians through-out recent history to thank for the fact that we have some rules for identifying default bin widths that are known to accomodate data with particular features.
What software does it well?
A large number of software I’ve seen do it poorly, and perhaps I’m a particularly critical critic but I’ve become disinclined to trust those software with automated statistics upon that discovery. I find it particularly disconcerting that numPy, maplotlib and pandas Python libraries all seem to simply default to 10 bins. I don’t really feel there is any justifying not using even the simplest of binning rules, and Python is awfully popular. That feels like a dangerous combination to me.
Another noteworthy failure is Microsoft’s Excel. I’ve seen a number of references (including here) to their use of the square root rule in identifying the default bin width. By this rule the default bin width used is found by first taking the square root of the sample size, then rounding to the next integer to find the number of bins, and then dividing the range of the data by the number of bins to find the bin width. I’ve never seen anything to suggest why that should be an ideal default, or any metric to support it. The best I can say for it, is that its better than not using any rule at all.
There are other known rules, which are grounded in metric-based arguments and come with known and clearly stated assumptions about the underlying data and its features. One such example is Scott’s Rule, which identifies the bin width which results in the smallest mean squared error and is known to only apply to normally distributed data. Freedman-Diaconis is a variant of Scott’s, which I rather like, which is less sensitive to outliers in the data.
Another rule worth knowing about is Sturge’s rule, as its possibly the first method for identifying optimal default bin widths published in the literature. It’s also a common choice for statistical plotting software that actually do use binning rules. It doesn’t perform particularly well if there are fewer than 30 or more than 200 values, or if the data is heavily skewed. Fortunately we have Doane’s rule, which is a modification of Sturge’s intended to handle skewed data sets.
An example of software using Sturge’s rule by default is the graphics::hist
function in R. The same function accepts Scott
and FD
(Freedman-Diaconis) as arguments, if the user would like to change that default. That is an example of doing it reasonably well, I think, and is much better than ggplot2‘s default of 30 bins. In a way ggplot2’s failure is the saddest, since I know that the functions already exist in R for them to borrow.
How would you do it, Danielle?
It’s funny you should ask, because I’ve already done it. There is an R package called plot.data belonging to the VEuPathDB organization on Github, which I created and authored the binning utilities for. That package has a function plot.data::findBinWidth
which will return the optimal default bin width for a collection of data points. In implementing this function, I’ve done my best to minimize binning artifacts by using the following discretization rules. Each subsequent rule takes priority over the previous, in order to identify the optimal and default bin width h:
- Use the Freedman-Diaconis Rule: h=2{\frac {\operatorname {IQR} (x)}{\sqrt[{3}]{n}}}, where:
- {\operatorname {IQR} (x)} is the interquartile range of the data
- If there are less than 200 values, use Sturge’s Rule to find the number of bins: k = [log_2n] + 1
- If the absolute value of the skewness of the variable’s values is greater than .5, use Doane’s Rule: k=1+\log _{2}(n)+\log {2}\left(1+{\frac {|g{1}|}{\sigma {g{1}}}}\right), where:
- g_{1} is the estimated 3rd-moment-skewness of the distribution
- \sigma_{g_1} = \sqrt { \frac { 6(n-2) }{ (n+1)(n+3) } }
- Format according to data type:
- If the variable contains only integer values, take the floor of the bin width that resulted from the previous rules. If that results in a bin width of 0, use 1 instead.
- If the variable contains non-integer values, round the bin width to match the precision of the variable. If this results in a bin width of 0, make successive attempts with ever greater precision until a non-zero value results.
- If the variable represents date values, these values are temporarily converted to numeric values representing the number of days from some origin and the above rules applied as usual. After the numeric bin width (in days) is determined the following logic is applied:
- If the bin width is greater than 365 days, return a bin width of “1 year”
- If the bin width is greater than 31 days, return a bin width of “1 month”
- If the bin width is greater than 7 days, return a bin width of “1 week”
- Otherwise return the bin width in days as calculated from the steps above.
Note: In the above equations k is the number of bins and n is the number of data points. Additionally, from the number of bins k, the bin width h can be found where h={\frac {\max x-\min x}{k}}.
Pingback: plot.data - Danielle Callan