@radekmie

On Histograms in Decision Making

By Radosław Miernik · Published on · Comment on Reddit

Table of contents

Intro

While planning yet another feature, one of the designers asked, “How many values do we expect to see in this dropdown?” You, the only person with database access at this meeting, run a trivial query and reply, “Six, on average.” Everyone’s happy, everyone continues with their day.

And on the first day it lands on production, you see this horror with more than a thousand options to choose from. “You said it’ll be six!” Well, you were right, but not right right. That’s where being “technically correct” is not enough.

Today, we’ll talk about the alternatives and their outcomes. Sort of a statistics 101, but with little-to-zero formal language. As the title suggests, we’ll arrive at histograms, but let’s not jump the gun.

Average

The major problem with averages is that they can’t handle outliers nicely. An outlier is something unusual and far from normal, e.g., this one enterprise client of yours, who makes up for 20% of your total traffic.

The thing is that given enough difference between an outlier and the rest, they will impact the average significantly. Remember that outliers don’t necessarily have to be large; the few clients who left last year and now produce zero traffic are affecting the average as well.

Standard deviation

Once you realize an average is not enough, the first Google search suggests using standard deviation. The idea is simple: instead of a single number, we’ll think about a range where most of the data fits. (Approximately 68%, assuming a normal distribution. If you’re not sure what that means, “most”.)

How does it look in practice? If the standard deviation is low (compared to the average), then most of your data is close to each other; if it’s high, then the data is spread more. We just don’t know spread how.

Percentiles

A different direction would be to look at certain percentiles instead. While there are multiple formal definitions (nitty-gritty details), the idea is simple: the 60th percentile represents a value below which 60% of all values are.

This allows us to put a number on the Pareto principle, which I guess you heard a lot about while listing all the edge cases you came up with. If we go back to our dropdown example, we could say, “97.4% of our users have 27 options or less. However, the maximum is over a thousand.” A much more informed decision.

Example #1

Let’s take a real-life example: API response times. The design team is wondering whether the loading state of this new bulk edit view will appear for just a moment or if we expect it to stay on for a while. Let’s get the numbers in:

Case ACase B
Average100ms100ms
Standard deviation80ms10ms
90th percentile300ms120ms
95th percentile500ms160ms
99th percentile2500ms190ms

As you can see, the average is the same. Standard deviation is already telling us something – case A is spread far more. Then, the percentiles show that case B is “fast enough” for “most” users. I’d say A needs a better loading state than B; the latter will be fine with just a spinner on the submit button.

Visualization

Even if you understand the shortcomings of all the aforementioned metrics, always remember to visualize the data anyway. Just a quick-n-dirty graph will do. Check out this lovely Anscombe’s quartet:

Anscombe's quartet

In short, these are four different datasets with almost identical statistics but completely different distributions. If you’re in your dinosaur phase (whether you like it or not), check the Datasaurus dozen, too.

Histograms

Finally, let’s talk histograms. So far, we have thought about “summarizing” the data we’re given, but not really how it “looks” like. Of course, we can chart it, but if one of the axes is large, then the chart may get unreadable.

Instead, we group the values into a handful of “buckets” (or “bins”), representing some range. Of course, we can dive into it as granular as we need (or rather as we’re asked to). From my experience, just a handful of buckets is enough to make a decision (I usually start with ten).

You can also mix and match what you learned today! If instead of the current one, we count all the previous buckets, we end up with a kind of percentile-like chart. Look at these beauties:

Cumulative vs normal histogram

Example #2

Let’s face the dropdown options once again. Knowing that a basic histogram would help us, let’s make one from the data we have. One way would be to throw it into your Excel-like tool of choice.

I’m fine with text form, and I’d like to have it right now (we’re still at that meeting, remember?), so let’s try to do it directly in the database. Luckily, MongoDB comes up equipped with $bucketAuto:

// Data shape: {
//   _id: ObjectId(),
//   tags: ['Name 1', ...],
//   tenantId: ObjectId(),
// }
db.posts.aggregate([
  { $unwind: '$tags' },
  { $group: { _id: '$tenantId', tags: { $addToSet: '$tags' } } },
  { $project: { count: { $size: '$tags' } } },
  { $bucketAuto: { groupBy: '$count', buckets: 10 } },
]);

The result looks like this:

{ _id: { min: 1, max: 2 }, count: 762 }
{ _id: { min: 2, max: 3 }, count: 2493 }
{ _id: { min: 3, max: 4 }, count: 593 }
{ _id: { min: 4, max: 6 }, count: 651 }
{ _id: { min: 6, max: 11 }, count: 645 }
{ _id: { min: 11, max: 28 }, count: 615 }
{ _id: { min: 28, max: 200 }, count: 159 }

We no longer see the average, standard deviation, or percentiles, but we can easily calculate them too, using $avg, $stdDevPop, and $percentile (I’ll leave that as an exercise for you, dear reader).

Of course, it’s not rocket science, so if you’re using any SQL database, a group by using the bucket will work just fine. And you don’t have to be worried about the performance – all we need is the single number we’re putting into the buckets, so you should be fine with running it ad-hoc without an index.

Conclusion

That’s it for today! Hopefully, it wasn’t too formal, and you’ll spread the love for histograms further. Or at least make informed decisions more often.

“Yeah, sorry, I zoned out. How many options we had there again…?”