Blog

Creating human-readable scales for data visualization in charts.

If you ever had the task to generate a graphic from scratch, then you probably encountered this problem: how do we pick a "round" number for the maximum range of the graphic which will fit all of our data nicely?

Here's an example. Your app has a statistics section which displays a line chart of the number of entries versus time. At the beginning, when the number of entries is small - say 6 entries - your graphic could display a range from 0 to 10. But after a while you'll have some hundreds, or even thousands of entries.

Here's an example of a line chart with a range up to 500.
The maximum value in the dataset is 476.

How do you find which is the next "best display range" for any given maximum value $m$? For a maximum value of 332, which would be better: a range from 0 to 340, 400, 500 or 1000?

In this post, I'll show you how to create a function $\mbox{R}(m)$ that will give you that display range value, and we'll be using logarithms to achieve that.

A very brief introduction to logarithms

For those not too familiar with logarithms, they are the inverse function of exponentials. Say you have the equation $2^x = 5$. In order to find the value of x, you need to use logarithms.

A logarithm is a number, the power to which a certain base (in this case, 2) must be raised to return another number (in this case, 5). This would be written as $\log_2 5$. In our example, the value of the logarithm is about $\log_2 5 \approx 2.32192809...$ In short, you have this correlation between a logarithm and an exponential functions:

$b^x = a \Leftrightarrow \log_b a = x$

When the base isn't specified for the logarithm, it is usually assumed to be $e \approx 2.71828...$, also known as Euler's number. This seemingly random number has several useful properties, but not for our particular problem here. Note, also, that a logarithm with this base is also called a natural logarithm, usually written unambiguously as $\ln x$.

A very useful property of logarithms is that you can easily convert a logarithm in a base to any other base you wish. Say you have the logarithm of $x$ in a base $a$, but you actually want it in base $b$. You can just do this:

$\log_b x = \frac{\log_a x}{\log_a b}$

So even if you have an unknown base for your logarithm function, you can convert it to any other base you wish by just diving the value given by that function using your value, to the value it gives for the base you want.

This is useful - and worth pointing out - because a lot of programming languages give you only a natural logarithm function, and we'll be working with a base 10 logarithm from now on.

Finding the order of magnitude using base-10 logarithms

The decadic logarithm. Note how it crosses y = 1 when x = 10, and y = 2 when x = 100.

Here, we'll be using the base-10 logarithm. It is useful in our case because we happen to use a decimal system for our numbers.

In any positional numerical base, you can figure out how many digits a certain integer value will have by taking the logarithm of that value using the numerical base as the base for the logarithm, rounding down to the nearest integer and then adding one.

For example, the number 1452 has $\lfloor \log_{10} 1452 \rfloor + 1 = \lfloor 3.16.. \rfloor + 1 = 3 + 1 = 4$ digits. Here, $\lfloor x \rfloor$ is the floor function, which gives the smallest integer closer to x. It is a very standard mathematical function in most programming languages.

Let's call the maximum value in our dataset $m$, and we'll create a function $\mbox{D}(m)$ which will give us the number of digits in the integer part of $m$:

$\mbox{D}(m) = \lfloor \log_{10} m \rfloor + 1$

So for any $m$, we can find the next biggest power after $m$ by simply calculating $10^{D(m)}$. This will give us 1 for any $0 \leq m < 1$, 10 for any $1 \leq m < 10$, 100 for any $10 \leq m < 100$, etc. In effect, the function $\mbox{D}(m)$ gives us the exponent for the next decimal power (or order of magnitude) after $m$.

Breaking our range into smaller steps

Now, when $m = 100$, our next order of magnitude is 3, that is, 10^3 = 1000. If we use that for the range of our chart, we'll have all the data drawn in just the bottom 10% of the graphic.

By just using the order of magnitude for the range, you'll find yourself with graphs that look like this, with lots of unused space.

That doesn't look too good, does it? It's just too much empty space, and it makes the maximum value in our dataset look much smaller than it really is.

Each order of magnitude is a 10x increase, and we want to break it in $n$ equal parts. We'll do that by using a sibling of the floor function, the ceiling function, $\lceil x \rceil$, which returns the largest integer closer to x. This function is also common in most programming languages, usually by the name "ceil".

We'll use the ceiling function to create a "step-factor" which will have steps of size $\tfrac{1}{n}$. You do this simply by evaluating $\tfrac{1}{n} \lceil x n \rceil$. By itself, this looks like this:

Here's how our step function look like for n = 2 and n = 4.

But note that for each new order of magnitude we reach, it should take 10 times longer to step up to the next. To do that, we plug in our next power of ten inside the ceiling function, but as a division, so we end up with:

$\frac{1}{n} \left\lceil \frac{m n}{ 10^{D(m)} } \right\rceil$

Now, if you evaluate this as a function on x, you'll see it just steps between 0 and 1 $n$ times before resetting back to zero, at which point it will take 10 times longer for each step up.

We just have to make it step across powers of ten now, and to do that we multiply $\tfrac{1}{n}$ to $10^{D(m)}$. The result is our function $\mbox{R}(m)$, which looks like this:

$\mbox{R}(m) = \frac{10^{D(m)}}{n} \left\lceil \frac{m n}{ 10^{D(m)} } \right\rceil$

Where $n$ is the number of steps. Good values for $n$ are 2, 4, 5 and 10.

Here's how this would look like in pseudocode, with the appropriate base-10 logarithm conversion:

function getGraphRange(maxValue):
numDigits = floor( log(maxValue) / log(10) ) + 1
nextDecimalPower = pow(10,numDigits)
step = ceil( maxValue * numSteps / nextDecimalPower )
range = ( nextDecimalPower / a ) * step
return range


But what about the subdivisions and their labels?

Now that we have a round-ish range for our chart that our users can understand, we still have the problem of dividing that range in an appropriate number of parts so the labels will be easily understood. You don't really want the labels to mark 0, 110, 220, etc.

This can be accomplished using our function $\mbox{D}(m)$. First, we need to find the first power of ten below our range (and not the maximum $m$), and that's simply $10^{D(R(m))-1}$. We divide our range to this value, and then multiply by the number of steps $n$ we defined before. The result is the number of subdivisions you need.

$d = n \frac{\mbox{R}(m)}{ 10^{D(R(m))-1} }$

Tip: If you're using $n = 10$ for the range, it's probably a good idea to use 5 for this equation in order to avoid clutter in your graphic, as 10 usually gives you too many subdivisions.

Conclusion

Data visualization is not a trivial task as it may seem at first. Doing it improperly, without considering the quirks of human perception, will seriously corrupt the information you are trying to convey. So it is very important to consider both the mathematics and psychology involved.

As others have mentioned, mathematical ignorance and the limitations of our instinctive notions of numbers can be dangerous and have serious political and personal consequences, so be careful with your data!