How to Design Scatter Plots


Create engaging content for free

In our Data Visualization 101 series, we cover each chart type to help you sharpen your data visualization skills.

For a general data refresher, start here.

Scatter plots have been called the “most versatile, polymorphic, and generally useful invention in the history of statistical graphics” (Journal of the History of the Behavioral Sciences, 2005). That’s a big claim, but just as their name implies, they can take a confusing and scattered set of data and make sense of it. As such, these plots are much more than a visualization tool; they are a discovery tool. Let’s look at what makes the scatter plot so good.

What It Is

The scatter plot is simply a set of data points plotted on an x and y axis to represent two sets of variables. The shape those data points create tells the story, most often revealing correlation (positive or negative) in a large amount of data.

Here, a scatter plot reveals the pattern in different product families, showing how much they produce in revenue compared to their units sold.

Revenue

Where It Came From

If you’ve read our previous posts in this series, it might come as a shock that, while he did bring the line, bar, and pie charts to the world, data visualization pioneer William Playfair didn’t invent the scatter plot. In fact, the scatter plot’s history is much more, well, scattered.

One reason we don’t have a specific inventor for this visualization form is because people have been plotting data on maps and with Cartesian coordinates for centuries. It was only a matter of time before people independently realized the story hidden inside those clouds of data. (Playfair probably missed inventing the scatter plot because the data he was charting was almost exclusively based on a time-series, which his charts were already well-suited to represent.)

Once the scatter plot did catch on, however, it made a splash in the world of science, precipitating a number of exciting discoveries.

In 1905, when Danish astronomer Ejnar Hertzprung tabled the luminosity (or absolute magnitude) of stars versus their colors (ranging through the color spectrum from blue-white to red), he noticed some correlations and trends. But it wasn’t until he and American astronomer Henry Norris Russell independently plotted that data between 1911 and 1913 that they noticed something that would change our understanding of the cosmos.

Here, they saw a distinct trend along a diagonal band from the top right (high luminosity/low spectral color) to the low left (low luminosity/high spectral color). They also noticed a cluster of data in the top right of their charts. What the two had stumbled upon was a new understanding in the life of stars and how they age, from newly formed blue-white stars to old red stars. That cluster off to the side was composed only of giant stars.

HRDiagramA newly plotted Hertzprung-Russell Diagram, showing 22,000 stars. (The Sun would be found on the main sequence at luminosity 1.)

For more about the exploration of the scatter plot’s origins, check out this paper by Michael Friendly and Daniel Denis.

When to Use It

It’s easy to see the usefulness of the scatter plot, but it’s important to point out its unique advantages over other chart types. Unlike other charts, scatter plots have the ability to show trends, clusters, patterns, and relationships in a cloud of data points—especially a very large one.

Whether plotting lung capacity compared to free-diving depth, the magnitude of earthquakes compared to their duration, or profits compared to expenditures in a multitude of different business divisions, the data’s correlation can be interpreted in a number of different ways. Trends presented may include:

Positive correlation (both values increasing in unison): Screen Shot 2015-01-20 at 9.44.07 AMNegative correlation (one value increases while the other decreases):

Screen Shot 2015-01-20 at 9.44.56 AMNull (no correlation between data):

Screen Shot 2015-01-20 at 9.45.16 AMLinear:

Screen Shot 2015-01-20 at 9.45.44 AMExponential:

Screen Shot 2015-01-20 at 9.47.04 AM

Outliers (data points or clusters far outside the norm of the data set):

Screen Shot 2015-01-20 at 9.47.38 AMNote: It’s important to remember that correlation does not always equal causation, and other unnoticed variables could be influencing the data in a chart.

Best Practices for Designing Scatter Plots

Now that you have a basic understanding of scatter plots, let’s look at 4 tips to get the most out of using them.

1) Start Y-Axis Value at 0

Start y-axis at 0Starting the axis above zero truncates the visualization of values.

2) Include More Variables

Include More Variables

Use size and dot color to encode additional data variables.

3) Use Trend Lines

Dont compare more than 2 trend lines

These help draw correlation between the variables to show trends.

4) Don’t Compare More Than Two Trend Lines

Dont compare more than 2 trend lines

Too many lines make data difficult to interpret.

Want more? Read up on the pie chart, bar chart, line chart, and area chart.