In our Data Visualization 101 series, we cover each chart type to help you sharpen your data visualization skills.
For a general data refresher, start here.
Scatter plots have been called the “most versatile, polymorphic, and generally useful invention in the history of statistical graphics” (Journal of the History of the Behavioral Sciences, 2005). That’s a big claim, but just as their name implies, they can take a confusing and scattered set of data and make sense of it. As such, these plots are much more than a visualization tool; they are a discovery tool. Let’s look at what makes the scatter plot so good.
What It Is
The scatter plot is simply a set of data points plotted on an x and y axis to represent two sets of variables. The shape those data points create tells the story, most often revealing correlation (positive or negative) in a large amount of data.
Here, a scatter plot reveals the pattern in different product families, showing how much they produce in revenue compared to their units sold.
Where It Came From
If you’ve read our previous posts in this series, it might come as a shock that, while he did bring the line, bar, and pie charts to the world, data visualization pioneer William Playfair didn’t invent the scatter plot. In fact, the scatter plot’s history is much more, well, scattered.
One reason we don’t have a specific inventor for this visualization form is because people have been plotting data on maps and with Cartesian coordinates for centuries. It was only a matter of time before people independently realized the story hidden inside those clouds of data. (Playfair probably missed inventing the scatter plot because the data he was charting was almost exclusively based on a time-series, which his charts were already well-suited to represent.)
Once the scatter plot did catch on, however, it made a splash in the world of science, precipitating a number of exciting discoveries.
In 1905, when Danish astronomer Ejnar Hertzprung tabled the luminosity (or absolute magnitude) of stars versus their colors (ranging through the color spectrum from blue-white to red), he noticed some correlations and trends. But it wasn’t until he and American astronomer Henry Norris Russell independently plotted that data between 1911 and 1913 that they noticed something that would change our understanding of the cosmos.
Here, they saw a distinct trend along a diagonal band from the top right (high luminosity/low spectral color) to the low left (low luminosity/high spectral color). They also noticed a cluster of data in the top right of their charts. What the two had stumbled upon was a new understanding in the life of stars and how they age, from newly formed blue-white stars to old red stars. That cluster off to the side was composed only of giant stars.
A newly plotted Hertzprung-Russell Diagram, showing 22,000 stars. (The Sun would be found on the main sequence at luminosity 1.)
For more about the exploration of the scatter plot’s origins, check out this paper by Michael Friendly and Daniel Denis.
When to Use It
It’s easy to see the usefulness of the scatter plot, but it’s important to point out its unique advantages over other chart types. Unlike other charts, scatter plots have the ability to show trends, clusters, patterns, and relationships in a cloud of data points—especially a very large one.
Whether plotting lung capacity compared to free-diving depth, the magnitude of earthquakes compared to their duration, or profits compared to expenditures in a multitude of different business divisions, the data’s correlation can be interpreted in a number of different ways. Trends presented may include:
Positive correlation (both values increasing in unison): Negative correlation (one value increases while the other decreases):
Null (no correlation between data):
Outliers (data points or clusters far outside the norm of the data set):
Note: It’s important to remember that correlation does not always equal causation, and other unnoticed variables could be influencing the data in a chart.
Best Practices for Designing Scatter Plots
Now that you have a basic understanding of scatter plots, let’s look at 4 tips to get the most out of using them.
1) Start Y-Axis Value at 0
Starting the axis above zero truncates the visualization of values.
2) Include More Variables
Use size and dot color to encode additional data variables.
3) Use Trend Lines
These help draw correlation between the variables to show trends.
4) Don’t Compare More Than Two Trend Lines
Too many lines make data difficult to interpret.