By day, Randy Olson is a postdoctoral researcher at the University of Pennsylvania applying machine learning to biomedical problems. If you think of genes as a massive dataset, he’s the guy using that data along with machine learning to help detect diseases – among many other things. But we’re not here to talk about his day job. We’re here to talk about his hobby: creating the most interesting and compelling data visualizations on the web.
He’s done dataviz on:
- A data-driven guide to creating a successful reddit post
- 144 years of marriage and divorce data
- Percentages of bachelor’s degrees given to women by major from 1970 to 2012
- Revisiting the six degrees of Kevin Bacon
- Here’s Waldo: Computing the optimal search strategy for finding Waldo
- And, computing the optimal road trip across the U.S.
Before this interview, I really only had one question in mind: How do you come up with this stuff?! If your dataviz needs a creative shot in the arm, read on, because Randy Olson is the dataviz wiz, if ever a wiz there was.
For Olson, dataviz began as a side interest in grad school. As a longtime user and fan of reddit, he was curious how that community had changed and evolved over time. One Spring Break, he collected all of reddit’s posts back to the first day – an enormous dataset – and began analyzing and visualizing its evolution. From there, whenever an intriguing dataset crossed his path, he’d wonder “Can I answer this question with that dataset?” He says, “I was very much fueled by my curiosity about the world.”
On telling stories
“In one of my favorite charts, I visualized the gender breakdown of college degrees that were awarded between 1975 and 2012 – 37 years of data. The reason I chose that as a topic was that I was curious about what was going on in computer science. I did my undergrad and PhD in computer science, and one of the first observations any CS student makes is that… it’s mostly guys. Had it always been like that? Was the landscape changing? My own anecdotal evidence by the end of my undergrad degree seemed to show that female students were almost at parity in my classes, and I wanted to check if this was a broader trend.”
But Olson didn’t stop at the question of “What has happened to computer science degrees over time?” A large part of what makes his dataviz so compelling is his conscious decision to deliver data in context. So he looked at a number of majors.
“The cool part about this chart is that it was about many more majors, and you get to see which majors were more male dominated or female dominated – and which ones switched. Not only did it answer my original question of what happened to women earning computer science degrees, it raised a lot of other questions. Why have computer science degrees waned since the 80s, and what caused that shift in culture?”
Olson says this data had always been available to the public, but as numbers sitting in a table, they couldn’t tell the same stories as they could when visualized.
The ethics of dataviz: Content, context and caveats
“Data journalism has been abused quite a bit. Graphs come with an air of scientific authority – you see the numbers laid out before you, and they appear incontrovertible. But many articles don’t provide the necessary caveats, and I think a lot of people are quick to hop onto a certain data set that supports their views without checking to make sure that: A) the data was properly collected, and B) the data actually says what they think it says.
Recently, I’ve seen articles on divorce statistics that said ‘divorce rates are going down!’ What wasn’t reported in those articles is that marriage rates are at an all-time low – lower even than during the Great Depression. You can’t have high divorce rates without high marriage rates.
A few years ago, when global warming and climate change were first all over the news, you’d see poll results that said, “60% of the U.S. say global warming isn’t happening!” But we’ve had data on peoples’ opinions on these topics for decades. Why aren’t we providing a historical context for this data? That frustrates me.
I’d like to see more of a balanced approach to data journalism where they say, ‘Here’s a data set on the topic. Here’s what it shows. This is how the data was gathered, and here are some potential flaws, and this is the historical context, and this is how we could do this better.’ It could be used to accomplish a meaningful debate rather than constantly trying to push our political agenda.”
How do you come up with this stuff?!
“When you asked this, I had to sit back and think about it. How do I actually brainstorm and come up with these project ideas?
I came to the conclusion that I do so many things broadly – what I learn about, what I write about, and the people I work with – I don’t focus on any one thing. I don’t put myself into a specific niche. I think that’s lent to the success of my blog because I’m free to hop between topics as I see fit.
Like the Where’s Waldo? post that I wrote, which gained quite a bit of attention in the media. I was able to pull that off because I saw an article on Slate.com by someone who did a mathematical analysis of Where’s Waldo?. I applied machine learning to the same problem, crediting the Slate article, and was able to improve upon it because I had that extra tool in my tool belt.
I think a lot of people focus on just dataviz, or just analysis, which can really be quite limiting on what they can accomplish. I can easily transition between various topics, which allows me to look at problems in different ways and combine things in ways no one thought about. I’d even describe myself as an opportunist, looking for the low hanging fruit or a unique problem that no one has tackled before, or that they haven’t tackled in this way before.”
Randy Olson’s Go-To Data Sources
“I always try to find a dataset that’s provided by the government. The government-curated datasets are among the best and most reliable; they document and curate them very well. They’re also generally much more broad. The U.S. Census is amazing. That’s not something you can find easily on a company website. I look at the Census and the CIA Factbook. Quandl.com collects a lot of datasets like that and makes them easily accessible through an API. They do a really great job of gathering these data sets and telling you where they’re from, and they visualize the data for you right there.”
Where can people find your latest data visualizations?
My website, www.RandalOlson.com, and I’m always happy to have people follow me on Twitter – I’m constantly posting and critiquing dataviz on Twitter. And I’d love for people to join Data is Beautiful on reddit. It’s this massive online community I moderate with 3.5 million subscribers (as of this week) who are all interested in data visualizations.