On election night, another data analyst and I sat at the tables on the first floor of PayScale’s offices in Seattle and watched the election results trickle in. We are data-oriented individuals, as you might expect given our job titles in a salary data driven software company. We had, like many of you, been following the polls and the swarm of election stats from a variety of sites, and saw what many of you saw: predictions ranging from small to sizable Clinton victories. We were, like many of you, surprised when the results came in and, in state after state, Trump narrowly won out where the data said Clinton was supposed to win!
So, what went wrong with the predictions? Without getting too Stats 101-ey here, it’s worth understanding what so many of the polls being off tells us about how “data” as a concept works, so stick with me as I make a few points. And forgive me for using “data” as a plural noun – I know it’s annoying, but I can’t stop myself.
1. Moving targets are hard to measure
The things we’re trying to measure with data are generally moving targets, from public opinion to weather to compensation (what I spend my professional life thinking about and looking at). Consider a Martian who starts observing a farmer early in the year, and collects data on whether the farmer kills turkeys. For each day, the alien records a ‘one’ if the turkey is killed, and a ‘zero’ if the turkey is not. Say the Martian stays for eight months, from February to October, and records a ‘zero’ every day: no turkeys killed, not ever.
But this isn’t right. You know something the Martian doesn’t know: that in November there’s a big ‘one’ looming on the horizon. But there’s no indication in the data that this would happen. One of the problems with political polling is that things can change quickly. We start polling earlier and earlier, but necessarily we’re looking further and further into the future. Opinions changed (and sometimes dramatically) around the presidential debates, based on new events like the release of the Trump audio tape, and FBI Director Comey’s letter suggesting that Clinton was back under investigation. Real-time data collection helps address this issue, gathering data on the landscape even as it changes.
We see the same sort of thing in the compensation world – job titles appear and disappear in the workforce, the pay bumps certain skills command increase or decrease from quarter to quarter, etc. The way we deal with it is by constantly updating our dataset – pollsters tried to do this, too, but polling properly is expensive and time consuming.
2. Your data are only as good as your data
Sounds stupid, right? But it isn’t, and it’s critical. One of the problem with the polls is that there were a lot of people that were under-polled, or groups who were just easier to get in touch with, or who were more likely to talk to you. Good pollsters take these things into account by considering which groups are over- or under-represented in the data, but (a) they’re facing an uphill battle, (b) see point 1, and (c) by definition you have no data on the people who are not in your survey. You can address this point by using outside data sources (e.g. the US Census) and/or historical voting patterns to “weight” your data. The basic idea is that if you can come up with an idea of what you expect voter turnout to look like, you can assume that different demographic groups (e.g. non-college educated white men in Virginia, age 18-24) will vote the same way as members of that group responded to the polls and come up with a more accurate prediction than the raw data alone. However, if the people in your survey don’t represent their respective populations (which can happen for any number of reasons), you’ll end up reaching the wrong conclusion. This brings us to our next point…
3. Models do it better
Behind every dataset is a model, whether we state it explicitly or not. You can think about “the model” as the set of beliefs we hold about the data. For example, are there underlying differences between groups in the data? How are we taking those into account? We want to be sure that we validate our data, but we also need a high quality and carefully constructed model that tells us what these data can do for us and what they can’t. One thing that’s great about collecting data real-time in the digital age is that you can make your model “learn” from data and become smarter. It is so much harder to do this with presidential elections than with compensation data – each salary profile we collect and keep is someone’s validated current job salary or a job offer, so we get thousands of new “results” every day. Think about how comparatively rare presidential elections are: we only get to compare our predictions to results once every four years! So you (and your model) don’t get a shot to do it better next time for four years – and think about how much the world changes in four years.
Back to the question – what went wrong with the predictions? It’s some combination of the three points above. We were using polling data (point 2, see this Nate Silver post for a deeper dive) to predict the future (point 1) with models that we only get to try out once every four years (point 3).
There is so much we can learn from data, but we should be responsible consumers of data or else we run the risk of reaching the wrong conclusions. Responsible consumption of data doesn’t mean you need to love or understand statistics – just consider carefully what you’re using your data for, where it comes from, and what it can (and cannot) do for you. If you’re not sure you have the right answers to these questions, tread carefully before you bet your house on a Clinton victory, and find someone who does love and understand statistics. There are dozens of us.