How to forecast an election? Or more specifically, how to write a model that tries to do that? This is the vast quest we embarked on at PollsPosition. A quest that may seem stupid at first glance, because, well, all models are wrong. Every model implies simplifying assumptions of reality, hypotheses, decisions. But the goal is not to discover the truth - it's to get as close as possible.
In short, our model takes all polls, weights them based on recency, sample size and historical performances of the pollster . Then it simulates thousands of elections to get a distribution of possible results for each party. More precisely, it uses Bayesian methods and Markov chain Monte Carlo - if it sounds unclear, no worries, we detail all of this below.
The goal: using all the information, in a methodical way
During an election, we face information that is both numerous and imprecise. Numerous because, well, there are a lot of polls (#ThankYouCaptainObvious) - too many to remember all of them. Imprecise because, despite their good historical performance , polls are always error-prone - sampling error , sampling bias , measurement error ...
Our brain has a lot of trouble digesting all these polls - and the errors that come with them - when assessing the balance of power. So it takes shortcuts. Rather than using all the available information, it throws some of it away and only takes into account the last poll. It compares the polls without asking if they are comparable. And it interprets polls deterministically, whereas the information they deliver is eminently probabilistic.
As a result, each new observation erases the old one, and the consensus is harder to spot than outliers - which catch our eye precisely because they seem to tell another story, even if it's not the right one.
The purpose of our model is therefore to introduce a scientific method of interpreting polls - all polls. A reproducible and systematic method, and no longer a rule-of-thumb analysis that confirms what we already thought. As a result, we get a more reliable, less volatile and less biased estimate - what else?
Step 1: Collecting polls
The idea is to use computers where they are better than us: to keep track of all polls, and to systematically discriminate according to recency, sample size and pollster quality. You can do it by hand - if you are better organized than us - but a computer will be faster and more reliable than you.
To do this, we must start by collecting as many polls as possible. We do it mainly thanks to the Europe Elects initiative, whose raw data are already clean and reliable. We also check that there is no obvious gap with, in particular, the Wikipedia page .
Thus all national polls are integrated into our database. If one is missing, it is because:
- It is very recent and we didn't have the time to add it yet.
- We added it under a different name than the one you are looking for: PollsPosition names polls according to the pollster that did it, not to the media that ordered it (for example, Ifop and not Paris Match).
- The poll was done by an unknown institute, on which we do not have enough methodological information.
- It was commissioned directly by a political party or a candidate's campaign.
Polls that do not respect the rules of the Commission of polls are not included in our database. In particular, we exclude "polls" done on websites. First, these have nothing to do with polls per se - it boils down to measuring PSG's popularity by polling its fans. Second, these are not even polls in the eyes of the law : "surveys of this type, which are not carried out with representative samples of the population, do not constitute polls falling within the scope of the law of July 19, 1977".
Finally, each poll has a field date. Pollsters indicate start and end dates. The dates indicated in our model correspond to the median date. For example, if a poll took place February 2-4 and was released on the 7th, it will be dated February 3rd. We do not take into account the release date - and we encourage you to do the same thing : what matters is when respondents answered the questions, not when their answers appeared in newspapers.
Step 2: Aggregating and weighting polls
Rather than using only the latest poll, we use all available polls - why would you throw information? By aggregating the data, the extreme values compensate one another, statistical noise decreases and you are more likely to spot the signal.
Technically, aggregation is justified by the inevitable existence of statistical biases in pollsters' data and methods. Often, these biases are specific to each pollster, so that the aggregation of different polls, from different pollsters, using different methods, tends to counterbalance these imperfections. The aggregation is all the more useful when 1 / many polls are available, 2 / these polls come from different sources (methods, samples, pollsters), and 3 / it is difficult to know a priori the most accurate pollster. The French political landscape matches these conditions.
Obviously, not all polls are born equal. So, as our popularity tracker , our model weights them according to several factors:
- Pollsters' historical performance. As our ranking shows, some pollsters perform better on left-wing parties, others on right-wing parties. Our model takes these differences into account and weights the polls accordingly, which you can not do with the naked eye.
- Sample size: the larger the (random!) sample, the better - up to a certain point. After that, you enter a zone of dimishing returns. The sampling error decreases substantially between a sample of 1,000 adults and another of 200, but it decreases marginally between 10,000 and 1,000. Above all, a random sample of 2,000 is much more useful than a biased sample of 20,000 responses. At some point, quality trumps quantity.
- Recency: the more recent a survey is, the more weight it has.
Finally, note that our aggregation does not directly take into account methodology, mainly because most pollsters have the same - self-administered online forms - which prevents us from discriminating objectively. But we hypothesize that methodology indirectly influences ranking, insofar as a small error on the long run partly comes from a good methodology.
The differences between methodologies lie less in their intrinsic qualities than in the groups they manage to reach. Each methods having their blind spots, aggregating counterbalances the different biases. That is why we view a diversification of methodologies (landline and mobile phones, online forms, face-to-face, ...) as very attractive, with the aim of minimizing the sampling bias.
In a nutshell, with this step we give more weight to sources that have proven to be more reliable historically, while giving new-comers a chance. This is exactly what you do when you give greater credence to the National Weather Service than to a weather-forecast enthusiast, who may be knowledgeable... or not. Why not do with polls what you do for any other information?
Step 3: Modeling uncertainty and simulating elections
Up until now, we got a weighted average for each party, but we did not take into account the uncertainties surrounding the election. If you give this average to your model, it should be happy: it has a lot of data and little uncertainty. As a result, it will be very sure of itself while being very wrong - except by chance.
Polls are actually an imperfect and temporary representation of each party's latent support in the population, which is only observed on election days. However, we precisely seek to estimate this latent support and the uncertainties around it.
Thus, the model generates a random error for each party, simulating the fact that pollsters can make larger of smaller mistakes, or that a media event - involving for instance a letter to Congress from the director of the FBI - can arise at the last moment and influence results.
This is a crucial point: the errors we simulate come from the historical distribution of errors, to maximize the chances that they are realistic. Thus, the probabilities derived from the model are really calibrated to represent historical uncertainty. But this comes at a price: our results assume that polling errors in 2019 will not be significantly different from the past.
The model repeats this simulation hundreds of thousands of times - as if the election took place at the same time in thousands of different universes, each time with different errors. This gives a distribution of each party's possible latent support. And from there we can compute the corresponding number of seats, the probability of finishing first, that of finishing before such party, etc. We can in fact ask all the questions we want, since the simulations give us all the possibilities - again, conditioned on the model and the data.
How do you get the distributions for each party?
This question takes us back to the beginning of the 18th century, when British pastor and mathematician Thomas Bayes formalizes the theorem that will bear his name . A basic and relatively simple formula in probabilities, it is the basis of Bayesian inference , which characteristically formulates its results in probabilistic terms.
In English, the theorem answers this question: having a prior on the probability of an event, and observing information related to this event, how should I change my prior to take this information into account (and so build my posterior)? This is no more than a form of learning: the formula guarantees the most logical way of handling new observations, based on our initial assumptions.
It thus minimizes cognitive biases occuring while processing the information. But it does not guarantee that your model is good: if your assumptions are bad or biased, your model will be bad and you will have to rethink your prior. But that is what makes this method so useful: this formula requires you to integrate the facts logically into your reasoning and draw the necessary conclusions.
Another advantage over non-Bayesian methods: you get probability distributions rather than point estimates, which allows for more intuitive communication of uncertainty. Imagine two drugs that are expected to cause side effects on an average 0.5% of patients. The first drug has a 5 in 6 chance to do so on 0.3% to 0.7% of patients. The second one has the same probability of doing so, but on 0.01% to 1% of patients. Is your decision to market the same in both cases? Probably not - and yet the average (the point estimate in this case) is the same...
Behind this simple formula, there are massive complications in calculating the probabilities a posteriori. Approximation methods such as Markov chain Monte Carlo have greatly contributed to the spread of Bayesian methods in recent years. Computers' increasing computing powers have done the rest.
Concretely, we use the python programming language and the PyMC3 open-source library to code our model and perform inference computations. So let us conclude on a huge thank you to all the contributors, who make our lives easier and participate in the democratization of these methods.