The Pollster Ratings - PollsPosition

PollsPosition's Pollster Ratings

A statistical analysis of the historical accuracy and methodology of every French polling firm

By Alexandre Andorra, Alexis Bergès and Bérengère Patault

Updated August 16, 2018

PollsPosition calculates its ranking by analyzing the historical accuracy of each pollster in the first round of all elections with enough surveys 1 .

Why the first round? Mainly because the five major political parties are (almost) always represented, which provides stability and a basis for comparison. And incidentally, the multiplicity of candidates in the first round significantly complicates the work of the pollster compared to the second round. Our assumption is that this makes it possible to more closely evaluate the quality of each institute - " who can the most can the least".

With all these filters, our ranking is based on nearly 800 surveys and takes into account sample sizes, field dates, the seniority of the institute, and the accuracy of other pollsters during the same election.

If you are dying to discover our method, you will find a more detailed analysis below. If you have other things to do - it's a shame, but we understand - you can just look at the scores obtained by each institute for each party. They are precisely made to synthesize our results, even though they can not replace them.

Pollster Far left Left Center Right Far right
Elabe A C B B A
Harris A A C C A
Ifop C D D B C
Ipsos B C A A A
Kantar D B A A B
Odoxa A B A D B
OpinionWay B D C A C

Sources: Polls Commission, pollsters sites, Sondothèque of Sciences Po, newspaper archives, open data sites, and search engines; PollsPosition calculations on 748 surveys conducted by the survey institute (Ifop, Kantar Sofres, Ipsos, BVA, CSA, Harris Interactive, OpinionWay, Odoxa, and Elabe).

This table documents the score (in decreasing order from A to D) of each pollster for each party in the first round of fifteen presidential, legislative, European and regional elections. The scores are regressed towards the mean in order to take into account the number of elections that each institute has covered

An A rating indicates that the institute averages within the top 25% for this party. The next 25% obtain a B, and so on until a D grade. In other words, an institute with an A performed better than the average pollsters, while a D shows a lower than average accuracy. B and C are intermediate measurements, signaling a slightly higher or lower than average accuracy.

For example, on average, Kantar rated the right-wing candidate better than 3/4 of the pollsters, but it underperformed the market for the far-left candidate.

The following is a more detailed presentation of the method we use to establish this ranking.

How does PollsPosition rank pollsters?

As opposed to the United States, the statistical treatment of elections in French media remains artisanal. The monitoring of voting intentions is likened to a horse race where informed speculation takes a much greater space than the systematic use of available empirical information. Not that the former is useless or that the latter is the cure-all. But it seems to us that a mix of both is complimentary.

The current approach, where every new observation chases the old one, does not fully take advantage of the information provided by polls. This prevents us from getting a proper overview. Consensus is then harder to spot than outliers, which catch our eye because they seem to tell another story - even if it's not the right one.

Secondly, the media treatment of polls in France implicitly puts all the institutes on the same level by not taking into account their historical accuracy, their methodologies, or the size of their samples. Our classification aims to change this by introducing a scientific method of interpreting surveys. A reproducible and systematic method, and no longer a haphazard analysis to confirm what we already thought.

The idea is simple: are some pollsters structurally better than others? If yes, why? Is it because of their methodology or exogenous factors? Are the best-known or oldest pollsters the best? Will today’s best be tomorrow’s best?

Overall, the goal is also to question the idea that "polls are always wrong": what does it mean for a pollster to be wrong? Can we do better with current methods? Are there other methods of measuring intentions that are more effective?

State of the art

As far as we know, there is no pollster ranking within the French landscape. By pollster ratings, we mean a classification based on their past accuracy. Such a classification takes into account the institutes’ years of experience, their methodology 2 , their historical precision, the performance of other pollsters analyzing the same election, etc.

At least two difficulties arise when one undertakes this task. On the one hand, the French political landscape is quite complex, in particular, because of its second round multi-partyism leading to scenarios that are more diverse, for example, than the first round bipartism of the United States.

De manière plus pratique encore, la France ne compte que onze instituts de sondages 3 actifs dans le domaine des études d’intentions de vote, quand les États-Unis en dénombrent plusieurs centaines. Cela induit une baisse du nombre de sondages dans notre base de données, mais ne remet cependant pas en cause le principe et l’intérêt d’un classement comparatif.

In an even more practical way, France only has eleven active polling firms 3 within the field of voting intention studies while the United States count several hundreds. This leads to a decrease in the number of surveys in our database but does not call into question the concept and interest of a comparative ranking.

Step 1: Collect and sort all polls

It may sound trivial, but it's pretty complicated. We have spent hours looking for and collecting polls for presidential, legislative, European and regional elections, going back to 1965 (date of the first presidential election by popular vote). As a result, our database contains all the polls we've seen - approximately 1,500 of them.

Since data is not always perfect, we had to make some choices:

  • Polls are tied to their median date, not the date of publication. For example, a February 2-4 poll, released Feb. 7, will be dated Feb. 3.
  • We name surveys according to the institute, not the media that ordered it (for example, Ifop and not Paris Match). The goal is to associate the survey with the entity that has contributed the most to its methodology.
  • Old surveys do not always indicate their sample size. We circumvented this by generating random samples in ranges close to the average sample size of each institute.
  • In France, most pollsters publish results based on registered voters on the electoral lists. Nevertheless, when a poll was also published based on likely voters, we selected the latter version.
  • When a survey is published in two versions (one with one or more smaller candidates, the other without), we take the "with" version, because we consider that it is up to the voters to select the candidates, not the pollster.

The distribution by polling institutes and by elections is inegalitarian: the older an institute, the more data we have on it; the more recent an election, the more polls are available. For example, the 1965 election was only analyzed by Ifop, while the 2012 one was followed by eight institutes. These are limitations inherent to this subject of study, with which we have no say in.

However, we believe that there are still some existing polls out there - especially for the old elections, for which the digitization was not systematic. Which is why we are working to expand our database. Despite the hours we dedicated to it, there is a possibility that our database contains some errors - introduced by us or by our sources.

Our sources are varied and public: the Polls Commission, the websites of pollsters who published their archives (in this respect, the former TNS Sofres did a remarkable job), the Sondotheque of Sciences Po 4 , newspaper archives, open data sites and, quite simply, search engines.

Let's end with a question we are often asked: why not include primary polls in the database? First, the US experience tells us that primary presidential polls are often much more mistaken than general election polls. The reasons are less related to pollsters or the United States than to the exercise of the primaries: turnout is much lower; the candidates are ideologically close and therefore the electorate passes more easily from one to the other; voters are slow to decide. It is therefore quite possible that these factors also come into play in France. Secondly, the exercise of the primaries is very recent in France, which decreases the significance of the pollsters' results. For these two reasons, we do not include primary polls in our rankings - for now.

Step 2: Calculate the relative error by election

Once the data are collected and filtered, we can begin to process it. The purpose of this step is to see by how much each institute was wrong, for each election and each party.

First and foremost, the ranking does not directly account for the method of collection, mainly because most pollsters have the same - self-administered online forms - which prevents us from objectively discriminating. But we assume that collection methods indirectly influence ranking, in the sense that a good methodology is reflected by small errors in the long run. That being said, we strongly encourage the diversification of collection methods (landlines, mobiles, online forms, big data) 5 and will change our weighting system if necessary, to favor institutes which use diversified methods. This is what we currently do with our popularity tracker , where pollsters seem more open to experimentation.

  • The model starts by aggregating the polls of each institute, for each candidate and each election. For example, it calculates the average survey that we would have obtained for François Fillon in 2017 if we had only looked at Ifop.
  • The model then does the same thing for all pollsters – what we later call "the market". To go back to our previous example, it computes the average polling intention for Fillon 2017 across all polls. As noted above, these aggregations take into account the field date and sample size of each survey.
  • Then, for each candidate and each election, the model looks at how far removed from the result each pollster is - what we refer to as "the raw error". It also calculates the market error - how far removed is the market from the outcome? For example, if the market indicates 25% for Fillon in 2017 and he gets 20%, then the market was wrong by 5 points - the error is the same if Fillon gets 30% on Election Day.
  • Finally, the model estimates the distance between the market’s forecasts and those of each pollster, thus obtaining what we call "relative error". In other words, the model observes, at each election, how far removed each institute is from the market.
  • This allows us to know who has underperformed/outperformed the market while controlling for the complexity of polling a particular election. Indeed, knowing the error of a pollster is useful but does not place its performance in a historical context. Imagine that Institute X performed very well in 2017; if the market also performed well, we can not then say that X did better than the others. But if the market was bad, while X was doing very well, then we can conclude that X outperformed. Conversely, if X completely misses the target for a year, but so does the market, then we can assume that this election was particularly difficult to poll and that X does not deserve to be penalized beyond measure. The relative error neutralizes these contextual effects and makes it possible to really compare polls between them.

Step 3: compute the total relative error, and turn it into weights and grades

  • The model then aggregates these errors across all elections, giving more influence to presidential elections as well as the most recent ones (those taking place after 2006). In other words, we established each pollster’s accuracy over time, for each party (yay!).
  • The model then assigns a weight to each institute, the best one (the one with the smallest total relative error) weighing the most and the worst one weighing the least. But to represent a reasonable inference of the pollsters' future accuracy, these weights must take into account the number of polls conducted by each institute. Imagine that you have 10 surveys of pollster A and 100 surveys of pollster B; which can you evaluate with the greatest certainty? B of course, because your sample of its past accuracy is 10 times larger, and thus allows you to have more information on its upcoming accuracy. On the other hand, there is still a very good chance that the results of A are due to luck, and you have to stay within your benchmark (for example the average accuracy of pollsters in France). You apply the same reasoning when you see a young football player scoring 2 goals in 2 games: you won’t conclude that he is the future Messi; you will wait to see his performance throughout an entire season - that is to say on a larger sample. In short: the smaller your sample is, the more the statistics you are studying will be subject to random variations – and therefore not attributable to specific causes 6 .
  • To take into account this statistical optical illusion, we regress the weights attributed to the institutes to the average weight, according to the number of polls they conducted. By doing this, the more data we have on an institute, the more its weight will reflect its past accuracy; the less we have, the more its weight will be close to the average. See this as a way of suspending your judgment on the quality of an institute, in order to have the time to acquire new data.
  • All of this allows us to use the ratings in our models and trackers , something that can not be done with the naked eye. This is what we mean by “scientific method”: the pollster ratings provides a system of analysis, a way of sorting information that requires you to consider the data and realign your opinions if needed.
  • Last step, more educational than functional, we transform each pollster’s weights to grades ranging from A to D, which you can see in the table at the top of this page. It should be noted that, thanks to regression to the mean, the grades take into account the seniority of the institutes - it compensates for the optical illusion created by the different number of polls conducted by each pollster. This allows us to realize that the hierarchy changes depending on the party we analyze, and that few pollsters are structurally among the top tier –hence the point of aggregating polls rather than hoping that one of them will hit the bull's eye.

There is a lot to say about this ranking , but the first is that the hierarchy it establishes is moving. With new elections, the position of each pollster will evolve. Do not over-interpret the results of this ranking: Elabe and Odoxa have only followed three elections, while Harris Interactive's historical data are surprisingly inaccessible. The errors of these three pollsters contain more noise than signal. Even the older ones, such as Ifop or Kantar, have only followed about 20 elections, which is very little.

Each election will gradually dissipate this fog, but certainty is not of this world. That being said, our approach is fundamentally Bayesian. An approach that can be summed up by John Maynard Keynes' famous quote (probably apocryphal): "When the facts change, I change my mind. And you, sir, what do you do?"

Alexandre Andorra, Alexis Bergès and Bérengère Patault are the founders of PollsPosition.

1 : More specifically, our ranking is based on all the presidential elections from 2002 to 2017 plus those of 1974 and 1988, the legislative elections of 1997, 2012 and 2017, all European ones since 2004, and all regional ones since 2004. We only include elections where there are at least 2 pollsters, which have at least published 1 survey each.

2 : Sample size? Field dates? Respondents base (all adults, registered voters or likely voters)? Data collection method (telephone, internet ...)?

3 : In chronological order of creation: Ifop, Kantar Sofres (formerly TNS Sofres), Ipsos, BVA (which bought LH2 in 2014), CSA, Harris Interactive, OpinionWay, Viavoice, YouGov, Odoxa and Elabe.

4 : We thank them for this initiative, which is eminently valuable to researchers.

5 : The difference between the various methods comes less from their intrinsic quality than from the audiences they can reach: better support for the respondents, less misunderstanding of the questions, facilitates the expression of "shy voters", reach new audiences, etc.

6 : This statistical optical illusion is formalized by Moivre's formula (a fairly simple formula but even easier to forget), which tells us that volatility increases when the sample size decreases. This implies that small samples are more likely to have extreme results - it drowns out the signal and ends up being just noise. Thus, small schools show the best and the worse results; the least populated departments have the largest and smallest number of cancers; small towns are at once the safest and the least secure, and so on.