by Dr Adrian Worton
For long-time readers of this website you will know that for the past three General Elections (2015, 2017 and 2019) we created models of the General Election using bookmaker odds.
These allowed the creation of simulators that allowed one to run a simulation of each election based on the probabilities inferred from the odds. This also allowed us to analyse the elections in more depth.
For long-time readers of this website you will know that for the past three General Elections (2015, 2017 and 2019) we created models of the General Election using bookmaker odds.
These allowed the creation of simulators that allowed one to run a simulation of each election based on the probabilities inferred from the odds. This also allowed us to analyse the elections in more depth.
There is to be a new election in July, coming slightly out of the blue. And so we are going to be looking at the odds again. However, rathern than diving straight in, I am going to take the opportunity to look back at the data collected from the three previous elections to fine-tune our model this year.
Specifically, we are going to look at the odds and how accurately they predict the likelihood of a candidate winning a given seat. We can then use this to check whether our methodology matches this. Then we will be ready to start making predictions on the upcoming vote!
The data
We have odds for all 650 constituencies across the last three elections. That's 1,950 seats, with most having odds for somewhere between 2-7 parties. In some years we excluded parties whose odds were above a certain amount (say longer than 500/1). In total, we have odds for 8,703 candidates in our dataset.
We will be using the latest odds collected for each election. In each case, these were taken on the morning of each vote.
Implied probability
For a given odd (x) we can infer the probability (p) of it happening (implied probability) by adding 1 and dividing 1 by the result. Algebraically:
Specifically, we are going to look at the odds and how accurately they predict the likelihood of a candidate winning a given seat. We can then use this to check whether our methodology matches this. Then we will be ready to start making predictions on the upcoming vote!
The data
We have odds for all 650 constituencies across the last three elections. That's 1,950 seats, with most having odds for somewhere between 2-7 parties. In some years we excluded parties whose odds were above a certain amount (say longer than 500/1). In total, we have odds for 8,703 candidates in our dataset.
We will be using the latest odds collected for each election. In each case, these were taken on the morning of each vote.
Implied probability
For a given odd (x) we can infer the probability (p) of it happening (implied probability) by adding 1 and dividing 1 by the result. Algebraically:
p = 1 / (x + 1)
(Side note: another way to calculate this for fractional odds is to divide the denominator by the sum of numerator and denominator.)
To pick a random constituency, in Guildford the Conservatives have odds of 9/4 to win. We convert this to decimal, which is 2.25. Then we add one (3.25) and divide 1 by this (1/3.25). This gives 0.3077 - so the Conservatives have an implied probability of 30.8% chance of winning Guildford in July.
We apply this method to every odd in our dataset, which are then grouped into 2% ranges (so every candidate who was given between 0%-2% chance of winning is put in one group, then the ones between 2% and 4% - and so on).
We can then see the percentage of each group that actually won. We would expect the percentages to roughly match - e.g. we would expect around 47% of those in the 46-48% bracket to win. Here is the result:
To pick a random constituency, in Guildford the Conservatives have odds of 9/4 to win. We convert this to decimal, which is 2.25. Then we add one (3.25) and divide 1 by this (1/3.25). This gives 0.3077 - so the Conservatives have an implied probability of 30.8% chance of winning Guildford in July.
We apply this method to every odd in our dataset, which are then grouped into 2% ranges (so every candidate who was given between 0%-2% chance of winning is put in one group, then the ones between 2% and 4% - and so on).
We can then see the percentage of each group that actually won. We would expect the percentages to roughly match - e.g. we would expect around 47% of those in the 46-48% bracket to win. Here is the result:
The green line is the linear regression model for the resulting data points, the grey line is one running through x = y. We would expect these lines to match. Specifically, we would expect the green line to be slightly below the grey line (for reasons explained in the next section) but following the same slope.
The results fit expectation pretty well. Many points fall on the grey line - for example, candidates with an implied probability in the 92-94% range won 93.7% of the time. But what we can see is that most of the points for the higher percentages are above the grey line (12 of 14 points above 72% do better than implied) whilst most of the points for lower percentages are below (only two points below 50% do better than implied).
To put it another way, strong favourites do even better than predicted, whilst there are fewer outsiders winning than you may expect.
Transforming the data
We will now consider how we can interpret the odds to better match reality.
The first issue is that the implied probability doesn't take the bookies' cut into account. Were the odds truly reflective of probability the bookmakers wouldn't make any money. Therefore, prices are slightly shorter than they otherwise would be (this makes the payout smaller, and therefore the implied probability higher).
For example, to return to Guildford we know the Conservatives had a 30.8% chance of victory. But using the same method gives the Liberal Demorats (1/3) a 75% chance of winning. This adds to 105.8% even before we add in Labour (4.8%), Reform (1.5%) or Greens (0.5%). In total we have five parties with a combined chance of 112.5% of winning, which is obviously impossible.
This is fairly typical. For our 2015 data each seat gave an average total probability of 107.8%.
The way we have always dealt with this is to assume that the bookies' mark-up affects each odd proportionally, and so we divide every implied probability by their sum so that they do ultimately add up to 100%.
So for Guildford we divide the odds of the five parties by 1.125. This gives our final probabilities for our parties of:
The clear problem with this is that it is taking more off the favourite than the underdogs: the Lib Dems lost over 8% from their likelihood whilst Labour, Reform and Greens each lost under 1%.
If we apply this method to our dataset and plot it the same way as we did earlier, the problem can be seen below:
The results fit expectation pretty well. Many points fall on the grey line - for example, candidates with an implied probability in the 92-94% range won 93.7% of the time. But what we can see is that most of the points for the higher percentages are above the grey line (12 of 14 points above 72% do better than implied) whilst most of the points for lower percentages are below (only two points below 50% do better than implied).
To put it another way, strong favourites do even better than predicted, whilst there are fewer outsiders winning than you may expect.
Transforming the data
We will now consider how we can interpret the odds to better match reality.
The first issue is that the implied probability doesn't take the bookies' cut into account. Were the odds truly reflective of probability the bookmakers wouldn't make any money. Therefore, prices are slightly shorter than they otherwise would be (this makes the payout smaller, and therefore the implied probability higher).
For example, to return to Guildford we know the Conservatives had a 30.8% chance of victory. But using the same method gives the Liberal Demorats (1/3) a 75% chance of winning. This adds to 105.8% even before we add in Labour (4.8%), Reform (1.5%) or Greens (0.5%). In total we have five parties with a combined chance of 112.5% of winning, which is obviously impossible.
This is fairly typical. For our 2015 data each seat gave an average total probability of 107.8%.
The way we have always dealt with this is to assume that the bookies' mark-up affects each odd proportionally, and so we divide every implied probability by their sum so that they do ultimately add up to 100%.
So for Guildford we divide the odds of the five parties by 1.125. This gives our final probabilities for our parties of:
- Liberal Democrats: 66.7%
- Conservatives: 27.3%
- Labour: 4.2%
- Reform: 1.3%
- Greens: 0.4%
The clear problem with this is that it is taking more off the favourite than the underdogs: the Lib Dems lost over 8% from their likelihood whilst Labour, Reform and Greens each lost under 1%.
If we apply this method to our dataset and plot it the same way as we did earlier, the problem can be seen below:
The yellow line and points are for the adjusted probabilities, the green line are the ones from the implied probabilities before.
The b numbers above are the slope of the lines. The method using implied probabilities has a slope of 1.140, which means for every increase of 1% in predicted probability the actual win rate goes up by 1.14%. For our adjusted method this increases to 1.202. We want this to be as close to 1 as possible, so by adjusting the odds in this way we are getting further from what we want.
Looking at the yellow points against the grey line (which, as x=y, has a slope of 1, so is what we want our regression lines to match), only one point above 50% is below the line, and only three points below 50% are above the line. We would ideally want the points to be equally above and below the line.
Phi
So our adjusted method for calculating probability is even worse for underestimating the favourites and overestimating the underdogs. We spotted this in our initial 2015 model and introduced a coefficient ϕ (phi, chosen because it's my favourite Greek letter). An explanation of this is in the linked article, but effectively it was an exponent applied to each implied probability (before the adjustment). This made the higher probabilities higher and the lower ones lower.
From said article, this is a little interactive tool to see how changing ϕ alters the resulting probabilities:
The b numbers above are the slope of the lines. The method using implied probabilities has a slope of 1.140, which means for every increase of 1% in predicted probability the actual win rate goes up by 1.14%. For our adjusted method this increases to 1.202. We want this to be as close to 1 as possible, so by adjusting the odds in this way we are getting further from what we want.
Looking at the yellow points against the grey line (which, as x=y, has a slope of 1, so is what we want our regression lines to match), only one point above 50% is below the line, and only three points below 50% are above the line. We would ideally want the points to be equally above and below the line.
Phi
So our adjusted method for calculating probability is even worse for underestimating the favourites and overestimating the underdogs. We spotted this in our initial 2015 model and introduced a coefficient ϕ (phi, chosen because it's my favourite Greek letter). An explanation of this is in the linked article, but effectively it was an exponent applied to each implied probability (before the adjustment). This made the higher probabilities higher and the lower ones lower.
From said article, this is a little interactive tool to see how changing ϕ alters the resulting probabilities:
As ϕ increases, you can see the points above change to an s-shape distribution. This is the similar to the distribution of the points in our scatter graphs above, which suggests that using ϕ does indeed make our probabilities match reality to some degree.
To see which value of ϕ gives the best results, we will take the values 1.1, 1.2, 1.3 ... 2.0 for ϕ and apply them to our dataset. This gives us ten sets of probabilities which we can again plot against the real-life results:
To see which value of ϕ gives the best results, we will take the values 1.1, 1.2, 1.3 ... 2.0 for ϕ and apply them to our dataset. This gives us ten sets of probabilities which we can again plot against the real-life results:
As we would expect, the higher the value of ϕ the more the predictions favour the favourites. This makes the line flatter, and by ϕ = 2.0 the line is quite a long way away from the x = y line (still in gray). With a slope of 0.825 it is further away from 1 than our original probabilities.
We simply need to look at which value of ϕ has a slope as close to 1 as possible. ϕ = 1.4 is 0.027 away, ϕ = 1.5 is 0.011 away. Strictly speaking, a value of something like ϕ = 1.47 would be as close as possible to 1, but we will just take ϕ = 1.5 as our best value.
This is lower than that we have been using in the past, but of course this is based on much more data.
This has been quite a dry technical review of how we calculate our probabilities based on bookmakers odds, but now we have done that we will soon be publishing our first seat projections! With not long to go until the vote (and me on holiday next week) it remains to be seen how much analysis we'll be able to do, but hopefully a fair bit.
We simply need to look at which value of ϕ has a slope as close to 1 as possible. ϕ = 1.4 is 0.027 away, ϕ = 1.5 is 0.011 away. Strictly speaking, a value of something like ϕ = 1.47 would be as close as possible to 1, but we will just take ϕ = 1.5 as our best value.
This is lower than that we have been using in the past, but of course this is based on much more data.
This has been quite a dry technical review of how we calculate our probabilities based on bookmakers odds, but now we have done that we will soon be publishing our first seat projections! With not long to go until the vote (and me on holiday next week) it remains to be seen how much analysis we'll be able to do, but hopefully a fair bit.