Leng and I had a really interesting exchange with Simon Jackman on Twitter yesterday. Me and my whopping eight followers (hi mum) would probably agree that I suck at Twitter. It became difficult to continue the discussion with the 140 character limit, so Leng and I thought we’d try and outline it here.
My buddy Davis is an actual, real statistician, and a damn fine one at that. Leng and I want to be real statisticians one day when we grow up, so we were really pleased to get some feedback on the blog from him.
One thing he pointed out was that the PMFs in the previous post for the ‘maximum covariance’ case look a little funky. Here they are again:
These don’t look like distributions we’d normally see, like a normal distribution. They’re truncated at seemingly arbitrary values (a minimum bound at 25 seats for Labor and 48 seats for the Coalition) and reach their peaks at the edges. And they suggest that it’s roughly twice as likely Labor will only win 25 seats compared to a more believable 60ish, which doesn’t sound right.
Is something wrong here? We don’t think so.
The maximum covariance model doesn’t represent the real world. It’s only meant to be an upper bound, an outer limit of what could technically be possible given the probabilities we inferred on August 6. In the bizarre maximum covariance model universe, election outcomes between seats are very strongly correlated. This means that when Labor win, they win big, and similarly for the Coalition.
Consider an extreme case, where all 150 seats have 100% correlation with each other. Then in this case, there are only two possible outcomes: Labor wins 150 seats, or the Coalition wins 150 seats. A brutal environment for career politicians! In this case, the PMFs would be completely concentrated at the edges. In practice, the probabilities inferred from the betting odds constrain the correlations between many seats to be much less than 100%, resulting in some mass between the two extremes.
What causes the arbitrary truncation points? We make adjustments to the inferred probabilities to counter longshot bias: seats with less than a 0.1 probability of victory for Labor are rounded down to 0 (and similarly for all other parties). On August 6, this resulted in 25 seats with a 100% probability of Labor victory and 48 seats with a 100% probability of Coalition victory. So the bias-adjustment explains the truncation points.
The aim of this whole exercise has been to build covariance into the model. We don’t know what the true covariances are between seats, so instead we look at upper and lower bounds on the covariances (these bounds are set by the probabilities inferred from the betting odds). The main takeaway is that the overall probability of a Labor victory increases dramatically when you include covariance between seats; but even if you include unrealistically high covariance, the betting markets still believe Labor are likely to lose the election.
Leng and I have been working on a few improvements to our model that we wanted to share.
The story so far
We’ve been converting betting odds for each seat to probabilities of victory for each party. This is straightforward, and many others (such as the Guardian’s Simon Jackman) are doing this. Using these probabilities to estimate the most likely number of seats each party will win (and ultimately, who will win the election) is more subtle and difficult. In a previous post, we talked about a common way people get this wrong. So far, our approach has been to treat electorates as independent of each other (this makes the number of seats won by each party a Poisson-Binomial random variable). But electorates aren’t independent. All politics might be local in some parts of the world, but in Australia people vote not just for their local representative, but for who they want to be Prime Minister. For example, people in seats A and B might consider both local and national issues in choosing who to vote for. If they only consider local issues (e.g., local planning decisions), then the electorates are independent. If they only consider national issues (e.g., federal political issues such as border protection), the electorate results will be highly correlated. The truth is somewhere in between. But if Kevin Rudd was found to be secretly from New Zealand tomorrow, we think there would probably be a national swing against him in horror and disgust, and this national swing can’t be modelled right assuming seats are independent.
How important is this covariance? Let’s take a look.
Updating the model
First, we needed to update our model. The problems here are i) including covariance in your model makes it much more complicated and ii) we don’t know what the actual covariances are between seats. We sucked it up and tackled the first part, so we can now include covariances in our model: our simulations are now samples from a correlated multivariate Bernoulli distribution.
The second part is harder: it’s not clear how to estimate covariances between seats. Instead, we’ve calculated an upper bound on the covariance between each seat. The maximum covariance is bounded by the probabilities inferred from the betting odds. You can derive an upper bound using the definition of covariance for a Bernoulli random variable and a bit of algebra.
To get an idea for how covariances affect the seat totals, we plotted histograms of number of seats won by Labor and the Coalition for two different models. In the first model, there is no covariance at all. This is a common assumption, and one we’ve been using so far: the electorates are treated as independent. Here’s the histogram using betting odds from August 6:
If we assume the seats are independent, we see Labor has practically zero (less than 1%) chance of getting more than 75 seats, and therefore winning the election. What happens if we introduce covariance? We rerun the analysis, but this time set all covariances to be the maximum possible for the given probabilities. Here are the histograms for the new model:
The difference is significant! Labor now has a 34% chance of winning enough seats to form government. The shape of the distribution also changes, with the biggest changes occurring at the extremes. This makes sense: if the seat outcomes are highly correlated, we would expect to get ‘election tsunamis’ happening frequently, where one party wins in a landslide each time.
Interestingly, the mean seat counts barely change at all between models. Under the maximum covariance model, Labor is expected to win 64 to the Coalition’s 83 seats; under the independent electorates model, Labor is 65 to the Coalition’s 83. This suggests that assuming independence between seats shouldn’t change the expected number of seats much, which is why we’ve felt comfortable reporting these figures in the AFR articles despite assuming independent electorates up until now.
Getting our covariance on
What are the implications for our predictions? For now, we’re comfortable that our mean seat counts are reasonable, whether or not you include covariance in the model. Probabilities of overall election victory, however, are sensitive to covariances between electorates, which we’re unlikely to ever fully know. But we can bound the covariance, and therefore still get some useful information from the electorate betting odds. For instance, for August 6, the probability of Labor victory is somewhere between 0 and 34%. This is a wide range! But it’s still useful information: it tells us that the betting markets believe Labor are likely to lose the election. We’ll be keeping an eye on this over the coming weeks.
K and I are obviously not the only ones doing this prediction caper for the Australian election. Aside from the AFR coverage using our stuff, we enjoy reading Simon Jackman’s stuff in The Guardian. We dont know Simon but he’s got a stellar background – professor at Stanford in both the politics and statistics department. Some of his research uses Bayesian techniques which is stuff K and I have been reading.
On Aug 2, Simon Jackman had an article in the The Guardian about betting markets. He uses effectively the same simulation we do – get implied probabilities from betting markets, runs lots of simulations and get the distribution of outcomes. Based on his 1 August simulation, he predicts the ALP will win 61 seats. This is a few less than our model predicts but in the same ballpark, unlike the national polls which are effectively predicting the ALP winning around 75 seats.
Why does Simon’s predictions differ from ours? From what we can tell, the only difference is that he is using data from both Centrebet and Sportsbet, and then averaging the implied probability. We’ve only got data from Sportbet. It may also be possible that he is not correcting for the longshot bias that we discussed in a previous post. This might mean that in his model, seats that will likely end up in the ALP column are given to ‘other’ candidates. If you look at our predictions before we corrected for the longshot bias, we had thought the ALP would win around 60-62 seats too. Anyway, after reading Simon’s article, it’s good to know me and K are not cray. We’re going to enjoy this song and the rest of our weekend. Hope you do too!
By now you have probably seen that the election will be on Sept 7 – four short weeks away! I think this means me and K will have to crank up our general rate of output. We’ve been working on looking at different types of analyses beyond the current stuff and hopefully will execute it.
Re the election, it’s interesting to see the campaign slogans and tag lines the parties go with. Seems like we’ll be seeing a lot of ‘Hope, reward and opportunity’ via the Real Solutions Plan. And some stuff about a positive vision which is not the same old negativity. Three word slogans always remind me of ‘Peace, Bread, Land’. Not knocking it at all because they can be very effective.
Some exciting news for us is that we’ve agreed with the AFR to conduct our analysis for them on an exclusive basis. That was pretty good to nail down. But if anyone wants to put K’s face on radio, I’m sure everyone will enjoy that. Looking forward to the next four weeks!