Does covariance matter?

Leng and I had a really interesting exchange with Simon Jackman on Twitter yesterday. Me and my whopping eight followers (hi mum) would probably agree that I suck at Twitter. It became difficult to continue the discussion with the 140 character limit, so Leng and I thought we’d try and outline it here.

The discussion was about the importance or otherwise of modelling covariance between seats, in light of the extremely low probability of Labor victory (< 1%) implied by the seat-level betting data, when modelled assuming seats are independent. Leng and I have put a bit of effort into including this covariance in our model, and we think the very low implied probabilities of Labor victory are due to ignoring this covariance. Simon suggested, though, that the Efficient Market Hypothesis (EMH), were it to hold, implied the seats should be treated as independent.
Here’s what we think Simon’s saying:
1. Bettors can bet on any seat.
2. If they think that seats are NOT independent but move in certain directions together, then this will be reflected in how they bet.
3. How they bet affects the prices of each seat.
4. Assuming the EMH holds, the price of each seat then accurately represents the probability of a party winning that particular seat.
5. Built into that probability is any covariance that bettors think exists. Therefore, it is OK to assume independence because each seat’s probability already builds in any covariance.
We agree with 1-4, but disagree with 5. For a correlated multivariate Bernoulli distribution, the probabilities p_{1}, ..., p_{150} don’t uniquely define the covariances (although they do constrain the allowable choices of covariances). So even if the EMH holds, and p_{1}, ..., p_{150} are known exactly, these probabilities still imply a range for each covariance, rather than a single value. If the EMH holds, we would expect this range to include the true covariance, but we don’t have enough information to define it exactly.
With a little algebra, you can show that for any set of seat probabilities, choosing all covariances to be zero (i.e., consistent with assuming independent) is always within the valid range of choices for covariances. But this doesn’t mean it’s the right choice. Indeed, there are good reasons to think that there are significant correlations between many seats. And if we choose valid non-zero covariances, the seat-level betting data imply a much more realistic probability of Labor victory (as much as 34% using odds from August 6), that is in better agreement with national-level betting odds.
We hope we haven’t misrepresented Simon’s views about the role of EMH in making the independence assumption. Hopefully we can clear this up and better understand the proper role of covariance in making predictions using betting markets.

Election tsunamis

My buddy Davis is an actual, real statistician, and a damn fine one at that. Leng and I want to be real statisticians one day when we grow up, so we were really pleased to get some feedback on the blog from him.

One thing he pointed out was that the PMFs in the previous post for the ‘maximum covariance’ case look a little funky. Here they are again:

Funky PMFs from maximum covariance model

Funky PMFs from maximum covariance model

These don’t look like distributions we’d normally see, like a normal distribution. They’re truncated at seemingly arbitrary values (a minimum bound at 25 seats for Labor and 48 seats for the Coalition) and reach their peaks at the edges. And they suggest that it’s roughly twice as likely Labor will only win 25 seats compared to a more believable 60ish, which doesn’t sound right.

Is something wrong here? We don’t think so.

Winning big

The maximum covariance model doesn’t represent the real world. It’s only meant to be an upper bound, an outer limit of what could technically be possible given the probabilities we inferred on August 6. In the bizarre maximum covariance model universe, election outcomes between seats are very strongly correlated. This means that when Labor win, they win big, and similarly for the Coalition.

Consider an extreme case, where all 150 seats have 100% correlation with each other. Then in this case, there are only two possible outcomes: Labor wins 150 seats, or the Coalition wins 150 seats. A brutal environment for career politicians!  In this case, the PMFs would be completely concentrated at the edges. In practice, the probabilities inferred from the betting odds constrain the correlations between many seats to be much less than 100%, resulting in some mass between the two extremes.

What causes the arbitrary truncation points? We make adjustments to the inferred probabilities to counter longshot bias: seats with less than a 0.1 probability of victory for Labor are rounded down to 0 (and similarly for all other parties). On August 6, this resulted in 25 seats with a 100% probability of Labor victory and 48 seats with a 100% probability of Coalition victory. So the bias-adjustment explains the  truncation points.

The aim of this whole exercise has been to build covariance into the model. We don’t know what the true covariances are between seats, so instead we look at upper and lower bounds on the covariances (these bounds are set by the probabilities inferred from the betting odds). The main takeaway is that the overall probability of a Labor victory increases dramatically when you include covariance between seats; but even if you include unrealistically high covariance, the betting markets still believe Labor are likely to lose the election.

Covariance between electorates

Leng and I have been working on a few improvements to our model that we wanted to share.

The story so far

We’ve been converting betting odds for each seat to probabilities of victory for each party. This is straightforward, and many others (such as the Guardian’s Simon Jackman) are doing this. Using these probabilities to estimate the most likely number of seats each party will win (and ultimately, who will win the election) is more subtle and difficult. In a previous post, we talked about a common way people get this wrong. So far, our approach has been to treat electorates as independent of each other (this makes the number of seats won by each party a Poisson-Binomial random variable). But electorates aren’t independent. All politics might be local in some parts of the world, but in Australia people vote not just for their local representative, but for who they want to be Prime Minister. For example, people in seats A and B might consider both local and national issues in choosing who to vote for. If they only consider local issues (e.g., local planning decisions), then the electorates are independent. If they only consider national issues (e.g., federal political issues such as border protection), the electorate results will be highly correlated. The truth is somewhere in between. But if Kevin Rudd was found to be secretly from New Zealand tomorrow, we think there would probably be a national swing against him in horror and disgust, and this national swing can’t be modelled right assuming seats are independent.

"We're all individuals!"

“We’re all individuals!”

How important is this covariance? Let’s take a look.

Updating the model

First, we needed to update our model. The problems here are i) including covariance in your model makes it much more complicated and ii) we don’t know what the actual covariances are between seats. We sucked it up and tackled the first part, so we can now include covariances in our model: our simulations are now samples from a correlated multivariate Bernoulli distribution.

The second part is harder: it’s not clear how to estimate covariances between seats. Instead, we’ve calculated an upper bound on the covariance between each seat. The maximum covariance is bounded by the probabilities inferred from the betting odds. You can derive an upper bound using the definition of covariance for a Bernoulli random variable and a bit of algebra.

To get an idea for how covariances affect the seat totals, we plotted histograms of number of seats won by Labor and the Coalition for two different models. In the first model, there is no covariance at all. This is a common assumption, and one we’ve been using so far: the electorates are treated as independent. Here’s the histogram using betting odds from August 6:

Independent electorates model

Independent electorates model

If we assume the seats are independent, we see Labor has practically zero (less than 1%) chance of getting more than 75 seats, and therefore winning the election. What happens if we introduce covariance? We rerun the analysis, but this time set all covariances to be the maximum possible for the given probabilities. Here are the histograms for the new model:

Maximum covariance model

Maximum covariance model

The difference is significant! Labor now has a 34% chance of winning enough seats to form government. The shape of the distribution also changes, with the biggest changes occurring at the extremes. This makes sense: if the seat outcomes are highly correlated, we would expect to get ‘election tsunamis’ happening frequently, where one party wins in a landslide each time.

Interestingly, the mean seat counts barely change at all between models. Under the maximum covariance model, Labor is expected to win 64 to the Coalition’s 83 seats; under the independent electorates model, Labor is 65 to the Coalition’s 83. This suggests that assuming independence between seats shouldn’t change the expected number of seats much, which is why we’ve felt comfortable reporting these figures in the AFR articles despite assuming independent electorates up until now.

Getting our covariance on

What are the implications for our predictions? For now, we’re comfortable that our mean seat counts are reasonable, whether or not you include covariance in the model. Probabilities of overall election victory, however, are sensitive to covariances between electorates, which we’re unlikely to ever fully know. But we can bound the covariance, and therefore still get some useful information from the electorate betting odds. For instance, for August 6, the probability of Labor victory is somewhere between 0 and 34%. This is a wide range! But it’s still useful information: it tells us that the betting markets believe Labor are likely to lose the election. We’ll be keeping an eye on this over the coming weeks.

Not cray!

K and I are obviously not the only ones doing this prediction caper for the Australian election. Aside from the AFR coverage using our stuff, we enjoy reading Simon Jackman’s stuff in The Guardian. We dont know Simon but he’s got a stellar background – professor at Stanford in both the politics and statistics department. Some of his research uses Bayesian techniques which is stuff K and I have been reading.

On Aug 2, Simon Jackman had an article in the The Guardian about betting markets. He uses effectively the same simulation we do – get implied probabilities from betting markets, runs lots of simulations and get the distribution of outcomes. Based on his 1 August simulation, he predicts the ALP will win 61 seats. This is a few less than our model predicts but in the same ballpark, unlike the national polls which are effectively predicting the ALP winning around 75 seats.

Why does Simon’s predictions differ from ours? From what we can tell, the only difference is that he is using data from both Centrebet and Sportsbet, and then averaging the implied probability. We’ve only got data from Sportbet. It may also be possible that he is not correcting for the longshot bias that we discussed in a previous post. This might mean that in his model, seats that will likely end up in the ALP column are given to ‘other’ candidates. If you look at our predictions before we corrected for the longshot bias, we had thought the ALP would win around 60-62 seats too. Anyway, after reading Simon’s article, it’s good to know me and K are not cray. We’re going to enjoy this song and the rest of our weekend. Hope you do too!

Election Date & AFR exclusive

By now you have probably seen that the election will be on Sept 7 – four short weeks away! I think this means me and K will have to crank up our general rate of output. We’ve been working on looking at different types of analyses beyond the current stuff and hopefully will execute it.

Re the election, it’s interesting to see the campaign slogans and tag lines the parties go with. Seems like we’ll be seeing a lot of ‘Hope, reward and opportunity’ via the Real Solutions Plan. And some stuff about a positive vision which is not the same old negativity. Three word slogans always remind me of ‘Peace, Bread, Land’. Not knocking it at all because they can be very effective.

Some exciting news for us is that we’ve agreed with the AFR to conduct our analysis for them on an exclusive basis. That was pretty good to nail down. But if anyone wants to put K’s face on radio, I’m sure everyone will enjoy that. Looking forward to the next four weeks!