Unlike ODI’s, the format and basic rules of Twenty20 matches have stayed the same since the very first official match back in 2003. The first 6 overs in each innings are Powerplay overs and the bowlers are limited to 4 overs each. This, coupled with the condensed nature of the format, means that certain situations are repeated many times. Naturally, this allows us to compute probabilities of events by looking back at how many times this situation arose and the resulting outcome. For example, how often do teams defend 10 runs off the last over? Or what do teams need to score in the Powerplay to stand a chance of chasing 200?

In this article, I develop some simple prediction models using a few machine learning algorithms. I use only a few factors to begin with – specifically the current score, number or runs required and wickets to predict the final score in the 1st innings and the winner in the 2nd innings. The dataset consists of 877,319 balls from 3,700 T20 matches where there was an outright winner and no reduction in overs in either innings.

## Predicting 1st innings scores

Predicting the eventual score in the 1st innings is a regression problem, as the outcome is a continuous variable. I consider the 455,301 1st innings balls in the dataset to be independent of one another.

The table above shows a random sample of the data. The first three columns are the predictor variables and the last column is the target variable i.e. what we’re trying to predict.

The first model we can look at is **Linear Regression**. Here we are trying to draw a line through our multi-dimensional data such that the total distance between all the points and the line is as small as possible. In two dimensions this is simply the line of best fit that we are all familiar with.

Using scikit-learn’s machine learning package in Python, we can build our model very easily as below.

Once fitted, the module calculates the the coefficients of the linear regression equation as follows:

final_score = 1.080*current_score + 1.16*balls_remaining – 4.04*wickets + 17.1

The R^{2} value is 0.547, which means about 55% of the variability in the data is explained by this model. The model tells us that the average score from the very beginning of the innings is about 156 (1.16*120 + 17.1). As the innings progresses, we can feed in more information regarding the current score and wickets to get a more refined prediction. The model also tells us that a wicket will save about 4 runs for the bowling team. Being a linear model it doesn’t tell us at what stage of the innings losing wickets is most costly. It’s also not particularly self-consistent – it can give predictions that are lower than the current score in extreme cases.

**KNeighborsRegressor** is an algorithm that I used when developing the Expected Runs and Wickets models. For any given ball, the algorithm searches for a specified number of the most similar balls and from those returns the average final score. The Python implementation is as below.

I found that 26 neighbours was the optimal number that would give the lowest error. Fewer than 26 and you’ll suffer from small sample sizes whilst any greater and the neighbours start to become a bit too dissimilar. This method had an R^{2} value of 0.580, which is slightly better than what we had for linear regression. The disadvantage however is that we don’t get a interpretable equation or rule – we merely input the details of the ball and it outputs a prediction.

A similar algorithm is **RadiusNeighborsRegressor**. Instead of searching for a fixed number of nearest neighbours, this algorithm finds all neighbours which are within a certain distance.

To give some perspective, the most popular unique score after the 6th over is 49 for 2 from 7.1 overs, which has occurred 62 times in total. If we take the average final score from those 62 innings we can be reasonably confident that this is a good prediction. If we loosen our requirements to allow any of current score, wickets and overs to be off by a maximum value of 1 (the radius I chose), e.g. 50 for 1 after 7.2 overs, we get 402 similar instances. In fact, of the 55,781 unique combinations of current score, wickets and balls remaining, 44,536 have at least 10 close neighbours.

This method is ever so slightly better with an R^{2} value of 0.607. We do however fail to predict with much accuracy the somewhat extreme cases. A score like 187 for 2 with 22 balls to go has only 4 close neighbours. *RadiusNeighborsRegressor* will take the average final score of those 4, while *KNeighboursRegressor* will find 26 quite dissimilar neighbours. Both methods will likely have high margins of error in this case.

The graph above shows how confident our predictions become as we progress through the innings using the *RadiusNeighborsRegressor* method. Early on, we have very little information to base our predictions on – the best we can do is give the historical average final score. We become 80% confident at around the 13th over and 95% confident with about 2 overs to go. It’s true that a lot can happen in the last couple of overs, but over a large number of games any differences tend to average out.

The final algorithm we’ll take a look at is **RandomForestRegressor**. This is an example of a ensemble method which takes a number of slightly different prediction models and combines them in such a way that the overall performance is better than any one model on its own.

The algorithm combines 1,000 models or estimators, specifically decision trees, and considers all three features when looking for the best split. It had an R^{2} value of 0.607 – very similar to the *RadiusNeighborsRegressor* algorithm. However, this algorithm has the advantage that it can tell us about feature importance – which of the attributes is most significant in predicting the final score.

The table above shows that the current score is the best predictor of the final score. It is about as informative as both the wickets and ball remaining values combined. This of course makes sense – good luck predicting the final score from just the number of wickets and balls remaining. It also perhaps confirms the long suspected notion that wickets in hand are overvalued. You’ll often hear that a team that scores 180-3 from 20 overs has left a fair few runs on the table.

We can take a look at how the predictions from the random forests model perform on a real game – an IPL match earlier this season between Gujarat Lions and Kolkata Knight Riders.

The graph above shows the model’s predictions for the Lions’ first innings. A rolling mean of 6 balls has been applied to smooth out the plot. The model’s predictions never varies from the final score by more than 8 runs. In contrast, the run rate projected score is a lot more erratic and only briefly predicts anything close to the final score. The model barely dropped below 180 even after the Lions’ slowdown in between the 10th and 16th overs. It had faith in their ability to accelerate in the latter stage of the innings, something which the run rate projections does not take into account.

To sum up, we’ve seen how we can take some simple facts about the current state of the game and look back in the vast array of ball-by-ball data to generate fairly accurate predictions of the final score of the 1st innings. In the next article I will be using a similar suite of machine learning algorithms to predict the winner of the match during the 2nd innings.