Chris Gayle – a statistical analysis

In an excellent article in The Independent recently, it stated how in last year’s World T20 final against the West Indies, England decided to open the bowling with Joe Root because Chris Gayle had ‘a poor record against off-spin’.  Although the idea worked it made me think that surely these plans are more rigorously justified beyond seeing which type of bowlers dismiss him most often.  Root is a right-arm off-spin bowler just like say India’s Ravichandran Ashwin, but over a long period of time would probably come off worse against Gayle.  I’m sure England and other teams develop more detailed bowling plans that include what kinds of lines and lengths to bowl to particular batsmen, when to vary their pace and where exactly to position fielders among other things.  All of this can be derived from data: what kind of deliveries do batsmen generally get out to, what areas of the ground do they target etc.

Chris Gayle in T20’s

In this post, I want to use the metrics I’ve been developing and some visualisations to build up a statistical profile of a particular batsman in T20’s, in this case Chris Gayle himself.  It goes without saying that Gayle has a phenomenal record in this format; closing in on 10,000 runs from over 270 innings, with an average of over 40 and a strike rate of 150.  The plot below shows how Gayle’s average has fluctuated over the course of his T20 career.gayle_averageEarly on in his career, his average somewhat dipped to about 30 before there was a resurgence in his batting from his 50th match until his 100th.  He’s since been averaging in the mid 40’s.

The histogram below shows how his career scores are distributed, split up into 5 run bins.gayle_hist

It’s obvious that we’re not going to see a devastating Gayle innings in every match, having made 78 50+ scores in 273 innings.  In fact it’s more likely that we’ll see a failure from Gayle having been dismissed for single figures 84 times in his career.  This is not entirely surprising for an opener in T20’s but it does show it’s not impossible to dismiss him  cheaply.

We can break this down further and see how he performs at different stages of the innings.  The plot below shows the total number of runs he’s scored in each over of a T20 innings.overs.pngIt’s evident that the bulk of his runs come in the Powerplay overs before dropping off through the rest of the overs.  This is due to the fact that it becomes more and more unlikely that he is actually still out in the middle in the later overs.  To account for this we can plot his average per over.ave_over.png

Apart from the 11th over and the last 2 overs, Gayle’s lowest average comes in the 3rd over.  In fact, this is the over where he is most frequently dismissed – a total of 28 times.  It’s peculiar to see that he averages nearly 100 in the 10th over which then suddenly drops to about 30 in the next over.  I’m not really sure why.sr_over

His strike rate generally increases through the course of the innings after the initial blitz in the Powerplay.  The 1st over is the only time when his strike rate is below 100 suggesting he tends to be quite circumspect at the start of his innings. in.png
The graph above shows the average number of balls it takes Gayle to reach a particular score.  It illustrates his incredible ability to accelerate during an innings.  It takes him on average 10 balls to score his first 10 runs.  After that his 50 comes up in about 32 balls and if he gets to a century, it usually take just over 50 balls.  Of course, sample sizes get quite small at that point so the graph becomes a little more erratic.

Gayle’s xR and xW

Now, we can look at what type of bowlers Gayle performs best against or otherwise.  This article describes quite well exactly this.  From a sample of 3 IPL seasons it shows that Gayle thrives against left-arm slow bowling striking at nearly 3 runs a ball.  However, his run-scoring is somewhat restricted by right-arm fast and off-spin bowling, going for just over a run a ball.

We can go further and see if there is any variation in Gayle’s expected runs and wickets against both spinners and seamers.  I use a dataset made up of the last 5 IPL seasons consisting of 70,217 balls that produce 89,329 runs and 3,612 wickets.  Gayle’s IPL average and strike rate is not significantly off his career figures, so just using this data is sufficient for our analysis.  As before, I train the data using a machine learning algorithm to return a xR and xW figure for every ball.  The table below summarises the results.

seamspin

So Gayle over-performs against both seam and spin bowlers compared to the average batsman.  Although he performs better against spinner than seamers, he is dismissed more often than expected.  This suggests that he takes a more hit and miss approach against spinners but balances risk and reward well against the seamers.

Where to bowl (and not bowl) to Gayle

gayle_w.png

The beehive plot above shows Gayle’s dismissals in the IPL since 2012, split by seam and spin bowlers.  Spinners tend to dismiss him by bowling fairly straight while seamers tend to go wide of off stump or very short.  Of course this doesn’t tell you the full story.  We can also take a look at where to bowl to keep Gayle quiet.

seam_0spin_0

The heat maps above show the distributions of 460 and 186 dot balls to Gayle from seamers and spinners respectively.  Seam bowlers give themselves the best chance by bowling back-of-a-length outside off, although the distribution is quite broad in terms of both line and length.  For spinners, keeping it very tight to the top of off stump is the way to go ensuring you’re not too full or too short.  Of course, these plans are fraught with risk as the next images show.

seam_4_6.png

The heat map above shows the distribution of 227 balls that have been hit to the boundary by Gayle.  It’s clear that if you bowl fuller and outside off, you’re very likely to be hit to the boundary.  If you’re wondering what happens if you bowl straight to Gayle as a seamer, then this is where he is mostly restricted to 1’s, 2’s and 3’s.spin_4_6

For spinners the margin of error between dot balls and boundaries is even smaller, comparing this to the spinner’s dot ball heat map above.  The figure above illustrates the 92 boundaries Gayle has hit off spin bowling.  If you bowl fractionally too full and outside off stump then you are in trouble.  This confirms the hit and miss nature of bowling spinners to Gayle implied by his xR and xW figures.  If you want to bowl spin to him then you have to be prepared to go for runs before his is dismissed.

Gayle vs particular bowlers

We can now look at how specific bowlers who have been successful (or otherwise) have bowled to Gayle.  The table below shows the 10 bowlers he has performed worst against in terms of runs/xR and who have bowled at least 18 legal deliveries to him.

gayle_bowlers

We get a mixture of both seam bowlers and spinners.  Gayle scores nearly 26 runs fewer than what we would expect the average batsman to score when facing Lasith Malinga.  We have to be careful with the interpretation here however.  The average bowler would expect to concede 44 runs if they bowled the exact same deliveries as Malinga has.  But if Malinga (and other bowlers on the list) concede less than expected against most other batsman then there is something about these bowlers our model doesn’t quite capture.

The number of dismissals per bowler isn’t really large enough to form concrete conclusions from the dismissals/xR ratios, but it should be noted that he is dismissed by the spinners Sunil Narine and Harbhajan Singh more often than expected.

Let’s look at the heat maps from some of these bowlers with their wickets shown in red.malinga.png

Firstly, Malinga has two distinct areas that he bowls to Gayle mixed in with the occasional short ball.  He favours a good length quite wide of off stump and also some yorker length deliveries aiming for the base of middle stump.

steyn

Steyn bowls more or less evenly between back of a length on off stump and a bouncer length towards Gayle’s helmet which has gotten his wicket once.

narine.png

Narine’s heat map shows he bowls wide of off stump rather than most spinners who target top of off.  His length is also incredibly consistent shown by the very narrow contours, relying on variation in spin.

ashwin.png

Ashwin, on the other hand, varies his line and length a lot more to Gayle.  He predominantly bowls quite full on off stump which, if you remember, is where Gayle hits a lot of boundaries against spinners.  Perhaps this suggests Gayle is relatively more cautious when facing Ashwin.

Finally, we can take a look at a bowler who hasn’t fared quite as well.bhuvi.png

Bhuvneshwar Kumar has gone for 78 runs from 51 balls with a runs/xR figure of 1.22 against Gayle.  He mostly bowls in that area where seamers typically keep Gayle quiet.  However, when he misses his length he goes for a lot of boundaries shown by the blue balls.  This again stresses how little margin for error you have when bowling to Gayle.

Setting a field to Gayle

Another component to restricting and ultimately dismissing Gayle is field placement.  The wagon wheel below shows 6,037 of his runs from 2,260 scoring shots.

gayle.png

It’s obvious from watching him that a lot of his boundaries come in the deep midwicket to long-off region.  He also doesn’t run many 2’s or 3’s.

gayle_dots.png

His dot balls mainly come from deliveries played back to the bowler.  Having a fielder close in on the off-side is also a source of quite a few dot balls, as well as conventional point and cover fielders.  I assume all those dot balls on the boundary are when he’s turned down singles although I’m not entirely sure he’s done it that many times.

Gayle is caught 64% of the time he is dismissed.  The wagon wheel below shows 85 instances of him being caught in the outfield separated by seam and spin bowling.

gayle_caught.png

He is caught behind and in the slips quite often to both seamers and spinners, suggesting it is worth having a slip in, especially early on in his innings, even in T20’s.  Given where he scores most of his runs, it’s only a matter of time before he mishits one and gets caught at long-on or deep midwicket.

Data needs context

Overall, we’ve seen how we can use data to identify the strengths and weaknesses of a batsman and thus formulate bowling plans.  However, it’s important to note that we haven’t found the silver bullet to dismissing Gayle cheaply every time we bowl to him.  Just because he gets out to a particular type of delivery really often, doesn’t mean teams should focus their entire pre-match training on hitting that one spot.  We know that variation is important in T20’s.  Looking deeper, we might find it’s a particular string of deliveries that set him up before dismissing him, or that he is dismissed this way after getting a big score.  As ever, more investigation is required.

Any comments/questions?  Tweet me here.

Advertisements

India v England T20I series analysis

In this blog I wanted to take a brief break from refining our Expected Runs and Expected Wickets models and instead use these metrics, in their current form, to analyse the recent India v England T20I series.  India came back from 1 nil down to win the series 2-1.  The table below shows a summary of the series in terms of the total xR and xW for each team in each match.  To give some context, batsman outperformed in this series overall, scoring 875 runs compared to a total xR of 828.8.  In contrast, there were 35 wickets lost (excluding run outs) compared to a total xW of 34.81.

indeng

In the first match, England bowled reasonably well to restrict India to 147-7 in their innings.  This was 6 runs below their total xR suggesting their relatively low score was due to poor batting more than good bowling.  In reply, England outscored their total xR by nearly 26 runs.  This, along with losing nearly 2 wickets fewer than expected suggests a pretty good batting performance overall.  In the next match, both teams performed near enough as expected in terms of both runs and wickets.

In the final match, India significantly outperformed their xR to post a total of 202-6.  While England did too, it was nowhere near enough to challenge India’s score.  Slightly outperforming an already low xR total won’t win you many games.  The most damning aspect however, is England losing more than double the expected number of wickets.  In fact, their xW for that match was the lowest of the series highlighting just how inept that batting collapse was.

Batsmen

We can look at how individual batsmen performed throughout the series.  The table below shows some statistics for the top 10 series run scorers.

batsmen

Although Joe Root top scored in the series, he did under-perform according to his xR.  However, he did have the lowest wickets/xW of any batsman suggesting he had to dig in at times.  He was dismissed only twice even though India bowled well enough to him to dismiss an average batsman more than 5 times.root_runs.png

The beehive plot above shows how India bowled to Root in the series.  I think there was a definite plan to bowl very straight and wide of leg stump, with fielders out on the leg side boundary.  This seemed to work as he was mainly restricted to singles in this area and ended up with an overall strike rate of just over 100.    root_wickets

The heat map above shows the extent of India’s bowling plan along with Root’s 2 dismissals.  The darker the green the greater the number of balls bowled in that region.

MS Dhoni had a similar story to Root – scoring below xR, but batting well enough to survive periods of good bowling from England.

dhoni.png

This heat map shows an even more pronounced plan from England to bowl at the top of leg stump and occasionally full and outside off stump to Dhoni.  The purple balls shows all of Dhoni’s boundaries which accounted for 44 of his 97 runs in the series.  His overall strike rate of 139 suggests this plan didn’t quite work for England, in contrast to India’s plan to Root.

Also from the table above, it seem Sam Billings had a pretty poor series after giving his wicket away nearly 3.5 times more than expected.  The beehive plot below shows his dismissals and every other ball he faced in the series.

billings

It shows how we was undone by the surprise bouncer from Ashish Nehra in the 2nd T20I, which had an xW of 0.136 – the highest he faced.  His other 2 dismissals had xW’s of 0.045 and 0.034.  He faced 8 balls which had higher xW’s from which he scored 17 runs.  Although he is an opener, this perhaps shows his need to work on picking the right ball to hit.

England vs spinners

Another theme to come out of this series was, predictably, England’s poor batting performance against spinners.  We can see if this is borne out with our xR and xW metrics.  The table below compares England’s xR and xW for both seamers and spinners.

eng

England produced an average performance against seamers, scoring and losing wickets more or less in line with expectation.  Against spinners, they over-performed in terms of scoring runs but lost nearly 3 more wickets than expected.  If we exclude Suresh Raina from India’s list of spinners, England’s runs/xR drops to 1.064 and wicket/xW increases to 1.63 i.e. nearly 4.3 wickets more than expected.  This can only confirm how dreadfully poor some of England’s shot selection was in this series against spinners.wick_r

wick_l

The heat maps above show how India’s spinners bowled to right and left-handed batsman with their wickets in red.  The dark green patches tend to be wide but quite flat.  This suggests that their spinners get quite consistent bounce and rely on movement off the pitch for variation.  We can compare this to England’s spinners, again bowling to right and left-handed batsmen.

en_spin_ren_spin_l

Interestingly, the heat map is more narrow and less flat.  This suggests the England spinners relied more on variation in length rather than line.

Overall, we’ve shown how the xR and xW metrics, together with some visualisations, can be useful in a post-match analysis.  We can make use of this data to confirm or challenge any conclusions that commentators or we as viewers may make about team or player performances.  As I refine these models further, I’ll be sure to do some more match analyses in between to make sure the metrics remain valid.

If there’s anything you think I could add, let me know here.

Adding game state to xR and xW

In previous blogs I have described how I used the speed of the ball, and its line and length to calculate the Expected Runs and Expected Wickets of that ball.  In this blog, I incorporate game state into these metrics, i.e. the over of the innings the ball was bowled.  The run rate, and indeed the probability of getting a wicket, is not constant throughout the course of a T20 innings.  Using data from nearly 4,000 T20 matches, we can calculate the average number of runs from each over of a innings, shown in the graph below.run_rates.pngTeams generally have a bit of a go in the powerplay overs then completely start over in the 7th over – something I’ve always found a bit peculiar.  Usually, a spinner comes on and they knock him about in that over.  If teams target this over when the fielders have just recovered from the powerplay, perhaps they can increase their average scores by 2 or 3 runs – enough to win maybe 5% more games?  Anyway, the average number of runs then increases in a pretty linear fashion.

The point is that the run rate fluctuates and xR can be adjusted to reflect that.  The historical run rate in T20 matches is about 6.98.  This is the benchmark that will be used to set the ‘value’ of overs.  For example, on average there are 9.07 runs scored in the 20th over, so the xR of balls in this over will be multiplied by 9.07/6.98 = 1.30.  In other words you would expect 30% more runs to be scored from the exact same balls than in the 13th over where the multiple is about 1.  Similarly, the multiple of the 1st over is 0.71 i.e. 4.95/6.98.

The average number of wickets in an over follows a similar shape.w_rates.pngAgain we see that dip in the 7th over where both sides just drop in intensity a bit.  Toward the end of the innings the wickets tend to fall at an exponential rather than linear rate.  As before, we can adjust xW to account for this variation.  The historical average wickets per over is 0.318.  So the xW of balls in the last over for example, are multiplied by 0.80/0.318 = 2.52.  However, this does not ever result in the xW of a particular ball exceeding 1.  In fact the highest xW of a single ball is 0.704.

We can now see how this affects our batsmen and bowler ratings.  The correlation between the old xR and updated xR is very significant with an R2 of 0.996, so we would expect few changes.  The table below shows the best and worst performing batsmen by runs/xR with over 300 runs.

xr2

The best batsmen according to the updated metric is once again Glenn Maxwell followed by Aaron Finch.  However, Darren Sammy, who was previously in 3rd position, has dropped all the way to 33rd.  This is a reflection of his under-performance in scoring runs near the end of the innings.  This may also be true of MS Dhoni whose xR/runs figure drops from 1.00 to 0.889 – the equivalent of about 76 runs.  Dhoni is a great finisher of an innings but perhaps picks the right ball to hit a little too conservatively.

An example of a batsman who sees an increase in their runs/xR multiple is David Warner, from 1.14 to 1.21 – the equivalent of 56 extra runs.  This perhaps illustrates the fact that he performs better in the first few overs of the innings than other openers and top-order batsmen.  If you look at the first graph above, you’ll see that of the first 12 overs of an innings, only 3 (overs 4, 5 and 6) are above the average run rate of about 7.  This means batting in this period will result in your xR to be discounted as it were.  Other examples include Aaron Finch and Chris Gayle who this updated model rates more favourably.  On the other hand, Luke Wright’s runs/xR drops ever so slightly from 1.13 to 1.12.

For bowlers with at least 200 balls, the top and bottom 10, measured by xR/ball, are shown below.  It should be noted again that spinners historically have had a lower economy rate than pace bowlers in T20 cricket.  This coupled with the fact that spinners tend to bowl in the middle overs means that this updated model favours spinners a slightly more than pace bowlers.

bowl

As expected, spinners make up the bulk of the top performers list.  The best performing fast bowler is South Africa’s Lonwabo Tsotsobe with an xR/ball of 1.16.  One important difference to the previous model is the fact that the xR/ball figures have all dropped for bowlers in the top 10 and increased for bowlers in the bottom 10.  Mohammad Hafeez’s xR/ball has gone from 1.08 to 1.00 while Dwayne Bravo’s has gone from 1.40 to 1.50.  This shows how the new model can be useful in identifying the best bowlers at certain stages of the innings.

Furthermore, we can see how the updated xW model affects both batsmen and bowlers.  You can see from the second graph above that the wicket rate for every over up to the 14th is below the average wicket rate, so the xW of any balls in that period will be discounted and vice versa.  The best and worst performing batsmen, with at least 15 dismissals, measured by dismissals/xW are shown below.

xw

This time JP Duminy is top after adjusting xW for game state.  At the other end, surprisingly Shahid Afridi drops down to 4th worst with Michael Lumb taking his place.  Interestingly there are quite a few openers in the bottom 10 compared to the previous model even though this model favours batsmen who bat in the opening overs.  This possibly highlights the hit-and-miss nature for openers in T20’s.  Perhaps an extension to this model would include adjustment factors for different positions in the batting order.

bowl-desc2

Expected Strike Rate (xSR) is calculated from the number of balls divided by xW.  In general, pace bowlers have had a lower strike rate than spinners in T20 cricket so it is not surprising to see only fast bowlers in the top 10 and mostly spinners in the bottom 10.  Tsotsobe however, who we saw previously to be the best performing pace bowler in terms of runs/xR,  has the highest xSR of any bowler.  This is consistent with his career stats which suggests he is a bowler who keeps it tight in the latter overs without getting a load of wickets.

Overall, the intuition between adding game state to the model is simple – if a player scores more runs or takes more wickets than on average at a particular stage of the innings then they deserve some credit and tells us something about their game.  I realise this article consisted of a lot of tables of data, but I’ll be sure to include some more informative graphics in the future.  In the next post I hope to look into game state further, specifically which batsmen and bowlers perform the best in the powerplay and death overs and why they’re able to do so.

Questions/suggestions?  Tweet me here.

xR with machine learning

In my first blog I described my first attempt at a metric called Expected Runs or xR.  This calculated the average number of runs a batsman would score off a delivery based on the ball’s line and length as it reaches the batsman.  In this blog, I extend this model by including other factors such as the speed of the ball and where it pitched.

I recently used a machine learning algorithm called k-nearest neighbours to develop a Expected Goals model.  This worked out pretty well so I thought I’d apply this to my cricket data.  Put simply, this algorithm finds the k (I used 50) most similar deliveries to a particular delivery and calculates the average number of runs accrued from those.  This means that it is not necessary to split bowlers into spinners and fast bowlers, as balls bowled at a typically spinner’s or seamer’s pace are likely to be grouped together.

The algorithm works by randomly splitting the dataset into a training set (from which the patterns are identified) and a test set which is a fifth of the original dataset.  This is done 10 times to cross-validate the data and reduce the effect of any bias in the training set.  For anyone interested the Python code is below:

code

From the ball-by-ball T20I data I removed any wides and erroneous deliveries.  In contrast to my first xR model I decided to mirror flip any deliveries to left-handers so that they can be equivalently compared to right-handers.  This left 41,104 balls from 208 matches.  After applying the algorithm as above, the total xR was 50,978 runs compared to 50,923 actual runs – a 0.1% error.  This is not surprising at all given that the algorithm learns on itself.  Almost by definition, the total actual runs and xR must more or less match.

As before, we can calculate the xR of individual batsmen.  The plot below shows each batsman’s careers runs against their xR.  The straight line is y=x and acts as the dividing line between over and under-performing their expectation.xrMost of the highest scoring batsmen are over-performing which perhaps means that there is an aspect of their game we are not capturing with this metric.  One possible factor is boundary hitting.  The runs/xR ratio for a particular ball can take values up to and above six.  As such, batsmen who hit a lot of boundaries can inflate their performance with this metric, not that boundary hitting is a bad thing.  The usefulness of xR comes from identifying batsmen who can score more runs from deliveries than on average and this includes rotating the strike and hitting into gaps for two’s and three’s.  The graph below shows the total number of boundaries for each player against their performance figure, measured by runs/xR.blog.png

Although there is a slight positive correlation, the R² value is just 0.119.  This means that total boundaries only explains 11.9% of the variance in the performance measure.  The R² drops to 0.093 if we only include batsmen who have scored at least 20 boundaries, indicating that it is even less significant for the better batsmen.  The table below shows the top 20 performing batsmen with at least 300 runs (of which there are 52).

batsmen
As before, Glenn Maxwell comes way out on top over-performing his xR by 42%, followed by his Australian teammate Aaron Finch.  Finch and Sammy have swapped places from the list in the my first xR model.  Some other notable changes include Shahid Afridi dropping from 5th to 12th in the list, and Mahela Jayawardene breaking into the top 10.

At the other end of the scale, the table is as below:

batsmen-xr-desc

This time Martin Guptill comes bottom of the pile instead of Mohammad Hafeez who is in second place.  It is reassuring that the runs/xR figures themselves have not changed drastically.  In this and the previous model, they are mostly between 0.9 and 1.2.

The tables below show the 10 best and worst performing bowlers who have bowled at least 200 balls (there are 60 such bowlers in the dataset).

bowlers

In my previous blog about xR, I questioned Hafeez’s place in the Pakistan T20 side based on his batting stats.  But it appears his bowling justifies his inclusion as he has the lowest xR/ball in the list.  As explained in that post, bowlers who concede more runs than expected are most likely unlucky enough to bowl to quality batsmen quite often.  Sunil Narine slips down to 8th, although he still concedes over 100 runs fewer than expected.

Spinners dominate the top of the list with the best performing pace bowler, Bhuvneshwar Kumar, coming in at 18th place with an xR/ball of 1.205.  This is not entirely unexpected, as spinners have an overall economy rate of 6.89 in T20I’s compared to 7.70 for pace bowlers.  The full lists for both batsmen and bowlers can be obtained here.

In future blogs I will be extending this analysis to ODI’s and Test matches and investigate xR at a match level.  I’ll also be experimenting with other machine learning algorithms to predict both runs and wickets.

If you have any suggestions or ideas of your own, please feel free to tweet me.