You’re on the bus. It’s early in the morning and difficult to remember what number comes after one. (Twelfty, isn’t it?). There seem to be some other people on the bus, people with coats, and hair, and bags, and mobile phones. When the early morning coffee starts to kick in, you start counting the colours of the coats and the hair and the bags. “Wow!” you think, “That’s amazing! There are seventeen people on this bus with blonde coats! That’s millions!” “So what?” says the bus driver when you try to tell him about your exciting discovery. Counting the things you see is relatively easy. Descriptive statistics is about giving some context to your observations, and starting you on the way to a genuine discovery.
There is a trend in sports analytics to count everything that moves. The Six Nations (that’s the rugby tournament that marks the start of Spring) has an official analytics partner. The “analytics” involves lots of counting, and helpful numbers pop up in the television coverage on “statistics” such as lineouts won on own throw. We can assume that they’re selling something more sophisticated to the teams but what’s presented is probably better described as “metrics”. The closest they get to statistics is percentages of play in action areas, but even then we’re not told what percentages one would ordinarily expect. The real break-through in sports analytics has been in measurement and coding and there are piles of raw numbers generated every week. Players wear GPS trackers that measure distance covered and the intensity of impacts. We now know, for example, that scrums are about 6G and tackles can be over 30G (thanks, the42.ie!). Video analysis has also progressed to classifying the contribution of each player arriving, but that’s data transformation, not analysis.
The first real analytical steps in building on the masses of data becoming available have to include looking at distributions and central tendency. Distribution has to do with positioning people on a scale, say from no tackles in a match to lots of tackles. There will be some players in a match who do lots of tackling, and some who do very little, and most who are somewhere in the middle, the group tending towards the centre, you see. There are three kinds of centre too: the mean is the arithmetic average, the median is the middle number of tackles if they’re arranged from least to most, and the mode is the most common number. The trend in some sports stats has been towards reporting the extremes, but with no clear rationale.
Out-liers (not outliers, which, if it were a word, would refer to things being more outly) are data points that look really big but do not appear to fit in a dataset; they might be genuine freaks, they might be measurement errors, or they might just indicate that the sample is too small. Hampel (1974. Thanks, Hampel!) helpfully came up with the concept of the influence curve of a data point, the degree to which one observation affects the pattern in a dataset: The influence of a single data point is approximately inversely proportional to the sample size, that is, the smaller the sample the greater the risk of influence. In a rugby match featuring a maximum sample of 46, the influence of out-liers is tight-head prop-esque.
Numbers of lineouts won on own throw is just as useless as number of blonde coats without any context and without any expectation. If we were also told – and the data are certainly there – the average number of lineouts won in matches over the last 15 years, and whether today’s match was significantly above or below the average, that would be something to shout about. There’s a difference between saying, “Wow! There’s a really big number!” based on a single number and “Wow! There’s a surprisingly big number!” based on a comparison. The next step is to link differences in what can easily be counted to what actually counts, that is, the result of the match. It is possible, for a start, to correlate any performance metric of a team with the outcome of the match. It is possible to control for the influence of any other performance metric, and soon you’re on the way to a statistical model of how to win a rugby match.
Analysis of sport has taken several steps in a more numerically literate direction but it’s still a long way from the very appearance of terms “kurtosis” or even “predictor”. There’s probably still too much reliance on raw, de-contextualised data and on impressive-looking outliers but there’s also enormous potential to find out what really counts. The first step is descriptive statistics, but that’s still several steps away from master-minding a Six Nations win, so don’t mention anything to the bus driver quite yet.