Notes on Progress: The stats gap
Students understand just enough statistics to get by
Notes on Progress is a diary-style format from Works in Progress. If you only want to hear when we have a new issue of the magazine, you can opt out of Notes in Progress here.
In this issue, Ellen Pasternack writes about the issues with statistical education for scientists at universities. If you enjoy it, please share it on social media or by forwarding it to anyone you think might find it interesting.
Who here studied a STEM subject at university (or is studying one now!), and remembers going to lectures in statistics? The content of these is more or less the same wherever you are. They’ll start by making sure everyone is up to speed on averages – the difference between mean, median, and mode – and on measures of how spread out data is – like variance and standard deviation. Then you’ll learn about the Normal distribution, simple linear models, and how to carry out hypothesis testing via t-tests and ANOVA (which tells us whether groups have differences in some variable – e.g. whether one dog breed is larger than another).
At the end of this course, if you’ve been following along and doing your homework, you should understand all of these concepts well enough to explain them to someone else, or to work out the maths by hand or sketch graphically what is going on.
Something like a t-test is useful if you want to ask a very basic question – is the mean in this group larger than the mean in this other group? – about neat, independent, Normally distributed data. In a classroom setting that’s fine, but in real life, the data you’re dealing with is almost never going to be as tidy as this; and you probably want to be able to answer a wider range of questions, too. So as well as this grounding in the basic principles of statistics and probability, you’ll hopefully be introduced to some more sophisticated ways of modelling and interrogating data.
But when I say introduced, I mean introduced. You’ll understand broadly the types of situation where you might want to use a particular method, and you’ll know which commands to type into your statistical software to put it into practice, but you’re very unlikely to get to a point where you can explain on a mathematical level what’s going on under the hood; there’s just not enough time on your standard course to go into that level of depth.
What you’ve been taught is something anthropologist Richard McElreath calls a ‘golem’. Golems, most famously the Golem of Prague, are powerful clay giants (or so the legend goes) created to defend local Jewish populations from persecution, but which, having no intelligence of their own, will cause disaster if not carefully directed.
Statistical algorithms – whether a simple t-test or something more complex – are like golems. Whether calculated by hand or by computer, you put the data in, and it gives some numbers as output. Sometimes the output is ‘statistically significant’, which might be shown in statistical software by a little asterisk. But seeing a little asterisk is not a substitute for actually understanding what is going on. What ‘statistically significant’ means in the context of an algorithm like this is: given the data you’ve just fed me, I have performed some calculations, and the output of those calculations is a number which is lower than 0.05. The algorithm (aka the golem) can’t make real-world inferences for you, and it can’t tell you whether it was the correct algorithm to use in this instance. If the data is of the wrong sort, it will still blindly attempt to carry out its instructions. If the answer it gets is wacky for some reason, it won’t necessarily notice or care.
Armed with these golems, you’re let loose on real research as a graduate student. Let’s say you’re trying to apply a golem to your experimental data, and your statistical software throws back an error: the command you typed hasn’t worked. You don’t understand why it hasn’t worked, or for that matter what the jargon-filled error message even means, so you Google it, and see that someone has posted on StackExchange a few years ago describing what sounds like a similar problem. ‘You should be applying a Tischbein-Fischbein correction’, suggests one of the replies. ‘Actually, this analysis is probably worthless without it.’
A what?
The top Google result for ‘Tischbein-Fischbein correction' is an academic paper published in a journal of statistics in 1982, which has been cited eleven thousand times. From the abstract of the paper, you think you understand in words what the procedure is meant to do, but you aren’t sure whether it’s applicable to the type of data you have. Reading onwards, the paper is dense with mathematical symbols, only half of which you even recognise. You don’t have a hope of gleaning useful information from this; you’re here to study the ecology of blue tits, for goodness’ sake. With a sinking heart, you return to the page of Google search results. All the other hits are research papers in a variety of fields that make a passing mention of having used the Tischbein-Fischbein procedure, citing the 1982 paper. Great.
This made up example might be an exaggeration, but it’s not much of an exaggeration. If you’re anything like me, this process is incredibly frustrating: at a certain point you don’t care about understanding the problem any more, you just want to know how to make it go away. And it’s also demoralising, because everyone around you is talking as though they fully understand these things, yet you can’t find anything that explains it to you clearly and simply. Even when it does seem to be working, you’re worried – and for good reason! – that you might’ve done something wrong that you’re not aware of.
This problem arises because of what I call ‘the statistics gap’. There's a huge gap between the level of statistical understanding you get from university courses, which ideally give a thorough grounding of the basics, and the level of understanding required to parse most reference material at the next level up in complexity. And it’s within this gap that much academic research takes place.
Read any academic paper dealing with quantitative data. How often do these rely on nothing more complex than a t-test or a basic linear model (a model where outcome variables only vary with some multiple of input variables, and not their square, cube, inverse, or square root)? Almost never. But beyond this level of statistical analysis, the average researcher, in my field, at least, is far from having a watertight understanding.
Learning statistics on the job as a junior researcher is a bit like being inducted into a secret society. Nobody expects you to fully understand the papers peppered with mathematical notation that explain how a certain statistical technique works. Instead, people understand in practice, to a better or worse extent, how to apply it, even if they don’t have that clear a picture of what they’re actually doing on a deep level. As a new graduate student, older ones might give you pointers on which golems to use, while freely admitting to having only a very superficial understanding of them, which means that troubleshooting any problems that arise is near impossible. Fitting models to and making inferences from your data become slightly alchemical processes, with everyone having their own approach often based on a somewhat ramshackle grasp of the concepts involved.
You can see how, in this environment, statistics would be a source of insecurity for lots of researchers; in my experience, I think biologists sometimes talk about statistics in a slightly guarded or obfuscatory way to avoid slipping up and revealing some unknown ignorance. And a culture like this has implications for the quality of research being produced.
A clear illustration of this is ‘many analysts’ research projects. This is where multiple teams of researchers are asked to use the same dataset to answer the same question- often coming up with results that are wildly different from each other. One such study, which attracted a fair bit of media attention when it was published in 2018, asked twenty-nine teams to test whether darker skinned players were more likely to get a red card in football matches than lighter skinned players. You might think if you were looking at the same data as someone else, reaching the same answer to this question ought to be reasonably straightforward, but no: twenty of the teams found there was a bias in the distribution of red cards, and nine of them said there wasn’t. There are a few other papers like this across different disciplines. In my field of biology, one group is working on a large project where over a hundred teams answer questions on the growth of baby birds and of baby seedlings and so far have found substantial variation in the results reported.
Now, these differences don’t come about purely as a result of different analytical techniques. One of the main sources of discrepancy is which variables are chosen to be included in analysis (for instance, if you want to know about the size of birds, that doesn’t tell you whether to use their mass in grams, their wingspan in millimetres, or something else as your measure). But I think it demonstrates the importance of being able to have transparent and detailed conversations about the specifics of data analysis, and to understand the nuances where there is disagreement over which approach is better. We can have much more fruitful discussions when we understand a subject inside out and upside down than when we’re aware in the back of our minds that our understanding contains large gaps that have been papered over. And errors are much more likely to go unnoticed if peer reviewers and the people reading and citing research not only don’t have a fluent grasp of statistics, but also are insecure about their lack of fluency.
Lots has been written about how poor use of statistics can lead to research that doesn’t replicate, and there are a number of suggestions to address this, from pre-registration (where you clarify in advance how you’ll analyse the data) to multiverse analysis (where you carry out many different analyses and report all the results). With this article, I want to draw attention to something much simpler, which as well as helping in its own right is a prerequisite for many of the more involved solutions: we just need researchers who understand statistics better.
How could we achieve this? One answer is to fill the statistics gap with more teaching. When someone at the bottom of the gap, i.e. with a good grasp of basic concepts but nothing more, wants to learn something more advanced, there should be materials that meet them where they are, whether those are taught courses at university or elsewhere, blog posts, online lectures, or some other format. Currently, I'd say there’s a dearth of resources pitched at a level just above an introductory stats course, with little prior knowledge assumed and with no jargon thrown in without explanation.
Such resources do exist! It’s just that there aren’t that many of them, compared to the abundance of clear teaching materials for basic statistics, so coverage is a bit patchy and they’re not necessarily easy to find. There’s Richard McElreath’s Statistical Rethinking course (where the term golem for an undirected statistical algorithm was coined). I’m a fan of StatQuest, a YouTube channel created by former geneticist Josh Starmer, which explains a range of concepts with the kind of visual demonstration that means people can really see what is going on. (A benefit of YouTube as a medium is you can go through an example verrry very laboriously, and viewers can skip forwards and backwards to take it in at their own pace). And this beautifully friendly introduction to mixed-effects models by linguist Bodo Winter was a godsend during my masters research project.
All of these are the sort of material that makes a confused reader sigh with relief: ‘OH, so THAT’S what that’s all about’. And then, a bit later, with some annoyance: ‘this isn’t at all difficult to grasp- when it’s actually explained, it makes total sense!’ We desperately need more of this kind of thing! If any readers of this post have the knowledge and inclination to create teaching materials that hold people’s hands through intermediate statistical concepts, or perhaps a directory of those that already exist, I think this would be a great low-hanging fruit for reproducible research across disciplines.
Another solution could be more of a specialist role for statisticians within science. It’s already the norm in high-stakes, large-budget research like clinical trials that statisticians should be consulted as part of designing the study and analysing results. Perhaps this should be the norm in other areas of science, too? I suspect the statistics gap might be especially bad in my field of ecology and evolution, because although it is a STEM subject, people often go into it not because they are especially quantitatively-minded but because of their passion for, and encyclopaedic knowledge of, wildlife. This in itself is a valuable attribute, and perhaps we ought to be selecting for this passion (and the equivalent in other fields) separately from the ability to thoroughly understand statistical concepts, rather than hoping for both to exist in a single person.
What I’d love to see would be departments like my department of biology hiring a handful of dedicated people for statistics support for all the researchers that work there. Just as you can pop in to see the people at IT support for help with your computer, you ought to be able to fire off a quick email to the stats office for advice on the analysis of your data. We don’t expect researchers to be self-sufficient when it comes to IT problems. Perhaps we should recognise that a thorough understanding of statistics is also a specialist expertise in its own right, and not one that we need every individual researcher to possess. This would be an ideal job for the large number of smart people with a broad interest in science but little interest in advancing in a particular niche, for whom there’s currently not much place in academia. More importantly, it would be an efficient way to distribute people who have good quantitative skills, potentially dramatically improving the quality of all the research produced across a department.
The statistics gap is probably a function of science that’s moved on faster than some of its institutions, including, perhaps, the way it’s taught. The amount of data we are able to collect nowadays is enormous. The methods available to analyse it are dazzling. The rate at which new developments are published by researchers around the world is galloping. And into this increasingly quantitatively sophisticated world, students are sent unprepared, with the vague expectation that they’ll somehow just pick it up off people who are oftentimes also just winging it. This is something that’s quite likely holding back the quality of research we produce – but to close the gap wouldn’t require any great leaps forward in technology, or huge investments of money. All we need to do is take the knowledge we already have, and spread it more effectively.