COVID-19: Curve-Flattening, Models, and Garbage Data
We’ve been surrounded lately with an abundance of news that has forced the field of epidemiology (statistics related to diseases and applied to public health) into our thoughts. Epidemiologists are experts specifically in the prevalence and spread of diseases, like COVID-19, among populations. And, while epidemiologists are statisticians, not all statisticians are trained in epi stats or public health.
I am not an epidemiologist. Yet, like all statisticians, I’ve been trained to sniff out sound statistical measurement, good data, and most of all, find other experts who know what they are talking about much more than I do.
I’ll share with you what I’ve been reading, who I’m paying attention to, and what perks my ears from the news.
The Ubiquitous Curve
If nothing else recently, the typical consumer of news has absorbed fantastic data visualization lessons – both in “flattening the curve” and in guessing the alphabet letter shape chart our economic recovery will take.
What is the curve? The curve we are eager to flatten represents the pace of deaths or disease from COVID-19. If the pace is too forceful, and the curve shoots up high very quickly, we overwhelm our healthcare system. To disperse demand for COVID-19-related healthcare, we’ve tried to flatten, or spread out the curve. To be clear, flattening the curve doesn’t lessen the raw number of disease cases, it merely spreads the cases out, slowing the spread and pace of cases. In theory, even if we “flatten the curve,” we may have the same number of COVID-19 cases.
Cathy O’Neil (mathematician worth reading #1) warns us here that this “curve” we’ve been seeing depicted as an actual curve in news visualizations may not actually be curve-shaped. To date, the second half of the curve is a guess, as it is still in the future. It may be asymmetrical. It may level off after it peaks. It may peak again. Importantly, we can’t expect diseases to lessen at the same rate that they increased. Hence, we can’t predict how long the COVID-19 crisis will last.
Models and Garbage Data
Another term news-consumers are being bombarded with: models. Statistical modeling is not the same as just an algorithm or a formula itself, it’s the application of an algorithm or formula to a set of data to predict what may happen. There are two pieces here. Modeling requires both a valid algorithm and valid data. Both must be correct for the model to produce a usable prediction.
Hence our challenge. Not only do we not have a known formula for the model (new virus – what shape will its path take?), we also don’t have valid data (disease case prevalence, death rates, etc.). In fact, we have the opposite – we have garbage data. Many models are dogs that won’t hunt. While we can know (to some validity) raw counts of hospitalizations and deaths, we can’t know number of people infected because of a lack of tests (at least in the U.S.). And, we need both of those numbers to calculate the illness rate and death rate. In one sense, the lack of tests for COVID-19’s prevalence stairsteps us down to numerous other incorrect and incalculable statistics.
Who Should We Listen To?
Here’s my personal decision rule: listen to those that caveat themselves. Accurate models for COVID-19 don’t exist, but some models are better than others. If an expert or a model does not include caveats or a description of likely errors, ignore it. Johns Hopkins University’s map is clear what it is measuring – number tested, and number of deaths reported. Nigel Marriott has a fantastic compilation of credible COVID-19 data sources. Finally, the UK’s Royal Statistical Society spells out all of this clearly in their Statistician’s Guide to Coronavirus Numbers. Consume COVID-19 data numbers with an abundance of skepticism.
(I can’t help all of the dog analogies. I’ve just adopted a snuggly St. Bernard to be a sibling to my older lab.)