Some of you may remember the “Google Flu” effort, where the company was going to try to track outbreaks of influenza in the US by mining Google queries. There was never much clarification about what terms, exactly, they were going to flag as being indicative of someone coming down with the flu, but the hype (or hope) at the time was pretty strong:
Because the relative frequency of certain queries is highly correlated with the percentage of physician visits in which a patient presents with influenza-like symptoms, we can accurately estimate the current level of weekly influenza activity in each region of the United States, with a reporting lag of about one day. . .
So how’d that work out? Not so well. Despite a 2011 paper that seemed to suggest things were going well, the 2013 epidemic wrong-footed the Google Flu Trends (GFT) algorithms pretty thoroughly.
This article in Science finds that the real-world predictive power has been pretty unimpressive. And the reasons behind this failure are not hard to understand, nor were they hard to predict. Anyone who’s ever worked with clinical trial data will see this one coming:
The initial version of GFT was a particularly problematic marriage of big and small data. Essentially, the methodology was to find the best matches among 50 million search terms to fit 1152 data points. The odds of finding search terms that match the propensity of the flu but are structurally unrelated, and so do not predict the future, were quite high. GFT developers, in fact, report weeding out seasonal search terms unrelated to the flu but strongly correlated to the CDC data, such as those regarding high school basketball. This should have been a warning that the big data were overfitting the small number of cases—a standard concern in data analysis. This ad hoc method of throwing out peculiar search terms failed when GFT completely missed the nonseasonal 2009 influenza A–H1N1 pandemic.
The Science authors have a larger point to make as well:
“Big data hubris” is the often implicit assumption that big data are a substitute for, rather than a supplement to, traditional data collection and analysis. Elsewhere, we have asserted that there are enormous scientific possibilities in big data. However, quantity of data does not mean that one can ignore foundational issues of measurement and construct validity and reliability and dependencies among data. The core challenge is that most big data that have received popular attention are not the output of instruments designed to produce valid and reliable data amenable for scientific analysis.
The quality of the data matters very, very, much, and quantity is no substitute. You can make a very large and complex structure out of toothpicks and scraps of wood, because those units are well-defined and solid. You cannot do the same with a pile of cotton balls and dryer lint, not even if you have an entire warehouse full of the stuff. If the individual data points are squishy, adding more of them will not fix your analysis problem; it will make it worse.
Since 2011, GFT has missed (almost invariably on the high side) for 108 out of 111 weeks. As the authors show, even low-tech extrapolation from three-week-lagging CDC data would have done a better job. But then, the CDC data are a lot closer to being real numbers. Something to think about next time someone’s trying to sell you on a BIg Data project. Only trust the big data when the little data are trustworthy in turn.
Update: a glass-half-full response in the comments.