From the Chronicle, a surprisingly good article on Big Data:
This month Mr. Lazer published a new Science article that seemed to dump a bucket of cold water on such data-mining excitement. The paper dissected the failures of Google Flu Trends, a flu-monitoring system that became a Big Data poster child. The technology, which mines people’s flu-related search queries to detect outbreaks, had been “persistently overestimating” flu prevalence, Mr. Lazer and three colleagues wrote. Its creators suffered from “Big Data hubris.” An onslaught of headlines and tweets followed. The reaction, from some, boiled down to this: Aha! Big Data has been overhyped. It’s bunk.
Not so, says Mr. Lazer, who remains “hugely” bullish on Big Data. “I would be quite distressed if this resulted in less resources being invested in Big Data,” he says in an interview. Mr. Lazer calls the episode “a good moment for Big Data, because it reflects the fact that there’s some degree of maturing. Saying ‘Big Data’ isn’t enough. You gotta be about doing Big Data right.”
I don’t know if I have to sketch out the parallels in education, but just in case: we have two really unhelpful parties in learning analytics. We have the “it’s all bunk” crowd, and we have the evangelists. And I don’t know which is worse.
Here’s the thing — saying “Big Data is bunk” is pretty close in ridiculousness to saying “Oceanography is bunk”. Seventy percent of the planet is ocean. Likewise, the “data exhaust” we emit on a daily basis is growing exponentially. There is no future where the study of this data is not going to play a large role in the research we do and the solutions we create. None. Nada.
How we do it is the issue. And the “science” in “data science” is supposed to bring an element of rigor to that.
But for various reasons, the Big Data world is surprisingly unscientific, surprisingly data illiterate, surprisingly uncritical of its own products. I hear supposed data scientists quoting the long debunked claim that Obama won the 2012 election through use of Big Data (unclear and unlikely). They latch on to the same story about Target predicting pregnancy, which remains, years later, an anecdote that has never seen external scrutiny. They cite Netflix, even though Netflix has walked back from a purer data approach, handtooling micro-genres to make results more meaningful.
It gets worse. As the Chronicle article points out, Google statisticians published the original Nature article on using Google searches to predict flu outbreaks in 2009. Google Flu Trends, the result of that research, is used by public health professionals as one measure of likely flu incidence. That persistent over-estimation that was just discovered? Flu Trends was overestimating physician visits by about 100%. That’s bad. But here’s the kicker: It’s been doing that for three years.
And this, unfortunately, is par for the course. A recent article by an Open University analytics expert cites the Purdue Course Signals experiment, apparently unaware that a substantial portion of those findings came under substantial critique last year, which raised questions that have still not been answered to this day. Meanwhile the examples used by MOOC executives are either trivial or so naively interpreted that one has to assume that they are deliberately decieving the public (the alternative, that professors at Stanford do not understand basic issues in research methods, is just too frightening to contemplate). Yet, if they are called on these errors, it is generally by the “Big Data is bunk” crowd.
There are people — people I know personally — who are in the happy middle here, believing not that we should support analytics or Big Data but that we should support better practice, period. But there’s too few of them.
So here’s my proposal. We’ve all used these anecdotes — Google Flu Trends, Course Signals, Netflix recommendations, Obama’s election — to make a point in a presentation.
Early in this field, that was probably OK. We needed stories, and there wasn’t a whole lot of rigorous work to pull from. But it’s not OK anymore. I’m declaring a moratorium on poorly sourced anecdotes. If you are truly a data person, research the examples you plan to cite. See what has happened in the four or five years since you first saw them. Don’t cite stuff that is questionable or seemingly permanently anecdotal. And if you hear people cite this stuff in a keynote, call them on it. Be the skunk at the party. Because it’s intelligent skunks, not cheerleaders, that this field needs right now.