Taleb on the Problems of Big Data

February 20th, 2013

Nassim Taleb, no fan of big data, writes in Wired (of all places!) to highlight some of the shortcomings of putting too much faith in data.

But beyond that, big data means anyone can find fake statistical relationships, since the spurious rises to the surface. This is because in large data sets, large deviations are vastly more attributable to variance (or noise) than to information (or signal). It’s a property of sampling: In real life there is no cherry-picking, but on the researcher’s computer, there is. Large deviations are likely to be bogus.

We used to have protections in place for this kind of thing, but big data makes spurious claims even more tempting. And fewer and fewer papers today have results that replicate: Not only is it hard to get funding for repeat studies, but this kind of research doesn’t make anyone a hero. Despite claims to advance knowledge, you can hardly trust statistically oriented sciences or empirical studies these days.

This is not all bad news though: If such studies cannot be used to confirm, they can be effectively used to debunk — to tell us what’s wrong with a theory, not whether a theory is right.

I am not saying here that there is no information in big data. There is plenty of information. The problem — the central issue — is that the needle comes in an increasingly larger haystack.