Where to Draw the Line on 'Data Science'?

I completely agree with Tim O'Reilly. Mike Loukides' post on what is data science is a, "seminal, important post." If it has managed to avoid your gaze over the past twenty-four hours I highly recommend it; if nothing else, it is a 2,000 word massage of the data geek ego and a nifty tool and who's who reference to boot. As the latest in a recent series of blog post and magazine/newspaper articles on the rise of the data scientist Loukides draws broad strokes on this emerging discipline, covering everything from where the data comes from, to how to manage it, and who is doing great work (kudos for getting quotes for so many excellent members of the data community).

While I think it is important to write and discuss the importance of this field, I think it is equally important that we—the data science community—do not fall into a perpetual cycle of self-admiration and navel gazing. That is, when asking the question, "what is data science," we should also be asking, "what is not data science?" Or, perhaps more appropriately, "What is good data science, and how do I become a good data scientist?" These questions have not been the focus of the discussion thus far, and it is time to start asking them.

Up to this point the discussion of what is data science has been rather inclusive. As Loukides notes:

In the last few years, there has been an explosion in the amount of data that's available. Whether we're talking about web server logs, tweet streams, online transaction records, "citizen science," data from sensors, government data, or some other source, the problem isn't finding data, it's figuring out what to do with it.

After reading the Loukides piece in the context of what has already been said, I was struck by what appears to be a gradual blurring between what is science and what now being promoted as data science. As an example, consider the recent adjustment of the estimated amount of oil spilled into the Gulf coast. Using the live video feeds of the spill and satellite imagery, FSU oceanographer Ian R. MacDonald performed "rough calculations" to find that the actual amount of oil being spilled may have been four or five times what the government had estimated. Now, were the calculations performed by Dr. MacDonald data science, or just science? His data came from the streaming ethers of the Internet pointed out by Loukides and others; the external spring from which data science flows, but his primary tools were his own eyes and decades of experience.

Before you accuse me of pedantic folly, my purpose with the MacDonald example is to highlight the fact that good data science is exactly the same good science. The most meaningful analyses will be borne from a thorough understanding of the data's context, and an acute sense of what the most important questions should be asked. The conversation up to this point, unfortunately, has been far too focused on the data resources themselves and the tools used to approach them. Good data science will never be measured by the terabytes in your Cassandra database, the number of EC2 nodes your jobs is using, or the volume of mappers you can send through a Hadoop instance. Having a lot of data does not license you to have a lot to say about it.

To that end, I have been disappointed in the lack of mention as to on how critical the social sciences are to good data science. Loukides quotes LinkedIn's Chief Scientist DJ Patil in reference to who makes the best data scientist:

...the best data scientists tend to be "hard scientists," particularly physicists, rather than computer science majors. Physicists have a strong mathematical background, computing skills, and come from a discipline in which survival depends on getting the most from the data.

While I have the upmost respect for physicist, like Patil, their discipline is unencumbered by such pesky matters as human free will and fallibility. I happen to know that DJ respects and understands the difference because I have had the great pleasure of discussing this issues with him, but imagine how much more difficult the so-called hard sciences would be if atoms got to decide their own charge? As data science is fundamentally about gleaning information from the data trail of humans those with perspective on causality in this context are invaluable. While large data stores may be interested in running a regression over some set of variables, a good data scientist would first wonder what the underlying process was that generated those observations, what is missing, and how that affects the interpretations of results.

My assessment of the current state of data science is best described as cautious optimism. The tools needed to capture the data deluge (as Chris Anderson puts it) have developed at a truly astonishing rate. And though I think those leading the data science charge are brilliant and preeminently capable of continuing its surge; I fear our intuitions about what the data mean have not kept pace, and it may be sooner than later that our analyses suffer for it.