What Will 'Data Science' Teach Us?

July 15, 2010 by Drew Conway

If the level of online discourse is a good indicator of whether a topic has penetrated the collective nerd consciousness, then the notion of a burgeoning "data science" discipline has taken hold. A few weeks ago I discussed where to draw the line on this idea, but recently I again begann thinking about the idea and term more critically. Yesterday, I had a wonderful discussion with a brilliant member of the data community here in New York, which focused on the delicate balance between keeping a human-friendly face on mass quantities of data—something the data scientists are meant to do—and having this new discipline make formidable contributions to our general understand of human behavior.

That is, up to this point, many of the great evangelists of data science have focused on telling stories with data. Science, however, is not about story telling, but about discovery. Perhaps I am particularly cautious of the suffix "science" because of the awkward self-consciousness the word has imbued in my own discipline. At its roots, political science was a discipline that sought to construct narratives; equal parts history, philosophy and personal experience. The name "political science," therefore, brought the ire of the "hard science" community, as they felt (perhaps with reason) that the word had been appended to the title erroneously, as there were no identifiably scientific aspects to the endeavor. While my discipline has come a long way in its application of the scientific method, and today can much more accurately be referred to as a science, there continues to be a delicate balance between discovery and story telling. What, then, can the data science community learn from this experience?

Broadly, all disciplines are measured by their contributions to our understanding of the universe. Data science—by design—is the product of measured human activity, and therefore should seek to provide new insight into human behavior. Unfortunately, the current focus of many of the community's members has been a self-congratulatory appraisal of the tools that have been developed to allow for this large-scale measurement and recording. To be a successful discipline, however, the focus must move away from tools and toward questions.

To paraphrase a famous nerd, with great data comes great responsibility; so to begin, the data science community must ask: what questions do mass quantities of measured human existence allow us to address that were never previously possible? Just the thought should be enough to inspire some to begin writing research proposal, but in effort to contribute to this discussion here are a few things I hope data science will teach us:

How do online discourses manifest in offline behavior? - I study terrorism, and one lingering problem in this area is the threat from so-called online radicalization. That is, to what extent does information obtained online influence individuals to join radical organization or commit acts of terror? This question, however, applies to many other areas, such as voting and purchasing decisions. As our ability to analyze these discourses increases, perhaps data science will provide some answers.
How do we reach the "tipping point"? - Malcom Gladwell did well to introduce the idea of the tipping point, but since then we have learned too little about how these culminations occur, and what—if any—are the consistent behavioral features that lead to them. Often, these events occur online, where data science may be able to analyze the tracks that lead to these phase shifts.
What are the ethical limits of personal data analysis? - The rise of massive stores of personal data online has been a boon to the data science community, but it has not come without some trepidation. With intimate knowledge of the tools and processes used to capture and analyze this information, this community is uniquely positioned to contribute to a discussion of the ethical limits of their own work.
Do we really consumer things differently? - Everyday people make decisions about what they will consume; in terms of purchases, food, information, etc., and conventional wisdom states that these decisions are largely a function of birth cohorts, geography, educations, etc. Is this really case? The vast amount of consummatory data being generated online may be able to help us understand the most significant indicators of these differences.
Can more/better data explain rational irrationality? - Today we learned some of the limits of behavioral economics, which have helped explain instances of seemingly irrational behavior. As the op-ed points out, however, there continue to be many questions that discipline fails to explain. Perhaps, then, the explanations of these anomalies can be borne out of data.

I welcome your own thoughts on what data science will teach us, and hope you will share them. Personally, I think this discipline has the potential to generate vast amounts of knowledge, but must be cautious to not loose sight of the question in the sea of information.

Ten Reasons Why Grad Students Should Blog

June 08, 2010 by Drew Conway

Tomorrow is the two year anniversary of ZIA. In keeping with the tradition started last year there are some changes afoot for the website itself, but I will keep those under wraps until the actual birthday (wouldn't want to open your gift early, yes?). Rather, today I would like to be more reflective. A few days ago, as the two year anniversary approached, I began thinking back on not just what I accomplished this year at ZIA but also how much the blog has provided me. Upon this reflection it occurred to me that this endeavor has been incredibly beneficial. As such, it seemed logical to me that this would also be the case for many other grad students; which was immediately triggered the question: then why do so few do it?

There are a few notable exceptions, but for the most part it is the faculty that partake in blogging. Perhaps this is simply a function of my particular discipline, in which I—admittedly—do most of my blog reading. I have, however, been to many corners of the blogosphere, and at least within my N=1 sample this appears to be a common phenomenon. I welcome others to show me that this is not the case in other disciplines, but even so, more grad students should be publishing online.

As I thought longer about the vacant state of grad student blogging I wondered if it could be explained as a "they don't know what they don't know" situation. Perhaps by standing from the outside looking in, my fellow grad students simply do not know all of the benefits that can come from participating in an online discourse. To remedy this informational problem, and in an attempt to encourage more grad students to begin blogging, I present (in no particular order) my ten reasons why grad students should blog:

You actually have something to say - This is perhaps the best reason why you should be blogging. One of the most frustrating characteristics of the blogosphere is its inherent infinitesimal signal-to-noise ratio. As a grad student, especially those in PhD programs, you have already been deemed qualified to participate in the discussion at a very high level by a panel of distinguished scholars, i.e., the admission committee. Why keep all that smart analysis to yourself?
Honing your craft - At its core, graduate school is preparing you to be an active member of the academy. While we may struggle through our preliminary methods classes as we build our technical expertise, it is the application of these tools to interesting research questions that builds successful careers. A blog provides a wonderful lab for experimentation, both in terms of the technical application of methods and toying with research questions in sub-fields of your discipline you may not have otherwise tested.
Establishing an identity - If you are in graduate school to be the "best kept secret in academia," you are making a fatal mistake. As with any other job market, getting the preverbal foot in the door for a job talk at a university is a critical first step. As a graduate student it can be incredibly difficult to navigate the sea of senior faculty, their research agendas, and how that fits into your career goals. Having a blog provides you an independent beacon upon which you can broadcast your own ideas. Consider this, ZIA is but a tiny blip within the academic blogosphere, but in the last year my CV has been downloaded by over 875 unique visitors, or more than twice a day.
Extending your network outside of academia - Though it is often hard to imagine this from within the cozy confines of the ivory tower, there are a lot of brilliant people outside of academia interested in exactly the same things you are. The difficulty, however, is connecting with them. The Internet is a powerful networking device, and if you are willing to put yourself out there these people will seek you out (Kevin Costner knows what I am talking about). Your bonafides are already largely taken care of (see point reason #1), now you have to impress the Internet with your brilliant musings.
The faculty in your department will not think less of you - I have been asked several times by fellow grad students some form of the following question: "Weren't you worried what your advisors would think about your blog?" Of course, I never even thought about this question, as I started blogging before actually matriculating to NYU (note that ZIA's anniversary is early-June and most universities begin the Fall semester in late August). This, however, is besides the point. No, I was never worried about what my advisors would think. The things I write about on ZIA are exactly the same kinds of things I say in seminar and write for term papers (in fact, these ideas often flow both ways). Furthermore, most of those faculty who might actually view blogging in a negative light are also those most unlikely to ever read your blog.
Instant and broad criticism of your work - Part of the maturation process for any grad student is developing the ability to receive, absorb, and convert criticism. Much of this will come from rote academic traditions contained within the classroom and conferences, but a blog offers an alternative channel for this criticism. Not only will you get criticism from fellow academics, but criticism from non-academics can illuminate aspects of your research that can be improved to allow for broader understanding.
Sharpening your own critical eye - What is the primary thrust of most graduate seminars? Read a series of papers, and spend the next 120 or so minutes tearing them apart. This is meant to help students recognize the difference between good and great work, but also begin to discover where the more fertile patches exist within the landscape of possible research agenda. There are, however, many more papers published in a semester than any one seminar could possibly hope to cover. Also, many seminars are focused on seminal works, not cutting edge research—the same cutting edge research you are most likely already reading in your free time. A blog provides you a platform upon which to criticize this new work, and if you are very lucky (as I have been on a few occasions) an opportunity to interact with the authors in a public forum.
Oh, the places you'll go - A combined effect of reasons #1-5 is you will be given the opportunity to travel all over the world and participate in many conferences, seminars, panels, etc. Without a public voice on the Internet I would have never had the opportunity to present to the Bay Area R User's Group or the University of Michigan's Center for the Study of Complex Systems. As you extend your network outside of academia it will take you to places you could never have thought possible without the blog.
Building technical expertise - Not all of the work you put into your blog will go toward writing noteworthy posts. Some of the effort, particularly at the outset, will be focused on building the actual site. This will require you to learn technical skills you would have otherwise never had the need or desire to. This is incredibly useful in and of itself, but these skills can be applied beyond blogging. Considering how charts from a paper will look in the online version of your paper (the version most readers will see) is something you may only have thought of after several iterations of trial and error posting it to your blog.
It is just plain fun - You are a nerd. You enjoy writing. In many ways, a blog sells itself. But, the additional joy you will feel as you watch your daily hits go up, and the frequency of (non-SPAM) comments increases, will become a powerful motivating force in your day-to-day. A wonderful side effect of which is that the overall quality of your work will also increase, as you become a better writer, researcher and conveyer of complex ideas.

I realize that this will not motivate everyone to navigate over to WordPress and being their own blogs, but I hope it has helped you understand some of the benefits of having your own presence on the Web. I welcome your own thoughts, either as a grad student blogger, or as someone unmoved by the above reasons.

Where to Draw the Line on 'Data Science'?

June 03, 2010 by Drew Conway

I completely agree with Tim O'Reilly. Mike Loukides' post on what is data science is a, "seminal, important post." If it has managed to avoid your gaze over the past twenty-four hours I highly recommend it; if nothing else, it is a 2,000 word massage of the data geek ego and a nifty tool and who's who reference to boot. As the latest in a recent series of blog post and magazine/newspaper articles on the rise of the data scientist Loukides draws broad strokes on this emerging discipline, covering everything from where the data comes from, to how to manage it, and who is doing great work (kudos for getting quotes for so many excellent members of the data community).

While I think it is important to write and discuss the importance of this field, I think it is equally important that we—the data science community—do not fall into a perpetual cycle of self-admiration and navel gazing. That is, when asking the question, "what is data science," we should also be asking, "what is not data science?" Or, perhaps more appropriately, "What is good data science, and how do I become a good data scientist?" These questions have not been the focus of the discussion thus far, and it is time to start asking them.

Up to this point the discussion of what is data science has been rather inclusive. As Loukides notes:

In the last few years, there has been an explosion in the amount of data that's available. Whether we're talking about web server logs, tweet streams, online transaction records, "citizen science," data from sensors, government data, or some other source, the problem isn't finding data, it's figuring out what to do with it.

After reading the Loukides piece in the context of what has already been said, I was struck by what appears to be a gradual blurring between what is science and what now being promoted as data science. As an example, consider the recent adjustment of the estimated amount of oil spilled into the Gulf coast. Using the live video feeds of the spill and satellite imagery, FSU oceanographer Ian R. MacDonald performed "rough calculations" to find that the actual amount of oil being spilled may have been four or five times what the government had estimated. Now, were the calculations performed by Dr. MacDonald data science, or just science? His data came from the streaming ethers of the Internet pointed out by Loukides and others; the external spring from which data science flows, but his primary tools were his own eyes and decades of experience.

Before you accuse me of pedantic folly, my purpose with the MacDonald example is to highlight the fact that good data science is exactly the same good science. The most meaningful analyses will be borne from a thorough understanding of the data's context, and an acute sense of what the most important questions should be asked. The conversation up to this point, unfortunately, has been far too focused on the data resources themselves and the tools used to approach them. Good data science will never be measured by the terabytes in your Cassandra database, the number of EC2 nodes your jobs is using, or the volume of mappers you can send through a Hadoop instance. Having a lot of data does not license you to have a lot to say about it.

To that end, I have been disappointed in the lack of mention as to on how critical the social sciences are to good data science. Loukides quotes LinkedIn's Chief Scientist DJ Patil in reference to who makes the best data scientist:

...the best data scientists tend to be "hard scientists," particularly physicists, rather than computer science majors. Physicists have a strong mathematical background, computing skills, and come from a discipline in which survival depends on getting the most from the data.

While I have the upmost respect for physicist, like Patil, their discipline is unencumbered by such pesky matters as human free will and fallibility. I happen to know that DJ respects and understands the difference because I have had the great pleasure of discussing this issues with him, but imagine how much more difficult the so-called hard sciences would be if atoms got to decide their own charge? As data science is fundamentally about gleaning information from the data trail of humans those with perspective on causality in this context are invaluable. While large data stores may be interested in running a regression over some set of variables, a good data scientist would first wonder what the underlying process was that generated those observations, what is missing, and how that affects the interpretations of results.

My assessment of the current state of data science is best described as cautious optimism. The tools needed to capture the data deluge (as Chris Anderson puts it) have developed at a truly astonishing rate. And though I think those leading the data science charge are brilliant and preeminently capable of continuing its surge; I fear our intuitions about what the data mean have not kept pace, and it may be sooner than later that our analyses suffer for it.

Homegrown Terrorism and the Small N Problem

May 27, 2010 by Drew Conway

I just finished the new RAND report on homegrown terrorism in the United States, entitled "Would-Be Warriors: Incidents of Jihadist Terrorist Radicalization in the United States Since September 11, 2001," and it is a fascinating analysis of the paths to radicalization by American citizens over the past near-decade. This paper is clearly extremely timely given the seemingly sudden rise in domestic radicalization toward jihadism. As the report notes, "the 13 cases in 2009 did indicate a marked increase in radicalization leading to criminal activity, up from an average of about four cases a year from 2002 to 2008." Given this fact, and the more recent Faisal Shahzad case, and the overall increase in attacks against the U.S., the salience of homegrown terrorism is as high as ever.

Previously, I have written skeptically about the notion that domestic terrorism is—in fact—on the rise. This apparent trend may be better described as a regression toward to mean level of this activity over a longer time period. To the author's credit, Brian Michael Jenkins, he assuages any alarmist notions of a sudden and abnormal rise in domestic terrorism by reviewing the extensive history of domestic terrorism incidents that occurred in the United States during the 1960's and 70's.

After reading the RAND report I was not necessarily dissuaded rom my position that the current spike is nothing more than a mean regression; however, I was convinced by this report that the stakes have changed considerably since the previous decades and thus this subject deserves considerable attention going forward. What the RAND report suffers from, and many other reports on domestic terrorism, is a small N problem, and in order to more accurately study this phenomenon efforts must be made to overcome these issues.

To be clear, having a small number of observations with respect to domestic terrorism and radicalization is a "good" problem. National security benefits from the fact that these are rare events, and we are thankful that this is the case. That said, because the RAND analysis consists of only 46 observations over an 8 year period any conclusions must be tempered by this fact. For example, when describing who the terrorist are the author states (emphasis mine):

Information on national origin or ethnicity is available for 109 of the identified homegrown terrorists. The Arab and South Asian immigrant communities are statistically overrepresented in this small sample, but the number of recruits is still tiny. There are more than 3 million Muslims in the United States, and few more than 100 have joined jihad—about one out of every 30,000—suggesting an American Muslim population that remains hostile to jihadist ideology and its exhortations to violence.

We know, however, that this final assertion is not true; specifically, with regard to the numbers. The numbers, at best, only support the claim that domestic radicalization is very rarely observed. It does not suggest anything about the internal disposition of American Muslims. While this may actually be the case, simply by not observing a phenomenon cannot support this claim. The cliché, "The absence of evidence is not evidence of absence," is particularly applicable to small N problems. If we are actually interested in understanding the sentiment of American Muslims then traditional survey work would be quite applicable.

Clearly, the primary problem is that because these are rare events we simply do not have enough data to build good statistical models. As such, whenever endeavoring to study this subject al attempt to retain as much applicable data should be made. In the case of the RAND report this was not done, as the data were thinned to include only those cases that resulted in indictments in the U.S. or abroad. While this is a minimal limitations, the underlying assumptions is that paths and intents for radicalization is somehow different for those who are indicted versus those that are not. This seems dubious at best, and therefore a better approach would be to include all possible observations, and then using a more theoretically unbiased method for data cleansing (such as a coarsened exact matching) to isolate those observations of interest. This seems to follow a troubling trend in terrorism studies of selection on the dependent variable.

PolNet 2010 and the Cult of ERGM

May 24, 2010 by Drew Conway

Duke Psychology Building I returned to NYC on Friday from the Political Networks conference, but have only now had a chance to reflect. Charli Carpenter, of the always excellent Duck of Minerva, has already made many great points about what large conference could learn from niche conferences through her experience at PolNets (who's that guy imbibing in that photo, anyway?). I agree with much of what Charli points out about, and overall thoroughly enjoyed the conference. I think a combination of low-visibility of these methods within the discipline as a whole with high-energy among those actually interested in networks resulted in a very top-heavy set of presentations.

A clear advantage to a conference like PolNets is that rather having a specific substantive focus at its core—like so many smaller conferences—here the focus was on a methodological technology. With that, there is less need during presentations for people to "sell" their method, because everyone in attendance has essentially signaled acceptances by being there. Therefore, more of the discussions are centered on the substantive implications of applying network theory to some research agenda, or specific methodological quibbles. This is all well and good, and add to this the fact that a small number of attendees means graduate students and young scholars have a lot of opportunity to discuss their work with more established academics.

While I have studied networks for several years, this was actually my first conference on the subject. I do, however, try to stay rather current on the literature and as such came to the conference with the expectation that the breadth of topics covered would be wide both in terms of application of network methods and political science topics. Perhaps due to my own naivety, or willful ignorance, I was disappointed to find that this was not the case.

On the former point, from what I observed at PolNets it seems that the social science networks community is rapidly forming as a cult of the exponential random graph model (ERGM) framework. In some ways this makes perfect sense. ERGM are—for lack of a better term—statistical models that describe network and allow for some degree of inference to be drawn about these structures. This can be extremely useful for social scientists, as it describes networks in familiar statistical terms. What was surprising was the wholesale, and often unquestioning, commitment to these models for all types of analysis with the social sciences. In fact one of the creators of ERGM went so far as to call it the lingua franca of all network models. To be clear, mathematically ERGM can produce all possible networks; however, in practice this is akin to saying that all the works of Shakespeare could be reproduced in Morse code. While technically possible, it would be a fool's errand. The ERGM framework has significant computational limitations, which was reinforced by the admission of several presenters needing weeks to complete model estimations on very moderately sized networks.

While there were a few notable exceptions (best exemplified by the presenters on the Innovations in Network Measurement panel), I would have liked to see more research not just extending the ERGM framework, but also stepping outside of it to build models to describe the massively complex networks that have become commonplace in disciplines outside of the social sciences. My fear is that networks in the social sciences will become a "one trick pony," and a pony that itself is incredibly hampered by current technology.

With respect to the breadth of application in political science I was impressed by the diversity of topics covered by the panels. I was disappointed, however, by the actual representation of political scientists at the conference. While I am fully aware that the study of networks is highly interdisciplinary, and that political science as a discipline is a very late adopter of this technology, it would have been encouraging to see more APSA card caring political scientists among the attendees. For example, on the second day of the conference a "panel of experts" convened to field questions from anyone who cared to pose one. The problem: there was not a political scientist among the experts, making it hard to ask pointed questions about networks in political science.

As I said, though, overall the conference was excellent, and I extend my thanks and congratulations to Mike Ward of Duke University for putting on such a great event. Next stop: Sunbelt 2010!