Data ownership, social science, and Twitter's folly

July 28, 2011 by Drew Conway

Last night I received the following courtesy email from the good folks at Infochimps:

We received a takedown request from Twitter to remove the dataset you posted to Infochimps which contained raw tweets. Previously this was located at http://www.infochimps.com/datasets/five-days-of-25bahman-tweets.

We're very sorry for the inconvenience this causes. We really value the contribution, but Twitter has asserted claims over the license of this data. Please let me know if you have any questions.

This is in reference to a data set collected by my friend Michael Bommarito on tweets with the hashtag #25bahman, which was being used during protests in Iran early this year. Both Michael and I performed some cursory analysis on the tweets, and later I posted the data to Infochimps because the real motivation for its collection and subsequent analysis was to pique other researcher's interests in this rich data set. So, I get this email and I am angry. I am not angry with Infochimps. I do not expect them to balk Twitter's assertion on my behalf, especially when it seems like such a large portion of their business relies on ready access to Twitter data. What angers me is the hubris and shortsightedness of Twitter's policy, and its consequences for basic social science research.

Imagine you are an eager graduate student in the social sciences. You want to study human behavior; you are interested in communication, social structure, information dissemination, and crowd behavior. You schedule a meeting with your advisor to discuss research design options, and she launches off into a diatribe about NSF funding, IRB qualifications, and collection methods. It will be months before you can even begin your project, and even if everything goes well your data will be wrought with inaccuracy, missing observations, and outright falsehoods. Its just the nature of the beast

But wait. You're on Twitter, and you've watched with your own eyes the types of information cascades and communication dynamics that form the foundation of your research. Eureka! Twitter presents an unprecedented opportunity to study human beings, and its openness and technical hooks allows for the systematic capture of this with complete transparency. After the lengthy legal lecture your advisor just gave you, however, it seems like it would be a good idea to review Twitter's terms of service.

You're not a lawyer, but as your scan through the TOS you see very encouraging roadsigns:

Fantastic! So, you set off to collect a heap of Twitter data. First, you apply to be whitelisted on the API because you are sure your collection will exceed the pittance of API rate-limits Twitter provides, and as an academic you need a more discerning flow of information than the firehouse. Unfortunately, after a few weeks Twitter has denied your API whitelisting and given you no reason as to why. In fact, in the box labeled "Explanation" there is only whitespace staring back at you. Frustrated but not deterred, you spend the next few days designing a script that meets your analytical needs and stays within the bounds of Twitter's API throttling policies. You flip the switch, and shortly thereafter you are watching the rows of your database populate with golden nugget after golden nugget of data. This will make your dissertation!

You go on to write a great paper. It gets accepted by a top journal, which has a requirement that both the underlying data and supporting analytical code be supplied before publications. Peer review and reproducibility are a basic part of science, so you are happy to oblige. The paper appears in the next issue, with the data and code posted online. Just as you are reaching the apex of your young academic career you receive an email from the journal's editor informing you that Twitter has demanded the data be taken down. Because your paper includes raw tweets that it must either be massively edited or withdrawn.

In fairness, to my knowledge this exact scenario has never played out in full. But in case it wasn't obvious from the tone, much of the hypothetical story above comes from my own experiences, and I have heard similar stories from researchers that have gone further with Twitter data for publication and had difficulty. Twitter's logo is a silhouette of a small harmless bird, but to researchers Twitter is a vulture waiting until the task of collection and insight has been done just to swooping in to peck the eyes out of science.

Despite its desire to be portrayed as the engine of social change, Twitter's dirty secret is that it fights to prevent people from actually showing evidence of this. Through a combination of opaque adjudications for whitelist appeals, and contradictory and confusing language in the terms of service, Twitter has effectively locked researchers out of their data. Even if you're an academic with the inclination and ability to hack together scripts for collection, as Michael was with the #25bahman data, all of your effort is subject to the whim of Twitter's subjective approval.

In my view, this is a great tragedy of contemporary social science. The academy is very slowly beginning to understand the breadth of research topics that Twitter data can be applied to. In most cases this has been within technical disciplines, like computer science, but the real opportunity for knowledge building is in the social sciences. For it to be successful, however, Twitter needs to allow for reasonable fair use of their raw data in academic research and for this data to be redistributed widely. A simple Google search reveals that such fair use claims have legal precedent, but Twitter needs to proactively move their usage terms to the right side of this argument.

There are literally no reasons this should not be the case. By all accounts, Twitter as a company spends the vast majority of its effort and resources simply keeping the lights on. Given the level of scaling required to run a service that really should be an open Internet protocol; i.e., imagine if the only way you could email was through America Online, this makes sense. As such, why would you want to impede a group of people who are volunteering to do free research for you? Rather than issue sporadic cease and desist notices, you should be forming a committee to dole out plaques and certificates thanking academics for providing humanity with whatever modicum of knowledge they were able to extract from the tweets and in the process doing you job for you. Further, no academic is going to attempt to profit from this data. First, we as a group have no incentive to do so, nor do we have the capacity to store and redistribute the data. Remember, you're the ones working to keep the lights on, we just want a little access to the light.

If you're from Twitter and you're reading this I welcome a more detailed explanation of the logic. Further, I would be happy to work together to setup a more formalized mechanism by which academics can apply for access to the data. It seems perfectly reasonable to require some explanation of the researcher's intent. Academics do this all the time, it's part of our training. What is not reasonable is to open a have a vacuum into which we must attempt to ask for permission, only to be denied with no recourse and have our work randomly undermined by your mysterious policies.

Visualizing NetworkX graphs in the browser using D3

July 25, 2011 by Drew Conway

During one of our impromptu sprints at SciPy 2011, the NetworkX team decided it would be nice to add the ability to export networks for visualization with the D3 JavaScript library. This would allow people to post their visualizations online very easily. Mike Bostock, the creator and maintainer of D3, also has a wonderful example of how to render a network using a force-directed layout in the D3 examples gallery.

So, we decided to insert a large portion of Mike's code into the development version of NetworkX in order to allow people to quickly export networks to JSON and visualize them in the browser. Unfortunately, I have not had the chance to write any tests for this code, so it is only available in my fork of the main NetworkX repository on Github. But, if you clone this repository and install it you will have the new features (along with an additional example file for building networks for web APIs in NX).

As a quick example, I used the data I had collected on some of my friends' Twitter ego-networks to show some of the new features. Below is my friend Mike Dewar's Twitter network visualized in D3 using the new features, with the supporting code.

Mike is the red node at the center, and the rest are colored by a basic clustering based on geodesic distance. This coloring comes from the REC node attribute in the NetworkX object, which is just a series of integers used to color the nodes. The function also has the ability to size the edges based on some weighting, but that is not used in the above example.

Most of this is pretty raw, so please do let me know when things blow up, or what additional customizations you would like. For better or worse, D3 let's you do almost anything, so we have to decide what things can be edited from within the function calls, and what will require customization on the user end.

Many of you may also be interested in seeing the output of these functions. To view the JavaScript and exported JSON go to the full code for the example here.

Venture and PE fund performance by vintage year

June 29, 2011 by Drew Conway

My first post at the IA Venture blog is up today. I discuss a quick analysis I did on the performance of venture and private equity funds from the CalPERS data set. The result is the visualization below, which illustrates these funds' performance from the mid-1990's through the late 2000's.

I added some of my own thoughts at the blog, and encourage you to do so as well.

A year of Chicago's crime, in 30 seconds

June 21, 2011 by Drew Conway

Yesterday Brett Goldstein, the Chief Data Officer for the City of Chicago, announced on Twitter the release of Chicago's crime data for the past year. The data is very detailed, and wonderful resource for criminologist and social scientists alike.

I have been playing around with the data a bit, and have produced an animation that explores the geospatial nature of the data. Similar to what Mike Dewar did with the Afghanistan War Logs, I wanted to show variation over time rather than simple aggregates. To do so, I decided to plot moving 10-day windows of the data on a map of Chicago's police districts. Moreover, I wanted to show the regional trends of different kinds of crime throughout Chicago.

The map below shows these 10-day windows, with the crime types color coded. The boundary lines on the map indicate police districts. Each dot represents a crime, and the opacity of each dot corresponds to the number of this type of crime reported in that geographic location on that day. For example, a dark shaded pink dot would indicate an area of heavy theft. Because there are a large number of crime types in the data I restricted this animation to only the top 18 crime types. These are those crimes for which there are over 1,000 incidents total.

This visual technique provides insight into the intensity of various crime types across region in Chicago. In addition, the timeline below the map highlights the current chronological window being plotted, as well as the total density of crimes for the entire data set. The color codes in the this timeline correspond to those on the map.

There is a lot going on here, so it is best viewed in full-screen mode...

For me, lots of interesting observation:

There actually appears to be very little variation in both the volume and location of crime. Downtown Chicago is consistently plagued by a high-degree of theft, while burglary is much more frequent in the south of Chicago. Also, the density timeline shows vary little change in the volume over time.
Crimes are symbiotic. That is, it seems certain types of crimes coexists quite well; such as narcotics and prostitution. This is exemplified by he prominent north-eastern ring.
People do not like to commit crimes in the cold. Conventional wisdom supports this, and Chicago appears to be no different. Though there is very little variation, there is a slight dip in the overall number of crimes from December through February.
If you look closely, you can actually see the formation of roads where crimes occur. Particularly those that lead in and out of downtown.

Not being a Chicagoan, I would welcome alternative observations from those with a better understanding of the geography and crime trends.

P.S., The above animation was made entirely with open source tools: R, ggplot2, ImageMagick, and ffmpeg.

Code available here

The things that keep me from blogging

May 31, 2011 by Drew Conway

I was recently at a conference where my friend Pete Skomoroch confessed to the audience that he was a "bad blogger," because he had not blogged in several months. Pete has been reasonably distracted as of late, so his absence is completely understandable. His comment, however, caused a sudden wave of guilt to wash over me as I sat in the audience. I am a very bad blogger, as it has been over three weeks since my last post, which admittedly was more of an announcement than an actual blog post.

Though not to the level of Pete, I too have been preoccupied. Over the last several weeks I have worked through some large projects and milestones. Two of these may be of general interest. First, John Myles White and I submitted half of the chapters for our upcoming O'Reilly book Machine Learning for Hackers to our editor. At the risk of sounding self-congratulatory, I am really impressed by what we have managed to pull together thus far and am really looking forward to getting the text out. I think teaching machine learning concepts algorithmically, and motivating each method with a case study will appeal to a broader audience. There may be a mini-eBook version of the first half of the book out before the full text is complete, so be on the look out for that announcement.

The second big announcement is that as of the end of this semester I have fulfilled all requirements for my PhD other than the dissertation. In the academic jargon I am now "all but dissertation" (ABD), and am now beginning the final leg of this scholarly adventure. I mention this for two reasons. First, so that you may shower me with the appropriate level of congratulations; and second, because part of my dissertation work will require the use of some new or novel social network data.

My most recent work on modeling network evolution with graph motifs has a serious deficiency: real data. To move this research from an abstract idea to something that makes a meaningful contribution to the social sciences I need to apply it to real data. Unfortunately, there is a dearth of real dynamic social network data—especially related to terrorist or criminal organizations. As such, I am putting out the call to my readership. If you, or someone you know, is working with a dynamic social network data set please contact me.

I know from my traffic logs that ZIA gets a fair amount of traffic from both academic and government readers. If any of you have data like this and are interested in working together I would love to chat. I am easy to get in touch with, so please let me know if you are interested.

Finally, I look forward to getting back into a regular blogging routine. There are so many fascinating things happening in the world related to conflict, terrorism and data it is hard to imagine where to begin. It should be an interesting summer!