Data use policies and social media: an appeal

After my post a few weeks ago lamenting Twitter's data use policies, many people reached out to me supporting my position and asking what they could do to help. One person was Mark Huberty, a fellow political scientist at UC Berkeley. Mark mentioned that there were many other social scientists who had similar experiences and were worried about its ramifications for research.

We decided to the best way to proceed was to make an appeal to all researchers—not only social scientists—to gather examples of work, and stories, of how many disciplines are using this data to uncover new aspects of human behavior. This morning, Mark wrote just such an appeal to the POLMETH mailing list, and in an effort to make this appeal to a larger audience I have reproduced it below:

Greetings,
One of us (Drew Conway) recently found that, although Twitter makes its data open to almost anyone via well-documented interfaces, and although it appears to encourage experimentation with its data, that doesn't extend to redistributing that data for replication. This poses serious questions about the use of twitter-based data for academic research. Twitter has been shown to be an accurate predictor of vote and polling outcomes, and a novel way to measure partisan polarization and communication. But without clear data use policies, research taking advantage of this data may not pass muster with journals requiring the release of replication files, and research progress will be hindered.
We think there is an opportunity here to show interdisciplinary academic interest in the Twitter data, and open a conversation about reasonable data retention and release policies on their part. At present, there appears to be a disconnect between Twitter's analysts, who seem to encourage data use, and its legal and business arm, who are very conservative with Twitter's intellectual property rights. Given this disconnect, Twitter has been inconsistent in its demands on researchers using this data. We're hoping that by pointing out the inconsistency and seeking a reasonable resolution, we can find a suitable outcome for both Twitter's business model and researchers' interests. If done correctly, this might have the potential to become a model for other social media sites of interest to social scientists.
We would like to engage participation from anyone in the PolMeth community who has an interest in this outcome. If you might be interested in participating, please let one of us know. We're only in the early stages of working on this, so we welcome all inquiries, ideas, and concerns.
Thanks for your interest. We will look forward to hearing from you.

We hope that those of you using Twitter data for research will help us in this effort. Please feel free to contact me directly, either by email or in the comments section below. We look forward to hearing from you!

Data ownership, social science, and Twitter's folly

Last night I received the following courtesy email from the good folks at Infochimps:

We received a takedown request from Twitter to remove the dataset you posted to Infochimps which contained raw tweets. Previously this was located at http://www.infochimps.com/datasets/five-days-of-25bahman-tweets.
We're very sorry for the inconvenience this causes. We really value the contribution, but Twitter has asserted claims over the license of this data. Please let me know if you have any questions.

This is in reference to a data set collected by my friend Michael Bommarito on tweets with the hashtag #25bahman, which was being used during protests in Iran early this year. Both Michael and I performed some cursory analysis on the tweets, and later I posted the data to Infochimps because the real motivation for its collection and subsequent analysis was to pique other researcher's interests in this rich data set. So, I get this email and I am angry. I am not angry with Infochimps. I do not expect them to balk Twitter's assertion on my behalf, especially when it seems like such a large portion of their business relies on ready access to Twitter data. What angers me is the hubris and shortsightedness of Twitter's policy, and its consequences for basic social science research.

Imagine you are an eager graduate student in the social sciences. You want to study human behavior; you are interested in communication, social structure, information dissemination, and crowd behavior. You schedule a meeting with your advisor to discuss research design options, and she launches off into a diatribe about NSF funding, IRB qualifications, and collection methods. It will be months before you can even begin your project, and even if everything goes well your data will be wrought with inaccuracy, missing observations, and outright falsehoods. Its just the nature of the beast

But wait. You're on Twitter, and you've watched with your own eyes the types of information cascades and communication dynamics that form the foundation of your research. Eureka! Twitter presents an unprecedented opportunity to study human beings, and its openness and technical hooks allows for the systematic capture of this with complete transparency. After the lengthy legal lecture your advisor just gave you, however, it seems like it would be a good idea to review Twitter's terms of service.

You're not a lawyer, but as your scan through the TOS you see very encouraging roadsigns:

twitter_tip.png
twitter_tip2.png

Fantastic! So, you set off to collect a heap of Twitter data. First, you apply to be whitelisted on the API because you are sure your collection will exceed the pittance of API rate-limits Twitter provides, and as an academic you need a more discerning flow of information than the firehouse. Unfortunately, after a few weeks Twitter has denied your API whitelisting and given you no reason as to why. In fact, in the box labeled "Explanation" there is only whitespace staring back at you. Frustrated but not deterred, you spend the next few days designing a script that meets your analytical needs and stays within the bounds of Twitter's API throttling policies. You flip the switch, and shortly thereafter you are watching the rows of your database populate with golden nugget after golden nugget of data. This will make your dissertation!

You go on to write a great paper. It gets accepted by a top journal, which has a requirement that both the underlying data and supporting analytical code be supplied before publications. Peer review and reproducibility are a basic part of science, so you are happy to oblige. The paper appears in the next issue, with the data and code posted online. Just as you are reaching the apex of your young academic career you receive an email from the journal's editor informing you that Twitter has demanded the data be taken down. Because your paper includes raw tweets that it must either be massively edited or withdrawn.

In fairness, to my knowledge this exact scenario has never played out in full. But in case it wasn't obvious from the tone, much of the hypothetical story above comes from my own experiences, and I have heard similar stories from researchers that have gone further with Twitter data for publication and had difficulty. Twitter's logo is a silhouette of a small harmless bird, but to researchers Twitter is a vulture waiting until the task of collection and insight has been done just to swooping in to peck the eyes out of science.

Despite its desire to be portrayed as the engine of social change, Twitter's dirty secret is that it fights to prevent people from actually showing evidence of this. Through a combination of opaque adjudications for whitelist appeals, and contradictory and confusing language in the terms of service, Twitter has effectively locked researchers out of their data. Even if you're an academic with the inclination and ability to hack together scripts for collection, as Michael was with the #25bahman data, all of your effort is subject to the whim of Twitter's subjective approval.

In my view, this is a great tragedy of contemporary social science. The academy is very slowly beginning to understand the breadth of research topics that Twitter data can be applied to. In most cases this has been within technical disciplines, like computer science, but the real opportunity for knowledge building is in the social sciences. For it to be successful, however, Twitter needs to allow for reasonable fair use of their raw data in academic research and for this data to be redistributed widely. A simple Google search reveals that such fair use claims have legal precedent, but Twitter needs to proactively move their usage terms to the right side of this argument.

There are literally no reasons this should not be the case. By all accounts, Twitter as a company spends the vast majority of its effort and resources simply keeping the lights on. Given the level of scaling required to run a service that really should be an open Internet protocol; i.e., imagine if the only way you could email was through America Online, this makes sense. As such, why would you want to impede a group of people who are volunteering to do free research for you? Rather than issue sporadic cease and desist notices, you should be forming a committee to dole out plaques and certificates thanking academics for providing humanity with whatever modicum of knowledge they were able to extract from the tweets and in the process doing you job for you. Further, no academic is going to attempt to profit from this data. First, we as a group have no incentive to do so, nor do we have the capacity to store and redistribute the data. Remember, you're the ones working to keep the lights on, we just want a little access to the light.

If you're from Twitter and you're reading this I welcome a more detailed explanation of the logic. Further, I would be happy to work together to setup a more formalized mechanism by which academics can apply for access to the data. It seems perfectly reasonable to require some explanation of the researcher's intent. Academics do this all the time, it's part of our training. What is not reasonable is to open a have a vacuum into which we must attempt to ask for permission, only to be denied with no recourse and have our work randomly undermined by your mysterious policies.

Visualizing NetworkX graphs in the browser using D3

During one of our impromptu sprints at SciPy 2011, the NetworkX team decided it would be nice to add the ability to export networks for visualization with the D3 JavaScript library. This would allow people to post their visualizations online very easily. Mike Bostock, the creator and maintainer of D3, also has a wonderful example of how to render a network using a force-directed layout in the D3 examples gallery.

So, we decided to insert a large portion of Mike's code into the development version of NetworkX in order to allow people to quickly export networks to JSON and visualize them in the browser. Unfortunately, I have not had the chance to write any tests for this code, so it is only available in my fork of the main NetworkX repository on Github. But, if you clone this repository and install it you will have the new features (along with an additional example file for building networks for web APIs in NX).

As a quick example, I used the data I had collected on some of my friends' Twitter ego-networks to show some of the new features. Below is my friend Mike Dewar's Twitter network visualized in D3 using the new features, with the supporting code.

Mike is the red node at the center, and the rest are colored by a basic clustering based on geodesic distance. This coloring comes from the REC node attribute in the NetworkX object, which is just a series of integers used to color the nodes. The function also has the ability to size the edges based on some weighting, but that is not used in the above example.

Most of this is pretty raw, so please do let me know when things blow up, or what additional customizations you would like. For better or worse, D3 let's you do almost anything, so we have to decide what things can be edited from within the function calls, and what will require customization on the user end.

Many of you may also be interested in seeing the output of these functions. To view the JavaScript and exported JSON go to the full code for the example here.