Data ownership, social science, and Twitter's folly

Last night I received the following courtesy email from the good folks at Infochimps:

We received a takedown request from Twitter to remove the dataset you posted to Infochimps which contained raw tweets. Previously this was located at http://www.infochimps.com/datasets/five-days-of-25bahman-tweets.
We're very sorry for the inconvenience this causes. We really value the contribution, but Twitter has asserted claims over the license of this data. Please let me know if you have any questions.

This is in reference to a data set collected by my friend Michael Bommarito on tweets with the hashtag #25bahman, which was being used during protests in Iran early this year. Both Michael and I performed some cursory analysis on the tweets, and later I posted the data to Infochimps because the real motivation for its collection and subsequent analysis was to pique other researcher's interests in this rich data set. So, I get this email and I am angry. I am not angry with Infochimps. I do not expect them to balk Twitter's assertion on my behalf, especially when it seems like such a large portion of their business relies on ready access to Twitter data. What angers me is the hubris and shortsightedness of Twitter's policy, and its consequences for basic social science research.

Imagine you are an eager graduate student in the social sciences. You want to study human behavior; you are interested in communication, social structure, information dissemination, and crowd behavior. You schedule a meeting with your advisor to discuss research design options, and she launches off into a diatribe about NSF funding, IRB qualifications, and collection methods. It will be months before you can even begin your project, and even if everything goes well your data will be wrought with inaccuracy, missing observations, and outright falsehoods. Its just the nature of the beast

But wait. You're on Twitter, and you've watched with your own eyes the types of information cascades and communication dynamics that form the foundation of your research. Eureka! Twitter presents an unprecedented opportunity to study human beings, and its openness and technical hooks allows for the systematic capture of this with complete transparency. After the lengthy legal lecture your advisor just gave you, however, it seems like it would be a good idea to review Twitter's terms of service.

You're not a lawyer, but as your scan through the TOS you see very encouraging roadsigns:

twitter_tip.png
twitter_tip2.png

Fantastic! So, you set off to collect a heap of Twitter data. First, you apply to be whitelisted on the API because you are sure your collection will exceed the pittance of API rate-limits Twitter provides, and as an academic you need a more discerning flow of information than the firehouse. Unfortunately, after a few weeks Twitter has denied your API whitelisting and given you no reason as to why. In fact, in the box labeled "Explanation" there is only whitespace staring back at you. Frustrated but not deterred, you spend the next few days designing a script that meets your analytical needs and stays within the bounds of Twitter's API throttling policies. You flip the switch, and shortly thereafter you are watching the rows of your database populate with golden nugget after golden nugget of data. This will make your dissertation!

You go on to write a great paper. It gets accepted by a top journal, which has a requirement that both the underlying data and supporting analytical code be supplied before publications. Peer review and reproducibility are a basic part of science, so you are happy to oblige. The paper appears in the next issue, with the data and code posted online. Just as you are reaching the apex of your young academic career you receive an email from the journal's editor informing you that Twitter has demanded the data be taken down. Because your paper includes raw tweets that it must either be massively edited or withdrawn.

In fairness, to my knowledge this exact scenario has never played out in full. But in case it wasn't obvious from the tone, much of the hypothetical story above comes from my own experiences, and I have heard similar stories from researchers that have gone further with Twitter data for publication and had difficulty. Twitter's logo is a silhouette of a small harmless bird, but to researchers Twitter is a vulture waiting until the task of collection and insight has been done just to swooping in to peck the eyes out of science.

Despite its desire to be portrayed as the engine of social change, Twitter's dirty secret is that it fights to prevent people from actually showing evidence of this. Through a combination of opaque adjudications for whitelist appeals, and contradictory and confusing language in the terms of service, Twitter has effectively locked researchers out of their data. Even if you're an academic with the inclination and ability to hack together scripts for collection, as Michael was with the #25bahman data, all of your effort is subject to the whim of Twitter's subjective approval.

In my view, this is a great tragedy of contemporary social science. The academy is very slowly beginning to understand the breadth of research topics that Twitter data can be applied to. In most cases this has been within technical disciplines, like computer science, but the real opportunity for knowledge building is in the social sciences. For it to be successful, however, Twitter needs to allow for reasonable fair use of their raw data in academic research and for this data to be redistributed widely. A simple Google search reveals that such fair use claims have legal precedent, but Twitter needs to proactively move their usage terms to the right side of this argument.

There are literally no reasons this should not be the case. By all accounts, Twitter as a company spends the vast majority of its effort and resources simply keeping the lights on. Given the level of scaling required to run a service that really should be an open Internet protocol; i.e., imagine if the only way you could email was through America Online, this makes sense. As such, why would you want to impede a group of people who are volunteering to do free research for you? Rather than issue sporadic cease and desist notices, you should be forming a committee to dole out plaques and certificates thanking academics for providing humanity with whatever modicum of knowledge they were able to extract from the tweets and in the process doing you job for you. Further, no academic is going to attempt to profit from this data. First, we as a group have no incentive to do so, nor do we have the capacity to store and redistribute the data. Remember, you're the ones working to keep the lights on, we just want a little access to the light.

If you're from Twitter and you're reading this I welcome a more detailed explanation of the logic. Further, I would be happy to work together to setup a more formalized mechanism by which academics can apply for access to the data. It seems perfectly reasonable to require some explanation of the researcher's intent. Academics do this all the time, it's part of our training. What is not reasonable is to open a have a vacuum into which we must attempt to ask for permission, only to be denied with no recourse and have our work randomly undermined by your mysterious policies.