Back in May I gave a short talk to the Data Driven NYC meetup entitled, "WARNING: Do not feed the wildebeests."
The focus of the talk was to encourage those in the position to hire data analysts and data scientist to consider social scientists, rather than the "PhD in math, statistics, computer science, or equivalent," that has become the de facto for data science job listings.
Stand up and take the pledge: I will not feed the wildebeests!
I finally got around to uploading a paper on the research I have been doing for the past year on using large-scale, non-expert, coders to derive data from political text. Enjoy!
Methods for Collecting Large-scale Non-expert Text Coding
The task of coding text for discrete categories or quantifiable scales is a classic problem in political science. Traditionally, this task is executed by qualified ``experts''. While productive, this method is time consuming, resource intensive, and introduces bias. In the following paper I present the findings from a series of experiments developed to assess the viability of using crowd-sourcing platforms for political text coding, and how variations in the collection mechanism affects the quality of output. To do this, the labor pool available on Amazon's Mechanical Turk platform were asked to identify policy statements and positions from a text corpus of party manifestos. To evaluate the quality of the the non-expert codings, this text corpus is also coded by multiple experts for comparison. The evidence from these experiments show that crowd-sourcing is an effective alternative means to generating quantitative categorization from text. The presence of a filter on workers increases the quality of output, but variation on that filter have little affect. The primary weakness of the non-experts participating in these experiments is their systematic inability to identify texts that contain no policy statement.
A couple month ago I experimented with a slope-graph, and did some basic cluster analysis to determine popularity tiers from the language. I liked the tiers, but the visualization seemed to muddle things more than enhance them. I actually really like the simplicity of the original scatter-plot, but I wished I could get more information on the trends in languages from the chart.
I have been toying with this for a few weeks, and am finally ready to add this project to The Lab. This new visualization is much closer in content to the original, but in this new version you can explore interactively. Along with being able to see more clearly where languages are in the chart, it also shows individual trends among the languages. Of course, I have only been collecting data for a few weeks, so there are not many distinct trends yet.
The chart updates weekly, and if people are interested in the data for their own experimentation just ping me and I can get it to you. Also, I am always open to suggestions on how to make these kinds of visualizations clearer, so if you have ideas let me know.
<iframe src="https://s3.amazonaws.com/aws.drewconway.com/viz/venn_diagram/data_science.html" width="800px" height="720px" frameborder="0" scrolling="no"></iframe><p>Hello, World!</p>