Methods for Collecting Large-scale Non-expert Text Coding

I finally got around to uploading a paper on the research I have been doing for the past year on using large-scale, non-expert, coders to derive data from political text.  Enjoy!

Methods for Collecting Large-scale Non-expert Text Coding

The task of coding text for discrete categories or quantifiable scales is a classic problem in political science. Traditionally, this task is executed by qualified ``experts''.  While productive, this method is time consuming, resource intensive, and introduces bias.  In the following paper I present the findings from a series of experiments developed to assess the viability of using crowd-sourcing platforms for political text coding, and how variations in the collection mechanism affects the quality of output.  To do this, the labor pool available on Amazon's Mechanical Turk platform were asked to identify policy statements and positions from a text corpus of party manifestos.  To evaluate the quality of the the non-expert codings, this text corpus is also coded by multiple experts for comparison. The evidence from these experiments show that crowd-sourcing is an effective alternative means to generating quantitative categorization from text. The presence of a filter on workers increases the quality of output, but variation on that filter have little affect.  The primary weakness of the non-experts participating in these experiments is their systematic inability to identify texts that contain no policy statement.

Programming language trends

Every since I started investigating the popularity of programming languages by looking at their usage on Github and StackOverflow I have wanted to improve on the original scatter-plot.

A couple month ago I experimented with a slope-graph, and did some basic cluster analysis to determine popularity tiers from the language. I liked the tiers, but the visualization seemed to muddle things more than enhance them. I actually really like the simplicity of the original scatter-plot, but I wished I could get more information on the trends in languages from the chart.

I have been toying with this for a few weeks, and am finally ready to add this project to The Lab. This new visualization is much closer in content to the original, but in this new version you can explore interactively. Along with being able to see more clearly where languages are in the chart, it also shows individual trends among the languages.  Of course, I have only been collecting data for a few weeks, so there are not many distinct trends yet.

The chart updates weekly, and if people are interested in the data for their own experimentation just ping me and I can get it to you. Also, I am always open to suggestions on how to make these kinds of visualizations clearer, so if you have ideas let me know.

The DS VD: now in d3.js!

As if you could not imagine seeing the Data Science Venn Diagram  any more, it now lives as a standalone interactive data visualization (if you click the link you'll find a version that includes context for each component).

Actually, I needed to practice using the transform function in Javascript and d3.js, and getting those components to tilt the right way was just the ticket. This does, however, present the opportunity for those that wish to embed the diagram to do so.

<iframe src="https://s3.amazonaws.com/aws.drewconway.com/viz/venn_diagram/data_science.html" width="800px" height="720px" frameborder="0" scrolling="no"></iframe><p>Hello, World!</p>

The Data Science Venn Diagram is Creative Commons licensed as Attribution-NonCommercial

Revisiting "Ranking the popularity of programming languages": creating tiers

In a post on dataists almost two years ago, John Myles White and I posed the question: "How would you rank the popularity of a programming language?".

From the original post:

One way to do so is to count the number of projects using each language, and rank those with the most projects as being the most popular. Another might be to measure the size of a language’s “community,” and use that as a proxy for its popularity. Each has their advantages and disadvantages. Counting the number of projects is perhaps the “purest” measure of a language’s popularity, but it may overweight languages based on their legacy or use in production systems. Likewise, measuring community size can provide insight into the breadth of applications for a language, but it can be difficult to distinguish among language with a vocal minority versus those that are actually have large communities.

So, we spent an evening at Princeton hacking around on Github and StackOverflow to get data on the number of projects and questions tagged, per programming language, respectively. The result was a scatter plot showing the linear relationship between these two measures. As with any post comparing programming languages, it was great bait for the Internet masses to poke holes in, and since then Stephen O'Grady at Redmonk has been re-running the analysis to show changes in the relative position of languages over time.

Today I am giving a talk at Monktoberfest on the importance of pursuing good questions in data science. As an example, I wanted to revisits the problem of ranking programming languages. For a long time I have been unsatisfied with the outcome of the original post, because the chart does not really address the original question about ranking.

Scatter plots are a really poor way of displaying rank correlation. In reality, the chart from the original post is a great first step in the exploration of the data. Clearly there is a relationship between the two metrics that is strongly linear; we can see there are many outliers on either side, but, it doesn't help us understand how to rank the languages based on these measures. So, I decided to pull the data again, and create a new visualization.

If you have not seen one before, this is a Tufte-style "slopegraph," which Tufte developed to show the change in gradients over time. Here I am using the technique to compare two independent rankings of the same list. A couple of things to note about the graph:

  • The languages that have been bumped out into the margins by brackets represent ties in the ranking. This happens much more often in StackOverflow data because we must convert the highly skewed number of tags to a rank. Unlike the Github data, where the rank is the value provided.
  • The colors of the lines correspond to a simple k-means clustering of the languages' ranks using five clusters. For the k-means, the distance is based on the among languages in the original scatter plot.
  • The descending grey bands represent the rank quartiles, e.g., the top band is the 100th percentiles, then 80th percentile, etc.

Not, perfect, but I can get much more information on the ranking by looking at this chart

  1. By adding the k-means clusters and the percentile bands, we can see a much clearer picture of how languages group by tiers.
  2. There is actually a large amount of agreement between the two rankings when we view the quartiles as popularity tiers. For example, the purple cluster is almost exclusively in the 100th percentile, or top tier. Save for R, which Github ranks in the 80th percentile! Likewise, the teal cluster is primarily in the next grey band, or second tier
  3. There is also a clear bottom tier for the least popular languages, represented here by the orange cluster.
  4. For some languages this method of ranking is completely useless. In this case, those languages are in the fuchsia cluster. Many of them are very highly ranked in Github, but have zero questions tagged on StackOverflow.

To further simplify the chart, and again to get closer to answering the question of what makes a programming language popular, I redrew the chart given the observations mentioned above:

You'll notice that for the languages in the green cluster I simply pull them out of the vertical ordering and call them "High Variance." Unlike the fuchsia cluster, the rank comparison between Github and StackOverflow is meaningful, but much weaker. These languages fall anywhere between the 20th and 100th percentiles, but from the slope chart we can see that that for many of them the two data sets do not rank them wildly differently.

Given the data, and the observation made in the slopegraph, we can make qualified statements about programming language popularity; such as, "The most popular programming languages are those that fall in the 20th percentile of rank between Github projects and StackOverflow tags." Statements about the second most, and least popular languages are equally valid, though — of course — not definitive. Finally, for completeness, here are the group membership of each language for the clusters:

Of course, all of this is merely an exercise in descriptive analysis. The real value would be in understanding why languages fall into these tiers. Hopefully the folks at Monktoberfest can help me with that one!

Oh, and here are the slides from my presentation at Monktoberfest.

Update: Below is the video of of my talk, graciously provided by the good people at Redmonk.

Enter DataGotham

data002_no_back.png

The New York City data community is a very jovial and tight-knit community. We go to Meetups together, visit each other at work to talk about projects, and even occasionally take over bars together. This is partly because all of us are crammed on this tiny island with everyone else; but it's mostly due to the effort of a group of extraordinary people from very different backgrounds committed to making NYC a great place to be doing data science.

One of the threads that has consistently come out of conversations I have had with members of the community is the desire to highlight the strength of NYC's data community vis a vis this diversity of backgrounds and interests. For example, while NYC has become a strong geographic complement to Silicon Valley as a hub for technology startups, the data community stretches far beyond the startups.

Many of our traditional stalwart industries; such as finance, entertainment, and media, that have been building sophisticated analytics products and teams for some time, but have only recently started speaking more publicly about these efforts. At the same time, industries with less obvious connections to the data community; such as design, fashion, and non-profits, have thrust themselves into the NYC data community.

All this is to say, we believe that there is something special about New York that makes it a great place to be doing data science. We think and do things differently, and we have a very diverse and unique set of constituencies in the city that have cultivated a special community and culture. So, we want to show everyone just how awesome this community is.

Enter DataGotham: http://datagotham.com

This event is the first of its kind. Rather than focus on the tools and techniques people are using, DataGotham will bring together professionals from across the NYC data community for intense discussion, networking, and sharing of wisdom. The goal is to tell stories about what problems people are solving, and the highs and lows of that process. DataGotham is being organized by Hilary Mason, Mike Dewar, John Myles White, and myself; and will take place September 13th-14th at NYU Stern.

You can check out our great, and ever expanding, speaker list here: http://www.datagotham.com/speakers/. We also have four tutorials running on the afternoon of the 13th, followed by cocktails and The Great Data Extravaganza Show at the Tribeca Rooftop that evening.

Tickets are on sale now, and can be purchased at http://datagotham.eventbrite.com. Also, as a reader of this blog, I would like to offer you a special discount of 25% off registration by using the promo code "dataGothamist" when you register.

We are very excited about the opportunity to provide a platform for this great community, and we would love to see you there!