In a post on dataists almost two years ago, John Myles White and I posed the question: "How would you rank the popularity of a programming language?".

From the original post:

One way to do so is to count the number of projects using each language, and rank those with the most projects as being the most popular. Another might be to measure the size of a language’s “community,” and use that as a proxy for its popularity. Each has their advantages and disadvantages. Counting the number of projects is perhaps the “purest” measure of a language’s popularity, but it may overweight languages based on their legacy or use in production systems. Likewise, measuring community size can provide insight into the breadth of applications for a language, but it can be difficult to distinguish among language with a vocal minority versus those that are actually have large communities.

So, we spent an evening at Princeton hacking around on Github and StackOverflow to get data on the number of projects and questions tagged, per programming language, respectively. The result was a scatter plot showing the linear relationship between these two measures. As with any post comparing programming languages, it was great bait for the Internet masses to poke holes in, and since then Stephen O'Grady at Redmonk has been re-running the analysis to show changes in the relative position of languages over time.

Today I am giving a talk at Monktoberfest on the importance of pursuing good questions in data science. As an example, I wanted to revisits the problem of ranking programming languages. For a long time I have been unsatisfied with the outcome of the original post, because the chart does not really address the original question about ranking.

Scatter plots are a really poor way of displaying rank correlation. In reality, the chart from the original post is a great first step in the exploration of the data. Clearly there is a relationship between the two metrics that is strongly linear; we can see there are many outliers on either side, but, it doesn't help us understand how to rank the languages based on these measures. So, I decided to pull the data again, and create a new visualization.

If you have not seen one before, this is a Tufte-style "slopegraph," which Tufte developed to show the change in gradients over time. Here I am using the technique to compare two independent rankings of the same list. A couple of things to note about the graph:

The languages that have been bumped out into the margins by brackets represent ties in the ranking. This happens much more often in StackOverflow data because we must convert the highly skewed number of tags to a rank. Unlike the Github data, where the rank is the value provided.
The colors of the lines correspond to a simple k-means clustering of the languages' ranks using five clusters. For the k-means, the distance is based on the among languages in the original scatter plot.
The descending grey bands represent the rank quartiles, e.g., the top band is the 100th percentiles, then 80th percentile, etc.

Not, perfect, but I can get much more information on the ranking by looking at this chart

By adding the k-means clusters and the percentile bands, we can see a much clearer picture of how languages group by tiers.
There is actually a large amount of agreement between the two rankings when we view the quartiles as popularity tiers. For example, the purple cluster is almost exclusively in the 100th percentile, or top tier. Save for R, which Github ranks in the 80th percentile! Likewise, the teal cluster is primarily in the next grey band, or second tier
There is also a clear bottom tier for the least popular languages, represented here by the orange cluster.
For some languages this method of ranking is completely useless. In this case, those languages are in the fuchsia cluster. Many of them are very highly ranked in Github, but have zero questions tagged on StackOverflow.

To further simplify the chart, and again to get closer to answering the question of what makes a programming language popular, I redrew the chart given the observations mentioned above:

You'll notice that for the languages in the green cluster I simply pull them out of the vertical ordering and call them "High Variance." Unlike the fuchsia cluster, the rank comparison between Github and StackOverflow is meaningful, but much weaker. These languages fall anywhere between the 20th and 100th percentiles, but from the slope chart we can see that that for many of them the two data sets do not rank them wildly differently.

Given the data, and the observation made in the slopegraph, we can make qualified statements about programming language popularity; such as, "The most popular programming languages are those that fall in the 20th percentile of rank between Github projects and StackOverflow tags." Statements about the second most, and least popular languages are equally valid, though — of course — not definitive. Finally, for completeness, here are the group membership of each language for the clusters:

Of course, all of this is merely an exercise in descriptive analysis. The real value would be in understanding why languages fall into these tiers. Hopefully the folks at Monktoberfest can help me with that one!

Oh, and here are the slides from my presentation at Monktoberfest.

Update: Below is the video of of my talk, graciously provided by the good people at Redmonk.

Questions first, then data - perspectives on data science from a social scientist. Drew Conway Drew Conway's talk at Monktoberfest 2012.