In a post on dataists almost two years ago, John Myles White and I posed the question: "How would you rank the popularity of a programming language?".
From the original post:
One way to do so is to count the number of projects using each language, and rank those with the most projects as being the most popular. Another might be to measure the size of a language’s “community,” and use that as a proxy for its popularity. Each has their advantages and disadvantages. Counting the number of projects is perhaps the “purest” measure of a language’s popularity, but it may overweight languages based on their legacy or use in production systems. Likewise, measuring community size can provide insight into the breadth of applications for a language, but it can be difficult to distinguish among language with a vocal minority versus those that are actually have large communities.
So, we spent an evening at Princeton hacking around on Github and StackOverflow to get data on the number of projects and questions tagged, per programming language, respectively. The result was a scatter plot showing the linear relationship between these two measures. As with any post comparing programming languages, it was great bait for the Internet masses to poke holes in, and since then Stephen O'Grady at Redmonk has been re-running the analysis to show changes in the relative position of languages over time.
Today I am giving a talk at Monktoberfest on the importance of pursuing good questions in data science. As an example, I wanted to revisits the problem of ranking programming languages. For a long time I have been unsatisfied with the outcome of the original post, because the chart does not really address the original question about ranking.
Scatter plots are a really poor way of displaying rank correlation. In reality, the chart from the original post is a great first step in the exploration of the data. Clearly there is a relationship between the two metrics that is strongly linear; we can see there are many outliers on either side, but, it doesn't help us understand how to rank the languages based on these measures. So, I decided to pull the data again, and create a new visualization.