Python function to send email (via GMail) when script has completed

April 12, 2011 by Drew Conway

As I mentioned yesterday, I have been moving most of my computationally intensive work to Amazon's EC2 cloud computing service. One important thing to keep in mind when you are using EC2 is that ever minute counts, and Amazon is running a tab. In the interest of best-practices I decided to write a short Python function that would notify me an email via GMail when the script had finished. I also thought it would be useful to include the runtime of the script in the body of the email, both in terms of benchmarking and as a sort-of digital receipt.

For your enjoyment, here is that function:

Note: in the above example I am using GMail to send the email via SMTP, but it would be trivial to modify the above function to work with a different SMTP server.

The rise (and fall?) of firms in the 'government' 'data' space

April 06, 2011 by Drew Conway

Since the news broke that the current budget negotiations in the U.S. Congress placed many of the open government data initiatives squarely on the chopping block there has been much consternation within the data community. Much has been written in defense of these sites, which include data.gov, USASpending.gov, paymentaccurancy.gov, and others. The Sunlight Foundation has even started online petition to rally support. For my part, I believe this is yet another ridiculous attempt to frame the budget debate in terms of minor expenditures rather than focusing on the endemic problems in entitlement and military spending. But I digress.

Clearly these sites are of massive value to researchers, as they provide—in some cases—extremely granular information about government activity. If these sites were shutdown it would certainly affect the work of many scholars, journalist, and active citizens. Likewise, however, in that event the government is not going to suddenly stop collecting data. As the other side of this debate has pointed out, those interested in gathering this data can always make formal requests to receive it, and most agencies are bound by statute to provide it. For researchers working on relatively protracted timelines this will not be catastrophic loss, and journalists were breaking stories using government data long before these sites existed.

The question then is: who really suffers?

One argument I have seen and heard is that there are now many companies and startups using this data to provide tools and services that are a direct byproduct of this level of government transparency. Much of the budget debate has centered on returning the U.S. to prosperity. Politicians often pay lip service to the value of job creation, so perhaps a negative consequence of this proposal would be the loss of these newly created jobs. Anecdotally I have observed the rise of these firms, as I have seen presentations and participated in hack-a-thons that focused exclusively on government data. But is there actual evidence of such a trend?

One way to test this is to count the number of firms in the 'government' and 'data' space that have been founded over the last several years. Since I am primarily interested in technology companies the best source for this information is CrunchBase. This is an open database on all things related to technology firms, and provides a very convenient API for querying. One drawback of the API, as I was able to understand it, is that you cannot query it using Booleans. In my case, I was interested in companies that matched the terms 'government' and 'data,' but had to actually perform both searches separately and then take the intersection.

As such, the companies I focused on lie at the center of the above Venn diagram. That is, their description in the CrunchBase database include both the words 'government' and 'data.' I am perfectly aware of the limitations of this approach for the analysis. There are likely companies in the data set that are not representative of the trend I am attempting to analyze. Furthermore, the CrunchBase database is full of holes, and many companies that met the search criteria did not include founding date information and thus were ignored. Bearing all that in mind, however, the results remain quite interesting.

The above graph shows the frequencies of companies in the dataset founded each year between 1950 and 2010. The blue bars are the raw frequencies, and the smooth red line is a kernel density estimate. Clearly, starting in the late 1990's and through the mid-2000's there was a huge rise in the number of companies working in this space. Since then, however, there has been a decrease.

This result is in stark contrast to my assumptions coming into this analysis. Given the anecdotal evidence I mentioned, I assumed there would have been a steady rise over the past several years, rather than a decline. Perhaps someone who is more knowledgeable of the CrunchBase data can provide some insight as to why? Or, even better, someone in the government data space can provide alternative evidence.

As a final thought, regardless of whether the numbers are increasing or decreasing, this simple exercise shows one important thing: there are many companies already working with government data. It is very difficult to know whether shutting down open government sites would stymie the growth of firms in this space. What is clear is there are already many companies, those under the large curve from 1990-2010, that could be negatively affected by this decision. For the U.S. Congress the important question is: does the ends justify the means?

Code used for analysis

Happy Pi Day, Now Go Estimate It!

March 14, 2011 by Drew Conway

As you may know, today is Pi Day, when all good nerds take a moment to thank the geeks of antiquity for their painstaking work in estimating this marvelous mathematical constant.

It is also a great opportunity to thank contemporary geeks for the wonders of modern computing, which allow us to estimate pi to near infinite precision. One popular method for estimating pi is the so-called "random darts method," which uses the Monte Carlo method to simulate the act of throwing darts at a board centered inside a square. Suppose we have a dartboard ascribed in a square as pictured below.

If we randomly throw darts at this board such that each dart has an equal probability of landing on the board or outside it, then we can estimate pi using the ratio of darts that fall inside the circle. In this case, those would be the darts in the red shaded area below.

Specifically, our estimate for pi will be four times the number of darts on the board divided by the total number of throws. Again, we are assuming all darts hit the square and have equal probability of landing anywhere inside the square, i.e., a very bad dart thrower.

Using this method, it is extremely easy to estimate a value of pi using Monte Carlo in R. We simply need to make N number of draws in two dimensions from a uniform distributions, test which points are on the board, and then estimate pi with that ratio.

This can be accomplished in six lines of R code (ignoring comments):

If we want to have a lot of fun, we can test for convergence to pi as the number of dart throws gets big. Since the Monte Carlo method relies on the law of large numbers, we would expect the precision of our estimate to increase as the number of darts thrown increases. In other words, the more of the board we can potentially cover, the better our estimate will be.

I ran the simulation from 1 to 5,000 trials, and as you can see from the above chart the estimate quickly converges to a value within a fraction of pi. The circle diagrams I used above were taken from this great tutorial on estimating pi in Python, so you can have fun estimating pi in many languages.

Challenge: submit code for estimating pi using Monte Carlo in your favorite, or most esoteric, language. Bonus points for brevity and elegance—especially if you can improve on my above code.

Language used by Academics with the Protection of Anonymity

March 14, 2011 by Drew Conway

Those in the political science discipline probably remember their first encounter with poliscijobrumors.com. For those outside, you have probably never heard of this particular message board, and you would have no reason to. As the URL suggests, the board specializes in rumor, gossip, back-bitting, mudslinging, and the occasional lucid thread on the political science discipline. By browsing the posts one can quickly see how the protection of anonymity results in the lowest-common denominator of discourse—even among members of the Ivory Tower!

If you are unconvinced, simply test Godwin's law for yourself.

The convergence of specific topics within a discipline and the promise of anonymity, however, makes for a very interesting data set on the use of language in this context.

I have always been curious what patterns could be extracted from the particular forum. Specifically, given the ability of people to mask their identities, which often leads to a very low-quality in discourse, is it still possible to identify topic areas of interest by examining the data in aggregate? Furthermore, will any of them have anything to do with political science?

The answer: kind of...

The message board has been around for a long time, so it was infeasible to go out and scrape the entire corpus. Short of that, I decided to create a text corpus of the first 1,018 threads in the General Job Market Discussion. The 1,018 comes from that fact that several threads include multiple pages, so rather than strictly stopping at 1,000 pages I decided to try to be inclusive of full threads.

With all the data in hand, the analysis was very straightforward. I constructed a term-document matrix, with the usual linguistic noise removed, and performed a simple matrix multiplication to get the number of times each of the words were used in the same thread. The result is an N-by-N matrix, wherein the elements are the number of times a words were used in the same thread. We can think of this data as a weighting among words: the higher the number the "closer" the affiliation.

Another way to construct this is graphically, whereby the data is a weighted adjacency matrix. Then, the words become nodes and the edges are weighted by the co-occurence weighting in each element. This is helpful because we can now use force-directed methods to place words near each other in two-dimensional space. , both the x- and y-axis position of the word is directly relevant the how words relate to each other.

This positional data also gives us a sense of distance between words, i.e, the further apart words are the more unlikely it will be that they are used in the same thread. From this we can create "topic" clusters. That is, we can attempt to divide the words into clusters based on their distances, and these clusters can represent consistent topics within the entire corpus of data. To do this I use simply k-means clustering, and use 8 centers for the clusters. The choice of 8 was made because the "Dark2" Color Brewer palette only has 8 colors in it (art vs. science compromise).

Finally, because I think it is an immediately obvious way to convey this, words are sized by the log of their frequency in the entire corpus. The visualization above is the result of this analysis, which follows from previous thoughts on building better word clouds.

What can we say about this analysis? It seems—to me—that the topics are fairly similar. Moreover, despite the low-level of the overall discourse on the forum, in aggregate the topics are very relevant to the political science discipline and job market. That said, a non-negligible amount of profanity does make it into the visualization, though thankfully those words are not among the most frequently used. The placement of certain cities and universities into various topic clusters is also interesting.

Keep in mind that the method I use here is very different from the LDA topic modeling I have discussed in the past. Perhaps that would have produced better topic clusters, however, I do think one benefit of this method is the non-stochastic nature of the clusters.

Code available for download and inspection at the ZIA Code Repository.

R Packages Used

XML
tm
igraph
ggplot2

Building a Better Word Cloud

January 27, 2011 by Drew Conway

A few weeks ago I attended the NYC Data Visualization and Infographics meetup, which included a talk by Junk Charts blogger Kaiser Fung. Given the topic of his blog, I was a bit shocked that the central theme of his talk was comparing good and bad word clouds. He even stated that the word cloud was one of the best data visualizations of the last several years. I do not think there is such a thing as a good word cloud, and after the meetup I left unconvinced; as evidenced by the above tweet.

This tweet precipitated a brief Twitter debate about the value of word clouds, but from that straw poll it seemed the Nays had the majority. My primary gripe is that space is meaningless in word clouds. They are meant to summarize a single statistics—word frequency—yet they use a two dimensional space to express that. This is frustrating, since it is very easy to abuse the flexibility of these dimensions and conflate the position of a word with its frequency to convey dubious significance.

Since then, I have been inundated with word clouds. The State of the Union address always brings out a wave of word clouds. Even LinkedIn got in on the action with this spinning word cloud of doom. I'd had enough, and decided it was time to add spatial meaning to the lexical cauldron of word clouds, and see how far I could get.

First, I was struck by one of the comments at the NYVIZ meetup. An audience member mentioned to Kaiser that the word clouds he found most useful were those presented as comparison; i.e., comparing word clouds from two different texts on similar topics. This appealed to my statistical inclinations, wherein we are often interested in comparing data. As such, I decided that the best way to evolve the word cloud was to create one that compared two texts in a single visualization. Given the timeframe when this pet project started, the data I sought to compare were the speeches given by Pres. Obama and Sarah Palin in the wake of the Tucson shooting.

This is the result of my effort to build a better word cloud...

The best data visualizations should stand on their own, which is why I present the word cloud before explaining how I created it. What do you think?

To understand how these speeches compared I first needed to create a term-frequency matrix, which contained only words used in both speeches. After removing common English stop words and the word 'applause' (Obama's speech was in front of a live audience), and retaining only words contained in both speeches at least once, I was left with 103 words to visualize.

To show how the two speeches contrasted, I decided to use the x-axis position to pull words used more by one politician closer to either the left or right of the plot. Words used more by Palin are to the left, and likewise words to the right were used more by Obama. The color reinforces this information, making words Palin words darker red, and Obama darker blue. The scaling of the x-axis is also very important. Note that the "Said Equally" partition does not appear at the exact center of the graph—this is intentional. The word "people" is said more by Obama than any shared word in the corpus. The variance in distance between the equal partition and the edges of the plot is meant to convey this disparity.

Next, I wanted to improve the word cloud, but not redefine it. The base motivation of a word cloud is to convey term frequency by the word size, therefore, this remains true here. In this case, however, because there are two different frequency counts for each text some reduction in data must be used to fit the data into a single visualization. To accomplish this, those words used more often by either politician are sized based on the frequency of the word in that politician's speech. For example, both Obama and Palin use the word "violence," but Palin uses it more; therefore, that word is sized by its frequency in her speech. Logically, words at in the equal partition are sized by their frequency in both speeches.

For the y-axis I wanted to maximize readability. One of the most frustrating things about traditional word clouds is how hard they make you work to read the words. As such, I created a simple function that equally spaced the words on the y-axis given the number appearing in each vertical partition. It isn't perfect, but for this relatively small number of words all are reasonably visible. Also, the ordering from top to bottom is simply the descending alphabetical order of each vertical partition.

While this is a very simple extension of the traditional word cloud, much more can be learned from it. For example, both politicians used the words "congresswomen," and "america" equally but also frequently. While the word "tragedy" is used often in both speeches, but slightly more by Obama. The edges are most interesting. Palin repeated the shared terms "ideas," "debate," "victims," "values," and "strength," while Obama focused on "people," "lives," and "life."

A clear weakness in this approach is that words that are not in the intersection of two texts are ignored. In an effort for full democratization of methods for visualizing word frequency, below are the Wordle versions of the texts used to create the above visualization. They are presented side-by-side to allow for comparison.

Finally, as you may have guessed, this visualization was created in R using a combination of the tm and ggplot2 packages. The data and code are available in the ZIA Code Repository for your amusement.

So, which method do you like better?