Articles to Reconsider in the Wake of bin Laden's Death

For me the prevailing emotion after learning that Osama bin Laden had been killed was sadness. Certainly not for the man, but his death brought back many memories and emotions I remember feeling as an undergraduate ten years ago on the morning of the attacks. How this single event entirely changed the world I lived in, and it subsequent direct impact it had on my life and career.

I also felt a sense of discomfort at the unabashed jubilation. Killing bin Laden is a significant milestone, but not a yardstick of progress. This is a great victory for the Obama White House, the U.S. intelligence community, and the special forces that executed the raid. But, will it have any impact on transnational terrorism, or radicalization more generally? Cautious optimism should rule of the day, and by way of promoting this perspective below are five articles that approach this question from many different angles.

If there are others you think should be included please add them in the comments.

My article in IQT Quarterly, "Data Science in the U.S. Intelligence Community"

I was asked to write a short piece in In-Q-Tel's journal, IQT Quarterly. The article attempts to address how the U.S. intelligence community can begin incorporating "data science" into intelligence cycle, and some of the consequences. The issue was just published, and my article is entitled (appropriately), "Data Science in the U.S. Intelligence Community."

Readers of this blog will note several themes, ideas, and even graphics in the article that I have mentioned in previous posts. But, I was very pleased by the sentence the editors decided to draw out in the print version.

Understanding how modeling assumptions impact the interpretations of analytical results is critical to data science, and this is particularly true in the IC.

I welcome you thoughts, especially if you are a practicing intelligence professional.

Python function to send email (via GMail) when script has completed

As I mentioned yesterday, I have been moving most of my computationally intensive work to Amazon's EC2 cloud computing service. One important thing to keep in mind when you are using EC2 is that ever minute counts, and Amazon is running a tab. In the interest of best-practices I decided to write a short Python function that would notify me an email via GMail when the script had finished. I also thought it would be useful to include the runtime of the script in the body of the email, both in terms of benchmarking and as a sort-of digital receipt.

For your enjoyment, here is that function:

Note: in the above example I am using GMail to send the email via SMTP, but it would be trivial to modify the above function to work with a different SMTP server.

The rise (and fall?) of firms in the 'government' 'data' space

Since the news broke that the current budget negotiations in the U.S. Congress placed many of the open government data initiatives squarely on the chopping block there has been much consternation within the data community. Much has been written in defense of these sites, which include data.gov, USASpending.gov, paymentaccurancy.gov, and others. The Sunlight Foundation has even started online petition to rally support. For my part, I believe this is yet another ridiculous attempt to frame the budget debate in terms of minor expenditures rather than focusing on the endemic problems in entitlement and military spending. But I digress.

Clearly these sites are of massive value to researchers, as they provide—in some cases—extremely granular information about government activity. If these sites were shutdown it would certainly affect the work of many scholars, journalist, and active citizens. Likewise, however, in that event the government is not going to suddenly stop collecting data. As the other side of this debate has pointed out, those interested in gathering this data can always make formal requests to receive it, and most agencies are bound by statute to provide it. For researchers working on relatively protracted timelines this will not be catastrophic loss, and journalists were breaking stories using government data long before these sites existed.

The question then is: who really suffers?

One argument I have seen and heard is that there are now many companies and startups using this data to provide tools and services that are a direct byproduct of this level of government transparency. Much of the budget debate has centered on returning the U.S. to prosperity. Politicians often pay lip service to the value of job creation, so perhaps a negative consequence of this proposal would be the loss of these newly created jobs. Anecdotally I have observed the rise of these firms, as I have seen presentations and participated in hack-a-thons that focused exclusively on government data. But is there actual evidence of such a trend?

One way to test this is to count the number of firms in the 'government' and 'data' space that have been founded over the last several years. Since I am primarily interested in technology companies the best source for this information is CrunchBase. This is an open database on all things related to technology firms, and provides a very convenient API for querying. One drawback of the API, as I was able to understand it, is that you cannot query it using Booleans. In my case, I was interested in companies that matched the terms 'government' and 'data,' but had to actually perform both searches separately and then take the intersection.

governemnt_data_vd.png

As such, the companies I focused on lie at the center of the above Venn diagram. That is, their description in the CrunchBase database include both the words 'government' and 'data.' I am perfectly aware of the limitations of this approach for the analysis. There are likely companies in the data set that are not representative of the trend I am attempting to analyze. Furthermore, the CrunchBase database is full of holes, and many companies that met the search criteria did not include founding date information and thus were ignored. Bearing all that in mind, however, the results remain quite interesting.

crunchbase_density.png

The above graph shows the frequencies of companies in the dataset founded each year between 1950 and 2010. The blue bars are the raw frequencies, and the smooth red line is a kernel density estimate. Clearly, starting in the late 1990's and through the mid-2000's there was a huge rise in the number of companies working in this space. Since then, however, there has been a decrease.

This result is in stark contrast to my assumptions coming into this analysis. Given the anecdotal evidence I mentioned, I assumed there would have been a steady rise over the past several years, rather than a decline. Perhaps someone who is more knowledgeable of the CrunchBase data can provide some insight as to why? Or, even better, someone in the government data space can provide alternative evidence.

As a final thought, regardless of whether the numbers are increasing or decreasing, this simple exercise shows one important thing: there are many companies already working with government data. It is very difficult to know whether shutting down open government sites would stymie the growth of firms in this space. What is clear is there are already many companies, those under the large curve from 1990-2010, that could be negatively affected by this decision. For the U.S. Congress the important question is: does the ends justify the means?

Code used for analysis

Happy Pi Day, Now Go Estimate It!

As you may know, today is Pi Day, when all good nerds take a moment to thank the geeks of antiquity for their painstaking work in estimating this marvelous mathematical constant.

It is also a great opportunity to thank contemporary geeks for the wonders of modern computing, which allow us to estimate pi to near infinite precision. One popular method for estimating pi is the so-called "random darts method," which uses the Monte Carlo method to simulate the act of throwing darts at a board centered inside a square. Suppose we have a dartboard ascribed in a square as pictured below.

basic_example_area.png

If we randomly throw darts at this board such that each dart has an equal probability of landing on the board or outside it, then we can estimate pi using the ratio of darts that fall inside the circle. In this case, those would be the darts in the red shaded area below.

basic_example_monte.gif

Specifically, our estimate for pi will be four times the number of darts on the board divided by the total number of throws. Again, we are assuming all darts hit the square and have equal probability of landing anywhere inside the square, i.e., a very bad dart thrower.

Using this method, it is extremely easy to estimate a value of pi using Monte Carlo in R. We simply need to make N number of draws in two dimensions from a uniform distributions, test which points are on the board, and then estimate pi with that ratio.

This can be accomplished in six lines of R code (ignoring comments):

If we want to have a lot of fun, we can test for convergence to pi as the number of dart throws gets big. Since the Monte Carlo method relies on the law of large numbers, we would expect the precision of our estimate to increase as the number of darts thrown increases. In other words, the more of the board we can potentially cover, the better our estimate will be.

pi_est.png

I ran the simulation from 1 to 5,000 trials, and as you can see from the above chart the estimate quickly converges to a value within a fraction of pi. The circle diagrams I used above were taken from this great tutorial on estimating pi in Python, so you can have fun estimating pi in many languages.

Challenge: submit code for estimating pi using Monte Carlo in your favorite, or most esoteric, language. Bonus points for brevity and elegance—especially if you can improve on my above code.