Swallowing the Academic "Red Pill"

By now you have likely read the lengthy attack on the value of acquiring a PhD and entering an academic career in The Economist's end of the year issue. As might be expected, this has generated much consternation within The Tower, and garnered some excellent responses from within my own discipline. What seems to be missing from the discussion is perspective from the so-called "cheap, highly motivated and disposable labour" i.e., graduate students. To that end, as someone who has recently tasted and—at least partially digested—the jagged red pill that is entering the academic rabbit hole, allow me a moment to reflect on the decision.

To be honest, it is very difficult to argue with the numbers put forward by The Economist. The years immediately following college graduation are that most crucial in shaping an individual's lifetime earning. The ancient tradition of apprenticeship used to train graduate students is deliberately slow, which forces people to forgo these crucial earning years in exchange for admittance into an extremely exclusive career path. To borrow Josh's sports metaphor, the 64,000 PhDs produced in the United States each year cited by The Economist is at least comparable to the tens of thousands of individuals drafted into the top professional sports teams in the U.S. each year. The pursuit of a dream often begins with a massive opportunity cost, and just like the individual deciding whether to follow a dream to play major league baseball faces extraordinarily low probability of reaching that dream, and forgoing considerable future earning (first contract season for Minor League players in $1,100/month), so too does the college senior filling out graduate school applications.

Part of the problem for academics is the mythology of their career is not celebrated to an even reasonably comparable degree as that of the professional athlete. On my first day of graduate school one of my professors said, "Congratulations on being accepted to the program. While most people will not understand it, you have one of the greatest jobs one the planet. People are going to pay you to think, and I think that is pretty cool."

I think it is pretty cool too, and while at face value that statement no better reflects the reality of graduate school anymore than Summer Catch reflects the realities of the Minor League baseball system, it is an important to remember what an academic career is really about: to be one of the world's best thinkers, period. The original article attacking academia never considers this point, and rather places doctoral research as any other kind of on-the-job training. The fact is, there are very few people who will successfully navigate their graduate program and be hired as a tenure-track faculty, and even fewer who will go on to be successful academics. It is an environment where a very specific set of goals blended with unique intellectual, interpersonal and labor skills are needed to flourish, not unlike many other highly specialized careers.

Fortunately, like other highly specialized careers, it easy to recognize whether your own goals and skills exists at some sufficient intersection with what the academy expects. Having sat at this desk for countless hours considering my own career path, here are a few questions I have found most valuable upon reflection:

  • Do you love school? - To be clear, this is not the same question as, "Do you love learning?" Learning, for the most part, is either accomplished as the result of a natural ability, or a force of will. Schools and universities, on the other hand, are the institutions established to train and judge your qualifications to become a member of academia. It is both highly flawed and extremely difficult to change. Success in graduate school is as much about appreciating the system as is it working in it.
  • Do you care about money? - Recall that the pursuit of a doctoral degree is a massive opportunity cost. Are you prepared to pay it? All things being equal, having a PhD will make you poorer in the long run, so if you care about money then do not waste your time. You make money by earning it, and the currency paid in graduate school is not money.
  • Is your life adherent to a timeline? - Do you have personal goals on a calendar? Something like, "I want to be married by 20XX, and have children Y years later." Despite the popular notion that an academic life comes with unmatched freedom, it is in fact a heavily structured system. The apprenticeship is about devotion to that system, which often means delaying many other life goals.
  • Do you want your work to end? - Professional scholarship is a completely immersive existence. Your work never ends, and at least ostensibly everything you do is about building your academic footprint. The specifics of this differ among disciplines, but the point is your work goes with you, everywhere. This is not the same as having a smart phone the "tethers" you to the office. Smart phones can be turned off, but the presence of the unending productive expectations of academia cannot.

While the comparisons to Major League Baseball and The Matrix are meant to be fun, they also highlight the true uniqueness of academia as an employment choice; one that is almost impossible to appreciate from the outside. The Economist provides a disservice by attempting to place it in the context of every other type of work as a means to diminish its value. For the dreamers, it can be the most challenging, rewarding, and satisfying experience imaginable, and there are no economic or social implications that could ever change that.

The Difference Between Relative and Absolute Comparison, and the NYT Op-Ed Page

I am not often drawn to comment on editorials, but a piece by Nicholas Kristof in this Sunday's New York Times seemed worthy of note. In the piece, "The Big (Military) Taboo," Kristof attempts to make the argument that as the United States faces difficult budgetary decisions the military should come under serious fiscal scrutiny.

While there is no doubt that such an investigation could result in significant savings, Kristof motivates his argument by citing several statistics related to the size of the U.S. military; with the assertion that they represent evidence of waste. The trouble with these figures is they are absolute numbers, which are inappropriate in the context of this argument.

The United States spends nearly as much on military power as every other country in the world combined, according to the Stockholm International Peace Research Institute. It says that we spend more than six times as much as the country with the next highest budget, China.

The statement is true, but deceiving. While the United States does spend more than nearly every other country combined, our collective wealth is equally disproportionate . The better comparison is to show how much each country spends on their military as a percent of GDP. This provides a more common reference among countries. As you can see, in this perspective the United States is not nearly as notable an outlier.

absolute_military_spending.png
relative_military_spending.png
The intelligence community is so vast that more people have “top secret” clearance than live in Washington, D.C.

The estimated population of Washington, DC as of July, 2009 is 599,657. First, that's not a very big number relative the total population of the U.S., or other small states (my home state of Connecticut boasts a population of over 3.5 million). Furthermore, the number of individuals with "top secret" clearances is unknown, but the Washington Post estimated the number at 854,000. On its own that seems large—nearly one million. Consider, however, that number in the context of size of the U.S. population, which was recently estimated by the Census Bureau 308,745,538.

If these estimates are accurate then approximately 1-in-361 Americans holds a "top secret" clearance, or about 0.2% of the population. For those in the New York City area, those proportions would equate to about 22,151 fellow New Yorkers, a slightly more than one-quarter filled New Meadowlands Stadium. Certainly not a staggering figure by any means.

Kristof also cites two other statistics. The first is that the U.S. maintains "more than 560 bases and other sites around the world." While a quick search did not yield any results, a more interesting metric would be growth in foreign bases, rather than total number. While the wars in Iraq and Afghanistan have expanded the U.S. military's established reach, it would be useful to see this in the context of global growth.

His final point is that these contemporary wars will cost the U.S. more than the Revolutionary War, the War of 1812, the Mexican-American War, the Civil War and the Spanish-American War combined, even adjusted for inflation. Again, the better comparison is as a percent of historic GDP. Has the U.S.'s spending on wars increased as a percent of GDP since these wars, or decreased? It seems that the later would be true, but a more extensive data collection effort is needed to confirm.

Why I Will Not Analyze The New WikiLeaks Data

By now you surely have read about the latest massive disclosure of classified documents from WikiLeaks. Unlike the previous two disclosures, which were thousands of Significant Activities reports (SIGACT) from the Afghanistan and Iraq wars respectively, the latest leak are hundreds of thousands of cables communicated between the U.S. State Department and its many diplomats deployed around the world. As many of you know, after the first WikiLeaks disclosure on the Afghanistan War I—along with others—generated several analyses and visualizations based on this data.

I am a strong supporter of government transparency, and open data and analysis more generally. I viewed the first large WikiLeaks disclosure as an unprecedented opportunity to show the power of such openness. This, however, was not without reservation. Having worked inside the U.S. intelligence community, I was cognizant of the potential damage these data could do; first with respect to the U.S. government, but more importantly to those individuals working inside Afghanistan. Mindful of this, we focused on aggregate-level analyses of the data, and did not investigate individual reports or expose the names they contained.

I still believe that significant and meaningful discoveries are yet to be made from the Afghanistan disclosure about conflict, its effect on civilians, and the spatial-temporal nature of violence. I do not, however, believe that such discovery was ever the intent of the WikiLeaks organization. To the contrary, WikiLeaks's continued and reckless pursuit of classified document disclosures seems to have much more to do with the proclivities of the organization's founder, and very little to do with building knowledge or improving democratic discourse.

The latest leak typifies the identity and culture of WikiLeaks and by continuing to analyze new disclosures I am tacitly supporting this, which is something I will not do. WikiLeaks' motivation is that of a court jester, to mock and ridicule the contradictions of a state. However, they present themselves as a sage with the wisdom to adjudicate the public relevance of all information, which is the greatest contradiction of all.

To be clear, this is an entirely personal decision, and is not meant to discourage others from endeavoring to glean insight from this new data. The substantive value of the day-to-day machinations of diplomats, however, is dubious at best—even at aggregate.

Openness of information can lead to great things, not the least of which is the democratization of knowledge in ways never before possible. Shoving private messages into the public sphere without any context or care for the consequences can lead to misunderstanding, fear, and aggression. Unfortunately, WikiLeaks appears to be in the business of promoting the latter.

What Data Visualization Should Do: Simple Small Truth

Yesterday the good folks at IA Ventures asked me to lead off the discussion of data visualization at their Big Data Conference. I was rather misplaced among the high-profile venture capitalists and technologist in the room, but I welcome any opportunity to wax philosophically about the power and danger of conveying information visually.

I began my talk by referencing the infamous Afghanistan war PowerPoint slide because I believe it is a great example of spectacularly bad visualization, and how good intentions can lead to disastrous result. As it turns out, the war in Afghanistan is actually very complicated. Therefore, by attempting to represent that complex problem in its entirety much more is lost than gained. Sticking with that theme, yesterday I focused on three key things—I think—data visualization should do:

  1. Make complex things simple
  2. Extract small information from large data
  3. Present truth, do not deceive

The emphasis is added to highlight the goal of all data visualization; to present an audience with simple small truth about whatever the data are measuring. To explore these ideas further, I provided a few examples.

As the Afghanistan war slide illustrates, networks are often the most poorly visualized data. This is frequently because those visualizing network data think it is a good idea to include all nodes and edges in the visualization. This, however, is not making a complex thing simple—rather—this is making a complex thing ugly.

Below is an example of exactly this problem. On the left is a relatively small network (V: ~2,220 and E:~4,400) with weighted edges. I have used edge thickness to illustrate weights, and used a basic force-directed algorithm in Gephi to position the nodes. This is a network hairball, and while it is possible to observe some structural properties in this example, many more subtle aspects of the data are lost in the mess.

Slide06.png
Slide07.png

On the right are the same data, but I have used information contained in the data to simplify the visualization. First, I performed a k-core analysis to remove all pendants and pendant chains in the data; an extremely useful technique I have mentioned several times before. Next, I used the weighted in-degree of each node as a color scale for the edges, i.e., the dark the blue the higher the in-degree of the node the edges connect to. Then, I simply dropped the nodes from the visualization entirely. Finally, I added a threshold weight for the edges so that any edges below the threshold are drawn with the lightest blue scale. Using these simple techniques the community structures are much more apparent; and more importantly, the means by which those communities are related are easily identified (note the single central node connecting nearly all communities).

To discuss the importance of extracting small information from large data I used the visualization of the WikiLeaks Afghanistan War Diaries that I worked on this past summer. The original visualization is on the left, and while many people found it useful, its primary weakness is the inability to distinguish among the various attack types represented on the map. It is clear that activity gradually increased in specific areas over time; however, it is entirely unclear what activity was driving that evolution. A better approach is to focus on one attack type and attempt glean information from that single dimension.

Slide08.png
Slide09.png

On the right I have extracted only the 'Explosive Hazard' data from the full set and visualized as before. Now, it is easy to see that the technology of IEDs were a primary force in the war, and as has been observed before, the main highway in Afghanistan significantly restricted the operations of forces.

Finally, to show the danger of data deception I replicated a chart published at the Monkey Cage a few months ago on the sagging job market for political science professors. On the left is my version of the original chart published at the Monkey Cage. At first glance, the decline in available assistant professorships over time is quite alarming. The steep slope conveys a message of general collapse in the job market. This, however, is not representative of the truth.

Slide10.png
Slide11.png

Note that in the visualization on the left the y-axis scales go from 450 to 700, which happen to be the limits of the y-axis data. Many data visualization tools, including ggplot2 which is used here, will scale their axes by the data limits by default. Often this is desirable; hence the default behavior, but in this case it is conveying a dishonest perspective on the job market decline. As you can see from the visualization on the right, by scaling the y-axis from zero the decline is much less dramatic, though still relatively troubling for those of us who will be going on the job market in the not distant future.

These ideas are very straightforward, which is why I think they are so important to consider when doing your own visualizations. Thanks again to IA Ventures for providing me a small soap box in front of such a formidable crowd yesterday. As always, I welcome any comments or criticisms.

Cross-posted at dataists

Should Researchers Share Their Code-in-Progress Online?

I am a huge fan of github. Not only because I think it is a great service, but I love the idea of having my work freely accessible for people to view, use, use and critique. I have transitioned all of the code from the ZIA Code Repository there, used it to collaborate with Aric Hagberg on our NetworkX workshop, and I even gave a presentation to a group of fellow-graduate students in my agent-based modeling class a few weeks ago on the joys of version control using git.

I am also pushing the code associated with what I hope to be a large part of my dissertation work to my github account. There are, of course, inherent risks in "airing my dirty laundry" for all of the world to see. Last night I had a conversation with a friend about these risks. He uses github like I do, but when he mentioned it to his advisor he was strongly advised to take down the code. Unfortunately, this advise came without an explanation, but clearly this seasoned academic viewed the risk of posting premature code irreconcilable with any advantages.

Without a doubt, there are numerous bugs in my code on github, but I had a very hard time understanding why that was a problem. During the conversation last night we went back and forth trying to account for all of the risks of putting our code-in-progress online before it was fully developed. As new reasons came up, we seemed to easily find more compelling—at least to me—counter-arguments. Here are some of the reasons we came up with:

  • People will steal your ideas - This seems to be the most common reason for keeping code private, but also the easiest to counter. How can someone steal something that you have already publicly claimed as your own? I understand that for graduate students there may be a fear that senior academics with a higher profile could "borrow" your work and use their position to get it to publication faster, and while this may have at one time been a legitimate fear, code repositories like github date/time stamp everything. If someone steals your stuff having a repository is your only recourse, and actually acts as a much more effective guard against intellectual property theft than keeping things a secret.
  • People will see all of your mistakes - This is absolutely true, but so what? In both the hard and social sciences there are strong traditions of posting "working" version of papers online. Part of the reason for doing this is to get some response from the community about the work. This includes solicited or unsolicited criticism. It is quite common for ambitious graduate students to dig deeply into the appendix of a paper to check a proof or data coding, and forward any errata to the author. This is precisely the same dynamic that occurs when bugs are flagged in code, and this is a good thing.
  • Incomplete projects make you seem fickle - Part of what I love about github is how easy it is to create a new repository. Every time I have a new coding idea I can just fire through a few commands in the terminal and be ready to push code. This, however, can lead to many incomplete projects—the dreaded "abandonware." I think this is a fair criticism, but only if this is all one ever posts. A better idea is to have one repository that you use as a sandbox, and be explicit about its purpose. In the software development world this is a standard operating procedure, and it should be for scientific research. One researcher's sandbox may be another's career. Allowing others to see your ideas in an area can spark brilliance!

After all this self-assurance, however, I am eager for someone to convince me otherwise. Has anyone had a particualrly negative experience with posting code? Are their disadvantages that we could not come up with? Posting code seems like an obviously good thing to me, which makes me very suspicious that I am wrong. Please help!