Language used by Academics with the Protection of Anonymity

Those in the political science discipline probably remember their first encounter with poliscijobrumors.com. For those outside, you have probably never heard of this particular message board, and you would have no reason to. As the URL suggests, the board specializes in rumor, gossip, back-bitting, mudslinging, and the occasional lucid thread on the political science discipline. By browsing the posts one can quickly see how the protection of anonymity results in the lowest-common denominator of discourse—even among members of the Ivory Tower!

If you are unconvinced, simply test Godwin's law for yourself.

The convergence of specific topics within a discipline and the promise of anonymity, however, makes for a very interesting data set on the use of language in this context.

I have always been curious what patterns could be extracted from the particular forum. Specifically, given the ability of people to mask their identities, which often leads to a very low-quality in discourse, is it still possible to identify topic areas of interest by examining the data in aggregate? Furthermore, will any of them have anything to do with political science?

The answer: kind of...

polisci_words.png

​The message board has been around for a long time, so it was infeasible to go out and scrape the entire corpus. Short of that, I decided to create a text corpus of the first 1,018 threads in the General Job Market Discussion. The 1,018 comes from that fact that several threads include multiple pages, so rather than strictly stopping at 1,000 pages I decided to try to be inclusive of full threads.

With all the data in hand, the analysis was very straightforward. I constructed a term-document matrix, with the usual linguistic noise removed, and performed a simple matrix multiplication to get the number of times each of the words were used in the same thread. The result is an N-by-N matrix, wherein the elements are the number of times a words were used in the same thread. We can think of this data as a weighting among words: the higher the number the "closer" the affiliation.

Another way to construct this is graphically, whereby the data is a weighted adjacency matrix. Then, the words become nodes and the edges are weighted by the co-occurence weighting in each element. This is helpful because we can now use force-directed methods to place words near each other in two-dimensional space. , both the x- and y-axis position of the word is directly relevant the how words relate to each other.

This positional data also gives us a sense of distance between words, i.e, the further apart words are the more unlikely it will be that they are used in the same thread. From this we can create "topic" clusters. That is, we can attempt to divide the words into clusters based on their distances, and these clusters can represent consistent topics within the entire corpus of data. To do this I use simply k-means clustering, and use 8 centers for the clusters. The choice of 8 was made because the "Dark2" Color Brewer palette only has 8 colors in it (art vs. science compromise).

Finally, because I think it is an immediately obvious way to convey this, words are sized by the log of their frequency in the entire corpus. The visualization above is the result of this analysis, which follows from previous thoughts on building better word clouds.

What can we say about this analysis? It seems—to me—that the topics are fairly similar. Moreover, despite the low-level of the overall discourse on the forum, in aggregate the topics are very relevant to the political science discipline and job market. That said, a non-negligible amount of profanity does make it into the visualization, though thankfully those words are not among the most frequently used. The placement of certain cities and universities into various topic clusters is also interesting.

Keep in mind that the method I use here is very different from the LDA topic modeling I have discussed in the past. Perhaps that would have produced better topic clusters, however, I do think one benefit of this method is the non-stochastic nature of the clusters.

Code available for download and inspection at the ZIA Code Repository.

R Packages Used

  • XML
  • tm
  • igraph
  • ggplot2

Swallowing the Academic "Red Pill"

By now you have likely read the lengthy attack on the value of acquiring a PhD and entering an academic career in The Economist's end of the year issue. As might be expected, this has generated much consternation within The Tower, and garnered some excellent responses from within my own discipline. What seems to be missing from the discussion is perspective from the so-called "cheap, highly motivated and disposable labour" i.e., graduate students. To that end, as someone who has recently tasted and—at least partially digested—the jagged red pill that is entering the academic rabbit hole, allow me a moment to reflect on the decision.

To be honest, it is very difficult to argue with the numbers put forward by The Economist. The years immediately following college graduation are that most crucial in shaping an individual's lifetime earning. The ancient tradition of apprenticeship used to train graduate students is deliberately slow, which forces people to forgo these crucial earning years in exchange for admittance into an extremely exclusive career path. To borrow Josh's sports metaphor, the 64,000 PhDs produced in the United States each year cited by The Economist is at least comparable to the tens of thousands of individuals drafted into the top professional sports teams in the U.S. each year. The pursuit of a dream often begins with a massive opportunity cost, and just like the individual deciding whether to follow a dream to play major league baseball faces extraordinarily low probability of reaching that dream, and forgoing considerable future earning (first contract season for Minor League players in $1,100/month), so too does the college senior filling out graduate school applications.

Part of the problem for academics is the mythology of their career is not celebrated to an even reasonably comparable degree as that of the professional athlete. On my first day of graduate school one of my professors said, "Congratulations on being accepted to the program. While most people will not understand it, you have one of the greatest jobs one the planet. People are going to pay you to think, and I think that is pretty cool."

I think it is pretty cool too, and while at face value that statement no better reflects the reality of graduate school anymore than Summer Catch reflects the realities of the Minor League baseball system, it is an important to remember what an academic career is really about: to be one of the world's best thinkers, period. The original article attacking academia never considers this point, and rather places doctoral research as any other kind of on-the-job training. The fact is, there are very few people who will successfully navigate their graduate program and be hired as a tenure-track faculty, and even fewer who will go on to be successful academics. It is an environment where a very specific set of goals blended with unique intellectual, interpersonal and labor skills are needed to flourish, not unlike many other highly specialized careers.

Fortunately, like other highly specialized careers, it easy to recognize whether your own goals and skills exists at some sufficient intersection with what the academy expects. Having sat at this desk for countless hours considering my own career path, here are a few questions I have found most valuable upon reflection:

  • Do you love school? - To be clear, this is not the same question as, "Do you love learning?" Learning, for the most part, is either accomplished as the result of a natural ability, or a force of will. Schools and universities, on the other hand, are the institutions established to train and judge your qualifications to become a member of academia. It is both highly flawed and extremely difficult to change. Success in graduate school is as much about appreciating the system as is it working in it.
  • Do you care about money? - Recall that the pursuit of a doctoral degree is a massive opportunity cost. Are you prepared to pay it? All things being equal, having a PhD will make you poorer in the long run, so if you care about money then do not waste your time. You make money by earning it, and the currency paid in graduate school is not money.
  • Is your life adherent to a timeline? - Do you have personal goals on a calendar? Something like, "I want to be married by 20XX, and have children Y years later." Despite the popular notion that an academic life comes with unmatched freedom, it is in fact a heavily structured system. The apprenticeship is about devotion to that system, which often means delaying many other life goals.
  • Do you want your work to end? - Professional scholarship is a completely immersive existence. Your work never ends, and at least ostensibly everything you do is about building your academic footprint. The specifics of this differ among disciplines, but the point is your work goes with you, everywhere. This is not the same as having a smart phone the "tethers" you to the office. Smart phones can be turned off, but the presence of the unending productive expectations of academia cannot.

While the comparisons to Major League Baseball and The Matrix are meant to be fun, they also highlight the true uniqueness of academia as an employment choice; one that is almost impossible to appreciate from the outside. The Economist provides a disservice by attempting to place it in the context of every other type of work as a means to diminish its value. For the dreamers, it can be the most challenging, rewarding, and satisfying experience imaginable, and there are no economic or social implications that could ever change that.

The Difference Between Relative and Absolute Comparison, and the NYT Op-Ed Page

I am not often drawn to comment on editorials, but a piece by Nicholas Kristof in this Sunday's New York Times seemed worthy of note. In the piece, "The Big (Military) Taboo," Kristof attempts to make the argument that as the United States faces difficult budgetary decisions the military should come under serious fiscal scrutiny.

While there is no doubt that such an investigation could result in significant savings, Kristof motivates his argument by citing several statistics related to the size of the U.S. military; with the assertion that they represent evidence of waste. The trouble with these figures is they are absolute numbers, which are inappropriate in the context of this argument.

The United States spends nearly as much on military power as every other country in the world combined, according to the Stockholm International Peace Research Institute. It says that we spend more than six times as much as the country with the next highest budget, China.

The statement is true, but deceiving. While the United States does spend more than nearly every other country combined, our collective wealth is equally disproportionate . The better comparison is to show how much each country spends on their military as a percent of GDP. This provides a more common reference among countries. As you can see, in this perspective the United States is not nearly as notable an outlier.

absolute_military_spending.png
relative_military_spending.png
The intelligence community is so vast that more people have “top secret” clearance than live in Washington, D.C.

The estimated population of Washington, DC as of July, 2009 is 599,657. First, that's not a very big number relative the total population of the U.S., or other small states (my home state of Connecticut boasts a population of over 3.5 million). Furthermore, the number of individuals with "top secret" clearances is unknown, but the Washington Post estimated the number at 854,000. On its own that seems large—nearly one million. Consider, however, that number in the context of size of the U.S. population, which was recently estimated by the Census Bureau 308,745,538.

If these estimates are accurate then approximately 1-in-361 Americans holds a "top secret" clearance, or about 0.2% of the population. For those in the New York City area, those proportions would equate to about 22,151 fellow New Yorkers, a slightly more than one-quarter filled New Meadowlands Stadium. Certainly not a staggering figure by any means.

Kristof also cites two other statistics. The first is that the U.S. maintains "more than 560 bases and other sites around the world." While a quick search did not yield any results, a more interesting metric would be growth in foreign bases, rather than total number. While the wars in Iraq and Afghanistan have expanded the U.S. military's established reach, it would be useful to see this in the context of global growth.

His final point is that these contemporary wars will cost the U.S. more than the Revolutionary War, the War of 1812, the Mexican-American War, the Civil War and the Spanish-American War combined, even adjusted for inflation. Again, the better comparison is as a percent of historic GDP. Has the U.S.'s spending on wars increased as a percent of GDP since these wars, or decreased? It seems that the later would be true, but a more extensive data collection effort is needed to confirm.

Why I Will Not Analyze The New WikiLeaks Data

By now you surely have read about the latest massive disclosure of classified documents from WikiLeaks. Unlike the previous two disclosures, which were thousands of Significant Activities reports (SIGACT) from the Afghanistan and Iraq wars respectively, the latest leak are hundreds of thousands of cables communicated between the U.S. State Department and its many diplomats deployed around the world. As many of you know, after the first WikiLeaks disclosure on the Afghanistan War I—along with others—generated several analyses and visualizations based on this data.

I am a strong supporter of government transparency, and open data and analysis more generally. I viewed the first large WikiLeaks disclosure as an unprecedented opportunity to show the power of such openness. This, however, was not without reservation. Having worked inside the U.S. intelligence community, I was cognizant of the potential damage these data could do; first with respect to the U.S. government, but more importantly to those individuals working inside Afghanistan. Mindful of this, we focused on aggregate-level analyses of the data, and did not investigate individual reports or expose the names they contained.

I still believe that significant and meaningful discoveries are yet to be made from the Afghanistan disclosure about conflict, its effect on civilians, and the spatial-temporal nature of violence. I do not, however, believe that such discovery was ever the intent of the WikiLeaks organization. To the contrary, WikiLeaks's continued and reckless pursuit of classified document disclosures seems to have much more to do with the proclivities of the organization's founder, and very little to do with building knowledge or improving democratic discourse.

The latest leak typifies the identity and culture of WikiLeaks and by continuing to analyze new disclosures I am tacitly supporting this, which is something I will not do. WikiLeaks' motivation is that of a court jester, to mock and ridicule the contradictions of a state. However, they present themselves as a sage with the wisdom to adjudicate the public relevance of all information, which is the greatest contradiction of all.

To be clear, this is an entirely personal decision, and is not meant to discourage others from endeavoring to glean insight from this new data. The substantive value of the day-to-day machinations of diplomats, however, is dubious at best—even at aggregate.

Openness of information can lead to great things, not the least of which is the democratization of knowledge in ways never before possible. Shoving private messages into the public sphere without any context or care for the consequences can lead to misunderstanding, fear, and aggression. Unfortunately, WikiLeaks appears to be in the business of promoting the latter.