In Search of Power-laws: WikiLeaks Edition

Yesterday, a commenter reminded me of the very popular hobby among scientists of searching for power-law distributions in large event data. While the commonality of scale invariance in event data is quite well known—particularly with respect to conflict data—this has not prevented many researchers from seeking and finding these patterns in data.

As the commenter notes, it is likely that the WikiLeaks data will soon be annexed into this line of research. Before other researchers examine the distributional properties of these data more thoroughly, it is worth doing a quick exploration to show some of the issues in power-law fishing, and how to avoid them. First, we begin by plotting the WikiLeaks casualty data using the traditional log-log transformation and fitting a linear regression.

kia_freq_mag.png
all_freq_mag.png

The search for power-law distributions often focuses on the scaling parameter: $latex -\alpha$. Scaling parameters where 2 < $latex \alpha$ < 3 are generally accepted as fitting a power-law, thus the search is for values in that range.

When using linear regression to fit the distribution this parameter is calculated simply as the slope of the linear fit to the logged data. The above panels show this analysis for two different version of the data. On the left, the data are restricted to only observations with, in the words of Lewis Richardson, "deadly quarrels." That is, WikiLeaks events where a death occurred, which accounts for friendly, enemy, host nation and civilian deathsg. Interestingly, we find that for the KIA data the scaling parameter falls just outside the necessary range to be classified as a power-law.

At this point we might conclude that the data does not fit our assumptions and move on to test other distributions. If we were particularly motivated to find a power-law in this data, however, one option would be to go back and loosen our restriction on the data to include not just KIA's but all casualties, i.e., non-deadly quarrels. The assumption being that with more data points the "tail" of the distribution would be longer and thus more likely to fit a power-law. The right panel above illustrates this analysis, and as you can see in this case we find that the data do fit a power-law, with $latex -\alpha = 2.08$.

Unfortunately, even if suspending disbelief enough to accept the altogether dubious inclusion of more data points to force-fit a power-law, everything we have done up to this point is wrong. As was brilliantly detailed by Clauset, et al in "Power-law distributions in empirical data," linear fits to log transformed data are extremely error-prone. As such, rather than rely on the above findings we will use the method detailed by these authors for properly fitting power-law on the WikiLeaks data.

In this case we need to do three things: 1) find the appropriate lower-bound for the value of $latex x$ for our data, which in this case are events with casualties; 2) fit the scaling parameter with $latex x_{min}$; 3) perform a goodness-of-fit test to test whether our empirical observations actually fit the parameterization of the distribution.

For the first step we are fortunate, as we know the appropriate minimum value $latex x_{min}=1$, since these are discrete event data and we are counting the number of observed casualties in the data. Equally convenient, this allows for a straightforward maximum-likelihood estimation of the scaling parameter via a variant of the well-known Hill estimator. This functionality is built-into R's igraph package so we can compute the new scaling parameters easily.

Using this more accurate methods for estimating the scaling parameter reveals that—in fact—neither set of data on the frequency and magnitude of violent events in Afghanistan fit a power-law. As a result, goodness-of-fit tests for power-law with this data are unnecessary, but as described in Clauset, et al. using a Kolmogorov–Smirnov test to measure the distance between theorized and observed distributions is a useful tool for checking fits to other distributions. There are several alternative distributions that may better fit these data, many of which are specified for simulation in the degreenet R package, but I leave that as an exercise to the reader.

There are two primary things to take away from this exercise: 1) power-laws are much less frequently observed than is commonly thought, and careful estimation of scaling parameters and goodness-of-fit should be performed to check; 2) it appears that the WikiLeaks data fall well short of proving, or even reinforcing, previous conclusions about the underlying dynamics of violent conflict.

As always, the code used to generate this analysis is available on Github.

The Evolution of Report Summaries in WikiLeaks Data Over Time

Last week I posted a visualization of the reports in the WikiLeaks data projected onto a map of Afghanistan in yearly slices. Much to my delight, many people found this visualization of the data helpful, with special thanks to Wired's Danger Room blog for picking up the post and giving some well-deserved press to the power of open-source tools. While this visualization was powerful in how it conveyed the spread and increase of the military's activity in Afghanistan over time, it did not give a sense of what was actually going on in the reports.

In today's visualization I would like to dig deeper in the WikiLeaks data by analyzing the contents of report summaries, and attempt to visualize the evolution of topics in these summaries over time. To achieve this, it will be necessary to distill the tens of thousands of report summaries into general terms that best represent the topics of those reports. Using a subset of the data for reports from only the NORTH, SOUTH, EAST and WEST regions, i.e., excluding reports labeled as CAPITAL and UNKNOWN, I have performed an analysis on the report summary data using common text-mining techniques. Below is a visualization of the results, followed by a brief explanation of how the images were produced.

First, given the detail in the maps, the image was rendered in high-resolution and is being shared using Microsoft's zoom.it. The image is best viewed in fullscreen more, and I recommend you explore it that way.

First, I have used a technique called Latent Dirichlet allocation (LDA) to generate the terms above. The method is far too nuanced to describe in detail here; but generally, it is used to measure similarity among disparate pieces of data and create partitions of these data that can be thought of as most representative of the overall set. It is most often used in generating topic models from large text corpuses, such as the WikiLeaks report summary data. By ignoring common English stop words, and adding a few specific to this data (click here to view the code for the LDA model), I specified the model to produce ten topics, with five terms in each, for every year/region pair in the data. For every pair, therefore, the model produced 50 terms that "best fit" the topics that pair represents.

To visualize these results, I projected the terms onto their corresponding region in Afghanistan, and sized them by the frequency they appeared in topics for each pair. It is common in LDA models for words to appear in multiple topics, so it was logical to make those that appeared more than once more prominent in the visualization. The terms were then colored by region to make it easier to distinguish among them at the regional borders. As you can see, the results are quite interesting.

The frequency of terms like "IED" and "EOD" increase in the northern region, while "friendly," "forces" and "engage" become very prominent in the south in 2007. I have spent a lot of time exploring the image and found it to be quite fascinating, and I hope you will too. Also, many of the terms appear to be military acronyms that I am unfamiliar with, so I hope one of my more astute readers may be able to decode some of what is being represented by the topic models. UPDATE: With the help of Josh Foust and Greg Hannah, I have started an acronym reference guide below. If you know the meaning of one of the blank acronyms, please let me know!

Finally, if you have an idea for a future analysis I am happy to take suggestions. I, along with a esteemed team of data hacks, will be presenting some of our findings in this data at the next NYC R meetup and we welcome all ideas!

Acronym Guide

41r

MGRS Coordinate

kdz

Kunduz

41s

MGRS Coordinate

kpf

42s

MGRS Coordinate

loc

aaf

Anti-Afghan figters

ltc

Lieutenant Colonel

abp

Afghan Border Police

mes

acm

Anti-coalition militia (disc)

mey

Meymaneh

amf

Afghan Militia Force (disc)

nds

National Directorate of Security (Afghan Intel service)

ana

Afghan National Army

ngo

Non-governmental organization

anp

Afghan National Police

nov

att

At this time

oda

Operational Detachment (Alpha) – special forces

baf

Bagram Air Field

opord

Operational order

bcp

pak

Pakistan

bda

Battle damage assessment

pakmil

Pakistan Military

cas

Close air support

pax

Persons/personnel

cexc

pek

cjsotf

Combined Joint Special Operations Task Force

phq

Police headquarters

cjtf76

Combined Joint Task Force - 76

plt

Platoon

cop

Combat outpost

poo

Point of origin

coy

Company

prt

Provincial reconstruction team

cstc-a

qrf

Quick reaction force

dcg

rcp

Route clearance Patrol

eod

Explosive ordinance disposal

rpg

Rocket propelled grenade

esoc

saf

Special Operatiions Forces

fob

Forward Operating Base

salt

Size, Activity, Location, Time

fra

French

tgt

Tacical Group T

frago

Fragmentary order

tic

Troops in Contact, i.e. combat

idf

Indirect fire

usmc

United Stated Marine Corp

ied

Improvised explosive device

uxo

Unexploded ordinance

inf

Infiltrate

vbied

Vehicle borne-IED

ins

Insurgents

vino

ivo

In vicinity of

w/d

Wheels down

kaf

Kandahar Air Field

w/u

Wheels up

kaia

Kabul international airport

wia

Wounded in action

Wikileaks Attack Data by Year and Type Projected on Afghanistan Regional Map

Below is a visualization of the Wikileaks data produced in collaboration with Michael Dewar. This plot shows attacks in the data set by year and type, projected onto a map of Afghanistan with district boundaries.

events_by_year_map.png

This visualization is certainly not perfect, i.e., some colors are difficult to discern, but it does provide added insight to the level and location of fighting over six years o the war represented by the data.

R code available here.

Graphs, Disease and Mental Health

During my last visit to DC I had the great pleasure of meeting CPT Benjamin Kirkup, a Microbiologist at Walter Reed Army Institute of Research's Department of Wound Infections. CPT Kirkup had invited me to his laboratory to discuss how "big data" could help their mission: combat new and virulent strains of drug resistant infectious diseases. Having not taken a biology class since high school (truly), I was both surprised and honored by the invitation.

As we began to discuss more of the lab's work, CPT Kirkup described the changing strategies and challenges for combatting infection. He noted that while the care received by soldiers has greatly improved over the last several decades; with respect to first-aid, surgery and rehabilitation, there is serious concern about the spread of drug resistant diseases&mmdash;both at the individual level (a patient) or group (platoon, etc.). These diseases have much longer-term effects, often lasting past the point of physical recovery.

As an example, CPT Kirkup described to me a scenario whereby a solider is treated after being victim to an IED attack. Given the nature of these attacks, this frequently requires partial amputation of a limb, which is treated to prevent infection. The danger from drug resistant strains is that they can remain present at the point of amputation well beyond surgery. As a result, follow-up procedures must be performed in the future to address the new infection, often resulting in additional amputation and further time in a hospital recovering. The physical toll from this process is obvious, but what is less obvious is the mental toll. As CPT Kirkup explained, after these types of surgeries soldiers have to go through the mental process of accepting a new definition of who they are and their limitations. For an organization that attempts to instill a sense of invincibility in its warriors, this sudden and violent change can have extraordinary affects on a soldiers mental well-being. When drug resistant infections cause multiple surgery/recovery procedures this mental damage is compounded in ways that are difficult to imagine.

To help combat this cycle, CPT Kirkup's team has very recently started a program to collect data from the entire Army hospital network. They have setup sensors and procedures in these facilities to collect and parse infectious disease data at a very granular level. Moving beyond traditional data, such as cultures and blood samples, the project hopes to add spatio-temporal data, e.g., two soldiers were wounded in the same attack at time t in location z, then road the same ambulance to some hospital, and were then separated into different recovery rooms. It is through these new bits of data that CPT Kirkup and his team hope to understand the dynamics of disease diffusion within a military population, and also how they might be able to identify "early warning" signs as to when these infections might occur and how to prevent them.

It quickly became clear that there was a graphical component to this problem. The data are a natural graph in high-dimensional space, and perhaps to provide the military with the insight it hopes to gain from this project this approach will be useful.

I am very excited about having the opportunity to work with CPT Kirkup and his team on this project, and I will provide periodic updates as to discovery, failures and general progress related to my participation.

Photo: BBC News