The Evolution of Report Summaries in WikiLeaks Data Over Time

Last week I posted a visualization of the reports in the WikiLeaks data projected onto a map of Afghanistan in yearly slices. Much to my delight, many people found this visualization of the data helpful, with special thanks to Wired's Danger Room blog for picking up the post and giving some well-deserved press to the power of open-source tools. While this visualization was powerful in how it conveyed the spread and increase of the military's activity in Afghanistan over time, it did not give a sense of what was actually going on in the reports.

In today's visualization I would like to dig deeper in the WikiLeaks data by analyzing the contents of report summaries, and attempt to visualize the evolution of topics in these summaries over time. To achieve this, it will be necessary to distill the tens of thousands of report summaries into general terms that best represent the topics of those reports. Using a subset of the data for reports from only the NORTH, SOUTH, EAST and WEST regions, i.e., excluding reports labeled as CAPITAL and UNKNOWN, I have performed an analysis on the report summary data using common text-mining techniques. Below is a visualization of the results, followed by a brief explanation of how the images were produced.

First, given the detail in the maps, the image was rendered in high-resolution and is being shared using Microsoft's zoom.it. The image is best viewed in fullscreen more, and I recommend you explore it that way.

First, I have used a technique called Latent Dirichlet allocation (LDA) to generate the terms above. The method is far too nuanced to describe in detail here; but generally, it is used to measure similarity among disparate pieces of data and create partitions of these data that can be thought of as most representative of the overall set. It is most often used in generating topic models from large text corpuses, such as the WikiLeaks report summary data. By ignoring common English stop words, and adding a few specific to this data (click here to view the code for the LDA model), I specified the model to produce ten topics, with five terms in each, for every year/region pair in the data. For every pair, therefore, the model produced 50 terms that "best fit" the topics that pair represents.

To visualize these results, I projected the terms onto their corresponding region in Afghanistan, and sized them by the frequency they appeared in topics for each pair. It is common in LDA models for words to appear in multiple topics, so it was logical to make those that appeared more than once more prominent in the visualization. The terms were then colored by region to make it easier to distinguish among them at the regional borders. As you can see, the results are quite interesting.

The frequency of terms like "IED" and "EOD" increase in the northern region, while "friendly," "forces" and "engage" become very prominent in the south in 2007. I have spent a lot of time exploring the image and found it to be quite fascinating, and I hope you will too. Also, many of the terms appear to be military acronyms that I am unfamiliar with, so I hope one of my more astute readers may be able to decode some of what is being represented by the topic models. UPDATE: With the help of Josh Foust and Greg Hannah, I have started an acronym reference guide below. If you know the meaning of one of the blank acronyms, please let me know!

Finally, if you have an idea for a future analysis I am happy to take suggestions. I, along with a esteemed team of data hacks, will be presenting some of our findings in this data at the next NYC R meetup and we welcome all ideas!

Acronym Guide

41r

MGRS Coordinate

kdz

Kunduz

41s

MGRS Coordinate

kpf

42s

MGRS Coordinate

loc

aaf

Anti-Afghan figters

ltc

Lieutenant Colonel

abp

Afghan Border Police

mes

acm

Anti-coalition militia (disc)

mey

Meymaneh

amf

Afghan Militia Force (disc)

nds

National Directorate of Security (Afghan Intel service)

ana

Afghan National Army

ngo

Non-governmental organization

anp

Afghan National Police

nov

att

At this time

oda

Operational Detachment (Alpha) – special forces

baf

Bagram Air Field

opord

Operational order

bcp

pak

Pakistan

bda

Battle damage assessment

pakmil

Pakistan Military

cas

Close air support

pax

Persons/personnel

cexc

pek

cjsotf

Combined Joint Special Operations Task Force

phq

Police headquarters

cjtf76

Combined Joint Task Force - 76

plt

Platoon

cop

Combat outpost

poo

Point of origin

coy

Company

prt

Provincial reconstruction team

cstc-a

qrf

Quick reaction force

dcg

rcp

Route clearance Patrol

eod

Explosive ordinance disposal

rpg

Rocket propelled grenade

esoc

saf

Special Operatiions Forces

fob

Forward Operating Base

salt

Size, Activity, Location, Time

fra

French

tgt

Tacical Group T

frago

Fragmentary order

tic

Troops in Contact, i.e. combat

idf

Indirect fire

usmc

United Stated Marine Corp

ied

Improvised explosive device

uxo

Unexploded ordinance

inf

Infiltrate

vbied

Vehicle borne-IED

ins

Insurgents

vino

ivo

In vicinity of

w/d

Wheels down

kaf

Kandahar Air Field

w/u

Wheels up

kaia

Kabul international airport

wia

Wounded in action