The Evolution of Report Summaries in WikiLeaks Data Over Time
Last week I posted a visualization of the reports in the WikiLeaks data projected onto a map of Afghanistan in yearly slices. Much to my delight, many people found this visualization of the data helpful, with special thanks to Wired's Danger Room blog for picking up the post and giving some well-deserved press to the power of open-source tools. While this visualization was powerful in how it conveyed the spread and increase of the military's activity in Afghanistan over time, it did not give a sense of what was actually going on in the reports.
In today's visualization I would like to dig deeper in the WikiLeaks data by analyzing the contents of report summaries, and attempt to visualize the evolution of topics in these summaries over time. To achieve this, it will be necessary to distill the tens of thousands of report summaries into general terms that best represent the topics of those reports. Using a subset of the data for reports from only the NORTH, SOUTH, EAST and WEST regions, i.e., excluding reports labeled as CAPITAL and UNKNOWN, I have performed an analysis on the report summary data using common text-mining techniques. Below is a visualization of the results, followed by a brief explanation of how the images were produced.
First, given the detail in the maps, the image was rendered in high-resolution and is being shared using Microsoft's zoom.it. The image is best viewed in fullscreen more, and I recommend you explore it that way.
First, I have used a technique called Latent Dirichlet allocation (LDA) to generate the terms above. The method is far too nuanced to describe in detail here; but generally, it is used to measure similarity among disparate pieces of data and create partitions of these data that can be thought of as most representative of the overall set. It is most often used in generating topic models from large text corpuses, such as the WikiLeaks report summary data. By ignoring common English stop words, and adding a few specific to this data (click here to view the code for the LDA model), I specified the model to produce ten topics, with five terms in each, for every year/region pair in the data. For every pair, therefore, the model produced 50 terms that "best fit" the topics that pair represents.
To visualize these results, I projected the terms onto their corresponding region in Afghanistan, and sized them by the frequency they appeared in topics for each pair. It is common in LDA models for words to appear in multiple topics, so it was logical to make those that appeared more than once more prominent in the visualization. The terms were then colored by region to make it easier to distinguish among them at the regional borders. As you can see, the results are quite interesting.
The frequency of terms like "IED" and "EOD" increase in the northern region, while "friendly," "forces" and "engage" become very prominent in the south in 2007. I have spent a lot of time exploring the image and found it to be quite fascinating, and I hope you will too. Also, many of the terms appear to be military acronyms that I am unfamiliar with, so I hope one of my more astute readers may be able to decode some of what is being represented by the topic models. UPDATE: With the help of Josh Foust and Greg Hannah, I have started an acronym reference guide below. If you know the meaning of one of the blank acronyms, please let me know!
Finally, if you have an idea for a future analysis I am happy to take suggestions. I, along with a esteemed team of data hacks, will be presenting some of our findings in this data at the next NYC R meetup and we welcome all ideas!
Acronym Guide
41r
MGRS Coordinate
kdz
Kunduz
41s
MGRS Coordinate
kpf
42s
MGRS Coordinate
loc
aaf
Anti-Afghan figters
ltc
Lieutenant Colonel
abp
Afghan Border Police
mes
acm
Anti-coalition militia (disc)
mey
Meymaneh
amf
Afghan Militia Force (disc)
nds
National Directorate of Security (Afghan Intel service)
ana
Afghan National Army
ngo
Non-governmental organization
anp
Afghan National Police
nov
att
At this time
oda
Operational Detachment (Alpha) – special forces
baf
Bagram Air Field
opord
Operational order
bcp
pak
Pakistan
bda
Battle damage assessment
pakmil
Pakistan Military
cas
Close air support
pax
Persons/personnel
cexc
pek
cjsotf
Combined Joint Special Operations Task Force
phq
Police headquarters
cjtf76
Combined Joint Task Force - 76
plt
Platoon
cop
Combat outpost
poo
Point of origin
coy
Company
prt
Provincial reconstruction team
cstc-a
qrf
Quick reaction force
dcg
rcp
Route clearance Patrol
eod
Explosive ordinance disposal
rpg
Rocket propelled grenade
esoc
saf
Special Operatiions Forces
fob
Forward Operating Base
salt
Size, Activity, Location, Time
fra
French
tgt
Tacical Group T
frago
Fragmentary order
tic
Troops in Contact, i.e. combat
idf
Indirect fire
usmc
United Stated Marine Corp
ied
Improvised explosive device
uxo
Unexploded ordinance
inf
Infiltrate
vbied
Vehicle borne-IED
ins
Insurgents
vino
ivo
In vicinity of
w/d
Wheels down
kaf
Kandahar Air Field
w/u
Wheels up
kaia
Kabul international airport
wia
Wounded in action