Counting the NSA Documents

28 Aug 2013 — Last updated: 29 Aug 2013, 09:07AM

import pandas as pd
import matplotlib as mpl
import requests
import lxml.html
import datetime
import mplstyle
import mplstyle.styles.simple
    "figure.figsize": (10, 8),
    "lines.markersize": 0
from IPython.display import HTML
from markdown import markdown
md = lambda x: HTML(markdown(x))

The ACLU has started keeping a list of NSA documents that the government or press has released since June 2013. The list includes each document's title, its original date, the date it was released, who released it, its length in pages, and a short description.

Helpfully, the ACLU has formatted this list as a structured HTML table, making it possible to write a computer script to parse this information.

aclu_text = requests.get(ACLU_URL).text
aclu_dom = lxml.html.fromstring(aclu_text)
table = aclu_dom.cssselect("#tableOne")[0]
rows = table.cssselect("tr")[1:]
def parse_date(date_str):
    split = date_str.split("/")
    return + int(split[-1][-1]), int(split[0]), int(split[1]))
def extract_data_from_row(row):
    cells = row.cssselect("td")
    title_link = cells[2].cssselect("a")[0]
    return {
        "release_date": parse_date(cells[0].text), 
        "document_date": cells[1].text,
        "title": title_link.text,
        "url": ACLU_BASE + title_link.attrib["href"], 
        "description": cells[3].text, 
        "releaser": cells[4].text,
        "pages": int(cells[5].text)
# Read the parsed data into a pandas DataFrame
documents = pd.DataFrame(map(extract_data_from_row, rows))

Though the information in the table is just metadata, not the content of the documents themselves, we can still learn something from it. For example: Who has released more documents — the government or the press? Not all documents are created equal, of course, especially when you take redactions into account. But charting this statistic can give us a sense of the tempo of releases, if not their melody.

In the chart below, you can see the Washington Post and the Guardian go on a document-publishing spree in June. Then All Quiet on the NSA Front for nearly a month, until late July, when they begin releasing more documents. The government started releasing documents more slowly; about half as many as the press through early August. But — at the risk of making this sound like a horse race — but by late August they'd nearly caught up.

months = pd.DatetimeIndex(freq="MS", 
    start=documents["release_date"].min() -  datetime.timedelta(days=1), 
    end=documents["release_date"].max() + datetime.timedelta(days=1))

def set_month_ticks(ax):
    ax.set_xticklabels([ x.strftime("%b %Y") for x in months ])
docs_by_date = documents.groupby([ "release_date", "releaser" ])
cumulative_docs = docs_by_date.size().unstack().fillna(0).cumsum()
ax = cumulative_docs.plot(drawstyle='steps', grid=True)

legend(loc="upper left", title="Releasing Entity")
ax.set_title("Cumulative Documents Released, by Entity\n")
ax.set_xlabel("\nRelease Date")
ax.set_ylabel("Cumulative Number of Documents Released\n")
ax.set_ylim(0, cumulative_docs.stack().max() * 1.25)

It might also be interesting to look at the number of pages contained in those documents. You should take these numbers with an even grainier grain of salt, particularly since a page of redactions counts the same as a dense PowerPoint slide.

In the chart below, you can see that for the first few weeks in June, the press and government mostly dribbled out out short documents. But, since then, they've started to release larger documents — or at least more pages at a time. In the chart, you can also see that the government's late-August release more-than-tripled their previous page count.

pages_by_date = documents.groupby([ "release_date", "releaser" ])["pages"]
cumulative_pages = pages_by_date.sum().unstack().fillna(0).cumsum()
ax = cumulative_pages.plot(drawstyle="steps", grid=True)

legend(loc="upper left", title="Releasing Entity")
ax.set_title("Cumulative Pages Released, by Entity\n")
ax.set_xlabel("\nRelease Date")
ax.set_ylabel("Cumulative Number of Pages Released\n")

We can see another pattern if we arrange the documents by length instead of date released. In the chart below, you can see that the government has released most of the longest documents, while the press has released most of the shorter documents.

mplstyle.set({ "figure.figsize": (10, 18) })
color_cycle = mplstyle.get("axes.color_cycle")
page_sort = documents.sort("pages")
ax = page_sort["pages"].plot(
    color=[ color_cycle[0] if r == "Government" else color_cycle[1] 
        for r in page_sort["releaser"] ],

for i in range(len(ax.patches)):
    p = ax.patches[i]
    width = p.get_width()
    x = (width < 10) * (width) + 1
    y = p.get_y() + p.get_height() / 3
    ax.text(x, y, page_sort["title"].iget(i))

ax.set_title("Documents by Length and Releasing Entity\n")
ax.set_xlabel("\nNumber of Pages")
ax.set_ylabel("Document Title")
leg_patches = [ mpl.patches.Rectangle((0, 0), 1, 1, fc=c, alpha=0.25)
    for c in color_cycle[:2] ]
legend(leg_patches, [ "Government", "Press" ], loc="lower right")
mplstyle.set({ "figure.figsize": (10, 8) })
page_avg_by_entity = documents.groupby("releaser")["pages"].mean()
gov_avg, press_avg = (page_avg_by_entity[x] for x in ("Government", "Press"))
md("""Indeed, NSA documents released by the government have averaged 
%.1f pages, compared to %.1f pages for NSA documents released by the press.""" 
% (gov_avg, press_avg))

Indeed, NSA documents released by the government have averaged 20.6 pages, compared to 9.6 pages for NSA documents released by the press.

I'm wary of extrapolating from such a small sample. The government and the press have different — or polar opposite, at least in the short term — motives and incentives for releasing documents. So it shouldn't be surprising that we see differences in the tempo and shape of the documents they've released. As both estates continue their releases, it'll be interesting to see whether these patterns hold up.

Show CodeHide Code