Reverse Engineering xkcd's 'Frequency'
import requests
import lxml.html
import gifparse
import pandas as pd
import numpy as np
from IPython.display import HTML
from markdown import markdown
def md(x, *args, **kwargs):
return HTML(markdown(x.format(*args, **kwargs)))
def commafy(n):
if type(n) == int:
return "{:,d}".format(n)
return str(n)
import matplotlib as mpl
import matplotlib.pyplot as plt
import mplstyle
%matplotlib inline
def gif_to_html(raw_bytes):
"""Convert the raw bytes of a GIF into an HTML <img>."""
b64 = raw_bytes.encode("base64")
tmpl = '<img src="data:image/gif;base64,%s">'
return tmpl % b64
About a week ago, Randall Munroe published Frequency, the 1331st entry in his xkcd webcomic series. Unlike most xkcd comics, it's animated — actually, 5-by-10 grid of animated GIFs. Each describes a scenario (e.g., "one birth," or "someone locks their keys in their care") and blinks at the scenario's purported frequency. The result: a beautiful, pulsating composition. If you haven't taken a look yet, do so now before reading on.
What makes Frequency tick? Literally, it's the Graphics Interchange Format 89a specification, the image format that makes animated GIFs possible. (Unpresciently, the authors of the in 1990 spec wrote: "The Graphics Interchange Format is not intended as a platform for
animation, even though it can be done in a limited way.") The metadata associated with each frame of an animated GIF tells your browser how long to wait before showing the next one. This data lets us calculate the total loop time for an animated GIF and helps us piece together how Frequency ticks, in the figurative sense.
To reverse-engineer Frequency, I began by downloading each of its 50 GIFs, and calculated the total loop time for each. (Scroll to the bottom of the page to see a table containing the full results.) From the loop time, I calculated the number of times each scenario is purported to occur per second, per hour, and per heartbeat. I checked these numbers against published figures (such as the USGS's earthquake statistics), with reassuring results.
def get_image_urls():
"""Fetch the URLs for all GIFs from the xkcd's 'Frequency' piece."""
URL = "http://xkcd.com/1331/"
html = requests.get(URL).content
dom = lxml.html.fromstring(html)
images = dom.cssselect("#comic td img")
srcs = [ i.attrib["src"] for i in images ]
return srcs
### NOTE
# This analysis uses a custom GIF parser, which you can see here:
#
# https://github.com/jsvine/gifparse
#
# I wrote the parser in part because the PIL library — the standard
# tool for this sort of thing — seemed to be having trouble with
# the particular format of the GIFs in "Frequency." (I also just wanted
# to learn the spec.) As a result, `gifparse` is an incomplete library.
# I'm hoping to beef it up soon, but feel free to send pull requests
# in the meantime.
###
class Scenario(object):
"""A scenario from xkcd's 'Frequency' piece.
The corresponding GIF is downloaded upon instantiation.
"""
def __init__(self, url):
self.url = url
self.name = url.split("/")[-1].split(".")[0]
self.raw = requests.get(url).content
self.gif = gifparse.parse(self.raw)
def __repr__(self):
return "<Scenario:{}>".format(self.name)
# Instantiate 50 scenarios, one for each GIF in the comic
image_urls = get_image_urls()
scenarios = map(Scenario, image_urls)
# Note: GIF delays are, by spec, multiples of 1/100th of a second.
freq = pd.DataFrame({
"scenario": s,
"name": s.name,
"loop_seconds": round(float(s.gif.total_delay) / 100, 2)
} for s in scenarios).set_index("name")
freq["per_second"] = (1 / freq["loop_seconds"]).apply(lambda x: round(x, 4))
freq["per_hour"] = (60*60 / freq["loop_seconds"]).apply(lambda x: round(x, 2))
freq["heartbeats"] = (freq["loop_seconds"] / freq.ix["heartbeat"]["loop_seconds"])\
.apply(lambda x: round(x, 4))
Heatmap
To get a sense of the Frequency's spatio-temporal rhythm, see the heatmap below, which recreates the comic's grid. Instead animated GIFs, we'll fill each cell with a color corresponding to its frequency. The darker the blue the more frequent the scenario.
grid = pd.DataFrame(np.split(freq["per_hour"].values, 10))
grid_log = grid.apply(np.log10)
def create_heatmap():
pixel_coeff = 1.5 / mplstyle.get("figure.dpi")
grid_dim = (120 * 5 + 120, 50 * 10)
mplstyle.set({
"figure.figsize": tuple(pixel_coeff * x for x in grid_dim)
})
fig, ax = plt.subplots()
heatmap = ax.pcolor(grid_log,
cmap=mpl.cm.Blues,
edgecolors="white")
ax.invert_yaxis()
ax.set_xticks([])
ax.set_yticks([])
for row_i in range(len(grid)):
for col_i in range(len(grid.ix[row_i])):
index = row_i * 5 + col_i
name = freq.index[index]
per_hour = freq["per_hour"][index]
ax.text(col_i + 0.5, row_i + 0.5, name,
ha="center",
va="center",
fontsize=12,
color="white" if per_hour > 50 else "black")
max_freq = int(np.log10(freq["per_hour"].max())) + 1
ticks = range(0, max_freq)
cbar = plt.colorbar(heatmap, ticks=ticks)
cbar.set_ticklabels([ commafy(pow(10, int(x))) for x in ticks ])
cbar.set_label("Occurances per Hour",
rotation=270,
fontsize=12)
create_heatmap()
Two notes:
-
Because there are so many magnitudes of difference between the most frequent and least frequent scenarios, we're using a logarithmic color scale. Each doubling in blueness corresponds to a ten-fold increase in frequency.
-
The abbreviations in each cell come from each GIF's xkcd-provided filename.
Even with this rudimentary analysis, you can see a few patterns, including the descending frequency of magnitude-1, magnitude-2, magnitude-3, and magnitude-4 earthquakes; that the most frequent occurance is a birth; that the least frequent is Old Faithful erupting; that someone gets married about as often as a fishing boat catches a shark.
Distribution of Frequencies
Clearly, there are a broad range of frequencies in Frequency. But how are the scenarios distributed along this range? The histogram below groups the scenarios into a handful of bins, each group an order of magnitude more frequent.
def plot_distribution():
mplstyle.set({ "figure.figsize": (8, 6) })
ticks = range(-1, 6)
ax = grid_log.unstack().hist(bins=ticks, color=mpl.cm.Blues(mpl.cm.Blues.N / 2))
ax.set_xticklabels([ commafy(pow(10, x)) for x in ticks ])
ax.set_title("Distribution of Frequencies\n", fontweight="bold")
ax.set_xlabel("\nOccurances per Hour")
ax.set_ylabel("Number of Scenarios\n")
pass
plot_distribution()
We can see that scenarios that occur between 1,000 and 10,000 times per hour are the most common — accounting for more than half of the GIFs. Less-frequent scenarios appear in the Frequency less often. This makes sense: Who'd want to see 30 super-slow GIFs?
Sorting Frequencies
We can also sort the GIFs by frequency. Here are the five most frequent scenarios:
HTML("".join(gif_to_html(freq.ix[name]["scenario"].raw)
for name in freq.sort("loop_seconds").index[:5]))
And the five least frequent:
HTML("".join(gif_to_html(freq.ix[name]["scenario"].raw)
for name in freq.sort("loop_seconds").index[-5:]))