Reverse Engineering xkcd's 'Frequency'

import requests
import lxml.html
import gifparse

import pandas as pd
import numpy as np

from IPython.display import HTML
from markdown import markdown

def md(x, *args, **kwargs):
    return HTML(markdown(x.format(*args, **kwargs)))

def commafy(n):
    if type(n) == int:
        return "{:,d}".format(n)
    return str(n)

import matplotlib as mpl
import matplotlib.pyplot as plt
import mplstyle
%matplotlib inline

def gif_to_html(raw_bytes):
    """Convert the raw bytes of a GIF into an HTML <img>."""
    b64 = raw_bytes.encode("base64")
    tmpl = '<img src="data:image/gif;base64,%s">'
    return tmpl % b64

About a week ago, Randall Munroe published Frequency, the 1331st entry in his xkcd webcomic series. Unlike most xkcd comics, it's animated — actually, 5-by-10 grid of animated GIFs. Each describes a scenario (e.g., "one birth," or "someone locks their keys in their care") and blinks at the scenario's purported frequency. The result: a beautiful, pulsating composition. If you haven't taken a look yet, do so now before reading on.

What makes Frequency tick? Literally, it's the Graphics Interchange Format 89a specification, the image format that makes animated GIFs possible. (Unpresciently, the authors of the in 1990 spec wrote: "The Graphics Interchange Format is not intended as a platform for

animation, even though it can be done in a limited way.") The metadata associated with each frame of an animated GIF tells your browser how long to wait before showing the next one. This data lets us calculate the total loop time for an animated GIF and helps us piece together how Frequency ticks, in the figurative sense.

To reverse-engineer Frequency, I began by downloading each of its 50 GIFs, and calculated the total loop time for each. (Scroll to the bottom of the page to see a table containing the full results.) From the loop time, I calculated the number of times each scenario is purported to occur per second, per hour, and per heartbeat. I checked these numbers against published figures (such as the USGS's earthquake statistics), with reassuring results.

def get_image_urls():
    """Fetch the URLs for all GIFs from the xkcd's 'Frequency' piece."""
    URL = "http://xkcd.com/1331/"
    html = requests.get(URL).content
    dom = lxml.html.fromstring(html)
    images = dom.cssselect("#comic td img")
    srcs = [ i.attrib["src"] for i in images ]
    return srcs

### NOTE
# This analysis uses a custom GIF parser, which you can see here: 
#
#   https://github.com/jsvine/gifparse
# 
# I wrote the parser in part because the PIL library — the standard 
# tool for this sort of thing — seemed to be having trouble with 
# the particular format of the GIFs in "Frequency." (I also just wanted
# to learn the spec.) As a result, `gifparse` is an incomplete library.
# I'm hoping to beef it up soon, but feel free to send pull requests
# in the meantime.
###
class Scenario(object):
    """A scenario from xkcd's 'Frequency' piece.
    
    The corresponding GIF is downloaded upon instantiation.
    """
    def __init__(self, url):
        self.url = url
        self.name = url.split("/")[-1].split(".")[0]
        self.raw = requests.get(url).content
        self.gif = gifparse.parse(self.raw)
    def __repr__(self):
        return "<Scenario:{}>".format(self.name)

# Instantiate 50 scenarios, one for each GIF in the comic
image_urls = get_image_urls()
scenarios = map(Scenario, image_urls)

# Note: GIF delays are, by spec, multiples of 1/100th of a second.
freq = pd.DataFrame({
    "scenario": s,
    "name": s.name,
    "loop_seconds": round(float(s.gif.total_delay) / 100, 2)
} for s in scenarios).set_index("name")
freq["per_second"] = (1 / freq["loop_seconds"]).apply(lambda x: round(x, 4))
freq["per_hour"] = (60*60 / freq["loop_seconds"]).apply(lambda x: round(x, 2))
freq["heartbeats"] = (freq["loop_seconds"] / freq.ix["heartbeat"]["loop_seconds"])\
    .apply(lambda x: round(x, 4))

Heatmap

To get a sense of the Frequency's spatio-temporal rhythm, see the heatmap below, which recreates the comic's grid. Instead animated GIFs, we'll fill each cell with a color corresponding to its frequency. The darker the blue the more frequent the scenario.

grid = pd.DataFrame(np.split(freq["per_hour"].values, 10))
grid_log = grid.apply(np.log10)

def create_heatmap():
    pixel_coeff = 1.5 / mplstyle.get("figure.dpi")
    grid_dim = (120 * 5 + 120, 50 * 10)
    mplstyle.set({
    "figure.figsize": tuple(pixel_coeff * x for x in grid_dim)
    })
    fig, ax = plt.subplots()
    heatmap = ax.pcolor(grid_log, 
        cmap=mpl.cm.Blues,
        edgecolors="white")
    ax.invert_yaxis()
    ax.set_xticks([])
    ax.set_yticks([])
    for row_i in range(len(grid)):
        for col_i in range(len(grid.ix[row_i])):
            index = row_i * 5 + col_i
            name = freq.index[index]
            per_hour = freq["per_hour"][index]
            ax.text(col_i + 0.5, row_i + 0.5, name, 
                ha="center", 
                va="center", 
                fontsize=12,
                color="white" if per_hour > 50 else "black")
    max_freq = int(np.log10(freq["per_hour"].max())) + 1
    ticks = range(0, max_freq)
    cbar = plt.colorbar(heatmap, ticks=ticks)
    cbar.set_ticklabels([ commafy(pow(10, int(x))) for x in ticks ])
    cbar.set_label("Occurances per Hour",
        rotation=270,
        fontsize=12)
create_heatmap()

Two notes:

Because there are so many magnitudes of difference between the most frequent and least frequent scenarios, we're using a logarithmic color scale. Each doubling in blueness corresponds to a ten-fold increase in frequency.
The abbreviations in each cell come from each GIF's xkcd-provided filename.

Even with this rudimentary analysis, you can see a few patterns, including the descending frequency of magnitude-1, magnitude-2, magnitude-3, and magnitude-4 earthquakes; that the most frequent occurance is a birth; that the least frequent is Old Faithful erupting; that someone gets married about as often as a fishing boat catches a shark.

Distribution of Frequencies

Clearly, there are a broad range of frequencies in Frequency. But how are the scenarios distributed along this range? The histogram below groups the scenarios into a handful of bins, each group an order of magnitude more frequent.

def plot_distribution():
    mplstyle.set({ "figure.figsize": (8, 6) })
    ticks = range(-1, 6)
    ax = grid_log.unstack().hist(bins=ticks, color=mpl.cm.Blues(mpl.cm.Blues.N / 2))
    ax.set_xticklabels([ commafy(pow(10, x)) for x in ticks ])
    ax.set_title("Distribution of Frequencies\n", fontweight="bold")
    ax.set_xlabel("\nOccurances per Hour")
    ax.set_ylabel("Number of Scenarios\n")
    pass
plot_distribution()

We can see that scenarios that occur between 1,000 and 10,000 times per hour are the most common — accounting for more than half of the GIFs. Less-frequent scenarios appear in the Frequency less often. This makes sense: Who'd want to see 30 super-slow GIFs?

Sorting Frequencies

We can also sort the GIFs by frequency. Here are the five most frequent scenarios:

HTML("".join(gif_to_html(freq.ix[name]["scenario"].raw) 
             for name in freq.sort("loop_seconds").index[:5]))

And the five least frequent:

HTML("".join(gif_to_html(freq.ix[name]["scenario"].raw) 
             for name in freq.sort("loop_seconds").index[-5:]))

old_faithful_mins = freq.ix["oldfaithful"]["loop_seconds"] / 60
md("""(You'll have to wait {} minutes for Old Faithful.)""".format(int(old_faithful_mins)))

(You'll have to wait 94 minutes for Old Faithful.)

We can also find scenarios that share the same loop time, and thus blink at the same interval.

def display_twinned_gifs():
    time_grouped = freq.groupby("loop_seconds")
    twinned_times = time_grouped.size()[time_grouped.size() > 1].index
    html = ""
    for time in twinned_times:
        matching_scenarios = freq[freq["loop_seconds"] == time]
        html += "<h4>Every {} seconds:</h4>".format(time)
        html += "".join(gif_to_html(x.raw) for x in matching_scenarios["scenario"])
    return HTML(html)
display_twinned_gifs()

Every 0.93 seconds:

Every 1.27 seconds:

Every 2.43 seconds:

The Turn Signal Problem

Frequency alludes to the empirically-proven fact that no two cars' turn signals blink at the same rate:

def get_delays(gif):
    return [ x["gce"].delay for x in gif.frames ]
freq["zeros"] = freq["scenario"].apply(lambda x: sum(pd.Series(get_delays(x.gif)) == 0))
freq["loop_seconds_adj"] = freq["loop_seconds"] + freq["zeros"] * 0.1
freq["per_hour_adj"] = (60*60 / freq["loop_seconds_adj"]).apply(lambda x: round(x, 2))

signals = freq.ix[[ "turnsignal1", "turnsignal2" ]]
t1, t2, = signals["loop_seconds"]
t1_adj, t2_adj = signals["loop_seconds_adj"]

HTML("".join(gif_to_html(x.raw) for x in signals["scenario"]))

signal_beat = 1 / np.abs((1.0 / t1) - (1.0 / t2))
md(u"""
According to the data embedded *Frequency*'s GIFs, 
Randall Munroe's turn signal blinks once every {:.2f} seconds, 
while the turn signal in front of him blinks once every {:.2f} seconds — just out of sync.
How often will the two signals blink together? Using the equation for 
[beat frequency](http://hyperphysics.phy-astr.gsu.edu/hbase/sound/beat.html), we can calculate this.
The two signals should blink together approximately every {:.2f} seconds.
""", t1, t2, signal_beat)

According to the data embedded Frequency's GIFs, Randall Munroe's turn signal blinks once every 0.94 seconds, while the turn signal in front of him blinks once every 0.90 seconds — just out of sync. How often will the two signals blink together? Using the equation for beat frequency, we can calculate this. The two signals should blink together approximately every 21.15 seconds.

signal_beat_adj = 1 / np.abs((1.0 / t1_adj) - (1.0 / t2_adj))
md(u"""
But you'll see something different. If you're using a modern browser in 2014, 
you'll observe a beat frequency closer to {:.2f} seconds — about {:.2f} seconds
longer than expected. Why is that?
""", signal_beat_adj, signal_beat_adj - signal_beat)

But you'll see something different. If you're using a modern browser in 2014, you'll observe a beat frequency closer to 26.00 seconds — about 4.85 seconds longer than expected. Why is that?

The Zero-Delay Problem

A quirk in how browsers display animated GIFs is distorting Frequency's timing. Remember how the metadata associated with each frame of an animated GIF tells the browser how long to wait before displaying the next frame? It turns out that "no modern browser surveyed supports frame delays below 0.02 seconds." When browsers see a delay shorter than 0.02 seconds, they replace it with a delay of 0.1 seconds. In practice, the only sub–0.02 second delays in Frequency are the zero-second delays in the first frame of certain GIFs.

min_loop_seconds = freq["loop_seconds"].min()
max_loop_seconds = freq[freq["zeros"] > 0]["loop_seconds"].max()
md("""
The __zero-delay problem affects {} of the 50 GIFs__ in *Frequency*. 
When the browser renders them, the loops will take 1/10th-second longer than
it seems Munroe intended. The affected GIFs happen to be all those shorter than 
two seconds, which might be an artifact of how Munroe created them. 
(I emailed Munroe early in my research, before discovering this issue, but never heard back.
I'll update this post if I do hear from him.) For the fastest affected GIF, that's a 
{:.1f}% lengthening. For the slowest, the zero-delay problem adds {:.1f}% to its encoded loop time.
""",
    sum(freq["zeros"] > 0),
    100 * (((0.1 + min_loop_seconds) / min_loop_seconds) - 1),
    100 * (((0.1 + max_loop_seconds) / max_loop_seconds) - 1)
)

The zero-delay problem affects 21 of the 50 GIFs in Frequency. When the browser renders them, the loops will take 1/10th-second longer than it seems Munroe intended. The affected GIFs happen to be all those shorter than two seconds, which might be an artifact of how Munroe created them. (I emailed Munroe early in my research, before discovering this issue, but never heard back. I'll update this post if I do hear from him.) For the fastest affected GIF, that's a 41.7% lengthening. For the slowest, the zero-delay problem adds 5.3% to its encoded loop time.

Full Results

In all the analyses above, I've used the encoded loop times, given that they appear to be Munroe's intention. But in the table below, you can find both the encoded and adjusted timing data. The GIFs are sorted by their encoded loop time.

freq["needs_adj"] = freq["zeros"].apply(lambda x: "*" if x else "")
freq.sort("loop_seconds")[["loop_seconds", "per_hour", "needs_adj", "loop_seconds_adj", "per_hour_adj"]]

	loop_seconds	per_hour	needs_adj	loop_seconds_adj	per_hour_adj
name
birth	0.24	15000.00	*	0.34	10588.24
death	0.56	6428.57	*	0.66	5454.55
domain	0.64	5625.00	*	0.74	4864.86
wikipedia	0.67	5373.13	*	0.77	4675.32
wedding	0.75	4800.00	*	0.85	4235.29
shark	0.83	4337.35	*	0.93	3870.97
heartbeat	0.86	4186.05	*	0.96	3750.00
turnsignal2	0.90	4000.00	*	1.00	3600.00
iphone	0.93	3870.97	*	1.03	3495.15
flight	0.93	3870.97	*	1.03	3495.15
turnsignal1	0.94	3829.79	*	1.04	3461.54
car_elsewhere	1.03	3495.15	*	1.13	3185.84
phoenixshoes	1.08	3333.33	*	1.18	3050.85
meteor	1.15	3130.43	*	1.25	2880.00
littleleague	1.23	2926.83	*	1.33	2706.77
bottles	1.27	2834.65	*	1.37	2627.74
denverpizza	1.27	2834.65	*	1.37	2627.74
ndsex	1.38	2608.70	*	1.48	2432.43
pulsar	1.40	2571.43	*	1.50	2400.00
cat_mockingbird	1.82	1978.02	*	1.92	1875.00
car_china	1.89	1904.76	*	1.99	1809.05
phoenix	2.05	1756.10		2.05	1756.10
tattoo	2.06	1747.57		2.06	1747.57
earthquake1	2.43	1481.48		2.43	1481.48
keys	2.43	1481.48		2.43	1481.48
eagle	2.69	1338.29		2.69	1338.29
vibrator	2.99	1204.01		2.99	1204.01
car_japan	4.01	897.76		4.01	897.76
facebook	4.32	833.33		4.32	833.33
recycled	4.64	775.86		4.64	775.86
bieber	4.73	761.10		4.73	761.10
kiss	5.53	650.99		5.53	650.99
car_germany	5.80	620.69		5.80	620.69
house	6.22	578.78		6.22	578.78
car_us	6.95	517.99		6.95	517.99
dogbite	7.01	513.55		7.01	513.55
amelia	7.79	462.13		7.79	462.13
parliament_toilet	10.06	357.85		10.06	357.85
dog	15.60	230.77		15.60	230.77
us_cancer	18.99	189.57		18.99	189.57
cat	21.30	169.01		21.30	169.01
fire_dept	23.00	156.52		23.00	156.52
earthquake2	24.26	148.39		24.26	148.39
bike	24.93	144.40		24.93	144.40
book_mockingbird	42.05	85.61		42.05	85.61
us_cancer_death	54.34	66.25		54.34	66.25
holeinone	180.00	20.00		180.00	20.00
earthquake3	242.60	14.84		242.60	14.84
earthquake4	2426.00	1.48		2426.00	1.48
oldfaithful	5640.00	0.64		5640.00	0.64

50 rows × 5 columns

Notebooks