Notebooks

What's the Most Forgotten State?

29 Aug 2013 — Last updated: 29 Aug 2013, 11:03AM

import pandas as pd
import matplotlib as mpl
import requests
import lxml.html
import itertools
from StringIO import StringIO
import mplstyle
import mplstyle.styles.simple
mplstyle.set(mplstyle.styles.simple)
mplstyle.set({ 
    "figure.figsize": (10, 8),
    "lines.markersize": 0
})
from IPython.display import HTML
from markdown import markdown
md = lambda x: HTML(markdown(x))

A friend recently challenged me to name all 50 U.S. states. I named 49, forgetting Michigan. Which made me wonder: What's the most forgotten state? And what's the most memorable?

QUIZ_RESULTS_URL = "http://www.sporcle.com/games/g/states/results"
results_html = requests.get(QUIZ_RESULTS_URL).text
results_dom = lxml.html.fromstring(results_html)
times_played_str = results_dom.cssselect(".dividerText strong")[0].text
answer_tables = results_dom.cssselect("table.answer")
result_rows_nested = map(lambda x: x.cssselect("tr")[1:], answer_tables)
result_rows = list(itertools.chain.from_iterable(result_rows_nested))
def data_from_row(tr):
    return {
        "answer": tr.cssselect("td.answer a")[0].text,
        "pct_correct": float(tr.cssselect("td.percent")[0].text[:-1])
    }
sporcle_results = pd.DataFrame(map(data_from_row, result_rows))
md(u"""The answer, according to [Sporcle's "US States Quiz"](http://www.sporcle.com/games/g/states/results)
— which, the last time I checked, has been played %s times — is __%s__.""" % (
    times_played_str,
    sporcle_results.irow(-1)["answer"]
))

The answer, according to Sporcle's "US States Quiz" — which, the last time I checked, has been played 10,983,146 times — is Missouri.

Because of selection bias among people who decide to play the quiz, and the fact that people often retake these types of quizzes, it's hard to say just how forgotten Missouri is. But it's certainly the most forgotten among people who've taken that quiz.

Even so, you'd expect larger states to be, generally, more memorable than smaller states. Using U.S. Census population data for 2010, we can find out which states are punching above their weight. The simplest, crudest approach is to subtract states' Sporcle rank from their population rank. By this measure:

CENSUS_URL = "http://www.census.gov/popest/data/national/totals/2012/files/NST_EST2012_ALLDATA.csv"
census_csv = requests.get(CENSUS_URL).text
populations = pd.read_csv(StringIO(census_csv))
# The Census data also includes population figures for 
# the nation at-large, as well as regional groupings of states.
#
# Let's select just the states.
in_sporcle = populations["Name"].apply(lambda x: x in sporcle_results["answer"].values)
state_populations = populations[in_sporcle]
state_populations["Pop. Rank"] = state_populations["CENSUS2010POP"].rank(ascending=False)
sporcle_results["Sporcle Rank"] = sporcle_results["pct_correct"].rank(ascending=False)
rankings = sporcle_results.merge(
    state_populations[["Name", "Pop. Rank"]], 
    how="left", 
    left_on="answer", 
    right_on="Name")[["answer", "Sporcle Rank", "Pop. Rank"]]
rankings["diff"] = rankings["Pop. Rank"] - rankings["Sporcle Rank"]
def get_state(rows, name):
    return rows[rows["answer"] == name].irow(0).to_dict()

def rank_to_string(x):
    mod_10 = int(x) % 10
    ths = [ "th" for i in range(6) ]
    if x not in [ 11, 12, 13 ]:
        suffix = (["th", "st", "nd", "rd"] + ths)[mod_10]
    else:
        suffix = ths[mod_10]
    return "%d%s" % (x, suffix)
missouri = get_state(rankings, "Missouri")
md(u"""- __Missouri still does terribly__, ranking %s in population, 
for a difference of __%d__.""" % (rank_to_string(missouri["Pop. Rank"]), missouri["diff"]))
  • Missouri still does terribly, ranking 18th in population, for a difference of -32.
top_state = rankings.sort("diff", ascending=False).irow(0).to_dict()
md(u"""- __%s is the most over-performing state__, ranking %s in population but
%s in the Sporcle quiz, for a difference of __+%d__.""" % (
    top_state["answer"],
    rank_to_string(top_state["Pop. Rank"]),
    rank_to_string(top_state["Sporcle Rank"]), 
    top_state["diff"])
)
  • North Dakota is the most over-performing state, ranking 48th in population but 6th in the Sporcle quiz, for a difference of +42.

You can see the full list below. The over-performing states tend to have memorable quirks to them. In the top five, you have the two states outside the Lower 48, a North-South pairing, and the state furthest northeast. Only one of the bottom five touches an ocean.

rankings.sort("diff", ascending=False).reset_index(drop=True)
answer Sporcle Rank Pop. Rank diff
0 North Dakota 6.0 48 42.0
1 Alaska 7.5 47 39.5
2 South Dakota 7.5 46 38.5
3 Hawaii 12.0 40 28.0
4 Maine 14.0 41 27.0
5 Nevada 10.5 35 24.5
6 New Mexico 13.0 36 23.0
7 Idaho 21.0 39 18.0
8 Rhode Island 27.5 43 15.5
9 South Carolina 10.5 24 13.5
10 Montana 31.0 44 13.0
11 Vermont 38.0 49 11.0
12 Wyoming 39.5 50 10.5
13 Washington 3.5 13 9.5
14 West Virginia 27.5 37 9.5
15 Utah 25.5 34 8.5
16 Alabama 15.0 23 8.0
17 Mississippi 24.0 31 7.0
18 New Hampshire 37.0 42 5.0
19 Oregon 22.0 27 5.0
20 Delaware 42.0 45 3.0
21 North Carolina 9.0 10 1.0
22 Texas 1.5 2 0.5
23 New York 3.5 3 -0.5
24 California 1.5 1 -0.5
25 Florida 5.0 4 -1.0
26 Kansas 34.5 33 -1.5
27 Arizona 18.0 16 -2.0
28 Colorado 25.5 22 -3.5
29 Virginia 16.0 12 -4.0
30 Arkansas 36.0 32 -4.0
31 Louisiana 30.0 25 -5.0
32 Kentucky 33.0 26 -7.0
33 Georgia 17.0 9 -8.0
34 Nebraska 47.0 38 -9.0
35 Michigan 19.5 8 -11.5
36 Oklahoma 39.5 28 -11.5
37 New Jersey 23.0 11 -12.0
38 Ohio 19.5 7 -12.5
39 Iowa 43.0 30 -13.0
40 Tennessee 32.0 17 -15.0
41 Connecticut 48.0 29 -19.0
42 Illinois 29.0 5 -24.0
43 Maryland 44.0 19 -25.0
44 Wisconsin 45.0 20 -25.0
45 Indiana 41.0 15 -26.0
46 Minnesota 49.0 21 -28.0
47 Pennsylvania 34.5 6 -28.5
48 Massachusetts 46.0 14 -32.0
49 Missouri 50.0 18 -32.0

In the slopegraph below, you can get a more visceral, visual sense of state forgettableness. States with a absolute rank-difference less than 10 are in gray. Of states with larger differences, over-performers are in blue, while underperformers are in red.

slope = rankings.set_index("answer")[["Pop. Rank", "Sporcle Rank"]]\
    .sort("Pop. Rank", ascending=False)\
    .apply(lambda x: 51 - x)

# Make the chart more vertical
prev_styles = mplstyle.get()
mplstyle.set({ "figure.figsize": (6, 12) })

# Color the lines based on direction and magnitude
color_cycle = ["gray"] + mplstyle.get("axes.color_cycle")
pos_neg = (slope["Pop. Rank"] < slope["Sporcle Rank"])
major = (slope["Pop. Rank"] - slope["Sporcle Rank"]).apply(abs) > 9
colors = (major * (pos_neg + 1))\
    .apply(lambda x: color_cycle[int(x)])

# Plot the slopegraph
ax = slope.T.plot(
    colors=colors,
    alpha=0.7,
    legend=False, 
    xticks=[0,1],
    ylim=(0, 51))

intflo = lambda x: x if x - int(x) else int(x)
yticks = [ x + 1 for x in range(50) ]

# Set the left y-axis
ax.set_yticks(yticks)
ax.set_yticklabels([ "%s (%s)" % (
    slope.index[i],
    51 - intflo(slope["Pop. Rank"][i])
) for i in range(50) ])

# Set the right y-axis
ax2 = ax.twinx()
ax2.set_yticks(yticks)
ax2.set_ylim((0, 51))
ax2.set_yticklabels([ "%s (%s)" % (
    slope.sort("Sporcle Rank").index[i], 
    51 - intflo(slope.sort("Sporcle Rank")["Sporcle Rank"][i])
) for i in range(50) ])

mplstyle.set(prev_styles)
pass

Updates:

  • 29 Aug. 2013: Fixed erroneous claim that none of the five most underperforming states touch an ocean.

Show CodeHide Code