Facebook Government Requests Per User
import pandas as pd
import matplotlib as mpl
import numpy as np
import requests
import lxml.html
import json
from StringIO import StringIO
from IPython.display import HTML
from markdown import markdown
md = lambda x: HTML(markdown(x))
import mplstyle, mplstyle.styles.simple
mplstyle.set(mplstyle.styles.simple)
mplstyle.set({
"figure.figsize": (10, 15),
})
base_styles = mplstyle.get()
Facebook released its first Global Government Requests Report in late August. The report tallies the number of requests for user data that each national government has made through the first six months of 2013.
FB_GOV_URL = "https://www.facebook.com/about/government_requests/download.php"
fb_gov_csv = requests.get(FB_GOV_URL).text
fb_gov = pd.read_csv(StringIO(fb_gov_csv),
names=["country", "requests", "users_requested", "percent_complied"],
header=False).set_index("country")
def parse_fb_counts(s):
commaless = str(s).replace(",", "")
split = commaless.split(" - ")
mean = np.mean(map(float, split))
return mean
fb_gov["requests"] = fb_gov["requests"].apply(parse_fb_counts)
fb_gov["users_requested"] = fb_gov["users_requested"].apply(parse_fb_counts)
During that time, the U.S. government requested data on far more accounts than any other country. (The government hasn't allowed Facebook to publish the specific numbers, only ranges, for the U.S. requests. For the purposes of this analysis, we'll use the average across those ranges.) Here are the top 10 countries, by number of user accounts requested:
fb_gov.sort("users_requested", ascending=False)["users_requested"].head(10).reset_index()
country | users_requested | |
---|---|---|
0 | United States | 20500 |
1 | India | 4144 |
2 | United Kingdom | 2337 |
3 | Italy | 2306 |
4 | Germany | 2068 |
5 | France | 1598 |
6 | Brazil | 857 |
7 | Spain | 715 |
8 | Australia | 601 |
9 | Chile | 340 |
But there's a crucial flaw with this metric. The countries at the top of this list tend to be large and relatively developed: the sorts of countries we'd expect to have the most Facebook users overall. It's not particularly interesting they're also the countries requesting the most user data. So let's try controlling for the number of Facebook users per country, scraped from Wikipedia's Facebook statistics page, which has end-of-2012 data.
FB_USERS_URL = "http://en.wikipedia.org/wiki/Facebook_statistics"
fb_users_text = requests.get(FB_USERS_URL).text
fb_users_dom = lxml.html.fromstring(fb_users_text)
fb_user_rows = fb_users_dom.cssselect("table.wikitable")[0].cssselect("tr")[1:]
def parse_user_row(row):
cells = row.cssselect("td")
return {
"country": cells[1].cssselect("a")[0].text,
"fb_users": int(cells[4].text.replace(",", ""))
}
fb_users = pd.DataFrame(map(parse_user_row, fb_user_rows)).set_index("country")
fb = fb_gov.join(fb_users, how="left")
missing_usage = fb[fb["fb_users"].isnull()]
md("The two datasets match up well; there are only %d countries in the government-requests dataset missing from the usage dataset:\n\n %s" %
(len(missing_usage), "\n".join("- %s" % x for x in missing_usage.index)))
The two datasets match up well; there are only 2 countries in the government-requests dataset missing from the usage dataset:
- Kosovo
- Ivory Coast
To check our previous hunch, we can plot (the base-10 logarithms of) each country's number of Facebook users vs. the number of accounts that country requested.
def plot_fb_scatter():
mplstyle.set({ "figure.figsize": (8, 8) })
ax = mpl.pyplot.scatter(fb["fb_users"].apply(log10), fb["users_requested"].apply(log10),
s=20,
alpha=0.7).axes
ax.set_ylim(0, ax.get_ylim()[1] * 1.1)
ax.set_title("Facebook Users vs. Accounts Requested, by Country\n")
ax.set_xlabel("\nLog10 of Facebook Users, End of 2012")
ax.set_ylabel("Log10 of Accounts Requested, First Half of 2013\n")
mplstyle.reset(base_styles)
plot_fb_scatter()
pearsons_r = fb["fb_users"].apply(log10).corr(fb["users_requested"].apply(log10))
md("""
Indeed, it appears that countries with more Facebook users tend to request data on a greater number of user accounts.
(The *Pearson's r* is %.2f, which suggests a fairly strong, positive correlation.)
""" % round(pearsons_r, 2))
Indeed, it appears that countries with more Facebook users tend to request data on a greater number of user accounts. (The Pearson's r is 0.65, which suggests a fairly strong, positive correlation.)
Let's adjust for this information by creating a new metric: accounts requested requested per million users. The chart below plots this number for all countries that requested data on at least 10 users.
fb["per_m_users"] = fb["users_requested"] * 1e6 / fb["fb_users"]
def plot_request_rate():
fb_sort_rate = fb[fb["fb_users"].notnull() * (fb["users_requested"] > 10)].sort("per_m_users")
ax = fb_sort_rate["per_m_users"].plot(kind="barh", color="teal", alpha=0.5)
ax.grid(axis="y")
for i in range(len(fb_sort_rate)):
row = fb_sort_rate.irow(i)
ax.text(row["per_m_users"], i+0.6, "% 0.1f" % row["per_m_users"], va="center")
ax.set_xlim(0, ax.get_xlim()[1] * 1.1)
ax.set_title("Facebook Accounts Requested Per Million Users\nDuring the First Six Months of 2013, by Country\n")
ax.set_ylabel("")
ax.set_xlabel("\nAccounts Requested Per Million Users")
plot_request_rate()
Even by this metric, the U.S. still out-requests all other major countries — though by a smaller margin than it appeared before accounting for the number of Facebook users. But the U.S. rates only second overall, dwarfed by a tiny Mediterranean island-nation.
commafy = lambda x: "{:,}".format(int(round(x)))
malta = fb.ix["Malta"]
next_highest = fb["per_m_users"].order(ascending=False).iget(1)
md("""
__Malta requested data on %d Facebook accounts per million users, more than %d times the next-highest rate.__
Flipping the denominator, that's 1 in every %s users.
""" % (
round(malta["per_m_users"]),
int(malta["per_m_users"] / next_highest),
commafy(round(malta["fb_users"] / malta["users_requested"])))
)
Malta requested data on 447 Facebook accounts per million users, more than 3 times the next-highest rate. Flipping the denominator, that's 1 in every 2,238 users.
It's possible that the usage statistics for Malta are wrong or outdated; if not, the country is a remarkable outlier. A cursory search of Maltese news doesn't reveal any obvious explanations. Have any hypotheses?
Here's the data behind the chart, if you're curious:
cols = ["country", "users_requested", "fb_users", "per_m_users"]
fb[fb["fb_users"].notnull()].sort("per_m_users", ascending=False).reset_index()[cols]
country | users_requested | fb_users | per_m_users | |
---|---|---|---|---|
0 | Malta | 97 | 217040 | 446.922226 |
1 | United States | 20500 | 166029240 | 123.472227 |
2 | Italy | 2306 | 23202640 | 99.385242 |
3 | Germany | 2068 | 25332440 | 81.634458 |
4 | United Kingdom | 2337 | 32950400 | 70.924784 |
5 | India | 4144 | 62713680 | 66.078087 |
6 | France | 1598 | 25624760 | 62.361560 |
7 | New Zealand | 119 | 2256040 | 52.747292 |
8 | Australia | 601 | 11680640 | 51.452660 |
9 | Portugal | 213 | 4663060 | 45.678160 |
10 | Spain | 715 | 17590500 | 40.646940 |
11 | Singapore | 117 | 2915640 | 40.128411 |
12 | Greece | 141 | 3845820 | 36.663182 |
13 | Chile | 340 | 9687720 | 35.095977 |
14 | Israel | 132 | 3792820 | 34.802601 |
15 | Belgium | 169 | 4922260 | 34.333822 |
16 | Taiwan | 329 | 13240660 | 24.847704 |
17 | Barbados | 3 | 121620 | 24.666996 |
18 | Botswana | 7 | 294000 | 23.809524 |
19 | Ireland | 40 | 2183760 | 18.317031 |
20 | Poland | 158 | 9863380 | 16.018850 |
21 | Brazil | 857 | 58565700 | 14.633139 |
22 | Malaysia | 197 | 13589520 | 14.496465 |
23 | Austria | 41 | 2915240 | 14.064022 |
24 | Sweden | 66 | 4950160 | 13.332902 |
25 | Canada | 219 | 18090640 | 12.105708 |
26 | Switzerland | 36 | 3055800 | 11.780876 |
27 | Macedonia | 11 | 962780 | 11.425248 |
28 | Slovenia | 8 | 730160 | 10.956503 |
29 | Albania | 12 | 1097800 | 10.930953 |
30 | Argentina | 218 | 20048100 | 10.873848 |
31 | Bosnia and Herzegovina | 11 | 1345020 | 8.178317 |
32 | Cyprus | 4 | 582600 | 6.865774 |
33 | Romania | 36 | 5374980 | 6.697699 |
34 | Finland | 15 | 2287960 | 6.556059 |
35 | Montenegro | 2 | 306260 | 6.530399 |
36 | Lithuania | 7 | 1118500 | 6.258382 |
37 | Pakistan | 47 | 7984880 | 5.886125 |
38 | Norway | 16 | 2771480 | 5.773089 |
39 | Hungary | 24 | 4265960 | 5.625932 |
40 | Turkey | 170 | 32131260 | 5.290798 |
41 | Qatar | 3 | 671720 | 4.466147 |
42 | Iceland | 1 | 227000 | 4.405286 |
43 | Mongolia | 2 | 515080 | 3.882892 |
44 | Denmark | 11 | 3037700 | 3.621161 |
45 | Czech Republic | 13 | 3834620 | 3.390166 |
46 | Mexico | 127 | 38463860 | 3.301801 |
47 | Costa Rica | 6 | 1889620 | 3.175242 |
48 | Colombia | 41 | 17322000 | 2.366932 |
49 | Netherlands | 15 | 7554940 | 1.985456 |
50 | Panama | 2 | 1014160 | 1.972075 |
51 | Uganda | 1 | 562240 | 1.778600 |
52 | Nepal | 3 | 1940820 | 1.545738 |
53 | South Korea | 15 | 10012400 | 1.498142 |
54 | Peru | 14 | 9351460 | 1.497092 |
55 | South Africa | 9 | 6269600 | 1.435498 |
56 | Cambodia | 1 | 742220 | 1.347309 |
57 | El Salvador | 2 | 1491480 | 1.340950 |
58 | Croatia | 2 | 1595760 | 1.253321 |
59 | Egypt | 11 | 12173540 | 0.903599 |
60 | Bangladesh | 12 | 14352680 | 0.836081 |
61 | Ecuador | 3 | 4970680 | 0.603539 |
62 | Bulgaria | 1 | 2522120 | 0.396492 |
63 | Serbia | 1 | 3377340 | 0.296091 |
64 | Thailand | 5 | 17721480 | 0.282143 |
65 | Hong Kong | 1 | 4034560 | 0.247859 |
66 | Philippines | 4 | 29890900 | 0.133820 |
67 | Russia | 1 | 7963400 | 0.125575 |
68 | Japan | 1 | 17196080 | 0.058153 |