Category Archives: Uncategorized

Week 7: Starting the Final Project

I’ve realized there’s no reason to constrain everyone to having both a real user and a real data set right out of the gate. Once you settle on one constraint, we’ll find or simulate the other.

There are a few considerations about the data source:
1) It’s updated at least daily
2) It can be made completely publicly available, even in granular form
3) It can be used to make a decision with important consequences

Also about the user:
1) They’re able to answer questions by email or phone over the course of the next few weeks
2) They would be able to gather this sort of data
3) They are either the decision maker, or directly talking to the decision maker

Once you’ve nailed those down, you’ll step through the data pipeline that we’ve set up repeatedly in class. The first part, which we only touched on briefly, is the collection of data. We looked at web scraping options, but you can also use APIs, or more batch-process data extracts. Whichever intake method you choose, your first step will be writing it into an SQLite database in as most granular form (the operational database). The reason for this is that you will probably make changes to your data warehouse design as you iterate, and you’ll need to be able to recreate the whole thing from raw data.

Once you’ve built a data warehouse, you have several options for your data front end. In all, there should be three interfaces on your dashboard:
1) A weekly report
2) An ever-morning status update
3) An alert

Of course, this is going to be an iterative process, and what you’ll enter into the spreadsheet now is your best guess. But it pays to be bold and specific with them, because they are the platform on which your next, more refined, draft will be built.

In the spreadsheet below, add at least one data source and one potential user (someone that you know personally, and suspect would be open to trading emails with some students). It doesn’t have to be one that you’re interested in personally. Once we get to class, we’ll go through the list, each of you will pick an idea, and get we’ll get going!

Add ideas to the spreadsheet

Boston housing prices and accessing public data through Enigma’s APIs

It’s exciting to see how much data is available for public consumption, often for free, and I wanted to play around with building a connection between my Jupyter notebook and one of the most user friendly repositories that I’ve found, Enigma.

The first challenge of using any API is that it makes each test of your code feel expensive. There are often limitations to how often you can request information, so each piece of information that you pull down feels precious. When you’re just developing something simple locally, an effective testing strategy is to wipe everything and run it again, but you can find yourself stuck without data if you’re not careful about this.

One of the first things that I do is pull down the minimum amount of data (often called one ‘page’ of data, in this case 100 rows). I make it clear in the variable name that this isn’t the full thing:

import json
# Format: 
#<version>/<endpoint>/<api key>/<datapath>/<parameters>
url1 = "<your api key>/"
response1 = urllib.urlopen(url1)
data1 = json.loads(

Then take a look to understand the structure:

print data1.keys() 
for key in ['info','datapath','success']:
    print "\n\n== {} == \n{}".format(key,data1[key])
print type(data1['result'])
print data1['result'][0] #Check out the first row of actual data

print df.columns
def is_number(x):
        return True
        return False


Once you’re satisfied with that, it’s time to download a little more systematically:



import random
datalist = []

with open('api_pages_downloaded.txt','r') as f:
    already_used = [int(x) for x in'\n') if x]

for i in [x for x in random.sample(set(range(326))-set(already_used),100)]:
    with open('api_pages_downloaded.txt','r') as f:
        if str(i) in'\n'):
            print "Page {} already in datalist".format(i)
    url = "<your api key>/{}".format(i)
    response = urllib.urlopen(url)
    data = json.loads(
    with open('datalist.pkl','w') as f:
    with open('api_pages_downloaded.txt','a') as f:

After all of that work, getting the data into a Pandas DataFrame is mercifully easy

import pandas as pd
all_data = []
for x in datalist: all_data.extend(x['result'])
df = pd.DataFrame(all_data)

There are a lot of columns. On the balance, this is great! But you may want to arrange them into id, categorical, and numeric lists, if only for your own sanity. Fortunately, there are not going to be issues with python time series data structures, since this is a static data set. That means we can just clean everything up so that there are no nulls to cause issues with plotting later:

temp = df.fillna(0).applymap(is_number).all()
id_var = ['pid','serialid']
cat_var = ['ptype','zipcode','mail_zip']
num_var = [x for x in temp[temp].index if x not in id_var+cat_var]
cat_var += [x for x in df.columns if x not in id_var+num_var+cat_var]
print id_var
print num_var
print cat_var

df[num_var] = df[num_var].fillna(1).astype(float)
df[cat_var] = df[cat_var].fillna(1).astype(str)
temp = df[[x for x in df.columns if x not in ('cm_id','unit_num','st_num')]]
print temp.isnull().mean().order(ascending=False).head()
df = temp

This post isn’t primarily about the analysis, but I couldn’t miss a chance to show how incredibly more pricy houses are in Boston than the immediate area:

import seaborn as sns
by_count = df[ == 'R1'].groupby('mail_cs').size()
sufficient_listings = by_count[by_count > 100].index
temp = df[( == 'R1')& (df.mail_cs.isin(sufficient_listings))].copy()
temp['av_total_log10'] = temp.av_total.apply(np.log10)
g = sns.FacetGrid(
    data=temp, # 


I hope that this was helpful – a surprising amount of work goes into the conceptually simple task of downloading some data and getting it into a form that you can work with. Almost all of this generalizes beyond the housing data set, and much of it beyond Enigma, so I hope that it ends up being a good resource as you explore. Let me know what you do with it!

How to make a Python Histogram (Using Python to Excel, part 1)

Excel is the perfect tool for many applications – the problem is that it’s used for about 5 billion more on top of those.

Fortunately, I’ve found many things that are complex to accomplish in Excel are extremely simple in Python. More important, there’s no copy/pasting of data, or unlabeled cells with quick calculations. Both of these ad hoc methods invariably leave me confused when I re-open the workbook to update my numbers for next month’s report. In this post, I’ll walk through many of the basic functions you’d use Excel for, and show that they’re just as simple in Python.

Let’s walk you through an example analysis. We’ll be looking at nutritional facts about a few different breakfast cereals:


Download and import data

This data is drawn from a site that is a wonderful source of examples, but makes the puzzling decision to keep data in a tab-separated format, as opposed to the comma-separated standard. For that reason, I’m attaching a .csv of the data here to save you a bit of reformatting work.

Original data

Suggested Download


In Excel, importing the data is a three step process:

1. Find Data > From Text:



2. Find the file, then choose “Delimited” and hit “Next”

2014-10-11 19_22_34-Text Import Wizard - Step 1 of 3

3. Make sure you change the delimiter to “Comma” (what else it imagines .csv to stand for is beyond me). Then you can hit ‘finish’

2014-10-11 19_24_37-Text Import Wizard - Step 2 of 3


In Python, you just use ‘read_csv’ (I’ll cover general setup in a later post, but the basics are that you should start with the Anaconda distribution)

2014-10-11 19_19_41-cereals


Filter the data

To keep things to a manageable size, we’ll filter to the Kellogg’s cereals

This is accomplished with a simple dropdown in Excel:

2014-10-11 19_28_45-cereals - Excel

Python’s approach requires a bit of explanation. As you can see, ‘data’ (which is basically the same as a worksheet in Excel) is being filtered by the data.mfr == ‘K’ in the square brackets.

This simply says “Take all of the rows in this sheet where the mfr column exactly equals (that’s what the == means!) the letter K”. Then, we simply save it as ‘kelloggs’ so we can remember (just like naming an Excel sheet).

I’m using a useful function called .head(), which lets you see the first few rows of a sheet, so that you don’t get overwhelmed with output

2014-10-11 19_30_08-cereals

 Sort the data

The final step of this introduction is sorting the data.

In Excel, this is accomplished with the same dropdown as before:

2014-10-11 19_40_43-cereals - Excel

In python, you call the aptly named ‘sort’ function, and tell it what to sort on

2014-10-11 19_42_20-cereals


Make a Histogram

Making a histogram is in python is extremely easy. Simply use the .hist() function. If you’ve got the plotting package seaborn enabled, it will even look nice too!


example python histogram



That’s it for part 1! Stay tuned for calculated columns and pivots in part 2

Using Python to Excel (Part 2 – Calculate Columns and Pivot)

Note: If you’re new here, you’ll probably want to start with Part 1

Python’s Excel pivot table

We’ve made great progress, and things are only just getting started. We’ll mostly be talking about pivot tables, though we’ll start by creating the calculated column that we’ll pivot on shortly:

Calculated Column

The cereal’s rating (apparently given by Consumer Reports) is bizarrely precise, so I thought it would be valuable to make bins of 10. This starts to get a little complex, but isn’t too bad:

2014-10-11 19_53_34-cereals - Excel

Python actually uses the same function name. Note that I’m telling it to only display three columns this time, but all of the rest are still in the ‘kellogs_sorted’ sheet:

2014-10-11 19_54_46-cereals



This is where things get really intense. I’ve squished Excel’s pivot table a bit to fit it, but you can get a sense of the simple stuff that we’re doing here, just summing the cups and calories for each rating category:

2014-10-11 20_00_40-PivotTable Field List

Python uses a function called pivot_table to achieve this same thing:

2014-10-11 20_03_26-cereals

Calculating Values in Pivots

Excel has you add equations columns to the pivot, then bring those into the table

2014-10-11 20_05_01-cereals - Excel


2014-10-11 20_06_30-Insert Calculated Field


2014-10-11 20_07_41-


In python you can take the table that you made before and simply add the column directly (and then round it, to keep the numbers reasonable)

2014-10-11 20_09_35-cereals


Pivot with column categories

In Excel, you add the additional category (in this case, which shelf the item is on) to the “Columns” list. Note that we’re now counting the number of products, instead of summing cups or calories

2014-10-11 20_11_38-PivotTable Field List

In python, you add a new argument to the pivot_table function called “columns”, and tell it that you what “shelf” in there

You can also tell it to put 0 in the blank cells, so that the table is makes more visual sense

Python Excel pivot table


When it correctly guesses at your intentions, Excel’s graphs are pretty magical, laying everything out in essentially one click:

2014-10-11 20_20_55-cereals - Excel


Python is pretty magical too, simply requiring you to specify that you’d like a bar plot, and then allowing you to set as many or as few labels as you’d like. The legend does have an unfortunate tendency to default to the worst corner of the graph, but it’s easy enough to move around:

2014-10-11 20_18_25-cereals


It’s clearly less attractive by default, though. Here’s an example of what you can do with a little more specification:


From the matplotlib gallery



Histograms are the first of the functions that Excel doesn’t have a button for. Excel recommends a 6 step process outlined here:

Python has a one-step histogram function, .hist()  Here’s an example histogram of the consumer reports ratings of all of the cereals:

2014-10-11 20_32_27-cereals


Histogram Comparisons

You can extend the functionality even further by showing two histograms over each other. The function naturally groups the data into 10 bins, which can be misaligned if you are comparing two data sets.

I used a new function called ‘range’ that simply gives me all the numbers between 0 (the first argument) and 100 (the second), counting by 10 (the third argument). Then I tell .hist() that it needs to use those as bins for both the overall data, and the data filtered to Kellogg’s

2014-10-11 20_28_27-cereals

This allows us to see that Kellogg’s is doing about as well as the group overall in terms of ratings.



That’s it for the basics! I hope you’ve found it useful, and that you give Python a try next time you want to explore a data set.

Using bayes theroem on two-way categorical data

Still revising, but I figured that in true Bayesian fashion, I’d update this dynamically as more information came in

It’s said that the best way to understand something is to teach it, and the huge number of explanations of Bayes’ Theorem suggest that many (like me!) have struggled to learn it. Here is my short description of the approach that ultimately led to some clarity for me:

Lets do the “draw a cookie from the jar” example from Allen Downey’s Think Bayes, with a bit more of a plausible backstory (one cannot be an actor, or act like a Bayesian, without understanding their character’s motivation). I made two 100-cookie batches, one with cilantro (Bowl 1) and one without (Bowl 2). Because my cilantro-hating friend prefers vanilla cookies, I made 75 of those (plus 25 chocolate to round out the batch) and put them in the cilantro-free bowl. I made 50 vanilla and 50 chocolate for the cilantro-added bowl.

Everyone comes over for my weird cookie party, but I forget to tell my cilantro-hating friend that they should only choose out of one of the bowls. <EDIT – have the bowls be mixed together, so that each cookie has an equal probability. Doesn’t change the problem, but removes need for an assumption of equal bowl probability> Being entirely too trusting, they just grab a cookie randomly from one of the bowls in the kitchen, and walk back to the living room. I stop them and say “Wait! Do you know which bowl that came out of?” and they say “Oh, no I wasn’t paying attention, but if you made different numbers of vanilla, which this cookie is, that should at least give us a probability of whether it came from the cilantro bowl. It wouldn’t be catastrophic to accidentally take a cilantro bite, so I’ll go for it if my chance of it being cilantro-free is greater than 55%”

Here’s what the situation would look like as a table

               | Vanilla | Chocolate | Total |
Cilantro-free  |   75    |    25     |  100  |
Cilantro-add   |   50    |    50     |  100  |
Total          |   125   |    75     |  200  |

Now, we already know that they had a 75% chance of getting a vanilla cookie if they chose the CF bowl. But that’s NOT the question at hand. The question at hand is related, but different: What is the chance that they chose the CF bowl if they got a vanilla cookie. Let’s watch a replay:

We know: Chance they have a vanilla cookie if cookie from CF bowl

We want:  Chance cookie from CF bowl if they have a vanilla cookie

The reason that I stress this is that the conventional method of “Null Hypothesis Significance Testing” (the whole concept of “the null hypothesis was rejected at p<.05” that you see in most papers) is analogous to the first statement, but we almost always want to make decisions based on the value of the second statement. To be even more direct: Most statistical analysis that we see leaves us with a number (p) that is one step short of what we can make decisions on.

Fortunately, there is an equation that can take us from what we’ve know to what we actually want. Unfortunately, it requires additional variables to solve. In this case, we have the additional information. In other cases, we would have to estimate those values, without any method of checking our estimate (until we get more data). Painfully, after all of this work to design a clean experiment, accurately measure results, and methodically run the numbers, out very last step requires us to irrevocably taint our objective results with an estimate that is picked out of the air. It’s so frustrating, and you can start to understand why people try to use the first number, whose definition is so close to sounding like what they need, but that’s just how the math works out. Let me know if you find an alternative.

Lets try it on this case. Here’s the simple derivation of Bayes Theroem:

p(A and B) = p(B and A)

p(A if B) x p(B) = p(B if A) x p(A)

Therefore: p(A if B) = p(B if A) x p(A) / p(B)

Remembering where we left off:

We know: Chance they have a vanilla cookie if cookie from CF bowl

We want:  Chance cookie from CF bowl if they have a vanilla cookie

If A = they have a vanilla cookie

and B =  cookie from CF bowl

Then p(cookie from CF bowl if they have a vanilla cookie) = p(they have a vanilla cookie if cookie from CF bowl) x p(cookie from CF bowl) / p(they have a vanilla cookie)

Note that the first term to the right of the equals sign is ‘we know’, and the final result is ‘we want’. Unfortunately, there are those two other unknowns to calculate, which is where a bit of subjectivity comes in:

p(cookie from CF bowl) means “The overall chance that any cookie (vanilla or chocolate) would be drawn from the CF bowl”. Since there are two bowls, and we don’t know of any reason one would be picked over another, we assume this is 50%. But this is an assumption, and many real-life problems would give alternatives that are clearly not 50/50, without giving clear guidance on whether they should be considered 45/55, 15/85 or .015/99.985. Note that, if  you assume each cookie was equally likely to be selected, this number could be calculated from the total number of cookies in each bowl on the far right column (ie 100 of the 200 total cookies are in the CF bowl)

p(they have a vanilla cookie) means “The overall chance that any cookie would be vanilla”. In this case, simply look at the total number of cookies of each type (the totals on the bottom row of the table) and see that vanilla makes up 125/200 of the total. (NOTE: does this change if the bowls are not equally likely to be selected?)

Once you’ve gotten over the implications  The final calculation is easy:

.75 * (100/200) / (125/200) = .6

It’s also interesting to see that the .75 could be calculated in much the same was as the other two variables (percentage of the total in their row or column), along the top column. Specifically, “within the Cilantro-free column, what portion of cookies are vanilla?”, it’s simply the intersection of CF and Vanilla, divided by the total of the column.

               | Vanilla | Chocolate | Total |
<strong>Cilantro-free  |   75    |    25     |  100  |</strong>
Cilantro-add   |   50    |    50     |  100  |
Total          |   125   |    75     |  200  |

Let’s look at all the factors again in that light:

p(they have a vanilla cookie if cookie from CF bowl):

  • Numerator: # in the intersection of CF and Vanilla
  • Denominator: # of CF

p(cookie from CF bowl)

  • Numerator: # of CF
  • Denominator: # of total cookies

p(they have a vanilla cookie)

  • Numerator: # of Vanilla
  • Denominator: # of total cookies

This is all very symmetric with the definition of our result:

p(cookie from CF bowl if they have a vanilla cookie)

  • Numerator: # in the intersection of CF and Vanilla
  • Denominator: # of Vanilla



Bayes in Excel

Why Excel?

Solving the Cookie Problem

As always, the row labels (Cilantro-Free and Cilantro-Add) are your hypotheses. You can’t know them directly, but the whole goal of this exercise is to use data (that’s the columns) to increase the difference in probability until you can feel comfortable that one hypothesis is likely enough to act on.

p(H|D) = p(D|H)p(H) / p(D)

p(D) = ∑p(D|Hi)p(Hi), or p(D & Hi) for all i, assuming hypotheses are MECE

∑: Sum of all (In this case, we’re saying that the probability of the data is p(D & H) for all possible hypotheses, assuming they are MECE

MECE: Mututally Exclusive and Collectively Exhaustive. This is a great term for a concept that we all intuit but don’t have a good word for. Generally, we should be formulating our set of hypotheses so that there’s no chance that the data could have come any other way. So I’m not saying that the cookie could have come from the Cilantro-Free, Cilantro-Add, or “an unknown variety of other sources”. If there are other sources, I need to make that a definite hypothesis, with a prior, and everything that my first two have.

In this simple case, ∑p(D|Hi) just means p(V & CF) =75 plus  p(V & CA) = 50 <EDIT it’s actually this divided by the total, 200>

Vanilla Chocolate Total
Cilantro-Free 75 25 100
Cilantro-Add 50 50 100
Total 125 75 200

So we have all of this information, and BAM – we draw a Vanilla cookie. That means we have a piece of data. Whereas our previous best guess at where a given cookie came from was 100/200 for each container (from the ‘total’ column, since we had no more specific information), we can leave those totals behind and focus on the ‘Vanilla’ column, where the probability is the percentage of vanilla cookies from each jar.

Cilantro-Free 0.6
Cilantro-Add 0.4

I did all of this with a count of cookies, to keep it intuitive. Let’s reply the exact same thing with probabilities, which will allow us to take the critical leap in our next section:

Vanilla Chocolate Total
Cilantro-Free 37.50% 12.50% 50.00%
Cilantro-Add 25.00% 25.00% 50.00%
Total 62.50% 37.50% 100.00%

This is exactly what we saw above, except that I’ve divided everything by 200 (the total number of cookies). Look at the lower right corner first. The probability that you draw a cookie of either type from either jar is 100%. Awesome. Look at the Total column – the probability that you draw it from the CF jar is 50%, and the same from the CA jar, because they had an equal number of cookies. That’s p(H), by the way. Similarly, p(D) is the probability that you drew that vanilla. It’s 62.5%, because there are more Vanilla cookies overall. So, returning to the central equation:

p(H|D) = p(D|H)p(H)/p(D)

p(H|D) = (37.5%/50%) x 50% / 62.5%

p(H|D) = 60%

What we really just did was say “Multiply the priors of each hypothesis by how likely they were to give the data, then divide everything by the sum to make them sum to one again”. In this case, you’re saying “Now that we know we have a vanilla cookie, the chances of the cookie coming from CF are 75 (number of vanilla cookies in CF) over 75 + 50 (number of vanilla cookies anywhere)”. That’s pretty intuitive when visualizing cookies, but it feels a little weirder when talking about probabilities. But those numbers are just the same, divided by the total number of cookies.

Where it gets really interesting is if you did this cookie draw again. Imagine that the vanilla cookie was put back in its original jar, and then another cookie was randomly drawn (I can’t think of a story that would justify this, but work with me here). Here’s the deal. You can actually think of those two draws as one piece of data: Vanilla-Vanilla, Vanilla-Chocolate, Chocolate-Vanilla, or Chocolate-Chocolate. You simply multiply the original probabilities together (so p(Vanilla Chocolate | CF) = 37.5%x 12.5%). Here’s how that plays out:

First Draw Vanilla Chocolate
Second Draw Vanilla Chocolate Vanilla Chocolate Total
Cilantro-Free 14.06% 4.69% 4.69% 1.56% 25.00%
Cilantro-Add 6.25% 6.25% 6.25% 6.25% 25.00%
Total 20.31% 10.94% 10.94% 7.81% 50.00%

<NOTE: Explain why this doesn’t sum to 100% – I think it’s because this table makes the assumption that you’re drawing from the same jar twice, which is a 50% chance)

Or, dividing everything by the total-total again:

First Draw Vanilla Chocolate
Second Draw Vanilla Chocolate Vanilla Chocolate Total
Cilantro-Free 28.13% 9.38% 9.38% 3.13% 50.00%
Cilantro-Add 12.50% 12.50% 12.50% 12.50% 50.00%
Total 40.63% 21.88% 21.88% 15.63% 100.00%

So if you drew Vanilla-Vanilla from the same jar, your probability of it coming from CF is 28.13% / 40.63% = 69.23%. Note that this is higher than the 60% certainty you had with one draw, and any further straight vanilla draws would increase the chance even further.

But we’ve already tortured this analogy as far as we can. Cookies in a jar are easy to visualize for one draw, but lets move on to something that lends itself more easily to repeated experimentation

Solving the Coin Problem (Simplified)

One confusing part about going from the cookie example to this coin problem is that the hypotheses are now numeric, which is convenient for calculation, but  a little confusing. Let’s say that we encounter a weird-shaped coin, and we have literally no idea what the odds are of ‘heads’ versus ‘tails’ (however that’s defined on this coin). Instead of the hypotheses being that you’re drawing from one jar instead of another (both of which exist and have a definite number of cookies of each type), it’s that you’re flipping one of several possible coins (only one of which exists, and has a single attribute “probability of heads”). To make this somewhat more concrete, here’s a table like the one we saw for cookies, but for a set of “probability of heads” possibilities:

0 Heads, 0 Tails
Probability of Heads Heads Tails Total
0% 0 0.2 0.2
25% 0.05 0.15 0.2
50% 0.1 0.1 0.2
75% 0.15 0.05 0.2
100% 0.2 0 0.2
Total 0.5 0.5 1

It’s critical that we keep in mind that the left most column is for hypotheses, not results. So “25%” is just like “Cilantro Free” or “Cilantro Added”, in that it’s the thing we want to figure out, and which we’re going to build confidence in based on data.

Speaking of data, we still have two possible outcomes, now Heads and Tails instead of Vanilla and Chocolate.

We start by looking at our chances if we’ve have 0 flips of the coin. We’re assuming that all hypotheses are equally likely, including the possibility that the coin could never turn up Heads, as well as the possibility that it will always turn up heads.

Let’s look down the ‘Heads’ column (remember, each of these numbers represents the probability that both the hypothesis and the data are true). We shouldn’t be surprised to see that the there’s no chance that both our hypothesis that Heads is impossible, and that our flip ends up Heads. To look at it the other way, if our flip ends up Heads, we have eliminated this hypothesis forever, even if we get straight Tails after that.

What’s interesting is that the opposite is not true. Toward the bottom of the column, we see that the assumption that the coin will always flip Heads does not necessarily go to 100% if we get a heads. That would be true even if we got 1,000 Heads in a row (or a million, or an billion). As long as there are other hypotheses that could also explain that string of Heads (even if the hypothesis that Heads  is 50% likely, and you’ve just gotten exceedingly lucky), you can’t say for sure that the 100% Heads hypothesis is true. This is the mathematical reasoning behind the statement “No amount of data can prove a theory, but one data point can refute it”, or the story that people assumed that all swans were white, until someone saw a single black swan and *poof*, obliterated the theory that had seemed clear for hundreds of years.

Let’s say we got Heads on that first flip. That’s like filtering out the Tails column, leaving just the Heads (note that there’s now a probability of 1 at the bottom of the Heads column, since that’s what actually happened):

1 Heads, 0 Tails (end of first flip)
Probability of Heads Heads Tails Total
0% 0 0
25% 0.1 0.1
50% 0.2 0.2
75% 0.3 0.3
100% 0.4 0.4
Total 1 1

So what if we got a Heads on our first flip, and we wanted to flip again? This is analogous to the end of the Cookie section, where we drew a second cookie from the same jar. Here, we’re getting a second result from the same coin – another output of the same unknown system.

We constructed a somewhat awkward table for the Cookie, with a two layered column header, which showed the probability of every pair of outcomes. For simplicity’s sake, I’m going to ignore the columns where Tails came up first, since we’ve already gotten one Heads:

1 Heads, 0 Tails
Heads Tails
0% 0 0 0
25% 0.0125 0.0375 0.05
50% 0.05 0.05 0.1
75% 0.1125 0.0375 0.15
100% 0.2 0 0.2
0.375 0.125 0.5

What’s interesting here is that the equations are exactly the same, but the prior was copy-pasted from the Heads column of the last table. That’s because, when we flipped that first Heads, we learned more about the coin. For example, we learned that it was impossible that it could have a 0% chance of heads. Now I take that prior, multiply it by the probability of getting another Heads for each hypothesis, and get the probability of Heads-Heads for each of them. Note that the Total-Total is just .5, instead of 1. That’s because we’re looking at a sub-universe, in which we got heads for that first flip (which had a probability of .5, given our priors). I didn’t do the re-balancing that we saw in the one-column table before, so each box here is the probability of Heads-Heads, as judged before we’ve flipped anything.

Here’s where we start to see the future – we can just as easily do this a third time:

2 Heads, 0 Tails
Heads Tails
0% 0 0 0
25% 0.003125 0.009375 0.0125
50% 0.025 0.025 0.05
75% 0.084375 0.028125 0.1125
100% 0.2 0 0.2
0.3125 0.0625 0.375

Again, the Heads column here represents Heads-Heads-Heads, as judged from the standpoint of not having flipped the coin at all. We can basically say, before flipping anything, “There is a .025 chance that the coin is 50% likely to be heads, and we get Heads-Heads-Heads”.

Let’s say that we did get  Heads-Heads-Heads. then the Total at the bottom will be 1, and everything else in the column scales up by the same factor. So it would look like this:

2 Heads, 0 Tails
0% 0
25% 0.01
50% 0.08
75% 0.27
100% 0.64

So, if we got Heads-Heads-Heads, we’d say there was a .64 chance that the coin was 100% likely to come up heads.

Solving the Coin Problem

This ‘copy-paste the old posterior as the new prior’ thing makes some intuitive sense (each flip is like a new test, starting with the results from the last), but it gets awkward. Particularly if you end up with a mixture of heads and tails. But if you remember the fundamental equation that we’ve got going here:

p(H|D) = p(D|H)p(H)/p(D)

We can connect it with the extremely large set of tools for calculating p(D|H). In this case, it would be the binomal distribution.

Probability of first Heads, as calculated using =BINOMDIST(1,1,HYPOTHESIS,False)*PRIOR

Heads Tails
0 0 0.2 0.2
0.25 0.05 0.15 0.2
0.5 0.1 0.1 0.2
0.75 0.15 0.05 0.2
1 0.2 0 0.2

This matches the other method

0 0
0.25 0.003125
0.5 0.025
0.75 0.084375
1 0.2

Then, normalizing

0% 0
25% 0.01
50% 0.08
75% 0.27
100% 0.64

BAM – exactly what we got the other way

Let’s extend it to 150 heads and 10 Tails. Because, we can!

0% 0
25% 0.0000002081245439
50% 0.004546550037
75% 0.9954532418
100% 0

100%, which had been in the lead, is thrown out by that single Tails. 75% is now the clear favorite.


The history of Bayes

Still revising, but I figured that in true Bayesian fashion, I’d update this dynamically as more information came in

We come to understand most things when they are connected to the things that we already know. In an effort to better understand Bayes theorem’s meaning relative to frequentist interpretations, I’m reading about the history of its development.

The world was a different place back then. The very concept of ’cause and effect’ was politically charged in a way that would be difficult to imagine today. The basic issue was that people had, until recently, considered the immediate cause of everything to be God’s will. That’s why you could impress people so much by predicting an eclipse – it wasn’t that you were smart, it’s that you had an inside line to God. (Nate Silver would have done quite well back then). This may help explain how the prediction of Halley’s Comet was huge, because Halley didn’t claim that a divine connection allowed his accurate forecast of a highly improbable event. He laid out all the mathematics, and showed that natural laws were the cause. Suddenly, a whole class of celestial events were no longer the direct result of heavenly handiwork, but of predictable natural laws. The boundaries of God’s will, which had previous enveloped the observable reality in an all-encompassing embrace, began to recede.

Early experimentation focused on situations where one or two causes repeatedly led to an effect. If you did the same experimental steps (cause) again, you’d get pretty much the exact same result (effect). That’s awesome when it works, but there are plenty of situations in which the flow is:

Major Cause (consistently repeated many times) + many minor causes (randomly repeated many times) = Range of Effects (semi-randomly distributed)

Fortunately, those many random minor causes could sometimes be modelled as a probability distribution that was as stable as the gravitational constant. That’s where statistics comes in – creating that model of those minor causes, so that we can come as close as possible to a consisteny link between the Major Cause and Range of Effects.

That’s where Bayes comes in, and why his work had political importance – in a world where people were freaking out about the first few steps towards a mechanical universe, he created tools that would allow for a dramatically larger set of effects to be predicted and explained without appeal to a higher power.