Using bayes theroem on two-way categorical data

Still revising, but I figured that in true Bayesian fashion, I’d update this dynamically as more information came in

It’s said that the best way to understand something is to teach it, and the huge number of explanations of Bayes’ Theorem suggest that many (like me!) have struggled to learn it. Here is my short description of the approach that ultimately led to some clarity for me:

Lets do the “draw a cookie from the jar” example from Allen Downey’s Think Bayes, with a bit more of a plausible backstory (one cannot be an actor, or act like a Bayesian, without understanding their character’s motivation). I made two 100-cookie batches, one with cilantro (Bowl 1) and one without (Bowl 2). Because my cilantro-hating friend prefers vanilla cookies, I made 75 of those (plus 25 chocolate to round out the batch) and put them in the cilantro-free bowl. I made 50 vanilla and 50 chocolate for the cilantro-added bowl.

Everyone comes over for my weird cookie party, but I forget to tell my cilantro-hating friend that they should only choose out of one of the bowls. <EDIT – have the bowls be mixed together, so that each cookie has an equal probability. Doesn’t change the problem, but removes need for an assumption of equal bowl probability> Being entirely too trusting, they just grab a cookie randomly from one of the bowls in the kitchen, and walk back to the living room. I stop them and say “Wait! Do you know which bowl that came out of?” and they say “Oh, no I wasn’t paying attention, but if you made different numbers of vanilla, which this cookie is, that should at least give us a probability of whether it came from the cilantro bowl. It wouldn’t be catastrophic to accidentally take a cilantro bite, so I’ll go for it if my chance of it being cilantro-free is greater than 55%”


Here’s what the situation would look like as a table

               | Vanilla | Chocolate | Total |
Cilantro-free  |   75    |    25     |  100  |
Cilantro-add   |   50    |    50     |  100  |
Total          |   125   |    75     |  200  |

Now, we already know that they had a 75% chance of getting a vanilla cookie if they chose the CF bowl. But that’s NOT the question at hand. The question at hand is related, but different: What is the chance that they chose the CF bowl if they got a vanilla cookie. Let’s watch a replay:

We know: Chance they have a vanilla cookie if cookie from CF bowl

We want:  Chance cookie from CF bowl if they have a vanilla cookie

The reason that I stress this is that the conventional method of “Null Hypothesis Significance Testing” (the whole concept of “the null hypothesis was rejected at p<.05” that you see in most papers) is analogous to the first statement, but we almost always want to make decisions based on the value of the second statement. To be even more direct: Most statistical analysis that we see leaves us with a number (p) that is one step short of what we can make decisions on.

Fortunately, there is an equation that can take us from what we’ve know to what we actually want. Unfortunately, it requires additional variables to solve. In this case, we have the additional information. In other cases, we would have to estimate those values, without any method of checking our estimate (until we get more data). Painfully, after all of this work to design a clean experiment, accurately measure results, and methodically run the numbers, out very last step requires us to irrevocably taint our objective results with an estimate that is picked out of the air. It’s so frustrating, and you can start to understand why people try to use the first number, whose definition is so close to sounding like what they need, but that’s just how the math works out. Let me know if you find an alternative.

Lets try it on this case. Here’s the simple derivation of Bayes Theroem:

p(A and B) = p(B and A)

p(A if B) x p(B) = p(B if A) x p(A)

Therefore: p(A if B) = p(B if A) x p(A) / p(B)

Remembering where we left off:

We know: Chance they have a vanilla cookie if cookie from CF bowl

We want:  Chance cookie from CF bowl if they have a vanilla cookie

If A = they have a vanilla cookie

and B =  cookie from CF bowl

Then p(cookie from CF bowl if they have a vanilla cookie) = p(they have a vanilla cookie if cookie from CF bowl) x p(cookie from CF bowl) / p(they have a vanilla cookie)

Note that the first term to the right of the equals sign is ‘we know’, and the final result is ‘we want’. Unfortunately, there are those two other unknowns to calculate, which is where a bit of subjectivity comes in:

p(cookie from CF bowl) means “The overall chance that any cookie (vanilla or chocolate) would be drawn from the CF bowl”. Since there are two bowls, and we don’t know of any reason one would be picked over another, we assume this is 50%. But this is an assumption, and many real-life problems would give alternatives that are clearly not 50/50, without giving clear guidance on whether they should be considered 45/55, 15/85 or .015/99.985. Note that, if  you assume each cookie was equally likely to be selected, this number could be calculated from the total number of cookies in each bowl on the far right column (ie 100 of the 200 total cookies are in the CF bowl)

p(they have a vanilla cookie) means “The overall chance that any cookie would be vanilla”. In this case, simply look at the total number of cookies of each type (the totals on the bottom row of the table) and see that vanilla makes up 125/200 of the total. (NOTE: does this change if the bowls are not equally likely to be selected?)

Once you’ve gotten over the implications  The final calculation is easy:

.75 * (100/200) / (125/200) = .6

It’s also interesting to see that the .75 could be calculated in much the same was as the other two variables (percentage of the total in their row or column), along the top column. Specifically, “within the Cilantro-free column, what portion of cookies are vanilla?”, it’s simply the intersection of CF and Vanilla, divided by the total of the column.

               | Vanilla | Chocolate | Total |
<strong>Cilantro-free  |   75    |    25     |  100  |</strong>
Cilantro-add   |   50    |    50     |  100  |
Total          |   125   |    75     |  200  |

Let’s look at all the factors again in that light:

p(they have a vanilla cookie if cookie from CF bowl):

  • Numerator: # in the intersection of CF and Vanilla
  • Denominator: # of CF

p(cookie from CF bowl)

  • Numerator: # of CF
  • Denominator: # of total cookies

p(they have a vanilla cookie)

  • Numerator: # of Vanilla
  • Denominator: # of total cookies

This is all very symmetric with the definition of our result:

p(cookie from CF bowl if they have a vanilla cookie)

  • Numerator: # in the intersection of CF and Vanilla
  • Denominator: # of Vanilla

44848-keanu-reeves-whoa-gif-nOup

Inspiration: ist-socrates.berkeley.edu/~maccoun/PP279_Cohen1.pdf