Assorted Additional Resources

  • Patientslikeme: Organized around medical diagnoses, which makes a lot of sense, but can be a little difficult if the problem is lack of firm diagnosis
  • EverlyWell: I got so excited when I read what they do, but some background research (always do the background research!) turned up some very credible concerns. Stat News is especially reliable, so I’ll link to that, but there are others: https://www.statnews.com/2018/01/23/everlywell-food-sensitivity-test/
  • cochrane.org: Considered the gold standard in aggregating and summarizing medical research. They’re pretty conservative, which is appropriate in a field that’s inclined towards hype, so it should be your first but not necessarily last stop if you’re trying to find relevant scientific articles
  • examine.com: Another aggregation site. Make sure to actually look at the studies, because you might realize that there’s a twist which gets a little lost in the broad categories. For example, you might see “4 studies suggest that X reduces Y”, but upon inspection, 3 of the studies were on pregnant women and 1 was related to surgical complications. If that doesn’t describe you, downgrade your confidence appropriately
  • Doctor/Nurse On Call through your insurance: I’m not sure how prevalent this is, or how useful, but it’s often free (or 1/10th even a basic office visit), so worth looking into. They often don’t give you a definitive diagnosis, but can allay fears and/or suggest some low impact lifestyle changes that might help

Is Quantified Self Worth It? (Post #1)

I’m inclined to be a huge fan of Quantified Self. Taking entrepreneurial initiative, using numbers, even Meetup groups – it’s right up my alley! I’ve been doing it for years, including many months of daily 30 second recording, a few months of tracking every meal, and some self-designed Randomized Double Blind Controlled Trials (No, that’s not a typo. Yes, double blinding by yourself is tough). I’ve also seen several people try personal things at various stages of complexity, talked to people at the meetup groups, and watched many of the highlighted QS videos.

What I’ve taken away from it is that the human body is complex. Moreover, a human life is complex. So you can try to say “I ate carrots from days 1 to 10 and felt terrible, then avoided them from days 11 to 20 and felt great”, and infer that you’re allergic to carrots. But think about all of the things that could go wrong in this minimal example:

  • Maybe some nutrient that’s relatively high in carrots was accumulating in those first 10 days, and was actually the only reason you felt so good in the last 10
  • Maybe you ate celery instead of carrots in the last 10 days, and that’s actually the only factor that matters
  • 10 day cycles means that one of your samples probably had two weekends in it, while the other had just one. Maybe the difference in averages is simply because you feel different on weekends
  • Maybe there was a party at work, and you ate something you don’t usually, which contained the actual allergen
  • Maybe it has nothing to do with allergies, and responds to the weather
  • Maybe it has nothing to do with anything external, and relates to hormonal cycles (which also happen in men)
  • Maybe your definition of “felt terrible” shifted as you acclimated, or relative to other events in your life during the experiment

There’s also the issue that we often can’t directly measure the thing we’re interested in. Energy levels, focus, even “pain” are hard to put a consistent number on. If the true effect is small, it can easily be swamped in the difference between your definition of a 6 and 7 on whatever scale over time.

Finally, there are a near infinite number of interventions you can try, and just randomly guess-and-checking them is going to make for slow progress. There are some where your priors can be high enough to make them worthwhile (lactose intolerance is a good example), but once you get down into cutting out trace ingredients, or unconventional sleep routines, or anything like that, you’re looking for a needle in a haystack. Also: https://www.xkcd.com/882/

Also, very few doctors know what to do with your data. It makes sense, since few of their patients are arriving with anything like it, and those who are probably bring totally different formats to address different problems.

Still, you’re going to live your life today anyway, you might as well take some quick notes about how it went. At the very least, it might form the baseline if you decide to try a more significant intervention down the road.

Should I go to a Nutritionist? (Post #1)

One way to narrow down the list of medical professionals that I could go to is by looking at what my insurance will cover. This is pretty compelling, since it offers a chance that my out of pocket expenses will be lower. It’s a perverse indication about the healthcare system that this is just a chance, between random denials of coverage, and things like 10% co-insurance on services that cost 15x as much as they should.

But, when I look at “Find Care” under my Aetna website (which is quite a bit better designed than I would have expected, so check it out if you happen to have Aetna), there are 5 categories under “Alternative”:

  1. Acupuncture: Not relevant
  2. Chiropractor: Not relevant
  3. Massage Therapist: Wish I’d checked here before paying for one last year!
  4. Naturopathy: I’m surprised that Aetna is covering this. For all the (often justified) flak that insurance companies get, there’s something to be said for having a voice in the system that says “I really need to see some statistical proof of effectiveness before subsidizing this treatment”. My feelings about naturopathy are here.
  5. Dietician: That’s what we’ll be digging into today

The problem is that the list gives me dozens of choices, but no relevant information to select on. They’re all listed simply as ‘registered dietician’, with no indication of whether they specialize in weight loss (which is not my goal) or something else. And there’s even less information about what I’m really looking for, whether they would work with me over time as an active partner. That’s the sort of thing that would come out in patient reviews, but less than 10% of providers have any, and they’re incredibly focused on the patient experience, not the clinical approach or outcomes. Some examples:

  • Not only do they never postpone my appointment, they always try go get me in as fast as possible.
  • They had plenty of staff members to help me whenever I needed assistance
  • They’ve never used foul language, which bothered me at some other places I’ve been to

This isn’t going to help me find the relatively rare approach that I’m looking for. The next step is going to see if any of them have blogs or other outlets where they discuss their philosophy and see if they think about things besides weight loss, regularly consider the medical literature, and generally appear to think critically and flexibly.

UPDATE: I’ve done that for everyone within a 3mi radius. People are pretty non-specific, saying things like “helping others develop a healthy relationship with food and reach their health and wellness goals” (LinkedIn) While I’d love to find someone who’s tweeting an in-depth literature review around a relevant set of symptoms, the realistic signal that I’m looking for is that they self identify as focusing on scientific and quantitative approaches. Barring that, just some specificity would be great. Here’s a good example: “My expertise & passion is in the prevention and treatment of diabetes.” (LinkedIn) Not a fit for me, but I much prefer to come to a crisp ‘yes/no’ than the endless ‘meh’ that the previous example provokes!

One of them had a website full website, though it was last updated in 2015. She referenced going to a conference at Harvard, which is the best that I’ve found so far. I wish that I had something better to go on, but that puts her well ahead of the pack, so I’ve reached out.

Should you go to a Naturopath?

This ended up being shorter than originally envisioned. My answer is: No, unless a specific course of treatment has proven effective for someone whose judgement you really trust, is extremely low cost/risk, and is not replacing more widely accepted treatment.

What is a naturopath?

Broadly, anyone who defines themselves as one. 17 US states also have a licensing system, and there are a handful of schools that are accredited to teach it.

What’s worrisome?

These schools teach some things which I am willing to have an open mind about, like massage therapy and dietary advice, but it also teaches things which are widely discredited like homeopathy. Worse, they are linked to the anti-vaccine movement (more here). The schools and professional bodies have also reportedly behaved in ways that suggest less of a “Let’s rationally investigate and debate these alternative approaches which could have value” and more of a “Let’s squash the non-believers” mentality. I came in willing to give them benefit of the doubt about being honest, if sometimes credulous, truth-seekers. More details here.

Sermon Review: Truth and Empathy [2018.04.22]

My wife has been encouraging me to find ways to engage with the FUUSN community, as we’re relatively new to checking it out. Most of the things which appear to compel longtime worshipers (regardless of specific denomination/religion) have never resonated with me, but Samuel Foster’s rookie sermon had just provoked a bunch of thoughts. And since writing a sermon review can’t be considered weirder than putting Barney in your TensorFlow installation guide, I figured this was as good a first step towards experimenting with building a relationship with the church.

Theme 1: Coming To Agreement

Sam started with a recurring dream about finding his sister in a big city. She was also looking for him, but they had no way of communicating, so they each needed to guess where the other was likely to guess they would be likely to go. I almost yelled “Aha, a classic Schelling Point!” but I’m still not clear whether this is one of those congregations where people yell about spirits or game theoretic concepts.

The sermon deftly wove in the morning’s childrens’ story, about the blind men coming to different conclusions based on feeling different parts of an elephant (summary here). This had a twist which I hadn’t heard before: They argued angrily until the Prince asserted his comprehensive understanding of its nature, and gave them a pleasant elephant ride home. This segued into the next theme:

Theme 2: Objective Reality as Power

The moment I decided I loved the sermon is when Sam pulled apart the layers of the children’s story and questioned the implied superiority of the Prince’s viewpoint, even pointing out the political power dynamics both within the story and around its creative goal: “After all, storytellers must also be paid. ” Fortunately, he was respectful enough to do so after the children left so that they did not become upset or postmodernists.

This is a good illustration of his third theme, which I’m going to summarize with a word that I don’t think he actually used:

Theme 3: Empathy

The final major touchpoint was a poem which had been read, about a gas station and its visitor’s transition from judgement to humanization of the dirty grease-covered owners. That transition, based on no new objective information, started with the recognition of details that suggested home. In particular, a home that somebody loved and cared for, just as much as the visitor presumably cared about her own.

Throughout, there were references to the church community. Like the elephant, it’s bigger than any one of us can touch. Like the urban search, the ‘right answer’ depends entirely on what everyone agrees on. And like the gas station, we’re going to have different initial reactions to the same space (physical, social, political, etc), but we need to at least consider each others’ viewpoints when passing judgement.

In fact, particularly in the elephant segment, Sam celebrates the diversity of truths that the blind men initially came to, and suggests that their abandonment in favor of the Prince’s was a loss. I’d argue that the acceptance and rejection of divergent truths needs a more nuanced treatment

My Thoughts

In the elephant story, it’s not clear what the blind men plan to do with their understanding of the elephant. Are they trying to decide whether to be scared of it? Because then the one who felt the trunk, and declared it like a snake, would sound awfully alarmist to the one who felt its rope-like tail. If they planned on cooling off with it, the one who felt its ears could never coordinate with the one who push on its side. In fact, the very existence of differences caused them to get into a loud argument that attracted the prince. It was only when he aligned them on a completely different model of elephants (as things you could comfortably ride), that they got any benefit from the whole exercise.

Many situations are like this. There’s no moral reason to drive on the left or right side of the road, but it’s very important that we all agree. The book Sapiens argues that the primary advantage that humans had over the apes was our ability to create ‘myths’, which started with specific variants of “God will kill you if you are not pro-social” and have evolved into “You can trust that this dollar bill will have value tomorrow” and “voting is a thing you should probably do”. They allow us to coordinate at a level that would be unthinkable in an I-only-trust-my-family-and-friends culture. And if you think that organization based strictly on mutually held ideas is unstable, try changing one.

There are certainly situations where a diversity of opinions is valuable. The average tourist’s obsession with the Mona Lisa is probably more of a headache than benefit to the Louvre. Science is driven forward when one person thinks “Maybe things work a little differently” and proves it. But what’s critical is how they act in the meantime.

Specifically, respecting other people’s dignity. It’s easy to see how telling someone that one of their beliefs is wrong could disrespect their dignity. Worse still is not telling them, and simply disregarding both their opinions (which might be invalid, if based on flawed assumptions) and their needs (which are quite likely still valid!).

There are actually plenty of examples of the opposite, where everyone has a similar mental model, but some are being massively victimized anyway. The Stanford Prison Experiment is case in point. The participants knew that their assignment into prisoner and guard were totally random, and the prisoners knew that they were getting the short end of the stick. They certainly disagreed on things, but largely did it within the shared model of how prisons worked. There didn’t need to be a breakdown in either shared facts or beliefs for one group to be systematically deprived of dignity.

So I guess what I’m saying is that, in general, we should be striving to converge our understanding of reality. We should also acknowledge that we’re far from it at the moment, and need to have meta agreements for managing our disagreements in the meantime (voting is a common one, though not without drawbacks). But, critically, it should all be in the context of trying to lift the common denominator, and for the love of god prevent our children from becoming postmodernists.

Lessons from installing TensorFlow 1.7 for NVIDIA GPU on a Samsung Odyssey running Ubuntu 17.10

I’ve never been so jubilant to see custard apple (score = 0.00147) in my terminal window. It meant that I had finally classified an image using TensorFlow on my brand new GPU. Despite my confidence as I sat down with the visually appealing official guide, I found the process to be time consuming and frustrating. Based on the number and diversity of issues I saw others having as I Googled (actually DDGed) around, it looks like I’m not alone. As the beneficiary of their hard won experience, I wanted to contribute some of the things that I learned in the process.

I’m going to experiment a bit with the structure, alternating between abstract and specific thoughts. The value of specific thoughts is intuitive, but worth illuminating: None of this article has any value if it doesn’t help you, the reader, do something differently. Not “change your viewpoint” or “deepen your understanding”, but literally tap a different sequence of keys on your keyboard than you would have otherwise. Directly saying “Type this, not that” is the shortest path to this goal, and shorter paths are less likely to be waylaid.

Unfortunately, as Barney the Purple dinosaur tried to warn us, we’re all unique in our own way You're special!. This is mostly a good thing, but it can make it difficult to share advice. If nothing else, simply copying my .bash_history would start to fail as soon as you got to paths starting with `/home/mritter/`. You’re smart enough to trivially take that, abstract it up to “He means his home directory” and granularize it back to `/home/jsmith/` or whatever. You’re smart, you could do this, but there’s no reason I should make all of my readers perform that same first step, particularly for less obvious situations.

Specific: Ensure the right graphics driver is being used by blacklisting the default

Even after going through the installation steps, my Samsung Odyssey laptop wasn’t recognizing the existance of my GPU. The final step to fixing this was editing my /etc/modprobe.d/blacklist-nouveau.conf file to contain:

blacklist nouveau
options nouveau modeset=0

then running sudo update-initramfs -u

and restarting. I could then confirm the recognition of the GPU with `lshw -C video`. I tried other things beforehand (the probably-relevant parts of which will be detailed below), but I can’t know whether they were critical to this final bug or totally separate.

Abstract: The state space with positive outcomes was much smaller than I expected

Because the guides that I was reading were a few months old (which is years in internet time, and centuries in Deep Learning time), I assumed that I should just use the latest version of each suggested library or driver. This assumption has served me well for dozens of previous installation processes, but it failed this time. Perhaps I should have been more suspicious because of the unusual cross-corporate nature of the situation, or maybe you just win some and lose some. I won’t get into all of the other instances where a minor deviation from the advice cause cascading issues, but it was an important reminder that “extremely similar” configurations are not always good enough.

Specific: Be careful with CUDA 9.1!

The first major issue that I identified after trying to follow this comprehensive guide is that I’d installed CUDA 9.1 instead of 9.0 I assumed that since it wasn’t a major version number, it would be fully backwards compatible. To its credit the official documentation mentions the correct version number, but some of the commands it suggests default to the more recent version of various libraries, which have presumably changed since it was published. This short video¬† does a good job of outlining the small changes you need to make for it to work.

Note that you can get away with 9.1 if you build TensorFlow from source. But that sounded like opening up a shipping container of boxes of cans of worms, so I didn’t go down that route.

General: This stuff is still bleeding edge

I’ve always had a romantic notion of what it would have been like to work with steam engines during the Victorian age, or airplanes when they were new. New records being set every day! Limitless opportunity! …And frustrating setbacks caused by obscure parts!

The Wright Brothers, for example, had attempted a flight before the one which went down in history. It took two whole days to repair the ‘minor’ damage that the machine suffered, so that they could make their successful attempt. Their inspiration, a world famous glider pilot named Lilienthal, had (over the course of his 5 years in the spotlight) spent just 5 hours in the air. About half a workday actually doing the thing he was world famous for, the rest of the time handling logistics.

Good user experience fades into the background, and it’s easy to forget how hard and complex things are. When you’re at the bleeding edge, there’s nobody in change of making your experience pleasant, or even guaranteeing that what you want to do is even possible. When you’re lucky enough to find a guide, it usually assumes that you have considerable experience, which will let you fill in the gaps. For example, when was the last time someone digressed from their Stack Overflow answer to clarify “sudo means that you have to type your admin password”? That’s just a common denominator on that website, as are hundreds of other little bits of knowledge. Somehow our computing culture has come together on some de facto curriculum that lets most people understand each other, most of the time. But on the bleeding edge, when you’re talking about graphics drivers and rapidly updating libraries, those gaps can become impossible to bridge.

Specific: These commands are your friend

sudo dpkg/apt-get --purge <package> # Completely remove an installed system package, including drivers
apt list --installed | grep <package> # Search through installed packages (make sure they're all the right version!)
sudo dpkg -l | grep "cuda" # Search through installed packages (make sure they're all the right version!)
lshw -C video # See whether the GPU is visible to the machine
lsmod | grep nvidia # See list of relevant drivers (Make sure none are of the wrong version)
cat /proc/driver/nvidia/version # See Driver information
/usr/lib/nvidia-384/bin/nvidia-smi # See GPU details

The hardest part of the project was not doing things, but UNdoing them. Followed closely by knowing whether I had to undo them in the first place.

General: Learn to quickly Create, Read, Update and Delete in the system you’re debugging

Because I was largely operating in a space that I’m unfamiliar with, I didn’t know how to verify that I was on track until the end of the installation process. That would not have been as bad if the errors I got there had been more specific, but I was left with a diagnosis that boiled down to “One (or more!) of the 10 steps that you took is interacting with one (or more!) of your unknowable number of system configurations incorrectly” That, combined with my lack of fluency with the basic CRUD operations around drivers, made debugging by elimination extremely slow.

Working through it with a friend who both had this background, plus a running system to validate against, was critical for getting mine set up. THANK YOU STAN!!!

Specific: CUDNN doesn’t usually seem to be the root of issues, and CUDA versions often are

The download page is mercifully specific about which CUDA version each CUDNN option requires. I didn’t have to re-install it after moving a bunch of other things around – it really is just a few files, which you can see with the ls /usr/local/cuda-9.0/lib64/libcudnn* command.

Make sure that you’ve got the right CUDA version (denoted by the 3 digit number) on your PATH (and its parent directory on your LD_LIBRARY_PATH) /usr/lib/nvidia-384/bin

That’s what I’ve got for you. If you’re trying to get TensorFlow set up, I wish you the best of luck – it’s definitely possible (as long as you actually have a GPU!), and actually doesn’t take too long if you’re lucky enough to find a guide that aligns with your needs perfectly ūüôā If you run into issues, I definitely recommend finding someone who’s been through it before recently. In this and so many things, there’s a lot to be said for good friends! (I’ll take this opportunity to thank Stan again – I couldn’t have done it without the 150+ chat messages that we shared while debugging everything.)

Choosing a Medical Provider – Overview (Post #1)

I’ve decided that I need to find a new doctor.

The problem isn’t with my current doctor, who I like quite a bit, but with the entire system around her. All of the ancillary stuff, like getting a referral, moving my records, getting significant time from specialists, has been disastrous. Because moving medical records is so difficult, I figure that I need to switch sooner rather than later, because every visit with her is further investment into a system that I don’t want to be at long term.

My first option is finding another MD in another medical system. That’s absolutely on the table, but there are significant drawbacks: 1) It will probably take me months to get the first appointment, and then weeks to a follow up. It’s like getting my medical advice by snail mail from England. 2) It costs at least $400 per basic appointment. Sometimes I pay, sometimes my insurance pays, but it’s certainly insane. 3) It takes at least 2 hours of my time for 15 minutes of actual doctor time.

To be clear, I believe in medical science, in the sense that I don’t think there’s any other process that reliably produces better understanding and advice on human ailments. On the other hand, doctors are not doing a complete literature review before each diagnosis. They’re listening to me for about 7 minutes, glancing through my medical record, and coming up with their best guess on the spot. They don’t submit a case study document with citations for peer review, and they don’t necessarily follow up after a week for detailed feedback on their treatment’s impact. While medical school was absolutely based on science, but the actual clinical process simply doesn’t allow time for an hypothesis-experimentation-analysis cycle.

So I have the highest respect for doctors and medical researchers. The problem is in the business. With the exception of some specific chronic diseases like diabetes, the model is designed for point diagnosis, not to work with patients over time (especially not with their active participation). That is why I started considering (emphasis on considering) other options. For what it’s worth, I’m hardly alone: about 1/3 of American adults were actively using a Complementary/Alternative Medicine technique according to the NIH (though it’s worth noting that they’re grouping together things like “Deep Breathing” and “Meditation” with “Chiropractics”).

One of the attractions of CAM is that many of the interventions are extremely low risk, and low cost. I already meditate, which almost certainly isn’t hurting anything, and has cost less than $50 for books and an app. In the last year, I’ve twice gotten therapeutic massages in response to acute muscle pain. The first time, it led to an immediate and lasting improvement. The second time, it didn’t work, but felt nice. ¬Į\_(„ÉĄ)_/¬Į The benefit of the first absolutely justified the second. (1hr/$90 each)

So I’m doing some research into nutritionists, naturopaths (UPDATE: no), and anything else that people might find valuable. Email me if you’ve got ideas or personal experiences in this space!

UPDATE: I’m getting particularly interested in finding people who are more like guides than experts. A loose example would be a fitness guide, who might give their client some suggestions, some things to read, and check back in on a weekly basis. At the outset, they can’t know what will work, but by pairing objective research with the client’s ongoing results, they help iterate towards the right solution. Most of these people are focused on weight loss, though, not sure how to find anyone who helps with anything else!

UPDATE 2: I’ve looked into nutritionists, with mixed results. But I’ve scheduled an appointment with the one who seemed most scientifically grounded.

How to make a Python Histogram (Using Python to Excel, part 1)

Excel is the perfect tool for many applications – the problem is that it’s used for about¬†5 billion¬†more on top of those.

Fortunately, I’ve found many things that are complex to accomplish in Excel are extremely simple in Python. More important,¬†there’s no copy/pasting of data, or¬†unlabeled cells with¬†quick calculations. Both of these ad hoc methods invariably leave me¬†confused¬†when I re-open the workbook to update my numbers for next month’s report. In this post, I’ll walk through many of the basic functions you’d use Excel for, and show that they’re just as simple in Python.

Let’s walk you through an example analysis. We’ll be looking at nutritional facts about a few different breakfast cereals:

 

Download and import data

This data is drawn from a site that is a wonderful source of examples, but makes the puzzling decision to keep data in a tab-separated format, as opposed to the comma-separated standard. For that reason, I’m attaching¬†a .csv of the data here¬†to save you a bit of reformatting work.

Original data: http://lib.stat.cmu.edu/DASL/Datafiles/Cereals.html

Suggested Download: http://www.matthewritter.net/?attachment_id=256

 

In Excel, importing the data is a three step process:

1. Find Data > From Text:

 

excel01

2. Find the file, then choose “Delimited” and hit “Next”

2014-10-11 19_22_34-Text Import Wizard - Step 1 of 3

3. Make sure you change the delimiter to “Comma” (what else it imagines .csv to stand for is beyond me). Then you can hit ‘finish’

2014-10-11 19_24_37-Text Import Wizard - Step 2 of 3

 

In Python, you just use ‘read_csv’ (I’ll cover general setup in a later post, but the basics are that you should start with the Anaconda distribution)

2014-10-11 19_19_41-cereals

 

Filter the data

To keep things to a manageable size, we’ll filter to the Kellogg’s cereals

This is accomplished with a simple dropdown in Excel:

2014-10-11 19_28_45-cereals - Excel

Python’s approach requires a bit of explanation. As you can see, ‘data’ (which is basically the same as a worksheet in Excel) is being filtered by the¬†data.mfr¬†== ‘K’¬†in the square brackets.

This simply says “Take all of the rows in this sheet where the mfr column exactly equals (that’s what the == means!) the letter K”. Then, we simply save it as ‘kelloggs’ so we can remember (just like naming an Excel sheet).

I’m using a useful function called .head(), which lets you see the first few rows of a sheet, so that you don’t get overwhelmed with output

2014-10-11 19_30_08-cereals

 Sort the data

The final step of this introduction is sorting the data.

In Excel, this is accomplished with the same dropdown as before:

2014-10-11 19_40_43-cereals - Excel

In python, you call the aptly named ‘sort’ function, and tell it what to sort on

2014-10-11 19_42_20-cereals

 

Make a Histogram

Making a histogram is in python is extremely easy. Simply use the .hist() function. If you’ve got the plotting package seaborn enabled, it will even look nice too!

 

example python histogram

 

Conclusion

That’s it for part 1! Stay tuned for calculated columns and pivots in part 2

Using Python to Excel (Part 2 – Calculate Columns and Pivot)

Note: If you’re new here, you’ll probably want to start with Part 1

Python’s Excel pivot table

We’ve made great progress, and things are only just getting started. We’ll mostly be talking about pivot tables, though we’ll start by creating the calculated column that we’ll pivot on shortly:

Calculated Column

The cereal’s rating (apparently given by Consumer Reports) is bizarrely precise, so I thought it would be valuable to make bins of 10. This starts to get a little complex, but isn’t too bad:

2014-10-11 19_53_34-cereals - Excel

Python actually uses the same function name. Note that I’m telling it to only display three columns this time, but all of the rest are still in the ‘kellogs_sorted’ sheet:

2014-10-11 19_54_46-cereals

 

Pivot

This is where things get really intense. I’ve squished Excel’s pivot table a bit to fit it, but you can get a sense of the simple stuff that we’re doing here, just summing the cups and calories for each rating category:

2014-10-11 20_00_40-PivotTable Field List

Python uses a function called pivot_table to achieve this same thing:

2014-10-11 20_03_26-cereals

Calculating Values in Pivots

Excel has you add equations columns to the pivot, then bring those into the table

2014-10-11 20_05_01-cereals - Excel

 

2014-10-11 20_06_30-Insert Calculated Field

 

2014-10-11 20_07_41-

 

In python you can take the table that you made before and simply add the column directly (and then round it, to keep the numbers reasonable)

2014-10-11 20_09_35-cereals

 

Pivot with column categories

In Excel, you add the additional category (in this case, which shelf the item is on) to the “Columns” list. Note that we’re now counting the number of products, instead of summing cups or calories

2014-10-11 20_11_38-PivotTable Field List

In python, you add a new argument to the pivot_table function called “columns”, and tell it that you what “shelf” in there

You can also tell it to put 0 in the blank cells, so that the table is makes more visual sense

Python Excel pivot table

Plotting

When it correctly guesses at your intentions, Excel’s graphs are pretty magical, laying everything out in essentially one click:

2014-10-11 20_20_55-cereals - Excel

 

Python is pretty magical too, simply requiring you to specify that you’d like a bar plot, and then allowing you to set as many or as few labels as you’d like. The legend does have an unfortunate tendency to default to the worst corner of the graph, but it’s easy enough to move around:

2014-10-11 20_18_25-cereals

 

It’s clearly less attractive by default, though. Here’s an example of what you can do with a¬†little more specification:

../../_images/plot_bmh.png

From the matplotlib gallery

 

Histograms

Histograms are the first of the functions that Excel doesn’t¬†have a button for. Excel recommends a 6 step process outlined here: http://support2.microsoft.com/kb/214269

Python has a one-step histogram function, .hist()¬† Here’s an example histogram of the consumer reports ratings of all of the cereals:

2014-10-11 20_32_27-cereals

 

Histogram Comparisons

You can extend the functionality even further by showing two histograms over each other. The function naturally groups the data into 10 bins, which can be misaligned if you are comparing two data sets.

I used a new¬†function called ‘range’ that simply gives me all the numbers between 0 (the first argument) and 100 (the second), counting by 10 (the third argument). Then I tell .hist() that it needs to use those as bins for both the overall data, and the data filtered to Kellogg’s

2014-10-11 20_28_27-cereals

This allows us to see that Kellogg’s is doing about as well as the group overall in terms of ratings.

 

 

That’s it for the basics! I hope you’ve found it useful, and that you give Python a try next time you want to explore a data set.

Using bayes theroem on two-way categorical data

Still revising, but I figured that in true Bayesian fashion, I’d update this dynamically as more information came in

It’s said that the best way to understand something is to teach it, and the huge number of explanations of Bayes’ Theorem suggest that many (like me!) have struggled to learn it. Here is my short description of the approach that ultimately led to some clarity for me:

Lets do the “draw a cookie from the jar” example from Allen Downey’s Think Bayes, with a bit more of a plausible backstory (one cannot be an actor, or act like a Bayesian, without understanding their character’s motivation). I made two 100-cookie batches, one with cilantro (Bowl 1) and one without (Bowl 2). Because my cilantro-hating friend prefers vanilla cookies, I made 75 of those (plus 25 chocolate to round out the batch) and put them in the cilantro-free bowl. I made 50 vanilla and 50 chocolate for the cilantro-added bowl.

Everyone comes over for my weird cookie party, but I forget to tell my cilantro-hating friend that they should only choose out of one of the bowls. <EDIT – have the bowls be mixed together, so that each cookie has an equal probability. Doesn’t change the problem, but removes need for an assumption of equal bowl probability> Being entirely too trusting, they just grab a cookie randomly from one of the bowls in the kitchen, and walk back to the living room. I stop them and say “Wait! Do you know which bowl that came out of?” and they say “Oh, no I wasn’t paying attention, but if you made different numbers of vanilla, which this cookie is, that should at least give us a probability of whether it came from the cilantro bowl. It wouldn’t be catastrophic to accidentally take a cilantro bite, so I’ll go for it if my chance of it being cilantro-free is greater than 55%”


Here’s what the situation would look like as a table

               | Vanilla | Chocolate | Total |
Cilantro-free  |   75    |    25     |  100  |
Cilantro-add   |   50    |    50     |  100  |
Total          |   125   |    75     |  200  |

Now, we already know that they had a 75% chance of getting a vanilla cookie if they chose the CF bowl. But that’s NOT the question at hand. The question at hand is related, but different: What is the chance that they chose the CF bowl if they got a vanilla cookie. Let’s watch a replay:

We know: Chance they have a vanilla cookie if cookie from CF bowl

We want:  Chance cookie from CF bowl if they have a vanilla cookie

The reason that I stress this is that the conventional method of “Null Hypothesis Significance Testing” (the whole concept of “the null hypothesis was rejected at p<.05” that you see in most papers) is analogous to the first statement, but we almost always want to make decisions based on the value of the second statement. To be even more direct: Most statistical analysis that we see leaves us with a number (p) that is one step short of what we can make decisions on.

Fortunately, there is an equation that can take us from what we’ve know to what we actually want. Unfortunately, it requires additional variables to solve. In this case, we have the additional information. In other cases, we would have to estimate those values, without any method of checking our estimate (until we get more data). Painfully, after all of this work to design a clean experiment, accurately measure results, and methodically run the numbers, out very last step requires us to irrevocably taint our objective results with an estimate that is picked out of the air. It’s so frustrating, and you can start to understand why people try to use the first number, whose definition is so close to sounding like what they need, but that’s just how the math works out. Let me know if you find an alternative.

Lets try it on this case. Here’s the simple derivation of Bayes Theroem:

p(A and B) = p(B and A)

p(A if B) x p(B) = p(B if A) x p(A)

Therefore: p(A if B) = p(B if A) x p(A) / p(B)

Remembering where we left off:

We know: Chance they have a vanilla cookie if cookie from CF bowl

We want:  Chance cookie from CF bowl if they have a vanilla cookie

If A = they have a vanilla cookie

and B =  cookie from CF bowl

Then p(cookie from CF bowl if they have a vanilla cookie) = p(they have a vanilla cookie if cookie from CF bowl) x p(cookie from CF bowl) / p(they have a vanilla cookie)

Note that the first term to the right of the equals sign is ‘we know’, and the final result is ‘we want’. Unfortunately, there are those two other unknowns to calculate, which is where a bit of subjectivity comes in:

p(cookie from CF bowl) means “The overall chance that any cookie (vanilla or chocolate) would be drawn from the CF bowl”. Since there are two bowls, and we don’t know of any reason one would be picked over another, we assume this is 50%. But this is an assumption, and many real-life problems would give alternatives that are clearly not 50/50, without giving clear guidance on whether they should be considered 45/55, 15/85 or .015/99.985. Note that, if¬† you assume each cookie was equally likely to be selected, this number could be calculated from the total number of cookies in each bowl on the far right column (ie 100 of the 200 total cookies are in the CF bowl)

p(they have a vanilla cookie) means “The overall chance that any cookie would be vanilla”. In this case, simply look at the total number of cookies of each type (the totals on the bottom row of the table) and see that vanilla makes up 125/200 of the total. (NOTE: does this change if the bowls are not equally likely to be selected?)

Once you’ve gotten over the implications¬† The final calculation is easy:

.75 * (100/200) / (125/200) = .6

It’s also interesting to see that the .75 could be calculated in much the same was as the other two variables (percentage of the total in their row or column), along the top column. Specifically, “within the Cilantro-free column, what portion of cookies are vanilla?”, it’s simply the intersection of CF and Vanilla, divided by the total of the column.

               | Vanilla | Chocolate | Total |
<strong>Cilantro-free  |   75    |    25     |  100  |</strong>
Cilantro-add   |   50    |    50     |  100  |
Total          |   125   |    75     |  200  |

Let’s look at all the factors again in that light:

p(they have a vanilla cookie if cookie from CF bowl):

  • Numerator: # in the intersection of CF and Vanilla
  • Denominator: # of CF

p(cookie from CF bowl)

  • Numerator: # of CF
  • Denominator: # of total cookies

p(they have a vanilla cookie)

  • Numerator: # of Vanilla
  • Denominator: # of total cookies

This is all very symmetric with the definition of our result:

p(cookie from CF bowl if they have a vanilla cookie)

  • Numerator: # in the intersection of CF and Vanilla
  • Denominator: # of Vanilla

44848-keanu-reeves-whoa-gif-nOup

Inspiration: ist-socrates.berkeley.edu/~maccoun/PP279_Cohen1.pdf