Note: If you’re new here, you’ll probably want to start with Part 1
Python’s Excel pivot table
We’ve made great progress, and things are only just getting started. We’ll mostly be talking about pivot tables, though we’ll start by creating the calculated column that we’ll pivot on shortly:
The cereal’s rating (apparently given by Consumer Reports) is bizarrely precise, so I thought it would be valuable to make bins of 10. This starts to get a little complex, but isn’t too bad:
Python actually uses the same function name. Note that I’m telling it to only display three columns this time, but all of the rest are still in the ‘kellogs_sorted’ sheet:
This is where things get really intense. I’ve squished Excel’s pivot table a bit to fit it, but you can get a sense of the simple stuff that we’re doing here, just summing the cups and calories for each rating category:
Python uses a function called pivot_table to achieve this same thing:
Calculating Values in Pivots
Excel has you add equations columns to the pivot, then bring those into the table
In python you can take the table that you made before and simply add the column directly (and then round it, to keep the numbers reasonable)
Pivot with column categories
In Excel, you add the additional category (in this case, which shelf the item is on) to the “Columns” list. Note that we’re now counting the number of products, instead of summing cups or calories
In python, you add a new argument to the pivot_table function called “columns”, and tell it that you what “shelf” in there
You can also tell it to put 0 in the blank cells, so that the table is makes more visual sense
When it correctly guesses at your intentions, Excel’s graphs are pretty magical, laying everything out in essentially one click:
Python is pretty magical too, simply requiring you to specify that you’d like a bar plot, and then allowing you to set as many or as few labels as you’d like. The legend does have an unfortunate tendency to default to the worst corner of the graph, but it’s easy enough to move around:
It’s clearly less attractive by default, though. Here’s an example of what you can do with a little more specification:
From the matplotlib gallery
Histograms are the first of the functions that Excel doesn’t have a button for. Excel recommends a 6 step process outlined here: http://support2.microsoft.com/kb/214269
Python has a one-step histogram function, .hist() Here’s an example histogram of the consumer reports ratings of all of the cereals:
You can extend the functionality even further by showing two histograms over each other. The function naturally groups the data into 10 bins, which can be misaligned if you are comparing two data sets.
I used a new function called ‘range’ that simply gives me all the numbers between 0 (the first argument) and 100 (the second), counting by 10 (the third argument). Then I tell .hist() that it needs to use those as bins for both the overall data, and the data filtered to Kellogg’s
This allows us to see that Kellogg’s is doing about as well as the group overall in terms of ratings.
That’s it for the basics! I hope you’ve found it useful, and that you give Python a try next time you want to explore a data set.