Images

How to make a Python Histogram (Using Python to Excel, part 1)

Excel is the perfect tool for many applications – the problem is that it’s used for about 5 billion more on top of those.

Fortunately, I’ve found many things that are complex to accomplish in Excel are extremely simple in Python. More important, there’s no copy/pasting of data, or unlabeled cells with quick calculations. Both of these ad hoc methods invariably leave me confused when I re-open the workbook to update my numbers for next month’s report. In this post, I’ll walk through many of the basic functions you’d use Excel for, and show that they’re just as simple in Python.

Let’s walk you through an example analysis. We’ll be looking at nutritional facts about a few different breakfast cereals:

 

Download and import data

This data is drawn from a site that is a wonderful source of examples, but makes the puzzling decision to keep data in a tab-separated format, as opposed to the comma-separated standard. For that reason, I’m attaching a .csv of the data here to save you a bit of reformatting work.

Original datahttp://lib.stat.cmu.edu/DASL/Datafiles/Cereals.html

Suggested Downloadhttp://www.matthewritter.net/?attachment_id=256

 

In Excel, importing the data is a three step process:

1. Find Data > From Text:

 

excel01

2. Find the file, then choose “Delimited” and hit “Next”

2014-10-11 19_22_34-Text Import Wizard - Step 1 of 3

3. Make sure you change the delimiter to “Comma” (what else it imagines .csv to stand for is beyond me). Then you can hit ‘finish’

2014-10-11 19_24_37-Text Import Wizard - Step 2 of 3

 

In Python, you just use ‘read_csv’ (I’ll cover general setup in a later post, but the basics are that you should start with the Anaconda distribution)

2014-10-11 19_19_41-cereals

 

Filter the data

To keep things to a manageable size, we’ll filter to the Kellogg’s cereals

This is accomplished with a simple dropdown in Excel:

2014-10-11 19_28_45-cereals - Excel

Python’s approach requires a bit of explanation. As you can see, ‘data’ (which is basically the same as a worksheet in Excel) is being filtered by the data.mfr == ‘K’ in the square brackets.

This simply says “Take all of the rows in this sheet where the mfr column exactly equals (that’s what the == means!) the letter K”. Then, we simply save it as ‘kelloggs’ so we can remember (just like naming an Excel sheet).

I’m using a useful function called .head(), which lets you see the first few rows of a sheet, so that you don’t get overwhelmed with output

2014-10-11 19_30_08-cereals

 Sort the data

The final step of this introduction is sorting the data.

In Excel, this is accomplished with the same dropdown as before:

2014-10-11 19_40_43-cereals - Excel

In python, you call the aptly named ‘sort’ function, and tell it what to sort on

2014-10-11 19_42_20-cereals

 

Make a Histogram

Making a histogram is in python is extremely easy. Simply use the .hist() function. If you’ve got the plotting package seaborn enabled, it will even look nice too!

 

example python histogram

 

Conclusion

That’s it for part 1! Stay tuned for calculated columns and pivots in part 2