Excel is the perfect tool for many applications – the problem is that it’s used for about 5 billion more on top of those.
Fortunately, I’ve found many things that are complex to accomplish in Excel are extremely simple in Python. More important, there’s no copy/pasting of data, or unlabeled cells with quick calculations. Both of these ad hoc methods invariably leave me confused when I re-open the workbook to update my numbers for next month’s report. In this post, I’ll walk through many of the basic functions you’d use Excel for, and show that they’re just as simple in Python.
Let’s walk you through an example analysis. We’ll be looking at nutritional facts about a few different breakfast cereals:
Download and import data
This data is drawn from a site that is a wonderful source of examples, but makes the puzzling decision to keep data in a tab-separated format, as opposed to the comma-separated standard. For that reason, I’m attaching a .csv of the data here to save you a bit of reformatting work.
Original data: http://lib.stat.cmu.edu/DASL/Datafiles/Cereals.html
Suggested Download: http://www.matthewritter.net/?attachment_id=256
In Excel, importing the data is a three step process:
1. Find Data > From Text:
2. Find the file, then choose “Delimited” and hit “Next”
3. Make sure you change the delimiter to “Comma” (what else it imagines .csv to stand for is beyond me). Then you can hit ‘finish’
In Python, you just use ‘read_csv’ (I’ll cover general setup in a later post, but the basics are that you should start with the Anaconda distribution)
Filter the data
To keep things to a manageable size, we’ll filter to the Kellogg’s cereals
This is accomplished with a simple dropdown in Excel:
Python’s approach requires a bit of explanation. As you can see, ‘data’ (which is basically the same as a worksheet in Excel) is being filtered by the data.mfr == ‘K’ in the square brackets.
This simply says “Take all of the rows in this sheet where the mfr column exactly equals (that’s what the == means!) the letter K”. Then, we simply save it as ‘kelloggs’ so we can remember (just like naming an Excel sheet).
I’m using a useful function called .head(), which lets you see the first few rows of a sheet, so that you don’t get overwhelmed with output
Sort the data
The final step of this introduction is sorting the data.
In Excel, this is accomplished with the same dropdown as before:
In python, you call the aptly named ‘sort’ function, and tell it what to sort on
Make a Histogram
Making a histogram is in python is extremely easy. Simply use the .hist() function. If you’ve got the plotting package seaborn enabled, it will even look nice too!
That’s it for part 1! Stay tuned for calculated columns and pivots in part 2