Week 6: Excel

Why learn Excel?

Excel is expensive, memory hogging, and crash-prone. It’s also one of the most used analysis tools in the world. I believe it’s because doing simple things in Excel is very easy, doing intermediate things in Excel is also fairly easy, and doing complex things is (while extremely awkward) possible. In contrast, have you experienced, SQL is tricky. When you open it, you’re faced with a blank line, and not even given a hint about what data is available to you. Once you memorize a few commands, and remember the basic structure of the query, it is quite powerful. Most users don’t have the patience to travel up the frustrating learning curve. It’s comforting to have your numbers right in front of you, if you do with Excel, and to have all the commands as tangible buttons and menu options. So why should you, who have already put in the hard work to learn SQL, also work with Excel? Almost any project you do in industry will be in collaboration with others. It’s likely your teammates will not be data specialists (that’s your job!) And, given that, extremely likely that haven’t invested the time to become fluent in SQL. If you want to collaborate meaningfully, you’re going to need to speak the same language, a language will almost certainly be Excel. They may even come to you with any half finished or broken Excel sheet, which (and this is terrifying) they have been using to make decisions for years. Some projects are less about finding and creating a brand new data extract, and more about fixing one in active use ASAP. And we shouldn’t forget that Excel is very good at a few of the tasks that it was actually designed for. Pivot tables, which are a Group By queries in SQL, are fast and relatively foolproof. Building a decent graph off of them takes just two or three more keystrokes. Making that graph aesthetically pleasing takes only a little bit more tweaking than that. Sorting and filtering the data, like Where statements and Order By statements, are also lightening fast operations. One of my frustrations with Excel is how much repetitive pointing and clicking you end up doing. But there is a solution for this, utilizing the extensive array of keyboard commands that it offers. I’ve heard rumors that in the major consulting firms the senior partners come around in week 2 and cut the cords on interns’ computer mice, forcing them to use the keyboard exclusively. In Excel, to an even greater extent than many programs, a power user who has invested some time in learning the shortcuts can go tremendously faster than a casual user. That’s why I’ll be pushing you straight into power user territory (with a carefully selected subset of the functionality) on day one.

The Excel Philosophy

I’m actually going to have us do similar operations to the ones that we did SQL, just to show you the parallels between the two. As I showed on the projector a few weeks ago (using Google’s cloud version of Excel), it can actually be conceptually helpful to prototype your SQL queries in a spreadsheet, where you can see some example values and really visualize the reshaping that you’re doing. Being able to see all of the numbers is also valuable for quickly catching errors and mistaken assumptions. An obvious example would be a table that is far smaller than it should be. If you are using SQL and just jumped straight to taking the average, you might never realize that there are only 10 rows in what should be a table that is updated every minute. When you glance at the data in Excel, you’re automatically checking for all sorts of basic quality issues. Maybe a column is completely blank, or there are rows with poorly formatted values. Maybe when you sort by column that you think is numeric, you see a bunch of random letters appearing at the top. Being able to do quick graphs is also valuable, allowing you to see trends over time or gaps in the data. What is Excel not good for? Its fundamental weakness is rooted in the same “see all the numbers” philosophy as its fundamental strength – it gets very awkward with larger data sets, multiple data sets, and multi-step analyses. You have to remember that it was originally designed to be a slightly better version of “writing a bunch of numbers on graph paper with a pencil”. You can imagine that this starts to fall apart as you get to tens of thousands of rows, or dozens of columns. From a performance prospective, you’ll often start to notice issues when your row count gets into the high 10s of thousands range (assuming 5-10 columns), but the user interface awkwardness happens much earlier. Another carryover from the pencil and paper days is that a value is only accessible by referencing its physical location. In SQL, you say “select herd_id from cow where cow_id = 45”, and this will work no matter how things have been sorted. In Excel, you say “=C355” and hope that nothing moves. You’ll often find that making the first draft of an analysis is blazing fast, but connections flagrantly (or worse, quietly) break once you add more data, a new column, or make other changes.

Horrible Youtube Videos

With great power comes great responsibilty
The final thing that Excel is good for is giving you a user interface right off the bat. At its most basic, it’s summary statistics and bar graphs, but people have done amazing things with it. Take a look at some examples on this Youtube channel: https://www.youtube.com/channel/UCU8I1s7_rxd-GyB-socETDQ Hopefully, you will walk away as I did, thinking “I can’t believe something could be both so impressive and so ugly”. Take it as both inspiration and warning.