Week 3: SQL Building Blocks

Getting started

  1. Log into the server using ssh
    1. As we discussed at the beginning of Tuesday’s class, any public website gets quite a few suspicious visitors probing for basic security holes, so refer to the “Week 2 Logistics” for details on how to do that
  2. Start screen. You can use the command below to see your available sessions (or log in immediately if you only have one)
    1. screen -r
  3. If you accidentally closed your last screen session, go to your home directory and open up a new one
    1. cd ~
    2. screen -S default
  4. Once you’re in screen, make sure that you’re in your home directory, and make your own copy of the SQL database (for reasons that I’ll explain below)
    1. cd ~
    2. cp /usr/local/bin/farm_data.db .
    3. sqlite3 farm_data.db
    4. You should see “sqlite>
  5. Get your bearings: Set the output to be nicer, remind yourself which tables are available, and check out the schema of milking (note that I’ve made some changes!)
    1. .headers on
    2. .mode column
    3. .tables
    4. .schema milking
  6. Glance at the first few rows of milking, and count how big it is total
    1. select * from milking where rowid < 6;
    2. select count(*) countstar from milking;
  7. Challenge: Using modified versions of of the above queries, find the last 5 rows of the cow_temp table, and post on Slack
    1. Realizing the power of being able to tie all of these data sets together, Jenny set up a wireless feed from the temperature tracking sensors she’d placed on each cow, as an early warning sign of disease spreading through the herd.
  8. Let’s take a look at that holiday outlier from last week’s reading, by doing monthly average prices:
    1. select strftime(‘%Y-%m’,timestamp) month, round(avg(price_dollars_per_kg),2) avg_price from sale group by strftime(‘%Y-%m’,timestamp) order by strftime(‘%Y-%m’,timestamp);
    2. Group By creates an output row for each unique timestamp (in this case, rounded to the month). All of the rows in the sale table which round to that month have their price averaged to make the avg_price column
  9. Challenge: Using filtering and some modifications to the date string, figure out exactly when the high-priced order starts coming in, and post your query and SQL output on Slack
  10. Now we’re going to actually store the outputs of our queries as our own rolled up tables. These will be the ‘data warehouses’ that I mentioned in class. As a proof on concept, do a simple query:
    1. nano my_first_query.py
    2. import sqlite3
      conn = sqlite3.connect('farm_data.db')
      c = conn.cursor()
      r = c.execute("select count(*) from cow")
      columns = zip(*r.description)[0]
      rows = r.fetchall()
      print columns
      print rows
    3. python my_first_query.py
  11. Now, we’ll create tables. Substitute your last name for ‘ritter’ in the below:
    1. nano my_first_data_warehouse.py
    2. import sqlite3
      conn = sqlite3.connect('farm_data.db')
      c = conn.cursor()
      c.execute("create table warehouse_ritter as select count(*) countstar from cow")
    3. python my_first_data_warehouse.py
    4. Here we’re simply creating a table with the output of a query, rather than printing it. In SQL, everything is just rows and columns, so it’s easy (and common) to have tables created from other tables
  12. If that worked, we can update it to meet our final goal: A data warehouse table housing the metric “average herd temperature”. Note that the “drop table” command will throw an error if you try to drop a non-existent table, which could happen if there’s an error (like “no such table: warehouse_ritter“) in the table creation query. All you have to do is comment it out until you successfully create the query, then turn it back on:
    1. import sqlite3
      conn = sqlite3.connect('farm_data.db')
      c = conn.cursor()
      c.execute("drop table warehouse_ritter")
      c.execute("create table warehouse_ritter as select strftime('%Y-%m-%d %H:%M', timestamp) minute, avg(temp_f) avg_temp from cow_temp group by strftime('%Y-%m-%d %H:%M',timestamp) order by strftime('%Y-%m-%d %H:%M', timestamp)")
    2. Note that we’re grouping to the minute, not the month, because we specified a more precise timestamp.
  13. Finally, we’re going to check that into our cron, which you should be pretty good at by now!
    1. crontab -e
    2. * * * * * python ~/my_first_data_warehouse.py
    3. In this case, it is simply erasing and re-writing your table every minute. That’s fine, because it runs nearly instantly. In real world applications, you’d be adding only new data to the table, running the whole thing would take hours (weeks, in some cases)

With that, we’ve walked all the way through the first two steps of our three step flow. First, we accessed and explored raw operational data. We then rolled one piece of it up into a metric, average herd temperature, and set that up to automatically load into our data warehouse. If you want to see a basic example of the final step, front end display, check out: http://olindatasystems.matthewritter.net/ritter/rollup

 

Next week, I’ll introduce you to Tableau, and other pieces of software that try to make the front end of dashboarding easier.