Week 3 Lab

Instructions for building basic Tableau pipeline

 

  • mkdir /var/www/ritter/FlaskApp/static/

 

        • Make it internally writable

 

  • chmod +w /var/www/ritter/FlaskApp/static/

 

        • Tell Flask that anyone should be able to access these files  (in /var/www/ritter/FlaskApp/__init__.py)

 

  • @app.route(“/static/<path:path>”)
  • def send_static(path):
  •    return send_from_directory(‘static’, path)

 

        • Create a test file, and try downloading it

 

  • echo “hello world” > /var/www/ritter/FlaskApp/static/test.txt

 

 

  • cp /home/ritter/my_first_data_extract.py /home/ritter/my_first_data_extract.py

 

          • Look in the file, and update the path on the last line
        • Open up your copied version with nano, and look at how it differs from other scripts we’ve written
          • Note that we’re grouping by minute, so multiple rows from the original table (which was defined to the second) will roll together
          • herd_id and cow_type are not in the cow_temp table. They come from the ‘cow’ table, which is a static table of additional information about the cows. The ‘join’ says to SQL: “For each row in cow_temp, look for the cow row whose rowid is equal to cow_id. Allow us to include values from those columns as well”
          • We’re using python to export a csv, which we’ll download like the “test.txt” file we use experimented with
        • Put that in crontab

 

  • * * * * * python ~/my_first_data_extract.py

 

        • Test that it’s updating the file

 

  • ls -l /var/www/ritter/FlaskApp/static/

 

      • User curl to download (make this repeat)
        • Open a new terminal that is not in ssh
        • Edit your local crontab

 

  • * * * * * curl -s http://olindatasystems.matthewritter.net/ritter/static/my_first_data_extract.csv > ~/Downloads/my_first_data_extract.csv

 

      • On your local terminal, look at the latest data with tail ~/Downloads/my_first_data_extract.csv
        • Post the output on Slack
  • Load into Tableau
    • Open a new project
    • Go to ‘Data Source’ tab
    • ‘Text File’
    • Find my_first_data_extract.csv
    • Check out the column types to make sure they are logical
      • May want to change Herd Id to a string, since the numbers are just designations
      • Also may want to remove f1, the index column
  • Make a line graph over time
    • Put Minute in Columns
      • Switch to More>Minute (the full version)
    • Avg Temp in rows
    • Marks > Circle
    • You’re probably seeing values as high as 400F
      • That’s because you’re summing, not averaging, the temperatures
      • Try making that edit
  • Make a multi-line graph
    • Herd Id over Color
    • Click on size to make smaller
    • Click on color to make partially transparent
  • Try filtering
    • Drag Minute into Filters
    • Choose Range of Dates
    • Choose ‘Starting Date’
    • Hit OK
    • Click the down arrow and choose “Show Quick Filter”
    • Drag way over until you’re seeing data from the last few hours
    • You may want to turn down transparency
  • Make it live
    • Eye on the most recent data point
    • Ctrl-R to refresh
  • Publish to web
    • Just hit Ctrl-s
    • You may need to make a free account
    • Post URL on Slack
    • Marvel at what you have created!

 

Taking it further – Theory

  • Let’s zoom out and sketch out the three big technical components of any data pipeline, along with the valuable action that would not have happened without access to the data:
    • Operational Data: The cow_temp table
cow_id temp_f timestamp
1 99.3 9/20/2015 08:36:21
2 97.9 9/20/2015 08:36:22
3 101.2 9/20/2015 08:36:23

 

    • Data Warehouse: Your my_first_data_extract.csv
herd_id avg_temp minute
2 98.9 9/20/2015 08:36
3 102.3 9/20/2015 08:36
0 95.9 9/20/2015 08:37

 

    • Front End: Tableau
    • Action: John and Jenny’s action in light of your findings
      • “Those cows in herd 1 look like they’re getting worse – I should probably quarantine them”
  • Your turn! Each of you has a different business to build a data pipeline for. I’d suggest working backwards – what sort of problem might they have, particularly one that they wouldn’t otherwise realize they had? What sort of visualization would help them see that?
    • Businesses (assigned by distance from podium side of the room):
  1. New Years novelty glasses factory
  2. Guided mountaineering tours
  3. Travelling three ring circus
  4. High speed refrigerated trucking
  5. Custom candle design
  6. Software for managing pogo-stick racing leagues
  7. Lifestyle magazine for students taking a gap year
  8. Freelance video game final boss designer
  9. Minor league baseball team PR company
    • Operational Data:

 

    • Data Warehouse:

 

    • Front End:

 

 

  • Action:

 

 

Taking it further – Technical

 

Reference

      • ssh ritter@olindatasystems.matthewritter.net: Open your terminal
        • I can now reset your password if you’ve forgotten
      • screen -r: start your screen session (always start your screen session!)
      • Ctrl-a, then d: Exit out of screen without killing it
      • exit: Close your ssh connection (be sure you’ve left screen first!)
      • /usr/local/bin: Where the shared database lives
      • ~/: Your home directory
      • crontab -e: Access your cron scheduler
        • Here’s a simple example that runs every minute:

 

  • * * * * * echo “test” > ~/temp.txt