Business Data Analytics: User Needs

Question:

For how many hours per week is the average listener unproductive?

Data:

Survey of 466 listeners of Stever’s podcast

This data set required some algorithmic cleaning before I could start my analysis.  Responses were entered into a free text box, and many people entered a range, such as ’30-35′.  I wrote a function to average these ranges, and pass through specific numbers

import re
def avg_input_range(r):
  try:
    return (float(re.sub(r"-.*", '', r)) + float(re.sub(r".*-", '', r)))/2.0
  except:
    return None

I was then able to create a DataFrame with the number of hours worked, and the number of productive hours

worked = pFile['Number of Hours You Work Per Week'].apply(avg_input_range)
productive = pFile['Number of Hours You Feel are Productive'].apply(avg_input_range)
data = pandas.DataFrame({'worked':worked, 'productive':productive})

I had to filter that DataFrame to remove impossible values, and get the ‘delta’ that would form the core of the analysis

Clean the data, removing impossible values
data = data.dropna(axis=0)
#Remove responses claiming that they worked more than 168 hours per week
data = data[data['worked'] <= 24*7]

#Remove responses claiming that they were productive more than 168 hours per week

data = data[data['productive'] <= 24*7]

#Remove responses claiming that they worked less than or equal to 0 hours per week
data = data[data['worked'] >= 0]

#Remove responses claiming that they were productive less than 0 hours per week
data = data[data['productive'] >= 0]

#Take the difference, to make this a single column data analysis
delta = data['productive'] - data['worked']

#remove responses claiming to have been productive for more hours than they worked
delta = delta[delta<0]

At this point, I wanted to get a sense of how well the data approximated a normal distribution, so I did a Probability Plot

#Test that data can be approximated as a normal distribution
(osm, osr),(slope, intercept, r) = stats.probplot(delta, fit=True, plot=plt)

  

A perfectly gaussian set of data would follow the straight line.  This one is below the threshold that I like to consider a “good fit”, but we’ll press on to the finish with this caveat in mind

standardError = numpy.std(delta)/len(delta)**.5
tstatistic = stats.t.pdf(.01/2, len(delta)-1)
print "Minimum number of productive hours: %s" % abs(numpy.mean(delta) + tstatistic*standardError)
print "Maximum number of undproductive hours: %s" % abs(numpy.mean(delta) - tstatistic*standardError)

Which returns

"Minimum number of productive hours: 19.5226502972"
"Maximum number of undproductive hours: 19.9336631607"

The large number of data points allows us to have a nice and tight confidence interval!

As in my user demographics analysis, I’ll plot it on a number line

from matplotlib import pyplot
from matplotlib import font_manager
import matplotlib.ticker as ticker

f1 = pyplot.figure(1, figsize=(10,2.1), facecolor = 'white')

ax1 = pyplot.subplot(1,1,1)
for loc, spine in ax1.spines.iteritems():
if loc not in ['bottom']: #I generally only want the bottom of the bounding box
spine.set_color('none')

ax1.set_position((.1, .09, .8, .75)) #leaves some room for the description

font = '/usr/share/fonts/truetype/msttcorefonts/georgia.ttf' #My favorite
prop = font_manager.FontProperties(fname = font)

plot99 = ax1.plot(interval99, [-.5,-.5], 'b.-', label="99% confidence interval")
#plotHiggs = ax1.plot(intervalHiggs, [-.25,-.25], 'r.-', label="99% confidence interval")

ax1.set_xlim(-3, 22) #leave some buffer on the left and right
ax1.set_ylim(-1,1) #data should take up about 2/3 of the vertical space
#Bold the numbers of the ticks
ticklocs = numpy.arange(0,22,5)
ax1.get_xaxis().set_ticklabels(ticklocs, weight = 'bold')

#Turn off ticks on right and top
#ax1.yaxis.tick_left()
ax1.xaxis.tick_bottom()

formatter = ticker.FormatStrFormatter('%1.2f')
ax1.yaxis.set_major_formatter(formatter)
ax1.xaxis.set_major_formatter(formatter)
f1.axes[0].yaxis.set_ticks([])
f1.axes[0].xaxis.set_ticks(ticklocs)
ax1.vlines(0,-10,4, color = 'grey', linestyles='--')
f1.text(.34,.63, "Unproductive Hours Per Week,\n99% Confidence Interval", fontsize=15 , fontweight='bold')
pyplot.show()

 

 

 

 

This is a good example of how formatting decisions can have a big impact on the story that a graph tells.  In this case, I chose to highlight how far the result is from the ideal of zero unproductive hours per week (the horizontal dotted line).  I could have instead focused  more on the confidence interval.  For example, I could have pointed out that the fact that people are wasting more than 15 hours per week is demonstrable at the same level of confidence that the CERN used to validate the discovery of the Higgs Boson.

Conclusion:

Stever’s listeners are, on average, unproductive for about 19.75 hours per week.  This represents a big opportunity for improvement.