Boston housing prices and accessing public data through Enigma’s APIs

It’s exciting to see how much data is available for public consumption, often for free, and I wanted to play around with building a connection between my Jupyter notebook and one of the most user friendly repositories that I’ve found, Enigma.

The first challenge of using any API is that it makes each test of your code feel expensive. There are often limitations to how often you can request information, so each piece of information that you pull down feels precious. When you’re just developing something simple locally, an effective testing strategy is to wipe everything and run it again, but you can find yourself stuck without data if you’re not careful about this.

One of the first things that I do is pull down the minimum amount of data (often called one ‘page’ of data, in this case 100 rows). I make it clear in the variable name that this isn’t the full thing:

import json
# Format: 
# https://api.enigma.io/<version>/<endpoint>/<api key>/<datapath>/<parameters>
url1 = "https://api.enigma.io/v2/data/<your api key>/us.states.ma.cities.boston.real-estate-assessment13?page=1"
response1 = urllib.urlopen(url1)
data1 = json.loads(response1.read())

Then take a look to understand the structure:

print data1.keys() 
for key in ['info','datapath','success']:
    print "\n\n== {} == \n{}".format(key,data1[key])
print type(data1['result'])
print data1['result'][0] #Check out the first row of actual data

print df.columns
def is_number(x):
    try:
        float(str(x))
        return True
    except:
        return False

 

Once you’re satisfied with that, it’s time to download a little more systematically:

 

 

import random
datalist = []


with open('api_pages_downloaded.txt','r') as f:
    already_used = [int(x) for x in f.read().split('\n') if x]

for i in [x for x in random.sample(set(range(326))-set(already_used),100)]:
    with open('api_pages_downloaded.txt','r') as f:
        if str(i) in f.read().split('\n'):
            print "Page {} already in datalist".format(i)
            continue
    url = "https://api.enigma.io/v2/data/<your api key>/us.states.ma.cities.boston.real-estate-assessment13?page={}".format(i)
    response = urllib.urlopen(url)
    data = json.loads(response.read())
    datalist.append(data)
    with open('datalist.pkl','w') as f:
        pickle.dump(datalist,f)
    with open('api_pages_downloaded.txt','a') as f:
        f.writelines('{}\n'.format(i))

 
After all of that work, getting the data into a Pandas DataFrame is mercifully easy

import pandas as pd
all_data = []
for x in datalist: all_data.extend(x['result'])
df = pd.DataFrame(all_data)
df.head().transpose()

There are a lot of columns. On the balance, this is great! But you may want to arrange them into id, categorical, and numeric lists, if only for your own sanity. Fortunately, there are not going to be issues with python time series data structures, since this is a static data set. That means we can just clean everything up so that there are no nulls to cause issues with plotting later:

temp = df.fillna(0).applymap(is_number).all()
id_var = ['pid','serialid']
cat_var = ['ptype','zipcode','mail_zip']
num_var = [x for x in temp[temp].index if x not in id_var+cat_var]
cat_var += [x for x in df.columns if x not in id_var+num_var+cat_var]
print id_var
print num_var
print cat_var

df.isnull().mean()
df[num_var] = df[num_var].fillna(1).astype(float)
df[cat_var] = df[cat_var].fillna(1).astype(str)
temp = df[[x for x in df.columns if x not in ('cm_id','unit_num','st_num')]]
print temp.isnull().mean().order(ascending=False).head()
df = temp

This post isn’t primarily about the analysis, but I couldn’t miss a chance to show how incredibly more pricy houses are in Boston than the immediate area:

import seaborn as sns
by_count = df[df.lu == 'R1'].groupby('mail_cs').size()
sufficient_listings = by_count[by_count > 100].index
temp = df[(df.lu == 'R1')& (df.mail_cs.isin(sufficient_listings))].copy()
temp['av_total_log10'] = temp.av_total.apply(np.log10)
g = sns.FacetGrid(
    data=temp, # 
    row='mail_cs',
    size=2, 
    aspect=2,
)
g.map(sns.distplot,"av_total_log10",kde=False)

seaborn_export

I hope that this was helpful – a surprising amount of work goes into the conceptually simple task of downloading some data and getting it into a form that you can work with. Almost all of this generalizes beyond the housing data set, and much of it beyond Enigma, so I hope that it ends up being a good resource as you explore. Let me know what you do with it!

Leave a Reply

Your email address will not be published.