Project 2 - Ames Housing Data

For this project (full repo here) I created a model to predict price based on a dataset on the housing stock of Ames, Iowa, a small midwestern town north of Des Moines that is home to Iowa State University. There were 81 columns with about an even split between categorical data and numerical data. Some of the columns, such as “Overall Quality,” could be initially assumed to have a strong correlation with price, while other more obscure qualities such as “Garage Type” and “Masonry Veneer Type” occupied most of the columns. I only had a limited amount of time to complete this project, so efficiency was critical. My goal was to create a model that accurately predicted price without combing over every minute detail.

There was a lot of missing data in the dataset. Some of the columns were overwhelmingly full of NaN values, such as “Pool QC,” in which only 9 out of 2051 rows were not NaN values. I decided to drop this column, along with “Alley” and “Misc Feature” because of the overwhelming percentage of NaN values. In numerical columns, I replaced NaN values with the mean of the column using Impute. In categorical columns, I replaced NaNs with the most frequent value of that column. The reasoning behind this was that in most of the categorical columns there was one overwhelming mode, and filling in missing values with this mode would not alter the data very much. Had I had more time, I would have analyzed the missing values more in an attempt to determine if the values were truly missing or actually meant something else. In order to fill the missing values differently based on numerical vs. categorical data, I separated the DataFrame and then rejoined them using df.concat(). In addition to filling missing values, I also modified the “Year Built” and “Year Remod/Add” columns to reflect the time since the property was built or remodeled, as a year behaves as categorical data despite being an integer.

The first step I took to modify my data was to create dummy columns for all of the categorical data, using drop_first=True, and dropping the original categorical columns from my DataFrame. I then scaled the data using StandardScaler(). To select my features, I decided to use Lasso to find the ten strongest coefficients right off the bat. I figured that because of the many different variables, from a more qualitative perspective, I should use Lasso to figure out what is important. Like I said, there were variables that seemed to have an obvious predictive value and others that were a bit more trivial, possibly having a strong correlation with price and possibly not.

After selecting the ten strongest coefficients, I ran PolynomialFeatures(interaction_only=True) on those ten columns, bringing the number of columns up to 55. I scaled this again using StandardScaler(), and then ran Lasso again to account for any overfitting. I figured this two step process of filtering out unncessary variables made sense for this kind of dataset.

Scoring on my training data, I got an R squared value of 0.83. For my test data, it decreased slightly to 0.81. Given the time constraints of the project, this was good enough for me. If I had more time, I would have looked at the categorical columns much more closely. Some of the columns I turned into dummy variables could have been reassigned to integers, as the categories were ordinal. Furthermore, the town of Ames is bimodal; there is an older downtown area as well as another centralized area around the campus for Iowa State University. The distinct geography of this town could have had a significant impact on house prices, and it would be worthwhile to look at other midwestern college towns like Madison, Wisconsin.

import numpy as np
import pandas as pd
import patsy

from sklearn.linear_model import Ridge, Lasso, ElasticNet, LinearRegression, RidgeCV, LassoCV, ElasticNetCV
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.preprocessing import Imputer, StandardScaler, PolynomialFeatures

import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

Load the Data

train = pd.read_csv('./train.csv')
train.head()
Id PID MS SubClass MS Zoning Lot Frontage Lot Area Street Alley Lot Shape Land Contour ... Screen Porch Pool Area Pool QC Fence Misc Feature Misc Val Mo Sold Yr Sold Sale Type SalePrice
0 109 533352170 60 RL NaN 13517 Pave NaN IR1 Lvl ... 0 0 NaN NaN NaN 0 3 2010 WD 130500
1 544 531379050 60 RL 43.0 11492 Pave NaN IR1 Lvl ... 0 0 NaN NaN NaN 0 4 2009 WD 220000
2 153 535304180 20 RL 68.0 7922 Pave NaN Reg Lvl ... 0 0 NaN NaN NaN 0 1 2010 WD 109000
3 318 916386060 60 RL 73.0 9802 Pave NaN Reg Lvl ... 0 0 NaN NaN NaN 0 4 2010 WD 174000
4 255 906425045 50 RL 82.0 14235 Pave NaN IR1 Lvl ... 0 0 NaN NaN NaN 0 3 2010 WD 138500

5 rows × 81 columns

Clean the Data

Are there any NaNs? Where? How many?

print(train.shape)
train.isna().sum()[train.isna().sum() !=0]
(2051, 81)





Lot Frontage       330
Alley             1911
Mas Vnr Type        22
Mas Vnr Area        22
Bsmt Qual           55
Bsmt Cond           55
Bsmt Exposure       58
BsmtFin Type 1      55
BsmtFin SF 1         1
BsmtFin Type 2      56
BsmtFin SF 2         1
Bsmt Unf SF          1
Total Bsmt SF        1
Bsmt Full Bath       2
Bsmt Half Bath       2
Fireplace Qu      1000
Garage Type        113
Garage Yr Blt      114
Garage Finish      114
Garage Cars          1
Garage Area          1
Garage Qual        114
Garage Cond        114
Pool QC           2042
Fence             1651
Misc Feature      1986
dtype: int64
drop_cols = ['Alley', 'Pool QC', 'Misc Feature']
train.drop(axis=1, columns=drop_cols, inplace=True)
# dropping these columns because they were almost entirely NaNs
# Change date values to relative values. This is time sensitive but it will help the model.
train['Year Built'] = 2018 - train['Year Built']
train['Year Remod/Add'] = 2018 - train['Year Remod/Add']
cats = train.select_dtypes(exclude=[np.number])
nums = train.select_dtypes(include=[np.number])
cat_mask = list(cats.columns)
nums_mask = list(nums.columns)
# making a mask to separate categorical and numerical data
cat_nans_mask = list(train[cat_mask].isna().sum()[train.isna().sum() !=0].index)
cat_nans_mask
# making a mask for categorical columns with NaNs
['Mas Vnr Type',
 'Bsmt Qual',
 'Bsmt Cond',
 'Bsmt Exposure',
 'BsmtFin Type 1',
 'BsmtFin Type 2',
 'Fireplace Qu',
 'Garage Type',
 'Garage Finish',
 'Garage Qual',
 'Garage Cond',
 'Fence']
train[cat_nans_mask].head()
# Visualizing how to replace NaNs in categorical columns
Mas Vnr Type Bsmt Qual Bsmt Cond Bsmt Exposure BsmtFin Type 1 BsmtFin Type 2 Fireplace Qu Garage Type Garage Finish Garage Qual Garage Cond Fence
0 BrkFace TA TA No GLQ Unf NaN Attchd RFn TA TA NaN
1 BrkFace Gd TA No GLQ Unf TA Attchd RFn TA TA NaN
2 None TA TA No GLQ Unf NaN Detchd Unf TA TA NaN
3 None Gd TA No Unf Unf NaN BuiltIn Fin TA TA NaN
4 None Fa Gd No Unf Unf NaN Detchd Unf TA TA NaN
for col in train[cat_nans_mask]:
    print(col, train[cat_nans_mask][col].unique())
# I wanted to see how many categories there were for each of these
Mas Vnr Type ['BrkFace' 'None' nan 'Stone' 'BrkCmn']
Bsmt Qual ['TA' 'Gd' 'Fa' nan 'Ex' 'Po']
Bsmt Cond ['TA' 'Gd' nan 'Fa' 'Po' 'Ex']
Bsmt Exposure ['No' 'Gd' 'Av' nan 'Mn']
BsmtFin Type 1 ['GLQ' 'Unf' 'ALQ' 'Rec' nan 'BLQ' 'LwQ']
BsmtFin Type 2 ['Unf' 'Rec' nan 'BLQ' 'GLQ' 'LwQ' 'ALQ']
Fireplace Qu [nan 'TA' 'Gd' 'Po' 'Ex' 'Fa']
Garage Type ['Attchd' 'Detchd' 'BuiltIn' 'Basment' nan '2Types' 'CarPort']
Garage Finish ['RFn' 'Unf' 'Fin' nan]
Garage Qual ['TA' 'Fa' nan 'Gd' 'Ex' 'Po']
Garage Cond ['TA' 'Fa' nan 'Po' 'Gd' 'Ex']
Fence [nan 'MnPrv' 'GdPrv' 'GdWo' 'MnWw']

Fill in NaNs

  • I dropped three columns because they were almost entirely NaNs above.
  • I filled in missing values in numerical columns with the mean of the column and filled in missing values in categotical columns with the most common value of that column. Looking at the categorical columns, most of them had an overwhelming mode.
imp = Imputer(axis=1)
numerical_cols = imp.fit_transform(train[nums_mask])
numerical_cols = imp.transform(train[nums_mask])
# Replaces NaNs with mean for numerical columns
numerical_cols = pd.DataFrame(numerical_cols, columns=nums_mask)
numerical_cols.isna().sum().sum()
0
cat_cols = train[cat_mask]
for col in cat_cols:
    cat_cols[col].fillna(cat_cols[col].value_counts().index[0], inplace=True)
# replacing NaNs with the categorical mode for each categorical column
# most categorical columns had one overwhelming mode
/anaconda3/envs/dsi/lib/python3.6/site-packages/pandas/core/generic.py:4355: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._update_inplace(new_data)
cat_cols.isna().sum().sum()
0
Rejoin the categorical columns and numerical columns.
train = pd.concat([cat_cols,numerical_cols], axis=1)
train.shape
# Shape matches original before split and concat.
(2051, 78)
train.isna().sum().sum()
0
train.drop(columns=['Id','PID'],axis=1, inplace=True)
# We don't need these ID numbers as they are not ordinal. They will just muddy up our model.

Feature Engineering

# Make dummy columns for categorical columns.
dummies = pd.get_dummies(train[cat_mask], drop_first=True)
full_train = pd.concat([train, dummies], axis=1)
full_train.shape
(2051, 279)
# Get rid of the original categorical columns.
full_train.drop(columns=cat_mask, axis=1, inplace=True)
full_train.shape
(2051, 240)

Test Train Split

# Let's TTS
feature_cols = [col for col in full_train.columns if col !='SalePrice']
X = full_train[feature_cols]
y = full_train['SalePrice']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)
(1640, 239) (1640,)
(411, 239) (411,)
Xcols = list(X_train.columns)

Scale the Data

(Feature Engineering, Continued)

# Scale X Train
ss = StandardScaler()
ss.fit(X_train)
X_train = ss.transform(X_train)
# Scale X Test
ss = StandardScaler()
ss.fit(X_test)
X_test = ss.transform(X_test)

Feature Selection

  • I used Lasso initially to find the biggest ten coefficients, ran PolynomialFeatures on them, and then ran Lasso on that.
X_test = pd.DataFrame(data=X_test, columns=Xcols)
X_train = pd.DataFrame(data=X_train, columns=Xcols)
opt_lasso = LassoCV(n_alphas=500, cv=10)
opt_lasso.fit(X_train, y_train)
LassoCV(alphas=None, copy_X=True, cv=10, eps=0.001, fit_intercept=True,
    max_iter=1000, n_alphas=500, n_jobs=1, normalize=False, positive=False,
    precompute='auto', random_state=None, selection='cyclic', tol=0.0001,
    verbose=False)
lasso = LassoCV(n_alphas=100, cv=10)
lasso.fit(X_train, y_train)
lasso_scores = cross_val_score(lasso, X, y, cv=10)

print(lasso_scores)
print(np.mean(lasso_scores))
[0.11369192 0.05790766 0.08708682 0.11539319 0.10243234 0.06335296
 0.11494424 0.09696049 0.07977338 0.13083558]
0.09623786000708608
l_coefs = pd.DataFrame({'variable':X_train.columns,
                            'coef':lasso.coef_,
                            'abs_coef':np.abs(lasso.coef_)})

l_coefs.sort_values('abs_coef', inplace=True, ascending=False)
l_coefs.head(15)
abs_coef coef variable
15 19288.499129 19288.499129 Gr Liv Area
3 14324.895927 14324.895927 Overall Qual
115 10864.213367 10864.213367 Roof Matl_CompShg
153 10068.528391 -10068.528391 Exter Qual_TA
75 9849.815613 9849.815613 Neighborhood_NridgHt
199 8781.847169 -8781.847169 Kitchen Qual_TA
152 8202.189732 -8202.189732 Exter Qual_Gd
198 7747.772376 -7747.772376 Kitchen Qual_Gd
81 7136.796409 7136.796409 Neighborhood_StoneBr
117 7065.016241 7065.016241 Roof Matl_Tar&Grv
119 6919.552382 6919.552382 Roof Matl_WdShngl
5 6908.026325 -6908.026325 Year Built
164 5980.771638 -5980.771638 Bsmt Qual_Gd
74 5958.515873 5958.515873 Neighborhood_NoRidge
166 5467.374692 -5467.374692 Bsmt Qual_TA
top_ten_lasso = list(l_coefs.variable[:10])
# top_ten_lasso = ['Gr Liv Area','Overall Qual','Roof Matl_CompShg','Exter Qual_TA','Neighborhood_NridgHt','Kitchen Qual_TA','Exter Qual_Gd','Kitchen Qual_Gd','Neighborhood_StoneBr','Roof Matl_Tar&Grv']
X_train_ten = X_train[top_ten_lasso]
X_test_ten = X_test[top_ten_lasso]
X_train_ten.head()
# Let's try just using these variables
Gr Liv Area Overall Qual Roof Matl_CompShg Exter Qual_TA Neighborhood_NridgHt Kitchen Qual_TA Exter Qual_Gd Kitchen Qual_Gd Neighborhood_StoneBr Roof Matl_Tar&Grv
0 -0.279707 -0.081747 0.116606 -1.242029 4.010403 -1.009804 1.384973 1.226302 -0.136505 -0.089388
1 0.936761 1.322072 0.116606 -1.242029 4.010403 -1.009804 1.384973 1.226302 -0.136505 -0.089388
2 -0.283663 -0.081747 0.116606 0.805134 -0.249351 0.990291 -0.722036 -0.815460 -0.136505 -0.089388
3 1.510380 1.322072 0.116606 -1.242029 -0.249351 -1.009804 1.384973 1.226302 -0.136505 -0.089388
4 0.703357 -0.081747 0.116606 -1.242029 -0.249351 0.990291 1.384973 -0.815460 -0.136505 -0.089388

More Feature Engineering

  • I ran PolynomialFeatures on the ten columns with the biggest coefficients from Lasso to generate interactions and more features, bringing the amount of features from 10 to 55.
poly = PolynomialFeatures(interaction_only=True, include_bias=False)

X_train_poly = poly.fit_transform(X_train_ten, y_train)
X_test_poly = poly.fit_transform(X_test_ten, y_test)
ss = StandardScaler()
ss.fit(X_train_poly)
X_train_poly = ss.transform(X_train_poly)
ss = StandardScaler()
ss.fit(X_test_poly)
X_test_poly = ss.transform(X_test_poly)

Model

  • Lasso will help clean up any colinearity or overfitting that has resulted from the PolynomialFeatures, but this acts as kind of a double filter because I used this to select the features in the first place.
lasso = LassoCV(n_alphas=100,  cv=10)
lasso.fit(X_train_poly, y_train)
lasso.score(X_train_poly, y_train)
0.838918280626909
lasso.score(X_test_poly, y_test)
0.8168833753624594
plot_predictions = lasso.predict(X_test_poly)
plt.figure(figsize=(10,8))
fig = sns.regplot(plot_predictions, y_test)
sns.set(font_scale=2, font='Verdana')
fig.set_title('Predicted Price vs Y Test Prices')
fig.set_xlabel('Predicted Sale Price')
fig.set_ylabel('Y Test Sale Price');

png

Load in Test Data

Below I run everything I ran on the train data on the actual test data from which I make the actual predictions.

test = pd.read_csv('./test.csv')

drop_cols = ['Alley', 'Pool QC', 'Misc Feature']
test.drop(axis=1, columns=drop_cols, inplace=True)

test['Year Built'] = 2018 - test['Year Built']
test['Year Remod/Add'] = 2018 - test['Year Remod/Add']

tst_cats = test.select_dtypes(exclude=[np.number])
tst_nums = test.select_dtypes(include=[np.number])
tst_cat_mask = list(tst_cats.columns)
tst_nums_mask = list(tst_nums.columns)

tst_cat_nans_mask = list(test[tst_cat_mask].isna().sum()[test.isna().sum() !=0].index)

imp = Imputer(axis=1)
tst_numerical_cols = imp.fit_transform(test[tst_nums_mask])
tst_numerical_cols = imp.transform(test[tst_nums_mask])

tst_numerical_cols = pd.DataFrame(tst_numerical_cols, columns=tst_nums_mask)

tst_cat_cols = test[tst_cat_mask]

for col in tst_cat_cols:
    tst_cat_cols[col].fillna(tst_cat_cols[col].value_counts().index[0], inplace=True)

test = pd.concat([tst_cat_cols,tst_numerical_cols], axis=1)

test.drop(columns=['Id','PID'],axis=1, inplace=True)

tst_dummies = pd.get_dummies(test[tst_cat_mask], drop_first=True)
full_test = pd.concat([test, tst_dummies], axis=1)

full_test.drop(columns=tst_cat_mask, axis=1, inplace=True)

test_cols = list(full_test.columns)

ss = StandardScaler()
ss.fit(full_test)
full_test = ss.transform(full_test)

full_test = pd.DataFrame(data=full_test, columns=test_cols)
/anaconda3/envs/dsi/lib/python3.6/site-packages/pandas/core/generic.py:4355: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._update_inplace(new_data)
full_test['Roof Matl_CompShg'] = 0
# This was not in the test data.
top_ten_lasso = ['Gr Liv Area','Overall Qual','Roof Matl_CompShg','Exter Qual_TA','Neighborhood_NridgHt','Kitchen Qual_TA','Exter Qual_Gd','Kitchen Qual_Gd','Neighborhood_StoneBr','Roof Matl_Tar&Grv']

full_test_ten = full_test[top_ten_lasso]
poly = PolynomialFeatures(interaction_only=True, include_bias=False)

full_test_poly = poly.fit_transform(full_test_ten)
prediction = lasso.predict(full_test_poly)
test2 = pd.read_csv('./test.csv')
submission = pd.DataFrame(data = prediction, index=test2['Id'], columns = ['SalePrice'])
submission.to_csv('./confoley_submission.csv')
Written on May 18, 2018