Project 2 - Ames Housing Data

For this project (full repo here) I created a model to predict price based on a dataset on the housing stock of Ames, Iowa, a small midwestern town north of Des Moines that is home to Iowa State University. There were 81 columns with about an even split between categorical data and numerical data. Some of the columns, such as “Overall Quality,” could be initially assumed to have a strong correlation with price, while other more obscure qualities such as “Garage Type” and “Masonry Veneer Type” occupied most of the columns. I only had a limited amount of time to complete this project, so efficiency was critical. My goal was to create a model that accurately predicted price without combing over every minute detail.

There was a lot of missing data in the dataset. Some of the columns were overwhelmingly full of NaN values, such as “Pool QC,” in which only 9 out of 2051 rows were not NaN values. I decided to drop this column, along with “Alley” and “Misc Feature” because of the overwhelming percentage of NaN values. In numerical columns, I replaced NaN values with the mean of the column using Impute. In categorical columns, I replaced NaNs with the most frequent value of that column. The reasoning behind this was that in most of the categorical columns there was one overwhelming mode, and filling in missing values with this mode would not alter the data very much. Had I had more time, I would have analyzed the missing values more in an attempt to determine if the values were truly missing or actually meant something else. In order to fill the missing values differently based on numerical vs. categorical data, I separated the DataFrame and then rejoined them using df.concat(). In addition to filling missing values, I also modified the “Year Built” and “Year Remod/Add” columns to reflect the time since the property was built or remodeled, as a year behaves as categorical data despite being an integer.

The first step I took to modify my data was to create dummy columns for all of the categorical data, using drop_first=True, and dropping the original categorical columns from my DataFrame. I then scaled the data using StandardScaler(). To select my features, I decided to use Lasso to find the ten strongest coefficients right off the bat. I figured that because of the many different variables, from a more qualitative perspective, I should use Lasso to figure out what is important. Like I said, there were variables that seemed to have an obvious predictive value and others that were a bit more trivial, possibly having a strong correlation with price and possibly not.

After selecting the ten strongest coefficients, I ran PolynomialFeatures(interaction_only=True) on those ten columns, bringing the number of columns up to 55. I scaled this again using StandardScaler(), and then ran Lasso again to account for any overfitting. I figured this two step process of filtering out unncessary variables made sense for this kind of dataset.

Scoring on my training data, I got an R squared value of 0.83. For my test data, it decreased slightly to 0.81. Given the time constraints of the project, this was good enough for me. If I had more time, I would have looked at the categorical columns much more closely. Some of the columns I turned into dummy variables could have been reassigned to integers, as the categories were ordinal. Furthermore, the town of Ames is bimodal; there is an older downtown area as well as another centralized area around the campus for Iowa State University. The distinct geography of this town could have had a significant impact on house prices, and it would be worthwhile to look at other midwestern college towns like Madison, Wisconsin.

import numpy as np
import pandas as pd
import patsy

from sklearn.linear_model import Ridge, Lasso, ElasticNet, LinearRegression, RidgeCV, LassoCV, ElasticNetCV
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.preprocessing import Imputer, StandardScaler, PolynomialFeatures

import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

Load the Data

train = pd.read_csv('./train.csv')

train.head()

	Id	PID	MS SubClass	MS Zoning	Lot Frontage	Lot Area	Street	Alley	Lot Shape	Land Contour	...	Pool QC	Fence	Misc Feature	Mo Sold	Yr Sold	Sale Type	SalePrice
0	109	533352170	60	RL	NaN	13517	Pave	NaN	IR1	Lvl	...	NaN	NaN	NaN	3	2010	WD	130500
1	544	531379050	60	RL	43.0	11492	Pave	NaN	IR1	Lvl	...	NaN	NaN	NaN	4	2009	WD	220000
2	153	535304180	20	RL	68.0	7922	Pave	NaN	Reg	Lvl	...	NaN	NaN	NaN	1	2010	WD	109000
3	318	916386060	60	RL	73.0	9802	Pave	NaN	Reg	Lvl	...	NaN	NaN	NaN	4	2010	WD	174000
4	255	906425045	50	RL	82.0	14235	Pave	NaN	IR1	Lvl	...	NaN	NaN	NaN	3	2010	WD	138500

5 rows × 81 columns

Clean the Data

Are there any NaNs? Where? How many?

print(train.shape)
train.isna().sum()[train.isna().sum() !=0]

(2051, 81)





Lot Frontage       330
Alley             1911
Mas Vnr Type        22
Mas Vnr Area        22
Bsmt Qual           55
Bsmt Cond           55
Bsmt Exposure       58
BsmtFin Type 1      55
BsmtFin SF 1         1
BsmtFin Type 2      56
BsmtFin SF 2         1
Bsmt Unf SF          1
Total Bsmt SF        1
Bsmt Full Bath       2
Bsmt Half Bath       2
Fireplace Qu      1000
Garage Type        113
Garage Yr Blt      114
Garage Finish      114
Garage Cars          1
Garage Area          1
Garage Qual        114
Garage Cond        114
Pool QC           2042
Fence             1651
Misc Feature      1986
dtype: int64

drop_cols = ['Alley', 'Pool QC', 'Misc Feature']
train.drop(axis=1, columns=drop_cols, inplace=True)
# dropping these columns because they were almost entirely NaNs

# Change date values to relative values. This is time sensitive but it will help the model.
train['Year Built'] = 2018 - train['Year Built']
train['Year Remod/Add'] = 2018 - train['Year Remod/Add']

cats = train.select_dtypes(exclude=[np.number])
nums = train.select_dtypes(include=[np.number])
cat_mask = list(cats.columns)
nums_mask = list(nums.columns)
# making a mask to separate categorical and numerical data

cat_nans_mask = list(train[cat_mask].isna().sum()[train.isna().sum() !=0].index)
cat_nans_mask
# making a mask for categorical columns with NaNs

['Mas Vnr Type',
 'Bsmt Qual',
 'Bsmt Cond',
 'Bsmt Exposure',
 'BsmtFin Type 1',
 'BsmtFin Type 2',
 'Fireplace Qu',
 'Garage Type',
 'Garage Finish',
 'Garage Qual',
 'Garage Cond',
 'Fence']

train[cat_nans_mask].head()
# Visualizing how to replace NaNs in categorical columns

	Mas Vnr Type	Bsmt Qual	Bsmt Cond	Bsmt Exposure	BsmtFin Type 1	BsmtFin Type 2	Fireplace Qu	Garage Type	Garage Finish	Garage Qual	Garage Cond	Fence
0	BrkFace	TA	TA	No	GLQ	Unf	NaN	Attchd	RFn	TA	TA	NaN
1	BrkFace	Gd	TA	No	GLQ	Unf	TA	Attchd	RFn	TA	TA	NaN
2	None	TA	TA	No	GLQ	Unf	NaN	Detchd	Unf	TA	TA	NaN
3	None	Gd	TA	No	Unf	Unf	NaN	BuiltIn	Fin	TA	TA	NaN
4	None	Fa	Gd	No	Unf	Unf	NaN	Detchd	Unf	TA	TA	NaN

for col in train[cat_nans_mask]:
    print(col, train[cat_nans_mask][col].unique())
# I wanted to see how many categories there were for each of these

Mas Vnr Type ['BrkFace' 'None' nan 'Stone' 'BrkCmn']
Bsmt Qual ['TA' 'Gd' 'Fa' nan 'Ex' 'Po']
Bsmt Cond ['TA' 'Gd' nan 'Fa' 'Po' 'Ex']
Bsmt Exposure ['No' 'Gd' 'Av' nan 'Mn']
BsmtFin Type 1 ['GLQ' 'Unf' 'ALQ' 'Rec' nan 'BLQ' 'LwQ']
BsmtFin Type 2 ['Unf' 'Rec' nan 'BLQ' 'GLQ' 'LwQ' 'ALQ']
Fireplace Qu [nan 'TA' 'Gd' 'Po' 'Ex' 'Fa']
Garage Type ['Attchd' 'Detchd' 'BuiltIn' 'Basment' nan '2Types' 'CarPort']
Garage Finish ['RFn' 'Unf' 'Fin' nan]
Garage Qual ['TA' 'Fa' nan 'Gd' 'Ex' 'Po']
Garage Cond ['TA' 'Fa' nan 'Po' 'Gd' 'Ex']
Fence [nan 'MnPrv' 'GdPrv' 'GdWo' 'MnWw']

Fill in NaNs

I dropped three columns because they were almost entirely NaNs above.
I filled in missing values in numerical columns with the mean of the column and filled in missing values in categotical columns with the most common value of that column. Looking at the categorical columns, most of them had an overwhelming mode.

imp = Imputer(axis=1)
numerical_cols = imp.fit_transform(train[nums_mask])
numerical_cols = imp.transform(train[nums_mask])
# Replaces NaNs with mean for numerical columns

numerical_cols = pd.DataFrame(numerical_cols, columns=nums_mask)

numerical_cols.isna().sum().sum()

cat_cols = train[cat_mask]

for col in cat_cols:
    cat_cols[col].fillna(cat_cols[col].value_counts().index[0], inplace=True)
# replacing NaNs with the categorical mode for each categorical column
# most categorical columns had one overwhelming mode

/anaconda3/envs/dsi/lib/python3.6/site-packages/pandas/core/generic.py:4355: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._update_inplace(new_data)

cat_cols.isna().sum().sum()

Rejoin the categorical columns and numerical columns.

train = pd.concat([cat_cols,numerical_cols], axis=1)
train.shape
# Shape matches original before split and concat.

(2051, 78)

train.isna().sum().sum()

train.drop(columns=['Id','PID'],axis=1, inplace=True)
# We don't need these ID numbers as they are not ordinal. They will just muddy up our model.

Feature Engineering

# Make dummy columns for categorical columns.
dummies = pd.get_dummies(train[cat_mask], drop_first=True)
full_train = pd.concat([train, dummies], axis=1)
full_train.shape

(2051, 279)

# Get rid of the original categorical columns.
full_train.drop(columns=cat_mask, axis=1, inplace=True)
full_train.shape

(2051, 240)

Test Train Split

# Let's TTS
feature_cols = [col for col in full_train.columns if col !='SalePrice']
X = full_train[feature_cols]
y = full_train['SalePrice']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

(1640, 239) (1640,)
(411, 239) (411,)

Xcols = list(X_train.columns)

Scale the Data

(Feature Engineering, Continued)

# Scale X Train
ss = StandardScaler()
ss.fit(X_train)
X_train = ss.transform(X_train)

# Scale X Test
ss = StandardScaler()
ss.fit(X_test)
X_test = ss.transform(X_test)

Feature Selection

I used Lasso initially to find the biggest ten coefficients, ran PolynomialFeatures on them, and then ran Lasso on that.

X_test = pd.DataFrame(data=X_test, columns=Xcols)
X_train = pd.DataFrame(data=X_train, columns=Xcols)

opt_lasso = LassoCV(n_alphas=500, cv=10)
opt_lasso.fit(X_train, y_train)

LassoCV(alphas=None, copy_X=True, cv=10, eps=0.001, fit_intercept=True,
    max_iter=1000, n_alphas=500, n_jobs=1, normalize=False, positive=False,
    precompute='auto', random_state=None, selection='cyclic', tol=0.0001,
    verbose=False)

lasso = LassoCV(n_alphas=100, cv=10)
lasso.fit(X_train, y_train)
lasso_scores = cross_val_score(lasso, X, y, cv=10)

print(lasso_scores)
print(np.mean(lasso_scores))

[0.11369192 0.05790766 0.08708682 0.11539319 0.10243234 0.06335296
 0.11494424 0.09696049 0.07977338 0.13083558]
0.09623786000708608

l_coefs = pd.DataFrame({'variable':X_train.columns,
                            'coef':lasso.coef_,
                            'abs_coef':np.abs(lasso.coef_)})

l_coefs.sort_values('abs_coef', inplace=True, ascending=False)
l_coefs.head(15)

	abs_coef	coef	variable
15	19288.499129	19288.499129	Gr Liv Area
3	14324.895927	14324.895927	Overall Qual
115	10864.213367	10864.213367	Roof Matl_CompShg
153	10068.528391	-10068.528391	Exter Qual_TA
75	9849.815613	9849.815613	Neighborhood_NridgHt
199	8781.847169	-8781.847169	Kitchen Qual_TA
152	8202.189732	-8202.189732	Exter Qual_Gd
198	7747.772376	-7747.772376	Kitchen Qual_Gd
81	7136.796409	7136.796409	Neighborhood_StoneBr
117	7065.016241	7065.016241	Roof Matl_Tar&Grv
119	6919.552382	6919.552382	Roof Matl_WdShngl
5	6908.026325	-6908.026325	Year Built
164	5980.771638	-5980.771638	Bsmt Qual_Gd
74	5958.515873	5958.515873	Neighborhood_NoRidge
166	5467.374692	-5467.374692	Bsmt Qual_TA

top_ten_lasso = list(l_coefs.variable[:10])

# top_ten_lasso = ['Gr Liv Area','Overall Qual','Roof Matl_CompShg','Exter Qual_TA','Neighborhood_NridgHt','Kitchen Qual_TA','Exter Qual_Gd','Kitchen Qual_Gd','Neighborhood_StoneBr','Roof Matl_Tar&Grv']

X_train_ten = X_train[top_ten_lasso]
X_test_ten = X_test[top_ten_lasso]
X_train_ten.head()
# Let's try just using these variables

	Gr Liv Area	Overall Qual	Roof Matl_CompShg	Exter Qual_TA	Neighborhood_NridgHt	Kitchen Qual_TA	Exter Qual_Gd	Kitchen Qual_Gd	Neighborhood_StoneBr	Roof Matl_Tar&Grv
0	-0.279707	-0.081747	0.116606	-1.242029	4.010403	-1.009804	1.384973	1.226302	-0.136505	-0.089388
1	0.936761	1.322072	0.116606	-1.242029	4.010403	-1.009804	1.384973	1.226302	-0.136505	-0.089388
2	-0.283663	-0.081747	0.116606	0.805134	-0.249351	0.990291	-0.722036	-0.815460	-0.136505	-0.089388
3	1.510380	1.322072	0.116606	-1.242029	-0.249351	-1.009804	1.384973	1.226302	-0.136505	-0.089388
4	0.703357	-0.081747	0.116606	-1.242029	-0.249351	0.990291	1.384973	-0.815460	-0.136505	-0.089388

More Feature Engineering

I ran PolynomialFeatures on the ten columns with the biggest coefficients from Lasso to generate interactions and more features, bringing the amount of features from 10 to 55.

poly = PolynomialFeatures(interaction_only=True, include_bias=False)

X_train_poly = poly.fit_transform(X_train_ten, y_train)
X_test_poly = poly.fit_transform(X_test_ten, y_test)

ss = StandardScaler()
ss.fit(X_train_poly)
X_train_poly = ss.transform(X_train_poly)

ss = StandardScaler()
ss.fit(X_test_poly)
X_test_poly = ss.transform(X_test_poly)

Model

Lasso will help clean up any colinearity or overfitting that has resulted from the PolynomialFeatures, but this acts as kind of a double filter because I used this to select the features in the first place.

lasso = LassoCV(n_alphas=100,  cv=10)
lasso.fit(X_train_poly, y_train)
lasso.score(X_train_poly, y_train)

0.838918280626909

lasso.score(X_test_poly, y_test)

0.8168833753624594

plot_predictions = lasso.predict(X_test_poly)

plt.figure(figsize=(10,8))
fig = sns.regplot(plot_predictions, y_test)
sns.set(font_scale=2, font='Verdana')
fig.set_title('Predicted Price vs Y Test Prices')
fig.set_xlabel('Predicted Sale Price')
fig.set_ylabel('Y Test Sale Price');

png

Load in Test Data

Below I run everything I ran on the train data on the actual test data from which I make the actual predictions.

test = pd.read_csv('./test.csv')

drop_cols = ['Alley', 'Pool QC', 'Misc Feature']
test.drop(axis=1, columns=drop_cols, inplace=True)

test['Year Built'] = 2018 - test['Year Built']
test['Year Remod/Add'] = 2018 - test['Year Remod/Add']

tst_cats = test.select_dtypes(exclude=[np.number])
tst_nums = test.select_dtypes(include=[np.number])
tst_cat_mask = list(tst_cats.columns)
tst_nums_mask = list(tst_nums.columns)

tst_cat_nans_mask = list(test[tst_cat_mask].isna().sum()[test.isna().sum() !=0].index)

imp = Imputer(axis=1)
tst_numerical_cols = imp.fit_transform(test[tst_nums_mask])
tst_numerical_cols = imp.transform(test[tst_nums_mask])

tst_numerical_cols = pd.DataFrame(tst_numerical_cols, columns=tst_nums_mask)

tst_cat_cols = test[tst_cat_mask]

for col in tst_cat_cols:
    tst_cat_cols[col].fillna(tst_cat_cols[col].value_counts().index[0], inplace=True)

test = pd.concat([tst_cat_cols,tst_numerical_cols], axis=1)

test.drop(columns=['Id','PID'],axis=1, inplace=True)

tst_dummies = pd.get_dummies(test[tst_cat_mask], drop_first=True)
full_test = pd.concat([test, tst_dummies], axis=1)

full_test.drop(columns=tst_cat_mask, axis=1, inplace=True)

test_cols = list(full_test.columns)

ss = StandardScaler()
ss.fit(full_test)
full_test = ss.transform(full_test)

full_test = pd.DataFrame(data=full_test, columns=test_cols)

/anaconda3/envs/dsi/lib/python3.6/site-packages/pandas/core/generic.py:4355: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._update_inplace(new_data)

full_test['Roof Matl_CompShg'] = 0
# This was not in the test data.

top_ten_lasso = ['Gr Liv Area','Overall Qual','Roof Matl_CompShg','Exter Qual_TA','Neighborhood_NridgHt','Kitchen Qual_TA','Exter Qual_Gd','Kitchen Qual_Gd','Neighborhood_StoneBr','Roof Matl_Tar&Grv']

full_test_ten = full_test[top_ten_lasso]

poly = PolynomialFeatures(interaction_only=True, include_bias=False)

full_test_poly = poly.fit_transform(full_test_ten)

prediction = lasso.predict(full_test_poly)

test2 = pd.read_csv('./test.csv')

submission = pd.DataFrame(data = prediction, index=test2['Id'], columns = ['SalePrice'])
submission.to_csv('./confoley_submission.csv')

Written on May 18, 2018