Notice! This page is still in progress!

Before we can start

Load Packages

First we have to import the most important packages: Numpy and Pandas.

import pandas as pd
import numpy as np
import sklearn

from pandas.api.types import is_string_dtype
from IPython.core.display import display

Load Data

Now lets check the data:

train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")
print(f"Size Trainingset: {train.shape}, Size Testset: {test.shape}")
Size Trainingset: (1460, 81), Size Testset: (1459, 80)

The target variable is the sale price of the houses. So essentially we are having a regression problem and we try to get as close as possible to the actual sale price.

target = train["SalePrice"]
train.drop("SalePrice", axis=1, inplace=True)

target contains the dependent variable whereas train will be used to predict the target price.

Take a look at the data

def show_df(df):
    with pd.option_context('display.max_rows', None, 'display.max_columns', None):
        display(df)

show_df(train.head(10))
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities LotConfig LandSlope Neighborhood Condition1 Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir Electrical 1stFlrSF 2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch 3SsnPorch ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition
0 1 60 RL 65.0 8450 Pave NaN Reg Lvl AllPub Inside Gtl CollgCr Norm Norm 1Fam 2Story 7 5 2003 2003 Gable CompShg VinylSd VinylSd BrkFace 196.0 Gd TA PConc Gd TA No GLQ 706 Unf 0 150 856 GasA Ex Y SBrkr 856 854 0 1710 1 0 2 1 3 1 Gd 8 Typ 0 NaN Attchd 2003.0 RFn 2 548 TA TA Y 0 61 0 0 0 0 NaN NaN NaN 0 2 2008 WD Normal
1 2 20 RL 80.0 9600 Pave NaN Reg Lvl AllPub FR2 Gtl Veenker Feedr Norm 1Fam 1Story 6 8 1976 1976 Gable CompShg MetalSd MetalSd None 0.0 TA TA CBlock Gd TA Gd ALQ 978 Unf 0 284 1262 GasA Ex Y SBrkr 1262 0 0 1262 0 1 2 0 3 1 TA 6 Typ 1 TA Attchd 1976.0 RFn 2 460 TA TA Y 298 0 0 0 0 0 NaN NaN NaN 0 5 2007 WD Normal
2 3 60 RL 68.0 11250 Pave NaN IR1 Lvl AllPub Inside Gtl CollgCr Norm Norm 1Fam 2Story 7 5 2001 2002 Gable CompShg VinylSd VinylSd BrkFace 162.0 Gd TA PConc Gd TA Mn GLQ 486 Unf 0 434 920 GasA Ex Y SBrkr 920 866 0 1786 1 0 2 1 3 1 Gd 6 Typ 1 TA Attchd 2001.0 RFn 2 608 TA TA Y 0 42 0 0 0 0 NaN NaN NaN 0 9 2008 WD Normal
3 4 70 RL 60.0 9550 Pave NaN IR1 Lvl AllPub Corner Gtl Crawfor Norm Norm 1Fam 2Story 7 5 1915 1970 Gable CompShg Wd Sdng Wd Shng None 0.0 TA TA BrkTil TA Gd No ALQ 216 Unf 0 540 756 GasA Gd Y SBrkr 961 756 0 1717 1 0 1 0 3 1 Gd 7 Typ 1 Gd Detchd 1998.0 Unf 3 642 TA TA Y 0 35 272 0 0 0 NaN NaN NaN 0 2 2006 WD Abnorml
4 5 60 RL 84.0 14260 Pave NaN IR1 Lvl AllPub FR2 Gtl NoRidge Norm Norm 1Fam 2Story 8 5 2000 2000 Gable CompShg VinylSd VinylSd BrkFace 350.0 Gd TA PConc Gd TA Av GLQ 655 Unf 0 490 1145 GasA Ex Y SBrkr 1145 1053 0 2198 1 0 2 1 4 1 Gd 9 Typ 1 TA Attchd 2000.0 RFn 3 836 TA TA Y 192 84 0 0 0 0 NaN NaN NaN 0 12 2008 WD Normal
5 6 50 RL 85.0 14115 Pave NaN IR1 Lvl AllPub Inside Gtl Mitchel Norm Norm 1Fam 1.5Fin 5 5 1993 1995 Gable CompShg VinylSd VinylSd None 0.0 TA TA Wood Gd TA No GLQ 732 Unf 0 64 796 GasA Ex Y SBrkr 796 566 0 1362 1 0 1 1 1 1 TA 5 Typ 0 NaN Attchd 1993.0 Unf 2 480 TA TA Y 40 30 0 320 0 0 NaN MnPrv Shed 700 10 2009 WD Normal
6 7 20 RL 75.0 10084 Pave NaN Reg Lvl AllPub Inside Gtl Somerst Norm Norm 1Fam 1Story 8 5 2004 2005 Gable CompShg VinylSd VinylSd Stone 186.0 Gd TA PConc Ex TA Av GLQ 1369 Unf 0 317 1686 GasA Ex Y SBrkr 1694 0 0 1694 1 0 2 0 3 1 Gd 7 Typ 1 Gd Attchd 2004.0 RFn 2 636 TA TA Y 255 57 0 0 0 0 NaN NaN NaN 0 8 2007 WD Normal
7 8 60 RL NaN 10382 Pave NaN IR1 Lvl AllPub Corner Gtl NWAmes PosN Norm 1Fam 2Story 7 6 1973 1973 Gable CompShg HdBoard HdBoard Stone 240.0 TA TA CBlock Gd TA Mn ALQ 859 BLQ 32 216 1107 GasA Ex Y SBrkr 1107 983 0 2090 1 0 2 1 3 1 TA 7 Typ 2 TA Attchd 1973.0 RFn 2 484 TA TA Y 235 204 228 0 0 0 NaN NaN Shed 350 11 2009 WD Normal
8 9 50 RM 51.0 6120 Pave NaN Reg Lvl AllPub Inside Gtl OldTown Artery Norm 1Fam 1.5Fin 7 5 1931 1950 Gable CompShg BrkFace Wd Shng None 0.0 TA TA BrkTil TA TA No Unf 0 Unf 0 952 952 GasA Gd Y FuseF 1022 752 0 1774 0 0 2 0 2 2 TA 8 Min1 2 TA Detchd 1931.0 Unf 2 468 Fa TA Y 90 0 205 0 0 0 NaN NaN NaN 0 4 2008 WD Abnorml
9 10 190 RL 50.0 7420 Pave NaN Reg Lvl AllPub Corner Gtl BrkSide Artery Artery 2fmCon 1.5Unf 5 6 1939 1950 Gable CompShg MetalSd MetalSd None 0.0 TA TA BrkTil TA TA No GLQ 851 Unf 0 140 991 GasA Ex Y SBrkr 1077 0 0 1077 1 0 1 0 2 2 TA 5 Typ 2 TA Attchd 1939.0 RFn 1 205 Gd TA Y 0 4 0 0 0 0 NaN NaN NaN 0 1 2008 WD Normal
nans = train.isnull().sum().sort_values(ascending=False)
nans = nans.where(nans > 0).dropna().astype(int)
nans
PoolQC          1453
MiscFeature     1406
Alley           1369
Fence           1179
FireplaceQu      690
LotFrontage      259
GarageCond        81
GarageType        81
GarageYrBlt       81
GarageFinish      81
GarageQual        81
BsmtExposure      38
BsmtFinType2      38
BsmtCond          37
BsmtQual          37
BsmtFinType1      37
MasVnrArea         8
MasVnrType         8
Electrical         1
dtype: int64

Lets check what categories are given for the string type columns with NaN entries.

for idx in nans.index:
    if is_string_dtype(train[idx]):
        print(f"{idx} : {dict(train[pd.notnull(train[idx])][idx].value_counts())}")
PoolQC : {'Gd': 3, 'Ex': 2, 'Fa': 2}
MiscFeature : {'Shed': 49, 'Othr': 2, 'Gar2': 2, 'TenC': 1}
Alley : {'Grvl': 50, 'Pave': 41}
Fence : {'MnPrv': 157, 'GdPrv': 59, 'GdWo': 54, 'MnWw': 11}
FireplaceQu : {'Gd': 380, 'TA': 313, 'Fa': 33, 'Ex': 24, 'Po': 20}
GarageCond : {'TA': 1326, 'Fa': 35, 'Gd': 9, 'Po': 7, 'Ex': 2}
GarageType : {'Attchd': 870, 'Detchd': 387, 'BuiltIn': 88, 'Basment': 19, 'CarPort': 9, '2Types': 6}
GarageFinish : {'Unf': 605, 'RFn': 422, 'Fin': 352}
GarageQual : {'TA': 1311, 'Fa': 48, 'Gd': 14, 'Po': 3, 'Ex': 3}
BsmtExposure : {'No': 953, 'Av': 221, 'Gd': 134, 'Mn': 114}
BsmtFinType2 : {'Unf': 1256, 'Rec': 54, 'LwQ': 46, 'BLQ': 33, 'ALQ': 19, 'GLQ': 14}
BsmtCond : {'TA': 1311, 'Gd': 65, 'Fa': 45, 'Po': 2}
BsmtQual : {'TA': 649, 'Gd': 618, 'Ex': 121, 'Fa': 35}
BsmtFinType1 : {'Unf': 430, 'GLQ': 418, 'ALQ': 220, 'BLQ': 148, 'Rec': 133, 'LwQ': 74}
MasVnrType : {'None': 864, 'BrkFace': 445, 'Stone': 128, 'BrkCmn': 15}
Electrical : {'SBrkr': 1334, 'FuseA': 94, 'FuseF': 27, 'FuseP': 3, 'Mix': 1}

Some of the columns with lots of NaNs are not very suprising. For example if the observed house doesn’t have a garage or pool the related columns in the dataset will be NaN.

However we have to get rid of them.

Lets start alaysing the dataset by using decision trees 🌲. Random Forests refers to an ensemble of decision trees 🌲🌲🌲 and performs generaly better than just single trees.