Predicting the sale price of bulldozers using machine learning¶
Machine Learning Modelling Framework¶
1. Problem definition¶
How well can we predict the future sale price of bulldozer
2. Data¶
Data is downloaded from Kaggle Three main data sets: Train.csv, Valid.csv, Test.csv
3. Evaluation¶
Evalution metric for RMSLE (root mean squared log error) between actual and predicted auction prices Minimize error - build ML model minizes RMSLE
4. Features¶
https://docs.google.com/spreadsheets/d/1mSuoDrYWLgc01eh1Opy9PjijQ4uevvnNUCk3eHw51JI/edit?usp=sharing
5. Modelling¶
6. Experiments¶
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn
# Import training and validation sets
df = pd.read_csv("bluebook-for-bulldozers/TrainAndValid.csv", low_memory=False)
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 412698 entries, 0 to 412697 Data columns (total 53 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 SalesID 412698 non-null int64 1 SalePrice 412698 non-null float64 2 MachineID 412698 non-null int64 3 ModelID 412698 non-null int64 4 datasource 412698 non-null int64 5 auctioneerID 392562 non-null float64 6 YearMade 412698 non-null int64 7 MachineHoursCurrentMeter 147504 non-null float64 8 UsageBand 73670 non-null object 9 saledate 412698 non-null object 10 fiModelDesc 412698 non-null object 11 fiBaseModel 412698 non-null object 12 fiSecondaryDesc 271971 non-null object 13 fiModelSeries 58667 non-null object 14 fiModelDescriptor 74816 non-null object 15 ProductSize 196093 non-null object 16 fiProductClassDesc 412698 non-null object 17 state 412698 non-null object 18 ProductGroup 412698 non-null object 19 ProductGroupDesc 412698 non-null object 20 Drive_System 107087 non-null object 21 Enclosure 412364 non-null object 22 Forks 197715 non-null object 23 Pad_Type 81096 non-null object 24 Ride_Control 152728 non-null object 25 Stick 81096 non-null object 26 Transmission 188007 non-null object 27 Turbocharged 81096 non-null object 28 Blade_Extension 25983 non-null object 29 Blade_Width 25983 non-null object 30 Enclosure_Type 25983 non-null object 31 Engine_Horsepower 25983 non-null object 32 Hydraulics 330133 non-null object 33 Pushblock 25983 non-null object 34 Ripper 106945 non-null object 35 Scarifier 25994 non-null object 36 Tip_Control 25983 non-null object 37 Tire_Size 97638 non-null object 38 Coupler 220679 non-null object 39 Coupler_System 44974 non-null object 40 Grouser_Tracks 44875 non-null object 41 Hydraulics_Flow 44875 non-null object 42 Track_Type 102193 non-null object 43 Undercarriage_Pad_Width 102916 non-null object 44 Stick_Length 102261 non-null object 45 Thumb 102332 non-null object 46 Pattern_Changer 102261 non-null object 47 Grouser_Type 102193 non-null object 48 Backhoe_Mounting 80712 non-null object 49 Blade_Type 81875 non-null object 50 Travel_Controls 81877 non-null object 51 Differential_Type 71564 non-null object 52 Steering_Controls 71522 non-null object dtypes: float64(3), int64(5), object(45) memory usage: 166.9+ MB
df.isna().sum()
SalesID 0 SalePrice 0 MachineID 0 ModelID 0 datasource 0 auctioneerID 20136 YearMade 0 MachineHoursCurrentMeter 265194 UsageBand 339028 saledate 0 fiModelDesc 0 fiBaseModel 0 fiSecondaryDesc 140727 fiModelSeries 354031 fiModelDescriptor 337882 ProductSize 216605 fiProductClassDesc 0 state 0 ProductGroup 0 ProductGroupDesc 0 Drive_System 305611 Enclosure 334 Forks 214983 Pad_Type 331602 Ride_Control 259970 Stick 331602 Transmission 224691 Turbocharged 331602 Blade_Extension 386715 Blade_Width 386715 Enclosure_Type 386715 Engine_Horsepower 386715 Hydraulics 82565 Pushblock 386715 Ripper 305753 Scarifier 386704 Tip_Control 386715 Tire_Size 315060 Coupler 192019 Coupler_System 367724 Grouser_Tracks 367823 Hydraulics_Flow 367823 Track_Type 310505 Undercarriage_Pad_Width 309782 Stick_Length 310437 Thumb 310366 Pattern_Changer 310437 Grouser_Type 310505 Backhoe_Mounting 331986 Blade_Type 330823 Travel_Controls 330821 Differential_Type 341134 Steering_Controls 341176 dtype: int64
fig, ax = plt.subplots()
ax.scatter(df["saledate"][:1000],df["SalePrice"][:1000]);
df.SalePrice.plot.hist();
Parsing dates¶
When working with time series data, enrich the time and date component as much as possible.
Tell pandas which columns have dates in it using parse_dates
parameter
# Import data again but this time parse dates
df = pd.read_csv("bluebook-for-bulldozers/TrainAndValid.csv",
low_memory=False,
parse_dates = ["saledate"]);
df.saledate.dtype
dtype('<M8[ns]')
df.saledate[:1000]
0 2006-11-16 1 2004-03-26 2 2004-02-26 3 2011-05-19 4 2009-07-23 ... 995 2009-07-16 996 2007-06-14 997 2005-09-22 998 2005-07-28 999 2011-06-16 Name: saledate, Length: 1000, dtype: datetime64[ns]
fig, ax = plt.subplots()
ax.scatter(df["saledate"][:1000], df["SalePrice"][:1000]);
df.head()
SalesID | SalePrice | MachineID | ModelID | datasource | auctioneerID | YearMade | MachineHoursCurrentMeter | UsageBand | saledate | ... | Undercarriage_Pad_Width | Stick_Length | Thumb | Pattern_Changer | Grouser_Type | Backhoe_Mounting | Blade_Type | Travel_Controls | Differential_Type | Steering_Controls | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1139246 | 66000.0 | 999089 | 3157 | 121 | 3.0 | 2004 | 68.0 | Low | 2006-11-16 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Standard | Conventional |
1 | 1139248 | 57000.0 | 117657 | 77 | 121 | 3.0 | 1996 | 4640.0 | Low | 2004-03-26 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Standard | Conventional |
2 | 1139249 | 10000.0 | 434808 | 7009 | 121 | 3.0 | 2001 | 2838.0 | High | 2004-02-26 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | 1139251 | 38500.0 | 1026470 | 332 | 121 | 3.0 | 2001 | 3486.0 | High | 2011-05-19 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | 1139253 | 11000.0 | 1057373 | 17311 | 121 | 3.0 | 2007 | 722.0 | Medium | 2009-07-23 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 53 columns
# Transpose to see all columns
df.head().T
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
SalesID | 1139246 | 1139248 | 1139249 | 1139251 | 1139253 |
SalePrice | 66000.0 | 57000.0 | 10000.0 | 38500.0 | 11000.0 |
MachineID | 999089 | 117657 | 434808 | 1026470 | 1057373 |
ModelID | 3157 | 77 | 7009 | 332 | 17311 |
datasource | 121 | 121 | 121 | 121 | 121 |
auctioneerID | 3.0 | 3.0 | 3.0 | 3.0 | 3.0 |
YearMade | 2004 | 1996 | 2001 | 2001 | 2007 |
MachineHoursCurrentMeter | 68.0 | 4640.0 | 2838.0 | 3486.0 | 722.0 |
UsageBand | Low | Low | High | High | Medium |
saledate | 2006-11-16 00:00:00 | 2004-03-26 00:00:00 | 2004-02-26 00:00:00 | 2011-05-19 00:00:00 | 2009-07-23 00:00:00 |
fiModelDesc | 521D | 950FII | 226 | PC120-6E | S175 |
fiBaseModel | 521 | 950 | 226 | PC120 | S175 |
fiSecondaryDesc | D | F | NaN | NaN | NaN |
fiModelSeries | NaN | II | NaN | -6E | NaN |
fiModelDescriptor | NaN | NaN | NaN | NaN | NaN |
ProductSize | NaN | Medium | NaN | Small | NaN |
fiProductClassDesc | Wheel Loader - 110.0 to 120.0 Horsepower | Wheel Loader - 150.0 to 175.0 Horsepower | Skid Steer Loader - 1351.0 to 1601.0 Lb Operat... | Hydraulic Excavator, Track - 12.0 to 14.0 Metr... | Skid Steer Loader - 1601.0 to 1751.0 Lb Operat... |
state | Alabama | North Carolina | New York | Texas | New York |
ProductGroup | WL | WL | SSL | TEX | SSL |
ProductGroupDesc | Wheel Loader | Wheel Loader | Skid Steer Loaders | Track Excavators | Skid Steer Loaders |
Drive_System | NaN | NaN | NaN | NaN | NaN |
Enclosure | EROPS w AC | EROPS w AC | OROPS | EROPS w AC | EROPS |
Forks | None or Unspecified | None or Unspecified | None or Unspecified | NaN | None or Unspecified |
Pad_Type | NaN | NaN | NaN | NaN | NaN |
Ride_Control | None or Unspecified | None or Unspecified | NaN | NaN | NaN |
Stick | NaN | NaN | NaN | NaN | NaN |
Transmission | NaN | NaN | NaN | NaN | NaN |
Turbocharged | NaN | NaN | NaN | NaN | NaN |
Blade_Extension | NaN | NaN | NaN | NaN | NaN |
Blade_Width | NaN | NaN | NaN | NaN | NaN |
Enclosure_Type | NaN | NaN | NaN | NaN | NaN |
Engine_Horsepower | NaN | NaN | NaN | NaN | NaN |
Hydraulics | 2 Valve | 2 Valve | Auxiliary | 2 Valve | Auxiliary |
Pushblock | NaN | NaN | NaN | NaN | NaN |
Ripper | NaN | NaN | NaN | NaN | NaN |
Scarifier | NaN | NaN | NaN | NaN | NaN |
Tip_Control | NaN | NaN | NaN | NaN | NaN |
Tire_Size | None or Unspecified | 23.5 | NaN | NaN | NaN |
Coupler | None or Unspecified | None or Unspecified | None or Unspecified | None or Unspecified | None or Unspecified |
Coupler_System | NaN | NaN | None or Unspecified | NaN | None or Unspecified |
Grouser_Tracks | NaN | NaN | None or Unspecified | NaN | None or Unspecified |
Hydraulics_Flow | NaN | NaN | Standard | NaN | Standard |
Track_Type | NaN | NaN | NaN | NaN | NaN |
Undercarriage_Pad_Width | NaN | NaN | NaN | NaN | NaN |
Stick_Length | NaN | NaN | NaN | NaN | NaN |
Thumb | NaN | NaN | NaN | NaN | NaN |
Pattern_Changer | NaN | NaN | NaN | NaN | NaN |
Grouser_Type | NaN | NaN | NaN | NaN | NaN |
Backhoe_Mounting | NaN | NaN | NaN | NaN | NaN |
Blade_Type | NaN | NaN | NaN | NaN | NaN |
Travel_Controls | NaN | NaN | NaN | NaN | NaN |
Differential_Type | Standard | Standard | NaN | NaN | NaN |
Steering_Controls | Conventional | Conventional | NaN | NaN | NaN |
df.saledate.head(20)
0 2006-11-16 1 2004-03-26 2 2004-02-26 3 2011-05-19 4 2009-07-23 5 2008-12-18 6 2004-08-26 7 2005-11-17 8 2009-08-27 9 2007-08-09 10 2008-08-21 11 2006-08-24 12 2005-10-20 13 2006-01-26 14 2006-01-03 15 2006-11-16 16 2007-06-14 17 2010-01-28 18 2006-03-09 19 2005-11-17 Name: saledate, dtype: datetime64[ns]
Sort DataFrame by saledate¶
When working with time series data, sort by date
# Sort DataFrame by date
df.sort_values(by=["saledate"], inplace=True, ascending=True)
df.saledate.head(20)
205615 1989-01-17 274835 1989-01-31 141296 1989-01-31 212552 1989-01-31 62755 1989-01-31 54653 1989-01-31 81383 1989-01-31 204924 1989-01-31 135376 1989-01-31 113390 1989-01-31 113394 1989-01-31 116419 1989-01-31 32138 1989-01-31 127610 1989-01-31 76171 1989-01-31 127000 1989-01-31 128130 1989-01-31 127626 1989-01-31 55455 1989-01-31 55454 1989-01-31 Name: saledate, dtype: datetime64[ns]
Make copy of original DataFrame¶
Make a copy of original DataFrame to keep original data
# Make a copy
df_tmp = df.copy()
Add datetime parameter for saledate
column¶
df_tmp["saleYear"] = df_tmp.saledate.dt.year
df_tmp["saleMonth"] = df_tmp.saledate.dt.month
df_tmp["saleDay"] = df_tmp.saledate.dt.day
df_tmp["saleDayOfWeek"] = df_tmp.saledate.dt.dayofweek
df_tmp["saleDayOfYear"] = df_tmp.saledate.dt.dayofyear
df_tmp.head().T
205615 | 274835 | 141296 | 212552 | 62755 | |
---|---|---|---|---|---|
SalesID | 1646770 | 1821514 | 1505138 | 1671174 | 1329056 |
SalePrice | 9500.0 | 14000.0 | 50000.0 | 16000.0 | 22000.0 |
MachineID | 1126363 | 1194089 | 1473654 | 1327630 | 1336053 |
ModelID | 8434 | 10150 | 4139 | 8591 | 4089 |
datasource | 132 | 132 | 132 | 132 | 132 |
auctioneerID | 18.0 | 99.0 | 99.0 | 99.0 | 99.0 |
YearMade | 1974 | 1980 | 1978 | 1980 | 1984 |
MachineHoursCurrentMeter | NaN | NaN | NaN | NaN | NaN |
UsageBand | NaN | NaN | NaN | NaN | NaN |
saledate | 1989-01-17 00:00:00 | 1989-01-31 00:00:00 | 1989-01-31 00:00:00 | 1989-01-31 00:00:00 | 1989-01-31 00:00:00 |
fiModelDesc | TD20 | A66 | D7G | A62 | D3B |
fiBaseModel | TD20 | A66 | D7 | A62 | D3 |
fiSecondaryDesc | NaN | NaN | G | NaN | B |
fiModelSeries | NaN | NaN | NaN | NaN | NaN |
fiModelDescriptor | NaN | NaN | NaN | NaN | NaN |
ProductSize | Medium | NaN | Large | NaN | NaN |
fiProductClassDesc | Track Type Tractor, Dozer - 105.0 to 130.0 Hor... | Wheel Loader - 120.0 to 135.0 Horsepower | Track Type Tractor, Dozer - 190.0 to 260.0 Hor... | Wheel Loader - Unidentified | Track Type Tractor, Dozer - 20.0 to 75.0 Horse... |
state | Texas | Florida | Florida | Florida | Florida |
ProductGroup | TTT | WL | TTT | WL | TTT |
ProductGroupDesc | Track Type Tractors | Wheel Loader | Track Type Tractors | Wheel Loader | Track Type Tractors |
Drive_System | NaN | NaN | NaN | NaN | NaN |
Enclosure | OROPS | OROPS | OROPS | EROPS | OROPS |
Forks | NaN | None or Unspecified | NaN | None or Unspecified | NaN |
Pad_Type | NaN | NaN | NaN | NaN | NaN |
Ride_Control | NaN | None or Unspecified | NaN | None or Unspecified | NaN |
Stick | NaN | NaN | NaN | NaN | NaN |
Transmission | Direct Drive | NaN | Standard | NaN | Standard |
Turbocharged | NaN | NaN | NaN | NaN | NaN |
Blade_Extension | NaN | NaN | NaN | NaN | NaN |
Blade_Width | NaN | NaN | NaN | NaN | NaN |
Enclosure_Type | NaN | NaN | NaN | NaN | NaN |
Engine_Horsepower | NaN | NaN | NaN | NaN | NaN |
Hydraulics | 2 Valve | 2 Valve | 2 Valve | 2 Valve | 2 Valve |
Pushblock | NaN | NaN | NaN | NaN | NaN |
Ripper | None or Unspecified | NaN | None or Unspecified | NaN | None or Unspecified |
Scarifier | NaN | NaN | NaN | NaN | NaN |
Tip_Control | NaN | NaN | NaN | NaN | NaN |
Tire_Size | NaN | None or Unspecified | NaN | None or Unspecified | NaN |
Coupler | NaN | None or Unspecified | NaN | None or Unspecified | NaN |
Coupler_System | NaN | NaN | NaN | NaN | NaN |
Grouser_Tracks | NaN | NaN | NaN | NaN | NaN |
Hydraulics_Flow | NaN | NaN | NaN | NaN | NaN |
Track_Type | NaN | NaN | NaN | NaN | NaN |
Undercarriage_Pad_Width | NaN | NaN | NaN | NaN | NaN |
Stick_Length | NaN | NaN | NaN | NaN | NaN |
Thumb | NaN | NaN | NaN | NaN | NaN |
Pattern_Changer | NaN | NaN | NaN | NaN | NaN |
Grouser_Type | NaN | NaN | NaN | NaN | NaN |
Backhoe_Mounting | None or Unspecified | NaN | None or Unspecified | NaN | None or Unspecified |
Blade_Type | Straight | NaN | Straight | NaN | PAT |
Travel_Controls | None or Unspecified | NaN | None or Unspecified | NaN | Lever |
Differential_Type | NaN | Standard | NaN | Standard | NaN |
Steering_Controls | NaN | Conventional | NaN | Conventional | NaN |
saleYear | 1989 | 1989 | 1989 | 1989 | 1989 |
saleMonth | 1 | 1 | 1 | 1 | 1 |
saleDay | 17 | 31 | 31 | 31 | 31 |
saleDayOfWeek | 1 | 1 | 1 | 1 | 1 |
saleDayOfYear | 17 | 31 | 31 | 31 | 31 |
# Now we have enriched our DataFrame with date time features we can remove `saledate`
# inplace means we don't have to set df_tmp = df_tmp
df_tmp.drop("saledate", axis=1, inplace=True)
# Check the values of different columns
df_tmp.state.value_counts()
state Florida 67320 Texas 53110 California 29761 Washington 16222 Georgia 14633 Maryland 13322 Mississippi 13240 Ohio 12369 Illinois 11540 Colorado 11529 New Jersey 11156 North Carolina 10636 Tennessee 10298 Alabama 10292 Pennsylvania 10234 South Carolina 9951 Arizona 9364 New York 8639 Connecticut 8276 Minnesota 7885 Missouri 7178 Nevada 6932 Louisiana 6627 Kentucky 5351 Maine 5096 Indiana 4124 Arkansas 3933 New Mexico 3631 Utah 3046 Unspecified 2801 Wisconsin 2745 New Hampshire 2738 Virginia 2353 Idaho 2025 Oregon 1911 Michigan 1831 Wyoming 1672 Montana 1336 Iowa 1336 Oklahoma 1326 Nebraska 866 West Virginia 840 Kansas 667 Delaware 510 North Dakota 480 Alaska 430 Massachusetts 347 Vermont 300 South Dakota 244 Hawaii 118 Rhode Island 83 Puerto Rico 42 Washington DC 2 Name: count, dtype: int64
5. Modelling¶
Enough EDA (we could do more) - start model-driven EDA
# Build ML model - go through SKLEARN map and determine which model to use
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor(n_jobs=-1,
random_state=42)
model.fit(df_tmp.drop("SalePrice", axis=1), df_tmp["SalePrice"])
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) /var/folders/t6/1rfncxtx1v5g_d4tmt1rfpqm0000gn/T/ipykernel_27423/1201677297.py in ?() 3 4 model = RandomForestRegressor(n_jobs=-1, 5 random_state=42) 6 ----> 7 model.fit(df_tmp.drop("SalePrice", axis=1), df_tmp["SalePrice"]) ~/Documents/Udemy/Data_Science_ML_Bootcamp/bulldozer-sales/env/lib/python3.10/site-packages/sklearn/base.py in ?(estimator, *args, **kwargs) 1147 skip_parameter_validation=( 1148 prefer_skip_nested_validation or global_skip_validation 1149 ) 1150 ): -> 1151 return fit_method(estimator, *args, **kwargs) ~/Documents/Udemy/Data_Science_ML_Bootcamp/bulldozer-sales/env/lib/python3.10/site-packages/sklearn/ensemble/_forest.py in ?(self, X, y, sample_weight) 344 """ 345 # Validate or convert input data 346 if issparse(y): 347 raise ValueError("sparse multilabel-indicator for y is not supported.") --> 348 X, y = self._validate_data( 349 X, y, multi_output=True, accept_sparse="csc", dtype=DTYPE 350 ) 351 if sample_weight is not None: ~/Documents/Udemy/Data_Science_ML_Bootcamp/bulldozer-sales/env/lib/python3.10/site-packages/sklearn/base.py in ?(self, X, y, reset, validate_separately, cast_to_ndarray, **check_params) 617 if "estimator" not in check_y_params: 618 check_y_params = {**default_check_params, **check_y_params} 619 y = check_array(y, input_name="y", **check_y_params) 620 else: --> 621 X, y = check_X_y(X, y, **check_params) 622 out = X, y 623 624 if not no_val_X and check_params.get("ensure_2d", True): ~/Documents/Udemy/Data_Science_ML_Bootcamp/bulldozer-sales/env/lib/python3.10/site-packages/sklearn/utils/validation.py in ?(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, estimator) 1143 raise ValueError( 1144 f"{estimator_name} requires y to be passed, but the target y is None" 1145 ) 1146 -> 1147 X = check_array( 1148 X, 1149 accept_sparse=accept_sparse, 1150 accept_large_sparse=accept_large_sparse, ~/Documents/Udemy/Data_Science_ML_Bootcamp/bulldozer-sales/env/lib/python3.10/site-packages/sklearn/utils/validation.py in ?(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator, input_name) 914 ) 915 array = xp.astype(array, dtype, copy=False) 916 else: 917 array = _asarray_with_order(array, order=order, dtype=dtype, xp=xp) --> 918 except ComplexWarning as complex_warning: 919 raise ValueError( 920 "Complex data not supported\n{}\n".format(array) 921 ) from complex_warning ~/Documents/Udemy/Data_Science_ML_Bootcamp/bulldozer-sales/env/lib/python3.10/site-packages/sklearn/utils/_array_api.py in ?(array, dtype, order, copy, xp) 376 # Use NumPy API to support order 377 if copy is True: 378 array = numpy.array(array, order=order, dtype=dtype) 379 else: --> 380 array = numpy.asarray(array, order=order, dtype=dtype) 381 382 # At this point array is a NumPy ndarray. We convert it to an array 383 # container that is consistent with the input's namespace. ~/Documents/Udemy/Data_Science_ML_Bootcamp/bulldozer-sales/env/lib/python3.10/site-packages/pandas/core/generic.py in ?(self, dtype) 2082 def __array__(self, dtype: npt.DTypeLike | None = None) -> np.ndarray: 2083 values = self._values -> 2084 arr = np.asarray(values, dtype=dtype) 2085 if ( 2086 astype_is_view(values.dtype, arr.dtype) 2087 and using_copy_on_write() ValueError: could not convert string to float: 'Low'
df_tmp.info()
<class 'pandas.core.frame.DataFrame'> Index: 412698 entries, 205615 to 409203 Data columns (total 57 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 SalesID 412698 non-null int64 1 SalePrice 412698 non-null float64 2 MachineID 412698 non-null int64 3 ModelID 412698 non-null int64 4 datasource 412698 non-null int64 5 auctioneerID 392562 non-null float64 6 YearMade 412698 non-null int64 7 MachineHoursCurrentMeter 147504 non-null float64 8 UsageBand 73670 non-null object 9 fiModelDesc 412698 non-null object 10 fiBaseModel 412698 non-null object 11 fiSecondaryDesc 271971 non-null object 12 fiModelSeries 58667 non-null object 13 fiModelDescriptor 74816 non-null object 14 ProductSize 196093 non-null object 15 fiProductClassDesc 412698 non-null object 16 state 412698 non-null object 17 ProductGroup 412698 non-null object 18 ProductGroupDesc 412698 non-null object 19 Drive_System 107087 non-null object 20 Enclosure 412364 non-null object 21 Forks 197715 non-null object 22 Pad_Type 81096 non-null object 23 Ride_Control 152728 non-null object 24 Stick 81096 non-null object 25 Transmission 188007 non-null object 26 Turbocharged 81096 non-null object 27 Blade_Extension 25983 non-null object 28 Blade_Width 25983 non-null object 29 Enclosure_Type 25983 non-null object 30 Engine_Horsepower 25983 non-null object 31 Hydraulics 330133 non-null object 32 Pushblock 25983 non-null object 33 Ripper 106945 non-null object 34 Scarifier 25994 non-null object 35 Tip_Control 25983 non-null object 36 Tire_Size 97638 non-null object 37 Coupler 220679 non-null object 38 Coupler_System 44974 non-null object 39 Grouser_Tracks 44875 non-null object 40 Hydraulics_Flow 44875 non-null object 41 Track_Type 102193 non-null object 42 Undercarriage_Pad_Width 102916 non-null object 43 Stick_Length 102261 non-null object 44 Thumb 102332 non-null object 45 Pattern_Changer 102261 non-null object 46 Grouser_Type 102193 non-null object 47 Backhoe_Mounting 80712 non-null object 48 Blade_Type 81875 non-null object 49 Travel_Controls 81877 non-null object 50 Differential_Type 71564 non-null object 51 Steering_Controls 71522 non-null object 52 saleYear 412698 non-null int32 53 saleMonth 412698 non-null int32 54 saleDay 412698 non-null int32 55 saleDayOfWeek 412698 non-null int32 56 saleDayOfYear 412698 non-null int32 dtypes: float64(3), int32(5), int64(5), object(44) memory usage: 174.7+ MB
Convert string to categories¶
One way to turn data into numbers is by converting them into panadas categories
df_tmp.head()
SalesID | SalePrice | MachineID | ModelID | datasource | auctioneerID | YearMade | MachineHoursCurrentMeter | UsageBand | fiModelDesc | ... | Backhoe_Mounting | Blade_Type | Travel_Controls | Differential_Type | Steering_Controls | saleYear | saleMonth | saleDay | saleDayOfWeek | saleDayOfYear | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
205615 | 1646770 | 9500.0 | 1126363 | 8434 | 132 | 18.0 | 1974 | NaN | NaN | TD20 | ... | None or Unspecified | Straight | None or Unspecified | NaN | NaN | 1989 | 1 | 17 | 1 | 17 |
274835 | 1821514 | 14000.0 | 1194089 | 10150 | 132 | 99.0 | 1980 | NaN | NaN | A66 | ... | NaN | NaN | NaN | Standard | Conventional | 1989 | 1 | 31 | 1 | 31 |
141296 | 1505138 | 50000.0 | 1473654 | 4139 | 132 | 99.0 | 1978 | NaN | NaN | D7G | ... | None or Unspecified | Straight | None or Unspecified | NaN | NaN | 1989 | 1 | 31 | 1 | 31 |
212552 | 1671174 | 16000.0 | 1327630 | 8591 | 132 | 99.0 | 1980 | NaN | NaN | A62 | ... | NaN | NaN | NaN | Standard | Conventional | 1989 | 1 | 31 | 1 | 31 |
62755 | 1329056 | 22000.0 | 1336053 | 4089 | 132 | 99.0 | 1984 | NaN | NaN | D3B | ... | None or Unspecified | PAT | Lever | NaN | NaN | 1989 | 1 | 31 | 1 | 31 |
5 rows × 57 columns
pd.api.types.is_string_dtype(df_tmp["UsageBand"])
False
# Find columns which contain strings
for label, content in df_tmp.items():
if pd.api.types.is_string_dtype(content):
print(label)
fiModelDesc fiBaseModel fiProductClassDesc state ProductGroup ProductGroupDesc
# What df.items() does example
random_dict = {"key1" : "hello",
"key2" : "world!"}
for key, value in random_dict.items():
print(f"this is a key: {key}",
f"this is a value: {value}")
this is a key: key1 this is a value: hello this is a key: key2 this is a value: world!
# This weill turn all string values into category values
for label, content in df_tmp.items():
if pd.api.types.is_string_dtype(content):
df_tmp[label] = content.astype("category").cat.as_ordered()
df_tmp.info()
<class 'pandas.core.frame.DataFrame'> Index: 412698 entries, 205615 to 409203 Data columns (total 57 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 SalesID 412698 non-null int64 1 SalePrice 412698 non-null float64 2 MachineID 412698 non-null int64 3 ModelID 412698 non-null int64 4 datasource 412698 non-null int64 5 auctioneerID 392562 non-null float64 6 YearMade 412698 non-null int64 7 MachineHoursCurrentMeter 147504 non-null float64 8 UsageBand 73670 non-null object 9 fiModelDesc 412698 non-null category 10 fiBaseModel 412698 non-null category 11 fiSecondaryDesc 271971 non-null object 12 fiModelSeries 58667 non-null object 13 fiModelDescriptor 74816 non-null object 14 ProductSize 196093 non-null object 15 fiProductClassDesc 412698 non-null category 16 state 412698 non-null category 17 ProductGroup 412698 non-null category 18 ProductGroupDesc 412698 non-null category 19 Drive_System 107087 non-null object 20 Enclosure 412364 non-null object 21 Forks 197715 non-null object 22 Pad_Type 81096 non-null object 23 Ride_Control 152728 non-null object 24 Stick 81096 non-null object 25 Transmission 188007 non-null object 26 Turbocharged 81096 non-null object 27 Blade_Extension 25983 non-null object 28 Blade_Width 25983 non-null object 29 Enclosure_Type 25983 non-null object 30 Engine_Horsepower 25983 non-null object 31 Hydraulics 330133 non-null object 32 Pushblock 25983 non-null object 33 Ripper 106945 non-null object 34 Scarifier 25994 non-null object 35 Tip_Control 25983 non-null object 36 Tire_Size 97638 non-null object 37 Coupler 220679 non-null object 38 Coupler_System 44974 non-null object 39 Grouser_Tracks 44875 non-null object 40 Hydraulics_Flow 44875 non-null object 41 Track_Type 102193 non-null object 42 Undercarriage_Pad_Width 102916 non-null object 43 Stick_Length 102261 non-null object 44 Thumb 102332 non-null object 45 Pattern_Changer 102261 non-null object 46 Grouser_Type 102193 non-null object 47 Backhoe_Mounting 80712 non-null object 48 Blade_Type 81875 non-null object 49 Travel_Controls 81877 non-null object 50 Differential_Type 71564 non-null object 51 Steering_Controls 71522 non-null object 52 saleYear 412698 non-null int32 53 saleMonth 412698 non-null int32 54 saleDay 412698 non-null int32 55 saleDayOfWeek 412698 non-null int32 56 saleDayOfYear 412698 non-null int32 dtypes: category(6), float64(3), int32(5), int64(5), object(38) memory usage: 159.3+ MB
df_tmp.state.cat.categories
Index(['Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California', 'Colorado', 'Connecticut', 'Delaware', 'Florida', 'Georgia', 'Hawaii', 'Idaho', 'Illinois', 'Indiana', 'Iowa', 'Kansas', 'Kentucky', 'Louisiana', 'Maine', 'Maryland', 'Massachusetts', 'Michigan', 'Minnesota', 'Mississippi', 'Missouri', 'Montana', 'Nebraska', 'Nevada', 'New Hampshire', 'New Jersey', 'New Mexico', 'New York', 'North Carolina', 'North Dakota', 'Ohio', 'Oklahoma', 'Oregon', 'Pennsylvania', 'Puerto Rico', 'Rhode Island', 'South Carolina', 'South Dakota', 'Tennessee', 'Texas', 'Unspecified', 'Utah', 'Vermont', 'Virginia', 'Washington', 'Washington DC', 'West Virginia', 'Wisconsin', 'Wyoming'], dtype='object')
# String values (States) have been converted to categories and turned into numerical values
df_tmp.state.cat.codes
205615 43 274835 8 141296 8 212552 8 62755 8 .. 410879 4 412476 4 411927 4 407124 4 409203 4 Length: 412698, dtype: int8
# Check missing data
df_tmp.isnull().sum()/len(df_tmp)
SalesID 0.000000 SalePrice 0.000000 MachineID 0.000000 ModelID 0.000000 datasource 0.000000 auctioneerID 0.048791 YearMade 0.000000 MachineHoursCurrentMeter 0.642586 UsageBand 0.821492 fiModelDesc 0.000000 fiBaseModel 0.000000 fiSecondaryDesc 0.340993 fiModelSeries 0.857845 fiModelDescriptor 0.818715 ProductSize 0.524851 fiProductClassDesc 0.000000 state 0.000000 ProductGroup 0.000000 ProductGroupDesc 0.000000 Drive_System 0.740520 Enclosure 0.000809 Forks 0.520921 Pad_Type 0.803498 Ride_Control 0.629928 Stick 0.803498 Transmission 0.544444 Turbocharged 0.803498 Blade_Extension 0.937041 Blade_Width 0.937041 Enclosure_Type 0.937041 Engine_Horsepower 0.937041 Hydraulics 0.200062 Pushblock 0.937041 Ripper 0.740864 Scarifier 0.937014 Tip_Control 0.937041 Tire_Size 0.763415 Coupler 0.465277 Coupler_System 0.891024 Grouser_Tracks 0.891264 Hydraulics_Flow 0.891264 Track_Type 0.752378 Undercarriage_Pad_Width 0.750626 Stick_Length 0.752213 Thumb 0.752041 Pattern_Changer 0.752213 Grouser_Type 0.752378 Backhoe_Mounting 0.804428 Blade_Type 0.801610 Travel_Controls 0.801606 Differential_Type 0.826595 Steering_Controls 0.826697 saleYear 0.000000 saleMonth 0.000000 saleDay 0.000000 saleDayOfWeek 0.000000 saleDayOfYear 0.000000 dtype: float64
Save to new CSV preprocessed data¶
# Export current tmp dataframe
df_tmp.to_csv("bluebook-for-bulldozers/train_tmp.csv",
index=False)
# Import preprocessed data
df_tmp = pd.read_csv("bluebook-for-bulldozers/train_tmp.csv",
low_memory=False)
df_tmp.head().T
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
SalesID | 1646770 | 1821514 | 1505138 | 1671174 | 1329056 |
SalePrice | 9500.0 | 14000.0 | 50000.0 | 16000.0 | 22000.0 |
MachineID | 1126363 | 1194089 | 1473654 | 1327630 | 1336053 |
ModelID | 8434 | 10150 | 4139 | 8591 | 4089 |
datasource | 132 | 132 | 132 | 132 | 132 |
auctioneerID | 18.0 | 99.0 | 99.0 | 99.0 | 99.0 |
YearMade | 1974 | 1980 | 1978 | 1980 | 1984 |
MachineHoursCurrentMeter | NaN | NaN | NaN | NaN | NaN |
UsageBand | NaN | NaN | NaN | NaN | NaN |
fiModelDesc | TD20 | A66 | D7G | A62 | D3B |
fiBaseModel | TD20 | A66 | D7 | A62 | D3 |
fiSecondaryDesc | NaN | NaN | G | NaN | B |
fiModelSeries | NaN | NaN | NaN | NaN | NaN |
fiModelDescriptor | NaN | NaN | NaN | NaN | NaN |
ProductSize | Medium | NaN | Large | NaN | NaN |
fiProductClassDesc | Track Type Tractor, Dozer - 105.0 to 130.0 Hor... | Wheel Loader - 120.0 to 135.0 Horsepower | Track Type Tractor, Dozer - 190.0 to 260.0 Hor... | Wheel Loader - Unidentified | Track Type Tractor, Dozer - 20.0 to 75.0 Horse... |
state | Texas | Florida | Florida | Florida | Florida |
ProductGroup | TTT | WL | TTT | WL | TTT |
ProductGroupDesc | Track Type Tractors | Wheel Loader | Track Type Tractors | Wheel Loader | Track Type Tractors |
Drive_System | NaN | NaN | NaN | NaN | NaN |
Enclosure | OROPS | OROPS | OROPS | EROPS | OROPS |
Forks | NaN | None or Unspecified | NaN | None or Unspecified | NaN |
Pad_Type | NaN | NaN | NaN | NaN | NaN |
Ride_Control | NaN | None or Unspecified | NaN | None or Unspecified | NaN |
Stick | NaN | NaN | NaN | NaN | NaN |
Transmission | Direct Drive | NaN | Standard | NaN | Standard |
Turbocharged | NaN | NaN | NaN | NaN | NaN |
Blade_Extension | NaN | NaN | NaN | NaN | NaN |
Blade_Width | NaN | NaN | NaN | NaN | NaN |
Enclosure_Type | NaN | NaN | NaN | NaN | NaN |
Engine_Horsepower | NaN | NaN | NaN | NaN | NaN |
Hydraulics | 2 Valve | 2 Valve | 2 Valve | 2 Valve | 2 Valve |
Pushblock | NaN | NaN | NaN | NaN | NaN |
Ripper | None or Unspecified | NaN | None or Unspecified | NaN | None or Unspecified |
Scarifier | NaN | NaN | NaN | NaN | NaN |
Tip_Control | NaN | NaN | NaN | NaN | NaN |
Tire_Size | NaN | None or Unspecified | NaN | None or Unspecified | NaN |
Coupler | NaN | None or Unspecified | NaN | None or Unspecified | NaN |
Coupler_System | NaN | NaN | NaN | NaN | NaN |
Grouser_Tracks | NaN | NaN | NaN | NaN | NaN |
Hydraulics_Flow | NaN | NaN | NaN | NaN | NaN |
Track_Type | NaN | NaN | NaN | NaN | NaN |
Undercarriage_Pad_Width | NaN | NaN | NaN | NaN | NaN |
Stick_Length | NaN | NaN | NaN | NaN | NaN |
Thumb | NaN | NaN | NaN | NaN | NaN |
Pattern_Changer | NaN | NaN | NaN | NaN | NaN |
Grouser_Type | NaN | NaN | NaN | NaN | NaN |
Backhoe_Mounting | None or Unspecified | NaN | None or Unspecified | NaN | None or Unspecified |
Blade_Type | Straight | NaN | Straight | NaN | PAT |
Travel_Controls | None or Unspecified | NaN | None or Unspecified | NaN | Lever |
Differential_Type | NaN | Standard | NaN | Standard | NaN |
Steering_Controls | NaN | Conventional | NaN | Conventional | NaN |
saleYear | 1989 | 1989 | 1989 | 1989 | 1989 |
saleMonth | 1 | 1 | 1 | 1 | 1 |
saleDay | 17 | 31 | 31 | 31 | 31 |
saleDayOfWeek | 1 | 1 | 1 | 1 | 1 |
saleDayOfYear | 17 | 31 | 31 | 31 | 31 |
# Still lots of missing values
df_tmp.isna().sum()
SalesID 0 SalePrice 0 MachineID 0 ModelID 0 datasource 0 auctioneerID 20136 YearMade 0 MachineHoursCurrentMeter 265194 UsageBand 339028 fiModelDesc 0 fiBaseModel 0 fiSecondaryDesc 140727 fiModelSeries 354031 fiModelDescriptor 337882 ProductSize 216605 fiProductClassDesc 0 state 0 ProductGroup 0 ProductGroupDesc 0 Drive_System 305611 Enclosure 334 Forks 214983 Pad_Type 331602 Ride_Control 259970 Stick 331602 Transmission 224691 Turbocharged 331602 Blade_Extension 386715 Blade_Width 386715 Enclosure_Type 386715 Engine_Horsepower 386715 Hydraulics 82565 Pushblock 386715 Ripper 305753 Scarifier 386704 Tip_Control 386715 Tire_Size 315060 Coupler 192019 Coupler_System 367724 Grouser_Tracks 367823 Hydraulics_Flow 367823 Track_Type 310505 Undercarriage_Pad_Width 309782 Stick_Length 310437 Thumb 310366 Pattern_Changer 310437 Grouser_Type 310505 Backhoe_Mounting 331986 Blade_Type 330823 Travel_Controls 330821 Differential_Type 341134 Steering_Controls 341176 saleYear 0 saleMonth 0 saleDay 0 saleDayOfWeek 0 saleDayOfYear 0 dtype: int64
for label, content in df_tmp.items():
if pd.api.types.is_numeric_dtype(content):
print(label)
SalesID SalePrice MachineID ModelID datasource auctioneerID YearMade MachineHoursCurrentMeter saleYear saleMonth saleDay saleDayOfWeek saleDayOfYear
df_tmp.ModelID
0 8434 1 10150 2 4139 3 8591 4 4089 ... 412693 5266 412694 19330 412695 17244 412696 3357 412697 4701 Name: ModelID, Length: 412698, dtype: int64
# Check for which numeric columns have null values
for label, content in df_tmp.items():
if pd.api.types.is_numeric_dtype(content):
if pd.isnull(content).sum():
print(label)
auctioneerID MachineHoursCurrentMeter
# Fill numeric row with median values (MEDIAN is more robust than MEAN)
for label, content in df_tmp.items():
if pd.api.types.is_numeric_dtype(content):
if pd.isnull(content).sum():
# Add a binary column which tell us if the data was missing
df_tmp[label + "_is_missing"] = pd.isnull(content)
# Fill missing numeric values with median value
df_tmp[label] = content.fillna(content.median())
# Demonstrate how MEDIAN is more robust than MEAN to outlayers
hundreds = np.full ((1000,), 100)
hundreds_billion = np.append(hundreds, 1000000000)
np.mean(hundreds), np.mean(hundreds_billion), np.median(hundreds), np.median(hundreds_billion)
(100.0, 999100.8991008991, 100.0, 100.0)
# Check if there's any null numeric values
for label, content in df_tmp.items():
if pd.api.types.is_numeric_dtype(content):
if pd.isnull(content).sum():
print(label)
# Check to see how many examples we're missing
df_tmp.auctioneerID_is_missing.value_counts()
auctioneerID_is_missing False 392562 True 20136 Name: count, dtype: int64
df_tmp.isna().sum()
SalesID 0 SalePrice 0 MachineID 0 ModelID 0 datasource 0 auctioneerID 0 YearMade 0 MachineHoursCurrentMeter 0 UsageBand 339028 fiModelDesc 0 fiBaseModel 0 fiSecondaryDesc 140727 fiModelSeries 354031 fiModelDescriptor 337882 ProductSize 216605 fiProductClassDesc 0 state 0 ProductGroup 0 ProductGroupDesc 0 Drive_System 305611 Enclosure 334 Forks 214983 Pad_Type 331602 Ride_Control 259970 Stick 331602 Transmission 224691 Turbocharged 331602 Blade_Extension 386715 Blade_Width 386715 Enclosure_Type 386715 Engine_Horsepower 386715 Hydraulics 82565 Pushblock 386715 Ripper 305753 Scarifier 386704 Tip_Control 386715 Tire_Size 315060 Coupler 192019 Coupler_System 367724 Grouser_Tracks 367823 Hydraulics_Flow 367823 Track_Type 310505 Undercarriage_Pad_Width 309782 Stick_Length 310437 Thumb 310366 Pattern_Changer 310437 Grouser_Type 310505 Backhoe_Mounting 331986 Blade_Type 330823 Travel_Controls 330821 Differential_Type 341134 Steering_Controls 341176 saleYear 0 saleMonth 0 saleDay 0 saleDayOfWeek 0 saleDayOfYear 0 auctioneerID_is_missing 0 MachineHoursCurrentMeter_is_missing 0 dtype: int64
Filling and turning categorical variables into numbers¶
# Check for columns which aren't numeric
for label, content in df_tmp.items():
if not pd.api.types.is_numeric_dtype(content):
print(label)
UsageBand fiModelDesc fiBaseModel fiSecondaryDesc fiModelSeries fiModelDescriptor ProductSize fiProductClassDesc state ProductGroup ProductGroupDesc Drive_System Enclosure Forks Pad_Type Ride_Control Stick Transmission Turbocharged Blade_Extension Blade_Width Enclosure_Type Engine_Horsepower Hydraulics Pushblock Ripper Scarifier Tip_Control Tire_Size Coupler Coupler_System Grouser_Tracks Hydraulics_Flow Track_Type Undercarriage_Pad_Width Stick_Length Thumb Pattern_Changer Grouser_Type Backhoe_Mounting Blade_Type Travel_Controls Differential_Type Steering_Controls
# States (string) into numbers
pd.Categorical(df_tmp["state"]).codes
array([43, 8, 8, ..., 4, 4, 4], dtype=int8)
# Turn categorical variables into number and fill missing
for label, content in df_tmp.items():
if not pd.api.types.is_numeric_dtype(content):
# ADD BINARY COLUMN TO INDICATE WHETHER SAMPLE HAD MISSING VALUES
df_tmp[label + "_is_missing"] = pd.isnull(content)
# TURN CATEGORIES INTO NUMBERS AND ADD +1 - KEEPS VALUES POSITIVE OTHERWISE TURNS TO -1
df_tmp[label] = pd.Categorical(content).codes + 1
df_tmp.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 412698 entries, 0 to 412697 Columns: 103 entries, SalesID to Steering_Controls_is_missing dtypes: bool(46), float64(3), int16(4), int64(10), int8(40) memory usage: 77.9 MB
df_tmp.head().T
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
SalesID | 1646770 | 1821514 | 1505138 | 1671174 | 1329056 |
SalePrice | 9500.0 | 14000.0 | 50000.0 | 16000.0 | 22000.0 |
MachineID | 1126363 | 1194089 | 1473654 | 1327630 | 1336053 |
ModelID | 8434 | 10150 | 4139 | 8591 | 4089 |
datasource | 132 | 132 | 132 | 132 | 132 |
... | ... | ... | ... | ... | ... |
Backhoe_Mounting_is_missing | False | True | False | True | False |
Blade_Type_is_missing | False | True | False | True | False |
Travel_Controls_is_missing | False | True | False | True | False |
Differential_Type_is_missing | True | False | True | False | True |
Steering_Controls_is_missing | True | False | True | False | True |
103 rows × 5 columns
# NO MISSING VALUES AND DATA IS ALL NUMERIC
df_tmp.isna().sum()
SalesID 0 SalePrice 0 MachineID 0 ModelID 0 datasource 0 .. Backhoe_Mounting_is_missing 0 Blade_Type_is_missing 0 Travel_Controls_is_missing 0 Differential_Type_is_missing 0 Steering_Controls_is_missing 0 Length: 103, dtype: int64
BUILD THE MACHINE MODEL¶
df_tmp.head()
SalesID | SalePrice | MachineID | ModelID | datasource | auctioneerID | YearMade | MachineHoursCurrentMeter | UsageBand | fiModelDesc | ... | Undercarriage_Pad_Width_is_missing | Stick_Length_is_missing | Thumb_is_missing | Pattern_Changer_is_missing | Grouser_Type_is_missing | Backhoe_Mounting_is_missing | Blade_Type_is_missing | Travel_Controls_is_missing | Differential_Type_is_missing | Steering_Controls_is_missing | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1646770 | 9500.0 | 1126363 | 8434 | 132 | 18.0 | 1974 | 0.0 | 0 | 4593 | ... | True | True | True | True | True | False | False | False | True | True |
1 | 1821514 | 14000.0 | 1194089 | 10150 | 132 | 99.0 | 1980 | 0.0 | 0 | 1820 | ... | True | True | True | True | True | True | True | True | False | False |
2 | 1505138 | 50000.0 | 1473654 | 4139 | 132 | 99.0 | 1978 | 0.0 | 0 | 2348 | ... | True | True | True | True | True | False | False | False | True | True |
3 | 1671174 | 16000.0 | 1327630 | 8591 | 132 | 99.0 | 1980 | 0.0 | 0 | 1819 | ... | True | True | True | True | True | True | True | True | False | False |
4 | 1329056 | 22000.0 | 1336053 | 4089 | 132 | 99.0 | 1984 | 0.0 | 0 | 2119 | ... | True | True | True | True | True | False | False | False | True | True |
5 rows × 103 columns
%%time
# HOW MUCH TIME IT TAKES TO RUN THIS CELL
model = RandomForestRegressor(n_jobs=-1,
random_state=42)
# FIT THE MODEL
model.fit(df_tmp.drop("SalePrice", axis=1), df_tmp["SalePrice"])
CPU times: user 26min 58s, sys: 17 s, total: 27min 15s Wall time: 1min 58s
RandomForestRegressor(n_jobs=-1, random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestRegressor(n_jobs=-1, random_state=42)
# SCORE THE MODEL
model.score(df_tmp.drop("SalePrice", axis=1), df_tmp["SalePrice"])
0.9875468079970562
WHY IS THIS RESULT NOT RELIABLE?¶
we evaluated on the same data that it learnt on; didn't split
SPLITTING DATA INTO TRAIN / VALIDATION SETS¶
df_val = df_tmp[df_tmp.saleYear == 2012]
df_train = df_tmp[df_tmp.saleYear != 2012]
len(df_val), len(df_train)
(11573, 401125)
# SPLIT DATA INTO X & Y
x_train, y_train = df_train.drop("SalePrice", axis=1), df_train.SalePrice
x_valid, y_valid = df_val.drop("SalePrice", axis=1), df_val.SalePrice
x_train.shape, y_train.shape, x_valid.shape, y_valid.shape
((401125, 102), (401125,), (11573, 102), (11573,))
y_train
0 9500.0 1 14000.0 2 50000.0 3 16000.0 4 22000.0 ... 401120 29000.0 401121 11000.0 401122 11000.0 401123 18000.0 401124 13500.0 Name: SalePrice, Length: 401125, dtype: float64
BUILDING AN EVALUATION FUNCTION¶
# CREATE AN EVALUATION FUNCTION
from sklearn.metrics import mean_squared_log_error, mean_absolute_error, r2_score
def rmsle(y_test, y_preds):
"""
CALCULATES ROOT MEAN SQUARED LOG ERROR BERWEEN PREDICTION AND TRUE LABELS.
"""
return np.sqrt(mean_squared_log_error(y_test, y_preds))
# CREATE FUNCTION TO EVALUATE MODEL ON A FEW DIFFERENT LEVELS
def show_scores(model):
train_preds = model.predict(x_train)
val_preds = model.predict(x_valid)
scores = {"Training MAE" : mean_absolute_error(y_train, train_preds),
"Valid MAE" : mean_absolute_error(y_valid, val_preds),
"Training RMSLE" : rmsle(y_train, train_preds),
"Valid RMSLE" : rmsle(y_valid, val_preds),
"Training R^2" : r2_score(y_train, train_preds),
"Valid R^2" : r2_score(y_valid, val_preds) }
return scores
TESTING MODEL ON SUBSET (TO TUNE HYPERPARAMETERS)¶
# TAKES TOO LONG FOR EXPERIMENTING
# %%time
# model = RandomForestRegressor(n_jobs=-1,
# random_state=42)
# model.fit(x_train, y_train)
# CHANGE MAX SAMPLES VALUE
model = RandomForestRegressor(n_jobs=-1,
random_state=42,
max_samples=10000)
%%time
# CUTTING DOWN MAX NUMBER OF SAMPLES EACH ESTIMATOR CAN SEE CAN IMPROVE CALCULATION TIME
model.fit(x_train, y_train)
CPU times: user 49.5 s, sys: 876 ms, total: 50.4 s Wall time: 3.96 s
RandomForestRegressor(max_samples=10000, n_jobs=-1, random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestRegressor(max_samples=10000, n_jobs=-1, random_state=42)
show_scores(model)
{'Training MAE': 5561.2988092240585, 'Valid MAE': 7177.26365505919, 'Training RMSLE': 0.257745378256977, 'Valid RMSLE': 0.29362638671089003, 'Training R^2': 0.8606658995199189, 'Valid R^2': 0.8320374995090507}
HYPERPARAMETER TUNING WITH RANDOMIZEDSEARCHCV¶
%%time
from sklearn.model_selection import RandomizedSearchCV
# DIFFERENT RANDOMFORESTREGRESSOR HYPERPARAMETERS
rf_grid = {"n_estimators" : np.arange(10, 100, 10),
"max_depth" : [None, 3, 5, 10],
"min_samples_split" : np.arange(2, 20, 2),
"min_samples_leaf" : np.arange(1, 20, 2),
"max_features" : [0.5, 1, "sqrt", "auto"],
"max_samples" : [10000]}
# INSTANTIATE RANDOMIZEDSEARCHCV MODEL
rs_model = RandomizedSearchCV(RandomForestRegressor(n_jobs=-1,
random_state=42),
param_distributions=rf_grid,
n_iter=6,
cv=5,
verbose=True)
# FIT RANDOMIZEDSEARCHCV MODEL
rs_model.fit(x_train, y_train)
Fitting 5 folds for each of 6 candidates, totalling 30 fits
/Users/michaelbradley/Documents/Udemy/Data_Science_ML_Bootcamp/bulldozer-sales/env/lib/python3.10/site-packages/sklearn/model_selection/_validation.py:425: FitFailedWarning: 15 fits failed out of a total of 30. The score on these train-test partitions for these parameters will be set to nan. If these failures are not expected, you can try to debug them by setting error_score='raise'. Below are more details about the failures: -------------------------------------------------------------------------------- 15 fits failed with the following error: Traceback (most recent call last): File "/Users/michaelbradley/Documents/Udemy/Data_Science_ML_Bootcamp/bulldozer-sales/env/lib/python3.10/site-packages/sklearn/model_selection/_validation.py", line 732, in _fit_and_score estimator.fit(X_train, y_train, **fit_params) File "/Users/michaelbradley/Documents/Udemy/Data_Science_ML_Bootcamp/bulldozer-sales/env/lib/python3.10/site-packages/sklearn/base.py", line 1144, in wrapper estimator._validate_params() File "/Users/michaelbradley/Documents/Udemy/Data_Science_ML_Bootcamp/bulldozer-sales/env/lib/python3.10/site-packages/sklearn/base.py", line 637, in _validate_params validate_parameter_constraints( File "/Users/michaelbradley/Documents/Udemy/Data_Science_ML_Bootcamp/bulldozer-sales/env/lib/python3.10/site-packages/sklearn/utils/_param_validation.py", line 95, in validate_parameter_constraints raise InvalidParameterError( sklearn.utils._param_validation.InvalidParameterError: The 'max_features' parameter of RandomForestRegressor must be an int in the range [1, inf), a float in the range (0.0, 1.0], a str among {'log2', 'sqrt'} or None. Got 'auto' instead. warnings.warn(some_fits_failed_message, FitFailedWarning) /Users/michaelbradley/Documents/Udemy/Data_Science_ML_Bootcamp/bulldozer-sales/env/lib/python3.10/site-packages/sklearn/model_selection/_search.py:976: UserWarning: One or more of the test scores are non-finite: [0.33423792 nan nan 0.29693349 nan 0.33435722] warnings.warn(
CPU times: user 13.3 s, sys: 4 s, total: 17.3 s Wall time: 20.6 s
RandomizedSearchCV(cv=5, estimator=RandomForestRegressor(n_jobs=-1, random_state=42), n_iter=6, param_distributions={'max_depth': [None, 3, 5, 10], 'max_features': [0.5, 1, 'sqrt', 'auto'], 'max_samples': [10000], 'min_samples_leaf': array([ 1, 3, 5, 7, 9, 11, 13, 15, 17, 19]), 'min_samples_split': array([ 2, 4, 6, 8, 10, 12, 14, 16, 18]), 'n_estimators': array([10, 20, 30, 40, 50, 60, 70, 80, 90])}, verbose=True)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomizedSearchCV(cv=5, estimator=RandomForestRegressor(n_jobs=-1, random_state=42), n_iter=6, param_distributions={'max_depth': [None, 3, 5, 10], 'max_features': [0.5, 1, 'sqrt', 'auto'], 'max_samples': [10000], 'min_samples_leaf': array([ 1, 3, 5, 7, 9, 11, 13, 15, 17, 19]), 'min_samples_split': array([ 2, 4, 6, 8, 10, 12, 14, 16, 18]), 'n_estimators': array([10, 20, 30, 40, 50, 60, 70, 80, 90])}, verbose=True)
RandomForestRegressor(n_jobs=-1, random_state=42)
RandomForestRegressor(n_jobs=-1, random_state=42)
# FIND BEST MODEL HYPERPARAMETERS
rs_model.best_params_
{'n_estimators': 80, 'min_samples_split': 12, 'min_samples_leaf': 9, 'max_samples': 10000, 'max_features': 'sqrt', 'max_depth': 3}
# EVALUATE RANDOMIZED SEARCH MODEL
show_scores(rs_model)
{'Training MAE': 13375.550442891074, 'Valid MAE': 15301.80754559165, 'Training RMSLE': 0.5656098092633168, 'Valid RMSLE': 0.5695858066162913, 'Training R^2': 0.34898657343636397, 'Valid R^2': 0.3393057252778159}
TRAIN MODEL WITH BEST HYPERPARAMETERS¶
These were found after 100 iterations of RandomizedSearchCV
%%time
# MOST IDEAL HYPERPARAMETERS
ideal_model = RandomForestRegressor(n_estimators=40,
min_samples_leaf=1,
min_samples_split=14,
max_features=0.5,
n_jobs=-1,
max_samples=None,
random_state=42)
# FIT THE MODEL
ideal_model.fit(x_train, y_train)
CPU times: user 5min 9s, sys: 3.62 s, total: 5min 12s Wall time: 24.5 s
RandomForestRegressor(max_features=0.5, min_samples_split=14, n_estimators=40, n_jobs=-1, random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestRegressor(max_features=0.5, min_samples_split=14, n_estimators=40, n_jobs=-1, random_state=42)
# SCORES FOR IDEAL MODEL TRAINED ON ALL DATA
show_scores(ideal_model)
{'Training MAE': 2953.8161137163484, 'Valid MAE': 5951.247761444453, 'Training RMSLE': 0.14469006962371858, 'Valid RMSLE': 0.24524163989538328, 'Training R^2': 0.9588145522577225, 'Valid R^2': 0.8818019502450094}
# SCORE FOR RS MODEL ON TRAIN ON ~10,000 EXAMPLES
show_scores(rs_model)
{'Training MAE': 13375.550442891074, 'Valid MAE': 15301.807545591648, 'Training RMSLE': 0.5656098092633168, 'Valid RMSLE': 0.5695858066162913, 'Training R^2': 0.34898657343636397, 'Valid R^2': 0.3393057252778159}
MAKE PREDICTIONS ON TEST DATA¶
# IMPORT TEST DATA
df_test = pd.read_csv("bluebook-for-bulldozers/Test.csv",
low_memory=False,
parse_dates=["saledate"])
df_test.head()
SalesID | MachineID | ModelID | datasource | auctioneerID | YearMade | MachineHoursCurrentMeter | UsageBand | saledate | fiModelDesc | ... | Undercarriage_Pad_Width | Stick_Length | Thumb | Pattern_Changer | Grouser_Type | Backhoe_Mounting | Blade_Type | Travel_Controls | Differential_Type | Steering_Controls | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1227829 | 1006309 | 3168 | 121 | 3 | 1999 | 3688.0 | Low | 2012-05-03 | 580G | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | 1227844 | 1022817 | 7271 | 121 | 3 | 1000 | 28555.0 | High | 2012-05-10 | 936 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Standard | Conventional |
2 | 1227847 | 1031560 | 22805 | 121 | 3 | 2004 | 6038.0 | Medium | 2012-05-10 | EC210BLC | ... | None or Unspecified | 9' 6" | Manual | None or Unspecified | Double | NaN | NaN | NaN | NaN | NaN |
3 | 1227848 | 56204 | 1269 | 121 | 3 | 2006 | 8940.0 | High | 2012-05-10 | 330CL | ... | None or Unspecified | None or Unspecified | Manual | Yes | Triple | NaN | NaN | NaN | NaN | NaN |
4 | 1227863 | 1053887 | 22312 | 121 | 3 | 2005 | 2286.0 | Low | 2012-05-10 | 650K | ... | NaN | NaN | NaN | NaN | NaN | None or Unspecified | PAT | None or Unspecified | NaN | NaN |
5 rows × 52 columns
# MAKE PREDICTIONS ON TEST DATASET - not in the same format as the model was trained on
# test_preds = ideal_model.predict(df_test)
PREPROCESSING THE DATA TO GET THE SAME FORMAT AS TRAINING DATASET¶
def preprocess_data(df):
"""
PERFORMS TRANSFORMATION ON DF AND RETURNS TRANSFORMED DF
"""
df["saleYear"] = df.saledate.dt.year
df["saleMonth"] = df.saledate.dt.month
df["saleDay"] = df.saledate.dt.day
df["saleDayOfWeek"] = df.saledate.dt.dayofweek
df["saleDayOfYear"] = df.saledate.dt.dayofyear
df.drop("saledate", axis=1, inplace=True)
# Fill numeric row with median values (MEDIAN is more robust than MEAN)
for label, content in df.items():
if pd.api.types.is_numeric_dtype(content):
if pd.isnull(content).sum():
# Add a binary column which tell us if the data was missing
df[label + "_is_missing"] = pd.isnull(content)
# Fill missing numeric values with median value
df[label] = content.fillna(content.median())
# Turn categorical variables into number and fill missing
for label, content in df.items():
if not pd.api.types.is_numeric_dtype(content):
# ADD BINARY COLUMN TO INDICATE WHETHER SAMPLE HAD MISSING VALUES
df[label + "_is_missing"] = pd.isnull(content)
# TURN CATEGORIES INTO NUMBERS AND ADD +1 - KEEPS VALUES POSITIVE OTHERWISE TURNS TO -1
df[label] = pd.Categorical(content).codes + 1
return df
# PROCESS THE TEST DATA
df_test = preprocess_data(df_test)
df_test.head()
SalesID | MachineID | ModelID | datasource | auctioneerID | YearMade | MachineHoursCurrentMeter | UsageBand | fiModelDesc | fiBaseModel | ... | Undercarriage_Pad_Width_is_missing | Stick_Length_is_missing | Thumb_is_missing | Pattern_Changer_is_missing | Grouser_Type_is_missing | Backhoe_Mounting_is_missing | Blade_Type_is_missing | Travel_Controls_is_missing | Differential_Type_is_missing | Steering_Controls_is_missing | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1227829 | 1006309 | 3168 | 121 | 3 | 1999 | 3688.0 | 2 | 499 | 180 | ... | True | True | True | True | True | True | True | True | True | True |
1 | 1227844 | 1022817 | 7271 | 121 | 3 | 1000 | 28555.0 | 1 | 831 | 292 | ... | True | True | True | True | True | True | True | True | False | False |
2 | 1227847 | 1031560 | 22805 | 121 | 3 | 2004 | 6038.0 | 3 | 1177 | 404 | ... | False | False | False | False | False | True | True | True | True | True |
3 | 1227848 | 56204 | 1269 | 121 | 3 | 2006 | 8940.0 | 1 | 287 | 113 | ... | False | False | False | False | False | True | True | True | True | True |
4 | 1227863 | 1053887 | 22312 | 121 | 3 | 2005 | 2286.0 | 2 | 566 | 196 | ... | True | True | True | True | True | False | False | False | True | True |
5 rows × 101 columns
# FIND HOW THE COLUMNS DIFFER USING SETS
set(x_train.columns) - set(df_test.columns)
set()
# MANUALLY ADJUST DF_TEST TO HAVE AUCTIONEERID_IS_MISSING COLUMN
df_test["auctioneerID_is_missing"] = False
df_test.head()
SalesID | MachineID | ModelID | datasource | auctioneerID | YearMade | MachineHoursCurrentMeter | UsageBand | fiModelDesc | fiBaseModel | ... | Stick_Length_is_missing | Thumb_is_missing | Pattern_Changer_is_missing | Grouser_Type_is_missing | Backhoe_Mounting_is_missing | Blade_Type_is_missing | Travel_Controls_is_missing | Differential_Type_is_missing | Steering_Controls_is_missing | auctioneerID_is_missing | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1227829 | 1006309 | 3168 | 121 | 3 | 1999 | 3688.0 | 2 | 499 | 180 | ... | True | True | True | True | True | True | True | True | True | False |
1 | 1227844 | 1022817 | 7271 | 121 | 3 | 1000 | 28555.0 | 1 | 831 | 292 | ... | True | True | True | True | True | True | True | False | False | False |
2 | 1227847 | 1031560 | 22805 | 121 | 3 | 2004 | 6038.0 | 3 | 1177 | 404 | ... | False | False | False | False | True | True | True | True | True | False |
3 | 1227848 | 56204 | 1269 | 121 | 3 | 2006 | 8940.0 | 1 | 287 | 113 | ... | False | False | False | False | True | True | True | True | True | False |
4 | 1227863 | 1053887 | 22312 | 121 | 3 | 2005 | 2286.0 | 2 | 566 | 196 | ... | True | True | True | True | False | False | False | True | True | False |
5 rows × 102 columns
# Match column order from X_train to df_test (to predict on columns, they should be in the same order they were fit on)
df_test = df_test[x_train.columns]
# MAKE PREDICTIONS ON TEST DATA
test_preds = ideal_model.predict(df_test)
Section 12 - 198. FEATURE IMPORTANCE - Need to redo the video to get this straight¶
When looking at the Kaggle submission requirements, we see that if we wanted to make a submission, the data is required to be in a certain format. Namely, a DataFrame containing the SalesID and the predicted SalePrice of the bulldozer.
Let's make it.
# Create DataFrame compatible with Kaggle submission requirements
df_preds = pd.DataFrame()
df_preds["SalesID"] = df_test["SalesID"]
df_preds["SalePrice"] = test_preds
df_preds
SalesID | SalePrice | |
---|---|---|
0 | 1227829 | 17030.009274 |
1 | 1227844 | 14355.535652 |
2 | 1227847 | 46623.087743 |
3 | 1227848 | 71680.261335 |
4 | 1227863 | 61762.999424 |
... | ... | ... |
12452 | 6643171 | 39966.363007 |
12453 | 6643173 | 12049.704433 |
12454 | 6643184 | 11964.850733 |
12455 | 6643186 | 16496.710793 |
12456 | 6643196 | 27119.990440 |
12457 rows × 2 columns
# Export to csv...
#df_preds.to_csv("../data/bluebook-for-bulldozers/predictions.csv",
# index=False)
Feature Importance¶
Since we've built a model which is able to make predictions. The people you share these predictions with (or yourself) might be curious of what parts of the data led to these predictions.
This is where feature importance comes in. Feature importance seeks to figure out which different attributes of the data were most important when it comes to predicting the target variable.
In our case, after our model learned the patterns in the data, which bulldozer sale attributes were most important for predicting its overall sale price?
Beware: the default feature importances for random forests can lead to non-ideal results.
To find which features were most important of a machine learning model, a good idea is to search something like "[MODEL NAME] feature importance".
Doing this for our RandomForestRegressor leads us to find the feature_importances_ attribute.
Let's check it out.
# Find feature importance of our best model
ideal_model.feature_importances_
array([3.39445533e-02, 1.81148281e-02, 4.09167072e-02, 1.70752171e-03, 3.40797459e-03, 2.08200698e-01, 2.95067052e-03, 1.10113725e-03, 4.16122668e-02, 4.71911805e-02, 6.23815431e-02, 4.67433955e-03, 1.52524442e-02, 1.52517337e-01, 4.72224713e-02, 5.96817956e-03, 1.29351899e-03, 2.78088439e-03, 2.37248769e-03, 6.17114453e-02, 8.13525488e-04, 3.61873268e-05, 9.19098115e-04, 2.23170993e-04, 1.28102678e-03, 2.06519636e-05, 2.01477316e-03, 6.63364759e-03, 2.15274492e-03, 2.50178165e-03, 4.63902393e-03, 3.85873985e-03, 2.76062667e-03, 1.00782454e-03, 2.47969268e-04, 6.04239818e-03, 7.64997072e-04, 1.57100537e-02, 2.29716203e-03, 2.58372272e-03, 8.07637426e-04, 9.18548690e-04, 1.35656446e-03, 5.81458569e-04, 4.96716928e-04, 3.79552257e-04, 5.31712788e-04, 2.71823509e-03, 8.34294376e-04, 3.12136841e-04, 2.14075157e-04, 7.42422919e-02, 3.80158492e-03, 5.67641024e-03, 2.87154703e-03, 9.83349904e-03, 2.65470837e-04, 1.57946459e-03, 3.10058108e-04, 0.00000000e+00, 0.00000000e+00, 2.27421721e-03, 1.05632062e-03, 5.42819222e-03, 3.48484864e-02, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 1.90858845e-05, 9.09490682e-06, 1.31265147e-04, 5.29163902e-06, 1.11952381e-04, 4.78452431e-06, 3.43582863e-04, 5.57068428e-06, 1.07167376e-03, 3.99179008e-03, 4.07753410e-03, 1.05749617e-04, 2.76528927e-03, 2.59244312e-05, 3.51888176e-04, 2.31519337e-03, 1.99211177e-03, 4.02034629e-03, 2.03778082e-04, 1.13483313e-02, 9.02551628e-04, 1.58182497e-03, 4.63243398e-05, 2.92071004e-04, 3.11923094e-05, 1.56873538e-04, 2.87205987e-05, 3.80543083e-05, 2.55045807e-04, 1.66878572e-04, 2.10341792e-04, 1.26024842e-04, 9.40663015e-05])
import seaborn as sns
# Helper function for plotting feature importance
def plot_features(columns, importances, n=20):
df = (pd.DataFrame({"features": columns,
"feature_importance": importances})
.sort_values("feature_importance", ascending=False)
.reset_index(drop=True))
sns.barplot(x="feature_importance",
y="features",
data=df[:n],
orient="h")
plot_features(x_train.columns, ideal_model.feature_importances_)
sum(ideal_model.feature_importances_)
1.0
df.ProductSize.isna().sum()
216605
df.ProductSize.value_counts()
ProductSize Medium 64342 Large / Medium 51297 Small 27057 Mini 25721 Large 21396 Compact 6280 Name: count, dtype: int64
df.Turbocharged.value_counts()
Turbocharged None or Unspecified 77111 Yes 3985 Name: count, dtype: int64
df.Thumb.value_counts()
Thumb None or Unspecified 85074 Manual 9678 Hydraulic 7580 Name: count, dtype: int64
import os
os.system('jupyter nbconvert --execute --to html Bulldozer_Sales.ipynb')
[NbConvertApp] Converting notebook Bulldozer_Sales.ipynb to html