Quick start to machine learning¶

This is condensed from the Intro to Machine Learning Kaggle course.

Build a model
Validate a model
Overfitting and underfitting
A more sophisticated model: Random forests

Python

#[1]
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

How do we know which machine learning models to use?

Images (source)

Our Scenario¶

Your cousin has made millions of dollars speculating on real estate. He’s offered to become business partners with you because of your interest in data science. He’ll supply the money, and you’ll supply models that predict how much various houses are worth.

You ask your cousin how he’s predicted real estate values in the past. and he says it is just intuition. But more questioning reveals that he’s identified price patterns from houses he has seen in the past, and he uses those patterns to make predictions for new houses he is considering.

Machine learning works the same way. We’ll start with a model called the Decision Tree. There are fancier models that give more accurate predictions, but decision trees are easy to understand.

For simplicity, we’ll start with the simplest possible decision tree.

It divides houses into only two categories. The predicted price for any house under consideration is the historical average price of houses in the same category.

We use data to decide how to break the houses into two groups, and then again to determine the predicted price in each group. This step of capturing patterns from data is called fitting or training the model. The data used to fit the model is called the training data.

We will get to the details of how the model is fit (e.g. how to split up the data). After the model has been fit, you can apply it to new data to predict prices of additional homes.

Improving the Decision Tree¶

You can capture more factors affecting home price using a tree that has more “splits.” These are called “deeper” trees. A decision tree that also considers the total size of each house’s lot might look like this:

You predict the price of any house by tracing through the decision tree, always picking the path corresponding to that house’s characteristics. The predicted price for the house is at the bottom of the tree. The point at the bottom where we make a prediction is called a leaf.

The splits and values at the leaves will be determined by the data, so it’s time for you to check out the data you will be working with.

Load the data!¶

Python

#[2]
# save filepath to variable for easier access
melbourne_file_path = 'data/melb_data.csv'
# read the data and store data in DataFrame titled melbourne_data
melbourne_data = pd.read_csv(melbourne_file_path)

melbourne_data.head()
# print a summary of the data in Melbourne data
# melbourne_data.describe()

Output

	Suburb	Address	Rooms	Type	Price	Method	SellerG	Date	Distance	Postcode	...	Bathroom	Car	Landsize	BuildingArea	YearBuilt	CouncilArea	Lattitude	Longtitude	Regionname	Propertycount
0	Abbotsford	85 Turner St	2	h	1480000.0	S	Biggin	3/12/2016	2.5	3067.0	...	1.0	1.0	202.0	NaN	NaN	Yarra	-37.7996	144.9984	Northern Metropolitan	4019.0
1	Abbotsford	25 Bloomburg St	2	h	1035000.0	S	Biggin	4/02/2016	2.5	3067.0	...	1.0	0.0	156.0	79.0	1900.0	Yarra	-37.8079	144.9934	Northern Metropolitan	4019.0
2	Abbotsford	5 Charles St	3	h	1465000.0	SP	Biggin	4/03/2017	2.5	3067.0	...	2.0	0.0	134.0	150.0	1900.0	Yarra	-37.8093	144.9944	Northern Metropolitan	4019.0
3	Abbotsford	40 Federation La	3	h	850000.0	PI	Biggin	4/03/2017	2.5	3067.0	...	2.0	1.0	94.0	NaN	NaN	Yarra	-37.7969	144.9969	Northern Metropolitan	4019.0
4	Abbotsford	55a Park St	4	h	1600000.0	VB	Nelson	4/06/2016	2.5	3067.0	...	1.0	2.0	120.0	142.0	2014.0	Yarra	-37.8072	144.9941	Northern Metropolitan	4019.0

Python

#[3]
melbourne_data.shape

Output

(13580, 21)

Selecting Data for Modeling¶

The dataset has too many variables to wrap your head around, or even to print out nicely. How can you pare down this overwhelming amount of data to something you can understand?

We’ll start by picking a few variables using our intuition. There are also statistical techniques to automatically prioritize variables that we are not covering.

To choose variables/columns, we’ll need to see a list of all columns in the dataset. That is done with the columns property of the DataFrame (the bottom line of code below).

Python

#[4]
melbourne_data.columns

Output

Index(['Suburb', 'Address', 'Rooms', 'Type', 'Price', 'Method', 'SellerG',
       'Date', 'Distance', 'Postcode', 'Bedroom2', 'Bathroom', 'Car',
       'Landsize', 'BuildingArea', 'YearBuilt', 'CouncilArea', 'Lattitude',
       'Longtitude', 'Regionname', 'Propertycount'],
      dtype='object')

The Melbourne data has some missing values (some houses for which some variables weren’t recorded). We will take the simplest option for now, and drop houses from our data.

Python

#[5]
# dropna drops missing values
melbourne_data = melbourne_data.dropna(axis=0)

Selecting The Prediction Target¶

You can pull out a variable using square brackets: ['col_name']. This single column is stored in a Series, which is broadly like a DataFrame with only a single column of data.

We’ll use this bracket notation to select the column we want to predict (Price), which is called the prediction target. By convention, the prediction target is called y. Therefore, the code we need to save the house prices in the Melbourne data is the following:

Python

#[6]
y = melbourne_data['Price']

Choosing “Features”¶

The columns that are inputted into our model (and later used to make predictions) are called features. In our case, those would be the columns used to determine the home price. Sometimes, you will use all columns except the target as features. Other times you’ll be better off with fewer features.

For now, we’ll build a model with only a few features. Later on you’ll see how to iterate and compare models built with different features.

By convention, this data is called X.

Python

#[7]
melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 'Lattitude', 'Longtitude']
X = melbourne_data[melbourne_features]
X.head()

Output

	Rooms	Bathroom	Landsize	Lattitude	Longtitude
1	2	1.0	156.0	-37.8079	144.9934
2	3	2.0	134.0	-37.8093	144.9944
4	4	1.0	120.0	-37.8072	144.9941
6	3	2.0	245.0	-37.8024	144.9993
7	2	1.0	256.0	-37.8060	144.9954

Building Your Model¶

You will use the scikit-learn library to create your models. When coding, this library is written as sklearn, as you will see in the sample code. Scikit-learn is easily the most popular library for modeling the types of data typically stored in DataFrames.

The steps to building and using a model are:

Define: What question are you asking? What type of model bests answers the question? A decision tree? Some other type of model? Some other parameters of the model type are specified too.
Fit: Capture patterns from provided data. This is the heart of modeling.
Predict: Just what it sounds like
Evaluate: Determine how accurate the model’s predictionsare.

Here is an example of defining a decision tree model with scikit-learn and fitting it with the features and target variable.

Python

#[8]
from sklearn.tree import DecisionTreeRegressor

# Define model. Specify a number for random_state to ensure same results each run
melbourne_model = DecisionTreeRegressor(random_state=1)

# Fit model
melbourne_model.fit(X, y)

Output

DecisionTreeRegressor(random_state=1)

We set random_state=1 in order to remove variability from run to run.

Many machine learning models allow some randomness in model training. Specifying a number for random_state ensures you get the same results in each run.

Global optima

The problem of learning an optimal decision tree is known to be NP-complete under several aspects of optimality and even for simple concepts. Consequently, practical decision-tree learning algorithms are based on heuristic algorithms such as the greedy algorithm where locally optimal decisions are made at each node. Such algorithms cannot guarantee to return the globally optimal decision tree. This can be mitigated by training multiple trees in an ensemble learner, where the features and samples are randomly sampled with replacement.

We now have a fitted model that we can use to make predictions.

In practice, you’ll want to make predictions for new houses coming on the market rather than the houses we already have prices for. But we’ll make predictions for the first few rows of the training data to see how the predict function works.

Python

#[9]
print("Making predictions for the following 5 houses:")
print(X.head())
print("The predictions are")
y_pred = melbourne_model.predict(X.head())
print(y_pred)

Output

Making predictions for the following 5 houses:
   Rooms  Bathroom  Landsize  Lattitude  Longtitude
1      2       1.0     156.0   -37.8079    144.9934
2      3       2.0     134.0   -37.8093    144.9944
4      4       1.0     120.0   -37.8072    144.9941
6      3       2.0     245.0   -37.8024    144.9993
7      2       1.0     256.0   -37.8060    144.9954
The predictions are
[1035000. 1465000. 1600000. 1876000. 1636000.]

You’ve built a model. But how good is it?

Experimenting With Different Models, Underfitting and Overfitting¶

Now that you have a reliable way to measure model accuracy, you can experiment with alternative models and see which gives the best predictions. But what alternatives do you have for models?

You can see in scikit-learn’s documentation that the decision tree model has many options. The most important options determine the tree’s depth. A tree’s depth is a measure of how many splits it makes before coming to a prediction.

If a tree has 10 splits, data will be split into up to \(2^{10}\) groups of houses, or 1024 leaves.

Overfitting: When we divide the houses amongst many leaves, we also have fewer houses in each leaf. Leaves with very few houses may make very unreliable predictions for new data.

Underfitting: If our tree very shallow, it doesn’t divide up the houses into very distinct groups. Resulting predictions may be far off for most houses, even in the training data (and it will be bad in validation too for the same reason). When a model fails to capture important distinctions and patterns in the data, so it performs poorly even in training data, that is called underfitting.

We want to find the model between underfitting and overfitting. Visually, we want the low point of the (red) validation curve.

../../_images/underfitting-overfitting.png

Example¶

Let’s use max_depth argument to control overfitting vs underfitting. We can use a utility function to help compare MAE scores from different values for max_depth:

Python

#[16]
def mae(max_depth, train_X, val_X, train_y, val_y):
    model = DecisionTreeRegressor(max_depth=max_depth, random_state=0)
    model.fit(train_X, train_y)
    preds_val = model.predict(val_X)
    mae = mean_absolute_error(val_y, preds_val)
    return(mae)

We can use a for-loop to compare the accuracy of models built with different values for max_depth.

Python

#[17]
# compare MAE with differing values of max_depth
for max_depth in [1, 5, 10, 20, 50]:
    my_mae = mae(max_depth, train_X, val_X, train_y, val_y)
    print("Max depth: %d  \t\t Mean Absolute Error:  %d" %(max_depth, my_mae))

Output

Max depth: 1             Mean Absolute Error:  439583
Max depth: 5             Mean Absolute Error:  299691
Max depth: 10            Mean Absolute Error:  252176
Max depth: 20            Mean Absolute Error:  272924
Max depth: 50            Mean Absolute Error:  271598

Of the options listed, what is the optimal max depth?

Python

#[18]
# Let's run the optimal model with 500 notes
melbourne_model = DecisionTreeRegressor(max_depth=10, random_state=0)
melbourne_model.fit(train_X, train_y)

Output

DecisionTreeRegressor(max_depth=10, random_state=0)

Python

#[19]
# Can we get a visual?
import graphviz
from sklearn.tree import export_graphviz


dot_data = export_graphviz(melbourne_model, out_file=None,
                            feature_names=melbourne_features)
graph = graphviz.Source(dot_data)
graph.render("DecisionTree")

Output

'DecisionTree.pdf'

Here’s the takeaway: Models can suffer from either: - Overfitting: capturing spurious patterns that won’t recur in the future, leading to less accurate predictions, or - Underfitting: failing to capture relevant patterns, again leading to less accurate predictions.

We use validation data, which isn’t used in model training, to measure a candidate model’s accuracy. This lets us try many candidate models and keep the best one.

Random Forests ¶

Decision trees leave you with a difficult choice. A deep tree with lots of leaves will overfit because each prediction is coming from historical data from only the few houses at its leaf. But a shallow tree with few leaves will perform poorly because it fails to capture as many distinctions in the raw data.

Even today’s most sophisticated modeling techniques face this tension between underfitting and overfitting. But, many models have clever ideas that can lead to better performance. We’ll look at the random forest as an example.

The random forest algorithm uses many trees, and it makes a prediction by averaging the predictions of each component tree. It generally has much better predictive accuracy than a single decision tree and it works well with default parameters. If you keep modeling, you can learn more models with even better performance, but many of those are sensitive to getting the right parameters.

A random forest is a meta estimator that fits a number of decision tree models on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting. The sub-sample size is controlled with the max_samples parameter if bootstrap=True (default), otherwise the whole dataset is used to build each tree.

We build a random forest model similarly to how we built a decision tree in scikit-learn - this time using the RandomForestRegressor class instead of DecisionTreeRegressor.

Python

#[20]
from sklearn.ensemble import RandomForestRegressor

forest_model = RandomForestRegressor(random_state=1)
forest_model.fit(train_X, train_y)
melb_preds = forest_model.predict(val_X)
print(mean_absolute_error(val_y, melb_preds))

Output

207190.6873773146

Conclusion¶

There is likely room for further improvement, but this is a big improvement over the best decision tree error of 250,000. There are parameters which allow you to change the performance of the Random Forest much as we changed the maximum depth of the single decision tree. But one of the best features of Random Forest models is that they generally work reasonably even without this tuning.

Your Turn¶

Try Using a Random Forest model yourself and see how much it improves your model.
Visit kaggle for Intermediate ML and Intro to Deep Learning courses.

Quick start to machine learning¶

Our Scenario¶

Improving the Decision Tree¶

Load the data!¶

Selecting Data for Modeling¶

Selecting The Prediction Target¶

Choosing “Features”¶

Building Your Model¶

Model Validation¶

The Problem with “In-Sample” Scores¶

Coding It¶

Experimenting With Different Models, Underfitting and Overfitting¶

Example¶

Random Forests ¶

Conclusion¶

Your Turn¶

Quick start to machine learning¶

Our Scenario¶

Improving the Decision Tree¶

Load the data!¶

Selecting Data for Modeling¶

Selecting The Prediction Target¶

Choosing “Features”¶

Building Your Model¶

Model Validation¶

The Problem with “In-Sample” Scores¶

Coding It¶

Experimenting With Different Models, Underfitting and Overfitting¶

Example¶

Random Forests¶

Conclusion¶

Your Turn¶

Random Forests ¶