Project Description

Machine Learning Forecast (part 2): Your first tree

Author

Nicolas Vandeput

BOOK

Data Science for Supply Chain Forecast

This article is part of a serie on how to apply machine learning to forecast. We advise you to read the first article first.

Recap

In our first article we,

  • discussed how to applied supervised machine learning to forecast
  • cleaned the norwegian car sales data set
  • create our X_train, Y_train, X_test and Y_test data sets
  • created a first benchmark (a linear regression that got 18.1% of forecast error).

In this article we will continue where we stopped and we will create our first regression tree to predict the car sales in Norway.

Regression Tree

Description

Regression trees are a class of machine learning algorithms that will create a map (a tree actually) of questions to make a prediction. The tree will start at its foundation with a first Yes/No question. Based on the answer the tree will continue to ask new Yes/No questions following a specific road (branch). It will go on until the tree gets to a final prediction (what we call a leaf).

You can see below an example from wikipedia with the famous titanic case:

Example of decision tree (wikipedia)

Actually, there are many different supervised machine learning algorithms that we could use for to forecast our Norwegian car sales. Nevertheless, we will start by using a regression tree as these are easy to understand and use.

Lucas van Valckenborch - Landschap bij Dinant

Limitations

Decision trees (regression trees & categorization trees) have difficulties to learn complex relationships or specific logic. For these complex relationships, neural network could work better.

Note also that the results of regression trees are actually random. This caused by the fact that the generation of the tree is based on some random criteria selection. This is not really a problem, but you have to be aware that each time you’ll refit your tree it will be slightly different.

Parameters

Each tree is defined based on multiple parameters. We will only look into the most important:

  • Max depth: the maximum amount of consecutive questions the tree can ask.
  • Max features: the maximum amount of features (columns of the X data set) the tree can use.
  • Min samples leaf: the minimum amount of observations that need to be in a leaf. This is a very important parameter. The closer this is to 0 the higher the risk of overfit.

Depending on your data set you might want to give different values to these parameters. We will see in another article how to optimize this.

In Python

Step #1 – Model creation

We will simply use the sklearn library and call for an instance of DecisionTreeRegressor

# - Import module from sklearn
from sklearn.tree import DecisionTreeRegressor

# - Instantiate a Decision Tree Regressor
tree = DecisionTreeRegressor(max_depth=5,max_features=11,min_samples_leaf=5)

# - Fit the tree to the training data
tree.fit(X_train,Y_train)

We use then the .fit() method that will train the model with the training data set we made.

Step #2 – Test

Let’s now compare this model against the benchmark.

# Create a prediction based on our model
Y_pred_tree = tree.predict(X_test)

# Computes the Mean Absolute Error of the model
MAE_tree = np.mean(abs(Y_test - Y_pred_tree))/np.mean(Y_test)

# Print the results
print("Tree MAE%:",round(MAE_tree*100,1))

The .predict() method will take a X data set (= input data set) in and output a result (Y data set).

I got this result: Tree MAE%: 21.3. As said, the trees are made based on a random algorithm, so you could get slightly different results.

Next steps

Do you remember that the benchmark (a linear regression) was 18.1%? It seems that our tree is not yet able to beat it. So should we stop here and conclude that machine learning can’t beat linear regression?

Well. No.

Latest Posts