Machine Learning Forecast (part 2): Your first tree
Data Science for Supply Chain Forecast
This article is part of a serie on how to apply machine learning to forecast. We advise you to read the first article first.
In our first article we,
- discussed how to applied supervised machine learning to forecast
- cleaned the norwegian car sales data set
- create our
- created a first benchmark (a linear regression that got 18.1% of forecast error).
In this article we will continue where we stopped and we will create our first regression tree to predict the car sales in Norway.
Regression trees are a class of machine learning algorithms that will create a map (a tree actually) of questions to make a prediction. The tree will start at its foundation with a first Yes/No question. Based on the answer the tree will continue to ask new Yes/No questions following a specific road (branch). It will go on until the tree gets to a final prediction (what we call a leaf).
Actually, there are many different supervised machine learning algorithms that we could use for to forecast our Norwegian car sales. Nevertheless, we will start by using a regression tree as these are easy to understand and use.
Decision trees (regression trees & categorization trees) have difficulties to learn complex relationships or specific logic. For these complex relationships, neural network could work better.
Note also that the results of regression trees are actually random. This caused by the fact that the generation of the tree is based on some random criteria selection. This is not really a problem, but you have to be aware that each time you’ll refit your tree it will be slightly different.
Each tree is defined based on multiple parameters. We will only look into the most important:
- Max depth: the maximum amount of consecutive questions the tree can ask.
- Max features: the maximum amount of features (columns of the X data set) the tree can use.
- Min samples leaf: the minimum amount of observations that need to be in a leaf. This is a very important parameter. The closer this is to 0 the higher the risk of overfit.
Depending on your data set you might want to give different values to these parameters. We will see in another article how to optimize this.
Step #1 – Model creation
We will simply use the
sklearn library and call for an instance of
# - Import module from sklearn from sklearn.tree import DecisionTreeRegressor # - Instantiate a Decision Tree Regressor tree = DecisionTreeRegressor(max_depth=5,max_features=11,min_samples_leaf=5) # - Fit the tree to the training data tree.fit(X_train,Y_train)
We use then the
.fit() method that will train the model with the training data set we made.
Step #2 – Test
Let’s now compare this model against the benchmark.
# Create a prediction based on our model Y_pred_tree = tree.predict(X_test) # Computes the Mean Absolute Error of the model MAE_tree = np.mean(abs(Y_test - Y_pred_tree))/np.mean(Y_test) # Print the results print("Tree MAE%:",round(MAE_tree*100,1))
.predict() method will take a X data set (= input data set) in and output a result (Y data set).
I got this result:
Tree MAE%: 21.3. As said, the trees are made based on a random algorithm, so you could get slightly different results.
Do you remember that the benchmark (a linear regression) was 18.1%? It seems that our tree is not yet able to beat it. So should we stop here and conclude that machine learning can’t beat linear regression?
- Holt Winters forecast with multiplicative seasonality (Python)Nicolas Vandeput2019-11-13T16:41:48+01:00