### Project Description

## Machine Learning Forecast (part 1): Data Cleaning & Benchmark

#### Author

Nicolas Vandeput

#### BOOK

Data Science for Supply Chain Forecast

I will show you in this article how to apply supervised learning algorithms to predict sales. Sales forecast (or demand planning) is an important task for many supply chain practitioners and machine learning models can provide some help.

## How to use supervised learning to predict sales?

We will have to convert a time series prediction problem (i.e. predict demand over time) into a supervised machine learning problem. Remember that with supervised machine learning the model will learn the relationship between the **inputs** (denoted by X) and the **outputs** (denoted Y). The question we will ask to the model will be:

Based on the last x months of demand, what will be the demand next month?

We will train the model by providing it the data with a specific layout:

- n consecutive months of demand as the input (X data).
- the demand in the very next month as the output (Y data).

Basically we will train our model to predict the future demand based on the previous n months of demand.

Here is an example of a data set with 4 months of demand as an input (X) to predict the next month of demand (Y):

## Data set: numbers of car sold in Norway

For this example we will work with the sales of cars per brand per month in Norway. The data set is available here. Most likely, this typical example will look closely to the demand data you could be using yourself at work.

## Model in Python

### Step #1 – Data Cleaning

Before we go into data cleaning and preparation, let’s first import our favorite libraries (**Pandas** & **Numpy**) and then define the first global variables.

import pandas as pd import numpy as np x_len = 12 # How many previous months we will use as inputs y_len = 2 # How many months we want to forecast y_test_len = 12 # How many months we want to keep as a test

As you can see we define 3 global variables:

**x_len**: # of months we will use to predict the next one.**y_len**: # of months we want to predict (basically do you want to forecast the very next month of the next 3 months?).**y_test_len**: # of months we want to keep aside to validate our model.

It is a best practice in Data Science to keep some data aside (i.e. the

test data set) for the final validation check of the model.

In this example I will use the last 12 months (`x_len = 12`

) of demand to predict the next 2 months (`y_len = 2`

). I will also keep 12 months aside (`y_test_len = 12`

) to test my model on these unseen data

Let’s go now into the data preparation. We will first extract the data from the csv file (you can find it here). Then we will transform this data set to get the periods as columns and the brands as rows (I’ll use the pandas pivot_table function to do this).

As a last step, I save the dataframe as an excel file. I like to do this to double check that everything is properly done.

# - Load the CSV file (should be in the same directory) df = pd.read_csv("norway_new_car_sales_by_make.csv") # - Small functions to print numbers with format such a 01, 02, 03, etc. def month_str (x): if x &amp;lt; 10: return "0"+str(x) else: return str(x) # - Create one column with the period (format: YYYY MM) df["Period"] = df["Year"].astype(str) +" " + df["Month"].apply(month_str) # - Create a pivot of the data to show the periods as the column and the car makers on the rows df = pd.pivot_table(data=df,values="Quantity",index="Make",columns="Period",aggfunc='sum',fill_value=0) # - Print data to excel for reference df.to_excel("Clean Data Set.xlsx")

This is our df dataframe now:

### Step #2 – Data set creation

Now that we have a clean data set we need to create the training data set in the proper format. This is actually not straightforward. Before we jump into the python code, let me explain you how we will create the training data set (`X_train`

and `Y_train`

) and the test data set (`X_test`

& `Y_test`

).

#### Training data set

For the training data set we will run through each possibility of X & Y data, running through the time series each time looking at a specific time window. Let’s imagine that you will use 4 months to predict the next one: in this example you use Month 1, 2, 3 & 4 data to predict Month 5, then you can offset the data set by 1 month and use Month 2, 3, 4 & 5 to predict Month 6, and so on.

Let me show you an example below for a time series of 9 months. We can make here 5 different combinations of X & Y data sets. On the example below you see in each line the X data set (regular numbers) and Y data (bold number).

*In this example, our model will learn that if it receives a demand of 10, 12, 13 and 9, it should predict 11. *

We will have to do the same for our test data set now.

#### Test set

The test set (`X_test`

& `Y_test`

) will be easier to populate as we just have to keep aside the final months of the data set we want to keep to test the accuracy of the model (this will be the `Y_test`

dataframe) and the previous n months in the `X_test`

dataframe.

#### Python Code

If you have some difficulties to understand this python code (I had!). I advise you to use `print()`

function a lot. For example don’t hesitate to print the df columns we are currently using: `print(df.iloc[:,col:col+x_len+y_len].columns)`

# - Create the data sets # - Training set creation: run through all the possible time windows result = [] for col in range(df.shape[1]-x_len-y_len-y_test_len): x = df.iloc[:,col:col+x_len+y_len] result = result + x.values.tolist() result = np.array(result) X_train = result[:,:x_len] Y_train = result[:,x_len:] # - This data formatting is needed for the regression tree if we only want to predict a single month. if y_len == 1: Y_train = Y_train.ravel() # - Test set creation: unseen "future" data together with the demand just before result = [] for col in range(df.shape[1]-x_len-y_len-y_test_len,df.shape[1]-x_len-y_len): x = df.iloc[:,col:col+x_len+y_len] result = result + x.values.tolist() result = np.array(result) X_test = result[:,:x_len] Y_test = result[:,x_len:] # - This data formatting is needed for the regression tree if we only want to predict a single month. if y_len == 1: Y_test = Y_test.ravel()

### Step #3 – Benchmark Creation

Before we jump into using our regression tree, let’s take some time to create a forecast benchmark. I know we want to go ahead quickly to advanced machine learning, but it is important to have a benchmark against which we will be able to track the accuracy of our model.

As a benchmark we will use a linear regression. We will populate it thanks to the **sklearn** library. Actually many python libraries can deal with linear regressions, but we will use **sklearn** as it will be then the same library (and logic) as for the other machine learning models. If you are a fan of linear regressions, I would also advise you the **Statsmodels** library.

# Import the necessary module from sklearn.linear_model import LinearRegression # Create a linear regression object: reg reg = LinearRegression() # Fit it to the training data reg = reg.fit(X_train,Y_train)

Now that we have created and fitted our linear regression, let’s test this on our test data set:

# Create a prediction based on our model Y_pred_reg = reg.predict(X_test) # Computes the Mean Absolute Error of the model MAE_reg = np.mean(abs(Y_test - Y_pred_reg))/np.mean(Y_test) # Print the results print("Regression MAE%:",round(MAE_reg*100,1))

And we get this result: `Regression MAE%: 18.1`

. If you don’t remember what is a MAE, feel free to check our article about forecast error measurement.

## Next steps

Now that we have a benchmark and a proper data set, you can continue and use a regression tree to predict the sales.

#### Latest Posts

- Exponential Smoothing (Python)Nicolas Vandeput2020-02-19T14:39:58+01:00
#### Exponential Smoothing (Python)

- Exponential Smoothing with Trend (Python)Nicolas Vandeput2019-11-13T16:35:29+01:00
#### Exponential Smoothing with Trend (Python)

- Exponential Smoothing with Damped Trend (Python)Nicolas Vandeput2019-11-13T16:38:16+01:00
#### Exponential Smoothing with Damped Trend (Python)

- Holt Winters forecast with multiplicative seasonality (Python)Nicolas Vandeput2019-11-13T16:41:48+01:00
#### Holt Winters forecast with multiplicative seasonality (Python)

- Holt Winters forecast with additive seasonalityNicolas Vandeput2019-11-13T16:45:08+01:00
#### Holt Winters forecast with additive seasonality

Andres Garcia Echeverría11 July 2019 at 23 h 02 minEl dataset ya no está disponible

Lucy4 October 2019 at 5 h 55 minI’m trying to download the numbers of car sold in Norway dataset. However, the link seems to be broken. Could you update the link? Thanks.

Lucy6 October 2019 at 6 h 49 minCan you elaborate a bit more about why we use “range(df.shape[1]-x_len-y_len-y_test_len)” to loop through all possible windows (Step#2, Python code line 5), i.e. how does the counting work here? Thanks.