Project Description

Machine Learning Forecast (part 1): Data Cleaning & Benchmark


Nicolas Vandeput


Data Science for Supply Chain Forecast

I will show you in this article how to apply supervised learning algorithms to predict sales. Sales forecast (or demand planning) is an important task for many supply chain practitioners and machine learning models can provide some help.

How to use supervised learning to predict sales? advise on machine learning

We will have to convert a time series prediction problem (i.e. predict demand over time) into a supervised machine learning problem. Remember that with supervised machine learning the model will learn the relationship between the inputs (denoted by X) and the outputs (denoted Y). The question we will ask to the model will be:

Based on the last x months of demand, what will be the demand next month?

We will train the model by providing it the data with a specific layout:

  • n consecutive months of demand as the input (X data).
  • the demand in the very next month as the output (Y data).

Basically we will train our model to predict the future demand based on the previous n months of demand.

Here is an example of a data set with 4 months of demand as an input (X) to predict the next month of demand (Y):

    \[ \begin{tabular}{|c|c|c|c|c|} \hline \multicolumn{4}{|c|}{X (input)} & Y (output) \\ \hline 10 & 12 & 13 & 9 & 11\\ 1 & 1 & 3 & 1 & 2 \\ 15 & 17 & 21 & 25 & 28 \\ 5 & 12 & 7 & 13 & 8 \\ \hline \end{tabular} \]

Data set: numbers of car sold in Norway

For this example we will work with the sales of cars per brand per month in Norway. The data set is available here. Most likely, this typical example will look closely to the demand data you could be using yourself at work.

Extract of the data set

Model in Python

Step #1 – Data Cleaning

Before we go into data cleaning and preparation, let’s first import our favorite libraries (Pandas & Numpy) and then define the first global variables.

 import pandas as pd
 import numpy as np
 x_len = 12 # How many previous months we will use as inputs
 y_len = 2 # How many months we want to forecast
 y_test_len = 12 # How many months we want to keep as a test

As you can see we define 3 global variables:

  • x_len: # of months we will use to predict the next one.
  • y_len: # of months we want to predict (basically do you want to forecast the very next month of the next 3 months?).
  • y_test_len: # of months we want to keep aside to validate our model.

It is a best practice in Data Science to keep some data aside (i.e. the test data set) for the final validation check of the model.

In this example I will use the last 12 months (x_len = 12) of demand to predict the next 2 months (y_len = 2). I will also keep 12 months aside (y_test_len = 12) to test my model on these unseen data

Let’s go now into the data preparation. We will first extract the data from the csv file (you can find it here). Then we will transform this data set to get the periods as columns and the brands as rows (I’ll use the pandas pivot_table function to do this).

As a last step, I save the dataframe as an excel file. I like to do this to double check that everything is properly done.

# - Load the CSV file (should be in the same directory)
df = pd.read_csv("norway_new_car_sales_by_make.csv")

# - Small functions to print numbers with format such a 01, 02, 03, etc.
def month_str (x):
 if x < 10:
  return "0"+str(x)
  return str(x)

# - Create one column with the period (format: YYYY MM)
df["Period"] = df["Year"].astype(str) +" " + df["Month"].apply(month_str)

# - Create a pivot of the data to show the periods as the column and the car makers on the rows
df = pd.pivot_table(data=df,values="Quantity",index="Make",columns="Period",aggfunc='sum',fill_value=0)

# - Print data to excel for reference
df.to_excel("Clean Data Set.xlsx")

This is our df dataframe now:

Dataframe after data cleaning process

Step #2 – Data set creation

Now that we have a clean data set we need to create the training data set in the proper format. This is actually not straightforward. Before we jump into the python code, let me explain you how we will create the training data set (X_train and Y_train) and the test data set (X_test & Y_test).

Training data set

For the training data set we will run through each possibility of X & Y data, running through the time series each time looking at a specific time window. Let’s imagine that you will use 4 months to predict the next one: in this example you use Month 1, 2, 3 & 4 data to predict Month 5, then you can offset the data set by 1 month and use Month 2, 3, 4 & 5 to predict Month 6, and so on.

Let me show you an example below for a time series of 9 months.  We can make here 5 different combinations of X & Y data sets. On the example below you see in each line the X data set (regular numbers) and Y data (bold number).

    \[ \begin{tabular}{|c|c|c|c|c|c|c|c|c|} \hline M1 & M2 & M3 & M4 & M5 & M6 & M7 & M8 & M9\\ \hline 10 & 12 & 13 & 9 & \textbf{11} & & & & \\ & 12 & 13 & 9 & 11 & \textbf{13} & & & \\ & & 13 & 9 & 11 & 13 & \textbf{14} & & \\ & & & 9 & 11 & 13 & 14 & \textbf{8} &\\ & & & & 11 & 13 & 14 & 8 & \textbf{10}\\ \hline \end{tabular} \]

In this example, our model will learn that if it receives a demand of 10, 12, 13 and 9, it should predict 11.

We will have to do the same for our test data set now.

Test set

The test set (X_test & Y_test) will be easier to populate as we just have to keep aside the final months of the data set we want to keep to test the accuracy of the model (this will be the Y_test dataframe) and the previous n months in the X_test dataframe.

Python Code

If you have some difficulties to understand this python code (I had!). I advise you to use print() function a lot. For example don’t hesitate to print the df columns we are currently using: print(df.iloc[:,col:col+x_len+y_len].columns)

# - Create the data sets

# - Training set creation: run through all the possible time windows
result = []
for col in range(df.shape[1]-x_len-y_len-y_test_len):
 x = df.iloc[:,col:col+x_len+y_len]
 result = result + x.values.tolist()
result = np.array(result)
X_train = result[:,:x_len]
Y_train = result[:,x_len:]

# - This data formatting is needed for the regression tree if we only want to predict a single month.
if y_len == 1:
 Y_train = Y_train.ravel()

# - Test set creation: unseen "future" data together with the demand just before
result = []
for col in range(df.shape[1]-x_len-y_len-y_test_len,df.shape[1]-x_len-y_len):
 x = df.iloc[:,col:col+x_len+y_len]
 result = result + x.values.tolist()
result = np.array(result)
X_test = result[:,:x_len]
Y_test = result[:,x_len:]

# - This data formatting is needed for the regression tree if we only want to predict a single month.
if y_len == 1:
 Y_test = Y_test.ravel()

Step #3 – Benchmark Creation

Before we jump into using our regression tree, let’s take some time to create a forecast benchmark. I know we want to go ahead quickly to advanced machine learning, but it is important to have a benchmark against which we will be able to track the accuracy of our model.

As a benchmark we will use a linear regression. We will populate it thanks to the sklearn library. Actually many python libraries can deal with linear regressions, but we will use sklearn as it will be then the same library (and logic) as for the other machine learning models. If you are a fan of linear regressions, I would also advise you the Statsmodels library.

# Import the necessary module
from sklearn.linear_model import LinearRegression

# Create a linear regression object: reg
reg = LinearRegression()

# Fit it to the training data
reg =,Y_train)

Now that we have created and fitted our linear regression, let’s test this on our test data set:

# Create a prediction based on our model
Y_pred_reg = reg.predict(X_test)

# Computes the Mean Absolute Error of the model
MAE_reg = np.mean(abs(Y_test - Y_pred_reg))/np.mean(Y_test)

# Print the results
print("Regression MAE%:",round(MAE_reg*100,1))

And we get this result: Regression MAE%: 18.1. If you don’t remember what is a MAE, feel free to check our article about forecast error measurement.

Next steps

Now that we have a benchmark and a proper data set, you can continue and use a regression tree to predict the sales.

Latest Posts