Forecasting hierarchical time series using deep learning in a commercial setting

Note: This article was originally written for another purpose but I thought I’d publish a copy on my blog too.

A recent paper by Mancuso, Piccialli, and Sudoso, titled A machine learning approach for forecasting hierarchical time series (May 2021) showed success in hierarchical time-series forecasting in the Italian retail sector.

Experiment setup

The purpose of the research was to build a model for forecasting the sales volume of 4 different brands of pasta, and the 118 item level SKUs that roll up into them, 1 week ahead, for a single Italian grocery store. Crucially the predictions had to be at the brand and item level, and not just the aggregate level. Traditionally this has been performed by coming up with an aggregate level forecast first, then disaggregating that down to the item or brand level based on historical proportions.

This research followed a similar two-step process but utilized a different method for disaggregation. Here they used a collection of deep neural network models, and this is what makes their approach a novel one.

The aggregate (store level) forecast model

Here the researchers built a composite model using a set of that consisted of several classical techniques: ARIMAX, NARX, and ETS. Note that composite models which combine results from several other models are often found to perform better than relying on a single model for forecasting.

ARIMAX: Autoregressive Integrated Moving Average (ARIMA) models are an established method for time-series forecasting, with the ARIMAX extension allowing for the inclusion of exogenous variables in this case, things like day of the week or promotions.

NARX: This is a form of non-linear autoregression built using a neural network and includes exogenous variables. It has been shown to be quite effective at time-series forecasting and converges quicker on the solution than the deep learning models we will discuss later.

ETS: Exponential smoothing – a form of time series forecasting in which the model uses exponentially decreasing weights for past observations such that more recent sales will have a higher weighting.

Note that the choice of forecasting model here is independent of the use of deep neural networks. High quality forecasts produced through other means can also be used for disaggregation.

The disaggregation model

The disaggregation of the 7 day forecast was disaggregated down to the brand and item level by three different neural net sub-models as shown in the diagram below. The first two (MLP and CNN) generate a set of features that feed into the final model, which performs the final disaggregation of the forecast down to the brand and item level.

Each of the models serve a different purpose. The CNN model seeks to determine how the aggregate store level sales from the last 30 days determines what the sales volume will be at brand and item level for the next 7 days. The MLP model seeks to determine the impact exogenous variables such as promotions or day of the week have on these estimates. The outputs from both of these are concatenated and fed into a multi output regression model that forms the final prediction of sales.

Note that CNNs have been more often used with image processing where the inputs are a square pixel grid of a portion of an image. However a CNNs ability to detect hidden patterns while filtering out noise is such that it makes then popular in other applications, in this case on time-series disaggregation.

When used like this the input data is fed as single row of time-series data instead of a grid of data. Here the training data consisted of a row of 30 day sales data prior, and a row of the next 7 days of sales data.

Training and Testing

The result is a forecasting method that can generate a high-level forecast of store sales based on the last 30 days and variables such as day of the week and item level promotions; and then break that forecast down to the brand and item level. To train the model the researchers started with a training dataset spanning 3 years of daily sales data, and then iteratively included an additional week of data into the training set to simulate real world scenarios till they had spanned 4 years’ worth of training data.

To compare effectiveness of their solution, the researchers compared their result to several benchmarks built on classical approaches. Note that the terms here refer to how each model disaggregates the forecast from the high level to the item level.

  • BU: Bottoms up approach
  • AHP: Average Historical Proportions
  • PHA: Proportions of the Historical Averages
  • FP: Forecasted Proportion
  • OPT: Optimal Reconciliation
  • NND1: Neural Net model 1
  • NND2: Neural Net model 2

Most of the benchmark models perform the disaggregation using some form of proportional breakout based on either historical sales (AHP and PHA) or forecasted sales (FP) at the item level. The historical models suffer from not taking into account changes in relationships between top level sales and individual items. The FP model attempts to rectify this and the OPT model is an extension of the FP model in which the

The two Neural net models differ in their disaggregation process. NND1 disaggregated directly down to the item level, while NND2 has a model for disaggregating down to brand level and another model to go from brand level to item level.

Discussion of results

As stated earlier, the main difference between the models is the method of disaggregating the store level forecasts. The authors state that there was no real difference between the methods in terms of the top level store level forecasts. So the real crux of this research is in the ability for deep neural nets to disaggregate high level forecasts down to the item level.


Fig 1 below shows the result mean error at the brand level (B1-B4) with lower being better:

Fig. 1

We see that both NND models scored quite well at the brand level with only the OPT model being somewhat competitive from the benchmark models.

Fig. 2

Fig 2 shows the item level results with 1 representing best ranked model. We see that the NND models were significantly superior to other approaches. As stated by the authors themselves “the majority of the item level sales is sporadic sales including zeroes and the promotion of an item does not always correspond to an increase in sales”. This means that a bottom-up approach ends up resulting in a flat line forecast representing average sales volume. The top-down approaches however also fail to account for the sparsity of individual items being sold on a particular day.

Why do deep neural networks models outperform the classical models?

It should be noted that the task the deep learning model is being asked to perform here is quite narrow. That is to say, given a forecast for the next seven days for four brands of product, break down the forecast down to the item level for the 118 items that the model was trained on. The model would not perform well if the set of products or the brand in which they roll up into were to change.

Also remember that the aggregate forecast at store level was performed using a composite model consisting of classical (not deep neural net) models. Therefore, the solution presented in the paper can be considered a hybrid solution

Even so, we have seen that deep neural networks do outperform other standard models with regards to disaggregating the forecasts down to brand and item level. Deep learning models, such as a CNN model, allow for non-linear and hidden relationships to be discovered, while filtering out noise. In this case the CNN model determined 12 filters as optimal – each filter can be considered representing a different hidden pattern or feature of the data that was considered important for prediction.  The drawback of deep learning models is that it’s not possible to know what these hidden features are, but an example might be detecting an increase in sales of one item as the beginning of a shift in season that then impacts other items.


Fig 3 show that the model maintained its accuracy across the test period of a year. This shows the model can detect which season it is in based on the sales seen in the last 30 days. Furthermore, it also indicates that the hidden patterns that it has found are consistent across the time of year being disaggregated.

Fig 3.


Given that the at the item level product sales are sporadic, we may wish to structure the hierarchy by product group instead of by brand (so similar selling items are grouped together) and forecast at that level, or iteratively at both brand and item level as is the case with the NND2 model (the Walmart results indicate that product groupings also produce similarly good results). While the neural net models do appear to have dealt well with sparsity, having a product group could help in dealing with new and dropped items.

New items

As mentioned earlier, a deep neural net model can identify the relationship between aggregate sales and item level sales provided the hierarchical structure remains fixed. Or to put differently the neural net would have been optimized for a specific set of items and hierarchy. A change in products, such as new products or products being dropped, could result in the accuracy of the model significantly decreasing. This is quite common in production systems within retail.

If a product group is used in the hierarchy, then new products could be excluded from the model and assigned the same forecast value as the average for their product group.

Dropped items

Dealing with products that are dropped is just as tricky as in reality it could mean an increase in sales to other similar products. In this case we might assume that sales will remain consistent at the product group level (buyers will switch to other similar items) and therefore forecasting at the product group level may be sufficient. Alternatively a final step might be to re-assign the predicted sales from the dropped item, evenly across the other items in its product group.

A note about real world application

While this research focussed on the retail sector, hierarchical time-series forecasting can be applied to a wide range of sectors, each with their own hierarchy to consider.

In the educational sector the hierarchy may look like test results > module results > overall passes at the subject level > school or university. Here education authorities could first forecast out educational achievement for the next year, then build a stretch target on top of that, which can then be disaggregated back to the individual school and subject level to help educators understand what it means for them.

Within the retail sector, a supply chain hierarchy may look like raw produce (e.g. corn) > farm  > manufacturer > retailer. Having the full picture of the flow of corn recorded on the blockchain could allow retailers to build out a forecast model based for the year the year ahead then disaggregate that down to how much each farmer would need to produce in order to meet these needs. This could then provide an early warning to manufacturers in case they need to find additional sources for the raw produce.

Within telecoms, the hierarchy could be customer > zip code > town > region and be used to help forecast demand on the network and better capacity management. The impact of one-off events such as sporting events that would place a temporary high demand on the network could be forecasted similar to “promotions” in the research discussed.

Different situations can call for different methods of dealing with forecasting, sparsity, and new and dropped items. For example, the researchers chose to build out a complex composite forecasting model for the aggregate level forecasts. In practice, one may opt to use a simpler, single model at the expense of the lower accuracy at the item level.

Key takeaways from the research

  • A mix of standard techniques combined with deep neural nets can be effective at short term forecasting
  • Deep neural nets work well for disaggregating high level forecasts down to the item level, but changes to hierarchy structure can pose a challenge
  • Depending on the sector / nature of the forecasting problem, different methods for dealing with new or removed items can be considered
  • Blockchain technology is opening up new opportunities for organizations to analyse their data as well as that of the ecosystem in which they operate

Published by ReddSpark

Follow me on Twitter:

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: