Machine learning with parallel neural networks for analyzing and forecasting electricity demand

(1)

https://doi.org/10.1007/s10614-019-09960-5

Machine learning with parallel neural networks for analyzing and forecasting electricity demand

Yi-Ting Chen¹· Edward W. Sun²· Yi-Bing Lin¹

Accepted: 6 December 2019

Abstract

Traditional methods applied in electricity demand forecasting have been challenged by the course of dimensionality arisen with a growing number of distributed or decentral- ized energy systems are employing. Without manually operated data preprocessing, classic models are not well-calibrated for their robustness when dealing with the dis- ruptive elements (e.g., demand changes in holidays and extreme weather). Based on the application of big data driven analytics, we propose a novel machine learning method originating from the parallel neural networks for robust monitoring and forecasting power demand to enhance supervisory control and data acquisition for new industrial tendency such as Industry 4.0 and Energy IoT. Through our approach, we generalize the implementation of machine learning by using classic feed-forward neural networks, for parallelization in order to let the proposed method achieve superior performance when dealing with high dimensionality and disruptiveness. With the high-frequency data of consumption in Australia from January 2009 to December 2015, the overall empirical results confirm that our proposed method performs significantly better for dynamic monitoring and forecasting of power demand comparing with the classic methods.

Keywords Big data· Energy · Forecasting · Machine learning · Neural networks (PNNs)

JEL Classification C02· C10 · C63

B

Edward W. Sun

edward.sun@kedgebs.com

1 College of Computer Science, National Chiao Tung University, Hsinchu, Taiwan 2 KEDGE Business School, 680 Cours de la Libèration, 33405 Talance Cedex, France

(2)

1 Introduction

Electricity demand (utilization or load¹) forecasting is essential for effective power system operation and attracts major attention from the restructured electricity market. Ghadimi et al. (2018) point out electricity load is non-linear with a high level of volatility and predicting such complex signals requires suitable prediction tools.

The volatility illustrates strong seasonality in different time frames (i.e., the intra-day, intra-week, intra-month, and intra-year seasonality). Meantime, anomalies constantly occur in the power consumption. There are many days (e.g., public holidays or extreme weather days) exhibit load profiles that differ noticeably from the repeating patterns observed in normal days. Anomalous utilization deviates significantly from normal load, and is therefore not straightforward to model. For example, Abedini et al. (2019) pointed out that the participation of large consumer in the market is a challenging issue due to their considerable load demand. Some special operations also lead to anomalous loads, see Saeedi et al. (2019) for example. The relatively infrequent occurrence of anomalies results in a lack of training observations for adequately training the model.

Different anomalies exhibit different load profiles, which require each of them to be modeled uniquely. These make statistical modeling of anomalous load very challenging, see Siddharth and Taylor (2018) and references therein.

In this article, we implement a novel framework for processing big data collections, analyses, and management procedures to facilitate robust monitoring, forecasting, and economic assessment based on the scope of the Energy IoT. This framework relies on an architecture of a embedded system built on parallelizing neural networks (PNNs), which is a task-oriented group that parallelizes several individual neural networks together to create a new, more complex system that dedicates more functionality and superior performance than simply the sum of many signal neural networks. In this framework, the decision-making process based on forecasting the short-term demand is conducted on an optimization that is generally implemented with mathematical programming of genetic algorithms coupled with neural networks. In this article, the decision-making algorithm applied in our framework is based on a simple decision rule to minimize the forecasting error.

Neural network (NN) methods have a property of universal approximation and can achieve high predictive accuracy, therefore, they have received a lot of attention in decision-support systems. Although being considered as a singular approach,² NN method lose their advantages in interpretability in comparison to parametric models because they clarify decision making of the NN by explanatory rules that extract the patterns and capture the learned knowledge embedded in the networks, which can still help us illustrate the explanatory capability in response to why a specific implementa-

1 The end use of the generation, transmission, and distribution of electric power. Load requires specific amount of energy over a predetermined period of time. The total load of the power system is the sum of the total power consumed by all the electrical equipment in the system.

2 The non-identifiability of the NN model is similar to other methods, such as the wavelet method, Bayesian networks, hidden Markov model, and topological data mining (see Sun et al.2015, and references therein).

The model is referred to as being singular because its Fisher information matrix is unusual, and knowledge to be learned and patterns to be discovered from the sample space correspond to an individual instance; hence, difficulties arise when developing a mathematical tool that enables us to understand statistical estimation and interpret learning processes.

(3)

tion is classified as either bad or good. Tam and Kiang (1992) introduce an approach that can perform discriminant analysis in business research with neural networks.

Compared with a linear classifier, logistics regression, kNN, and an ID 3 algorithm, they show that the NN approach is a promising method in terms of predictive accuracy, adaptability, and robustness. Wang (1995) suggests a possible way of improving the performance of NN approach in managerial applications by offering an extension of Tam and Kiang (1992). Hill et al. (1996) compare six statistical time-series models with NN approach for forecasting and report that across monthly and quarterly time series, the neural networks perform significantly better than these traditional models.

In order to improve the classification accuracy and reduce computational time of neural network, Piramuthu et al. (1998) develop a methodology of feature construction to efficiently transform the training data. Baesens et al. (2003) use neural network for rule extraction in credit-risk evaluation and conclude that the neural network method is powerful as a management tool for decision-support systems. Kim et al. (2005) propose a new approach of neural networks with genetic algorithm in principal com- ponent analysis (PCA) and indicate that the underlying method is more accurate and parsimonious than traditional PCA methods. Wong et al. (2010) propose an adaptive neural network (ADNN) with the adaptive metrics of inputs for time-series predictions, and Kiani (2011) employed artificial neural networks (ANNs) to predict fluctuations in economic activity in several members of the Commonwealth of Independent States (CIS) using macroeconomic time series. Venkatesh et al. (2014) perform prediction of cash demand for groups of ATMs with different neural networks. Etemadi et al.

(2015) examined earnings per share forecasting using multi-layer perceptron (MLP) neural network and determined an optimal model based on evaluating the forecasting accuracy. Stasinakis et al. (2016) investigated the efficiency of the radial basis function neural networks in forecasting the US unemployment and reported that the neural network based method statistically outperforms all competing models’ performances.

Katris (2019) explored and analyzed time series with feed forward neural networks for prediction of unemployment. Levendovszky et al. (2019) performed a feed-forward neural network to obtain an efficient estimation of the forward conditional probability distribution in electronic trading. Ramyar and Kianfar (2019) investigates predictabil- ity of oil prices using a multilayer perceptron neural network that considers exhaustible nature of crude oil and impact of monetary policy along with other major drivers of crude oil prices. Gao et al. (2019) proposed a forecast engine that is comprised of multi- block neural network (NN) and optimized by an intelligent algorithm to increase the training mechanism and forecasting abilities. Therefore, our study will enrich the NN approaches currently applied in management science.

Using parallelization applies naturally to the time-consuming algorithms for improving the efficiency. Swann (2002) takes advantage of parallel computing to solve a maximum likelihood problem using the MPI message passing library. Creel (2008) parallelizes the parameterized expectations algorithm (PEA) to reduce the time needed to solve a simple model by more than 80%. Cai et al. (2015) imple- ments parallel dynamic programming methods in HTCondor Master-Worker system to solve demanding high-dimensional dynamic programming problems and reports the computational productivity can be increased by at least two orders of magnitude.

Muresano and Pagano (2016) apply a multi-core architecture in order to improve the

(4)

execution time for considering different key points (e.g., core communications, data locality, dependencies, and memory size). Creel (2016) illustrates the possibility for the programming language Julia with the message passing interface (MPI) for parallel computing on multicore computers or clusters of computers.

Dynamic measures for forecasting and monitoring update the information available at the time of evaluation based on error occurrence; that is, the assessment adapts to some filtered probability space. In a dynamic setting, error measures for processes can be identified with error measures for random variables in an appropriate product space, see Sun et al. (2015). The interrelation of error assessments at different times can be characterized by the property of time consistency, which can be described either by the super-martingale property of the error process or by the principle of prudence from the viewpoint of a decision maker and requires that if the error is tolerant for any scenario tomorrow then it should be tolerant today. In order to assess the performance of our system, we propose an efficient measure in our algorithm based on the tail error, which is an extreme error outside an ordinary distribution of outcomes (or simply, higher-than-expected), which causes massive inaccuracy. It has come to signify any big downward movement in a system’s accuracy. There are different ways to adjust the system to minimize the tail error, but a popular one is to optimize the algorithm with respect to its entropy, that is, the trade-off between complexity and accuracy. In our algorithm, we conduct conventional convex optimization and a simulation (with stylized data) and empirical results with real big data confirm the superior performance of our proposed method in comparison to alternative models.

In this paper, we first discuss the properties of the newly proposed parallel neural networks (PNNs) and run a simulation comparing with three alternative methods, i.e, the linear seasonal dummy model (LSDM), the autoregressive moving average model (ARMA), and the ARMA model embedded an autoregressive conditional heteroskedasticity (GARCH) to highlight its performance in term of error controlling with three conventional criteria for error measurement, i.e., the root mean squared error (RMSE), mean absolute percentage error (MAPE), and mean absolute error (MAE). We then conduct an empirical study with the proposed PNN for electricity demand forecasting based on the data from Australia. Based on the feature engineer- ing method proposed by Chen et al. (2018), in our empirical work we recognize four features: (1) hourly demand in a week day, (2) hourly demand in a weekend, (3) hourly demand in a day chosen from a period of four weeks around that day, and (4) hourly demand in a day chosen from a period of four weeks before that day. We perform the PNNs simultaneously on the four features to train the parameter for each method and make the continuous forecasting with a moving window. We compare the performance of PNNs with LSDM, ARMA, and the ARMA-GARCH. We show the proposed PNNs method has superiority both in simulation and empirical study comparing with other methods in terms of error minimization.

We organize this article as follows. In Sect.2, we describe the proposed methodology in details based on the parallel neural networks (PNNs) for power demand monitoring and forecasting in order to deal with anomalies efficiently. In Sect.3, we present an empirical study that applies our method to conduct modeling and forecasting power demands with real big data from Australia. We conclude in Sect.4.

(5)

2 The novel methodology

In this section, we describe the proposed parallel neural networks (PNNs) approach and analytically show its computational properties.

2.1 Simple neural networks

A neural network (NN) is a mathematical or computational model based on biolog- ical NNs. It consists of an interconnected group of artificial neurons and processes information using a connectionist approach to computation. In most cases, an NN is an adaptive system that changes its structure based on external or internal information that flows through the network during the learning phase. Similar to the linear and polynomial approximation methods, an NN relates input variables xi, i = 1, . . . , k to output variables yj, j= 1, . . . , k, which is essentially a mathematical model defining a function f : X → Y . Each type of NN model corresponds to a class of such functions.

The difference between a NN model and other approximation methods is that NN take advantage of one or more hidden layers, in which the input variables are transformed by a special function known as a logistic or log-sigmoid transformation; that is, the function f(x) is a composition of other functions gi(x) that can be further defined as a composition of other functions. Functions f(x) and gi(x) are composed of a set of elementary computational units called neurons,³which are connected through weighted connections. These units are organized in layers so that every neuron in a layer is exclu- sively connected to the neurons of the proceeding layer and the subsequent layer. Every neuron represents an autonomous computational unit and receives inputs as a series of signals that dictate its activation. All the input signals reach the neuron simultaneously, and the neurons can receive more than one input signal. Following the activation dictated from the input signals, the neurons produce the output signals. Every input signal is associated with a connection weight that determines the relative importance of the input signals in generating the final impulse transmitted by the neuron.

Formally, the described algorithm can be expressed as follows:

nk,t = wk,0 +

i^∗

i=1

wk,ixi,t, (1)

Nk,t = G(nk,t), (2)

yt = γ0 +

k^∗

k=1

γkNk,t, (3)

where G(·) represents the activation function and Nk,t stands for the neurons. In this system, there are i^∗input variables x and k^∗neurons. A linear combination of these input variables observed at time t, that is, xi,t, i = 1, . . . , i^∗, with the coefficient vector (i.e., a set of input weights)wk,i, i = 1, . . . , i^∗, and a constant termwk,0form the variable nk,t. This variable nk,tis transformed by the activation function G(·) to a

3 Sometimes such elementary computational units are called nodes, neuron nodes, units, or processing elements.

(6)

neuron Nk,t at time (or observation) t. The set of k^∗neurons at time (or observation) index t are combined in a linear way with the coefficient vectorγk, k= 1, . . . , k^∗and taken with a constant termγ0to form the output value yt at time t. In defining a NN model, the activation function G(·) is typically one of the elements to specify. There are three commonly employed types of activation functions: linear, stepwise, and sigmoid.

Neurons of the NN model are organized in layers. There are three types of layers:

input, output, and hidden. The input layer receives information only from the external information, that is, an explanatory variable xi. There is no any calculation performed for the input layer. It only transmits information to the next level. The output layer only produces the final results sent by the network to outside of the system, that is, response variable yj. Between the input and output layers there can be one or more intermediate layers, called hidden layers, so named because these layers are not directly connected with the external information. Giudici points out that the architecture of an NN model refers to the network’s organization: (1) the number of layers, (2) the number of neurons belonging to each layer, (3) the manners in which the neurons are connected, and (4) the direction of flow for the computation.

Different information flows lead to different types of network. The NN model can by divided into two types based on the information flow: feed-forward networks and recurrent networks. In the feed-forward network, the information moves in only one direction: from the input layer through the hidden layer and to the output layer. There are no cycles or loops in this type of network. Equations (1)–(3) describe the feed- forward networks. In contrast to feed-forward networks, recurrent networks are models with bi-directional information flows that enable the neurons to depend not only on the input variable xi but also on their own lagged values nk,t−pat order p. The recurrent network illustrate dependence in the evolution of the neurons. Replacing Eq. (1) with the following Eq. (4), a recurrent network can be formed by

nk,t = wk,0 +

i^∗

i=1

wk,ixi,t +

k^∗

k=1

φknk,t−p, (4)

where φk stands for the weight of lagged value nk,t−p at order p. The recurrent network has an indirect feedback effect from the lagged unsquashed neurons to the current neurons, not a direct feedback from lagged neurons to the level of output. In our study, we focus on the feed-forward network.

An NN model modifies its interconnection weights by applying a set of learning (training) samples. The learning process leads to parameters of a network that represent implicitly stored knowledge from the data. More generally, given a specific task to solve and a class of functions F , learning means using a set of observations (learning/training samples) in order to find f^∗ ∈ F which solves the task in an optimal sense. This entails defining a cost function C : F → such that, for the optimal solution f^∗, C( f^∗) ≤ C( f ) ∀ f ∈ F; that is, no solution has a cost less than that of the optimal solution. Figure1shows a simple NN, that is, a three-layer fully connected NN with M neurons. It comprises an input layer, one hidden layer and an output layer. The hidden layer is made up of a number of neurons um (m = 1, . . . , M), which act as feature detectors. As the learning progresses, the hidden neurons begin to gradually discover the salient relationship between input historical uses and the forecasted use.

(7)

Fig. 1 A basic neural network model

Fig. 2 A neuron

Figure2demonstrates how a neuron works. At the beginning of the training process, each neuron connects to the input vector (i.e., historical similar power usages xiin our example) with initial weights and performs inner product. As we know, inner products usually are used to quantify the similarity of two vectors. Because similarity ranges vary, the activation functionA(·) is introduced to scale the range and make it easily to compare similarities. In our case, we choose the sigmoid function as an activation function for its range limits in [0,1] because of its differentiable and non-decreasing properties. Each neuron can be expressed as in Eq. (5), where am is the output of a neuron um,βm denotes the inner product result, wm,i are weights, andA(·) is the sigmoid function.

am = A(βm) = A

_D

d=1

wm,ixi,d

(5)

In the output layer, all weights (i.e.,w1,1, . . . , w1,M) between the hidden layer and the output zi evaluate each neuron’s importance. The output layer reports the result of the inner product between all hidden neurons’ outputs a1, . . . , aM and the corresponding weights, and applies an activation function to derive the output˜z. Therefore, the overall network function can be formally expressed as Eq. (6).

˜z(xi, w) = A

_M

m=1

w1,mam

= A

_M

m=1

w1,mA

_D

d=1

wm,ixi,d

(6)

(8)

Fig. 3 A parallel neural network

2.2 Parallel neural networks

Figure3shows a parallel neural network (PNN) that consists of two parts. The first part is a number of simple three-layer NNs: NNq (q = 1, . . . , n). Each NN tries to capture the most salient features with a distinct and predefined number of neurons.

The second part is the evaluation module, which determines the optimal weight w^∗ of the neural network. Training basically consists of determining the weights, which achieve the desired target values by minimizing the sum of the squared estimated error E(x, w) over all training samples, where

E(x, w) =

N i=1

e(xi, w). (7)

In Eq. (7), function e(xi, w) is the estimated error of the training sample xi. To make the expression simple, we omit the corresponding input vector and expressE(x, w) as E(w). The training process is shown in Eq. (8):

e(xi, w) =

1 2

zi− ˜z(xi, w)2

. (8)

To take loading balance into consideration, we equally partition the optimization work to each processor. At the beginning of the training process, NNqis assigned a distinct and predefined neuron number and initialized with initial weights wq.

(9)

2.2.1 Feed-forward computation and backward error propagation

The input will be sent to each NN simultaneously for training, and NNqtries to find its optimal weight vector such thatE(·) takes its global minimum. As E(·) is a continuous function, its global minimum can be reached whenE(·) = 0. Therefore we follow conventional NN techniques to train each NN by iteratively applying batch algorithms, which are made up of feed-forward computations and backward error propagation. In the feed-forward computation phase, the information flows from the input layer to the output layer by applying the given wqto all the input vectors. We will get the estimate

˜zi corresponding to each input vector xi according to Eq. (6) and together with the estimated error e(xi, wq).

In the backward error propagation phase, we need to be able to compute the gradient of the error function with respect to each weight in the network. It tells us how a small change in that weight will affect the overall errorE(·). Applying the chain rule derives the gradient of the error function with respect to all weights between the hidden layer and the output layer as expressed in Eq. (9).

∂e

∂w1,m = −(z − ˜z)

∂ ˜z

∂w1,m

= −(z − ˜z)

∂A(β1)

∂w1,m

= −(z − ˜z)A(β1)

∂β1

∂w1,m

= −(z − ˜z)A(β1)am. (9)

We also need to adjust all weights connecting input nodes and hidden neurons. To obtain the weight update vector betweeen the input layer and the hidden layer, we keep propagating the errors to the previous layer, as follows:

∂e

∂wm,i = − (z − ˜z)

∂ ˜z

∂wm,i

= −(z − ˜z)

∂A(β1)

∂wm,i

= −(z − ˜z)A(β1)

∂β1

∂wm,i

= −(z − ˜z)A(β1) ∂

∂wm,i

_M

m=1

w1,mA(βm)

= −(z − ˜z)A(β1)w1,m

∂A(βm)

∂wm,i

= −(z − ˜z)A(β1)w1,mA(βm)

∂βm

∂wm,i

= −(z − ˜z)A(β1)w1,mA(βm)xi (10) In the weight adjustment phase, we resort to a gradient descent approach. The weight adjustmentwqis derived from propagating the error backward, see Eqs. (9)–

(10). When the stopping criterion is satisfied (i.e., wq = 0), the weight wq and corresponding errorE(wq) will be sent to the model evaluation module for comparing the actual performance, then the training process of NNq is terminated. Otherwise, the weight wqis updated as follows:

(10)

wq= wq+ wq (11) and the training process of NNqproceeds to the next epoch.

2.2.2 Model evaluation

In order to avoid over-fitting, we design a regularization approach, which involves adding a penalty term to the error function E(w) for discouraging the coefficients from reaching large values, as the form shown in following Eq. (12):

E(w) =

N i=1

e(xi, w) + α w 1 (12)

where w 1 =

| w |, and the parameter α governs the relative importance of the regularization term compared with the estimated error term. From all wqderived from NNq, the evaluation module returns the weight as the optimal weight w^∗ for which the error function E(·) attains its minimum value.

2.2.3 Summary

Algorithm 1 Parallel Neural Network

Input: observations: a set of observations, each with input vector x and target value z;

networks: a set of three-layer neural networks, each with a predefined number of neurons and activation functionA(·);

Output: w^∗ (in parallel do)

for q:= 1 to the number of elements in networks do initialize weights wqin the network NNq; repeat

perform a feed-forward operation. (see Eq. (6))

compute the error between the target value and the estimate. (see Eq. (7)) propagate the error backwardly. (see Eqs. (9)–(10))

update weights wqin the network. (see Eq. (11)) until satisfying stopping criterion

end for

Evaluate networks. w^∗← arg min(E(wq))

We summarize the complete flowchart of PNN in Fig.4with the pseudocode shown by Algorithm 1.

3 Simulation study

Let y1, y2, . . . , yt be a time series. At time t for t ≥ 1, the next value yt+1 will be predicted based on the observed realizations (for training or monitoring) of

(11)

Fig. 4 Flowchart of the parallel neural network (PNN)

yt, yt−1, yt−2, . . . , y1. In order to confirm the properties we derived, we perform a Monte Carlo simulation with stylized data, that is, the data patterns are predetermined.

3.1 The data

The simulated power demand series, yt, is generated according to an additive model:

yt = xt+εt, ε ∼ N(0, σ²), where xtis the deterministic term that considers seasonality that observed in practice andεtis the nondeterministic part, that is, the random term that stands for the events not being perfectly predicted and that follow a normal distribution N(0,σ²).

We simulate two data patterns in our study. For the Pattern 1 data, we use the trigonometric model suggested by Chen et al. (2010) for the seasonality of power demand as follows:

xt = s0+

2 i=1

sisin( fit+ θi),

where si, and θi, i = 0, . . . , 2 are constants. For simulating the Pattern 2 data, we extend the preceding equation with an additional seasonality term, as follows:

xt = s0+

2 i=1

sisin( fit+ θi) +

2 j=1

sjcos( fjt+ θj), where si,θi, andθj, i, j = 0, . . . , 2 are constants.

(12)

Fig. 5 Comparison of in-sample modeling performances of PNN with other alternative methods measured by mean and variance of RMSE for two different stylized data patterns

Letting N denote the length of the sample, we set N = 5000, and run 20,000 replica- tions for each simulation. The subsample series that are used for the in-sample study are randomly selected by a moving window with length T . Replacement is allowed in the sampling. Letting TF denote the length of the forecasting series, we perform the one day ahead out-of-sample forecasting (1≤ T ≤ T + TF≤ N). In our analysis, the subsample length (i.e., the window length) of T = 672 (approximately two weeks) was chosen for the in-sample simulation and TF = 48 (approximately one day) for the out-of-sample forecasting. A total of 9060 sub-samples (1812 sub-samples for each region during five years) were randomly created.

3.2 Alternative models

In the simulation study, we compare the performance of the PNN framework with three competing models: the linearly seasonal dummy model (LSDM), the Autoregressive Moving Average (ARMA) model, and the ARMA model with the noise that illustrates the autoregressive conditional heteroskedasticity (GARCH) effect. The analysis is

(13)

Fig. 6 Comparison of out-of-sample forecasting performances of PNN with other alternative methods measured by mean and variance of RMSE for two different stylized data patterns

performed for both in-sample (monitoring or training) and out-of-sample (forecasting) investigations with two data patterns described in the previous section. We compare the performance of model fitting by using root mean squared error (RMSE), mean absolute percentage error (MAPE), and mean absolute error (MAE), as the goodness- of-fit measures.

The LSDM with s (s≥ 1) seasonality dummy variables is described as follows:

yt = α0 +

s i=1

αi Di t + εt, (13)

where Di t stands for the seasonal dummy, Di t = 1 if s is the ith period, otherwise Di t = 0.

The ARMA(r,m) model is defined with the following formula:

yt = α0 +

r i=1

αi yt−i + εt +

m j=1

βjεt− j (14)

(14)

Fig. 7 Comparison of in-sample modeling performances of PNN with other alternative methods measured by mean and variance of MAPE for two different stylized data patterns

whereε ∼ N(0, 1). Equation (14) also serves as the conditional mean for the ARMA- GARCH model. Letεt = σtut, where the conditional variance of the innovations, σt², is by definition

V art−1(yt) = Et−1(ε²t) = σt²

The general GARCH(p,q) processes for the conditional variance of the innovation is then

σt²= κ +

p i=1

γiσt²−i +

q j=1

θjε²t− j (15)

In our study, all models, that is, Eqs. (13)–(15), have been parameterized with εt ∼ N(0, 1), see Sun et al. (2007) for the details about modeling and parameterizing residuals under different conditions.

(15)

Fig. 8 Comparison of out-of-sample forecasting performances of PNN with other alternative methods measured by mean and variance of MAPE for two different stylized data patterns

3.3 Results

We conduct a simulation study to investigate the performance of PNNs. The purpose of this simulation study is twofold. First, we show that for any arbitrary data, a PNN will result in a better performance not only in the in-sample modeling but also in the out-of-sample forecasting compared to others. Second, we illustrate the properties of PNNs in particular by showing the consistency of our algorithm, that is, the error generated by a PNN is bounded and less than those of alternative methods. For each pattern, we compute RMSE, which measures the distance between the true signal and its approximation. In Figs.5and6, we display the mean and variance of RMSE with respect to the in-sample and out-of-sample performance. The smaller these values are, the better the performance of the underlying model will be. We then can identify that the mean and variance of RMSE for PNNs are obviously smaller than other models for simulated patterns 1 and 2 data. For the in-sample modeling, we can observe error divergency for alternative methods based on the variance of RMSE in Panels 1 and 2 of Fig.5. It shows that when sample size increases, the errors of alternative methods exhibit a growing tendency. On the contrary, the error of PNN illustrates stability, that

(16)

Fig. 9 Comparison of in-sample modeling performances of PNN with other alternative methods measured by mean and variance of MAE for two different stylized data patterns

is, the error tends not to increase when enlarging the sample size. We could observe a similar profile for the out-of-sample forecasting, see Panels 2 and 4 in Fig.6.

We compare the performance of all methods measured by the mean of MAPE in Fig.7and variance of MAPE in Fig.8. Panels 1 and 3 in Fig.7exhibit similar profile and show the PNN has smallest value of MAPE for Data Pattern 1 and 2 in the in- sample modeling. For the out-of-sample forecasting, Panels 1 and 3 in Fig.8show that the error altitude of all methods increases but the value of PNN is still the smallest. For the variance of MAPE, the PNN has the smallest value, see Panels 2 and 4 in Figs.7 and8. Particularly, the variances of MAPE in both the in-sample and out-of-sample illustrate consistency in value fluctuation.

We also compute MAE for each pattern. In Figs.9and10, we display the mean and variance of MAE for the in-sample and out-of-sample performance respectively.

Similar to the results based on RMSE and MAPE, we can find out that the mean and variance of MAE for PNNs are obviously smaller than other models for simulated patterns 1 and 2 data. Comparing with the results based on RMSE and MAPE, the MAE measurement illustrates similar profile.

We therefore can conclude that the proposed PNN procedure performs better than other models for the two data patterns we investigated in our simulation study.

(17)

Fig. 10 Comparison of out-of-sample forecasting performances of PNN with other alternative methods measured by mean and variance of MAE for two different stylized data patterns

4 Empirical study

We then apply the real data from the Australian electricity market whose patterns were originally preserved for an empirical study. We use the same procedure we have used in our simulation study for the in-sample (monitoring-training) and out-of- sample (forecasting) investigations. In addition, we apply a robust test based on two transformation-invariant metrics to verify our empirical results.

4.1 The data

In our empirical study, we use the high-frequency (i.e., half-hourly) data of electricity demand in Australia collected from January 1, 2009, to December 31, 2014, released by the National Electricity Market (NEM), which is the Australian wholesale electricity market. The NEM operates in five interconnected regions: New South Wales (NSW), Queensland (QA), South Australia (SA), Tasmania (TAS), and Victoria (VIC). Sun and Meinl (2012) highlight some features of high-frequency data and suggest powerful denoising tools to smooth the data. Dealing with the abnormal and unusual seasonality

(18)

is the primary concern in demand forecasting; therefore, we maintain the originality of the data without moving any outliers and cleaning the data in our study.

4.2 Sequence alignment for demand similarity

A similar day method stemming from sequence alignment is based on the fractal assumption of the historical data that have similar characteristics to the day being forecasted, see Sun et al. (2007), and references therein. Many characteristics can be extracted for demand patterns, such as time of day, nature of the day (weekday, weekend, or holiday), business day variation across months, and seasonal anomalies.

In our study, t refers to time, which stands for a day or a minute, and we uset to identify a specific time period. For example, given t as a day,t = 0 means the current day (or “today”) we consider; whent ∈ [−14, 14], it means the period of two weeks before and after today. We then define four indicators with capital letters, that is, H_t, D_t, W_t, and Y_t, in order to classify easily the similar days with respect to our target day, and H refers to the hour indicator, D is the date indicator, W indicates day of week in a given year, and Y shows the year. As we have learned that the subscript t shows the length of the period of the considered indicators. Therefore, any power demand can be expressed as x_(H_t_,D_t_,Y_t₎.

We introduce four types of similar demand days with respect to our target day’s demand x_(H₀_,D₀_,Y₀₎:

• xi,1is the first type of the similar demand day and defined as x_(H₀_,D₀_,Y₋₁₎, which means the demand is in the same hour on the same day one year ago.

• xi,2is the second type of the similar demand day and defined as x_(H₀_,W₀_,Y₋₁₎, which means the demand is in the same hour on the same day of the week one year ago.

• xi,3is the third type of the similar demand day and defined as x_(H₀_,D_[−14,+14]_,Y₋₁₎, which means the demand is in the same hour in a day chosen from a period of four weeks around the same day one year ago.

• xi,4 is the fourth type of the similar demand day and defined as x_(H₀_,D_[−28,0]_,Y₀₎, which means the demand is in the same hour in a day chosen from a period from four weeks before the target day.

When considering xi,2, we face the anomaly of day of the week because of the irregular switching between working days and nonworking days. Electricity demand on working days usually follows a regular pattern, but on nonworking days it does not.

Nonworking days (i.e., weekend and public holidays) can be separated into two groups based on which day of the week in the calendar: fixed position or floating position. All weekends and some public holidays for a specific day of week, for example, Easter Monday and the Queen’s birthday, belong to the first group. Other public holidays whose positions float to different days of the week are in the second group.

The work week and weekend are two complementary parts of a week devoted to work and rest, respectively. The legal work week is the part of the seven-day week devoted to labor. In most Western countries it is Monday through Friday; the weekend comprises Saturday and Sunday.⁴A weekday is any day of the work week. If a weekday

4 Some people extend the weekend to Friday nights as well. In some Christian traditions, Sunday is the

“Lord’s Day” and the day of rest and worship. Jewish Shabbat or Biblical Sabbath lasts from sunset on

(19)

coincides with a public holiday, we can observe different power demand patterns. We also observe some patterns when public holidays overlap the weekend. We then classify non-weekend public holidays (i.e., public holidays that do not overlap the weekend) as two types: adhered with weekends and adhered with weekdays.

4.3 The result

We use the same setting and moving window design as we did in the simulation. For the moving window design, we set the in-sample size as one-month and the out-of- sample size as one-day-ahead forecasting. The moving window shifts 338 times for the data of one year. The moving window shifts once, we get a set of parameters for all methods. We have the data sampled for five year and the moving window totally shifts 1690 times. We provide descriptive statistics of parameters for each method in Table1.

Table2reports the in-sample results for the ARMA(2,1), ARMA(2,1)-GARCH(1,1) and LSMD models. We see that the measure (i.e., MAPE, MAE, and RMSE) of in- sample error is reduced when applying a PNN compared with the error of other models.

In addition, the PNN reduces the MAPE by 39%, 44%, and 44% compared with the results from the ARMA(2,1), ARMA(2,1)-GARCH(1,1), and LSMD models. The PNN also reduces MAE on average by 42.3%, 46.9%, and 49.2% when compared with the ARMA(2,1), ARMA(2,1)-GARCH(1,1), and LSMD models, respectively.

The PNN reduces RMSE by 46.7%, 52.6%, and 49.5% on average when compared with the ARMA(2,1), ARMA(2,1)-GARCH(1,1), and LSMD models as well. We report the results of out-of-sample forecasting performances of the PNN, ARMA(2,1), ARMA(2,1)-GARCH(1,1), and LSMD models in Table 2. The PNN increases the average accuracy of the MAPE by 45.6%, 45%, and 45% when compared to the ARMA(2,1), ARMA(2,1)-GARCH(1,1), and LMSD models, respectively. Based on the measure of MAE, the PNN also increases the average accuracy by 50.8%, 50.2%, and 39.1% when compared with the ARMA(2,1), ARMA(2,1)-GARCH(1,1), and LMSD models, respectively. Using RMSE as the accuracy measure, the PNN improves on average the accuracy by 48.4%, 47.9%, and 36.4% when compared with the ARMA(2,1), ARMA(2,1)- GARCH(1,1), and LMSD models as well. We find a generally significant improvement provided by the PNN when compared with the other three alternative models.

4.3.1 Robustness testing

We perform a robustness test based on two metrics: Kolmogorov–Smirnov and Kuiper metrics applied by Chen et al. (2019) for goodness-of-fit evaluation. Baucells and Bor- gonovo (2013) illustrate the advantages of using Kolmogorov–Smirnov and Kuiper metrics because they have good invariant probabilistic features when conducting monotonic transformations and are easy to interpret.

Friday to the fall of full darkness on Saturday, leading to a Friday-Saturday weekend in Israel. Some Muslim- majority countries historically had a Thursday-Friday or Friday-Saturday weekend; however, recently many such countries have shifted from Thursday-Friday to Friday-Saturday or to Saturday-Sunday; see Wikipedia.

(20)

Table1DescriptivestatisticsforparametersofARMA(2,1),ARMA(2,1)-GARCH(1,1),LSDM,andPNN(1690iterations) ARMA(2,1)ARMA(2,1)-GARCH(1,1)LSDM∗PNN α1α2βα1α2βγθκα0α1ωφ NSWMean0.57740.0131−0.00120.6428−0.0140−0.00810.08930.79840.00006.70062.1897−0.12660.1892 Var0.09260.08200.01470.08750.09570.01920.08210.22100.00000.33261.21050.87711.2773 Min0.3234−0.2195−0.05240.4666−0.1806−0.06860.04980.01700.00006.1961−0.2148−2.1109−2.5782 Max0.81820.18810.02650.82890.19800.03690.48090.90000.00007.36465.45981.88503.2677 QLDMean0.52810.05720.00440.52930.04890.00280.12880.66370.00004.61481.4002−0.10180.3129 Var0.07130.07060.00950.09600.07680.00940.08260.27920.00000.20420.81610.88221.3182 Min0.3283−0.0829−0.02880.2245−0.1151−0.02700.04980.00020.00004.3329−0.0268−2.1270−2.6662 Max0.67920.26440.02750.70950.23300.02740.38190.90000.00005.15452.85521.81413.4625 SAMean0.32330.1258−0.00460.33180.1308−0.00510.13540.58390.00011.16360.3780−0.2173−0.2265 Var0.09790.05220.01210.09030.05570.01280.06760.29210.00010.07590.23910.85981.0891 Min0.04390.0071−0.05510.12170.0200−0.05600.04910.01580.00001.0430−0.0655−2.2193−2.6279 Max0.55580.28460.02700.55880.26680.02670.27780.90010.00021.52091.25851.89392.5255 TASMean0.22680.00550.00150.23190.00950.00220.14370.59390.00010.96110.1715−0.09720.3809 Var0.10670.05770.00490.10620.05440.00740.09660.31580.00010.03920.12060.84281.1896 Min0.0004−0.1350−0.0143−0.0213−0.1030−0.01220.04930.00770.00000.8419−0.0169−2.0721−2.5617 Max0.50310.16130.02450.45020.15790.07920.35360.90000.00021.04720.50151.84543.5102 VICMean0.51260.10750.00760.52690.10580.00510.14050.57520.00004.76881.1065−0.16450.0125 Var0.07160.07050.00780.08360.08720.01120.08670.31420.00000.19030.66970.80791.1936 Min0.3347−0.0444−0.01260.3536−0.1074−0.03920.04980.00760.00004.4345−0.1830−2.0228−2.5357 Max0.67350.27210.02840.71530.32380.02820.36320.90000.00015.21553.36541.87272.9129 *×103

(21)

Table2Comparisonofgoodness-of-fitofthePNNwithotheralternativemethods,thatis,ARMA(2,1),ARMA(2,1)-GARCH(1,1),andlinearseasonaldummymodel (LSDM)forthein-sample(monitoring/training)performancebytheaveragevalueofthemeanabsoluteerror(MAE),meanabsolutepercentageerror(MAPE),androot meansquarederror(RMSE) YearRegionPNNARMA(2,1)ARMA(2,1)-GARCH(1,1)LSDM MAPEMAERMSEMAPEMAERMSEMAPEMAERMSEMAPEMAERMSE (×10−2)(×102)(×102)(×10−2)(×102)(×102)(×10−2)(×102)(×102)(×10−2)(×102)(×102) 2010NSW2.91292.49933.49544.65794.25026.00365.09574.65496.58985.32364.71866.2886 QLD2.33681.37091.83513.42062.13563.24543.75512.34543.56704.25742.59803.5150 SA5.57380.82561.09819.58341.49582.22179.39291.47512.163510.38201.59442.1996 TAS3.22090.36700.47935.60130.65690.88985.54950.65160.88024.71040.54230.7035 VIC3.07401.74382.38545.47213.29284.92275.94703.58075.34977.23304.22525.6198 2011NSW2.95712.52383.36214.70284.23496.06395.11584.61856.65385.35234.71796.4181 QLD2.45181.40291.89613.45502.10313.19073.80422.31593.49654.30332.56163.4785 SA5.98410.86501.15589.36371.42762.08429.73101.48772.14619.64611.43941.9744 TAS3.36480.37800.49935.48190.63380.85175.57780.64460.86375.08000.58090.7425 VIC2.95801.65682.20535.60583.28154.83306.11903.58585.27656.77873.86315.1005 2012NSW2.63142.12942.83354.60773.89445.46605.10014.31256.04435.25074.31865.7555 QLD2.39191.37011.83353.54922.14953.23753.87632.34663.52804.30402.55743.5475 SA5.45960.77821.03879.54901.44842.13969.62421.45952.16799.68721.42952.0365 TAS3.23800.35150.45925.63970.62710.84635.88540.65210.87704.80040.52260.6796 VIC3.09711.71282.35395.62133.28204.89456.09133.55615.29576.67233.78865.1638