systems: a dynamic approach for stochastic time consistency

(1)

1 23

Annals of Operations Research ISSN 0254-5330

Volume 277 Number 1

Ann Oper Res (2019) 277:3-32 DOI 10.1007/s10479-018-2795-1

systems: a dynamic approach for stochastic time consistency

Yi-Ting Chen, Edward W. Sun & Yi-Bing

Lin

(2)

1 23

Springer Nature. This e-offprint is for personal use only and shall not be self-archived in electronic repositories. If you wish to self- archive your article, please use the accepted manuscript version for posting on your own website. You may further deposit the accepted manuscript version in any repository,

provided it is only made publicly available 12

months after official publication or later and

provided acknowledgement is given to the

original source of publication and a link is

inserted to the published article on Springer's

website. The link must be accompanied by

the following text: "The final publication is

available at link.springer.com”.

(3)

S . I . : R E L I A B I L I T Y A N D QUA L I T Y M A NAG E M E N T I N S TO C H A S T I C S Y S T E M S

Coherent quality management for big data systems: a dynamic approach for stochastic time consistency

Yi-Ting Chen^1,2 · Edward W. Sun³ · Yi-Bing Lin²

Published online: 14 March 2018

Abstract Big data systems for reinforcement learning have often exhibited problems (e.g., failures or errors) when their components involve stochastic nature with the continuous control actions of reliability and quality. The complexity of big data systems and their stochastic features raise the challenge of uncertainty. This article proposes a dynamic coherent quality measure focusing on an axiomatic framework by characterizing the probability of critical errors that can be used to evaluate if the conveyed information of big data interacts efficiently with the integrated system (i.e., system of systems) to achieve desired performance. Herein, we consider two new measures that compute the higher-than-expected error,—that is, the tail error and its conditional expectation of the excessive error (conditional tail error)—as a quality measure of a big data system. We illustrate several properties (that suffice stochastic time-invariance) of the proposed dynamic coherent quality measure for a big data system. We apply the proposed measures in an empirical study with three wavelet-based big data systems in monitoring and forecasting electricity demand to conduct the reliability and quality management in terms of minimizing decision-making errors. Performance of using our approach in the assessment illustrates its superiority and confirms the efficiency and robustness of the proposed method.

Keywords Big data· Dynamic coherent measure · Optimal decision · Quality management · Time consistency

JEL Classification C02· C10 · C63

B

Edward W. Sun

[email protected]; [email protected]

1 School of Business Informatics and Mathematics, University of Mannheim, Mannheim, Germany 2 College of Computer Science, National Chiao Tung University (NCTU), Hsinchu, Taiwan 3 KEDGE Business School, 680 Cours de la Libèration, 33405 Talance Cedex, France

(4)

1 Introduction

Big data have become a torrent, flowing into every area of business and service along with the integration of information and communication technology and the increasingly development of decentralized and self-organizing infrastructures (see Sun et al.2015, for example). A big data system (BDS), which in our article refers to the big data-initiated system of systems, is an incorporation of a task-oriented or dedicated system that interoperates with big data resources, integrates their proprietary infrastructural components, and improves their analytic capabilities to create a new and more sophisticated system for more functional and superior performance. For example, the Internet of Things (IoT), as described by Deichmann et al.

(2015) is a BDS, and its network’s physical objects through the application of embedded sensors, actuators, and other devices can collect or transmit data about the objects. BDS generates an enormous amount of data through devices that are networked through computer systems.

Challenges to BDS are therefore unavoidable with the tremendously ever-increasing scale and complexity of systems. Hazen et al. (2014) address the data quality problem in the context of supply chain management and propose methods for monitoring and controlling data quality. The fundamental challenge of BDS is that there exists an increasing number of software and hardware failures when the scale of an application increases. Pham (2006) discusses the system software reliability in theory and practice. Component failure is the norm rather than exception, and applications must be designed to be resilient to failures and diligently han- dle them to ensure continued operations. The second challenge is that complexity increases with an increase in scale. There are more component interactions, increasingly unpredictable requests on the processing of each component, and greater competition for shared resources among interconnected components. This inherent complexity and non-deterministic behavior makes diagnosing aberrant behavior an immense challenge. Identifying whether the inter- ruption is caused by the processing implementation itself or unexpected interactions with other components turns out to be rather ambiguous. The third challenge is that scale makes a thorough testing of big data applications before deployment impractical and infeasible.

BDS pools all resources and capabilities together and offers more functionality and performance with sophisticated technologies to ensure the efficiency of such an integrated system.

Therefore, the efficiency of BDS should be higher than the merged efficiency of all con- stituent systems. Deichmann et al. (2015) point out that the replacement cycle for networked data may be longer than the innovation cycle for the algorithms embedded within those systems. BDS therefore should consider ways to upgrade its computational capabilities to enable continuous delivery of data (information) updates. Modular designs are required so the system can refresh discrete components (e.g., data) of a BDS on a rolling basis without having to upgrade the whole database. To pursue a continuous-delivery model of information or decision, BDS must be enabled to review its computational processes and be focused simultaneously on supporting data flows that must be updated quickly and frequently, such as auto-maintenance. In addition, BDS has to ensure speed and fidelity in computation and stability and security in operation. Networking efficiency is huge, but there is a lot of uncertainty over it. Questions thus arise about how to accurately assess the performance of BDS, and how to build a technology stack¹to support it.

Quality measures quantify a system’s noncompliance with the desirable or intended standard or objective. Such noncompliance may occur for a process, outcome, component,

1These are layers of hardware, firmware, software applications, operating platforms, and networks that make up IT architecture.

(5)

structure, product, service, procedure, or system, and its occurrence usually causes loss.

Enlightened by the literature of managing the extreme events and disasters, in this article we work on an axiomatic framework of new measures (based on the tail distribution of critical defect or error) to address how to conduct optimal quality management for BDS with respect to a dynamic coherent measurement that has been well established particularly for managing risk or loss—see, for example, Riedel (2004), Artzner et al. (2007), Sun et al. (2009), Bion-Nadal (2009) and Chun et al. (2012) as well as references therein. Conventional quality measures focus on quantizing either the central moment (i.e., the mean) or the extreme value (e.g., six sigma) of error (or failure). The newly proposed measure focuses on a specific range of distribution instead of a specific point, which modulates the overall exposure of defects for quality management.

Conventional measurement based on average performance such as mean absolute error (MAE) is particularly susceptible to the influence of outliers. When we conduct system engineering, systems have their individual uncertainties that lead to error propagation that will distort the overall efficiency. When these individual uncertainties are correlated with an unknown pattern, simple aggregating a central tendency measure gives rise to erroneous judgment. The six sigma approach that investigates the extremum²has received considerable attention in the quality literature and general business and management literature. However, this approach has shown various deficiencies such as rigidity, inadequate for complex system, and non-subadditivity, particularly, not suitable for an integrated system or system of dynamic systems. However, Baucells and Borgonovo (2013) and Sun et al. (2007) suggested the Kolmogorov–Smirnov (KS) distance, Cramér–von Mises (CVM) distance, and Kuiper (K) distance as the metrics for evaluation. These measures are based on the probability distribution of malfunction and only provide information for relative comparison of extremum. They fail to identify the absolute difference when we need to know the substantiality of difference.

In order to overcome these shortcomings, we propose a dynamic measurement applying tail error (TE) and its derivation, conditional tail error (CTE), to quantify the higher-than-expected failure (or critical error) of a BDS.

Tail error (TE) describes the performance of a system in a worst-case scenario or the critical error for a really bad performance with a given confidence level. TE measures an error below a given probabilityα that can be treated as a percentage of all errors. α is then the probability such that an error greater than TE is less than or equal toα, whereas an error less than TE is less than or equal to 1− α. For example, one can state that it is 95% (i.e., α = 5%) sure for a given system processing that the highest defect rate is no more than 15% (which is the tail error that can be tolerated). TE can also be technically defined as a higher-than- expected error of an outcome moving more than a certain tolerance level,—for example, a latency with three standard deviations away from the mean, or even six standard deviations away from the mean (i.e., the six sigma). For mere mortals, it has come to signify any big downward move in dependability. Because an expectation is fairly subjective, a decision maker can appropriately adjust it with respect to different severities.

When we assess (e.g., zoom in) tail-end (far distance from zero) along with the error (or failure) distribution, the conditional tail error (CTE) is a type of sensitivity measure derived from TE describing the shape of the error (or failure) distribution for the far-end tail departing from TE. CTE also quantifies the likelihood (at a specific confidence level) that a specific error will exceed the tail error (i.e., the worst case for a given confidence level), which is mathematically derived by taking a weighted average between the tail error and the errors

2A six sigma process is one in which 99.99966% of all opportunities are statistically expected to be free of defects (i.e., 3.4 defective features per million opportunities).

(6)

exceeding it. CTE evaluates the value of an error in a conservative way, focusing on the most severe (i.e., intolerable) outcomes above the tail error. The dynamic coherent version of CTE is the upper bound of the exponentially decreasing bounds for TE and CTE. CTE is convex and can be additive for an independent system. A preternatural character of them is to quantitate the density of errors by choosing differentα values (or percentages). Therefore, a subjective tolerance (determined byα) of error can be predetermined by a specific utility function of the decision maker. Intrinsic features of the conditional probability guarantee its consistency and coherence when applying it to dynamic quality management throughout the entire process. We show several properties of these measures in our study and explain how to compute them for quality management of a BDS.

We organize the article as follows. Section2introduces some background information focusing on BDS and quality management. We describe the prevailing paradigm for the quality management of a big data system. We then introduce the methodology in more details and illustrate several properties of the proposed framework. Section3presents three big data systems (i.e., SOWDA, GOWDA, and WRASA) based on wavelet algorithms. Section4con- ducts an empirical study to evaluate these three systems’ performances on the big electricity demand data from France through analysis and forecasting. It also provides a discussion with simulation for several special cases that have not been observed in the empirical study.

Our empirical and simulation results confirm the efficiency and robustness of the method proposed herein. We summarize our conclusions in Sect.5.

2 Quality management for big data system

Optimal quality management refers to the methodologies, strategies, technologies, systems, and applications that analyze critical data to help an institution or individual better understand the viability, stability, and performance of systems with timely beneficial decisions in order to maintain a desired level of excellence. In addition to the data processing and analytical technologies, quality management for BDS assesses macroscopic and microscopic aspects of system processing and then combines all relevant information to conduct various high- impact operations to ensure the system’s consistency. Therefore, the foundation of quality management for BDS is how to efficiently evaluate the interoperations of all components dealing with the big data through robust methodologies (see Sun et al.2015, for example).

In this section we first classify BDS—that is, the fundamentals for the system of systems engineering. We then discuss quality lifecycle management (QLM) for systematic quality management. We propose the framework of dynamic coherent quality management, which is a continuous process to consistently identify critical errors. We then set up the quantitative framework of metrics to evaluate the decision-making quality of BDS in terms of error control.

2.1 Big data system

The integration and assessment of a broad spectrum of data—where such data usually illustrate the typical “6Vs” (i. e., variety, velocity, volume, veracity, viability, and value)—affect a big data system. Sun et al. (2015) summarize four features of big data as follows: (1) autonomous and heterogeneous resource, (2) diverse dimensionality, (3) unconventional processing, and (4) dynamic complexity. Monitoring and controlling data quality turn out to be more critical (see Hazen et al.2014, for example). In our article we refer to the big data

(7)

Fig. 1 Fundamental architectures of a big data system (BDS). Dots indicate components (processors and terminals) and edges are links. The component could be a system or a unit. a Centralized BDS that has a central location, using terminals that are attached to the central. b Fragmented (or decentralized) BDS that allocates one central operator to many local clusters that communicate with associated processors (or terminals). c Distributed BDS that spreads out computer programming and data on many equally-weighted processors (or terminals) that directly communicate with each other

initiated a system of systems as BDS, which is an incorporation of task-oriented or dedicated systems that interoperate their big data resources, integrate their proprietary infrastructural components, and improve their analytic capabilities to create a new and more sophisticated system for more functional and superior performance. A complete BDS contains (1) data engineering that intends to design, develop, integrate data from various resources, and run ETL (extract, transform and load) on top of big datasets and (2) data science that applies data mining, machine learning, and statistics approaches to discovery knowledge and develop intelligence (such as decision making).

When integrating all resources and capabilities together to a BDS for more functionality and performance, such a processing is called system of systems engineering (SoSE) (see Keating and Katina2011, for example). SoSE for BDS integrates various, distributed, and autonomous systems to a large sophisticated system that pools interacting, interrelated, and interdependent components as an incorporation. We illustrate three fundamental architectures of BDS with Fig.1centralized, fragmented, and distributed.

Centralized BDS has a central operator that serves as the acting agent for all communica- tions with associated processors (or terminals). Fragmented (or decentralized) BDS allocates one central operator to many local clusters that communicate with associated processors (or terminals). Distributed BDS spreads out computer programming and data on many equally weighted processors (or terminals) that directly communicate with each other. For more details about data center network topologies, see Liu et al. (2013) and references therein.

We can identify five properties of BDS following “Maier’s criteria” (see Maier1998):

(1) operational independence, i.e., each subsystem of BDS is independent and achieves its purposes by itself; (2) managerial independence, that is, each subsystem of BDS is managed in large part for its own purposes rather than the purposes of the whole; (3) heterogeneity, that is, different technologies and implementation media are involved and a large geographic extent will be considered; (4) emergent behavior, that is, BDS has capabilities and properties in which the interaction of BDS inevitably results in behaviors that are not predictable in advance for the component systems; and (5) evolutionary development, that is, a BDS evolves with time and experience. This classification shares some joint properties that can be used to characterize a complex system,—for example, the last three properties of heterogeneity, emergence, and evolution.

(8)

2.2 Systematic quality management

Modeling a system’s featuring quality helps characterize the availability and reliability of the underlying system. The unavailability and unreliability of a system presents an overshoot with respect to its asymptotic metrics of performance with the hazard that characterizes the failure rate associated with design inadequacy. Levels of uncertainty in modeling and control are likely to be significantly higher with BDS. Greater attention must be paid to the stochastic aspects of learning and identification. Uncertainty and risk must be rigorously managed.

These arguments push for a renewed mechanism on system monitoring, fault detection and diagnosis, and fault-tolerant control that include protective redundancies at the hardware and software level (see Shooman2002). Rigorous, scalable, and sophisticated algorithms shall be applied for assuring the quality, reliability, and safety of the BDS. Achieving consistent quality in the intelligent decision of the BDS is a multidimensional challenge with increasingly global data flows, a widespread network of subsystems, technical complexity, and a highly demanding computational efficiency.

To be most effective, quality management should be considered as a continuous improvement program with learning as its core. Wu and Zhang (2013) point out that exploitative-oriented and explorative-oriented quality management need to consistently ensure quality control on the operation lifecycle throughout the entire process. Parast and Adams (2012) highlight two theoretical perspectives (i.e., convergence theory and insti- tutional theory) of quality management over time. Both of them emphasize that quality management evolves over time by using cross-sectional, adaptive, flexible, and collaborative methods so that the quality information obtained in one lifecycle stage can be transferred to relevant processes in other lifecycle stages. Contingency theory also recommends that the operation of quality control be continually adapted (see Agarwal et al.2013, for example), and quality information must therefore be highly available throughout the whole system to ensure that any and all decisions that may require quality data or affect performance quality are informed in a timely, efficient, and accurate fashion.

Quality lifecycle management (QLM)³is system-wide, cross-functional solution to ensure that processing performance, reliability, and safety are aligned with the requirements set for them over the course of the system performance. Mellat-Parst and Digman (2008) extend the concept of quality beyond the scope of a firm by providing a network perspective of quality.

QLM is used to build quality, reliability, and risk planning into every part of the processing lifecycle by aligning functional needs with performance requirements, ensuring that these requirements are met by specific characteristics and tracking these characteristics systemat- ically throughout design, testing, development, and operation to ensure the requirements are met at every lifecycle stage. Outputs from each lifecycle stage, including analysis results, performance failures, corrective actions, systemic calibrations, and algorithmic verifications, are compiled within a single database platform using QLM. They are made accessible to other relevant lifecycle stages using automated processes. This guarantees the continuous improvement of systemic performance over the course of development and further improvement. QLM links together the quality, reliability, and safety activities that take place across every stage of the product development lifecycle. Through QLM, one lifecycle stage informs the next, and the feedback from each stage is automatically fed into other stages to which it relates, we then obtain a unified and holistic view of overall quality of performance.

3PTC white paper. PTC is a global provider of technology platforms and solutions that transform how companies create, operate, and service the “things” in the Internet of Things (IoT). Seewww.ptc.com.

(9)

2.3 The methodology

In this section, we describe the axiomatic framework for QLM with respect to the dynamic coherent error measures. Based on the Kullback–Leibler distance, we propose two concepts—

that is, the tail error (TE) and conditional tail error (CTE)—and their dynamic setting to guarantee the feature of time consistency. These two error measures can be applied dynami- cally to control performance errors at each stage of BDS in order to conduct QLM.

2.3.1 Notation and definitions

The uncertainty of future states shall be represented by a probability space (,F, P) that is, the domain of all random variables. At time t a big data system is employed to make a decision about future scenario after a given time horizon, for example, to predict price change from t to t + . We denote a quantitative measure of the future scenario at time t by V(t) that is a random variable and observable through time. The difference between the predicted scenario at time t and its true observation at time t+ is then measured by V(t + ) − V (t). If we denote V(·) as the prediction and V (·) the observation, then the Kullback–Leibler distance or divergence of the prediction error is given by the following definition:

Definition 1 For given probability density functions V(x) and V(x) on an open setA∈ ^N, the Kullback–Leibler distance or divergence of V(x) from V (x) is defined by

K(V V) =

AV(x) logV(x)

V(x)d x. (1)

Axiom 1 Assume that V(x) and V(x) are continuous on an open setA, then for arbitrary V(x) and V(x),K(V V) ≥ 0; andK(V V) = 0 if and only if V (x) = V(x) for any x ∈A. Because V(x) and V (x) are density functions, we have V(x)dx = 1 and

V(x)dx = 1, which shows Axiom1.

Axiom 2 Consider a setKof real-valued random variables. A functionalρ : D → R is referred to be a consistent error measure forDif it satisfies all of the following axioms:

(1) ∀K1,K2, ∈Dsuch thatK2≥K1a.s., we haveρ(K1) ≤ ρ(K2).

(2) ∀K1,K2, ∈ D such thatK1+K2 ∈ D, we haveρ(K1+K2) ≤ ρ(K1) + ρ(K2);

for_K₁ and _K₂ are comonotonic random variables, that is, for everyω1, ω2 ∈ : (K1(ω2) −K1(ω1))(K2(ω2) −K2(ω1)) ≥ 0, we have ρ(K1+K2) = ρ(K1) + ρ(K2).

(3) ∀K∈Dandγ ∈ ₊such thatγK∈D, we haveρ(γK) ≤ γρ(K).

(4) ∀K∈Dand a∈ such thatK+ a ∈D, we haveρ(X + a) = ρ(X) + a.

(5) ∀K1,K2, ∈ D with cumulative distribution functions F_K₁ and F_K₂ respectively, if F_K₁ = FK2thenρ(K1) = ρ(K2).

This axiom is motivated by the coherent measure proposed by Artzner et al. (2007). Here, we modify the axiomatic framework of consistency in describing the error function. By Axiom2- (1), a monotonicity of the error function is proposed: the larger the Kullback–Leibler distance is, the larger the error metric will be. Axiom2-(2) says that if we combine two tasks, then the total error of them is not greater than the sum of the error associated with each of them, if these two tasks are independent. When these two tasks are dependent, the total error of them is equal to the sum of the error associated with each of them. It is also called subadditivity in

(10)

risk management and also is applied in systematic computing such that an integration does not create any extra error. Axiom2-(3) can be derived if Axiom2-(2) holds, and it reflects positive homogeneity. It shows that an error reduction requires that processing uncertainty should linearly influence task loans. Axiom2-(4) shows that the error will not be altered by adding or subtracting a deterministic quantity a to a task. Axiom2-(5) states the probabilistic equivalence that is derived directly from Axiom1.

2.3.2 Dynamic error measures

A dynamic error measure induces for each operation a quality control process describing the error associated with the operation over the time. O’Neill et al. (2015) point out that quality management helps maintain the inherent characteristics (distinguishing features) in order to fulfill certain requirements. The inherence of quality requires a dynamic measure to conduct continuous processing of error assessment in the rapidly changing circumstances of an operational system. During the dynamic phase, the decision-making process involves investigating the errors generated by the big data systems, selecting an appropriate response (system of work), and making a judgment on whether the errors indicate a systematic failure.

How to interrelate quality control process over different periods of time is important for the dynamic measure. Consequently, a dynamic error measure should take into account the information available at the time of assessment and update the new information continuously over time. Time consistency then plays a crucial role in evaluating the performance of dynamic measures, particularly for big data systems that run everlastingly before stopping. Therefore, the quality measure for errors shall preserve the property of consistency.

Bion-Nadal (2008) defines a dynamic risk measures based on a continuous time filtered probability space as (,F∞, (Ft)_t∈⁺, P) where the filtration (Ft)_t∈⁺is right continuous.

Similarly, following the Axiom2, we can extend it to a dynamic error measure following Bion-Nadal (2008), Chen et al. (2017), and references therein.

Definition 2 For any stopping timeτ, we define the σ -algebraF_τbyF_τ = {K∈F_∞| ∀ t ∈

⁺:K∩ {τ ≤ t} ∈Ft}. Then L^∞(,F_τ, P) is the Banach algebra of essentially bounded real valuedF_τ-measurable functions.

The Kullback–Leibler distance for the system error will be identified as an essentially bounded F_τ-measurable function with its class in L^∞(,F_τ, P). Therefore, an error at a stopping timeτ is an element of L^∞(,F_τ, P). We can distinguish a finite time horizon as a case defined byFt = FT ∀ t ≥ T .

Axiom 3 For all K ∈ L^∞(F_τ), a dynamic error measure (_τ₁_,τ₂)_0≤τ₁_≤τ₂ on (,F_∞, (Ft)t∈⁺, P), where τ1 ≤ τ2 are two stopping times, is a family of maps defined on L^∞(,Fτ2, P) with values in L^∞(,Fτ1, P). (τ1,τ2) is consistent if it satisfies the fol- lowing properties:

(1) ∀K1,K2, ∈ L^∞(Fτ)²such thatK2 ≥K1a.s., we haveτ1,τ2(K1) ≤ τ1,τ2(K2).

(2) ∀K1,K2, ∈ L^∞(F_τ)² such thatK1+K2 ∈ L^∞(F_τ)², we have_τ₁_,τ₂(K1+K2) ≤ _τ₁_,τ₂(K1)+_τ₁_,τ₂(K2); forK1andK2are comonotonic random variables, i.e., for every ω1, ω2 ∈ : (K1(ω2) −K1(ω1))(K2(ω2) −K2(ω1)) ≥ 0, we have τ1,τ2(K1+K2) = τ1,τ2(K1) + τ1,τ2(K2).

(3) ∀K ∈ L^∞(Fτ) and γ ∈ + such that γK ∈ L^∞(Fτ), we have τ1,τ2(γK) ≤ γ τ1,τ2(K).

(4) ∀K∈ L^∞(F_τ) and a ∈ L^∞(F_σ) such thatK+ a ∈ L^∞(F_σ), we have _τ₁_,τ₂(X + a) = _τ₁_,τ₂(X) + a.

(11)

(5) For any increasing (resp. decreasing) sequenceKn ∈ L^∞(F_τ)ⁿsuch thatK= limKn, the decreasing (resp. increasing) sequenceτ1,τ2(Kn) has the limit τ1,τ2(K) and con- tinuous from below (resp. above).

This axiom is motivated by Axiom2in consistency with respect to the dynamic setting for error assessment. Following Bion-Nadal (2008), Chen et al. (2017), and references therein, we can define time invariance as follows:

Definition 3 For allK∈ L^∞(F_τ), a dynamic error measure (_τ) on (,F_∞, (Ft)t∈⁺, P), whereτ1, τ2 ∈ {0, 1, . . . , T − 1} are consecutive stopping times and s is a constant of time, is said to be time invariant if it satisfies the following properties:

(1) ∀K1,K2, ∈ L^∞(F_τ)²: τ1+s(K1) = τ1+s(K2) ⇒ τ1(K1) = τ1(K2) and τ2+s(K1) = τ2+s(K2) ⇒ τ2(K1) = τ2(K2).

(2) ∀K∈ L^∞(Fτ): τ1+s(K) = τ2+s(K) ⇒ τ1(K) = τ2(K).

(3) ∀K∈ L^∞(F_τ): _τ₁(K) = _τ₁(−_τ₁_+s(K)) and _τ₂(K) = _τ₂(−_τ₂_+s(K)).

Based on the equality and recursive property in Definition3, we can construct a time invariant error measure by composing one-period (i.e., s = 1) measures over time such that for all t< T − 1 andK∈ L^∞(F_τ) the composed measure ˜_τ is (1)˜T−1(K) = T−1(K) and (2)

˜_τ(K) = _τ(−˜_τ+1(K)) (see Cheridito and Stadje2009, for example).

2.3.3 Tail error (TE) and conditional tail error (CTE)

Enlightened by the literature of the extreme value theory, we define the tail error (TE) and its coherent extension, that is, the conditional tail error (CTE). Considering some outcomes generated from a system, an error or distance is measured by the Kullback–Leibler distance K, and there exists any arbitrary valueε, we denote F_K(ε) = P(K≤ ε) as the distribution of the corresponding error. We intend to define a statistic based on F_Kthat can measure the severity of the erroneous outcome over a fixed time period. We can directly use the maximum error (e.g.,ε), such that inf{ε ∈ : F_K(ε) = 1}. When the support of F_Kis not bounded, the maximum error becomes infinity. However, the maximum error does not provide us the probabilistic information of F_K. The tail error then extends it to the maximum error that will not be exceeded with a given probability.

Definition 4 Given some confidence level α ∈ (0, 1), the tail error of outcomes from a system at the confidence levelα is given by the smallest value ε such that the probability that the errorKexceedsε is not larger than 1 − α, that is, the tail error is a quantile of the error distribution shown as follows:

TE_α(K) = inf{ε ∈ : P(K> ε) ≤ 1 − α}

= inf{ε ∈ : F_K(ε) ≥ α}. (2)

Definition 5 For an error measured by the Kullback–Leibler distanceKwith E(K) < ∞, and a confidence levelα ∈ (0, 1), the conditional tail error (CTE) of system’s outcomes at a confidence levelα is defined as follows:

CTE_α(K) = 1 1− α

₁

α TE_β(K) dβ, (3)

whereβ ∈ (0, 1).

(12)

We can formally present TE_α(K) in Eq. (2) as a minimum contrast (discrepancy) parameter as follows:

TE_α(K) = arg minTE_α(K)

(1 − α)

_TE_α_(K)

−∞ |K− TEα(K)| f (K) dK + α

_∞

TE_α(K)|K − TEα(K)| f (K) dK

, (4)

whereα ∈ (0, 1) identifies the location ofKon the distribution of errors. Comparing with TE, CTE averages TE over all levelsβ ≥ α and investigates further into the tail of the error distribution. We can see CTE_αis determined only by the distribution ofKand CTE_α≥ TE_α. We see TE is monotonic, that is, given by Axiom2-(1), positive homogenous by Axiom2-(3), translation invariant shown by Axiom2-(4), and equivalent in probability by Axiom2-(5).

However, the subadditivity property cannot be satisfied in general for TE, and it is thus not a coherent error measure. The conditional tail error satisfies all properties given by Axiom2 and it is a coherent error measure. In addition, under the dynamic setting, CTE satisfies all properties given by Axiom3, it is also a dynamic coherent error measure.

In other words, TE measures an error below a given percentage (i.e.,α) of all errors. α is then the probability such that an error greater than TE is less than or equal toα, whereas an error less than TE is less or than or equal to 1− α. CTE measures the average error of the performance in a given percentage (i.e.,α) of the worst cases (above TE). Given the expected value of errorKis strictly exceeding TE such that CTE_α(K)⁺= E[K|K> TE_α(K)] and the expected value of errorKis weakly exceeding TE where CTE_α(K)⁻= E[K|K≥ TE_α(K)], CTE_α(K) can be treated as the weighted average of CTEα(K)⁺and TE_α(K) such that:

CTE_α(K) =

_α(K) TE_α(K) + (1 − _α(K)) CTE_α(K)⁺, if F_K(TE_α(K)) < 1;

TE_α(K), if F_K(TE_α(K)) = 1, (5)

where

α(K) = F_K(TEα(K)) − α

1− α .

We can see that TE_α(K) and CTEα(K)⁺are discontinuous functions for the error distributions of(K), whereas CTEα(K) is continuous with respect to α; see Sun et al. (2009) and references therein. We then obtain TE_α(K) ≤ CTE_α(K)⁻ ≤ CTE_α(K)⁺≤ CTE_α(K). It is clear that CTE evaluates the performance quality in a conservative way by focusing on the less accurate outcomes.α for TE and CTE can be used to scale the user’s tolerance of critical errors.

2.3.4 Computational methods

The TE_αrepresents the error that with probabilityα it will not be exceeded. For continuous distributions, it is anα-quantile such that TE_α(K) = F_K⁻¹(α) where F_K(·) is the cumulative distribution function of the random variable of errorK. For the discrete case, there might not be a defined unique value but a probability mass around the value; we then can compute TE_α(K) with TEα(K) = min{TE(K) : P(K≤ TE(K)) ≥ α}.

The CTE_α(K) represents the worst (1 − α) part of the error distribution ofKthat is above the TE_α(K) (i.e., conditional on the error above the TEα). When TE_α(K) has continuous part of the error distribution, we can deriveαth CTE of (K) with CTE_α(K) = E(K|K> TE_α(K)).

However, when TE_αfalls in a probability mass, we compute CTE_α(K) following Eq. (5):

(13)

CTE_α(K) = ( ˆβ − α) TEα(K) + (1 − ˆβ) E (K|K> TEα(K))

1− α ,

where ˆβ = max{ β : TEα(K) = TEβ(K)}.

It is obvious that theα-th quantile of (K) is critical. The classical method for computing is based on the order statistics; see David and Nagaraja (2003) and references therein. Given a set of valuesK1,K1, . . . ,Kn, we can define the quantile for any fraction p as follows:

1. Sort the values in order, such thatK₍₁₎≤K₍₂₎≤ · · · ≤K_(n). The valuesK₍₁₎, . . . ,K_(n) are called the order statistics of the original sample.

2. Take the order statistics to be the quantile that correspond to the fraction:

p_i = i− 1 n− 1, for i= 1, 2, . . . , n.

3. Use the linear interpolation between two consecutive p_i to define the p-th quantile if p lies a fraction f to be:

TEp(K) = (1 − f ) TEp_i(K) + f TEp_i+1(K).

Kin Axiom 1measures the distance (i.e., error) between V(x) and V (x). If the error illustrate homogeneity and isotropy, for example, the distribution of errors are characterized by a known probability function, a parametric method then can be applied, see Sun et al.

(2009) and references therein.

3 Wavelet-based big data systems

We are now going to evaluate big data systems under the framework we proposed in Sect.2.

We first describe three big data systems that have been used for intelligent decision making with different functions in this section before we conduct a numerical investigation. These three big data systems have been built with wavelet algorithms that were applied in big data analytics; see Sun and Meinl (2012) and references therein. The first big data system is named SOWDA (smoothness-oriented wavelet denoising algorithm), which optimizes a decision function with two control variables as proposed by Chen et al. (2015). It is originally used to preserve the smoothness (low-frequency information) when processing big data. The second is GOWDA (generalized optimal wavelet decomposition algorithm) introduced by Sun et al. (2015), which is an efficient system for data processing with multiple criteria (i.e., six equally weighted control variables). Because GOWDA has a more sophisticated decision function, an equal weight framework significantly improves computational efficiency. It is particularly suitable for decomposing multiple-frequency information with heterogeneity.

The third one is called WRASA (wavelet recurrently adaptive separation algorithm), an extension of GOWDA as suggested by Chen et al. (2015), which determines the decision function of multiple criteria with convex optimization to optimally weight the criteria. Chen and Sun (2018) illustrate how to incorporate these systems with a reinforcement learning framework. All three big data systems have open access to adopt components of wavelet analysis.

As we have discussed in Sect.2, a complete BDS contains two technology stacks: data engineering and data science. The former intends to design, develop, and integrate data from various resources and run ETL on top of big datasets and the latter applies data mining,

(14)

machine learning, and statistics approaches to discovering knowledge and developing intelligence (such as decision making). In this section, we briefly introduce the three BDS focusing on their engineering and learning.

3.1 Data engineering with wavelet

Big data has several unique characteristics (see Sect.2.1) and requires special processing before mining it. When dealing with heterogeneity, multi-resolution transformation of the input data enables us to analyze the information simultaneously on different dimensions. We are going to decompose the input data X as follows:

Xt = St+ Nt, (6)

where St is the trend and we can estimate it with ˜St and Nt is the additive random features sampled at time t. LetW(·; ω, ζ ) denote the wavelet transform operators with specific wavelet functionω for ζ level of decomposition andW⁻¹(·) is its corresponding inverse transform.

LetD(·; γ ) denote the denoising operator with thresholding rule γ . Therefore, wavelet data engineering is then to extract ˜S_tfrom X_tas an optimal approximation of S_t after removing the randomness. We summarize the processing as follows:

Y =W(X; ω, ζ ), Z =D(Y ; γ ),

˜S =W⁻¹(Z).

The complete procedure shall implement the operatorsW(·) andD(·) after careful selection ofω, ζ , and γ ; see Sun et al. (2015).

Wavelets are bases of L²(R) and enable a localized time-frequency analysis with wavelet functions that usually have either compact support or decay exponentially fast to zero. Wavelet function (or the mother wavelet)ψ(t) is defined as:

_∞

−∞ψ(t)dt = 0.

Givenα, β ∈ R, α = 0, with compressing and expanding ψ(t), we obtain a successive waveletψα,β(t), such that,

ψα,β(t) = 1

√|α|ψ

t− β α

, (7)

where we callα the scale or frequency factor and β the time factor. Given ¯ψ(t) as the complex conjugate function ofψ(t), a successive wavelet transform of the time series or finite signals

f(t) ∈ L²(R) can be defined as:

W_ψf(α, β) = 1

√|α|

R f(t) ¯ψ

t− β α

dt. (8)

Equation (8) shows that the wavelet transform is the decomposition of f(t) under a different resolution level (scale). In other words, the idea of the wavelet transformWis to translate and dilate a single function (the mother wavelet)ψ(t). The degree of compression is determined by the scaling parameterα, and the time location of the wavelet is set by the translation parameterβ. Here, |a| < 1 leads to compression and thus higher frequencies.

The opposite (lower frequencies) is true for|a| > 1, leading to time widths adapted to their frequencies, see Chen et al. (2017) and references therein.

(15)

The continuous wavelet transform (CWT) is defined as the integral over all time of the data multiplied by scaled (stretching or compressing), shifted (moving forward or backward) version of wavelet function. Because most big data applications have a finite length of time, therefore, only a finite range of scales and shifts are practically meaningful. Therefore, we consider the discrete wavelet transform (DWT), which is an orthogonal transform of a vector (discrete big data) X of length N (which must be a multiple of 2^J) into J wavelet coefficient vectors W_j ∈ R^N/2^j, 1 ≤ j ≤ J and one scaling coefficient vector VJ ∈ R^N/2^J. The maximal overlap discrete wavelet transform (MODWT) has also been applied to reduce the sensitivity of DWT when choosing an initial value. The MODWT follows the same pyramid algorithm as the DWT by using the rescaled filters; see Sun et al. (2015) for details. All three BDSs can apply the DWT and MODWT. We focus on the wavelet transform on MODWT in this study.

Sun et al. (2015) pointed out that wavelet transform can lead to different outcomes when choosing different input variables. Since wavelets are oscillating functions and transform the signal orthogonally in the Hilbert space, there are many wavelet functions that satisfy the requirement for the transform. Chen et al. (2015) show that different wavelets are used for different reasons. For example, the Haar wavelet has very small support, whereas wavelets of higher orders such as Daubechies DB(4) and least asymmetric (LA) have bigger support.

Bigger support can ensure a smoother shape of Sj, for 1≤ j ≤ J with each wavelet and that the scaling coefficient carries more information because of the increased filter width. Sun et al. (2015) highlighted three factors that influence the quality of wavelet transform: wavelet function (or mother wavelet), number of maximal iterations (or level of decomposition), and the thresholding rule. However, there is no straightforward method to determine these three factors simultaneously. Therefore, different BDS applies different decision functions when conducting machine learning.

3.2 Machine learning

In order to perform the wavelet transform to optimize the analysis, that is, to see how close

˜S_tis toward S_t, three big data systems have been applied with different decision processing.

First, let us name a separation factorθ, which is the combination of wavelet function, number of maximal iterations, and thresholding rule, as well as a factor space ⊆ R^p, p ≥ 0, then θ ∈ . Different θ will lead to different wavelet performances. We then consider a random variableKandK= S − ˜S. Here,Kis in fact the approximation error and determined by a real value approximating function m(K, θ) ∈Rand E(m(K, θ)) = 0.

Definition 6 LetKt ∈ R^d be i.i.d. random vectors, and given thatθ ∈ R^p is uniquely determined by E(m(K, θ)) = 0, where m(K, θ) is called the approximating function and takes values inR^p^+q for q ≥ 0. Algorithm determines ˜θ such that ˜θ = arg maxθR(θ), where

R(θ) = min

_n

i=1

ωimi(K, θ)

, s.t. ωi≥ 0;

n i=1

ωi= 1, (9)

whereωi is the weight for the approximating function.

If we use only one type of approximation—that is, a single criterion—then the weight is one.

If we need to apply multivariate criteria for approximation, then we have to determine the associated weight for each approximating function.

(16)

3.2.1 Decision criteria

Six model selection criteria (i. e., m₁, . . . , m6) have been suggested by Sun et al. (2015) to evaluate m(·). GivenKt= St− ˜St, for all t= 1, 2, . . . , T where T is the data length and p the number of parameters estimated, the first three criteria are shown as follows:

m1 = _T

t=1K²t

T , (10)

m2 = ln T

t=1 K²_t T

+2 p

T , (11)

and

m₃ = ln T

t=1K²_t T

+ p ln T

T . (12)

Another three criteria are based on specific indicating functions. They are given as follows:

m₄= 1 T

T t=1

1× 1_C(K)=1, (13)

where

C(K) =

⎧⎨

⎩

1, i f ₁^max|K^t⁻^T¹^T¹^K^t^|

T−1_T

1(Kt−_T¹_T

1Kt)² > z_α 0, otherwise

and z is theα-th quantile of a probability distribution (e.g. α = 0.05) forK. m5 = 1

T T t=1

1×

1D(K)=1+1_D^∗(K)=1

, (14)

where

D(K) =

1, i f Kt ≤  0, i f otherwise when is the local maxima and

D^∗(K) =

1, i f Kt ≥ λ 0, i f otherwise whenλ is the local minima.

m₆= 1 T

T t=1

1× 1_C(K)=1, (15)

where

C(K) =

1, i f (St+1− St) ( ˜St+1− St) ≤ 0 0, i f otherwise.

Summaried by Sun et al. (2015) is m4to indicate the global extrema and m5for the local extrema. Both of them have the ability to detect boundary problems—that is, an inefficient

(17)

approximation at the beginning and the end of the signal. In addition, m₆detects the severity ofKfocusing on directional consistence.

For the three big data systems we compare in our study, SOWDA uses m₁ and m₂ and GOWDA and WRASA use all six criteria when formulating their decision making process.

3.2.2 Decision making

With the approximation error vector m = [m1, m2, . . . , m6], the next step is to determine the weights of these factors, that is,ω ∈ R⁶. Because the approximation error m is random, for any given combination of the wavelet, level of decomposition, and thresholding function, the weighted error m^Tω can be expressed as a random variable with mean E[m^Tω] and varianceV(m^Tω) where E[m^Tω] = ¯m^Tω and V(m^Tω) = E[m^Tω −E[m^Tω]]²= ω^T ω.

In our work, we apply convex optimization suggested by Chen et al. (2017) to determine the weightω for minimizing a linear combination of the expected value and variance of the approximation error, that is,

minimize

ω m^Tω + γ ω^T ω, s. t. 1^Tω = 1,

ω 0,

whereγ is a sensitivity measure that trades off between small expected approximation error and small error variance because of computational efficiency. In this article, we allowγ > 0 which ensures a sufficiently large decrease in error variance, when the expected approximation error is trivial with respect to computational tolerance.

Figure2illustrates the workflow for the three big data systems (i.e., SOWDA, GOWDA and WRASA) that we investigated in this article. As we can see, the major difference of these three big data systems is in the decision-making process. WRASA minimizes the resulting difference sequence between ˜S_tand S_t based on choosing an optimal combination of approximating functions m(·). Therefore, WRASA determines the approximating function m(·) and its corresponding weight with convex optimization. SOWDA and GOWDA simply take the predetermined value ofωi(i.e., at equal weight in our study) rather than optimizing them. There has to be a trade-off between accuracy and computational efficiency if we want to keep the latency as low as possible.

4 Empirical study 4.1 The data

In our empirical study, we use electricity power consumption in France from 01 January 2009 to 31 December 2014. The consumption data are taken from real-time measurements of production units. The National Control Center (Centre National d’Exploitation du Système - CNES) constantly matches electricity generation with consumers’ power demands, covering total power consumed in France nationwide (except Corsica). The input data for our empirical study are provided by RTE (Ré seau de Transport d’Èlectricitè—Electricity Transmission Network), which is the electricity transmission system operator of France. RTE is in charge of operating, maintaining and developing high-voltage transmission systems in France, which is also the Europe’s largest one at approximately 100,000 kilometers in length. The data

(18)

Fig. 2 Workflow of three big data systems (SOWDA, GOWDA, and WRASA) that have different decision processing paths at the evaluation stage

supplied are calculated on the basis of load curves at 10-min frequency. RTE’s regional units validate the data by adding missing values and by correcting erroneous data then provide the information data at 30-min frequency. These data are completed by the estimated values for consumed power from electricity generation on the distribution networks and private industrial networks.⁴

4.2 The method

In our empirical study, we apply three big data systems (i.e., SOWDA, GOWDA, and WRASA) to complete the whole decision making process, that is, process the original power demand data and conduct a predictive analytics. We evaluate the quality of decision-making process (forecasting accuracy) with the TE and CTE discussed in Sect.2.3.

Following Sun et al. (2015) and Chen et al. (2017), we chose Haar, Daubechies (DB), Symlet (LA), and Coiflets (Coif) as wavelet functions. We apply the MODWT with pyramid algorithm in our empirical study. Several threshold rules have been considered as well. The components served for the three systems in our study are listed as follows:

• ω ∈ { Haar, DB(2), DB(4), DB(8), LA(2), LA(4), LA(8), Coif(4), Coif(6), Coif(8)};

• ζ ∈ {i : i = 1, 2, 3};

4Seewww.rte-france.com.

(19)

• γ ∈ {Heuristic SURE, Minimax, SURE, Universal}.

The subsample series that are used for the in-sample study are randomly selected by a moving window with length T . Replacement is allowed in the sampling. Letting TFdenote the length of the forecasting series, we perform the 1 day ahead out-of-sample forecasting (1≤ T ≤ T + TF≤ N). In our analysis, the total sample length is T = 17,520 for a normal year and T = 17,568 for a leap year. The subsample length (i.e., the window length) of T = 672 (2 weeks) was chosen for the in-sample simulation and TF = 48 (1 day) for the out-of-sample forecasting.

We compare the system’s performance by using the root mean squared error (RMSE), the mean absolute error (MAE), and the mean absolute percentage error (MAPE) as the goodness-of-fit measures:

RMSE:=

_T

t=1

St − ˜St

₂

T ,

MAE:=

_T

t=1|St − ˜St|

T , and

MAPE:= 100 T

T t=1

S_t − ˜St

S_t

,

where S_t denotes the actual value at time t, and ˜S_t is its corresponding predicted value generated by the big data system.

Baucells and Borgonovo (2013) and Sun et al. (2007) suggested the Kolmogorov–Smirnov (KS) distance, the Cramér–von Mises (CVM) distance, and the Kuiper (K) distance as the metrics for evaluation. We use them here to compare the quality performance of the big data systems in our study.

Let F_n(S) denote the actual sample distribution of S, and F( ˜S) is the distribution function of the approximation or forecasts (i. e., the output of BDS), Kolmogorov–Smirnov (KS) distance, Cramér von Mises (CVM), and Kuiper distance (K), which are defined as follows:

KS:= sup

x∈R

Fn(S) − F( ˜S) , CVM:=

_∞

−∞

Fn(S) − F( ˜S)2

dF( ˜S), and K:= sup

x∈R

Fn(S) − F( ˜S) + sup

x∈R

F( ˜S) − Fn(S) .

We conclude that the smaller these distances are, the better the performance of the system.

The KS distance focuses on deviations near the median of the distribution. Thus, it tends to be more sensitive near the center of the error distribution than at the tails, whereas the CVM distance measures the sensitivity of dispersion between the output and its trend with respect to the change of the data filter of BDS. The Kuiper distance considers the extreme errors.

4.3 Results

We show the forecasting performance of three big data systems (i.e., SOWDA, GOWDA, and WRASA) in Fig.3. Panel (a) illustrates the predictive power demand generated from the three big data systems when compared with the actual power demand. Panel (b) illustrates

(20)

Fig. 3 Comparison for forecasting performance of the three big data systems (i.e., SOWDA, GOWDA, and WRASA). a Illustrates the predictive electricity power demand generated from the three systems compared with the actual electricity power demand. b Illustrates the forecasting errors compared with the actual electricity power demand

the forecasting errors versus the actual power demand. It is not easy to identify which one performs better than the others. Therefore, we report different error evaluation results.

Table1reports the in-sample (training) and out-of-sample (forecasting) performances of SOWDA, GOWDA, and WRASA measured by MAPE, MAE, and RMSE as the goodness- of-fit criteria. For the in-sample modeling performances, we see that the values of MAPE, MAE, and RMSE on average are all reduced after applying WRASA, compared with the results of SOWDA and GOWDA. Compared to SOWDA, WRASA reduces the error on average by 27.75, 27.87, and 24.17% for the MAPE, MAE, and RMSE criteria, respectively.

Compared to GOWDA, WRASA reduces the error on average by 24.55, 24.56, and 21.47%

for the MAPE, MAE, and RMSE, respectively.

We also focus on the results of the out-of-sample forecasting performances of SOWDA, GOWDA, and WRASA and evaluate them by using MAPE, MAE, and RMSE as the quality criteria. For the forecasting results, WRASA increases the accuracy on average by 28.06, 28.13, and 28.95% compared with SOWDA, and 10.49, 10.12, and 10.22% compared with GOWDA, based on the MAPE, MAE, and RMSE criteria, respectively.

Table 2reports the in-sample and out-of-sample performance of SOWDA, GOWDA, and WRASA when we take the Cramér von Mises (CVM), Kolmogorov–Smirnov (KS), and Kuiper statistics into consideration as the goodness-of-fit criteria. In comparison with the results of SOWDA, WRASA improves the performance by 56.77, 38.26, and 32.94%

measured by the CVM, KS, and Kuiper criteria, respectively. In addition, WRASA reduces the error by 49.84, 19.79, and 24.54% compared with GOWDA when evaluating by the CVM, KS, and Kuiper distance, respectively. From the CVM, KS, and Kuiper criteria, WRASA achieves better forecasting performance than SOWDA by 54.99, 34.58, and 30.80%, respectively.

Under the same measurement, WRASA outperforms GOWDA by 31.41, 10.88, and 6.84%, respectively.

We next evaluate the performances of three big data systems with the methods we have discussed in Sect.2.3—that is, TE and CTE. With these tools, we can identify the performance quality with respect to their cumulative absolute errors. We choose the significance levelα at 50%, 25%, 10%, 5%, and 1%. Whenα = 50%, TE is then the median of the absolute