Clustering and Symbolic Regression for Power Consumption Estimation on Smartphone Hardware Subsystems,"

(1)

JOURNAL OF L^ATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1

Clustering and Symbolic Regression for Power Consumption Estimation on Smartphone

Hardware Subsystems

Ekarat Rattagan, Ying-Dar Lin, Fellow, IEEE, Yuan-Cheng Lai, Edward T.-H. Chu, and Kate Ching-Ju Lin

F

Abstract—The subsystem in a smartphone means its hardware compo- nents, such as the CPU, GPU, and screen. Accurately estimating subsystem power consumption of commercial smartphones is necessary for applicable to wide research areas. Current subsystem power estimation techniques are mostly based on power models, resulting in considerable errors for various types of power consumption behaviors. These include (1) asynchrony between the measured power consumption and the corresponding workload statistics, and (2) nonlinearity concerning CPU idle states, pixels colors of AMOLED screen, and GPU workload statistics.

In this study we propose a novel utilization-based, subsystem power estimation method for a smartphone, namely Clustering and Symbolic Regression (CSR) that takes these power consumption behaviors into account so as to increase power estimation accuracy. To address asynchrony, we cluster the subsystem workload statistics into synchronous and asynchronous groups by employing affinity propagation clustering. To address nonlinearity, we employ symbolic regression for fitting measured power consumptions with respect to subsystem workload statistics. We compare our approach with various power estimation methods, Linear Regression Model (LM), Genetic Programming (GP), and Support Vector Regression (SVR). The results show Mean Absolute Percentage Error (MAPE) reduction between 23.61% and 42.55% on the estimated power consumption of a simple (Nexus S) and complex (Galaxy S4) smartphone subsystems

Index Terms—Smartphone subsystems, evolutionary computation, clustering methods, power consumption modeling and estimation

1 INTRODUCTION

F

^ASTbattery draining is the most critical issue for today smartphones, and smartphone applications play an im- portant role in the cause of this major issue [1]. Without clearly understanding the power consumption behaviors of smartphone applications, developers may inadvertently

• E. Rattagan is with the Faculty of Information Science and Technology, Mahanakorn University of Technology, Thailand.

E-mail: rekara40@mut.ac.th

• Y. D. Lin and K. C.-J. Lin are with Department of Computer Science, National Chiao Tung University, Taiwan.

E-mail:ydlin, katelin@cs.nctu.edu.tw

• Y. C. Lai is with Department of Information Management, National Taiwan University of Science and Technology, Taiwan.

E-mail: laiyc@cs.ntust.edu.tw

• E. T.-H. Chu is with Department of Computer Science and Information Engineering, National Yunlin University of Science and Technology, Taiwan.

E-mail: edwardchu@yuntech.edu.tw Manuscript received ; revised .

cause their applications to drain excessive power. It is thus necessary for application developers to be aware of the power usage of the applications they develop. Developers can then monitor the power consumption of applications from a power profile which gives the power consumption information of smartphone subsystems as provided by man- ufacturers such as an Android power profile [2]. However, the power profile provided causes significant errors in old smartphones. Dong and Zhong [3] determined the causes of inaccuracies in generated power profiles and suggested that the power profile of each smartphone should frequently be reconstructed to reduce its inaccuracy.

In general, a power profile generated by smartphone vendors contains a list of the correlation between subsystem workload statistics and the associated measured power consumption. However, reconstructing a power profile on a commercial smartphone is labor-intensive, especially for the power measurement task, as the manufacturer does not provide any schematics for power measurement or a way of measuring the consumption of each. Hence, to obtain the power consumption of the target subsystems, most existing works employ a subtractive method, which works by subtracting the power consumption of the other subsystems (obtained from the generated subsystem power models) from the total system power consumption (obtained from an external power meter).

Recently several studies have proposed subsystem power estimation modeling for commercial smartphones.

These studies can be classified into two categories:

utilization-based methods and instrumentation-based methods. Utilization-based methods refer to the profilers that collect statistical data at a regular interval, whereas instrumentation-based methods collect the required information when a specific event occurs. In utilization-based methods, Shy et al. [4] and Zhang et al. [5] used linear regressions to build a power model of all major subsystems. Kjrgaard and Blunck [6] applied a genetic algorithm, whereas Ma et al. [7] used support vector regression to build a power model of the subsystem, GPU, which nonlinearly consumes power. On the other hand, in instrumentation- based methods, Pathak et al. [8] stated that the tail energy, a type of asynchronous power, consumed significant power, and proposed system call-tracing to detect tail energy on some subsystems such as GPS, SD-card, Wi-Fi, and 3G.

(2)

2377-3782 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Asynchronous power (ASP) is defined as the consumed power which is uncorrelated with the corresponding subsystem workload statistics. Cao et al. [9] instrumented the browser in order to use WEB page load activities and resource information for modeling and predicting the energy consumption of mobile Web page.

Although existing power modeling techniques can work in general cases, they still have some limitations in handling the power consumption behavior of complex subsystems, leading to inaccurate power estimates. For example, the linear regression techniques may generate good results for a CPU where only two parameters (utilization and frequency) are considered. However, they cause significant errors when additional CPU parameters (idle time and entries) are considered [10]. These errors are the result of the presence of nonlinear power consumption behaviors of CPU idle time and entries. Nonlinear power (NLP) is defined as the consumed power which is not linearly proportional to the workload statistics. Moreover, ASP occurs in some complex subsystems, such as GPU, which are too complex to be detected by instrumentation-based approaches. Fur- thermore, most of the existing works rely on those power models whose forms (mathematical equations) are automatically discovered by mathematical modeling methods from machine learning or artificial intelligence techniques, e.g., linear and nonlinear regressions, genetic programming, support vector regression, etc. (named as a machine-defined model form). Compared with a model whose forms are manually defined by a human (named as a human-defined model form), the traditional machine-defined model form approach is only suitable for expressing the simple power consumption behaviors. However, they are not applicable to the complex power consumption behavior of modern smartphones since they are associated with many parameters, e.g., a GPU is associated with about 19 parameters.

Generally speaking, the human-defined model form can manually provide a more accurate model, while traditional machine-defined model form can automatically generate a less accurate model.

In this paper, we propose a novel utilization-based power estimation method, called Clustering and Symbolic Regression (CSR), to deal with ASP and NLP to improve the estimation accuracy. To determine ASP, a clustering method is introduced to first classify the data into samples correlated with synchronous or asynchronous. In this work, the Affin- ity Propagation (AP) clustering algorithm [11] is applied since AP does not need to specify the number of clusters in advance. This property of AP is suitable for discovering ASP whose number of occurrences is difficult to predict. CSR aims to detect ASP on various subsystems, especially for a complex subsystem such as GPU, because it is impractical to apply instrumentation-based approaches since GPU source code is closed. Next, the obtained data sample, excluding ASP, is passed through the symbolic regression (SR) method of Eureqa software [12] in order to build a power model.

Unlike traditional SR, Eureqa addresses the relationship among all parameters. We chose Eureqa because it is a machine-defined model form approach which can automatically discover power models whose mathematical forms are similar to that generated by human-defined model form approaches. With CSR, application developers can simply

and quickly build a power profile (power model) for each smartphone being tested, especially for a smartphone which includes complex subsystems, such as GPU and CPUs. Our main contributions in this paper are as follows:

—We characterize two major behaviors of power consumption in smartphones, i.e., nonlinear and ASP consumption behaviors, which significantly cause the estimation errors.

—To the best of our knowledge, we are the first to use the clustering algorithm method, Affinity Propagation, and Eureqa to improve the accuracy of estimated power consumption of smartphone subsystems.

—We investigate the impacts of these two power consumption behaviors, nonlinear and ASP consumption s, on the real applications running on two different generations of smartphone devices, Nexus S (single CPU core) and Galaxy S4 (Exynos 5 Octa CPU cores).

The remainder of this paper is organized as follows:

Section 2 provides the background to our works, the briefs of Affinity propagation and Eureqa; Section 3 gives the problem statement and definition; Section 4 presents our CSR approach; Sections 5 and 6 show the experimental setup and experimental results, respectively, and Section 7 concludes this work.

2 BACKGROUND

In this section, we describe smartphone subsystems including workload statistics and the aspects of power consumption behaviors, such as ASP and NLP.

2.1 Smartphone Subsystems

A smartphone device is comprised of several hardware subsystems, such as the CPU, GPU, screen, 3G interface, Wi-Fi interface, GPS interface, and so on. Each subsystem is associated with a variety of features. In the scope of this paper, a CPU is associated with four parameters: utilization, frequency, total time duration that a CPU stays in the idle state per second (CPU idle time), and total number that a CPU enters the idle state per second (CPU idle entries).

Each subsystem operates in various operating states. Each operating state is represented as a vector storing all parameters values of a subsystem (workload statistics) and the associated power consumption. For example, the workload statistics of busy CPU usage can be presented by four parameters, i.e., utilization = 100%, frequency = 1000 MHz, idle time = 0 ms, idle entries = 0, and the associated power consumption = 600 mW.

2.2 Power Consumption Behaviors

ASP is defined as the consumed power, which is uncorrelated with its corresponding workload statistics. In this paper, we classify ASP into two types:

2.2.1 Predictable power

Predictable power is the ASP where occurrences can be determined in advance, e.g., tail power which is the power consumption that still resides on a subsystem associating with low utilization, e.g., GPS , Wi-Fi [8], and 3G [13] . Fig. 1 shows the high tail power of 3G occurring during the FACH

(3)

5 10 15 20

01000200030004000

Time(sec)

Power(mW)

Tail power

02004006008001000 Packet rates (Tx+Rx)

Measured power Packet rates (Tx+Rx)

Fig. 1: Asynchronous power behavior: Predictable power such as tail power in 3G.

0 5 10 15 20 25

05001000150020002500

Time(sec)

Power(mW)

Measured power Hidden energy

Fig. 2: Asynchronous power behavior: Unpredictable power such as hidden power in GPU.

(Forward Access Channel) state of Radio Resource Control (RRC) protocol of about 6 seconds, i.e., between 9 to 15 sec.

In more detail, FACH is the cellular transmission state where smartphones share transmission channels with other phones to reduce the battery power consumption when there is not much traffic to transmit.

2.2.2 Unredictable power

Unpredictable power is the ASP whose occurrences are difficult to determine in advance, namely the hidden power.

Fig. 2 shows an example of hidden power, detected by CSR in our experiment, occurring on GPU. Most of the current works have proposed solutions which handle predictable power only, but not unpredictable power. Pathak et al.

[8], for example, proposed an instrumentation-based power modeling technique which probes the smartphone system calls and application frameworks to capture tail power of network HW components, but the GPU power remains hidden. Cao et al. [9] also took only CPU and networks into account, but ignores GPU hidden power usage caused by activities such as web-based games. Maghazeh et al.

[14] sampled GPU workloads with low sample rates and built a power model with linear regression, which is also not enough to detect GPU hidden power. Although an instrumentation-based technique is more accurate than a utilization-based technique [15], it requires significant de- velopment time and system knowledge to handle smart-

2 4 6 8 10 12 14

0.00.51.01.5

Time(s)

Normalized data

Total system power Idle time of state 0 Idle time of state 1 Idle time of state 2 Idle entry of state 0 Idle entry of state 1 Idle entry of state 2

Fig. 3: Nonlinear power consumption behavior of the CPU idle time and entry in state C0-C2 of the CPU core0 on Galaxy S4.

phone software systems, especially for commercial smartphones. In particular, it is a laborious task to apply the instrumentation-based technique to a subsystem such as GPU because of the lack of OS support for GPU abstractions [16].

Nonlinear power is defined as the power which is not linearly proportional to workload statistics. For example, Fig. 3 illustrates the normalized power consumption of a CPU to show the nonlinearity of the three levels of CPU idle time and entries. Furthermore, the power nonlinearity also appears for pixel colors on an AMOLED screen, i.e., the HW component showing the highest power consumption ratio compared to other HW components [17] [18] [19]. To model the NLP of CPU idle states, Zhang et al. [10] proposed a weighted linear model to fit the power consumption of multi-core CPU idle states. For AMOLED screens, Radhika et al. [17] and Xu et al. [18] proposed an exponential power model to fit the subset of pixel colors to the associated measured power. To model the NLP of GPU, Ma et al.

[7] proposed the statistical method which selects 5 out of 39 GPU parameters to build a Support Vector Regression (SVR). Another statistical method used is a tree-based random forest to build a GPU power estimation model [19].

However, most of the existing works for the GPU power models are based on the studies of a desktop GPU such as Nvidia. Also, the proposed solutions of Zhang et al. [10], Radhika et al. [17], and Xu et al. [18] are all examples of a human-defined model form, whereas those of Ma et al. [7]

and Chen et al. [20] are not.

2.3 Affinity Propagation Clustering

The Affinity Propagation (AP) clustering algorithm [11] is based on the process of message passing between data samples i and j. Every data sample is considered as an exemplar, the data that is the center of each cluster. Ex- emplar continue exchanging messages, responsibility and availability, with one another until a good set of exemplars and corresponding clusters emerges. The responsibility message, r(i, j), is sent from data sample i to data sample j, a candidate exemplar, reflecting the accumulated evidence for how appropriate data sample j is to serve as the exemplar of data sample i. Meanwhile, the availability message, a(i, j),

(4)

is sent from a data sample j, a candidate exemplar, to the data sample i, reflecting the accumulated evidence for how appropriate it would be for data sample i to choose data sample j as its exemplar. Finally, a set of clusters {c¹, c2, ..., cn} is selected to maximize the fitness function E, where E = Pn

i=1s(i, j) and s is the similarity function of i and j. The reason we use AP is because it does not require a specified number of clusters in advance. This property is useful for detecting the unknown numbers of ASP occurrence, especially for hidden power.

2.4 Eureqa Symbolic Regression

Symbolic regression (SR) is a method for searching mathematical equations from given data samples. Unlike traditional linear and nonlinear regression methods that fit parameters to an equation of a given model form, SR searches both the parameters and equation forms concur- rently. The result is represented as a tree structure composed of inner nodes containing the mathematical operators (+,−, ×, ÷, sin, cos) and outer nodes containing either subsystem parameters or a constant node containing either subsystem parameters or a constant value. Unlike traditional SR whose output model forms are not similar for the same input data trained at different training times, because of its random nature, Eureqa [12] can generate a mathematical model form which is similar to every training.

Fig. 4a shows the process of Eureqa operations. In step 1, Eureqa first works by calculating partial derivatives between variables from given data. In step 2, it then generates candidate symbolic functions that do accurately describe the behaviors of the given data. In step 3, Eureqa derives symbolic partial derivatives of the pairs of variables for each candidate functions. In step 4, the results of step 3 are compared with that of step 1. If the best candidates are not satisfied, step 2, 3, and 4 are iteratively processed until the best candidates are found (see [12] for more details).

Fig. 4b illustrates the example of applying Eureqa to find an equation which describes the behavior of swinging pendulum. Based on the Eureqa's operation, which uses the partial derivative as a key of model searching, Eureqa can produce a list of equations, trading them off between accuracy and simplicity. The list of equations thus generated allows users to have various choices for picking the most suitable equation, e.g., the number 2 in Fig. 4b, which prevents overfitting and underfitting. Unlike traditional linear and nonlinear regressions which fit parameters to an equation of a given form, e.g., the linear regression just give us f (t) = 0.02 as shown as number 4 in Fig. 4b, Eureqa can find both the parameters and the form of equations at the same time.

3 PROBLEMSTATEMENT ANDDEFINITION

This section first introduces the basic definition of smartphone subsystem power estimations, followed by formal definitions of the main problem and subproblems I and II.

3.1 Basic Definitions

The notations used in this paper are as follows. Let S = s¹, ..., s^M denote a set of M subsystems. Each subsystem

(a) The process of Eureqa symbolic regression.

0 100 200 300 400 500

1.00.50.00.51.0

Time

Cost

Data

Eureqa

Linear Regression

1. f(t) = 1.24×exp(0.19(0.36×t))×sin((9.76×t)×exp(0.19(0.36×t))) 2. f(t) = exp(0.34(0.34 × t)) × sin((9.64 × t)5.46)

3. f(t) = 0.26 × sin(0.42 + (9.77 × t)) 4. f(t) = 0.02

(b) Example of Eureqa modeling and a list of equations.

Fig. 4: Eureqa symbolic regression

sⁱ ∈ S is characterized by Nⁱ parameters, where sⁱ_j denote parameter j of the subsystem i. Each subsystem sⁱhas been trained in different operating states (a ’trained state’ for short). Each trained state k of the subsystem i, denoted by rⁱ_k = (sⁱ_1,k, ..., sⁱ_Ni,k), is a vector that stores Nⁱ parameter workloads sⁱ_j,k. Each trained state rⁱ_k is also paired with total system power pⁱ_k, and is denoted by a training sample v_kⁱ = (rⁱ_k, pⁱ_k). Let Vⁱ={vⁱ1, ..., v_rⁱ} denote a set of Z trained states v_kⁱ where 1 ≤ k ≤ Z. The set Vⁱ is used for building a power model Mⁱ, which represents the relationship between rⁱ_k and pⁱ_k. To build an efficient power model Mⁱ, it needs to take ASP and NLP into account. Table I lists all notations and their definitions.

Definition 1. ASP. Given two data samples v_kⁱ ∈ Vⁱ and vⁱ_l∈ Vⁱwhere k6= l. If the trained state rⁱkis similar to rⁱ_l ,but the total system power pⁱ_kis not similar to the total

(5)

TABLE 1: Notations and their definitions Notation Definition

sⁱ The subsystemi.

sⁱ_j The parameterjof subsystemi.

sⁱ_j,k The workload of parameter j of subsystemiat a trained statek.

rⁱ_k The trained statekof subsystemi.

pⁱ_k Total system power of subsystemiat trained statek. vⁱ_k The pair of a trained staterⁱ_kandpⁱ_k.

Vⁱ The set ofZtrained statevⁱ_k. Mⁱ A power model for subsystemi.

aⁱ_k Asynchronous power consumption of a trained state kof subsystemi.

bⁱ_k pⁱ_k− aⁱ_k.

Wⁱ The set ofZtrained statewⁱ_k= (r_kⁱ, bⁱ_k). Aⁱ Asynchronous power table for subsystemi. Yⁱ A mathematical equation for subsystemi.

Xⁱ In estimation, the set ofT data samples of subsystemi. Note the data samples do not include power information.

xⁱ_j,t In estimation, the workload of parameterj of subsystemiat time slott.

xⁱ_j In estimation, the vector of all parameters j of subsystemiat time slott.

system power pⁱ_l. It refers that either pⁱ_k or pⁱ_l includes asynchronous power aⁱ_kor aⁱ_l, respectively.

Definition 2. NLP. The nonlinear behavior between the trained states r_kⁱ and their associated power bⁱ_k, where bⁱ_k = pⁱ_k − aⁱk is the synchronous power. Since the nonlinearity, bⁱ_kis nonlinear power.

3.2 Problem Statement

The problem statement is as follows. Given a set of training samples Vⁱ, find a power model Mⁱ, which comprises of a mathematical equation Yⁱand an ASP table Aⁱ, the table which contains the information of ASP, so that the estimated total system power obtained from the sum of all subsystems sⁱ ∈ S, under a given workload statistics, is approximately equal to the measured total system power obtained from an external power meter. To find Yⁱand Aⁱ, we need to solve two subproblems below.

Subproblem 1. ASP problem. Given a set Vⁱ ={vⁱ1, ..., vⁱ_z} where vⁱ_k = (r_kⁱ, pⁱ_k), find ASP aⁱ_k, where aⁱ_k ≤ pⁱk for each v_kⁱ ∈ Vⁱ.

Subproblem 2. NLP problem. Given a set Wⁱ ={(rkⁱ, bⁱ_k)}, find Yⁱwhich is the best fit between the trained states rⁱ_k and the associated power consumption bⁱ_k.

4 CLUSTERING ANDSYMBOLICREGRESSION In this section we describe the design of CSR, which has three components: (1) ASP analysis, (2) NLP modeling, and (3) subsystem power estimation, as shown in Fig. 5. The first two components are for training the data samples while the third estimates the power consumption of the subsystems.

Fig. 5: Workflow diagram of CSR components, (a) ASP analysis and Nonlinear power modeling, and (b) Subsystem power estimation.

4.1 ASP Analysis Component

This component uses Algorithm I, which determines whether a given set Vⁱ contains ASP or not. The algorithm is based on the assumption that if any trained states in Vⁱ are similar, they would be associated with similar amounts of power consumption pⁱ_k in Vⁱ. If not, there exists some trained states r_kⁱ which correlate with ASP. Note that this assumption may be too strong when the number of collected parameters is small. We thus collects as many parameters as possible to reduce the errors caused by such an assumption.

To more clearly understand how Algorithm I works, it is illustrated by a simple example as shown in Fig. 6. In Fig. 6(a), a set of data sample Vⁱ ={v1ⁱ, ..., v₈ⁱ}, where vⁱk= (rⁱ_k, pⁱ_k) is given. We assume that all trained states r_kⁱ are divided into two groups, based on the similarity values. The first group contains rⁱ_k, where k = 1, 2, and 3, whereas the second group contains rⁱ_k, where k = 4, 5, 6, 7, and 8. Also, all power values pⁱ_k are divided into two groups. The first group contains pⁱ_k, where k = 1, 2, 3, 4, and 5, whereas the second group contains pⁱ_k, where k = 6, 7, and 8. Based on the assumption mentioned above, it can thus be seen that pⁱ₄ and pⁱ₅contain ASP.

To find pⁱ₄ and pⁱ₅ by Algorithm I, we start by parti- tioning ∀vkⁱ ∈ Vⁱ based on the similarity function and AP clustering (lines 2-3 in Algorithm I). The similarity function, sim = exp(−(d/w)^r), where d is a distance between data, w = 1 is a radius, r = 2 is an exponent. More details of this function use can be found in [21]. All data samples v_kⁱ ∈ Vⁱare partitioned into two clusters, including a cluster C1 = {vⁱ1, v₂ⁱ, vⁱ₃, v₄ⁱ, v₅ⁱ} and C² = {vⁱ6, v₇ⁱ, vⁱ₈}. Next, all data samples vⁱ_k ∈ Vⁱ are clustered again with the same similarity function and AP clustering, but using trained states∀rkⁱ only (lines 4-5 in Algorithm I). Therefore, all data samples vⁱ_k∈ Vⁱare partitioned into two clusters including D1 ={vⁱ1, v₂ⁱ, vⁱ₃} and D² ={vⁱ4, v₅ⁱ, vⁱ₆, v₇ⁱ, vⁱ₈}. Finally, the group of clusters C and D are compared in order to find ASP. Our assumption is that all members in Dyshould be in

(6)

Fig. 6: Asynchronous power analysis.

the same cluster of any Cx. If not, some of members of Dy

contain ASP.

In our example, the algorithm will find that∀vⁱk ∈ D² are not in any same cluster, unlike ∀vkⁱ ∈ D¹ which are in the same cluster C1. Hence, it can be concluded that there exist ASP on some pⁱ_k ∈ D². Next, ∀vⁱk ∈ D² are partitioned into subclusters in order to find ASP (line 8 in Algorithm I). The number of subclusters depends on the number of clusters Cx in which each v_kⁱ ∈ D² is.

In this example, the cluster D2 is partitioned into two subclusters, i.e., D2,1 = {v4ⁱ, vⁱ₅} and D^2,2 = {v6ⁱ, vⁱ₇, v₈ⁱ}, because v₄ⁱ and vⁱ₅ are in C1, and v₆ⁱ, vⁱ₇, v₈ⁱ are in C2. In lines 9-11, the algorithm computes the average power of all members in D2,1 and D2,2, which are avgP (D2,1) and avgP (D2,2), respectively. These power averages are used to assess which subcluster Dy,d contains ASP. Based on these criteria, all subclusters Dy,dwhich do not contain the minimum average power contains ASP. In this example, Fig. 6(b) (lines 9-12) shows that D2,1 = {v4ⁱ, v₅ⁱ} contains ASP aⁱ_k = avgP 1− avgP 2. Finally, Algorithm I (lines 13- 19) produces an ASP table Aⁱ ={Aⁱ0, ..., Aⁱ_p, ..., Aⁱ_q}, where Aⁱ_p =< rⁱ_k, rⁱ_k₋₁, d, aⁱ_k > is a row of the table including rⁱ_k and rⁱ_k₋₁ the pair of the training states triggering the ASP, i.e., r_kⁱ contains asynchronous power if its previous training state is rⁱ_k−1, d is the asynchronous duration, and aⁱ_k is the amount of ASP. Finally, a set Wⁱ = {(rⁱ1, bⁱ₁), ..., (r₈ⁱ, bⁱ₈)} is created, where the power bⁱ₄ and bⁱ₅ are modified as bⁱ₄= pⁱ₄−aⁱ4and bⁱ₅= pⁱ₅−aⁱ4respectively. Finally, Algorithm I (line 22) returns Wⁱ and Aⁱ, as shown in Fig. 6(c), which are later used by the NLP modeling and subsystem power estimation components, respectively.

4.2 Nonlinear Power Modeling Component

Algorithm II is applied to this component. The algorithm uses Eureqa to produce a mathematical equation repre-

Algorithm 1: Asynchronous Power Analysis Input: A set Vⁱ.

Output: A set Wⁱ //An asynchronous power table Aⁱ.

1 s1, s2← 2D matrix; C, D and A ← ∅; Wⁱ← Vⁱ;

2 s1← sim(∀vⁱk∈ Vⁱ);

3 C← AP (s1); //Clustering

4 s2← sim((∀r_kⁱ ∈ v_kⁱ)∈ Vi);

5 D← AP (s2); //Clustering

6 foreach D_y∈ D do

7 if ∀vⁱk∈ Dy are not in the same cluster C_xthen

8 F← partition(∀vⁱ_k∈ Dy);

9 foreach Dy,d∈ F do

10 //Add to list

11 avgP.add(

P((∀pⁱk∈ vⁱk)∈ Dy,d)

b );

12 minP← argmin{avgP };

13 // Number of members of Dy,d

14 d← |Dy,d|;

15 foreach (pⁱ_k∈ vⁱ_k)∈ Dy,ddo

16 aⁱ_k← abs(pⁱk− minP );

17 wⁱ_k← (r_kⁱ, minP );

18 Aⁱ.add(< r_kⁱ, rⁱ_k−1, d, aⁱ_k>);

19 return Wⁱ, Aⁱ;

1

senting the relationship between trained states r_kⁱand the associated synchronous power bⁱ_k.

Algorithm II requires two input parameters, i.e., the set Wi obtained from Algorithm I, and the target expression E which is a string expression that guides Eureqa the type of model to search for. For instance, a target expression ”y = f (x1, x2, ..., xn)” is an equation where y is modeled as a function of variables x1, x2, ..., xn and y = f1()x1+ f2()x2+, ..., +fn()xn is an equation where fi() is the coefficients of a variable xi. Algorithm II (line 1) works by initially defining the empty lists L and H.

Eureqa then applies the two input parameters to build the

(7)

Algorithm 2: Nonlinear Power Model Generation input: A set Wⁱ

A target expression E← “y = f(x1, . . . , xk)”

Output: An equation Yⁱ.

1 L, H,← An empty list;

2 L ← Eureqa.build(Wⁱ, E);

3 M ← “y = ∀xi∈ L”;

4 H ← Eureqa.build(Wⁱ, M );

5 G← {∀hⁱ∈ H | MAE(hⁱ) is 10% larger than M AE(argmax

hi∈ H Com(hi))};

6 Yⁱ←argmax hi∈ HCom(gi)

7 return Yⁱ;

1

equation (line 2). Since Eureqa uses Symbolic Regression (SR) to build an equation, it has to set termination criteria, i.e., a number of generations and a stability value. The stability value is used for measuring the sensitivity analysis of all parameters to find a set of parameters most effecting the subsystem power consumption. After completing, the construction process, Eureqa returns a list of the most impact subsystems parameters L. Next, the target expression E is modified by replacing the existing variables with a new set of variables from the list L, denoted as M (line 3). Eureqa is again run by applying Wi and M , and then returns a list of candidate equations H (line 4). The list H, a collection of candidate equations, are ranked based on the trade-off between the accuracy (fit) and complexity (size) of an equation. The accuracy is measured by using the Mean Absolute Error (MAE) metric and the complexity is the size of an equation. We defined the process of choosing an equation Yⁱ from H in two steps: (1) picking a group of candidates whose MAEs are 10% larger than the MAE of the greatest complexity, and (2) selecting the equation which has the greatest complexity from the group we picked at the first step. To create a generic model, we propose the picking process which picks an equation that has the least complexity but highest accuracy.

Fig. 7 is illustrative of the process of Algorithm II. At step 1, let Wⁱ be a set of pairs of trained states and power consumption wⁱ_k = (r_kⁱ, bⁱ_k) of GPU. The GPU subsystem is composed of 19 parameters, x1, ..., x19. The set Wⁱ and the target expression E are submitted to Eureqa that applies sensitivity analysis to generate a set of GPU's parameters most effecting its power consumption. In this example, the sensitivity analysis generates the four most impact parameters, x1, x5, x8, x12. At step 3, the target expression E is replaced by ”Y = f (x1, x5, x8, x12)”. At step 4, the expression E and Wi are submitted to Eureqa to build a model again, but without applying the sensitivity analysis. At step 5, Eureqa returns the results of a list of equations which are a trade-off between accuracy and complexity. Finally, at step 6, Algorithm II uses the picking criteria mentioned above to pick the right equation Yⁱ that is the equation whose complexity and MAE are 15 and 3, respectively, as shown as the red spot in step 5.

4.3 Subsystem Power Estimation Component

This component uses Algorithm III, as well as the ASP table Aⁱ and the equation Yⁱ, obtained from Algorithm I and II, respectively, to estimate the total subsystem power

Fig. 7: Nonlinear power modeling.

of the subsystem si, given a set of T data samples Xⁱ. Note that data samples Xⁱ only includes the workload statistics of the subsystems, but not their power information.

Algorithm III (line 1) works by initially setting an empty list P . Next, for each data sample xⁱ_t ∈ Xⁱ, Algorithm III (line 3) checks whether xtassociates with ASP or not. This works by measuring the similarity, sim (line 4), between xt

and rⁱ_k ∈ Aⁱp and xt−1 with rⁱ_k₋₁ ∈ Aⁱp. If the similarity is greater than zero, it means xⁱ_tassociates with ASP, we then estimate the total system power Ptof xⁱ_t, by applying xⁱ_tto the equation Yⁱ plus aⁱ_k ∈ Aⁱp, and adding asynchronous duration d with t to the variable len. Alternatively, Pt is estimated by applying xⁱ_twith Yⁱonly. If xⁱ_tassociates with ASP, we then continue checking the duration d ∈ Aⁱp as shown in lines 7–13. In this iterative process, the similarity between xⁱ_tand xⁱ_t₋₁is measured. If both data samples are similar, then the total system power consumption P_tⁱis also estimated by applying xⁱ_t with the process, as shown in line 5. Alternatively, xⁱ_tis checked with the other Aⁱ_p ∈ Aⁱ. Finally, this component returns a set of T total system power Pⁱassociated with the data samples Xⁱ.

5 EXPERIMENTALSETUP

In this section, we first describe the hardware and software experimental setup. We then elaborate all subsystems and their training process.

5.1 Hardware and Software Setup 5.1.1 Hardware setup

The hardware test bed consisted of a host computer, a Mon- soon power monitor [22], and two DUTs. The host computer was a normal desktop computer with a 3.10GHz Intel Core i3-2100 processor, 6GB of RAM, and 64-bit Windows 7.

The Monsoon power monitor, which sampled with rates at 5 kHz, was used to measure the total system power. To test the efficiency of the proposed technique, we ran our

(8)

Algorithm 3: Subsystem Power Estimation Input: A set of T data sample Xⁱ={x1, . . . , xr}.

Output: A set of T power estimation Pⁱ={p1, . . . , pr}.

1 Pⁱ ← () //An empty list;

2 for t = 1, . . . ,|Xi| do

3 foreach Aⁱ_p∈ Aⁱdo

4 if sim(xⁱ_t, rⁱ_k∈ Aⁱp)∧ sim(xⁱt−1, rⁱ_k−1∈ Aⁱp) then

5 len← t + d ∈ Aⁱp;

6 p_t← Yi(xⁱ_t) + aⁱ_k;

7 while t < len do

8 if sim(xi, xi+1) then

9 Pt← Yi(xt+1) + aⁱ_k;

10 t + +;

11 else

12 t− −;

13 break;

14 else

15 P_t← Yi(x_t)

16 return Pⁱ;

1

experiments on the old and new DUTs, Nexus S and Galaxy S4 (S4 for short). More details of both DUTs, including the target subsystems, CPU, screen, GPU, audio, GPS, 3G, and Wi-Fi are listed in Table II.

5.1.2 Software setup

The software tools consisted of Eureqa desktop version 0.99.8 [23], a statistical tool, R [24], with the Apcluster package [25], a system call tracing tool, Strace [26], and our training application running on both DUTs. To validate the accuracy of CSR, we tested it with four real applications with several scenarios. The applications included GPU benchmark, Google Maps, Firefox web, Firefox Youtube, Chrome web, and Chrome Youtube. For the Eureqa parameter setup, for each training process, we ran the evolutionary process until the number of generations reached 30,000 or a stability value of more than 80%. It took around 4 minutes to complete each training process.

5.2 Subsystem Training Process

To acquire subsystem workload statistics, we created two applications, operating and monitoring applications [27], working on a DUT side. The operating application put a target subsystem into different operating states, while the monitoring application periodically collected the workload statistics of the training subsystems. To synchronize the subsystem workload statistics with its associated power measurements, we used the instantaneous power caused by suddenly turning on and off the screen brightness as the synchronization point. After completing the training process, we stored the collected workload statistics of subsystems in a DUTs storage, and the power trace in the host computer. For each operating state of a subsystem, we repeatedly tested it five times and then measured its average. The details of each subsystem training process are as follows:

1) CPU. We disabled other subsystems when training the CPU. However, since GPU is on the same

TABLE 2: The subsystems of Nexus S and Galaxy S4

Sub system

Nexus Galaxy S4

Parameter Range Parameter Range

CPU

util0(%) 1100 util0, ..., util7 1 100 f r eq0(M H z) {200, ..., 1000} f r eq0, ..., f r eq7{200, ..., 1600}

it0

(idle time (ms)) 0 it0,0, ..., it7,0 0 it_0,1, ..., it_7,1 0 it_0,2, ..., it_7,2 0 ie₀(idle time (ms)) 0 ie_0,0, ..., ie_7,0 0 ie0,1, ..., ie7,1 0 ie0,2, ..., ie7,2 0

Screen

br ightness 0255 br ight 0255

red 0255

green 0255

blue 0255

GPU (show the most impact four parameters)

tal 0 tal

f ps 0 f ps 0

usseccpp 0 usseccpp 0

gtt3d 0 gtt3d 0

Audio volume [0, 1] volume [0, 1]

GPS on [0, 1] on [0, 1]

Wi-Fi

on [0, 1] on [0, 1]

channel {11, 36, 48, 54} channel {11, 36, 48, 54}

packet r ate 0 packet r ate 0

3G on [0, 1] on [0, 1]

packet r ate 0 packet r ate 0

tal : tile accelerator utilization f ps : frame per second description

usseccpp : Universal Scalable Shader Engine clock cycle per pixel gtt3d : GPU task time 3D utilization

System-on-Chip as CPU, we also monitored the workload statistics of GPU. The training parameters of CPU were set as described in Zhang et al. [9], for training CPU idle state. While, it was simple to train a single CPU core on Nexus S, it was an intensive task to train 8 CPU cores on S4. The 8 CPU cores, big.LITTLE [28], were partitioned into 2 groups: one group for 4 big cores and the other group for 4 little cores. Each big core had a range of frequencies of between 800 and 1600 MHz, and each little core had a range of frequencies of between 200 and 600 MHz. Although the S4’s CPU was composed of 8 cores, but only four cores, or one group, were active at a time, because of limitations of the current scheduler technology, In-kernel switcher [28] at the time. Developers can in fact only view 4 logical CPU cores, core0, core1, core2, and core3. The OS kernel allows developers to turn off core1 to core3, but not core0. Thus, to train the S4’s CPU, we started by training core0 alone by disabling the other cores. We next trained core1 by enabling it and let it operate along with core0. The power consumption of core1 is then estimated by subtraction. We subsequently trained core2 and core3 using the same procedure

(9)

as for core1.

2) Screen. It was simple to train the screen of Nexus S because it was a Super LCD [29], where its power consumption is affected only by brightness levels.

However, for the screen of S4, we had to address the red, green, and blue pixel colors for each brightness level, starting from 0 to 255 with increments of 25. Therefore, at each brightness level, the 24-bit color values, 8 bit for each color, were varied from 0 (black) to 255 (white). Since an OS kernel does not provide the pixel color data, for real application testing we used another android application, SCR screen record [30], to record the pixel colors of a whole screen, while the real application was running. The pixel color data was then saved as a MP4 file within a DUT, and was later processed on the host computer. To reduce the power consumption overhead caused by the SCR screen record app, it was set to record the screen at eight frames per second. Moreover, to reduce the time spent on pixel processing on the host computer, we processed only 135x240 of the total of 1080x1920 pixels. The power consumption of the screen was obtained by subtracting the power consumption of CPU from the total system power. We also observed that there were no GPU operations while pixel colors were being trained by our training application.

3) GPU. 19 GPU parameters were trained and it was an intensive task to control all combinations of these parameters. We thus used the GPU benchmark, 0xbench [31], to stress test the GPU. 0xbench con- sists of 2D training apps, such as canvas, shape, and image drawing, 3D training program, such as Cube and Teapot rotation, and other miscellaneous apps, e.g., Math, VM, Native, etc. The power consumption of GPU was obtained by subtracting the power consumption of CPU and screen.

4) GPS. We used the GPS Test application to train only GPS on and off states to capture its ASP consumption. The power consumption of GPS was estimated by subtracting CPU and screen power consumption. 5. Audio. We trained the audio subsystem by running the built-in music player, Apollo, and only maximum and minimum volume were trained. Its power consumption was estimated by subtracting CPU power because the screen was off.

5) Wi-Fi. We set the host computer as a server that con- nected with a router TP-Link TL-WR1043ND. We developed a client-server application, as described in PowerTutor [5], to train the Wi-Fi workload statistics. To reduce the variability of the experiment, we controlled Wi-Fi channel rates at 11, 36, 48, 54, and 72 Mbps, while the files with different sizes were exchanged between the DUT to the server at each channel rate. We found that the Wi-Fi subsystem of our DUTs resulted in a short duration of ASP consumption, i.e., less than 1 second. We thus ignored the asynchronous analysis for the Wi-Fi subsystem.

6) 3G. We experimented with 3G similar to the Wi-Fi experiment. However, we estimated its power consumption by transferring multiple files with various

sizes between the DUT and a server over FTP.

6 EXPERIMENTALRESULTS 6.1 ASP Detection

We give the results of CSR with reference to detecting ASP as it occurred on GPS, GPU, and 3G. We ignored Wi-Fi as its duration of ASP was trivial, i.e., about 1 second. Fig.

8(a) shows the ASP of GPS, which lasts for about 5 seconds after GPS is disabled. We compared the CSRs results with the results of the instrumentation-based approach, which uses Strace to determine the operating states of GPS. Our experiment found that the accuracy of CSR to detect ASP is closed to the accuracy of the instrumentation-based approach. However, CSR can automatically obtain the time duration of ASP, whereas the instrumentation-based approach requires detecting the time duration of asynchronous power manually. Moreover, we found that a sampling rate of 1 Hz was sufficient for detecting ASP of GPS. Fig. 8(b) shows ASP of GPU on S4. CSR revealed several portions of ASP occurring on GPU. After determining the GPU workload statistics that are correlated with ASP, we found that most of ASP on GPU is hidden power. For example, between 12 and 14 seconds all captured workload statistics are similar, but the associated power consumption changes instantaneously.

Moreover, we found that the ASP of GPU on Nexus S was caused by the GPUs parameters correlated with negative values, e.g., the negative values of Universal Scalable Shader Engine (USSE) load stall and load pixel. We also found that at least the sampling rate at 10 Hz was suitable for detecting ASP of GPU. Since it was difficult to instrument the GPUs system calls for the comparison purposes, we, therefore, validated the accuracy of CSR for identifying the ASP of GPU with the real applications. Fig. 8(c) shows the comparison between the techniques with and without detecting the ASP on 3G. The techniques for detecting the ASP include CSR and system call (FSM), whereas the technique without detecting ASP is LM. It can be seen that the techniques with detecting ASP can reduce more errors than the technique without detecting ASP.

6.2 Nonlinear Power Models

To validate the efficacy of CSR in terms of using a ma- chinedefined model forms for discovering NLP models, we compare the forms of models generated by CSR with the ones generated by using the human-defined model forms.

To do so, we built a power model of the Nexus S CPU by using the Weighted Linear Model (WLM) described in Zhang et al. [10], whose form was defined by a human.

Table III shows that, without human intervention, CSR can determine a model whose form is very similar to the ones produced by WLM. Moreover, the Mean Absolute Error (MAE) of both of these are approximately the same, 17.95 and 18.49, respectively. We also built a CPU power model generated by using GP, a traditional symbolic regression.

As shown in Table III, the model form obtained from the GP approach is completely different from the one obtained from WLM as well as CSR. The difference is because the random nature of genetic programming which means the model forms built by GP are different for each model built,

(10)

0 50 100 150

0100200300400

Time(sec)

Power(mW)

Measured power Tail energy

(a) GPS tail energy.

0 5 10 15 20 25

05001000150020002500

Time(sec)

Power(mW)

Measured power Hidden energy

(b) GPU hidden energy.

(c) 3G tail energy.

Fig. 8: Asynchronous power detection on (a) GPS, (b) GPU, and (c) 3G.

even though the input data is the same, whereas CSR can produce very similar forms of models for each model building. To show the accuracy, we compared the accuracy of the NLP models built for GPU of Galaxy S4, the most complex hardware subsystem. The results showed that the MAE of the GPU power model built by LM is very high. The MAE of LM and CSR are significantly different, at 8387.01 and 956.67, respectively. Furthermore, we also compared the results of CSR with another nonlinear model building method, SVR [7]. The results show that the error of CSR is less than that of SVR, at about 15% on GPU.

6.3 Real Application Validation

To evaluate the accuracy of CSR-based power estimation, we validated it on five well-known applications (seven scenarios) and then compared its results with those of the mixed model. In this paper we refer the mixed model as a

(a) GPS tail energy.

(b) GPU hidden energy.

Fig. 9: Asynchronous power detection.

list of subsystem power modeling techniques which gave the best results for each subsystem we tested. For the mixed model, we applied WLM for CPU, an Exponential model for AMOLED, SVR for GPU, and LR to the other subsystems.We also used Mean Absolute Percentage Error (MAPE) as the error metric for evaluation.

Table IV shows the MAPE of CSR and the mixed model, tested with the five applications running on two DUTs. On Nexus S, the average MAPE of CSR was about 10.58%, whereas that of the mixed model was about 13.85%, i.e., a 23.61% improvement of CSR over the mixed model. We found that the accuracy of estimated GPU power consumption has the most impact on that of total system power estimation, and the estimated power consumption of the other major subsystems, such as CPU, screen, and Wi-Fi was similar. As shown in Fig 9(a), CSR is more accurate than the mixed model for estimating total system power consumption of surfing whole CNN website for about 2 minutes testing on Chrome with Wi-Fi and 3G. The higher accuracy results from CSR being able to correctly capture the power consumption behavior of GPU, whereas SVR cannot perform this. On S4, the average MAPE of CSR was about 23.41%, whereas that of the mixed model was about 40.75%, i.e., about 42.55% improvement. Fig. 9(b) shows that CSR is capable of capturing accurate GPU power consumption behavior by improving around 70.05on GPU bench application. The improvement is because of the capability of ASP detection of CSR on GPU.

It is worth noting that all MAPEs in this work were sig-