• 沒有找到結果。

Clustering and Symbolic Regression for Power Consumption Estimation on Smartphone Hardware Subsystems,"

N/A
N/A
Protected

Academic year: 2022

Share "Clustering and Symbolic Regression for Power Consumption Estimation on Smartphone Hardware Subsystems,""

Copied!
12
0
0

加載中.... (立即查看全文)

全文

(1)

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1

Clustering and Symbolic Regression for Power Consumption Estimation on Smartphone

Hardware Subsystems

Ekarat Rattagan, Ying-Dar Lin, Fellow, IEEE, Yuan-Cheng Lai, Edward T.-H. Chu, and Kate Ching-Ju Lin

F

Abstract—The subsystem in a smartphone means its hardware compo- nents, such as the CPU, GPU, and screen. Accurately estimating sub- system power consumption of commercial smartphones is necessary for applicable to wide research areas. Current subsystem power estimation techniques are mostly based on power models, resulting in considerable errors for various types of power consumption behaviors. These include (1) asynchrony between the measured power consumption and the cor- responding workload statistics, and (2) nonlinearity concerning CPU idle states, pixels colors of AMOLED screen, and GPU workload statistics.

In this study we propose a novel utilization-based, subsystem power estimation method for a smartphone, namely Clustering and Symbolic Regression (CSR) that takes these power consumption behaviors into account so as to increase power estimation accuracy. To address asyn- chrony, we cluster the subsystem workload statistics into synchronous and asynchronous groups by employing affinity propagation cluster- ing. To address nonlinearity, we employ symbolic regression for fitting measured power consumptions with respect to subsystem workload statistics. We compare our approach with various power estimation methods, Linear Regression Model (LM), Genetic Programming (GP), and Support Vector Regression (SVR). The results show Mean Absolute Percentage Error (MAPE) reduction between 23.61% and 42.55% on the estimated power consumption of a simple (Nexus S) and complex (Galaxy S4) smartphone subsystems

Index Terms—Smartphone subsystems, evolutionary computation, clustering methods, power consumption modeling and estimation

1 INTRODUCTION

F

ASTbattery draining is the most critical issue for today smartphones, and smartphone applications play an im- portant role in the cause of this major issue [1]. Without clearly understanding the power consumption behaviors of smartphone applications, developers may inadvertently

E. Rattagan is with the Faculty of Information Science and Technology, Mahanakorn University of Technology, Thailand.

E-mail: rekara40@mut.ac.th

Y. D. Lin and K. C.-J. Lin are with Department of Computer Science, National Chiao Tung University, Taiwan.

E-mail:ydlin, katelin@cs.nctu.edu.tw

Y. C. Lai is with Department of Information Management, National Taiwan University of Science and Technology, Taiwan.

E-mail: laiyc@cs.ntust.edu.tw

E. T.-H. Chu is with Department of Computer Science and Information Engineering, National Yunlin University of Science and Technology, Taiwan.

E-mail: edwardchu@yuntech.edu.tw Manuscript received ; revised .

cause their applications to drain excessive power. It is thus necessary for application developers to be aware of the power usage of the applications they develop. Developers can then monitor the power consumption of applications from a power profile which gives the power consumption information of smartphone subsystems as provided by man- ufacturers such as an Android power profile [2]. However, the power profile provided causes significant errors in old smartphones. Dong and Zhong [3] determined the causes of inaccuracies in generated power profiles and suggested that the power profile of each smartphone should frequently be reconstructed to reduce its inaccuracy.

In general, a power profile generated by smartphone vendors contains a list of the correlation between subsys- tem workload statistics and the associated measured power consumption. However, reconstructing a power profile on a commercial smartphone is labor-intensive, especially for the power measurement task, as the manufacturer does not provide any schematics for power measurement or a way of measuring the consumption of each. Hence, to obtain the power consumption of the target subsystems, most exist- ing works employ a subtractive method, which works by subtracting the power consumption of the other subsystems (obtained from the generated subsystem power models) from the total system power consumption (obtained from an external power meter).

Recently several studies have proposed subsystem power estimation modeling for commercial smartphones.

These studies can be classified into two categories:

utilization-based methods and instrumentation-based meth- ods. Utilization-based methods refer to the profilers that collect statistical data at a regular interval, whereas instrumentation-based methods collect the required infor- mation when a specific event occurs. In utilization-based methods, Shy et al. [4] and Zhang et al. [5] used linear regressions to build a power model of all major subsys- tems. Kjrgaard and Blunck [6] applied a genetic algorithm, whereas Ma et al. [7] used support vector regression to build a power model of the subsystem, GPU, which nonlinearly consumes power. On the other hand, in instrumentation- based methods, Pathak et al. [8] stated that the tail energy, a type of asynchronous power, consumed significant power, and proposed system call-tracing to detect tail energy on some subsystems such as GPS, SD-card, Wi-Fi, and 3G.

(2)

2377-3782 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Asynchronous power (ASP) is defined as the consumed power which is uncorrelated with the corresponding sub- system workload statistics. Cao et al. [9] instrumented the browser in order to use WEB page load activities and re- source information for modeling and predicting the energy consumption of mobile Web page.

Although existing power modeling techniques can work in general cases, they still have some limitations in handling the power consumption behavior of complex subsystems, leading to inaccurate power estimates. For example, the linear regression techniques may generate good results for a CPU where only two parameters (utilization and frequency) are considered. However, they cause significant errors when additional CPU parameters (idle time and entries) are con- sidered [10]. These errors are the result of the presence of nonlinear power consumption behaviors of CPU idle time and entries. Nonlinear power (NLP) is defined as the consumed power which is not linearly proportional to the workload statistics. Moreover, ASP occurs in some complex subsystems, such as GPU, which are too complex to be detected by instrumentation-based approaches. Fur- thermore, most of the existing works rely on those power models whose forms (mathematical equations) are auto- matically discovered by mathematical modeling methods from machine learning or artificial intelligence techniques, e.g., linear and nonlinear regressions, genetic programming, support vector regression, etc. (named as a machine-defined model form). Compared with a model whose forms are manually defined by a human (named as a human-defined model form), the traditional machine-defined model form approach is only suitable for expressing the simple power consumption behaviors. However, they are not applicable to the complex power consumption behavior of modern smartphones since they are associated with many param- eters, e.g., a GPU is associated with about 19 parameters.

Generally speaking, the human-defined model form can manually provide a more accurate model, while traditional machine-defined model form can automatically generate a less accurate model.

In this paper, we propose a novel utilization-based power estimation method, called Clustering and Symbolic Regression (CSR), to deal with ASP and NLP to improve the estimation accuracy. To determine ASP, a clustering method is introduced to first classify the data into samples correlated with synchronous or asynchronous. In this work, the Affin- ity Propagation (AP) clustering algorithm [11] is applied since AP does not need to specify the number of clusters in advance. This property of AP is suitable for discovering ASP whose number of occurrences is difficult to predict. CSR aims to detect ASP on various subsystems, especially for a complex subsystem such as GPU, because it is impractical to apply instrumentation-based approaches since GPU source code is closed. Next, the obtained data sample, excluding ASP, is passed through the symbolic regression (SR) method of Eureqa software [12] in order to build a power model.

Unlike traditional SR, Eureqa addresses the relationship among all parameters. We chose Eureqa because it is a machine-defined model form approach which can automat- ically discover power models whose mathematical forms are similar to that generated by human-defined model form approaches. With CSR, application developers can simply

and quickly build a power profile (power model) for each smartphone being tested, especially for a smartphone which includes complex subsystems, such as GPU and CPUs. Our main contributions in this paper are as follows:

—We characterize two major behaviors of power con- sumption in smartphones, i.e., nonlinear and ASP consump- tion behaviors, which significantly cause the estimation errors.

—To the best of our knowledge, we are the first to use the clustering algorithm method, Affinity Propagation, and Eureqa to improve the accuracy of estimated power consumption of smartphone subsystems.

—We investigate the impacts of these two power con- sumption behaviors, nonlinear and ASP consumption s, on the real applications running on two different generations of smartphone devices, Nexus S (single CPU core) and Galaxy S4 (Exynos 5 Octa CPU cores).

The remainder of this paper is organized as follows:

Section 2 provides the background to our works, the briefs of Affinity propagation and Eureqa; Section 3 gives the problem statement and definition; Section 4 presents our CSR approach; Sections 5 and 6 show the experimental setup and experimental results, respectively, and Section 7 concludes this work.

2 BACKGROUND

In this section, we describe smartphone subsystems includ- ing workload statistics and the aspects of power consump- tion behaviors, such as ASP and NLP.

2.1 Smartphone Subsystems

A smartphone device is comprised of several hardware subsystems, such as the CPU, GPU, screen, 3G interface, Wi-Fi interface, GPS interface, and so on. Each subsystem is associated with a variety of features. In the scope of this paper, a CPU is associated with four parameters: utilization, frequency, total time duration that a CPU stays in the idle state per second (CPU idle time), and total number that a CPU enters the idle state per second (CPU idle entries).

Each subsystem operates in various operating states. Each operating state is represented as a vector storing all param- eters values of a subsystem (workload statistics) and the associated power consumption. For example, the workload statistics of busy CPU usage can be presented by four parameters, i.e., utilization = 100%, frequency = 1000 MHz, idle time = 0 ms, idle entries = 0, and the associated power consumption = 600 mW.

2.2 Power Consumption Behaviors

ASP is defined as the consumed power, which is uncor- related with its corresponding workload statistics. In this paper, we classify ASP into two types:

2.2.1 Predictable power

Predictable power is the ASP where occurrences can be determined in advance, e.g., tail power which is the power consumption that still resides on a subsystem associating with low utilization, e.g., GPS , Wi-Fi [8], and 3G [13] . Fig. 1 shows the high tail power of 3G occurring during the FACH

(3)

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 3

5 10 15 20

01000200030004000

Time(sec)

Power(mW)

Tail power

02004006008001000 Packet rates (Tx+Rx)

Measured power Packet rates (Tx+Rx)

Fig. 1: Asynchronous power behavior: Predictable power such as tail power in 3G.

0 5 10 15 20 25

05001000150020002500

Time(sec)

Power(mW)

Measured power Hidden energy

Fig. 2: Asynchronous power behavior: Unpredictable power such as hidden power in GPU.

(Forward Access Channel) state of Radio Resource Control (RRC) protocol of about 6 seconds, i.e., between 9 to 15 sec.

In more detail, FACH is the cellular transmission state where smartphones share transmission channels with other phones to reduce the battery power consumption when there is not much traffic to transmit.

2.2.2 Unredictable power

Unpredictable power is the ASP whose occurrences are difficult to determine in advance, namely the hidden power.

Fig. 2 shows an example of hidden power, detected by CSR in our experiment, occurring on GPU. Most of the current works have proposed solutions which handle predictable power only, but not unpredictable power. Pathak et al.

[8], for example, proposed an instrumentation-based power modeling technique which probes the smartphone system calls and application frameworks to capture tail power of network HW components, but the GPU power remains hidden. Cao et al. [9] also took only CPU and networks into account, but ignores GPU hidden power usage caused by activities such as web-based games. Maghazeh et al.

[14] sampled GPU workloads with low sample rates and built a power model with linear regression, which is also not enough to detect GPU hidden power. Although an instrumentation-based technique is more accurate than a utilization-based technique [15], it requires significant de- velopment time and system knowledge to handle smart-

2 4 6 8 10 12 14

0.00.51.01.5

Time(s)

Normalized data

Total system power Idle time of state 0 Idle time of state 1 Idle time of state 2 Idle entry of state 0 Idle entry of state 1 Idle entry of state 2

Fig. 3: Nonlinear power consumption behavior of the CPU idle time and entry in state C0-C2 of the CPU core0 on Galaxy S4.

phone software systems, especially for commercial smart- phones. In particular, it is a laborious task to apply the instrumentation-based technique to a subsystem such as GPU because of the lack of OS support for GPU abstractions [16].

Nonlinear power is defined as the power which is not linearly proportional to workload statistics. For example, Fig. 3 illustrates the normalized power consumption of a CPU to show the nonlinearity of the three levels of CPU idle time and entries. Furthermore, the power nonlinearity also appears for pixel colors on an AMOLED screen, i.e., the HW component showing the highest power consumption ratio compared to other HW components [17] [18] [19]. To model the NLP of CPU idle states, Zhang et al. [10] proposed a weighted linear model to fit the power consumption of multi-core CPU idle states. For AMOLED screens, Radhika et al. [17] and Xu et al. [18] proposed an exponential power model to fit the subset of pixel colors to the associated measured power. To model the NLP of GPU, Ma et al.

[7] proposed the statistical method which selects 5 out of 39 GPU parameters to build a Support Vector Regression (SVR). Another statistical method used is a tree-based ran- dom forest to build a GPU power estimation model [19].

However, most of the existing works for the GPU power models are based on the studies of a desktop GPU such as Nvidia. Also, the proposed solutions of Zhang et al. [10], Radhika et al. [17], and Xu et al. [18] are all examples of a human-defined model form, whereas those of Ma et al. [7]

and Chen et al. [20] are not.

2.3 Affinity Propagation Clustering

The Affinity Propagation (AP) clustering algorithm [11] is based on the process of message passing between data samples i and j. Every data sample is considered as an exemplar, the data that is the center of each cluster. Ex- emplar continue exchanging messages, responsibility and availability, with one another until a good set of exemplars and corresponding clusters emerges. The responsibility mes- sage, r(i, j), is sent from data sample i to data sample j, a candidate exemplar, reflecting the accumulated evidence for how appropriate data sample j is to serve as the exemplar of data sample i. Meanwhile, the availability message, a(i, j),

(4)

2377-3782 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

is sent from a data sample j, a candidate exemplar, to the data sample i, reflecting the accumulated evidence for how appropriate it would be for data sample i to choose data sample j as its exemplar. Finally, a set of clusters {c1, c2, ..., cn} is selected to maximize the fitness function E, where E = Pn

i=1s(i, j) and s is the similarity function of i and j. The reason we use AP is because it does not require a specified number of clusters in advance. This property is useful for detecting the unknown numbers of ASP occurrence, especially for hidden power.

2.4 Eureqa Symbolic Regression

Symbolic regression (SR) is a method for searching math- ematical equations from given data samples. Unlike tra- ditional linear and nonlinear regression methods that fit parameters to an equation of a given model form, SR searches both the parameters and equation forms concur- rently. The result is represented as a tree structure com- posed of inner nodes containing the mathematical operators (+,−, ×, ÷, sin, cos) and outer nodes containing either sub- system parameters or a constant node containing either sub- system parameters or a constant value. Unlike traditional SR whose output model forms are not similar for the same input data trained at different training times, because of its random nature, Eureqa [12] can generate a mathematical model form which is similar to every training.

Fig. 4a shows the process of Eureqa operations. In step 1, Eureqa first works by calculating partial derivatives be- tween variables from given data. In step 2, it then generates candidate symbolic functions that do accurately describe the behaviors of the given data. In step 3, Eureqa derives symbolic partial derivatives of the pairs of variables for each candidate functions. In step 4, the results of step 3 are compared with that of step 1. If the best candidates are not satisfied, step 2, 3, and 4 are iteratively processed until the best candidates are found (see [12] for more details).

Fig. 4b illustrates the example of applying Eureqa to find an equation which describes the behavior of swinging pendulum. Based on the Eureqa's operation, which uses the partial derivative as a key of model searching, Eureqa can produce a list of equations, trading them off between accuracy and simplicity. The list of equations thus generated allows users to have various choices for picking the most suitable equation, e.g., the number 2 in Fig. 4b, which prevents overfitting and underfitting. Unlike traditional lin- ear and nonlinear regressions which fit parameters to an equation of a given form, e.g., the linear regression just give us f (t) = 0.02 as shown as number 4 in Fig. 4b, Eureqa can find both the parameters and the form of equations at the same time.

3 PROBLEMSTATEMENT ANDDEFINITION

This section first introduces the basic definition of smart- phone subsystem power estimations, followed by formal definitions of the main problem and subproblems I and II.

3.1 Basic Definitions

The notations used in this paper are as follows. Let S = s1, ..., sM denote a set of M subsystems. Each subsystem

(a) The process of Eureqa symbolic regression.

0 100 200 300 400 500

1.00.50.00.51.0

Time

Cost

Data

Eureqa

Linear Regression

1. f(t) = 1.24×exp(0.19(0.36×t))×sin((9.76×t)×exp(0.19(0.36×t))) 2. f(t) = exp(0.34(0.34 × t)) × sin((9.64 × t)5.46)

3. f(t) = 0.26 × sin(0.42 + (9.77 × t)) 4. f(t) = 0.02

(b) Example of Eureqa modeling and a list of equations.

Fig. 4: Eureqa symbolic regression

si ∈ S is characterized by Ni parameters, where sij denote parameter j of the subsystem i. Each subsystem sihas been trained in different operating states (a ’trained state’ for short). Each trained state k of the subsystem i, denoted by rik = (si1,k, ..., siNi,k), is a vector that stores Ni parameter workloads sij,k. Each trained state rik is also paired with total system power pik, and is denoted by a training sample vki = (rik, pik). Let Vi={vi1, ..., vri} denote a set of Z trained states vki where 1 ≤ k ≤ Z. The set Vi is used for build- ing a power model Mi, which represents the relationship between rik and pik. To build an efficient power model Mi, it needs to take ASP and NLP into account. Table I lists all notations and their definitions.

Definition 1. ASP. Given two data samples vki ∈ Vi and vil∈ Viwhere k6= l. If the trained state rikis similar to ril ,but the total system power pikis not similar to the total

(5)

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 5

TABLE 1: Notations and their definitions Notation Definition

si The subsystemi.

sij The parameterjof subsystemi.

sij,k The workload of parameter j of subsystemiat a trained statek.

rik The trained statekof subsystemi.

pik Total system power of subsystemiat trained statek. vik The pair of a trained staterikandpik.

Vi The set ofZtrained statevik. Mi A power model for subsystemi.

aik Asynchronous power consumption of a trained state kof subsystemi.

bik pik− aik.

Wi The set ofZtrained statewik= (rki, bik). Ai Asynchronous power table for subsystemi. Yi A mathematical equation for subsystemi.

Xi In estimation, the set ofT data samples of subsys- temi. Note the data samples do not include power information.

xij,t In estimation, the workload of parameterj of sub- systemiat time slott.

xij In estimation, the vector of all parameters j of subsystemiat time slott.

system power pil. It refers that either pik or pil includes asynchronous power aikor ail, respectively.

Definition 2. NLP. The nonlinear behavior between the trained states rki and their associated power bik, where bik = pik − aik is the synchronous power. Since the nonlinearity, bikis nonlinear power.

3.2 Problem Statement

The problem statement is as follows. Given a set of training samples Vi, find a power model Mi, which comprises of a mathematical equation Yiand an ASP table Ai, the table which contains the information of ASP, so that the estimated total system power obtained from the sum of all subsystems si ∈ S, under a given workload statistics, is approximately equal to the measured total system power obtained from an external power meter. To find Yiand Ai, we need to solve two subproblems below.

Subproblem 1. ASP problem. Given a set Vi ={vi1, ..., viz} where vik = (rki, pik), find ASP aik, where aik ≤ pik for each vki ∈ Vi.

Subproblem 2. NLP problem. Given a set Wi ={(rki, bik)}, find Yiwhich is the best fit between the trained states rik and the associated power consumption bik.

4 CLUSTERING ANDSYMBOLICREGRESSION In this section we describe the design of CSR, which has three components: (1) ASP analysis, (2) NLP modeling, and (3) subsystem power estimation, as shown in Fig. 5. The first two components are for training the data samples while the third estimates the power consumption of the subsystems.

Fig. 5: Workflow diagram of CSR components, (a) ASP analysis and Nonlinear power modeling, and (b) Subsystem power estimation.

4.1 ASP Analysis Component

This component uses Algorithm I, which determines whether a given set Vi contains ASP or not. The algorithm is based on the assumption that if any trained states in Vi are similar, they would be associated with similar amounts of power consumption pik in Vi. If not, there exists some trained states rki which correlate with ASP. Note that this assumption may be too strong when the number of collected parameters is small. We thus collects as many parameters as possible to reduce the errors caused by such an assumption.

To more clearly understand how Algorithm I works, it is illustrated by a simple example as shown in Fig. 6. In Fig. 6(a), a set of data sample Vi ={v1i, ..., v8i}, where vik= (rik, pik) is given. We assume that all trained states rki are divided into two groups, based on the similarity values. The first group contains rik, where k = 1, 2, and 3, whereas the second group contains rik, where k = 4, 5, 6, 7, and 8. Also, all power values pik are divided into two groups. The first group contains pik, where k = 1, 2, 3, 4, and 5, whereas the second group contains pik, where k = 6, 7, and 8. Based on the assumption mentioned above, it can thus be seen that pi4 and pi5contain ASP.

To find pi4 and pi5 by Algorithm I, we start by parti- tioning ∀vki ∈ Vi based on the similarity function and AP clustering (lines 2-3 in Algorithm I). The similarity function, sim = exp(−(d/w)r), where d is a distance between data, w = 1 is a radius, r = 2 is an exponent. More details of this function use can be found in [21]. All data samples vki ∈ Viare partitioned into two clusters, including a cluster C1 = {vi1, v2i, vi3, v4i, v5i} and C2 = {vi6, v7i, vi8}. Next, all data samples vik ∈ Vi are clustered again with the same similarity function and AP clustering, but using trained states∀rki only (lines 4-5 in Algorithm I). Therefore, all data samples vik∈ Viare partitioned into two clusters including D1 ={vi1, v2i, vi3} and D2 ={vi4, v5i, vi6, v7i, vi8}. Finally, the group of clusters C and D are compared in order to find ASP. Our assumption is that all members in Dyshould be in

(6)

2377-3782 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Fig. 6: Asynchronous power analysis.

the same cluster of any Cx. If not, some of members of Dy

contain ASP.

In our example, the algorithm will find that∀vik ∈ D2 are not in any same cluster, unlike ∀vki ∈ D1 which are in the same cluster C1. Hence, it can be concluded that there exist ASP on some pik ∈ D2. Next, ∀vik ∈ D2 are partitioned into subclusters in order to find ASP (line 8 in Algorithm I). The number of subclusters depends on the number of clusters Cx in which each vki ∈ D2 is.

In this example, the cluster D2 is partitioned into two subclusters, i.e., D2,1 = {v4i, vi5} and D2,2 = {v6i, vi7, v8i}, because v4i and vi5 are in C1, and v6i, vi7, v8i are in C2. In lines 9-11, the algorithm computes the average power of all members in D2,1 and D2,2, which are avgP (D2,1) and avgP (D2,2), respectively. These power averages are used to assess which subcluster Dy,d contains ASP. Based on these criteria, all subclusters Dy,dwhich do not contain the minimum average power contains ASP. In this example, Fig. 6(b) (lines 9-12) shows that D2,1 = {v4i, v5i} contains ASP aik = avgP 1− avgP 2. Finally, Algorithm I (lines 13- 19) produces an ASP table Ai ={Ai0, ..., Aip, ..., Aiq}, where Aip =< rik, rik−1, d, aik > is a row of the table including rik and rik−1 the pair of the training states triggering the ASP, i.e., rki contains asynchronous power if its previous training state is rik−1, d is the asynchronous duration, and aik is the amount of ASP. Finally, a set Wi = {(ri1, bi1), ..., (r8i, bi8)} is created, where the power bi4 and bi5 are modified as bi4= pi4−ai4and bi5= pi5−ai4respectively. Finally, Algorithm I (line 22) returns Wi and Ai, as shown in Fig. 6(c), which are later used by the NLP modeling and subsystem power estimation components, respectively.

4.2 Nonlinear Power Modeling Component

Algorithm II is applied to this component. The algorithm uses Eureqa to produce a mathematical equation repre-

Algorithm 1: Asynchronous Power Analysis Input: A set Vi.

Output: A set Wi //An asynchronous power table Ai.

1 s1, s2← 2D matrix; C, D and A ← ∅; Wi← Vi;

2 s1← sim(∀vik∈ Vi);

3 C← AP (s1); //Clustering

4 s2← sim((∀rki ∈ vki)∈ Vi);

5 D← AP (s2); //Clustering

6 foreach Dy∈ D do

7 if ∀vik∈ Dy are not in the same cluster Cxthen

8 F← partition(∀vik∈ Dy);

9 foreach Dy,d∈ F do

10 //Add to list

11 avgP.add(

P((∀pik∈ vik)∈ Dy,d)

b );

12 minP← argmin{avgP };

13 // Number of members of Dy,d

14 d← |Dy,d|;

15 foreach (pik∈ vik)∈ Dy,ddo

16 aik← abs(pik− minP );

17 wik← (rki, minP );

18 Ai.add(< rki, rik−1, d, aik>);

19 return Wi, Ai;

1

senting the relationship between trained states rkiand the associated synchronous power bik.

Algorithm II requires two input parameters, i.e., the set Wi obtained from Algorithm I, and the target expres- sion E which is a string expression that guides Eureqa the type of model to search for. For instance, a target expression ”y = f (x1, x2, ..., xn)” is an equation where y is modeled as a function of variables x1, x2, ..., xn and y = f1()x1+ f2()x2+, ..., +fn()xn is an equation where fi() is the coefficients of a variable xi. Algorithm II (line 1) works by initially defining the empty lists L and H.

Eureqa then applies the two input parameters to build the

(7)

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 7

Algorithm 2: Nonlinear Power Model Generation input: A set Wi

A target expression E← “y = f(x1, . . . , xk)”

Output: An equation Yi.

1 L, H,← An empty list;

2 L ← Eureqa.build(Wi, E);

3 M ← “y = ∀xi∈ L”;

4 H ← Eureqa.build(Wi, M );

5 G← {∀hi∈ H | MAE(hi) is 10% larger than M AE(argmax

hi∈ H Com(hi))};

6 Yiargmax hi∈ HCom(gi)

7 return Yi;

1

equation (line 2). Since Eureqa uses Symbolic Regression (SR) to build an equation, it has to set termination criteria, i.e., a number of generations and a stability value. The stability value is used for measuring the sensitivity analysis of all parameters to find a set of parameters most effecting the subsystem power consumption. After completing, the construction process, Eureqa returns a list of the most im- pact subsystems parameters L. Next, the target expression E is modified by replacing the existing variables with a new set of variables from the list L, denoted as M (line 3). Eureqa is again run by applying Wi and M , and then returns a list of candidate equations H (line 4). The list H, a collection of candidate equations, are ranked based on the trade-off between the accuracy (fit) and complexity (size) of an equation. The accuracy is measured by using the Mean Absolute Error (MAE) metric and the complexity is the size of an equation. We defined the process of choosing an equation Yi from H in two steps: (1) picking a group of candidates whose MAEs are 10% larger than the MAE of the greatest complexity, and (2) selecting the equation which has the greatest complexity from the group we picked at the first step. To create a generic model, we propose the picking process which picks an equation that has the least complexity but highest accuracy.

Fig. 7 is illustrative of the process of Algorithm II. At step 1, let Wi be a set of pairs of trained states and power consumption wik = (rki, bik) of GPU. The GPU subsystem is composed of 19 parameters, x1, ..., x19. The set Wi and the target expression E are submitted to Eureqa that applies sensitivity analysis to generate a set of GPU's parameters most effecting its power consumption. In this example, the sensitivity analysis generates the four most impact param- eters, x1, x5, x8, x12. At step 3, the target expression E is replaced by ”Y = f (x1, x5, x8, x12)”. At step 4, the expres- sion E and Wi are submitted to Eureqa to build a model again, but without applying the sensitivity analysis. At step 5, Eureqa returns the results of a list of equations which are a trade-off between accuracy and complexity. Finally, at step 6, Algorithm II uses the picking criteria mentioned above to pick the right equation Yi that is the equation whose complexity and MAE are 15 and 3, respectively, as shown as the red spot in step 5.

4.3 Subsystem Power Estimation Component

This component uses Algorithm III, as well as the ASP table Ai and the equation Yi, obtained from Algorithm I and II, respectively, to estimate the total subsystem power

Fig. 7: Nonlinear power modeling.

of the subsystem si, given a set of T data samples Xi. Note that data samples Xi only includes the workload statistics of the subsystems, but not their power information.

Algorithm III (line 1) works by initially setting an empty list P . Next, for each data sample xit ∈ Xi, Algorithm III (line 3) checks whether xtassociates with ASP or not. This works by measuring the similarity, sim (line 4), between xt

and rik ∈ Aip and xt−1 with rik−1 ∈ Aip. If the similarity is greater than zero, it means xitassociates with ASP, we then estimate the total system power Ptof xit, by applying xitto the equation Yi plus aik ∈ Aip, and adding asynchronous duration d with t to the variable len. Alternatively, Pt is estimated by applying xitwith Yionly. If xitassociates with ASP, we then continue checking the duration d ∈ Aip as shown in lines 7–13. In this iterative process, the similarity between xitand xit−1is measured. If both data samples are similar, then the total system power consumption Ptiis also estimated by applying xit with the process, as shown in line 5. Alternatively, xitis checked with the other Aip ∈ Ai. Finally, this component returns a set of T total system power Piassociated with the data samples Xi.

5 EXPERIMENTALSETUP

In this section, we first describe the hardware and software experimental setup. We then elaborate all subsystems and their training process.

5.1 Hardware and Software Setup 5.1.1 Hardware setup

The hardware test bed consisted of a host computer, a Mon- soon power monitor [22], and two DUTs. The host computer was a normal desktop computer with a 3.10GHz Intel Core i3-2100 processor, 6GB of RAM, and 64-bit Windows 7.

The Monsoon power monitor, which sampled with rates at 5 kHz, was used to measure the total system power. To test the efficiency of the proposed technique, we ran our

(8)

2377-3782 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Algorithm 3: Subsystem Power Estimation Input: A set of T data sample Xi={x1, . . . , xr}.

Output: A set of T power estimation Pi={p1, . . . , pr}.

1 Pi ← () //An empty list;

2 for t = 1, . . . ,|Xi| do

3 foreach Aip∈ Aido

4 if sim(xit, rik∈ Aip)∧ sim(xit−1, rik−1∈ Aip) then

5 len← t + d ∈ Aip;

6 pt← Yi(xit) + aik;

7 while t < len do

8 if sim(xi, xi+1) then

9 Pt← Yi(xt+1) + aik;

10 t + +;

11 else

12 t− −;

13 break;

14 else

15 Pt← Yi(xt)

16 return Pi;

1

experiments on the old and new DUTs, Nexus S and Galaxy S4 (S4 for short). More details of both DUTs, including the target subsystems, CPU, screen, GPU, audio, GPS, 3G, and Wi-Fi are listed in Table II.

5.1.2 Software setup

The software tools consisted of Eureqa desktop version 0.99.8 [23], a statistical tool, R [24], with the Apcluster package [25], a system call tracing tool, Strace [26], and our training application running on both DUTs. To validate the accuracy of CSR, we tested it with four real applica- tions with several scenarios. The applications included GPU benchmark, Google Maps, Firefox web, Firefox Youtube, Chrome web, and Chrome Youtube. For the Eureqa parame- ter setup, for each training process, we ran the evolutionary process until the number of generations reached 30,000 or a stability value of more than 80%. It took around 4 minutes to complete each training process.

5.2 Subsystem Training Process

To acquire subsystem workload statistics, we created two applications, operating and monitoring applications [27], working on a DUT side. The operating application put a target subsystem into different operating states, while the monitoring application periodically collected the workload statistics of the training subsystems. To synchronize the subsystem workload statistics with its associated power measurements, we used the instantaneous power caused by suddenly turning on and off the screen brightness as the synchronization point. After completing the training process, we stored the collected workload statistics of sub- systems in a DUTs storage, and the power trace in the host computer. For each operating state of a subsystem, we repeatedly tested it five times and then measured its average. The details of each subsystem training process are as follows:

1) CPU. We disabled other subsystems when train- ing the CPU. However, since GPU is on the same

TABLE 2: The subsystems of Nexus S and Galaxy S4

Sub system

Nexus Galaxy S4

Parameter Range Parameter Range

CPU

util0(%) 1100 util0, ..., util7 1 100 f r eq0(M H z) {200, ..., 1000} f r eq0, ..., f r eq7{200, ..., 1600}

it0

(idle time (ms)) 0 it0,0, ..., it7,0 0 it0,1, ..., it7,1 0 it0,2, ..., it7,2 0 ie0(idle time (ms)) 0 ie0,0, ..., ie7,0 0 ie0,1, ..., ie7,1 0 ie0,2, ..., ie7,2 0

Screen

br ightness 0255 br ight 0 255

red 0 255

green 0 255

blue 0255

GPU (show the most impact four pa- rameters)

tal 0 tal 

f ps 0 f ps 0

usseccpp 0 usseccpp 0

gtt3d 0 gtt3d 0

Audio volume [0, 1] volume [0, 1]

GPS on [0, 1] on [0, 1]

Wi-Fi

on [0, 1] on [0, 1]

channel {11, 36, 48, 54} channel {11, 36, 48, 54}

packet r ate 0 packet r ate 0

3G on [0, 1] on [0, 1]

packet r ate 0 packet r ate 0

tal : tile accelerator utilization f ps : frame per second description

usseccpp : Universal Scalable Shader Engine clock cycle per pixel gtt3d : GPU task time 3D utilization

System-on-Chip as CPU, we also monitored the workload statistics of GPU. The training parameters of CPU were set as described in Zhang et al. [9], for training CPU idle state. While, it was simple to train a single CPU core on Nexus S, it was an intensive task to train 8 CPU cores on S4. The 8 CPU cores, big.LITTLE [28], were partitioned into 2 groups: one group for 4 big cores and the other group for 4 little cores. Each big core had a range of frequencies of between 800 and 1600 MHz, and each little core had a range of frequencies of between 200 and 600 MHz. Although the S4’s CPU was composed of 8 cores, but only four cores, or one group, were active at a time, because of limitations of the current scheduler technology, In-kernel switcher [28] at the time. Developers can in fact only view 4 logical CPU cores, core0, core1, core2, and core3. The OS kernel allows developers to turn off core1 to core3, but not core0. Thus, to train the S4’s CPU, we started by training core0 alone by disabling the other cores. We next trained core1 by enabling it and let it operate along with core0. The power consumption of core1 is then estimated by subtraction. We subsequently trained core2 and core3 using the same procedure

(9)

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 9

as for core1.

2) Screen. It was simple to train the screen of Nexus S because it was a Super LCD [29], where its power consumption is affected only by brightness levels.

However, for the screen of S4, we had to address the red, green, and blue pixel colors for each brightness level, starting from 0 to 255 with increments of 25. Therefore, at each brightness level, the 24-bit color values, 8 bit for each color, were varied from 0 (black) to 255 (white). Since an OS kernel does not provide the pixel color data, for real applica- tion testing we used another android application, SCR screen record [30], to record the pixel colors of a whole screen, while the real application was running. The pixel color data was then saved as a MP4 file within a DUT, and was later processed on the host computer. To reduce the power con- sumption overhead caused by the SCR screen record app, it was set to record the screen at eight frames per second. Moreover, to reduce the time spent on pixel processing on the host computer, we processed only 135x240 of the total of 1080x1920 pixels. The power consumption of the screen was obtained by subtracting the power consumption of CPU from the total system power. We also observed that there were no GPU operations while pixel colors were being trained by our training application.

3) GPU. 19 GPU parameters were trained and it was an intensive task to control all combinations of these parameters. We thus used the GPU benchmark, 0xbench [31], to stress test the GPU. 0xbench con- sists of 2D training apps, such as canvas, shape, and image drawing, 3D training program, such as Cube and Teapot rotation, and other miscellaneous apps, e.g., Math, VM, Native, etc. The power consumption of GPU was obtained by subtracting the power consumption of CPU and screen.

4) GPS. We used the GPS Test application to train only GPS on and off states to capture its ASP consump- tion. The power consumption of GPS was estimated by subtracting CPU and screen power consump- tion. 5. Audio. We trained the audio subsystem by running the built-in music player, Apollo, and only maximum and minimum volume were trained. Its power consumption was estimated by subtracting CPU power because the screen was off.

5) Wi-Fi. We set the host computer as a server that con- nected with a router TP-Link TL-WR1043ND. We developed a client-server application, as described in PowerTutor [5], to train the Wi-Fi workload statis- tics. To reduce the variability of the experiment, we controlled Wi-Fi channel rates at 11, 36, 48, 54, and 72 Mbps, while the files with different sizes were exchanged between the DUT to the server at each channel rate. We found that the Wi-Fi subsystem of our DUTs resulted in a short duration of ASP con- sumption, i.e., less than 1 second. We thus ignored the asynchronous analysis for the Wi-Fi subsystem.

6) 3G. We experimented with 3G similar to the Wi-Fi experiment. However, we estimated its power con- sumption by transferring multiple files with various

sizes between the DUT and a server over FTP.

6 EXPERIMENTALRESULTS 6.1 ASP Detection

We give the results of CSR with reference to detecting ASP as it occurred on GPS, GPU, and 3G. We ignored Wi-Fi as its duration of ASP was trivial, i.e., about 1 second. Fig.

8(a) shows the ASP of GPS, which lasts for about 5 seconds after GPS is disabled. We compared the CSRs results with the results of the instrumentation-based approach, which uses Strace to determine the operating states of GPS. Our experiment found that the accuracy of CSR to detect ASP is closed to the accuracy of the instrumentation-based ap- proach. However, CSR can automatically obtain the time du- ration of ASP, whereas the instrumentation-based approach requires detecting the time duration of asynchronous power manually. Moreover, we found that a sampling rate of 1 Hz was sufficient for detecting ASP of GPS. Fig. 8(b) shows ASP of GPU on S4. CSR revealed several portions of ASP occurring on GPU. After determining the GPU workload statistics that are correlated with ASP, we found that most of ASP on GPU is hidden power. For example, between 12 and 14 seconds all captured workload statistics are similar, but the associated power consumption changes instantaneously.

Moreover, we found that the ASP of GPU on Nexus S was caused by the GPUs parameters correlated with negative values, e.g., the negative values of Universal Scalable Shader Engine (USSE) load stall and load pixel. We also found that at least the sampling rate at 10 Hz was suitable for detecting ASP of GPU. Since it was difficult to instrument the GPUs system calls for the comparison purposes, we, therefore, validated the accuracy of CSR for identifying the ASP of GPU with the real applications. Fig. 8(c) shows the compar- ison between the techniques with and without detecting the ASP on 3G. The techniques for detecting the ASP include CSR and system call (FSM), whereas the technique without detecting ASP is LM. It can be seen that the techniques with detecting ASP can reduce more errors than the technique without detecting ASP.

6.2 Nonlinear Power Models

To validate the efficacy of CSR in terms of using a ma- chinedefined model forms for discovering NLP models, we compare the forms of models generated by CSR with the ones generated by using the human-defined model forms.

To do so, we built a power model of the Nexus S CPU by using the Weighted Linear Model (WLM) described in Zhang et al. [10], whose form was defined by a human.

Table III shows that, without human intervention, CSR can determine a model whose form is very similar to the ones produced by WLM. Moreover, the Mean Absolute Error (MAE) of both of these are approximately the same, 17.95 and 18.49, respectively. We also built a CPU power model generated by using GP, a traditional symbolic regression.

As shown in Table III, the model form obtained from the GP approach is completely different from the one obtained from WLM as well as CSR. The difference is because the random nature of genetic programming which means the model forms built by GP are different for each model built,

(10)

2377-3782 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

0 50 100 150

0100200300400

Time(sec)

Power(mW)

Measured power Tail energy

(a) GPS tail energy.

0 5 10 15 20 25

05001000150020002500

Time(sec)

Power(mW)

Measured power Hidden energy

(b) GPU hidden energy.

(c) 3G tail energy.

Fig. 8: Asynchronous power detection on (a) GPS, (b) GPU, and (c) 3G.

even though the input data is the same, whereas CSR can produce very similar forms of models for each model building. To show the accuracy, we compared the accuracy of the NLP models built for GPU of Galaxy S4, the most complex hardware subsystem. The results showed that the MAE of the GPU power model built by LM is very high. The MAE of LM and CSR are significantly different, at 8387.01 and 956.67, respectively. Furthermore, we also compared the results of CSR with another nonlinear model building method, SVR [7]. The results show that the error of CSR is less than that of SVR, at about 15% on GPU.

6.3 Real Application Validation

To evaluate the accuracy of CSR-based power estimation, we validated it on five well-known applications (seven scenarios) and then compared its results with those of the mixed model. In this paper we refer the mixed model as a

(a) GPS tail energy.

(b) GPU hidden energy.

Fig. 9: Asynchronous power detection.

list of subsystem power modeling techniques which gave the best results for each subsystem we tested. For the mixed model, we applied WLM for CPU, an Exponential model for AMOLED, SVR for GPU, and LR to the other subsystems.We also used Mean Absolute Percentage Error (MAPE) as the error metric for evaluation.

Table IV shows the MAPE of CSR and the mixed model, tested with the five applications running on two DUTs. On Nexus S, the average MAPE of CSR was about 10.58%, whereas that of the mixed model was about 13.85%, i.e., a 23.61% improvement of CSR over the mixed model. We found that the accuracy of estimated GPU power consump- tion has the most impact on that of total system power estimation, and the estimated power consumption of the other major subsystems, such as CPU, screen, and Wi-Fi was similar. As shown in Fig 9(a), CSR is more accurate than the mixed model for estimating total system power consumption of surfing whole CNN website for about 2 minutes testing on Chrome with Wi-Fi and 3G. The higher accuracy results from CSR being able to correctly capture the power consumption behavior of GPU, whereas SVR cannot perform this. On S4, the average MAPE of CSR was about 23.41%, whereas that of the mixed model was about 40.75%, i.e., about 42.55% improvement. Fig. 9(b) shows that CSR is capable of capturing accurate GPU power consumption behavior by improving around 70.05on GPU bench applica- tion. The improvement is because of the capability of ASP detection of CSR on GPU.

It is worth noting that all MAPEs in this work were sig-

參考文獻

相關文件

 make a big stink about refusing to put it on their tim esheet, just letting the feature they were working on slip, because they refuse to pad their estimates which were

 make a big stink about refusing to put it on their tim esheet, just letting the feature they were working on slip, because they refuse to pad their estimates which were

BAL 1000 Brown almost-linear func, nonconvex, dense Hessian1. BT 1000 Broyden tridiagonal func, nonconvex,

1 Generalized Extreme Value Distribution Let Y be a random variable having a generalized extreme- value (GEV) distribution with shape parameter ξ, loca- tion parameter µ and

That, if a straight line falling on two straight lines makes the interior angles on the same side less than two right angles, the two straight lines, if produced indefinitely, meet

Forming expectations on texts and text interpretation and being higher-order generic and language skills to be fostered, and a way to differentiate the content for students 27

In order to facilitate school personnel of DSS schools in operating their schools smoothly and effectively and to provide new DSS schools a quick reference on the

In this paper, we illustrate a new concept regarding unitary elements defined on Lorentz cone, and establish some basic properties under the so-called unitary transformation associ-