**33**

GUNG-YU PAN, National Chiao Tung University
JING-YANG JOU, National Central University and National Chiao Tung University

BO-CHENG LAI, National Chiao Tung University

Dynamic power management has become an imperative design factor to attain the energy efficiency in
modern systems. Among various power management schemes, learning-based policies that are adaptive
to different environments and applications have demonstrated superior performance to other approaches.
However, they suffer the scalability problem for multiprocessors due to the increasing number of cores in
a system. In this article, we propose a scalable and effective online policy called MultiLevel Reinforcement
*Learning (MLRL). By exploiting the hierarchical paradigm, the time complexity of MLRL is O(n lg n) for n*
cores and the convergence rate is greatly raised by compressing redundant searching space. Some advanced
techniques, such as the function approximation and the action selection scheme, are included to enhance the
generality and stability of the proposed policy. By simulating on the SPLASH-2 benchmarks, MLRL runs
53% faster and outperforms the state-of-the-art work with 13*.6% energy saving and 2.7% latency penalty on*
average. The generality and the scalability of MLRL are also validated through extensive simulations.
**Categories and Subject Descriptors: D.4.7 [Operating Systems]: Organization and Design—Real-time **

**sys-tems and embedded syssys-tems; I.2.6 [Artificial Intelligence]: Learning—Parameter learning; J.6 ****[Computer-Aided Engineering]: Computer-[Computer-Aided Design (CAD)**

General Terms: Design, Algorithms, Performance, Management

Additional Key Words and Phrases: Dynamic power management, multiprocessors, reinforcement learning
**ACM Reference Format:**

Gung-Yu Pan, Jing-Yang Jou, and Bo-Cheng Lai. 2014. Scalable power management using multilevel rein-forcement learning for multiprocessors. ACM Trans. Des. Autom. Electron. Syst. 19, 4, Article 33 (August 2014), 23 pages.

DOI: http://dx.doi.org/10.1145/2629486

**1. INTRODUCTION**

Power consumption has become the bottleneck for digital designs in the past decade [Pedram 1996]. Among many low-power techniques, online power saving is widely ap-plied to appliances that pose stringent energy requirements, such as battery-powered embedded devices [Jha 2001]. Dynamic power management (DPM) is one of the most popular power saving approaches. When a device is idle, it can be switched into inactive states to avoid wasting power [Benini et al. 2000]. The power states are defined in the Advanced Configuration and Power Interface (ACPI) [2011] and supported by many modern commercial systems, such as Enhanced Intel SpeedStep Technology (EIST) [Intel 2013], AMD PowerNow! [AMD 2013], ARM Intelligent Energy Controller (IEC)

This work is supported by the National Science Council under grant NSC101-2221-E-008-137-MY3. Authors’ addresses: G.-Y. Pan (corresponding author), Department of Electronics Engineering, National Chiao Tung University, 1001 University Road, Hsinchu City, Taiwan; email: astro@eda.ee.nctu.edu.tw; J.-Y. Jou, Department of Electrical Engineering, National Central University, Taoyuan County, Taiwan; B.-C. Lai, Institute of Electronics, National Chiao Tung University, 1001 University Road, Hsinchu City, Taiwan. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org.

c

* 2014 ACM 1084-4309/2014/08-ART33 $15.00*

[ARM 2005], and MIPS Cluster Power Controller (CPC) [Knoth 2009]. With the plat-forms supporting the interfaces, software-based power managers are able to determine the power states for the devices.

Recently, multiprocessor systems have become the mainstream in both general-purpose and embedded computers. The number of processors in a system grows as time advances [Held et al. 2006]; it is expected that there will be hundreds of proces-sors in a system in the near future. Since traditional DPM policies are mostly proposed for uniprocessor systems, it is imperative to design an effective policy suitable for multiprocessor systems.

**1.1. Dynamic Power Management for Multiprocessor Systems**

For a multiprocessor system, some of the cores may not be fully utilized but idle or with
low utilization rates, hence applying power management is crucial to avoid wasting
*energy. Different from traditional approaches, the efficiency and the effectiveness of a*
power manager need to be extraordinarily considered for a multiprocessor system. For
complex policies, the execution overheads become serious when there are more cores
in a system. Even worse, the policies become less effective as the solution space grows
larger. Thus, we need to design a low-complexity policy to ensure the scalability and
enhance the solution quality for multiprocessor systems.

The reinforcement learning technique is applied to multiprocessors in Ye and Xu
[2012] and shown superior than the distributed power managers using Tan et al.
[2009]. All the core states and actions are encoded and the learning function is
ap-proximated using the back-propagation neural network (BPNN). Tasks are assumed
independent, thus their policy is able to not only determine the power mode, but also
assign each task to a specific core. However, task assignment (context scheduling) has
other considerations, such as data locality, that greatly affect the performance in real
*environments. Besides, scalability is still a problem for its time complexity O(n*2_{).}
**1.2. Our Contributions**

In this article, we propose a policy called MultiLevel Reinforcement Learning (MLRL) that is much more scalable in efficiency and effectiveness for multiprocessors. The hierarchical approach performs better than both of the distributed policies lacking the global views and the centralized policies suffered the exponentially increasing solution space. Besides, we choose another low-overhead network for function approximation and the adaptive approach for action selection. The proposed policy is general and applicable in multiprogrammed or multithreaded environments without preknowledge to train the policy beforehand. It is evaluated using real benchmarks on the cycle-accurate simulator.

The main novelty of this article is the carefully designed multilevel framework that resolves the scalability issue by turning the exponential decision problem into linear, and provides the knobs for further optimizations. Both the solution quality and the policy efficiency are greatly raised. Moreover, the problem formulation is more general and the effectiveness is evaluated through real benchmarks. In short, we make the following contributions.

—The multilevel paradigm is exploited to compress the searching space, speed-up the
*convergence rate, and result in O(n lg n) time complexity for n cores.*

—The proposed online policy is independent of context scheduling; it neither needs preliminary information of the workload nor assumes the tasks are independent. —The simulation results show that our policy runs 53% faster and outperforms the

for the SPLASH-2 benchmarks; the performance penalty is close to zero while the
energy saving is close to the upper bound (14*.7%).*

The remainder of this article is organized as follows. Section 2 explains the background approaches of our policy and Section 3 formulates the problem. Then Section 4 illus-trates the basic framework of our policy and Section 5 provides several enhancement techniques. Afterward, Section 6 shows the simulation results and examines both the basic and advanced approaches. Lastly, Section 7 surveys the related works and Sec-tion 8 summarizes this article.

**2. BACKGROUND**

In this section, the background of the two key concepts is introduced before describing our policy. The same notations are inherited in this article.

**2.1. Reinforcement Learning**

Reinforcement learning (RL) [Barto and Mahadevan 2003] is applied in some previous
works [Tan et al. 2009; Ye and Xu 2012], outperforming other DPM approaches. There
*are some strong similarities between DPM and RL: the manager of DPM decides the*
next power state according to the current system status and the past statistics, while
*the agent of RL observes the environment state stat time t (called an epoch), takes an*

*action at, and accordingly receives the reward rt*. Because the trial-and-error processes

are similar, the power manager can be implemented using an RL-based agent.

One of the most effective RL algorithms is Q-learning, which keeps a Q-value for
*every state-action pair. The Q-value Q(st, at*) is the average reward prediction of the
*pair (st, at). During the decide phase, the agent selects the action at* based on the

*Q-values of the current state stand then updates Q(st, at) after receiving the reward rt*in

*the update phase. Note that the agent may not be able to receive rtright at time t but*

in the future.

*There are three basic methods to decide at*: greedy,*-greedy, and softmax. The greedy*

method picks the action with the highest Q-value. The*-greedy strategy also selects*
the action with the highest Q-value most of the time, but may select other actions
with a prespecified small probability*. The softmax method assigns the probability of*
*selecting each action proportional to exp (Q(st, at*)*/τ), where τ is the temperature that*

decreases as time advances.

The Q-value is updated according to the equation

*Q(st, at*)*←−−−− Q(s*update *t, at*)*+ μ*
*rt+ γ · max*
*a* *Q(s*
_{, a}_{)}_{− Q(s}_{t, a}*t*)
*,* (1)

where 0*≤ μ ≤ 1 is the learning rate and 0 ≤ γ < 1 is the discount rate that accounts for*
*future rewards, and s**and a**are the next state and action after taking at*, respectively.

*Since an RL agent learns online, the convergence time (number of epochs for the *
Q-values becoming stable) should be minimized, otherwise the agent would make inferior
*decisions. Besides, the agent should be efficient to minimize the runtime overhead (the*
time to execute the agent). When the state space (all the possible states) is large, more
samples are needed for training before convergence. When the action space (all the
possible actions with respect to the specific state) is large, the determination overhead
is proportionally increased. Therefore, policy designers need to avoid the explosion of
the state and the action spaces.

**2.2. MultiLevel Paradigm**

The multilevel paradigm is first proposed in Karypis et al. [1999] for circuit partitioning
*and then applied to other VLSI problems. During the coarsening phase, the elements*

are recursively clustered until the problem size is small enough to be solved. Then in
*the uncoarsening phase, the coarsened elements are declustered while the refinement*
algorithm is applied. A coarse solution is produced between the two phases.

Algorithms based on the multilevel paradigm are able to produce high-quality solu-tions in a small amount of time, while flat algorithms face lots of small elements thus lack of global views over the problems. The coarsening phase generates a good approx-imation of the original problems, so better initial solutions can be obtained. Then in the uncoarsening phase, the refinement algorithms are more effective because they are able to focus on local problems with smaller sizes.

**3. PROBLEM FORMULATION**
**3.1. System Model**

*The target architecture is a multiprocessor system containing n homogeneous cores*
*and m threads per core. It is able to simultaneously execute n× m contexts. Since*
our focus is on the scalability issues of multiprocessors instead of other peripherals or
subsystems, only the multiprocessor cores are considered in the rest of this article.

The power manager is implemented in the operating system and activated
*periodi-cally with period T [Isci et al. 2006; Winter et al. 2010]. It is able to switch the cores into*
different power modes that are given statically according to ACPI [2011]. The mode
switching is done in per-core granularity due to higher power saving ratios [Sharkey
et al. 2007; Kim et al. 2008]; because the power modes cannot be switched
instan-taneously, some transition overheads including extra delay and energy are induced.
Physical information is available for the power manager, measured by performance
counters and power sensors [Isci et al. 2006; Ye and Xu 2012].

Since the environments and applications are diverse for different multiprocessor
systems, the power manager should be designed for general applications. The workload
characteristics are assumed unknown in advance; it may comprise several single- or
multithreaded programs. To tackle various workloads, the context schedulers have
other complex considerations, such as data locality and load balancing. Hence, the
power manager is assumed to be general and independent of the scheduling policies.
**3.2. Problem Statement**

In general, DPM policies have three optimization goals [Benini et al. 2000]. First, the total energy saving should be maximized. Note that lowering power consumption does not imply energy saving in some circumstances, because the latency may be longer. Second, the latency penalty is minimized, even when the energy is lowered. Third, the runtime of the manager should be minimized to avoid prolonging the kernel time.

The focus of this article is designing the DPM policy for multiprocessor systems
according to the preceding system model and optimization goals. Since the proposed
policy is based on reinforcement learning, the terminologies are described as follows.
**3.3. Learning-Based Power Management**

The overall flow of learning-based power management is shown in Figure 1 with the
*timeline in Figure 2. At each epoch t, according to the current state st*, the agent

*(power manager) decides the action atbased on the Q-values. In addition, Q(st*−1*, at*−1)

*is updated using (1) with past reward rt*−1 *in [t− 1, t] (the current reward rt* cannot

*be received immediately until t+ 1). When the agent is activated at t = 1, Q(s*0*, a*0) is
*updated using r*0*and then the action a*1is taken.

To apply the Q-learning algorithm for dynamic power management, the current state
*is a pair s= (sp, sq) with sp for the current power mode and sq for the number of tasks*

Fig. 1. Overall flow of power management based on Q-learning.

Fig. 2. Timeline (epochs) of power management based on Q-learning. Table I. List of Notations

Notation Meaning

*n* Number of total processors

*m* Maximum number of tasks per processor

*T* Power management period

*st* *System state at time t*

*spt* *Power state (computation capability) at time t*
*sqt* *Queue state (number of tasks in queues) at time t*
*at* *Target power state (computation capability) at time t*
*rt* *Lagrangian figure-of-merit for the state-action pair (st, at*)
*β* Trade-off parameter between performance and power

*μ* Learning rate of reinforcement learning

*γ* Discount rate of reinforcement learning

The probability threshold of escaping from the greedy choice

*φk* *The kth hidden node in a Radial Basis Function (RBF) network*
* uk* The position vector of

*φk*in an RBF network

*σk* The width of*φk*in an RBF network
*αk* The weight of*φk*in an RBF network

*K* The number of hidden nodes in an RBF network

*emin* Desired accuracy of an RBF network
*δ* Distance between nodes in an RBF network

*λ* The decay constant of_{δ in an RBF network}

*κ* The overlap factor between hidden nodes in an RBF network

*in queue(s), and the action a is the target power mode. The reward function*

*r= β · throughput − (1 − β) · power* (2)
is the Lagrangian figure-of-merit for instantaneous throughput and power where
0 *≤ β ≤ 1 is the trade-off parameter. Although optimizing instantaneous *
through-put does not guarantee optimum overall latency, this greedy strategy is a must when
the incoming workloads are unknown [Mariani et al. 2011].

The important notations are listed in Table I. As stated before, the first and second sectors list the notations and parameters for the learning-based power management described in Section 4, respectively. The third sector lists the parameters for some enhancement techniques defined in Section 5. The values of the parameters are set for simulations in Section 6, obtained from previous works.

Fig. 3. A binary tree for the multilevel framework on top of a multiprocessor.

**4. MULTILEVEL REINFORCEMENT LEARNING**

In this section, the basic power management framework is described, while some ad-vanced techniques are provided in the next section to further enhance the performance of the proposed policy.

**4.1. The MultiLevel Framework**

Since there are at least 2*n* _{states and actions for per-core DPM, directly applying }

Q-learning to multiprocessors is not scalable due to the explosion of policy runtime. More-over, the agent may find inferior solutions because of the huge searching space [Barto and Mahadevan 2003]. Therefore, our learning algorithm is based on the multilevel paradigm to reduce the overhead and shrink the searching space to enhance the solu-tion quality.

In order to implement the coarsening and uncoarsening phases in the multilevel
paradigm, a binary tree is built on top of the cores, as shown in Figure 3 where the
*number of cores is denoted as n and D= lg n is the depth of the tree. The vertexes are*
numbered from the root to the rightmost leaf. For a general multiprocessor system, the
*number of cores n could be either power-of-two or not; a complete binary tree is built*
on top of it for general cases (five cores and eleven vertexes in Figure 3), while the tree
*becomes full for power-of-two cores (eight cores and fourteen vertexes in Figure 3). In*
general, there are three cases for a vertex: containing two children, containing only the
left child, or containing no child.

In the proposed MultiLevel Reinforcement Learning (MLRL) framework, a vertex

*v contains six attributes (spt*−1*, spt, sqt*−1*, sqt, at, rt*−1*): spt*−1[*v], spt*[*v], at*[*v] represent*
*the past, current, and target aggregate power state, respectively, sqt*−1[*v] and sqt*[*v]*
*represent the past and current queue state, respectively, and rt*−1[*v] represents the*

received Lagrangian reward, calculated according to (2) with respect to the past state-action pair. A parent vertex represents the coarse version of its children, thus the root represents the coarse version of the entire multiprocessor system. For a homogeneous system, the attributes of a parent vertex are the summation of the attributes of its children.

**ALGORITHM 1: MultiLevel-Reinforcement-Learning**

**1** **for d****← D to 1 do**

**2** **foreach vertex****v with depth(v) = d do**

**3** Update-Node(*v);*
**4** **end**
**5** **end**
**6** Update-Root();
**7** Decide-Root();
**8** **for d****← 0 to D − 1 do**

**9** **foreach vertex****v with depth(v) = d do**

**10** Decide-Node(*v);*

**11** **end**

**12** **end**

The overall flow of the proposed policy is described in Algorithm 1. There are three steps in our algorithm: the coarsening phase (lines 1–5), the coarse solution phase (lines 6–7), and the uncoarsening phase (lines 8–12). First, the attributes are collected from the leaves to the root and the Q-values are updated in Update-Node. Then the coarse solution is made on the root in Update-Root and Decide-Root; the basic approaches are first described for the coarse solution phase in this section, while some enhancement techniques are introduced in the next section. At last, the fine-grained solutions are decided hierarchically in Decide-Node, according to the coarse-grained solution of each node and the Q-values of its children.

Overall, first the total number of active cores and tasks (ready or running) are obtained, respectively, and the Q-values are updated simultaneously. Then, the number of active cores of the next period is accordingly determined. At last, these resources are distributed to the individual cores. The details of the three phases are explained in the following three sections, respectively. At last, the time complexity is analyzed in the last section.

**4.2. The Coarsening Phase**

*The coarsening phase starts from the leaves (with depth D) to the branches under*
the root (with depth 1). There are two missions in this phase: collecting the attributes
*(spt*−1*, spt, sqt*−1*, sqt, rt*−1) and updating the Q-values. Note that the attribute value of
*the action at*is determined in the uncoarsening phase.

For each leaf node *v, the current power and queue states are sampled from the*
*corresponding core as spt*[*v] and sqt*[*v], respectively, the past states spt*−1[*v] and sqt*−1[*v]*
*are stored in the last epoch, and the reward rt*−1[*v] is calculated according to (2) using*
the received power and performance values.

Otherwise, for each branch node *v, the attributes (spt*−1*, spt, sqt*−1*, sqt, rt*−1) are
summed up from its left child*vl and right child vr, so that the attributes represent the*
*aggregate behavior of the descent leaf nodes. For example, spt*[*v] represents the current*

number of active processors in the subtree of*v. Note that this coarsening methodology*
greatly reduces the data size from exponential to linear while keeping the principal
information. The coarsened granularity is resolved in the uncoarsening phase.

*Then the Q-values are updated using the attributes (spt*−1*, spt, sqt*−1*, sqt, rt*−1). For
each vertex*v on depth d, one of its Q-values (denoted as Q*(*v)*_{) is updated according}

*to (1) for the state-action pair (st*−1*, at*−1*). The state st*−1 *= (spt*−1*, sqt*−1) and the action

*at*−1*= spt*are stored in the last epoch. The maximum-possible future Q-value

*max Q*← max

0*≤a*_{≤n/2}dQ

(*v) _{(sp}*

*is calculated using the sampled current state (spt*[*v], sqt*[*v]) instead of estimating s* in

(1). Then the Q-value is updated as

*Q*(*v)(spt*−1[*v], sqt*−1[*v], spt*[*v])*
update
*←−−−− Q*(*v) _{(sp}*

*t*−1[

*v], sqt*−1[

*v], spt*[

*v])*

*+ μ ·*

*rt*−1[

*v] + γ · maxQ − Q*(

*v)(spt*−1[

*v], sqt*−1[

*v], spt*[

*v])*. (4) In the general non-power-of-two architectures, we need to ensure that the collected

*attributes (spt*−1

*, spt, sqt*−1

*, sqt, rt*−1) of any node represent the aggregate behavior of its descent subtree. For a parent vertex

*v with only the left child vl = 2v + 1 (such as*node 5 in Figure 3 for the five-core system), its attributes are copied from the child. For a parent vertex with no child (such as node 6 in Figure 3 for the five-core system), its attributes are all zero. For a parent vertex with two children, the updating process is the same as that of the power-of-two architectures. Using these rules, any vertex

*v*still represents the aggregate behavior of its descent subtree.

**4.3. The Coarse Solution Phase**

There are two main steps in this phase. Update-Root is similar to the coarsening phase
*where the attributes of the overall system are accumulated and one of the Q-values Q*(0)
is updated. Then the target number of active processors is determined by the action
*selection scheme in Decide-Root, according to the Q-values Q*(0)_{.}

*The Update-Root function first obtains the attributes (spt*−1*, spt, sqt*−1*, sqt, rt*−1) on
the root node. These attributes represent the coarsened behavior of the overall system;
they provide the global view (aggregate active resources, loading, and reward) but lack
details (the distribution of these attributes on the cores). Note that, besides the contexts
*allocated to cores, the number of unallocated (ready) contexts are added into sqt*[0] as

well to represent the overall system loading.

Since the candidates of actions on the root can be an arbitrary number of active
processors 0*≤ a**≤ n regardless of the current power state sp, the state s = (sp, sq) is*
*approximated using the number of tasks sq to focus on steady-state rewards. In other*
words, the amount of computation resources is determined according to the system
loading in this phase, while the transition costs are optimized in the uncoarsening
phase. In fact, the accumulated transition costs are the same, regardless the switching
orders. For example, if allocating three active cores is the best solution for current
system loading while there is only one active core currently, the accumulated transition
costs are the same between turning on two more cores directly and turning on one more
core in two epochs.

Thus, the Q-values on the root contain only two dimensions. The updating process is
reduced to
*Q*(0)*(sqt*−1[0]*, spt*[0])
update
*←−−−− Q*(0)_{(sq}*t*−1[0]*, spt*[0])
*+ μ ·*
*rt*_{−1}[0]*+ γ ·*
max
0*≤a** _{≤n}Q*
(0)

_{(sq}*t*[0]

*, a*)

*− Q*(0)

_{(sq}*t*−1[0]

*, spt*[0]) . (5)

*Since the state space is shrunk from (n+ 1) × (mn+ 1) to (mn+ 1) while the action space*

*remains (n*+ 1), the convergence time is greatly reduced. The deciding process is also

*based on the values of Q(sq, a*).

The*-greedy method is taken as the basic action selection scheme*

*at*[0]←

arg max0*≤a*_{≤n}*Q*(0)*(sqt*[0]*, a*)

if*ξ > *

where 0*≤ ≤ 1 is the small probability to escape from the local optimum and 0 ≤ ξ ≤ 1*
is the uniform random number that changes every time. Because the initial Q-values
*are the same, the smallest value of at*[0] (with the same Q-values) is selected to break

ties. Although the softmax method is more delicate in action selection than*-greedy,*
it is difficult to preset a proper value to the parameter*τ without knowledge of the*
application [Sutton and Barto 1998].

The system behavior is strongly related to the value of*. When is small, the system*
is stable but often trapped in the local optimum. In the extreme case, the system may be
*stuck on the initial full-on state (at*[0]*= n) if the pure greedy ( = 0) strategy is taken.*

On the other hand, the system may usually jump to random inferior states when* is*
large. This is known as the dilemma of exploration and exploitation [Tokic and Palm
2011]. Note that the*-greedy action selection scheme is only our basic strategy where*
it is used in Ye and Xu [2012] as well, while some advanced action selection schemes
are explained in the next section.

**4.4. The Uncoarsening Phase**

The uncoarsening phase starts from the root (with depth 0) to the deepest branches
*(with depth D*− 1). The mission is to hierarchically distribute the aggregate resources
*(target number of active cores at*[0]) to the distinct cores and restore the granularity. The

decisions are made according to the attributes and Q-values updated in the coarsening phase.

For each node*v with depth d, the number of active cores at*[*v] is distributed to at*[*vl]*

*and at*[*vr], where vl and vr are the left and right child nodes, respectively, such that*
*at*[*vl] + at*[*vr] = at*[*v]. Let l[v] denote the number of leaves in the descent subtree*

of each vertex *v (where l[0] = n), thus the target number of active cores is limited*
*by at*[*v] ≤ l[v]. According to their current power mode spt*[*vl], spt*[*vr] and number of*

*running tasks sqt*[*vl], sqt*[*vr], all of the combinations are tried and evaluated by the*

summation of their Q-values as

*(at*[*vl], at*[*vr]) ← arg max*0*≤al*_{≤l[vl],0≤ar}_{≤l[vr],al}_{+ar}_{=a}_{t}_{[}_{v]}

*Q*(*vl)(spt*[*vl], sqt*[*vl], al*)*+ Q*(*vr)(spt*[*vr], sqt*[*vr], ar*)

*. (7)*

The pair with the largest summed Q-value is chosen.

Note that the transition costs are considered implicitly in (7). The transition delay
of switching the power modes lowers the number of committed instructions, which is
*revealed in the rewards and Q-values. For example, it is expected that Q*(*vl)*_{(1}_{, 1, 1) +}*Q*(*vr)*_{(0}* _{, 0, 0) ≥ Q}*(

*vl)*

_{(1}

*(*

_{, 1, 0) + Q}*vr)*

_{(0}

_{, 0, 1) after convergence, because of the transition}power of turning off the cores and the transition delay of turning on the cores. Additional performance overheads, such as context (data) migration and refilling of cache and pipeline, are counted in the rewards and Q-values as well.

For systems supporting multiple power-down modes, the distributed learning policy [Tan et al. 2009] is applied at the end of this phase. For each core, if it is turned off in the uncoarsening phase, then one of the off states is chosen according to the static power and transition cost.

**4.5. Time Complexity Analysis**

*In the coarsening phase, the max operation in (3) of Update-Node requires n/2d*_{+ 1}

*comparisons for each node on depth d. The coarsening process goes from d* *= lg (n)*
*to d*= 1 with 2*d* _{nodes in depth d, so the time complexity of the coarsening phase is}

*lg (n)*

*d*=1 (2*d(n/2d*+ 1)) =
*lg (n)*

In the coarse solution phase, both of the max operations in (5) of Update-Root and
*the arg max operation (6) of Decide-Root require n*+ 1 comparisons. Therefore, the
*time complexity of the coarse solution phase is O(n).*

In the uncoarsening phase, the arg max operation in (7) of Decide-Node requires

*n/2d*+1_{+ 1 comparisons and there are 2}*d* _{nodes in depth d, so the time complexity of}

*the uncoarsening phase starting from d= 0 to d = lg (n) − 1 is**lg (n) _{d}*

_{=0}−12

*d*+1

_{(n}_{/2}d_{+ 1) =}

*O(n lg n).*

*According to the previous analysis, the overall time complexity is O(n lg n) for *
mul-tiprocessors with power-of-two cores. In general cases, the time complexity is bounded
*by O(n**lg n**), where n* = 2*lg n. Because n≤ n* *< 2n, the time complexity of general*
*architectures is still O(n lg n).*

**5. ENHANCEMENT TECHNIQUES**

In the previous section, the multilevel paradigm is applied to traditional reinforcement learning so that the searching space is greatly reduced while preserving the global view. There are some characteristics of power management on multiprocessors that can be exploited to further enhance the proposed policy.

First, the agent is able to estimate the reward of an state-action pair without visiting
it several times because the relation between active resources and instantaneous power
(and performance) is continuous. Thus, the coarse solution is estimated on the root
using modified Q-learning with function approximation (interpolation) in Update-Root.
Second, the Q-values converge when the system is stable while they change rapidly
at the beginning or when the system is unstable. Using the smart action selection
scheme to detect the system stability in Decide-Root, the random action is selected
when the system is unstable while the greedy choice is taken after convergence.
**5.1. Function Approximation with the Radial Basis Function Network**

*There are (mn+ 1) × (n + 1) state-action pairs in total for Q-learning, and each pair*
requires several training samples. This becomes a scalability problem when the
*sys-tem has lots of cores (large n) and deep multithreading (large m). Large searching*
space implies slow convergence and poor generality in learning, and leads to inferior
results. Therefore, function approximation (by supervised learning) is widely used in
reinforcement learning schemes [Tham 1994].

The radial basis function (RBF) is chosen to approximate our reinforcement learning scheme due to faster convergence rate with simpler structure [Wu et al. 2012], while multilayer perceptron (MLP) is chosen in the previous work [Ye and Xu 2012]. The output of an RBF network is

*K*

*k*=1

*αkφk ( I),* (8)

**where I is the input vector,***φk* *is the kth RBF (hidden node) in the network with*
corresponding weight*αk, and K is the total number of hidden nodes in the network.*
Among many candidates, the Gaussian function is widely used as the RBF

*φk ( I)*= exp
−

*2*

**I − u**k*σ*2

*k*

*,*(9)

* where uk*and

*σkis the position vector and the width of the kth node, respectively, and*

· means the Euclidean norm.

*Since the number of hidden nodes K is hard to predefine to minimize approximation*
error without unnecessarily prolonging the runtime, the Resource Allocating Network

Fig. 4. Approximating Q-values by the Radial Basis Function (RBF) network.

(RAN) [Platt 1991] is exploited, that is based on the RBF network with the ability
to dynamically allocate hidden nodes. When the output error *
is larger than the*
*desired accuracy emin* but the current hidden nodes are all far from the input vector

**I−u**k > δ for all k, where δ is the threshold of distances between nodes, a new hidden

node*φK*_{+1}is allocated with
⎧
⎨
⎩
*αK*+1*←
*
* uK*+1

**← I***σK*+1

**← κI − u**k*,*(10)

where*κ is the overlapping factor of hidden nodes. Otherwise, the network is adjusted*
by performing gradient descent

*ω ← ω − μ∂
*

*∂ω,* (11)

where *ω is the target parameter for adjustment and μ the learning rate (the same*
as Q-learning); the weight* α and the location u are adjusted in the RBF network by*
performing partial differentiation on the output error.

Furthermore, the Generalized Growing and Pruning RBF (GGAP-RBF) network
[Huang et al. 2005] is able to eliminate insignificant hidden nodes in the RAN. The
*node nearest to the input vector is eliminated when its significance is smaller than*
*the approximation accuracy emin* *and a new hidden node is allocated when it is *
*sig-nificant enough for the input vector. Besides, the updating process (gradient descent)*

is restricted to the hidden node nearest to the input vector in order to speed-up the learning process.

To approximate the Q-values on the root node using the GGAP-RBF network, the
**input I is set to the normalized state-action pair (¯s**= sq/mn, ¯a = a/n) and each position* vector ukcontains two dimensions uk= (usk, uak*). The approximated Q-value on the root

node is calculated by
˜
*Q*(0)*(¯s, ¯a) =*
*K*
*k*=1
*αk*exp
−*(¯s− usk*)2*+ (¯a − uak*)2
*σ*2
*k*
, (12)

as shown in Figure 4. The updating of Q-values is similar to (5) where ˜*Q*(0)_{(sq}

*t*−1[0]*/*

**ALGORITHM 2: Update-Root**

**1** *Update (spt*−1[0]*, spt*[0]*, sqt*−1[0]*, sqt*[0]*, rt*−1[0]) as the same as the coarsening phase;
**2** *max Q*← max0*≤a** _{≤n}*( ˜

*Q*(0)

_{(sq}*t*[0]*/mn, a**/n));*

**3** *
← (rt*−1[0]*+ γ · maxQ) − ˜Q*(0)*(sqt*−1[0]*/mn, spt*[0]*/n));*
**4** *nr*← arg min*k(sqt*−1[0]*/mn, spt*[0]*/n) − (usk, uak*);
**5** *minD← (sqt*−1[0]*/mn, spt*[0]* /n) − unr*;

**6** **if minD**> δ and |
κ*π/2 · minD| > emin***then**

**7** * Allocate new hidden node with uK*+1

*← (sqt*−1[0]

*/mn, spt*[0]

*/n), σK*+1

*← κ · minD,*

*αK*+1*←
;*
**8** **end**

**9** **else**

**10** Perform gradient descent on*αnr*;
**11** **if***|αnrσnr*

*π/2| > emin***then**

**12** remove*φnr and the corresponding parameters unr*,

*σnr*,

*αnr*;

**13**

**end**

**14** **end**

**15** *δ ← max (δmin, λδ);*

is the training value. Note that all the approximated Q-values of each state-action pair are changed on every single update, providing implicit interpolation for other Q-values.

The details of approximating ˜*Q*(0) _{using the GGAP-RBF network are shown in}
Algorithm 2. The approximation error*
and the minimum distance between the *
*in-put vector and the hidden nodes minD are calculated in lines 2–5. If the distance is*
larger than the threshold*δ and the error is significant against the desired accuracy*

*emin*, a new hidden node is allocated in lines 6–8; otherwise, the network is adjusted

by performing gradient descent on the nearest node*φnr* in line 10. Because the input
vectors are discrete, the gradient descent adjustment is only performed on the weight

*α without moving the position. Then the significance of the nearest node φnr*is checked
in lines 11–13. The distance threshold*δ is initialized to δmax* and then decays as time
advances until*δmin*with the decay constant*λ in line 15.*

*The most time-consuming operation in updating the RBF requires n*+1 comparisons
and each ˜*Q*(0)_{requires O(K) time to evaluate in line 2. Note that the number of hidden}*nodes K is limited by the minimum distance parameterδmin*and the desired accuracy

*emin[Platt 1991; Huang et al. 2005]. In implementation, K does not grow with n because*
the input vector is normalized to the range [0*, 1], thus the complexity is O(n) for*
Update-Root using the GGAP-RBF network.

**5.2. Smart Action Selection**

Since the solution space is too large to try all of the state-action combinations, the agent faces the dilemma of exploration and exploitation: selecting the best action within the current knowledge, or trying other unknown actions? In the traditional

*-greedy scheme, the trade-off parameter is a fixed value for all the states, which*

needs careful manual tuning. Hence, we choose another action selection scheme: the
Value-Difference-based Exploration (VDBE)-softmax method [Tokic and Palm 2011],
which outperforms other schemes when combined with Q-learning. The probability of
jumping out the local optimum is calculated according to the value difference of the
(approximated) Q-values; that is, the agent takes the greedy action when the system
is stable and tries other actions when not. It is similar to*-greedy but replacing the*

**ALGORITHM 3: Decide-Root**
**1** *(sqt*−1[0])←*n*+11 ·
1*−exp (−μ|
|)*
1*+exp (−μ|
|)*+ (1 −
1
*n*+1)*· (sqt*−1[0]));
**2** **if***ξ < (sqt***[0]) then**

**3** *at*[0]← arg softmax0*≤a** _{≤n}*( ˜

*Q*(0)

_{(sq}*t*[0]*/mn, a**/n));*
**4** **end**

**5** **else**

**6** *at*[0]← arg max0*≤a** _{≤n}*( ˜

*Q*(0)

*(sqt*[0]

*/mn, a*

*/n));*

**7**

**end**

uniform random selection by softmax probability
*exp (Q(s, a)/τ)*

*bexp (Q(s, b)/τ)*

. (13)

The temperature is kept constant*τ = 1 in VDBE-softmax so that other actions take*
higher probabilities to be tried, while the greedy action is taken most of the time with
probability 1−(s). Besides, the per-state threshold (s) is kept instead of using a global
value, so the agent can explore different actions when the environment changes.

The details of VDBE-softmax are shown in Algorithm 3. The per-state exploration
probability*(s) is updated according to the value difference in the Boltzmann *
distri-bution where the learning rate is the inverse of the number of actions. The value of

*
is the same as in Algorithm 2; it is the difference between the target Q-value and*

the current (approximated) Q-value. When the system is unstable (at the beginning
or when the environment changes),*(s) is large so that the agent may explore other*
actions; otherwise, the agent takes the greedy choice to avoid jumping to other inferior
states.

**6. SIMULATION RESULTS**

The proposed policy is compared with the previous work by simulation. First, the simulation environment and the parameters of the policies are described. Then, the policies are compared and analyzed in terms of energy saving and performance penalty using open-source benchmarks. Finally, the runtime overheads in executing the policies are analyzed. Because of the randomness in the simulator as well as the policies, all the results are averaged from at least ten runs.

**6.1. Environment Settings**

The multiprocessor performance simulator Multi2Sim [Ubal et al. 2007] and the power
simulator McPAT [Li et al. 2009] are combined to support closed-loop dynamic power
management. The transition overheads are implicitly modeled, including the
hard-ware costs of mode switching, or refilling the pipeline or cache. The power manager
*is activated with period T* = 1ms (between 0.5ms in Isci et al. [2006] and 10ms in
Winter et al. [2010]). During each period, the performance statistics of the last period
are collected from Multi2Sim and sent to McPAT to obtain the corresponding power
statistics; both feedback information are sent to the power manager to determine the
target power mode of the multiprocessor cores.

The target architecture is ARM Cortex-A9 MPCore [ARM 2012] where the config-uration parameters are listed in Table II, obtained from its manual [ARM 2012] and McPAT [Li et al. 2009]. The technology parameters are supported by McPAT as well.

The workloads for simulations are the SPLASH-2 benchmarks [Woo et al. 1995] as listed in Table III, which are widely used to evaluate multiprocessor systems. The executables, arguments, and input files are obtained from Multi2Sim [Ubal et al. 2007].

Table II. Configuration Parameters of ARM Cortex A9 [ARM 2012]

Parameter name Parameter value

Number of cores 4

Number of threads per core 1

Technology node 40nm Operating frequency 2000MHz Supply voltage 0.66V Threshold voltage 0.23V Decode width 2 Issue width 4 Commit width 4

Number of ALUs per core 3 Number of MULs per core 1 Number of FPUs per core 1

Branch predictor Two level, 1024-set 2-way BTB L1 data cache 32KB, 4 way, 10-cycle latency L1 instruction cache 32KB, 4 way, 10-cycle latency L2 unified cache 1MB, 8 way, 23-cycle latency

Table III. SPLASH-2 Benchmarks

Benchmark Problem Size Cycles (B) Instructions (B) Power (W)

Barnes 2048 particles 0.305 0.433 0.294
Cholesky tk14.O 0.288 0.130 0.192
FFT 65536 points 0.817 0.426 0.202
FMM 2048 particles 0.319 0.542 0.333
LU 512_{× 512 matrix, 16 × 16 blocks} 0.902 0.949 0.259
Ocean 130_{× 130 ocean} 0.364 0.305 0.234

Radiosity -batch -en 0.5 0.940 1.104 0.271

Radix 256k keys, max-value 524288, radix 4096 0.154 0.227 0.298

Raytrace balls4 1.493 0.613 0.190

Water-Nsq 512 molecules, 1 timestep 0.492 0.623 0.283

Water-Sp 512 molecules, 1 timestep 0.421 0.550 0.287

The dynamic context scheduler is provided by Multi2Sim [Ubal et al. 2007]. Each program dynamically forks at most four parallel contexts during runtime. The context binding overheads are inherently modeled in the simulator.

Our policy (MLRL) is compared with the state-of-the-art work (BPNN) [Ye and Xu
2012]. The results are normalized to the baseline (BASE) without power management;
the measured data of the baseline are shown in Table III. In addition, the oracle
policy (GOLD) with maximum energy saving but zero performance penalty is listed
as the reference upper bound. The parameters for Q-learning are set the same as in
Ye and Xu [2012] with*γ = 0.5, μ = 0.5, and = 0.1 for BPNN. The parameters for*
the GGAP-RBF network are set according to Platt [1991] and Huang et al. [2005] as

*emin* *= 0.05, δmax* *= 0.7, δmin* *= 0.07, κ = 0.87, λ = exp (−1/17), and the inputs (s, a)*

are normalized to [0*, 1]. Both policies are initialized without workload knowledge or*
pretraining. For the reward calculated according to (2), the throughput is in the unit
of instructions per cycle (IPC) and the power is in Watt (W).

**6.2. Comparisons of Performance and Energy**

6.2.1. Analyses on Different Policies.The statistics of running different policies with*β =*
0*.9 for SPLASH-2 are shown in Table IV. The values of BPNN and MLRL are the*
normalized percentage compared with BASE. The average power saving of MLRL is

Table IV. Simulations of SPLASH-2 Benchmarks in*β = 0.9*

Item (%) Power Saving Perf. Penalty Energy Saving

Benchmark BPNN MLRL GOLD BPNN MLRL GOLD BPNN MLRL GOLD

Barnes 15.09 2.81 2.73 17.78 1.25 0.00 0.01 1.61 2.73 Cholesky 26.92 40.14 43.64 9.08 0.86 0.00 20.31 39.65 43.64 FFT 24.72 28.07 25.25 11.77 5.93 0.00 15.85 23.80 25.25 FMM 9.43 5.65 3.23 10.29 3.74 0.00 0.11 2.11 3.23 LU 14.85 14.63 11.98 13.79 3.96 0.00 3.10 11.24 11.98 Ocean 30.60 4.72 0.00 42.61 8.03 0.00 −9.68 −2.34 0.00 Radiosity 10.90 8.81 8.91 6.90 0.01 0.00 4.76 8.83 8.91 Radix 8.77 4.06 0.00 10.06 4.19 0.00 −0.50 −0.08 0.00 Raytrace 24.15 49.54 49.72 2.40 0.02 0.00 22.34 49.54 49.72 Water-Nsq 21.37 8.54 8.34 22.49 0.72 0.00 3.69 7.89 8.34 Water-Sp 18.63 7.60 7.48 18.72 0.59 0.00 2.85 7.07 7.48 Average 18.63 15.87 14.66 15.08 2.66 0.00 5.71 13.58 14.66

Fig. 5. Transient results of Cholesky with*β = 0.9.*

close to BPNN, but the performance penalty is much lower, resulting in much more
energy saving. The main reason is that the rapid convergence rate of MLRL avoids
trying lots of inferior actions. Moreover, VDBE-softmax ensures the optimal action is
selected after convergence, while the*-greedy occasionally jumps to other states. For*
MLRL, the energy saving is 13.58% on average and up to 49.54%, while the performance
penalty is 2.66% on average and within 8.03% at worst.

For some extreme cases (Barnes, FMM, Ocean, Radix), both policies achieve low or even negative energy saving. Because the cores are busy almost all the time, turning off cores cannot save energy but can lengthen the latency. Furthermore, some overheads are induced when the cores switch, such as filling the pipelines and binding (or ejecting) contexts to cores. In these cases, the performance penalty and energy saving of MLRL is better than BPNN, because our policy with VDBE-softmax seldom erroneously turns off busy cores.

On the contrary, some programs (Cholesky, FFT, Raytrace) run serially for a long time, yielding lots of idle periods for most cores. As the transient diagrams of Cholesky shown in Figure 5 show, the fast converging for MLRL at the beginning (0–40ms) results in higher energy saving and the rapid reaction to the environment changes (120–130ms) leads to a shorter latency penalty.

Fig. 6. The comparisons on performance penalty and energy saving of different techniques.

6.2.2. Analyses on the Enhancement Techniques.The performance penalty and energy saving with different enhancement techniques described in Section 5 are shown in Figure 6 for analysis. Six policies are compared: the original reinforcement learning policy (RL), MLRL without any enhancement technique (MLRL[basic]), MLRL with GGAP-RBF without VDBE-softmax (MLRL[RBF]), MLRL with all the enhancement techniques (MLRL[RBF+VDBE]), the golden reference (GOLD), and the BPNN policy (BPNN). In order to show the figures clearly, the three benchmarks (Cholesky, FFT, Raytrace) with much larger energy saving are shown in different scales.

The RL policy that directly encodes the power states results in the largest
perfor-mance penalty but least energy saving due to its large state space (at least 2*n*_{). For}

MLRL[basic], the performance penalty is reduced but still huge, because the state space
*is still large on the root node; many samples are needed for n states and n actions, but*
the agent seldom tries other actions using*-greedy. The average performance penalty*
for MLRL[basic] (16.14%) is comparable to BPNN (15.08) with higher energy saving
(10.99%*> 5.71%), demonstrating the proposed multilevel paradigm is more effective*
than directly applying function approximation with BPNN.

Then the function approximation MLRL[RBF] greatly reduces the average perfor-mance penalty (16.14% → 4.98%), because the agent does not need to visit all the solutions with sufficient samples to select appropriate actions. The energy saving is similar to MLRL[basic] (10.99%→ 11.31%) and close to the upper bound.

Finally, including the VDBE-softmax scheme further reduces the average perfor-mance penalty (4.98% → 2.66%) and saves more energy (11.31% → 13.58%). The performance penalty of MLRL[RBF+VDBE] is close to zero, while the energy sav-ing is close to GOLD (14.66%). The difference (about 1% on average) is due to the

Table V. Simulations of SPLASH-2 Benchmarks in*β = 0.9 with Three Cores*

Item (%) Power Saving Perf. Penalty Energy Saving

Benchmark BPNN MLRL GOLD BPNN MLRL GOLD BPNN MLRL GOLD

Barnes 8.11 2.76 1.43 8.76 1.57 0.00 0.06 1.23 1.43 Cholesky 17.77 33.79 34.86 5.09 1.01 0.00 13.59 33.12 34.86 FFT – – – – – – – – – FMM 7.57 4.25 1.49 8.18 3.16 0.00 0.01 1.23 1.49 LU 3.64 10.07 9.02 3.24 2.10 0.00 0.52 8.18 9.02 Ocean – – – – – – – – – Radiosity 6.61 6.10 6.18 4.87 0.02 0.00 2.05 6.13 6.18 Radix – – – – – – – – – Raytrace 22.19 40.04 40.06 2.72 0.22 0.00 20.08 39.90 40.06 Water-Nsq 9.16 6.32 6.06 5.22 0.50 0.00 4.42 5.85 6.06 Water-Sp 13.79 20.69 20.33 4.37 1.01 0.00 10.02 19.89 20.33 Average 11.11 15.50 14.93 5.31 1.19 0.00 6.35 14.44 14.93

Table VI. Simulations in Different Trade-Off Parameters (*β)*
Item (%) Power Saving Perf. Penalty Energy Saving

*β* BPNN MLRL BPNN MLRL BPNN MLRL
0.1 40.43 37.47 98.21 63.78 9.79 10.73
0.3 33.64 29.69 27.57 35.76 8.29 13.65
0.5 19.69 21.29 19.62 12.85 5.60 13.08
0.7 19.26 17.72 15.93 4.27 7.03 13.94
0.9 18.63 15.87 15.08 2.66 5.71 13.58

nature of online learning where the agent still needs to try some actions before convergence.

6.2.3. Non-Power-of-Two Architectures.In order to examine whether the proposed policy
can be applied to general architectures, the ARM MPCore platform is equipped with
three cores. Similar to Table IV, the statistics with*β = 0.9 are shown in Table V where*
the values of BPNN and MLRL are the improvement percentage compared with BASE.
Some benchmarks (FFT, Ocean, and Radix) are skipped here because they can only be
run on power-of-two architectures. For MLRL, the average energy saving (14.44%) is
very close to the upper bound (14.93%), while the average performance penalty is only
1.19%. MLRL still works effectively for three cores and outperforms BPNN in either
energy saving and performance penalty, showing the proposed policy is general and
applicable for multiprocessor systems with non-power-of-two cores.

Compared with four cores, the room for energy saving is lower (GOLD: 17.00%→
14.93% without FFT, Ocean, and Radix) because the number of cores that can be turned
off is reduced by one when the application runs in serial segments (Water-Spatial is
an exception where the program leaves more idle cores on the three-core system). As
a result, the power saving is lower for both policies (BPNN: 17.61%→ 11.11%, MLRL:
17.22%→ 15.50%). The performance penalty is also lower for both policies (BPNN:
12.68%→ 5.31%, MLRL: 1.39% → 1.19%) because the solution spaces are reduced due
*to smaller n so that the agents can more quickly find adequate solutions.*

6.2.4. Analyses on Different Values of the Trade-Off Parameter.The simulation results of
varying the trade-off parameter*β are shown in Table VI. All these numbers are *
aver-aged across the 11 benchmarks and normalized to BASE.

For smaller*β, the power saving is greater but the performance penalty is higher*
for both policies, as expected. The energy saving is relatively stable with respect to*β*
*because the true energy to complete execution cannot be saved. The energy saving of*

Fig. 7. The scalability in performance and energy consumption of different policies for SPLASH-2 benchmarks.

MLRL is greater than BPNN for any value of*β, showing the superiority and stability of*
our policy. Because the energy saving is stable while the performance penalty is lower,
higher*β values are suitable for MLRL.*

Varying the other two parameters *μ and γ in Q-learning does not affect MLRL*
*significantly, hence the results are not shown. The parameters emin*, *δ, λ and κ in*

GGAP-RBF are mutually related, which cannot be changed arbitrarily but must be set to the original values.

6.2.5. Analyses on the Scalability of Different Policies.In order to compare the scalability of different policies, the architecture is scaled to more cores. Note that the architectures with more than four cores were not real systems (for ARM Cortex A9) at the time when this work was submitted; they are virtually created only to analyze the scalability of the proposed MLRL policy.

The scalability diagrams are depicted in Figure 7 for different benchmarks. The performance is normalized to uniprocessors, thus the lines are the speedup of different policies with varying number of cores. The energy consumption shown in the bar

Table VII. Runtime Analysis
Benchmark BPNN (* _{μs)}* MLRL (

*Improvement Barnes 28.96 15.42 47% Cholesky 19.27 8.22 57% FFT 16.89 10.77 36% FMM 29.12 15.67 46% LU 23.24 11.73 50% Ocean 27.23 14.31 47% Radiosity 13.93 10.56 24% Radix 29.34 8.80 70% Raytrace 23.00 5.68 75% Water-Nsq 26.10 9.73 63% Water-Sp 27.68 8.87 68% Average 24.07 10.89 53%*

_{μs)}chart is normalized to uniprocessors as well. In order to show the speedup clearly, the scales of the normalized performance are different; the benchmarks Cholesky, FFT, and Raytrace that contain long serial parts result in poor scalability in performance and large room for energy saving.

In general, MLRL is scalable in performance speedup and energy saving,
outperform-ing BPNN for eight or sixteen cores. Since the number of epochs decreases while the
searching space becomes larger for more cores, the performance and energy gaps with
respect to GOLD are accordingly widened for both MLRL and BPNN. For Cholesky,
which has the poorest scalability [Woo et al. 1995], the speedup goes downward for
eight and sixteen cores even for GOLD due to the resource (cache) contention problem
in our target embedded architectures (the memory subsystem in Woo et al. [1995] is
assumed perfect). The performance speedup values of MLRL decrease for FFT and
Raytrace because the trade-off parameter*β in (2) is constant; when the performance*
speedup faces diminishing returns while the energy consumption becomes larger for
more cores, MLRL would choose to turn off more cores to save energy.

**6.3. Comparisons of Policy Overhead**

Finally, the runtime overheads of executing the policies are evaluated and compared, while the transition overheads between power modes are implicitly included in the previous performance analysis. Because our instruction-set simulator Multi2Sim [Ubal et al. 2007] does not contain a full operating system, the policy runtime is measured separately using the statistics from previous sections as inputs. The runtime analysis is conducted on an Intel Xeon CPU E5420 running at 2.5 GHz using the gettimeofday function.

6.3.1. Overhead Comparisons.The runtime comparisons of different policies are shown
in Table VII with*β = 0.9. For all of the benchmarks, MLRL is faster than BPNN with*
53% improvement on average, showing the efficiency of the proposed policy. Note that
both of the policies are fast enough (in*μs) for four cores.*

Among the different benchmarks, the runtime variation of MLRL mainly comes
from the (dynamic) number of hidden nodes in the GGAP-RBF architecture. Taking
*Cholesky as an example, the input state ¯s remains 1/n for a long time at the beginning*
due to the serial segment of the program, so the GGAP-RBF network contains only a
few hidden nodes. The same situation occurs for FFT and Raytrace, while the
GGAP-RBF network is more complex for Barnes, FMM, and Ocean where the parallelism of
workloads varies rapidly.

Fig. 8. The runtime of the proposed policy with varying the number of cores for different benchmarks. 6.3.2. Analysis on Scalability.Then the number of cores is scaled up to examine the com-plexity of the proposed policy. The number of cores and the parallelism of benchmarks are set to two, four, eight, and sixteen.

*The runtime of MLRL is shown in Figure 8 with differing number of cores n for the*
*eleven SPLASH-2 benchmarks. Although the runtime is affected by the O(Kn) time in*
*updating the GGAP-RBF network, the number of hidden nodes K is saturated due to*
*the constant desired accuracy emin* and the minimum distance*δmin*in the normalized

space; the resulting number of hidden nodes is around 6–8 on average, regardless of the
*number of cores n. Therefore, the overall runtime is dominated by O(n lg n) as analyzed*
in the previous sections.

**7. RELATED WORKS**

Several power management policies have been proposed in the past decades [Benini et al. 2000]; they can be classified into four categories, namely timeout, predictive, stochastic, and learning based. The timeout policies [Karlin et al. 1994; Golding et al. 1996] are simple, but the waiting time means wasting of power. The predictive [Hwang and Wu 2000; Augustine et al. 2008] and stochastic policies [Qiu et al. 2007; Jung and Pedram 2009] improve this drawback using heuristics- and model-based approaches, respectively; however, these policies either require some preknowledge or assume task models. Machine learning techniques have been widely applied recently [Dhiman and Simunic Rosing 2009; Tan et al. 2009] that consider the variability of working environ-ments and outperform previous approaches.

The aforesaid policies designed for uniprocessor systems can be applied to multipro-cessors using either chip-wide DPM (treating the whole multiprocessor system as a core) [Kveton et al. 2007] or distributed power management (treating each core as an independent uniprocessor) [Shen et al. 2013]. They are both less effective than central-ized policies because the former approach cannot switch in per-core granularity while the latter method lacks of a global view. In Madan et al. [2011], two static heuristics are proposed for datacenters, but the power manager is turned off when it encounters problems. In Mariani et al. [2011], an application-specific framework is proposed with design-time characterization of applications.

For multiprocessors, other types of policy exploit Dynamic Voltage and Frequency Scaling (DVFS) [Maggio et al. 2012], such as MaxBIPS [Isci et al. 2006], Steepest Drop [Winter et al. 2010], supervised learning [Jung and Pedram 2010], and hierarchical gradient ascent [Sartori and Kumar 2009]. The Thread Motion (TM) technique [Rangan et al. 2009] also exploits MaxBIPS and migrates applications between cores. The main considerations of DVFS are transient power and thermal issues, while the focus of DPM is overall energy saving. Moreover, the transition overheads that are ignored in most DVFS policies should be taken into consideration in DPM. Because DPM and

DVFS outperform each other in some situations, they can be designed separately and then combined to provide comprehensive power management by static characterization [Srivastav et al. 2012], utilizing different states and rewards for the learning agent [Shen et al. 2013], or policy(expert) selection [Dhiman and Simunic Rosing 2009; Bhatti et al. 2010].

**8. CONCLUSION**

In this article, we propose a learning-based power management policy called MLRL for multiprocessors. Using the multilevel paradigm, the time complexity is reduced to

*O(n lg n) and the searching space is reduced to linearly proportional to the number of*

cores. Moreover, the scalability is further enhanced using function approximation by the GGAP-RBF network and the convergence quality is raised by using the VDBE-softmax action selection technique. The hierarchical approach is generalized to multiprocessor systems with non-power-of-two cores.

The simulations are conducted on the instruction-set simulator using the SPLASH-2
benchmarks. The results show that MLRL requires shorter runtime and outperforms
the state-of-the-art policy. MLRL runs 53% faster and achieves 13*.6% energy saving*
with only 2*.7% performance penalty on average. The energy saving of MLRL is close to*
the oracle policy with only 1*.1% difference on average. The effects of the enhancement*
techniques are evaluated as well. In addition, the generality and scalability of the
proposed policy are examined on architectures with two to sixteen cores.

In the future, MLRL can be extended to DVFS or even single-ISA heterogeneous
systems by generalizing the definitions of state and action in the multilevel framework
such that the number of active cores becomes the aggregate computational capability.
The heterogeneity and correlation between different contexts can be further taken
into consideration where the power manager can be combined with the task scheduler
to provide a comprehensive policy for complicated workloads. Moreover, the power
consumption or performance penalty can be controlled by adding a controller on the
value of*β.*

**REFERENCES**

ACPI. 2011. ACPI - Advanced configuration and power interface specification. http://www.acpi.info/. AMD. 2013. AMD powernow! technology.

http://www.amd.com/us/products/technologies/amd-powernow-technology/Pages/amd-powernow-technology.aspx.

ARM. 2005. ARM intelligent energy controller technical overview. http://www.arm.com/.

ARM. 2012. Cortex-a9 mpcore technical reference manual. http://infocenter.arm.com/help/topic/com.arm.doc. ddi0407i/DDI0407I cortex a9 mpcore r4p1 trm.pdf.

*John Augustine, Sandy Irani, and Chaitanya Swamy. 2008. Optimal power-down strategies. SIAM J. Comput.*
37, 5, 1499–1516.

Andrew G. Barto and Sridhar Mahadevan. 2003. Recent advances in hierarchical reinforcement learning.

*Discr. Event Dynam. Syst. 13, 1–2, 41–77.*

Luca Benini, Alessandro Bogliolo, and Giovanni De Micheli. 2000. A survey of design techniques for
*system-level dynamic power management. IEEE Trans. VLSI Syst. 8, 3, 299–316.*

Khurram Bhatti, Cecile Belleudy, and Michel Auguin. 2010. Power management in real time embedded
*systems through online and adaptive interplay of dpm and dvfs policies. In Proceedings of the IEEE/IFIP*

*International Conference on Embedded and Ubiquitous Computing.*

Gaurav Dhiman and Tajana Simunic Rosing. 2009. System-level power management using online learning.

*IEEE Trans. Comput.-Aided Des. Integr. Circ. Syst. 28, 5, 676–689.*

Richard Golding, Peter Bosch, Carl Staelin, Tim Sullivan, and John Wilkes. 1996. Idleness is not sloth. Tech. rep. Hewlett-Packard Laboratories, Palo Alto, CA.

Jim Held, Jerry Bautista, and Sean Koehi. 2006. From a few cores to many: A tera-scale computing research overview. Tech. rep., Intel Corporation.

Guang-Bin Huang, P. Saratchandran, and Narasimhan Sundararajan. 2005. A generalized growing and
*pruning rbf (ggap-rbf) neural network for function approximation. IEEE Trans. Neural Netw. 16, 1,*
57–67.

Chi-Hong Hwang and Allen C.-H. Wu. 2000. A predictive system shutdown method for energy saving of
*event-driven computation. ACM Trans. Des. Autom. Electron. Syst. 5, 2, 226–241.*

Intel. 2013. Enhanced intel speedstep technology. http://www3.intel.com/cd/channel/reseller/asmo-na/eng/ 203838.htm.

Canturk Isci, Alper Buyuktosunoglu, Chen-Yong Cher, Pradip Bose, and Margaret Martonosi. 2006. An
analysis of efficient multi-core global power management policies: Maximizing performance for a given
*power budget. In Proceedings of the IEEE/ACM International Symposium on Microarchitecture.*
*Niraj K. Jha. 2001. Low power system scheduling and synthesis. In Proceedings of the IEEE/ACM *

*Interna-tional Conference on Computer Aided Design.*

Hwisung Jung and Massoud Pedram. 2009. Uncertainty-aware dynamic power management in partially
*observable domains. IEEE Trans. VLSI Syst. 17, 7, 929–942.*

Hwisung Jung and Massoud Pedram. 2010. Supervised learning based power management for multicore
*processors. IEEE Trans. Comput.-Aided Des. Integr. Circ. Syst. 29, 9, 1395–1408.*

Anna R. Karlin, Mark S. Manasse, Lyle A. Mcgeoch, and Susan Owicki. 1994. Competitive randomized
*algorithms for non-uniform problems. Algorithmica 11, 6, 542–571.*

George Karypis, Rajat Aggarwal, Vipin Kumar, and Shashi Shekhar. 1999. Multilevel hypergraph
*partition-ing: Application in vlsi domain. IEEE Trans. VLSI Syst. 7, 1, 69–79.*

Wonyoung Kim, Meeta S. Gupta, Gu-Yeon Wei, and David Brooks. 2008. System level analysis of fast,
*per-core dvfs using on-chip switching regulators. In Proceedings of the 14th _{IEEE International Symposium}*

*on High Performance Computer Architecture. 123–134.*

*Matthias Knoth. 2009. Power management in an embedded multiprocessor cluster. In Proceedings of the*

*Embedded World Conference.*

Branislav Kveton, Prashant Gandhi, Georgios Theocharous, Shie Mannor, Barbara Rosario, and Nilesh
*Shah. 2007. Adaptive timeout policies for fast fine-grained power management. In Proceedings of the*

*19th _{National Conference on Innovative Applications of Artificial Intelligence.}*

Sheng Li, Jung Ho Ahn, Richard D. Strong, Jay B. Brockman, Dean M. Tullsen, and Norman P. Jouppi. 2009.
McPAT: An integrated power, area, and timing modeling framework for multicore and manycore
*archi-tectures. In Proceedings of the 42nd _{Annual IEEE/ACM International Symposium on Microarchitecture.}*
Niti Madan, Alper Buyuktosunoglu, Pradip Bose, and Murali Annavaram. 2011. A case for guarded power

*gating for multi-core processors. In Proceedings of the International Symposium on High-Performance*

*Computer Architecture.*

Martina Maggio, Henry Hoffmann, Alessandro V. Papadopoulos, Jacopo Panerati, Marco D. Santambrogio,
Anant Agarwal, and Alberto Leva. 2012. Comparison of decision-making strategies for self-optimization
*in autonomic computing systems. ACM Trans. Auton. Adapt. Syst. 7, 4, 36:1–36:32.*

Giovanni Mariani, Gianluca Palermo, Cristina Silvano, and Vittorio Zaccaria. 2011. ARTE: An
*application-specific run-time management framework for multi-core systems. In Proceedings of the IEEE Symposium*

*on Application Specific Processors.*

*Massoud Pedram. 1996. Power minimization in ic design: Principles and applications. ACM Trans. Des.*

*Autom. Electron. Syst. 1, 1, 3–56.*

*John Platt. 1991. A resource-allocating network for function interpolation. Neural Comput. 3, 2, 213–225.*
Qinru Qiu, Ying Tan, and Qing Wu. 2007. Stochastic modeling and optimization for robust power

*manage-ment in a partially observable system. In Proceedings of the Design, Automation, and Test in Europe*

*Conference. 779–784.*

Krishna K. Rangan, Gu-Yeon Wei, and David Brooks. 2009. Thread motion: Fine-grained power management
*for multi-core systems. In Proceedings of the International Symposium on Computer Architecture.*
John Sartori and Rakesh Kumar. 2009. Distributed peak power management for many-core architectures.

*In Proceedings of the Design, Automation, and Test in Europe Conference.*

Joseph Sharkey, Alper Buyuktosunoglu, and Pradip Bose. 2007. Evaluating design tradeoffs in on-chip power
*management for cmps. In Proceedings of the International Symposium on Low Power Electronics and*

*Design.*

Hao Shen, Ying Tan, Jun Lu, Qing Wu, and Qinru Qiu. 2013. Achieving autonomous power management
*using reinforcement learning. ACM Trans. Des. Autom. Electron. Syst. 18, 2.*

Meeta Srivastav, Michael B. Henry, and Leyla Nazhandali. 2012. Design of energy-efficient, adaptable
*throughput systems at near/sub-threshold voltage. ACM Trans. Des. Autom. Electron. Syst. 18, 1.*
*Richard S. Sutton and Andrew G. Barto. 1998. Reinforcement Learning: An Introduction. The MIT Press.*

Ying Tan, Wei Liu, and Qinru Qiu. 2009. Adaptive power management using reinforcement learning. In

*Proceedings of the International Conference on Computer Aided Design.*

Chen Khong Tham. 1994. Modular on-line function approximation for scaling up reinforcement learning. Ph.D. dissertation, Jesus College, Cambridge, UK.

Michel Tokic and Gunther Palm. 2011. Value-difference based exploration: Adaptive control between
*epsilon-greedy and softmax. In Proceedings of the 34th _{Annual German Conference on Advances in Artificial}*

*Intelligence.*

Rafael Ubal, Julio Sahuquillo, Salvador Petit, and Pedro Lopez. 2007. Multi2Sim: A simulation framework
*to evaluate multicore-multithreaded processors. In Proceedings of the 19th _{International Symposium on}*

*Computer Architecture and High Performance Computing.*

Jonathan A. Winter, David H. Albonesi, and Christine A. Shoemaker. 2010. Scalable thread scheduling and
*global power management for heterogeneous many-core architectures. In Proceedings of the *

*Interna-tional Conference on Parallel Architectures and Compilation Techniques.*

Steven C. Woo, Moriyoshi Ohara, Evan Torrie, Jaswinder P. Singh, and Anoop Gupta. 1995. The splash-2
*programs: Characterization and methodological considerations. In Proceedings of the 22nd _{International}*

*Symposium on Computer Architecture.*

Yue Wu, Hui Wang, Biaobiao Zhang, and Ke-Lin Du. 2012. Using radial basis function networks for function
*approximation and classification. ISRN Appl. Math. 2012.*

Rong Ye and Qiang Xu. 2012. Learning-based power management for multi-core processors via idle period
*manipulation. In Proceedings of the Asia and South Pacific Design Automation Conference.*