中華大學碩士論文

(1)

中華大學碩士論文

題目：多核心叢集系統之訊息傳遞程式能源利用率最佳化

系所別：資訊工程學系碩士班學號姓名： M09602048 陳彥鈞指導教授：許慶賢博士

中華民國九十八年七月

(2)

Power Consumption Optimization of MPI Programs on Multi-Core Clusters

By

Yen-Jun Chen

Advisor: Prof. Ching-Hsien Hsu

Department of Computer Science and Information Engineering

Chung-Hua University Hsinchu, 30067, Taiwan

June 2009

(3)

摘要

當能源危機及環境汙染成為全球人類共通的話題後，與能源消耗有關的研究也被帶入了電腦科學的領域中。在現今的世代中，包括多核心 CPU 等在內的高速 CPU 架構提供更多可用於計算的執行週期，但也同時需要更有效率的能源管理。同時，對稱式多處理器（SMP）

架構的叢集系統及多核心 CPU 能在單一電腦中提供更高的執行效率，只是在低負載工作時導致更高的無謂能源消耗。

叢集環境裡，平行化程式在執行時期需要進行節點間的資料交換，現下一般被大量使用的網路架構多半是以速度為 10/100Mbit 的 Fast Ethernet 及能達到 1Gbit 的 Gigabit Ethernet 為主，此二種架構雖然速度較慢，但與 InfiniBand 或 10G Ethernet 相比，卻便宜很多。在多核心叢集環境的資料交換有兩個問題發生，第一：當兩個節點的資料在網路中傳送時，網路造成的封包延遲（Packet Latency）比多核心 CPU 的內部匯流排要慢相當多，這意謂著大量等待傳送的封包會被阻塞於快取記憶體中；第二：如果多核心 CPU 內的任一核心處於等待接收資料的狀態，則不可預期的等待時間會使 CPU 的負載提高，這兩個情況消耗了額外的能源，但卻無助於效能的提升。在本論文中，我們提出一個新的方法以解決在一般網路架構中的頻寬阻塞，並控制能源消耗。此方法結合了硬體的節能技術，保持大致相同的資料傳送時間，同時在一般情況下，以低於一般情況及前述案例的能源完成工作。

關鍵詞：能源消耗，多核心處理器，叢集運算，MPI

(4)

Abstract

While the energy crisis and the environmental pollution become important global issues, the power consumption researching brings to computer sciences world. In this generation, high speed CPU structures include multi-core CPU have been provided to bring more computational cycles yet efficiently managing power the system needs. Cluster of SMPs and Multi-core CPUs are designed to bring more computational cycles in a sole computing platform, unavoidable extra energy consumption in loading jobs is incurred.

Data exchange among nodes is essential and needed during the execution of parallel applications in cluster environments. Popular networking technologies used are Fast Ethernet or Gigabit Ethernet, which are cheaper and much slower when compared to Infiniband or 10G Ethernet. Two questions on data exchange among nodes arise in multi-core CPU cluster environments. The former one is, if data are sent between two nodes, the network latency takes longer than system bus inside of a multi-core CPU, and thus, wait-for-sending data are blocked in cache. And the latter is, if a core keeps in waiting state, the unpredicted waiting time brings to cores higher load. These two situations consume extra power and no additional contribution for increasing overall speed. In this paper, we present a novel approach to tackle the congestion problem and taking into consideration energy in general network environments, by combining hardware power saving function, maintaining the transmission unchanged while saving more energy than any general and previous cases.

Keywords: Power consumption, multi-core processor, cluster computing, MPI

(5)

Acknowledgements

My deepest gratitude goes first to Lord God Almighty, he makes me, approves this dream of studying at graduate school, gives me enough ability to write this thesis, guards my heart to walk on this way, and accompanies me to go through these two years. Without his amazing grace, I cannot complete this dream.

Prof. Chin-Hsien Hsu is my adviser and best consultant, in this period of thesis researching, his suggestion pushed this study to go forwarding, observed the detail of every word in thesis, and pointed out mistakes of the logic in my brain. Besides these things about researching, I am grateful to have Prof. Hsu’s consideration for my life, and his attitude of no limit gives me a sky to fly.

My wife, Sappho Hsieh, is a best supporter in my life. She made a hard decision to marry me when I have no work and income, pushed me to complete the research, held everything of the home as I have no time to take care them, and crossed many hard times with me. She is my treasure, my always love who protects my heart.

Finally, I deeply appreciate Prof. Kuan-Ching Li, he worked overnight with me to correct English grammar of the thesis, gave me important suggestions about thesis structure, and thinking direction about some relational issues.

(6)

List of Figures

Figure 1-1: Intel Quad-Core CPU system structure ... 1

Figure 1-2: AMD Quad-Core CPU system structure ... 2

Figure 1-3: Multi-core based cluster structure ... 3

Figure 2-1: General NoC system structure ... 6

Figure 4-1: LAD Algorithm structure diagram ... 14

Figure 5-1: Test environment ... 16

Figure 5-2: Time effect of TD on TT (Frame = 1 Byte) ... 19

Figure 5-3: Power effect of TD on PC (Frame = one Byte) ... 20

Figure 5-5: Power effect of TD on PC (Frame = 1460 Byte) ... 22

Figure 5-7: Power effect of TD on PC (Frame = 8000 Byte) ... 24

Figure 5-8: Average core loading of OnDemand mode and LAD algorithm ... 25

(8)

List of Tables

Table 1-1: Infiniband transmission mode list ... 3

Table 5-1: Host specification ... 16

Table 5-2: Detail results of time effect of TD on TT (Frame = 1 Byte) ... 19

Table 5-3: Detail results of power effect of TD on PC (Frame = one Byte) ... 20

Table 5-5: Detail results of power effect of TD on PC (Frame = 1460 Byte) ... 22

Table 5-7: Detail results of power effect of TD on PC (Frame = 8000 Byte) ... 24

(9)

Chapter 1. Introduction

1.1. Motivation

Reduction on power consumption of computer systems is a hot issue recently, since many CPUs and computer-related hardware has been produced and under operation everywhere. As the number of single-core CPU has reached to physical limitation on current semi-conductor technology, the computing performance has met the bottleneck. Multi-core CPUs become a simple yet efficient solution to increase performance and speed since that concept SMP in a single chip, that is, making up a small cluster to be executed inside a host. Additionally, it reduces the amount of context switching while in single-core CPUs, increases straight forwardly the overall performance. Some CPU technologies and our target will be introduced in below.

Figure 1-1: Intel Quad-Core CPU system structure [11]

Figure 1-1 illustrates the architecture of Intel quad-core CPU, which looks like a combination of two dual-core CPUs. It has four individual execution engines, where each two cores share one set of L2 cache and system bus interface, and connect to the fixed system bus. The advantages of this architecture are twofold. The former one is that each core can fully utilize L2 cache as each core needs larger memory, while the latter is that each core accesses L2 cache through individual hub [7]

(10)

simplifying system bus and cache memory structures. Intel CPU provides “SpeedStep” [3]

technology that helps to control CPU frequency and voltage, and it needs to change all cores’

frequency at the same.

Figure 1-2: AMD Quad-Core CPU system structure [12]

AMD quad-core CPU, as shown in Figure 1-2, has individual L2 cache in each core and share L3 cache, (a special design), and then integrated to DDR2 memory controller into CPU, helping to increase memory access speed. Each core has individual channel to access system bus, and L3 cache and peripheral chips from crossbar switch. AMD provides “PowerNow!” [4] technology to adjust each core’s working frequency / voltage.

A cluster platform is built up by interconnecting a number of single-core CPU, and a message passing library, such as MPI is needed for data exchange among computing nodes in this distribution computing environment. In addition, high speed network as Infiniband is needed to interconnect the computing nodes. As multi-core CPUs are introduced and built in cluster environments, the architecture of this newly proposed cluster is as presented in Figure 1-3. The main advantages of data exchanges between cores inside of a CPU is much faster than passing by a network and South / North bridge chip.

(11)

Figure 1-3: Multi-core based cluster structure [13]

Developed from 1999, InfiniBand [16] is a point-to-point structure, original design concept that focused on high-performance computing support, so bidirectional serial fiber interface, failover mechanism and scalable ability are the necessary functions. InfiniBand supports at least 2.5Gbit/s bandwidth in each direction in single data rate (SDR) mode, the transmitted information includes 2 Gbit useful data and 500Mbit control commands. Besides, InfiniBand supports DDR (Double Data Rate) and QDR (Quad Data Rate) transmission mode, and each mode supports 3 different speed (1x, 4x and 12x) configurations, so the maximum bandwidth is 96Gbit/s. The detail specification is as Table 1-1.

Table 1-1: Infiniband transmission mode list Single (SDR) Double (DDR) Quad (QDR)

1X 2 Gbit/s 4 Gbit/s 8 Gbit/s

Infiniband networking technology is a good and fast enough solution to connect all computing nodes of a cluster platform, but expensive. Gigabit Ethernet is cheaper solution and widely built in general network environment, though slower in transmission speed and definitely drop down data exchange performance. To send data to a core that is inside of a different host will be needed to consume extra energy when waiting for data.

(12)

“SpeedStep” and “PowerNow!” technologies are good solutions to reduce power consumption, since they adjust CPU’s frequency and voltage dynamically to save energy. The power consumption can be calculated by the function:

P=IV=V

²

f =J/s.

(1)

where P is Watt, V is voltage, I is current, f is working frequency of CPU, J is joule and s is time in seconds. It means that lower voltage in the same current condition saves more energy. How and when to reduce voltage and frequency become an important issue, since one of main targets of clustering computing computers is to increase the performance, while slowing down CPU’s frequency is conflict with performance. Considering data latency of network, and CPU load in current CPU technologies, we would like to create a low energy cost cluster platform based on general network architecture, that keeps almost the same data transmission time though lower in energy consumption when CPU in full speed.

1.2. Contribution

To address the above questions, we use OpenMPI and multi-core CPU to build up a Linux based a low energy cost cluster, and implement three solutions on this environment.

 CPU power consumption reduction

Drive CPU power saving technology to reduce working frequency when low working loading, the method reduces unavailable power consumption.

 CPU internal bus congestion reduction

Add waiting time between each data frame before send out, the method slows down data transmission speed and reduces core working loading.

 Loading-Aware Dispatching (LAD) Algorithm

Lower loading core is indicated higher priority to receive data frame, the method increases core working efficiency.

(13)

1.3. Thesis Organization

The remainder of this thesis is organized as follows. Some related works are discussed in Chapter 2, and some challenges we found in our experiment is listed in Chapter 3. In Chapter 4, the proposed approach about reducing energy consumption is presented; while the testing environment and performance results in Chapter 5. Finally the conclusion and future works are discussed in Chapter 6.

(14)

Chapter 2. Related Work

Based on the concept about reducing computing time, the job scheduling methodology as introduced in [8] and [21] was designed targeting for a faster complete data transmission; otherwise, adjust cache block size to find the fastest speed that transmits data using MPI between MPI nodes in situations as listed in [13] was studied, and similar implementation of the method using OpenMP was also observed in [14]. Another investigation focused on compiler that analyze program’s semantics, and insert special hardware control command that automatically adjusts simulation board’s working frequency and voltage, [10] research needs to be combined both hardware and software resources.

Some kinds of paper designed their methodologies or solutions under simulation board, or called NoC system, as shown in its structure is shown in Figure 2-1.

Figure 2-1: General NoC system structure

Base on a simulation board, researchers have designed routing path algorithm that tries to find a shortest path to transmit data in Networks-on-Chip [15], in order to reduce data transmission time between CPUs, as also to have opportunities to realistically port and implement it to a cluster environment.

(15)

Others, researches have applied Genetic Algorithms to make a dynamically and continuous improvement on power saving methodology [9]. Through a software based methodology, routing paths are modified, link speed and working voltage are monitored and modified at the same time to reduce power consumption of whole simulation board, while the voltage detection information required hardware support.

Consider higher power density and thermal hotspots happened in NoC, the paper [18] provided a compiler-based approach to balances the processor workload, these researchers partitions a NoC system to several area and dispathes jobs to them by node remapping, the strategy reduces the chances of thermal concentration at runtime situation, and brings benefit about a bit of performance increasing. The paper [20] and [24] studied the same point about thermal control.

Modern Operating Systems as Linux and Windows provides hardware power saving function as introduced in [1] and [2], where they can drive “SpeedStep” [3] and “PowerNow!” [4] utilizing special driver to control CPU voltage and frequency. Of course hardware support is necessary, since depending on the CPU loading, CPU is automatically selected with lower frequency and voltage automatically. Besides, someone add management system into OS kernel to control energy consumption directly [22].

The peripheral devices of computer, as disk subsystem is a high energy consumption hardware, the paper [17] studied how to implement disk energy optimization in compiler, these researcheers considered disk start time, idle time, spindle speed, the disk accessing frequency of program and CPU / core number of each host, made up a benchmark system and real test environment to verify physical result.

Some groups study the power saving strategy implementation in data center, as database or search engine server. Huge energy is consumed by this kind of application when they have no work.

The nearer research [19] provides a hardware based solution to detect idle time power waste and designs a novel power supplier operation method, the approach applied in enterprise-scale commercial deployments and saved 74% power consumption.

(16)

Besides, some researchers studied OS resource management and power consumption evaluation and task scheduling method as [23] and [25], this kind of study provides a direction to optimize computer operation tuning and reduces system idle time that brings by resource waiting.

(17)

Chapter 3. Challenges of Power Saving in Multi-Core Based Cluster

In the previous single-core CPU based cluster environment, data distribution with CPU energy control are easier to implement by isolated CPU frequency control of each host. In multi-core based cluster, CPU internal bus architecture, bandwidth and power control structure bring differnet challanges in this issue. When we built a cluster platform that combines some key technologies as listed in Chapter 1 for experiment purposes, their advantages bring higher speed for data transmission peformance, yet only between cores inside a CPU, a CPU core is maintained with high load means the CPU speed cannot be decreased. Analysis and reasoning on these situations are discussed next.

3.1. CPU Power Control Structure

The “SpeedStep” and “PowerNow!” were not show in Figure 1-1 and 1-2. The “SpeedStep”

provides solely full CPU frequency and voltage adjustment. The design makes power control easier, though consumes extra energy. If only one core works with high load, power control mechanism cannot reduce other cores’ frequency / voltage, nor dropping down the performance of a busy core.

Inefficient energy consumption brings temperature increasing, since low loading core generates the same heat as high load one, and brings the CPU’s temperature up at the same time.

AMD “PowerNow!” shows advantage in this issue, since we can reduce frequency when core works in lower loading without need to consider other cores’ situation, and heat reduction is also another benefit.

3.2. Network Bandwidth and Cache Structure

(18)

As description of Figure 1-1, Intel’s CPU architecture shares L2 cache to cores using individual hub, all packets between core and cache needs to pass through by it. The architecture has 2 advantages and 2 problems:

3.2.1. Advantages

 Flexible cache allocation

Every core was allowed to use whole L2 cache from cache hub, the hub provides single memory access channel for each core, and hub assigns cache memory space to requested core. The method simplifies internal cache access structure.

 Decrease cache missing rate

When each core has massive cache request all of a sudden, flexible cache memory allocation provides larger space to save data frame, and also decreases page swapping from main memory at the same time.

3.2.2. Problems

 Cache Hub Congestion

If huge amount of data request or sending commands happen suddenly, individual cache hub blocks data frames in cache memory or stops commands in queue. All cores and hub keep in busy state and thus consume extra energy.

 Network Bandwidth Condition

Lower network bandwidth makes previous situation more seriously in many nodes' cluster, since network speed cannot be as fast as internal CPU bus, if cross-node data frames appear, the delivering time is longer than intra-node data switch.

Compared with Intel, while data frame flood sends to CPU, AMD structure has no enough cache to save them, yet individual bus / memory access channel of each core provides isolated bandwidth, L2 cache built in core reduces data flow interference. Different CPU structure provides their advantages, and weakness appears while they are compared to each other.

(19)

3.3. MPI Environment Support

In a general situation, each computing node executed under a given core / host randomly indicated by cluster software, signifies that programmer cannot obtain additional core loading from node's code section. Following our purpose, finding system information about thread / node location works, but it is a hard method since the program would spend large amount of time in device I/O, includes open system state file, analysis information and obtaining node’s location.

Another alternative method is easier, where we make cluster platform that fixes node location in indicated core or host, and the function helps to get core loading from node’s code. OpenMPI is selected for this issue.

(20)

Chapter 4. The Proposed Approach

Upon with CPU specification, CPU power control interface and network structure, we provide a novel data dispatching strategy to solve the previous challanges in Chapter 3, it combines data flow limitation, core frequency controlling, and accords core working load to transmit data frame, detail operation is as below.

4.1. Transmission Speed Reduction

It is not a good method to keep performance. In fact, we add 1µs delay between two packets, in a real environment, and the total transmission time is added as:

T = N × D

(2)

where T is total time, N is total number of packets and D is delay time between packets. We found that the total time has just been added less than one to four seconds in average, when is transmitted 100K data frames across two hosts that are connected via Gigabit Ethernet. Additionally, the advantage is that the loading of a central node that sends data to other nodes is decreased by almost 50%. On the other hand, data receiving core load is decreased by 15% in average when we added 10µs delay in these nodes, follow Function 2, the amoung of increased delay time should be 1s, yet in experiment result, total transmission time is increased by less than 0.5s. These experiment results means the core work loading brings up by massive data frame, not by CPU bound process.

This method reduces core work loading and helps below method to operate.

4.2. Core Frequency Reduction

Although the challenge presented in section 3.1 exists, as for power saving issue, we use AMD system and “PowerNow!” to slow down lowering loading core frequency. The given CPU supports

(21)

two steps frequency, and therefore they work in different voltage and current. Thus we focus on frequency adjustment, and calculating power consumption of each core as below:

P = V

max × Imax × T (3)

where Vmax and Imax are found from AMD CPU technology specification [6], and T is program execution time. Since “Time” joins the function, the unit of P is Joule.

There is a CPU frequency controlling software: CPUFreq. It provides simple commands to change CPU work state and 4 default operation modes:

 Performance mode: CPU works in highest frequency always

 Powersave mode: CPU works in lowest frequency always

 OnDemand mode: CPU frequency is adjusted following CPU work loading

 UserSpace mode: User is permitted to change CPU frequency manually follow CPU specification

We have used UserSpace mode and got the best CPU work loading threshold range to change CPU frequency: 75%~80%, if CPU work loading lower than this, we reduce frequency; if higher, we increase frequency. But actually, the default threshold of OnDemand mode is 80%, so we use OnDemand mode to control CPU frequency when our data dispatching method is executed.

4.3. Loading-Aware Dispatching

Following the previous results, working with OnDemand mode of CPUFreq, we provide a Loading-Aware Dispatching method (LAD). Based on the AMD “PowerNow!” hardware structure, and keeping the same load on all cores is necessary for efficient energy consumption, thus sending data from central node to lowest loading node makes sense. If the load can be reduced on a core, then reducing CPU frequency is permitted for saving energy.

(22)

Figure 4-1: LAD Algorithm structure diagram

Still in LAD algorithm, as indicated in Figure 4-1, data frames are sent sequentially from Host 1- Core 0 to other cores. This method is often used to distribute wait-for-calculate data blocks in complex math parallel calculations. MPI provides broadcast command to distribute data and reduce command to receive result. In order to changing data frame transmission path dynamically, we use point-to-point command to switch data, since this type of command can indicate sending and receiving node.

The detail of operation flow is as below:

 Step 1: Detect core loading

 Step 2: Find lowest loading core

 Step 3: Send several data frames to the lowest loading core

 Step 4: Repeat previous two step until all data frames are transmitted over The data distribution algorithm is given as below.

Loading-Aware Dispatching (LAD)Algorithm

1. generating wait-for-send data frame

2. if (node 0) 3. {

4. //send data follow sorting result 5. while(!DataSendingFinish) 6. {

7. //detect nodes’ loading from system information and save in TargetNode

(23)

8. OpenCPUState;

9. CalculateCPULoading;

10. //sort TargetNode from low to high 11. CPULoadingSorting;

12. //send 1000 data frame

13. for(i=1; i<1000; i++)

14. SendData(TargetNode[i]);

15. if(whole data transmitted) 16. DataSendingFinish=true;

17. }

18. //send finish message to receiving nodes 19. for(i=1; i<NodeNumber; i++)

20. SendData(i);

21. }

22. if (other nodes) 23. {

24. //receive data from node 0 25. ReceiveData(0);

26. usleep();

27. }

(24)

Chapter 5. Performance Evaluation and Analysis

In this chapter, experimental environment and results of LAD algorithm is presented. The cluster platform includes two computing nodes and connected via Gigabit Ethernet, and each node is installed with Ubuntu Linux 8.10 / kernel 2.6.27-9, OpenMPI message passing library is selected for thread execution affinity function, the hardware specification is listed as next.

Table 5-1: Host specification

CPU AMD Phenom X4 9650 Quad-

Core 2.3GHz Layer 1 Cache 64K Instruction Cache

and 64K Data Cache Per Core

Layer 2 Cache 512K Per Core

Layer 3 Cache Share 2M for 4 Cores

Main Memory DDR2-800 4GB

Figure 5-1: Test environment

5.1. Experiment Setup

Gigabit Switch

Ubuntu 8.10 OpenMPI 1.27 Ubuntu 8.10

OpenMPI 1.27

(25)

5.1.1. Data frame size

Three different sizes of data frames are transmitted between nodes: one byte, 1460 bytes and 8000 bytes. One byte frame is not only the smallest one in MPI data frame, but also in network, for complete data transmission in shortest time, source node generates huge amount of one byte frame, these packets congest CPU internal bus and network.

1518 bytes frame is the largest one in network, but considering that network header should be inserted into network packet, we select 1460 bytes frame for testing, and then, this size of packet brings largest amount of data in a single network packet, and trigger fewest network driver interrupt to CPU. Finally 8000 bytes frame is set for large data frame testing, since it needs to be separated to several other packets by network driver for transmission, but not necessary to be separated in intra- node, and thus need the longest time for data transmission.

While the experiment is executed, we send 100K data frames between two nodes, and calculate the power consumption.

5.1.2. CPU frequency and packet delay

Each experiment result figures and tables that follows next has four blocks. The first one is executed in Performance Mode (PM, CPU works in 2.3GHz), the second one is PowerSave Mode (PS, 1.15GHz), the third one is OnDemand Mode (OD, slows down frequency while CPU loading lower than 80%), and last one is LAD algorithm that works with OnDemand Mode.

Besides, each block has four delay time configurations, the first one contains no delay between each data frame, the second delays 5µs, the third one delays 10µs, and last one delay 20µs. Still in figures that follows next, TD stands for Transmission Delay, Transmission Time as TT, and PC for Power Consumption. Follow Function 2, 5µs packet delay should increase 0.5s TT more than no delay, 10µs should increase 1s, and 20µs 2s.

5.1.3. Rank number

(26)

The “Rank Number” in each figures and tables mean the number of nodes / cores join data dispatching. For example, rank 2 means rank 0 dispatchs data to rank one, and rank 4 means rank 0 dispatchs data to rank one, 2, and 3. Since each host has four cores, the rank number 2~4 are internal node data transmission, and rank 5~8 are cross node data transmission.

Although only four cores join work in rank number 2~4, other cores consume energy at the same time, and we still need to add the energy consumed.

5.2. Experiment results

5.2.1. One byte frame

Table 5-2 and Figure 5-2 show the TT for one byte frame. Comparing PM, PS and OD mode, we find that TD increases the TT over 3 seconds in rank 2~4 in every frequency level, but increases less than one second in 5~8. Table 5-3 and Figure 5-3 displayed one byte frame PC. Clearly, the PS mode spends the longest time to transmit data, though consumes the lowest energy. OD mode has none remarkable performance in power saving in rank 7~8, but it uses average 100J less than PM mode in rank 2~6, and keeps TT increasing less than 0.4s in cross-node situation. LAD algorithm displays advantage in no delay situation, less than 1s TT increasing yet consumes almost the same energy in rank 7~8. In other situations, LAD spends maximum 4s longer than OD mode, and saves 400J.

(27)

Table 5-2: Detail results of time effect of TD on TT (Frame = 1 Byte)

Rank Number Mode & TD

2 3 4 5 6 7 8

PM mode

0 0.160 0.333 0.501 6.262 10.646 15.072 18.630

5 0.928 1.152 1.286 6.813 11.276 15.867 19.384

10 2.292 2.271 1.775 6.655 11.076 15.419 19.238

20 3.251 3.229 3.216 6.924 11.083 15.603 19.032

PS mode

0 0.285 0.576 0.909 9.984 16.537 23.151 28.976

5 1.326 1.689 1.935 10.599 17.311 23.679 29.518

10 2.637 2.174 2.429 10.580 17.848 24.157 29.598

20 3.612 6.651 3.850 10.767 17.470 24.165 29.651

OD mode

0 0.216 0.372 0.531 6.625 11.388 16.025 18.664

5 1.330 1.625 1.824 7.143 11.973 16.863 19.503

10 2.630 2.126 2.256 6.898 11.693 16.456 19.456

20 3.489 3.683 3.756 7.343 11.723 16.615 19.161

LAD

0 0.288 0.577 0.918 7.182 12.018 16.699 19.524

5 1.367 1.423 1.623 8.716 14.587 20.181 20.704

10 2.659 1.960 2.028 8.718 14.508 20.221 21.355

20 3.598 3.813 3.716 9.254 15.129 20.253 22.943

Figure 5-2: Time effect of TD on TT (Frame = 1 Byte)

(28)

Table 5-3: Detail results of power effect of TD on PC (Frame = one Byte)

2 3 4 5 6 7 8

PM mode

0 25.400 52.864 79.535 994.105 1690.074 2392.710 2957.550 5 147.322 182.882 204.155 1081.577 1790.088 2518.918 3077.249 10 363.860 360.526 281.785 1056.495 1758.337 2447.797 3054.071 20 516.103 512.610 510.546 1099.199 1759.448 2477.007 3021.368

PS mode

0 20.349 41.126 64.903 712.858 1180.742 1652.981 2068.886 5 94.676 120.595 138.159 756.769 1236.005 1690.681 2107.585 10 188.282 155.224 173.431 755.412 1274.347 1724.810 2113.297 20 257.897 474.881 274.890 768.764 1247.358 1725.381 2117.081

OD mode

0 15.422 26.561 43.711 807.412 1516.140 2220.892 2926.660 5 105.881 133.768 161.069 767.735 1733.671 2433.350 3063.280 10 216.499 175.010 182.916 869.106 1578.218 2407.817 3072.742 20 232.068 220.862 300.934 799.179 1557.660 2429.356 3029.503

LAD

0 20.563 41.198 95.615 776.907 1475.156 2291.333 2901.684 5 112.530 106.221 126.801 957.703 1557.115 2203.301 2980.904 10 207.967 161.345 166.637 870.518 1631.205 2207.031 2976.349 20 213.865 302.963 294.978 676.686 1370.538 2073.595 2787.763

Figure 5-3: Power effect of TD on PC (Frame = one Byte)

5.2.2. 1460 byte frame

Table 5-4 and Figure 5-4 show 1460 bytes frame TT. By comparing PM mode and OD mode, the completed time is longer than one byte frame in all situations. In Table 5-5 and Figure 5-5, OD mode uses in average over 200J less than PM mode. Our LAD algorithm made uses of 24~25s to

(29)

complete data transmission as OD mode, yet consumes less than OD mode 200~600J in 8 ranks. In other situations, LAD keeps nearly the same performance, spending 3s longer than OD mode and consuming 200~600J less than OD mode.

2 3 4 5 6 7 8

PM mode

0 0.353 0.525 0.721 8.969 12.421 17.518 25.286

5 0.996 1.188 1.321 11.687 13.115 18.323 24.394

10 2.481 2.267 1.818 9.811 12.892 17.752 23.960

20 3.330 3.281 3.245 14.031 12.760 17.835 24.511

PS mode

0 0.621 0.913 1.254 11.391 18.925 25.933 31.738

5 1.448 1.825 2.004 10.599 19.379 26.427 32.430

10 2.947 2.252 2.442 12.100 19.803 26.545 32.802

20 3.708 3.749 3.941 11.890 19.405 26.641 32.827

OD mode

0 0.408 0.548 0.738 7.769 13.033 18.427 24.356

5 1.394 1.707 1.931 8.435 13.749 19.017 25.097

10 2.818 2.221 2.285 8.329 13.512 18.971 24.542

20 3.723 3.720 3.874 8.352 13.584 18.841 24.547

LAD

0 0.630 0.940 1.271 10.855 16.063 21.985 24.732

5 1.403 1.500 1.646 10.192 16.104 21.200 24.741

10 2.861 1.993 2.080 9.852 16.307 21.228 25.182

20 3.742 3.871 3.611 10.482 17.143 21.356 25.566

(30)

Table 5-5: Detail results of power effect of TD on PC (Frame = 1460 Byte)

2 3 4 5 6 7 8

PM mode

0 56.039 83.345 114.460 1423.847 1971.859 2781.018 4014.203 5 158.117 188.597 209.711 1855.335 2082.032 2908.813 3872.596 10 393.864 359.891 288.611 1557.516 2046.631 2818.166 3803.698 20 528.644 520.865 515.150 2227.449 2025.676 2831.342 3891.170

PS mode

0 44.339 65.188 89.536 813.317 1351.245 1851.616 2266.093 5 103.387 130.305 143.086 756.769 1383.661 1886.888 2315.502 10 210.416 160.793 174.359 863.940 1413.934 1895.313 2342.063 20 264.751 267.679 281.387 848.946 1385.517 1902.167 2343.848

OD mode

0 33.586 39.127 52.693 945.259 1645.667 2710.100 3839.306 5 110.451 132.799 158.958 1032.840 1849.698 2663.291 3900.304 10 223.043 182.830 195.905 1049.544 1827.601 2729.659 3861.452 20 306.473 306.226 309.360 1022.164 1827.937 2739.376 3834.217

LAD

0 44.982 87.644 123.505 1028.737 1705.181 2431.432 3650.212 5 104.575 118.019 124.578 1044.754 1715.120 2455.331 3608.556 10 235.515 153.219 171.224 976.402 1847.987 2431.883 3525.145 20 308.037 307.738 290.581 890.359 1654.873 2355.386 3209.088

Figure 5-5: Power effect of TD on PC (Frame = 1460 Byte)

5.2.3. 8000 byte frame

Although 8000 byte frame is the longest one, PS mode TT keeps 6s longer than other frames’

size, as in Table 5-6 and Figure 5-6. Comparing OD and PM Mode, OD mode spends less than 1s longer than PM Mode, yet saves 200~400J in other cases, ats in Table 5-7 and Figure 5-7.

(31)

Comparing LAD algorithm and OD mode, LAD algorithm still keeps its advantages in the longest frame size, spends almost the same TT in 8 ranks and average 2~3s longer in other cross-node situations, consuming 100~ 400J less than OD mode.

2 3 4 5 6 7 8

PM mode

0 1.220 1.409 1.597 11.158 20.343 26.952 31.241

5 1.484 1.710 1.783 13.364 21.986 27.664 34.053

10 1.993 2.171 2.260 11.857 21.455 27.397 33.398

20 3.824 3.753 3.732 11.247 21.604 27.513 33.178

PS mode

0 2.333 2.619 2.812 16.480 27.429 34.139 38.755

5 2.240 2.684 2.964 16.716 27.728 35.219 39.884

10 2.774 3.088 3.245 17.336 19.803 35.387 41.127

20 4.685 4.678 4.244 16.700 27.613 35.219 39.930

OD mode

0 1.274 1.464 1.610 10.648 22.022 27.752 31.210

5 2.045 2.407 2.226 14.377 21.778 27.739 34.744

10 2.546 2.810 2.769 13.856 22.079 27.932 34.379

20 4.338 4.380 4.448 14.037 21.957 27.603 34.878

LAD

0 1.917 2.125 2.200 12.242 23.169 28.553 33.991

5 2.234 2.399 2.579 13.487 22.152 29.627 34.864

10 2.672 2.779 2.740 13.018 24.838 30.845 34.518

20 4.456 4.663 4.163 12.754 22.897 29.118 36.666

(32)

Table 5-7: Detail results of power effect of TD on PC (Frame = 8000 Byte)

2 3 4 5 6 7 8

PM mode

0 193.677 223.682 253.527 1771.355 3229.492 4278.684 4959.571 5 235.588 271.466 283.055 2121.562 3490.321 4391.715 5405.982 10 316.393 344.651 358.780 1882.322 3406.024 4349.329 5301.999 20 607.068 595.796 592.462 1785.484 3429.678 4367.744 5267.074

PS mode

0 166.576 186.997 200.777 1176.672 1958.431 2437.525 2767.107 5 159.936 191.638 211.630 1193.522 1979.779 2514.637 2847.718 10 198.064 220.483 231.693 1237.790 1413.934 2526.632 2936.468 20 334.509 334.009 303.022 1192.380 1971.568 2514.637 2851.002

OD mode

0 107.866 147.418 185.271 1320.356 2877.695 4069.769 4903.488 5 156.932 193.698 202.611 1652.663 2925.864 4059.867 5447.829 10 203.622 231.316 238.859 1706.812 2862.416 4076.114 5432.837 20 353.408 367.326 361.262 1664.418 3014.893 4030.163 5487.617

LAD

0 167.818 201.826 224.777 1355.342 2522.337 3977.022 4695.212 5 183.581 197.163 205.979 1436.942 2499.579 3962.008 4708.029 10 212.619 220.259 225.554 1301.254 2938.202 4316.622 5198.511 20 284.493 376.613 329.994 1265.924 2629.309 4060.896 5061.468

Figure 5-7: Power effect of TD on PC (Frame = 8000 Byte)

(33)

5.2.4. Load balance effect

LAD algorithm sends data frame to lowest loading core, that means each core should be kept in almost the same work loading. The average work loading of OnDemand and LAD algorithm are as below, in OnDemand mode, the cores loading of first host (0, 1, 2 and 3) are higher than remote host (4, 5, 6 and 7). Compare OnDemand mode and LAD, the core 0 has almost the same work loading because it is root node, and needs to send data frame to other nodes, but other nodes keep work loading nearby 80%, it improves the unfair situation of OnDemand mode.

Figure 5-8: Average core loading of OnDemand mode and LAD algorithm

(34)

5.2.5. Remarks

In this proposed research, LAD algorithm keeps in average 17% TT increasing in 1 byte and 1460 byte data frame, and 8% in 8000 byte data frame, yet saves 18% energy that compares with OD mode in cross-node situation. Limited by only 2 steps experimental cases of CPU frequencies (2.3GHz and 1.15GHz), we cannot keep CPU loading in a smooth curve. In desktop and server CPU, they do not keep in high loading work longer time, while they complete a concurrent job and next one does not be started. Power saving technology helps to decrease host energy consumption, and decreasing energy cost and carbon dioxide emissions can be reduced.

Table 5-2 and 5-4 display the effect about transmission speed reduction, follow 5.1.2, 5µs packet delay should increase 0.5s TT more than no delay, yet in table 5-4, 8 ranks test of LAD algorithm, the no delay test result is 24.732s, 5µs delay result is 24.741s, 10µs and 20µs test results also increase less than 1s TT, that means too fast transmission speed increases CPU bus loading and have no effect about performance. Besides, because network protocol and driver issues, 8000 byte data frame is fragmented by them and increase a bit loading to CPU, test result of this data frame is not stable as 1 byte and 1460 byte.

Core frequency reduction helps to reduce power consumption of I/O bound job, if the tested application is CPU bound job, this strategy is useless because core always keeps in high loading and no chance to reduce frequency.

(35)

Chapter 6. Conclusions and Future Work

One byte data frame is the smallest one, and it has 5 seconds transmission time shorter than 1460 bytes frame and 14 seconds shorter than 8000 bytes frame in cross node situation. That means two kinds of application which have no huge data need to be transmitted are suitable to use small data frame.

 Mathematical calculation

 Operation command sending in any application

Small data frame helps to reduce transmission time and energy consumption, more core calculation cycles can be released to do CPU bound jobs.

Besides, there are two kinds of application suitable to use large data frame.

 Database server that is sending data back

 Distributed file transmission

Larger data frame reduces frame generated time and transmits more data in single frame because larger content space.

There are many directions to continue this investigation, to develop methods to save energy. If hardware and software provides functions about voltage or speed control, motherboard or any other type of peripheral device, then a hardware driver, power-aware job scheduling and data distribution algorithms can be combined and implemented, targeting in the construction of a low energy cost cluster computing platform in future.

And, the scale of the experiment is only two hosts, larger experiment environment, 4 or more hosts are necessary to verify this approach in the future.

(36)

Reference

1. “Power Management Guide”, http://www.gentoo.com/doc/en/power-management-guide.xml 2. “Enabling CPU Frequency Scaling”, http://ubuntu.wordpress.com/2005/11/04/enabling-cpu-

frequency-scaling/

3. “Enh an c ed Int e l S p eedS t ep Te ch nol o g y fo r t h e In t el P ent i um M P r oc es s o r ”, ftp://download.intel.com/design/network/papers/30117401.pdf

4. “AMD PowerNow! Technology Platform Design Guide for Embedded Processors” , http://www.amd.com/epd/processors/6.32bitproc/8.amdk6fami/x24267/24267a.pdf

5. “AMD / Intel CPU voltage control driver down load”, http://www.linux-phc.org/viewtopic.php?

f=13 &t= 2

6. “AMD Family 10h Desktop Processor Power and Thermal Data Sheet”, http://www.amd.com/us- en/assets/content_type/white_papers_and_tech_docs/GH_43375_10h_DT_PTDS_PUB_3.14.pdf 7. “ A M D O p t e r o n P r o c e s s o r w i t h D i r e c t C o n n e c t A r c h i t e c t u r e ” , h t t p : / / e n t e r

prise.amd.com/downloads/4P_Power_PID_4149 8.pdf

8. Chao-Yang Lan, Ching-Hsien Hsu and Shih-Chang Chen, “Scheduling Contention-Free Irregular Redistributions in Parallelizing Compilers”, The Journal of Supercomputing, Volume 40, Issue 3, (June 2007), Pages: 229-247

9. Dongkun Shin and Jihong Kim, “Power-Aware Communication Optimization for Networks-on- Chips with Voltage Scalable Links”, Proceeding of the International Conference on Hardware/Software Codesign and System Synthesis, 2004, Pages: 170-175

10. Guangyu Chen, Feihui Li and Mahmut Kandemir, “Reducing Energy Consumption of On-Chip Networks Through a Hybrid Compiler-Runtime Approach”, 16th International Conference on Parallel Architecture and Compilation Techniques (PA CT 2007), Pages: 163 -174 11. “Intel 64 And IA-32 Architectures Software Developers Manual, Volume 1”, http://

download.in tel.com/design/processor/manuals/253665.pdf

12. “ K e y A r c h i t e c t u r a l F e a t u r e s o f A M D P h e n o m X 4 Q u a d - C o r e P r o c e s s o r s ” , h t t p : / / w w w . a m d . c o m / u s - e n / P r o c e s s o r s / P r o d u c t I n f o r m a t i o n / 0 , , 3 0 _ 1 1 8 _ 15331_15332%5E15334,00.html

(37)

13. Lei Chia, Albert Hartono, Dhabaleswar K. Panda, “Designing High Performance and Scalable MPI Inter-node Communication Support for Clusters”, 2006 IEEE International Conference on Cluster Computing, 25-28 Sept. 2006, Pages: 1-10

14. Ranjit Noronha and D.K. Panda, “Improving Scalability of OpenMP Applications on Multi- core Systems Using Large Page Support”, 2007 IEEE International Parallel and Distributed Processing Symposium, 26-30 March 2007, Pages: 1-8

15. Umit Y. Ogras, Radu Marculescu, Hyung Gyu Lee and Na Ehyuck Chang, “Communication Architecture Optimization: Making the Shortest Path Shorter in Regular Networks-on-Chip”, 2006 Proceedings of the conference on Design, Automation and Test in Europe, Munich, Germany, March 2006, Volume 1, Pages: 712-717

16. “ I n f i n i B a n d I n t r o d u c t i o n ” , h t t p : / / e n . w i k i p e d i a . o r g / w i k i / I n f i n i B a n d 17. Seung Woo Son, Guangyu Chen, Ozcan Ozturk Mahmut Kandemir and Alok Choudhary,

“Compiler-Directed Energy Optimization for Parallel-Disk-Based Systems” IEEE Transactions on Parallel and Distributed Systems, September 2007, Volume. 18, No. 9, Pages: 1241-1257 18. Sri Hari Krishna Narayanan, Mahmut Kandemir and Ozcan Ozturk, “Compiler-Directed Power Density Reduction in NoC-Based Multi-Core Designs”, Proceedings of the 7th International Symposium on Quality Electronic Desing, 2006, Pages: 570-575

19. David Meisner, Brian T. Gold and Thomas F. Wenisch, “PowerNap: eliminating server idle power”, Proceeding of the 14th International Conference on Architectural Support for P r o g r a m m i n g L a n g u a g e s a n d O p e r a t i o n S y s t e m s , 2 0 0 9 , P a g e s : 2 0 5 - 2 1 6 20. Michael B. Healy, Hsien-Hsin S. Lee, Gabriel H. Loh and Sung Kyu Lim, “Thermal Optimization in Multi-Granularity Multi-core Floorplanning” Proceedings of the 2009 Conference on Asia and South Pacific Design Automation, 2009, Pages: 43 -48 21. M. Aater Suleman, Onur Mutlu, moinuddin K. Qureshi and Yale N. Patt, “Accelerating Critical Section Execution with Asymmetric Multi-core Architecture”, Proceeding of the 14th International Conference on Architectural Support for Programming Languages and Operating Systems, 2009, Pages: 253-264

22. David C. Snowdon, Etienne Le Sueur, Stefan M. Petters and Gemot Heiser, “Koala: A Platform for OS-Level Power Management”, Proceedings of the Fourth ACM European Conference on Computer Systems, 2009, Pages: 289-302

23. Alexander S. van Amesfoort, Ana lucia Varbanescu, Henk J. Sips and Rob V. van Nieuwpoort,

“Evaluating Multi-core Platforms for HPC Data-Intensive kernels”, Proceedings of the 6th

(38)

ACM Conference on Computing Frontiers, 2009, Pages: 207-216

24. XianGrong Zhou, ChenJie Yu and Peter Petrov, “Temperature-Aware Register Reallocation for Register File Power-Density Minimization”, ACM Transactions on Design Automation of Electronic Systems, March 2009, Volume 14, Issue 2, No. 26, Pages: no provided by ACM 25. Radha Guha, Nader Bagherzadeh and Pai Chou, “Resource Management and Task Partitioning and Scheduling on a Run-Time Reconfigurable Embedded System”, Computers and Electrical Engineering, March 2009, Volume 35, Issue 2, Pages: 258-285

中 華 大 學 碩 士 論 文