Performance Evaluation and Analysis - Power Consumption Optimization of MPI Programs on Multi-

Power Consumption Optimization of MPI Programs on Multi-Core Clusters

5. Performance Evaluation and Analysis

10. //sort TargetNode from low to high 11. CPULoadingSorting;

12. //send 1000 data frame

13. for(i=1; i<1000; i++)

14. SendData(TargetNode[i]);

15. if(whole data transmitted) 16. DataSendingFinish=true;

17. }

18. //send finish message to receiving nodes 19. for(i=1; i<NodeNumber; i++)

20. SendData(i);

21. }

22. if (other nodes) 23. {

24. //receive data from node 0 25. ReceiveData(0);

26. usleep();

27. }

Figure 6: Test environment

Data frame size

Three different sizes of data frames are transmitted between nodes: one byte, 1460 bytes and 8000 bytes.

One byte frame is not only the smallest one in MPI data frame, but also in network, for complete data transmission in shortest time, source node generates huge amount of one byte frame, these packets congest CPU internal bus and network.

1518 bytes frame is the largest one in network, but considering that network header should be inserted into network packet, we select 1460 bytes frame for testing, and then, this size of packet brings largest amount of data in a single network packet, and trigger fewest network driver interrupt to CPU. Finally 8000 bytes frame is set for large data frame testing, since it needs to be separated to several other packets by network driver for transmission, but not necessary to be separated in intra-node, and thus need the longest time for data transmission.

While the experiment is executed, we send 100K data frames between two nodes, and calculate the power consumption.

CPU frequency and packet delay

Each experiment result figures and tables that follows next has four blocks. The first one is executed in Performance Mode (PM, CPU works in 2.3GHz), the second one is PowerSave Mode (PS, 1.15GHz), the

third one is OnDemand Mode (OD, slows down frequency while CPU loading lower than 80%), and last one is LAD algorithm that works with OnDemand Mode.

Besides, each block has four delay time configurations, the first one contains no delay between each data frame, the second delays 5µs, the third one delays 10µs, and last one delay 20µs. Still in figures that follows next, TD stands for Transmission Delay, Transmission Time as TT, and PC for Power Consumption.

Rank number

The “Rank Number” in each figures and tables mean the number of nodes / cores join data dispatching.

For example, rank 2 means rank 0 dispatchs data to rank one, and rank 4 means rank 0 dispatchs data to rank one, 2, and 3. Since each host has four cores, the rank number 2~4 are internal node data transmission, and rank 5~8 are cross node data transmission.

Although only four cores join work in rank number 2~4, other cores consume energy at the same time, and we still need to add the energy consumed.

One byte frame

Table 3 and Figure 5 show the TT for one byte frame, and Figure 6 the PC. Comparing PM, PS and OD mode, we find that TD increases the TT over 3 seconds in rank 2~4 in every frequency level, but increases less than one second in 5~8. Table 4 and Figure 6 displayed one byte frame PC. Clearly, the PS mode spends the longest time to transmit data, though consumes the lowest energy. OD mode has none remarkable performance in power saving in rank 7~8, but it uses average 100J less than PM mode in rank 2~6, and keeps TT increasing less than 0.4s in cross-node situation. LAD algorithm displays advantage in no delay situation, less than 1s TT increasing yet consumes almost the same energy in rank 7~8. In other situations, LAD spends maximum 4s longer than OD mode, and saves 400J.

Table 4: Detail results of time effect of TD on TT (Frame = 1 Byte)

Rank Number Mode & TD

2 3 4 5 6 7 8

PM mode

0 0.160 0.333 0.501 6.262 10.646 15.072 18.630 5 0.928 1.152 1.286 6.813 11.276 15.867 19.384 10 2.292 2.271 1.775 6.655 11.076 15.419 19.238 20 3.251 3.229 3.216 6.924 11.083 15.603 19.032 PS

mode

0 0.285 0.576 0.909 9.984 16.537 23.151 28.976 5 1.326 1.689 1.935 10.599 17.311 23.679 29.518 10 2.637 2.174 2.429 10.580 17.848 24.157 29.598 20 3.612 6.651 3.850 10.767 17.470 24.165 29.651 OD

mode

0 0.216 0.372 0.531 6.625 11.388 16.025 18.664 5 1.330 1.625 1.824 7.143 11.973 16.863 19.503 10 2.630 2.126 2.256 6.898 11.693 16.456 19.456 20 3.489 3.683 3.756 7.343 11.723 16.615 19.161

LAD

0 0.288 0.577 0.918 7.182 12.018 16.699 19.524 5 1.367 1.423 1.623 8.716 14.587 20.181 20.704 10 2.659 1.960 2.028 8.718 14.508 20.221 21.355 20 3.598 3.813 3.716 9.254 15.129 20.253 22.943

Figure 7: Time effect of TD on TT (Frame = 1 Byte)

Table 5: Detail results of power effect of TD on PC (Frame = one Byte) Rank

Number Mode & TD

2 3 4 5 6 7 8

PM mode

0 25.400 52.864 79.535 994.105 1690.074 2392.710 2957.550 5 147.322 182.882 204.155 1081.577 1790.088 2518.918 3077.249 10 363.860 360.526 281.785 1056.495 1758.337 2447.797 3054.071 20 516.103 512.610 510.546 1099.199 1759.448 2477.007 3021.368 PS

mode

0 20.349 41.126 64.903 712.858 1180.742 1652.981 2068.886 5 94.676 120.595 138.159 756.769 1236.005 1690.681 2107.585 10 188.282 155.224 173.431 755.412 1274.347 1724.810 2113.297 20 257.897 474.881 274.890 768.764 1247.358 1725.381 2117.081 OD

mode

0 15.422 26.561 43.711 807.412 1516.140 2220.892 2926.660 5 105.881 133.768 161.069 767.735 1733.671 2433.350 3063.280 10 216.499 175.010 182.916 869.106 1578.218 2407.817 3072.742 20 232.068 220.862 300.934 799.179 1557.660 2429.356 3029.503

LAD

0 20.563 41.198 95.615 776.907 1475.156 2291.333 2901.684 5 112.530 106.221 126.801 957.703 1557.115 2203.301 2980.904 10 207.967 161.345 166.637 870.518 1631.205 2207.031 2976.349 20 213.865 302.963 294.978 676.686 1370.538 2073.595 2787.763

Figure 8: Power effect of TD on PC (Frame = one Byte)

1460 byte frame

Table 5 and Figure 7 show 1460 bytes frame TT. By comparing PM mode and OD mode, the completed time is longer than one byte frame in all situations. In Figure 8, OD mode uses in average over 200J less than PM mode. Our LAD algorithm made uses of 24~25s to complete data transmission as OD mode, yet consumes less than OD mode 200~600J in 8 ranks. In other situations, LAD keeps nearly the same performance, spending 3s longer than OD mode and consuming 200~600J less than OD mode.

Table 6: Detail results of time effect of TD on TT (Frame = 1460 Byte) Rank

Number Mode & TD

2 3 4 5 6 7 8

PM mode

0 0.353 0.525 0.721 8.969 12.421 17.518 25.286 5 0.996 1.188 1.321 11.687 13.115 18.323 24.394 10 2.481 2.267 1.818 9.811 12.892 17.752 23.960 20 3.330 3.281 3.245 14.031 12.760 17.835 24.511 PS

mode

0 0.621 0.913 1.254 11.391 18.925 25.933 31.738 5 1.448 1.825 2.004 10.599 19.379 26.427 32.430 10 2.947 2.252 2.442 12.100 19.803 26.545 32.802 20 3.708 3.749 3.941 11.890 19.405 26.641 32.827 OD

mode

0 0.408 0.548 0.738 7.769 13.033 18.427 24.356 5 1.394 1.707 1.931 8.435 13.749 19.017 25.097 10 2.818 2.221 2.285 8.329 13.512 18.971 24.542 20 3.723 3.720 3.874 8.352 13.584 18.841 24.547

LAD

0 0.630 0.940 1.271 10.855 16.063 21.985 24.732 5 1.403 1.500 1.646 10.192 16.104 21.200 24.741 10 2.861 1.993 2.080 9.852 16.307 21.228 25.182 20 3.742 3.871 3.611 10.482 17.143 21.356 25.566

Figure 9: Time effect of TD on TT (Frame = 1460 Byte)

Table 7: Detail results of power effect of TD on PC (Frame = 1460 Byte) Rank

Number Mode & TD

2 3 4 5 6 7 8

PM mode

0 56.039 83.345 114.460 1423.847 1971.859 2781.018 4014.203 5 158.117 188.597 209.711 1855.335 2082.032 2908.813 3872.596 10 393.864 359.891 288.611 1557.516 2046.631 2818.166 3803.698 20 528.644 520.865 515.150 2227.449 2025.676 2831.342 3891.170 PS

mode

0 44.339 65.188 89.536 813.317 1351.245 1851.616 2266.093 5 103.387 130.305 143.086 756.769 1383.661 1886.888 2315.502 10 210.416 160.793 174.359 863.940 1413.934 1895.313 2342.063 20 264.751 267.679 281.387 848.946 1385.517 1902.167 2343.848 OD

mode

0 33.586 39.127 52.693 945.259 1645.667 2710.100 3839.306 5 110.451 132.799 158.958 1032.840 1849.698 2663.291 3900.304 10 223.043 182.830 195.905 1049.544 1827.601 2729.659 3861.452 20 306.473 306.226 309.360 1022.164 1827.937 2739.376 3834.217

LAD

0 44.982 87.644 123.505 1028.737 1705.181 2431.432 3650.212 5 104.575 118.019 124.578 1044.754 1715.120 2455.331 3608.556 10 235.515 153.219 171.224 976.402 1847.987 2431.883 3525.145 20 308.037 307.738 290.581 890.359 1654.873 2355.386 3209.088

Figure 10: Power effect of TD on PC (Frame = 1460 Byte)

8000 byte frame

Although 8000 byte frame is the longest one, PS mode TT keeps 6s longer than other frames‟ size, as in Figure 9. Comparing OD and PM Mode, OD mode spends less than 1s longer than PM Mode, yet saves 200~400J in other cases. Comparing LAD algorithm and OD mode, LAD algorithm still keeps its advantages in the longest frame size, spends almost the same TT in 8 ranks and average 2~3s longer in other cross-node situations, consuming 100~ 400J less than OD mode.

Table 8: Detail results of time effect of TD on TT (Frame = 8000 Byte) Rank

Number Mode & TD

2 3 4 5 6 7 8

PM mode

0 1.220 1.409 1.597 11.158 20.343 26.952 31.241 5 1.484 1.710 1.783 13.364 21.986 27.664 34.053 10 1.993 2.171 2.260 11.857 21.455 27.397 33.398 20 3.824 3.753 3.732 11.247 21.604 27.513 33.178

PS mode

0 2.333 2.619 2.812 16.480 27.429 34.139 38.755 5 2.240 2.684 2.964 16.716 27.728 35.219 39.884 10 2.774 3.088 3.245 17.336 19.803 35.387 41.127 20 4.685 4.678 4.244 16.700 27.613 35.219 39.930 OD

mode

0 1.274 1.464 1.610 10.648 22.022 27.752 31.210 5 2.045 2.407 2.226 14.377 21.778 27.739 34.744 10 2.546 2.810 2.769 13.856 22.079 27.932 34.379 20 4.338 4.380 4.448 14.037 21.957 27.603 34.878

LAD

0 1.917 2.125 2.200 12.242 23.169 28.553 33.991 5 2.234 2.399 2.579 13.487 22.152 29.627 34.864 10 2.672 2.779 2.740 13.018 24.838 30.845 34.518 20 4.456 4.663 4.163 12.754 22.897 29.118 36.666

Figure 11: Time effect of TD on TT (Frame = 8000 Byte)

Table 9: Detail results of power effect of TD on PC (Frame = 8000 Byte) Rank

Number Mode & TD

2 3 4 5 6 7 8

PM mode

0 193.677 223.682 253.527 1771.355 3229.492 4278.684 4959.571 5 235.588 271.466 283.055 2121.562 3490.321 4391.715 5405.982 10 316.393 344.651 358.780 1882.322 3406.024 4349.329 5301.999 20 607.068 595.796 592.462 1785.484 3429.678 4367.744 5267.074 PS

mode

0 166.576 186.997 200.777 1176.672 1958.431 2437.525 2767.107 5 159.936 191.638 211.630 1193.522 1979.779 2514.637 2847.718 10 198.064 220.483 231.693 1237.790 1413.934 2526.632 2936.468 20 334.509 334.009 303.022 1192.380 1971.568 2514.637 2851.002 OD

mode

0 107.866 147.418 185.271 1320.356 2877.695 4069.769 4903.488 5 156.932 193.698 202.611 1652.663 2925.864 4059.867 5447.829 10 203.622 231.316 238.859 1706.812 2862.416 4076.114 5432.837 20 353.408 367.326 361.262 1664.418 3014.893 4030.163 5487.617

LAD

0 167.818 201.826 224.777 1355.342 2522.337 3977.022 4695.212 5 183.581 197.163 205.979 1436.942 2499.579 3962.008 4708.029 10 212.619 220.259 225.554 1301.254 2938.202 4316.622 5198.511 20 284.493 376.613 329.994 1265.924 2629.309 4060.896 5061.468

Figure 12: Power effect of TD on PC (Frame = 8000 Byte)

Remarks

In this proposed research, LAD algorithm keeps in average 4s TT increasing, yet saves 200~600J that compares with OD mode in cross-node situation. Limited by only 2 steps experimental cases of CPU frequencies (2.3GHz and 1.15GHz), we cannot keep CPU loading in a smooth curve. In desktop and server CPU, they do not keep in high loading work longer time, while they complete a concurrent job and next one does not be started. Power saving technology helps to decrease host energy consumption, and decreasing energy cost and carbon dioxide emissions can be reduced.

6. Conclusions

One byte data frame is the smallest one, and it has 5 seconds transmission time shorter than 1460 bytes frame and 14 seconds shorter than 8000 bytes frame in cross node situation. That means two kinds of application which have no huge data need to be transmitted are suitable to use small data frame.

 Mathematical calculation

 Operation command sending in any application

Small data frame helps to reduce transmission time and energy consumption, more core calculation cycles can be released to do CPU bound jobs.

Besides, there are two kinds of application suitable to use large data frame.

 Database server that is sending data back

 Distributed file transmission

Larger data frame reduces frame generated time and transmits more data in single frame because larger content space.

There are many directions to continue this investigation, to develop methods to save energy. If hardware and software provides functions about voltage or speed control, motherboard or any other type of peripheral device, then a hardware driver, power-aware job scheduling and data distribution algorithms can be combined and implemented, targeting in the construction of a low energy cost cluster computing platform in future.

在文檔中行政院國家科學委員會專題研究計畫成果報告 (頁 64-76)