In this chapter we evaluate the performance of TBMM and ATBMM and compare the results with other methods via simulation. The simulator is built on top of an event-driven simulator PeerSim [PeerSim]. The simulation model is summarized as follows. Each server is pre-configured to serve some of the applications, and the assignment will not change during the entire run. An application consists of several request types, each of which is associated with multi-resource requirements. When a request of a given type is generated, its actual requirement for each of the resources is determined randomly according to the request type’s multi-resource requirements. The request can be executed on a server only when the available resources of the server can satisfy all of the request’s resource requirements; otherwise the request is put in a queue waiting for execution. Once the request is put in the wait queue, it cannot be rescheduled to another server. To investigate the effectiveness of our load balancing method, we also adapt existing multi-resource load balancing methods BL, BB to fit our model, and compare them with our method.
We conduct the simulation for clusters of 32 servers. Each server has three kinds of resources: CPU (measured in Gflops - giga floating points operations per second), Memory (measured in GBs), and Network Bandwidth (measured in gb/s). There are two independent parameters used to configure the degree of heterogeneity of server resource capacities:
average server resource (Srm) and server resource variance (Srv). Srv is used to specify range of capacity for the resources within a server. A resource variance of zero implies the resource capacities of CPU, memory, and network bandwidth is the same for all the servers. We refer to this configuration as homogeneous configuration. Otherwise, the configurations are heterogeneous. A resource variance of Srv = ±X implies that the capacity of the resource will be assigned the value randomly within the range (Srm – X, Srm + X). In our simulation we set
the value of X to be 0, 20, 40, 60, 80 and 100, respectively.
CPU capacity is specified in terms of Gflops. We assume each machine uses time sharing to allocate the CPU resources among the requests it is processing. In other words, all requests in execution share the CPU resource, and they are executed by the CPU in a round-robin manner.
The workload model is described as follows. All the methods are simulated using static workload, except ATBMM which is simulated with time-varying workload. For each case, a suite of synthesized workload was generated. Two main settings of the simulated workload are the request arrival rate and the generation of multi-resource requirements. For both static workload and changing workload, we have experimented with different combinations of changing workloads. We allocate 1 to 3 workloads out of 6 workloads to each server. A workload can be served by more than one server. When a server is configured to be responsible for a workload, the corresponding request will arrive at the server’s local wait queue depending on the specified arrival rate following the Poisson distribution with mean λk
(0 ≤ k ≤ 5). The arrival rates for all workloads are adjusted such that resource utilization for the baseline algorithm (without load balancing) is around 80%. The multi-resource requirement of requests is generated as weak correlation among CPU, Memory and Network bandwidth requirements [Leinberger, et al].
The settings of 6 workloads, along with the initial configuration to each server are shown in Table 6.1 and Table 6.2.
Table 6.1: The multi-resource workloads used in the simulation:
Workload (jobs/
time unit)
CPU (gigaflop) Memory(GB) Network (gb/s)
W0 (λ = 2) 0.95~1.05 10~20 1~2
W1 (λ = 0.8) 0.35~0.45 10~20 0.5~1.5
W2 (λ = 1.21) 0.55~0.65 5~15 1~2
W3 (λ = 0.29) 0.37~0.47 1~9 2~3
W4 (λ = 0.57) 0.79~0.89 20~30 0.1~0.9
W5 (λ = 1.14) 1.63~1.73 10~20 1~2
Table 6.2: The 4 combination of 6 workload in the simulation:
Workload (jobs/
Figure 6.1: Initial configuration of workloads on servers Four Combinations of Workloads
W4 (λ = 0.57) N/A N/A N/A ○
W5 (λ = 1.14) N/A N/A ○ ○
To simulate demands surge with different duration and magnitude, we first referenced a 24-hour e-commerce trace from a large scale e-commerce site [Ranjan, et al]. In the study, the workload consists of 18 percent static requests, 57 percent dynamic request, 8 percent that are un-cacheable, and 17 percent request of other types. Static request usually incurs resource bottleneck of single resource, such as network transmission latency, while dynamic page request incurs a combination of multiple resources. The un-cacheable request incurs 500 and 100 times more CPU demand than a cache hit request. To parameterize and characterize the workload changing phenomena, we synthesize our changing workload using sine waves. The baseline of the sine wave is chosen to be workload without changing, that is, the baseline workload without surging user demands. We use two parameters, Wamplitude, Wcycle_time to control the “shapes” of the changing workload.
The amplitude of the sine wave (Wamplitude) is taken from 0.3 to 1, and the cycle time is 14 to 128. Figure 6.3 and Table 6.3 shows our synthesized changing workload pattern and the configuration for each parameter. All simulation is conducted on a time line of 700 units. We set Wamplitude to be ±0.3 to ±1.0 request per time unit. The baseline (±0 request per time unit) incurs no change to the workload. In Figure 6.3, for example, the cycle-time for the sine wave is 106 time units, with amplitude ±1.0 request per time unit. For the entire simulation run, there are about 6.6 cycles. As the cycle time increases, the number of cycles decreases (6.5 to 120 cycles in our experiments). Finally, the average token size is configured such that number of tokens each server has initially ranges 12 to 50.
Table 6.3: The changing workload configurations of simulation:
Amplitude ±0.3~1.0 request per time unit
Cycle time (Number of cycles per run) 14(50), 26(27), 66(10.6), 80(8.75), 106(6.6), 128(5.5) (unit: time unit (number of cycles per run)
Token size (Use token size and number of requests per 10 time units to
calculate number of token)
1.68 (unit: gigaflop): used in all
experiment except for token size effect comparison
Token size (Use token size and number of requests per 100 time units to
calculate number of token)
4, 6, 8, 10, 12, 14, 16, 16.80 (unit:
gigaflop): used in token size effect comparison
The major performance metrics measured in our simulation are response time, server utilization, queue length, and standard deviation of queue length. For server utilization, we measure CPU utilization as percentage of CPU in busy state, and use it as the server utilization. For memory and network utilization, it is measured as percentages of occupied capacities. The utilization of both can be affected by the job’s CPU requirement. In our model, CPU is busy as long as there are jobs in execution, and jobs have different execution time
Figure 6.2: Our synthesized workload
Cycle inter-arrival unit: 106; Amplitude=±1.0request (per time unit)
depending on CPU capacity of each server. Even if there is only one job, the CPU is busy but the other resources may be under-utilized. Therefore we separately observe 3 resources utilization and show only the CPU utilization as server utilization. We will discuss memory and network utilization shortly.
The following sections provide performance evaluation of TBMM and ATBMM, which are compared with other algorithms BL and BB as well as the baseline algorithm where no load balancing is performed.
First, we consider the case of static workload with varying server capacities. The results are shown in Figure 6.3 - 6.7. In Figure 6.3 and 6.4, as the heterogeneity of server capacity grows, all the methods except our TBMM have high server variance and low server utilization, which suggests that the inter-server imbalance degree is high, and there exist bottlenecks for some resources while the other resources remain idle. This causes more jobs to wait in the wait queue, and the overall average utilization becomes low. TBMM achieves the lowest server variance and the highest server utilization in all the cases, suggesting that a request has more chance to be redirected to servers with under-utilized resources, and thus more jobs can get executed instead of waiting in the queue. This is also confirmed in Figure 6.5 and 6.6, in which TBMM achieves the lowest response time and queue length. Figure 6.7 further shows that TBMM has lowest variance for queue length, showing that the quality of service seen from each request is more uniform.
Figures 6.8 - 6.11 show the average number of servers required against response times under a fix load. The load for each of the 32 servers is fixed load in this experiment. For a targeted response time such as 10 sec, TBMM uses 34 servers, while TBBL and TBBB uses
Figure 6.3, 6.4: Average standard deviation of server utilization and average server utilization
Figure 6.7: Average standard deviation of queue length
X-axis: Server Heterogeneity Degree Y: Average Response Time (unit)
Y: Average Server Utilization (100%)
Y: Queue Length
Y: Average Standard Deviation of Queue length
Y: Average Standard Deviation of Server Utilization
Figure 6.5, 6.6: Average response time and average queue length
about 38 servers, and the baseline method uses about 41 servers. Moreover, for a server number of 32, TBMM outperforms all of the other resource balancing algorithms Note that as the number of servers grows, all methods have lower server utilization. This means allocating more servers can improve performance, but utilization will drop, wasting resources.
Increasing the number of servers also lowers average queue length and variance, which reflects the fact that with more servers, jobs has more opportunities to be executed on any server, therefore fewer jobs will be queued and the performance improves.
Figure 6.8, 6.9: Average response time and average server utilization Y: Average Response Time (unit) Y: Average Server Utilization (100%)
X-axis: Number of servers
We examine the performance of ATBMM with changing workload as follows. Figures 6.12 - 6.15 show the results when increasing the cycle times. As cycle time increases, the average queue length increases. For shorter cycle time, although the job arrival rate surges during short period, it drops down quickly, too. The chances of forming a system bottleneck due to unhandled jobs increase as cycle time increases. We observe from the response time that ATBMM achieves lowest wait queue length. Moreover, it achieves lowest queue variance as cycle time increases. This shows that our method outperforms other multi-resource token-based algorithms.
Figure 6.10, 6.11: Average queue length and standard deviation Y: Queue Length Y: Average Standard Deviation of
Queue length
X-axis: Number of servers
We also evaluate the number-of-servers effect on the response time for changing workload. Figures 6.16 - 6.19 show the results for high magnitude (±1.0 request per time unit) and long cycle time. Again, the load of each server is fixed. For a fixed response time constraint such as 10 sec, ATBMM needs 34 servers, while TBMM needs 40 servers. For TBBL and TBBB, it’s about 50 servers. For baseline method, a number of about 54 servers is required. The queue length results have similar trend. Note that in the case of average standard deviation of queue length, ATBMM achieves a relatively low standard deviation, and remains low even if number of servers decreases. This shows that the queue lengths are
Figure 6.14, 6.15: Average queue length and standard deviation Figure 6.12, 6.13: Average response time and average server utilization Y: Average Response Time (unit) Y: Average Server Utilization (100%)
Y: Queue Length Y: Average Standard Deviation of Queue length
X-axis: Cycle-inter-arrival time (unit)
balanced for each of the servers, while the other methods have high imbalance degree for the queue lengths. Another noticeable phenomenon in the simulation is the dramatic improvement due to increasing number of servers. However, this significant improvement does not imply that the larger number of servers, the better the performance, because the server utilization will be lowered. In addition, we configure the system with a fix workload that is about 70% to 80% server utilization and average queue length about 90 jobs for baseline method. This already saturated workload may cause a high response time, while increasing more servers help distribute the workload to more servers, and average queue length decreases, and thus lower response time.
Figure 6.16, 6.17: Average response time and average server utilization Y: Average Response Time (unit) Y: Average Server Utilization (100%)
X-axis: Number of servers
We also investigate the effect of increasing token size. Figures 6.20 - 6.23 show the simulation results. As size of token increases, the number of tokens for each server decrease.
All the load balancing methods except ATBMM have long response times when the token size is small. The reason is that, when the token size is small, it becomes more difficult for other methods to balance the surging workload. In this case, our algorithm can adapt to changing workload, adding more tokens when workload surges, and can balance more system imbalance. Although not shown here, we have also observed that using small token size does not incur too much overhead because it does not induce excessive token exchanges when compared to the cases where tokens are larger.
Figure 6.18, 6.19: Average queue length and standard deviation Y: Queue Length Y: Average Standard Deviation of
Queue length
X-axis: Number of servers
Figure 6.20, 6.21: Average response time and average server utilization
Figure 6.22, 6.23: Average queue length and standard deviation
Y: Average Response Time (unit) Y: Average Server Utilization (100%)
Y: Queue Length Y: Average Standard Deviation of Queue length
X-axis: Token size (unit: 100 gigaflop)