Pre-order Deficit Round Robin: a new scheduling algorithm for packet-switched networks

(1)

Pre-order De®cit Round Robin: a new scheduling algorithm

for packet-switched networks

q

Shih-Chiang Tsao

a,1

, Ying-Dar Lin

b,*

a_{Telecommunication Laboratories, Chunghwa Telecom Co., Ltd, Taoyuan, Taiwan}

b_{Department of Computer and Information Science, National Chiao Tung University, 1001 Ta Hsueh Road, Hsinchu 30050, Taiwan}

Received 11 February 2000; accepted 31 July 2000 Responsible Editor: E. Knightly

Abstract

In recent years, many packet fair queueing algorithms have been proposed to approximate generalized processor sharing (GPS). Most of them provide a low end-to-end delay bound and ensure that all connections share the link in a fair manner. However, scalability and simplicity are two signi®cant issues in practice. De®cit Round Robin (DRR) requires only O(1) work to process a packet and is simple enough to be implemented in hardware. However, its large latency and unfair behavior are not tolerated. In this work, a new scheme, Pre-order De®cit Round Robin, is described, which overcomes the problems of DRR. A limited number, Z, of priority queues are placed behind the DRR structure to reorder the transmission sequence to approximate packet by packet generalized processor sharing (PGPS). We provide an analysis on latency and fairness, which shows our scheme as a better alternative to DRR. In most cases PDRR has a per-packet time complexity of O(1), and Olog Z in other speci®c cases. Simulation results are also provided to further illustrate its average behavior. Ó 2001 Elsevier Science B.V. All rights reserved.

Keywords: Packet scheduling; Fair queueing; Round robin; De®cit

1. Introduction

In recent years, many packet scheduling algorithms, which aim to approximate generalized processor sharing (GPS) [1] have been proposed [2±7]. Stiliadis and Verma [8] presented a broad class of schedulers to describe their common architecture and provided a systematic analysis regarding their latency and fairness properties. Generally, for the class of schedulers, there are two major operations, which are the key factors in determining the implementation complexity. One is the maintainence of virtual time and the other selects the most eligible packet to send out next among all active ¯ow queues.

www.elsevier.com/locate/comnet

q_{This work was supported in part by MOE Program of Excellence Research, 89-E-FA04-1-4.}

*_{Corresponding author. Tel.: +886-3-573-1899; fax: +886-3-572-1490.}

E-mail addresses: [email protected] (S.-C. Tsao), [email protected] (Y.-D. Lin).

1_{Tel.: +886-3-424-5085.}

(2)

To reduce the complexity of the above two major operations in this class of schedulers, many scheduling algorithms were proposed [3±7,9±11]. Some algorithms can reduce complexity to O(1) [9±11]. However, their schemes and analytical results can only be applied to ®xed packet sizes and, hence, are only suitable for ATM networks. Others can handle variable-length packets and oer good scheduling properties such as fairness and latency. However their complexities are still dependent on N, e.g., Olog N, where N is the number of ¯ows [3,4,7]. Among them, De®cit Round Robin (DRR) [5] (a credit-based version of Round Robin) is an extreme case. It requires only O(1) work to process a packet and is amenable to variable-length packets. But its latency and fairness [8] are higher than others, such as self-clocked fair queueing (SCFQ) [3] which has Olog N per-packet time complexity. How it performs these two operations with only O(1) work is most interesting. In contrast, the problems cause DRR's intolerable properties in latency and fairness which are what we have attempted to overcome.

Herein, a new scheduling algorithm, Pre-order De®cit Round Robin (PDRR), is proposed whose ob-jective is to solve the problems of DRR while keeping the advantage of O(1) per-packet time complexity in most cases and remaining amenable to variable-length packets. This goal is achieved through two im-provements. First, a Pre-order Queueing module is appended to the original architecture of DRR. The new design solves the phenomenon of small packets waiting too long to be transmitted, which is caused by sending large packets at inappropriate time. Secondly, the quantum update operation was separated from the dequeue operation of DRR and thereby enable the packet to be considered for sending out in this round upon arrival. In the original DRR, the packet is refrained from being sent out until the ¯ow's turn arrives. In the worst scenario, it might just miss the opportunity and therefore not be sent out in this round. The above two improvements enable the server to reorder the transmission sequence of DRR in one round and send out a packet in an eligible time as early as it may.

Next, we analyzed our PDRR algorithm with respect to three measures, including latency, fairness and per-packet time complexity. The analysis con®rms that PDRR oers better performance in latency and fairness than DRR and lower time complexity than SCFQ. Finally, through simulation results, we also demonstrate the average behavior of PDRR.

The rest of the paper is organized as follows. Section 2 illustrates the problems of DRR. Section 3 presents a new scheme, PDRR, to resolve these problems. Section 4 gets the analytical results of PDRR on latency, fairness, and per-packet time complexity, and Section 5 presents simulation results demonstrating the average behaviors. Section 6 describes some related works and ®nally, in Section 7 we summarize our work and illustrate future directions.

2. Motivation

2.1. De®cit Round Robin

DRR [5], proposed by Shreedhar and Varghese is a simple scheduling algorithm. Herein, its structure is conceptually depicted. For additional details, readers are referred to [5]. Examples to explain what its problems are and solutions are also provided.

The server in DRR rotationally selects packets to send out from all ¯ows that have queued packets. DRR maintains a service list to keep the ¯ow sequence being served in a round and to avoid examining empty queues. If a ¯ow has no packets in its queue, its identi®er will be deleted from the service list. The next time a packet arrives to the ¯ow that has an empty queue, the identi®er of the ¯ow will be added to the tail of the list.

We now come to see the number of packets can that be sent once a ¯ow is served. For each ¯ow, two variables, Quantum and De®citCounter, are maintained. Quantum is the amount of credits in byte allocated to a ¯ow within the period of one round. Quantumifor ¯ow i can be derived as

(3)

Quantumir_Ci F ; 1

where riis the rate allocated to ¯ow i, C the link service rate, and F is the frame size that represents the

summation of Quantum's for all ¯ows. DeficitCounterjÿ1i accumulates the residual Quantumiof ¯ow i in the

j ÿ 1th round. The next time that ¯ow i is served, it can send out additional DeficitCounterijÿ1 bytes of

data in the jth round. By this rule, once ¯ow i is served, two steps are performed. First, the server updates DeficitCounterj

i as

DeficitCounterj

i DeficitCounterjÿ1i Quantumi:

Second, it veri®es the size of the packet at the queue head of ¯ow i. If the size is smaller than DeficitCounterj

i; DeficitCounterji is decreased by this packet size and the packet is sent out. The server

repeats to execute this operation until the size of the head packet is larger than DeficitCounterj

i, that is there

are insucient credits to serve the subsequent packet, or there are no remaining packets in the queue of ¯ow i. In the former case, the time it takes to transmit this packet is delayed and the residual value in DeficitCounterji is held until the subsequent turn arrives. Again, the next time ¯ow i gets its turn, it can send

out additional DeficitCounterjibytes of data in addition to Quantumibytes. In the latter case, DeficitCounterji

is reset to zero, that is the residual credits remaining from the previous round cannot be carried over to serve the following burst. Or it may delay the service to other ¯ows.

2.2. Quantum size

Deciding Quantum size is an important issue. According to Shreedhar and Varghese [5], if we expect that the work for DRR is O(1) per packet, then the Quantum for a ¯ow should be larger than the maximum packet size within the ¯ow so that at least one packet per backlogged ¯ow can be served in a round. Besides, according to the meaning of Quantum, for any two ¯ows i and j, we have

Quantumi

Quantumj

ri

rj: 2

To satisfy the above two constraints for all ¯ows, the Quantum of a ¯ow might be very large and many times its maximum packet size.

An example is used to further illustrate the above problem. Assume that there are four ¯ows, sharing the same link, whose trac parameters are depicted in Table 1 and the link capacity is 160 Mbps (megabits per second). Let us consider the determination of Quantum. Under the above two constraints, ¯ow B that has the maximum ratio of maximum packet size to reserved rate was selected as the base and its Quantum was set to its maximum packet size, 640. The Quantum of alternate ¯ows are set according to (2). Notably, the Quantum size of ¯ow D is 25.6 times of the maximum packet size of ¯ow D.

Table 1

The trac parameters and quantum size of four ¯ows

Flow ID Reserved rate

(Mbps) Trac type Maximumpacket size (byte) Ratio of max packetsize to reserved rate Quantum size(byte)

A 12.8 CBR 400 250 512

B 16 CBR 640 320 640

C 64 CBR 800 100 2560

(4)

2.3. Problems and causality of DRR

Fig. 1 depicts the problems as the Quantum size reaches many times the size of the maximum packet. Suppose that there are four ¯ows requesting the same amount of bandwidth and have ®xed, but hetero-geneous packet sizes. The same Quantum is assigned to all of them and, according to the rules in Section 2.2, the Quantum should be equal to the largest maximum packet size among all ¯ows. Assume that packets 1, 4, 6, and B arrive at the same time and all have greedy ¯ow sources. Namely, all ¯ows are heavily backlogged. By comparing the output pattern in DRR with that in WFQ, three problems are observed. First, packets 1, 4 and 6 should be transmitted according to the sequence of service completions in the ¯uid server such as 6, 1 and 4. However, DRR only considers whether a packet could be sent out in a round and does not care for their eligible transmission sequence. Second, packets 6, 7, 8 and 9 are sent out in a batch, which in terms of latency and fairness is not a well behavior. Third, the transmission time of packet B with size slightly greater than the residual credits of this round is delayed until the next turn of this ¯ow, after all other ¯ows ®nish their transmissions in the second round. In fact, in contrast with the result in WFQ, it may wait too long. The delay increases with the frame size and a larger Quantum size produces larger frame size.

Thus, all three above mentioned problems are due to the large Quantum size, as compared with the maximum packet size. Reducing the Quantum size is an intuitive solution. However, it is ineective as it would often cause the server to discover nothing to send out after querying all ¯ows. Thus, the Quantum size is not reduced. Instead, a Pre-order Queueing module is appended to the original DRR scheduler to divide Quantum into several parts, which makes the ¯ow to spend its Quantum in pieces within a round. Furthermore, for the packet described in the third problem, its transmission sequence is decided only ac-cording to the size of its portion over the residual credit of the previous round, which enables the server to treat it as fair as those packets of smaller size than Quantum.

3. Pre-order De®cit Round Robin 3.1. Architecture

A new algorithm, Pre-order De®cit Round Robin (PDRR), is presented herein. Fig. 2 illustrates an overview of the PDRR architecture. An additional module, Pre-order Queueing, which consists of a

Fig. 1. Input and output patterns of packets in DRR and WFQ. Packets on the right-hand side of the broken line are allowed to be sent out in the ®rst round. Packet B is excluded in the ®rst round.

(5)

classi®er sub-module and a priority queueing sub-module, is appended to the original architecture of DRR so as to pre-order the service order of DRR and overcome the problems described in Section 2.3. Fig. 2 shows that the priority queueing sub-module has a limited number of priority queues, Z. Let Pq1;

Pq2; . . . ; PqZ denote the priority queues. As the server is ready to transmit, it picks up one packet from Pqj

with the smallest j among all nonempty priority queues. In one round, even when the server has sent out all packets in Pqj1, if one packet arrives at Pqj, this packet is sent out before any other packets in Pqlwhere l is

larger than j.

The classi®er sub-module decides which priority queue a packet should enter. Since the server picks packets from Pqjwith the smallest j among all nonempty ones, Pre-order Queueing in eect determines the

order of packets in a round that appear in the output link. Prior to describing our classi®er, we show why PDRR can approximate PGPS. Assume that all ¯ows begin heavily backlogged at time t and review the output pattern of WFQ (Fig. 1). The server in WFQ selects packets to send out according to the sequence of service completions within the ¯uid server. That is, the packet with the smallest virtual ®nishing timestamp among the packets at the heads of all ¯ow queues can be sent out ®rst. Under the above heavy backlog assumption, the virtual ®nishing timestamp of a packet is computed as

TSm i TSimÿ1L m i ri ; 3 where TSm

i denotes the timestamp of the mth packet of ¯ow i after time t and, for all i, TSi0is set to zero at

time t, ridenotes the allocated rate of ¯ow i, and Lmi denotes the size of the mth packet of ¯ow i after time t.

Eq. (3), by substituting Accm

i for TSim ri, is equivalent to Accm i Quantumi Accmÿ1 i Lmi Quantumi ; 4 where Accm

i denotes the accumulated amount of data with in a byte that ¯ow i has sent out after

trans-mitting the mth packet after time t. Assume that all m packets could be transmitted in the kth round. Eq. (4), by replacing Accm

i with DeficitCounter0i ÿ DeficitCountermi, is equivalent to

Fig. 2. The architecture of PDRR. Fqidenotes the queue of ¯ow i. Pqjdenotes the priority queue j in Pre-order Queueing. Didenotes

(6)

DeficitCounterm i Quantumi DeficitCountermÿ1 i ÿ Lmi Quantumi ; 5 where DeficitCounterm

i denotes the residual credits of ¯ow i in this round after it puts the mth packet into

the Pre-order Queueing. To further illustrate the equivalence, the following de®nition is required: De®nition 1. The Quantum Availability, QAm

i, of the packet Pim is the ratio of its DeficitCounterim to

Quantumi, i.e., QAm i DeficitCounterm i Quantumi : 6

Lemma 1. For any packet Pm

i , its Quantum Availability, QAmi, satisfies

0 6 QAm i < 1:

Proof. Readers are referred to Appendix A.

Lemma 2. For the packet with the smallest timestamp in one round, its QA is the largest. Proof. Readers are referred to Appendix A.

According to the Smallest virtual Finishing time First policy [12], to approximate the sequence of service completions in GPS, the server should select the packet with the smallest timestamp to send out. From the Lemma 2, our architecture selects the packet with the largest QA within one round to send out. However, it costs too much time to ®nd the packet with the largest QA among all packets that could be sent out in this round. Thus, our classi®er only classi®es packets to several classes according to their QA and places them into the corresponding priority queues. Assume that there are Z priority queues and, hence, Z classes. For the mth packet of ¯ow i that can be sent out in this round, its class nm

i can be derived as nm i Z ÿ QAmi Z Z ÿ DeficitCountermi Pqgi ; 7 where DeficitCounterm

i denotes the residual credits in byte for ¯ow i in the kth round after the mth packet is

placed into a priority queue, and Pqgi denotes the granularity of priority queue for ¯ow i derived as

PqgiQuantum_Z i: 8

3.2. Algorithm

Our scheduling algorithm is described formally herein. That is, it repeatedly performs the following three modules: PKT_Pass, PKT_Arrival and PKT_Departure. PKT_Pass manages the update of the Deficit-Counter for each ¯ow and classi®es eligible packets into the priority queueing sub-module from their Fq. PKT_Arrival simply places each arrival packet into its corresponding Fq. PKT_Departure declares the beginning of a new round and noti®es PKT_Pass to process non-empty Fq's. Once there are packets in the priority queueing sub-module, it repeatedly picks the packet from Pqjwith the smallest j, among non-empty

(7)

priority queues, to serve. A min heap, a complete binary tree in which the key value in each node is no larger than the key values in its children (if any), is used to enable PKT_Departure to eciently obtain the in-formation about the smallest j.

Prior to examining these pseudo codes of the three operations, Tables 2 and 3 contain some necessary de®nitions.

Below is our PKT_Pass and PKT_Arrival pseudo code:

On arrival of packet p, PKT_Arrival obtains its ¯ow identi®cation i and places it into Fqi. It then noti®es

PKT_Pass if there are no other packets in Fqi. For each packet that could be sent out in this round,

PKT_Pass decides to which class j it belongs and takes it into the corresponding priority queue Pqjfrom its

Table 2

The operations used in PKT_Arrival and PKT_Departure and PKT_Pass procedures

Operation Description

Enqueue, Dequeue, NonEmpty, Empty, Head The standard Queue operations

NumItem(A) Return the number of entries in the A

MH_Insert(x), MH_Delete(), MH_Empty() The standard operations of the min heap

MinHeapRoot The minimum value among nodes of the min heap

MH_Lock, MH_Unlock After locking, only one module can access the min heap

SetEvent(ev) Signal the event ev. The event will remain until signaled someone releases it

EventCase WaitEvents(ev1; ev2; . . .) Once any event is in the signaled state, return it to EventCase and release it. Note

ev1 is prior to ev2

SendMsg(x; y) Send message along with value y to x

(8)

Fq. If it is the ®rst packet of Pqj, the function MH_Insert along with parameter j is called to indicate that Pqj

is not empty. Alternately, if there are packets which cannot be sent out in this round, due to the insucient value of DeficitCounter, a ¯ow identi®er is inserted into AckList. PKT_Departure knows, from ActList, which ¯ow queues have remaining packets in this round and accumulate their residual credits for the next round.

We now describe our PKT_Departure as the following pseudo codes

Once packets are pushed into the priority queueing sub-module by PKT_Pass, the status of the event EVminheap is also be signaled to noti®y PKT_Departure to send out a packet drawn from PqMinHeapRoot. After

the transmission is complete, ServerIdle is signaled and PKT_Departure repeats the last action until the PqMinHeapRootis empty. At this point, the function MH_Delete deletes the root node of the min heap and sets

MinHeapRoot to the smallest j among residual nodes. When the min heap is empty, i.e., all Pqj's are empty,

PKT_Departure declares the arrival of a new round by adding 1 to Roundsys. For all non-empty Fq's, i.e.,

Table 3

The variables used in PKT_Arrival and PKT_Departure and PKT_Pass procedures

Variable Description

Fqi The queue of ¯ow i; i 1; . . . ; N

Pqj Priority queue j; j 1; . . . ; Z

Quantumi The created allocated to ¯ow i in one round

DCi The De®citCounter of that ¯ow i

Pqgi The granularity of priority queue for ¯ow i

Z The number of priority queues

Roundsys The identi®cation of the system current round

Roundi The identi®cation of the round of ¯ow i

EVminheap A event signaled as one key was placed into the empty min heap

(9)

¯ows with packets remaining in the last round, it updates their DeficitCounter's according to their sequence in the AckList and requests that PKT_Pass classify the eligible packets into the priority queueing sub-module. Then, it waits again until new packets are inserted to the priorityqueueing sub-sub-module.

3.3. Example

An example that uses the same input pattern and assumption as in Fig. 1 is given to show that our algorithm can overcome the problems of DRR in Section 2.3. In this example, packets are classi®ed into 4 classes, Z 4. Suppose that for all ¯ows their Quantum's are equal to 400 and the size of packet B is 500. Clearly, packet B could not be sent out in the ®rst round. However, in the next round, DeficitCounterB

would be equal to 300, i.e., 400 400 ÿ 500. According to (7), the packet would be classi®ed into the ®rst class, i.e., 4 ÿ b300=400=4c and could be sent out at the beginning of the next round. Other packets are put into the priority queueing sub-module according to the same rule. Fig. 3 presents the result, which is the same as that in WFQ (Fig. 1) and is unlike that in DRR. Under the assumption that all ¯ows are heavily backlogged and there are enough priority queues, our algorithm is expected to have behavior similar to that of WFQ.

4. Analytical results

In this section, the performance of PDRR is analyzed in terms of delay bound and throughput fairness. Finally, PDRR is revealed to have an O(1) per-packet time complexity in most cases. In high-speed net-works, low per-packet time complexity is an important criteria for routers and switches.

4.1. Delay bound

Consider a queueing system with a single server of rate C.

De®nition 2. A backlogged period for flow i is any period of time during which ¯ow i is continuously backlogged in the system.

Let t0be the beginning of a backlogged period of ¯ow i and tk indicate the time that the kth round in

PDRR is completed. Wis; tk denotes the service oered to ¯ow i in the interval s; tk by the server and Liis

the maximum packet size of ¯ow i.

(10)

Lemma 3. Under PDRR, if flow i is continuously backlogged in the interval t0; tk, then at the end of the kth

round,

Wit0; tk P k/iÿ Dki; 9

where Dk

i is the value of the DeficitCounteriat the end of the kth round and /iis Quantumi.

Lemma 4. Let t0 be the beginning of a backlogged period of flow i in PDRR. At any time t during the

backlogged period, Wit0; t P max 0; ri t ÿ t0ÿ2 1=ZF ÿ /_C i ; 10

where F is equal toP#flow

1 /i, Z is the number of priority queues, ri is the rate allocated to the flow i.

Theorem 1 (PDRR belongs to LR). The PDRR server belongs to LR with latency hPDRR_{less than or equal to}

2 1=ZF ÿ /i

C : 11

Proof. The proof is straightforward from Lemma 4 and the de®nition of LR server [8]. According to (1), replacing F with /iC=ri in (11), we also have

hPDRR_{6 2} 1 Z /i ri ÿ 2/i C : 12

Eq. (11) shows that PDRR evidentially improves latency, as opposed to the DRR whose latency is 3F ÿ 2/i=C. Furthermore, in the worst case, if the form of hSCFQ[8] is translated to the form of hPDRR, the

latency of PDRR is revealed as similar to that of SCFQ, 2F ÿ /i=C. From (12), it reveals the latency of

PDRR is inversely dependent with the allocated bandwidth, and independent of the number of actived ¯ows The major reason for DRR to have worse latency is that it only considers whether a packet could be sent out in a round and ignores the information provided in the quantum consumed by a packet. The latter information can be used to re-order the packets sent out in a round.

Theorem 2 (Delay bound). Suppose that the scheduling algorithm at the server is PDRR and the traffic of flow i conforms to a leaky bucket with parameters ri; qi, where riand qidenote the burstiness and average

rate of the flow i. Assume the rate allocated to the flow i is equal to q_i. If Delayiis the delay of any packet of

flow i, then Delayi6r_qi

i

2 1=ZF ÿ /i

C : 13

(11)

4.2. Throughput fairness

The fairness parameter herein is based on the de®nition presented by Golestani [3]. However, derivation skills in [8] were followed to obtain ours.

Theorem 3. For a PDRR scheduler, FairnessPDRR2 1=ZF

C ; 14

where FairnessPDRR _{is the fairness of the server PDRR.}

Thus, FairnessPDRR _{is smaller than Fairness}DRR_{, which is 3F =C [8]. The fairness and latency of other}

scheduling algorithms have been analyized in [8]. 4.3. Per-packet time complexity

Table 4 summarizes the complexity of PKT_Arrival, PKT_Pass, PKT_Departure when a non-empty or empty priority queue is accessed. For each packet, PKT_Arrival inserts it into its corresponding Fq and PKT_Pass takes it from its Fq to the Pqj, where j is found in a constant number of operations.

PKT_Departure simply repeats to pick a packet from the PqMinHeapRootwhose MinHeapRoot always presents

the smallest j among non-empty Pqj's. As the min heap operations are not invoked under the assumption

that each accessed Pq is non-empty, all the complexities of the above operations are O(1). When the PqMinHeapRoot is empty, a delete operation of the min heap is invoked by PKT_Departure to get the new

MinHeapRoot. Basically, we cannot continue to send out packets until the end of reheapi®cation loop whose time complexity is Olog Z, where Z is the maximum number of keys present in the min heap, i.e., the number of non-empty Pq's at that moment. A similar situation also occurs as PKT_Pass must insert a new j into the min heap.

However, there are sucient reasons to say that packets can be sent out individually and have low delay possibility by operations of the min heap. First, the min heap operation in our algorithm is involved only when the Pq, which it tries to access is empty, unlike the sorted-priority algorithms, e.g., SCFQ and WFQ, where the two operations, insert and delete, are repeatedly invoked once the server tries to send out a packet. Secondly, the scalar of the min heap is small whose maximum number of keys are the number of Pq's instead of the number of ¯ows. Moreover, the approach, presented in [13], customarily allows con-current insertions and deletions on the heap. According to the concept, PKT_Departure, after getting the next smallest value j immediately via one comparison between the two leaf keys of the root, can start to send out packets in the Pqj concurrently during the period of the reheapi®cation loop. In brief, the time

complexity of PDRR is O(1) in most cases and Olog Z in some special cases. That is obviously lower than the complexity of SCFQ, Olog N where N is the number of ¯ows.

Table 4

The complexity of PDRR

Operation case PKT_Arrival PKT_Pass PKT_Departure

Pq non-empty O(1) O(1) O(1)

(12)

5. Simulation results

Simulation results are shown in this section to compare the delay performance of PDRR with those of DRR and SCFQ. The link bandwidth, i.e., server capacity, is assumed to be 80 Mbps, shared by 20 ¯ows. These 20 ¯ows are divided in two groups, GA and GB. As the goal of this simulation is not to compare the sorted-priority scheduling algorithms, which are in a dierent class from ours, only SCFQ was choosen, which is the easiest to implement among all sorted-priority algorithms, in addition to DRR, to compare with PDRR. 5.1. Bursty transmission

The following experiment was performed to show the problem of bursty transmission by DRR and the improvement of PDRR. Assume that all trac sources are at a constant bit rate (CBR) with ®xed packet size. Table 5 indicates the trac parameters of the two groups, GA and GB. As the bit rates of GA and GB are equal and the maximum packet size among both groups is 500 bytes, the same Quantum size of 500 bytes was allocated them. Notably, the packet arrival rate of GA ¯ows is 10 times the size that GB's is. In this experiment, 10 priority queues were placed in the Pre-order Queueing of PDRR.

The delay times of a particular GA ¯ow under DRR, PDRR, and SCFQ, respectively were measured. In Fig. 4, it is observed that packets of ¯ow within DRR are sent out in a batch once the ¯ow is served. The packets that just missed the transmission chance, receive the largest delay. However, in PDRR the ¯ow is able to spend its Quantum in several pieces, so that packets could be sent out uniformly. Another observation of this special case is that packets in SCFQ suer high delay jitter. Due to GA ¯ows having a 10 times packet arrival rate than GBs do. For GA packets that arrive while the server is serving a large packet of GBs, their virtual arrival times would be equal, which causes the server not to send them out in their actual arrival sequence. 5.2. Packet size

To further discuss the average behavior of PDRR, the trac source is assumed to be MMPP shaped by a leaky bucket whose on/o rates are both 1000 1/ls and bit rate is 4 Mbps. In this experiment, GA ¯ows are

Table 5

The trac parameters and quantum size of two groups

Group Trac type Bit rate (Mbps) Packet size (byte) Quantum size (byte)

GA CBR 4 50 500

GB CBR 4 500 500

(13)

assigned a larger packet size than GB ¯ows, however, all request the same bandwidth. In Fig. 5, a GA ¯ow in DRR has a larger average delay then that in PDRR and SCFQ is observed. The result for GB ¯ow is similar. Additionally, with the increase of GA's packet size to that of GB's, the curve of PDRR in Fig. 5 becomes closer to that of SCFQ, which means that PDRR works like SCFQ, especially with heterogeneous trac sources. This is due to the more heterogeneous the trac sources are, the more bursty the trans-mission behavior of GA ¯ows are. Fig. 6 shows that as the ratio of GA's packet size to GB's is increased, small packets in PDRR and SCFQ perform better than large packets. However, this is not so evident for the case of DRR. That is the server in PDRR considers the information provided in the quantum consumed by a packet and reorders the transmission sequence of packets in one round. This is similar to a server in SCFQ considering the timestamp of a packet and then sending out the packet with the smallest timestamp. However, in DRR, the server only considers whether a packet could be sent out and ignores the trans-mission order of packets in one round.

5.3. Bit rate

Under the environment of ¯ows with heterogeneous bandwidth requirements, the ¯ow with high bit rate in DRR still has the bursty transmission behavior. Thus, our algorithm also could perform better than DRR because the Pre-order Queueing module can reduce the forenamed problem by transmitting packets

Fig. 6. The performance ratio of GA to GB ¯ows in DRR, PDRR, and SCFQ. Fig. 5. The average delay of GA ¯ows in DRR, PDRR, and SCFQ.

(14)

more uniformly within a round. GA ¯ows are assumed to have a higher bit rate than that of GB ¯ows. Fig. 7 shows that the average delay of GA ¯ows in PDRR is lower than that in DRR. Furthermore, the result for GB ¯ows is similar to that in Fig. 7.

5.4. Number of priority queues

The ideal number priority queues for a speci®c trac environment is discussed herein. As described in Section 3.1, the Pre-order Queueing enables a ¯ow to use its Quantum uniformly within a round, especially when its Quantum is several times its maximum packet size. For ¯ow i, the Pre-order Queueing with Quantumi=Li priority queues can reach the above goal. Thus, for a speci®c trac environment, Z priority

queues are enough, where Z is equal to the maximum value of Quantumi=Li among all ¯ows, i.e.,

maxiQuantumi=Li. Fig. 8 shows, through an experiment, the relationship between Z and

maxiQuantumi=Li. For easier observation, all trac sources are assumed as CBR with ®xed packet size.

Each curve in Fig. 8 shows the average delay of the ¯ow whose Quantumi=Li equals to maxiQuantumi=Li.

For each line, this ¯ow receives the smallest average delay when Z maxiQuantumi=Li, which con®rms

our hypothesis.

Fig. 7. Average delay of GA ¯ows under heterogeneous bandwidth requirements in DRR, PDRR, and SCFQ.

(15)

6. Related works

Uniform Round Robin (URR) [9] is a scheduling algorithm for ATM networks, improved from Weighted Round Robin (WRR) [10]. Some of its goals are the same as our algorithm. However, its frame and packet sizes are fixed. That is, within a frame there is a ®xed number of slots. For a slot allocated to one ¯ow, if the ¯ow has no data waiting to be sent out, the slot will be wasted. Namely, it is not a work-conserving scheduling algorithm and the residual bandwidth cannot be shared by those active ¯ows. Several work-conserving versions of WRR are designed in [11], however, their currenct schemes and analytical results are only applicable to ®xed-length packets. In fact, latency and fairness, the two properities of scheduler, are aected seriously as the packet-length is variable. That is what the PDRR has attempted to overcome.

Calendar queue [14] is a technique for reducing the implementation complexity of GPS-like schedulers. Its concept, similar to ours, classi®es packets. However, it does so according to their timestamps. In fact, for the packets currently in the system, the maximum interval between any two timestamps could be very large, which depends upon the fairness behavior of the underlying scheduler. To accommodate the large time interval, calendar queue must maintain a large number of queues to reduce the granularity of a class. In our PDRR, however, the range of the classi®ed value is fixed and small. Thus, less memory space is needed to obtain the same classi®cation granularity as that in the calendar queue. Concerning any GPS-like schemes combined with calendar queue, the complexity is reduced to O(1). However their latency and fairness are aected. The comparison and discussion are however, beyond the scope of this work.

7. Conclusion and future work

In this work, a rule for deciding the Quantum size of a ¯ow was oered and several problems in DRR such as bursty transmission and inappropriate transmission sequence that causes a large delay bound and serious unfair behavior, were presented. A new algorithm, PDRR, was proposed, which adds a new structure, Pre-order Queueing, into the DRR scheduler. It reorders the transmission sequence of the packets that could be sent out in one round, according to the quantum consumption status of its ¯ow within this round. This thereby enables PDRR to approximate PGPS and overcome said DDR problems.

From the analysis of PDRR on delay and fairness and simulation results, our algorithm is able to avoid bursty transmissions and overcome the serious problem of DRR. Namely, packets that cannot be sent out within the current round may get too high a penalty. Meanwhile, PDRR still maintains the advantages of DRR, i.e., per-packet time complexity independent from an increase in ¯ow number and ability to handle variable-length packets. Furthermore, we have also shown that the treatment of PDRR to packets with dierent sizes is similar to that of the sorted-priority scheduling, such as SCFQ.

There remains work to be done in the future. First, more simulation cases, such as ¯ows with dierent allocated bandwidth but having the same distribution of packet sizes, are being investigated. Secondly, trac types, other than CBR and MMPP, could be applied to PDRR to observe its general behavior. Finally, we plan to implement our scheduling algorithm on a Linux-based router to further evaluate its performance within a real environment.

Appendix A. Proofs of primary results

Proof of Lemma 1. This is proved by showing that for any packet Pm

i , its DeficitCountermi must be positive

and smaller than Quantumi. From the description in Section 3.2, the value in DeficitCounter cannot be

negative and could only increase by Quantumiin the UpdateOneFlow procedure. Assume that there is

in-sucient credit to send out the packet Pm

(16)

packet. After updating and sending out the packet Pm

i ; DeficitCountermi DeficitCountermÿ1i

Quantumiÿ Lmi. As DeficitCounterimÿ1< Lmi; DeficitCountermi must be smaller than Quantumi. From the

above description, the lemma is proved.

Proof of Lemma 2. According to (6), Eq. (5) is equivalent to QAm i QAmÿ1i ÿ Lm i Quantumi A:1 and since Lm

i > 0; Quantumi> 0 and ri> 0, from (3) and (A.1) for any m

TSm

i > TSimÿ1 and QAmÿ1i > QAmi:

In the same round, for the packet Pm

i with the smallest timestamp, its m is smallest among all packets, and

the QAm

i of the packet with the smallest m. Thus, for the packet with the smallest timestamp in one round,

its QA is the largest.

Proof of Lemma 3. As our algorithm only modi®es the service sequence of the packets in a DRR round, the packets in DRR that could be sent out in a round during a backlogged period, they can still be sent out in the same PDRR round. Thus, PDRR also has this property of DRR that has been shown in [5]. Proof of Lemma 4. For each time interval tkÿ1; tk, we can write

tkÿ tkÿ16_C1 F XN j1 Dkÿ1 j ÿ XN j1 Dk j ! : A:2 By summing over k ÿ 1, tkÿ1ÿ t06 k ÿ 1_CF_C1 XN j1 D0 jÿ_C1 XN j1 Dkÿ1 j : A:3

Assume there are two packets, PA

i and PiB, in the Fqiwhose sizes are LAi and LBi, (LBi /i), respectively and

only PA

i can be sent out at the k ÿ 1th round. All other ¯ows exhaust their DeficitCounter. Thus,

DKÿ1

i /iÿ D, where 0 < D 6 /i; Dkÿ1j 0 for j 6 i, and

tkÿ1ÿ t06 k ÿ 1_CFF ÿ /_Ci D: A:4

Under this assumption, in the kth round PB

i would be placed into the Pqnwhere

n Z ÿ DKi Pqgi Z ÿ DKÿ1i /iÿ LBi /i=Z Z ÿ DKÿ1i /i Z Z ÿ /iÿ D /i Z Z ÿ Z ÿD /iZ D /iZ A:5 and the maximum amount of data that could be served before PB

i is n=ZF ÿ /i . Thus, for any time t

from the beginning of the kth round until the PB

i is served, t ÿ t06n=ZF ÿ /_C i tkÿ1ÿ t0 6 k ÿ 1F C F ÿ /_i D C n=ZF ÿ /_i C ; A:6

(17)

or equivalently, k ÿ 1 Pt ÿ t0C F 2/iÿ D F ÿ n Zÿ 1: A:7

Replacing k with k ÿ 1 in (9) and de®ning ri, the reserved rate of ¯ow i, as /iC=F , we have

Wit0; tkÿ1 P /i t ÿ t_F0C 2/iÿ D F ÿ n Zÿ 1 ÿ Dkÿ1 i ri t ÿ t02/i_Cÿ Dÿ_ZnF_CÿF_CÿD kÿ1 i ri ri t ÿ t0ÿ 1 _ZnDkÿ1i /i F C 2/iÿ D C ri t ÿ t0ÿ1 n=Z D kÿ1 i =/iF ÿ 2/i D C : A:8 Replacing Dkÿ1

i with /iÿ D and since

n Zÿ D /i D /iZ 6_Z1ÿ_/D i 1 Z for any D, Wit0; tkÿ1 P ri t ÿ t0ÿ2 n=Z ÿ D=/_C iF ÿ 2/i D ; P ri t ÿ t0ÿ2 1=ZF ÿ /_C i : A:9 In the worst case, ¯ow i was the last to be updated during the kth round and its packet Liis inserted into the

tail of Pqn. We distinguish two cases:

Case 1. At time t before the time that ¯ow i is served in the kth round, i.e., tkÿ1< t 6 tkÿ1_ZnF ÿ /i;

we obtain

Wit0; t Wit0; tkÿ1: A:10

Case 2. At time t after the time that ¯ow i starts being served in the kth round, i.e., tkÿ1_ZnF ÿ /i< t 6 tki;

we obtain

Wit0; t Wit0; tkÿ1 Witkÿ1; t P Wit0; tkÿ1: A:11

Thus, for any time t, Wit0; t P max 0; ri t ÿ t0ÿ2 1=ZF ÿ /_C i : A:12

(18)

Proof of Theorem 3. Assume that it is at the beginning of the kth round and there are two packets PA i and

PB

i , in the Fqi whose sizes are LAi and /i, respectively and only PiA can be sent out at the kth round. The

packet PB

i's class is n, which implies it will enter the nth priority queue. In the (k 1) round, all packets of

another ¯ow j whose classes are larger than n, could not be sent out before the time t when PB

i is sent out,

as the server always selects packets from the nonempty Pqjwith the smallest j. Thus, before t, for another

¯ow j,

Wjt0; t 6 k/j_Zn/j: A:13

Also as tk < t < tk1, from (9) for ¯ow i,

Wit0; t P k ÿ 1/i D: A:14

From (A.5), (A.13) and (A.14) we can easily conclude that Wit0; t ri ÿWjt_r0; t j 6 k n Z /j rjÿ k ÿ 1 /_i ri ÿ D ri 6 1 n Zÿ D /i F C6 1 1 Z F C: A:15

This bound applies to time intervals that began at t0. For any arbitrary interval,

Wit1; t2 ri ÿWjt_r1; t2 j 6 2 _Z1 F C: A:16

Thus, for any two ¯ows i and j, FairnessPDRR2 1=ZF

C : A:17

References

[1] A.K. Parekh, R.G. Gallager, A generalized processor sharing approach to ¯ow control in integrated services networks: The single-node case, IEEE/ACM Trans. Networking (1993) 344±357.

[2] J.C.R. Bennett, H. Zhang, WF2_{Q: Worst fair weighted fair queueing, in: Proceedings of the IEEE INFOCOM'96, San Francisco,}

CA, March 1996, pp. 120±128.

[3] S.J. Golestani, A self-clocked fair queueing scheme for broadband applications, in: Proceedings of the INFOCOM'94, April 1994, pp. 636±646.

[4] L. Zhang, Virtual Clock: A new trac control algorithm for packet switching, ACM Trans. Comput Systems 9 (2) (1991) 101±124. [5] M. Shreedhar, G. Varghese, Ecient fair queueing using De®cit Round Robin, IEEE/ACM Trans. Networking 4 (3) (1996)

375±385.

[6] D. Stiliadis, A. Varma, Ecient fair queueing algorithms for packet-switched networks, IEEE/ACM Trans. Networking 6 (2) (1998) 175±185.

[7] S. Suri, G. Varghese, G. Chandranmenon, Leap forward virtual clock: A new fair queuing scheme with guaranteed delays and throughput fairness, in: Proceedings of the INFOCOM'97, 1997, pp. 557±562.

[8] D. Stiliadis, A. Varma, Latency-rate servers: A general model for analysis of trac scheduling algorithms, IEEE/ACM Trans. Networking 6 (5) (1998) 611±624.

[9] N. Matsufuru, R. Aibara, Ecient fair queueing for ATM networks using Uniform Round Robin, in: Proceedings of the INFOCOM'99, 1999, pp. 389±397.

(19)

[10] M. Katevenis, S. Sidiropoulos, C. Courcoubetis, Weighted Round-Robin cell multiplexing in a general-purpose ATM switch chip, IEEE J. Selected Areas Commun. 9 (8) (1991) 1265±1279.

[11] H.M. Chaskar, U. Madhow, Fair scheduling with tunable latency: A Round Robin approach, in: IEEE Globecom'99, 1999, pp. 1328±1333.

[12] J.C.R. Bennett, D.C. Stephens, H. Zhang, High speed, scalable, and accurate implementation of fair queueing algorithms in ATM networks, in: Proceedings of the ICNP'97, 1997, pp. 7±14.

[13] V. Nageshwara Rao, V. Kumar, Concurrent access of priority queues, IEEE Trans. Comput. 37 (12) (1988) 1657±1665. [14] J.L. Rexford, A.G. Greenberg, F.G. Bonomi, Hardware-ecient fair queueing architectures for high-speed networks, in:

Proceedings of the INFOCOM '96, 1996, pp. 638±646. Shih-Chiang Tsao was born in Hsin-chu, Taiwan, in 1975. He received the B.S. and M.S. degrees in Computer and Information Science from Na-tional Chiao Tung University in 1997 and 1999, respectively. Since 1999, he worked in the Telecommunication Laboratories at Chunghwa Telecom Co., Ltd, mainly to capture and ana-lyze switch performance. His research interests include scheduling algo-rithms, quality of services, and TCP bandwidth control. He has imple-mented QoS scheduling mechanisms in a remote access server, and is currently involved with a project on enforcing bandwidth policies.

Ying-Dar Lin was born in Hsi-Lo, South Taiwan, in 1965. He received the Bachelor's degree in Computer Science and Information Engineering from National Taiwan University in 1988, and M.S. and Ph.D. degrees in Computer Science from the University of California, Los Angeles in 1990 and 1993, respectively. At UCLA Com-puter Science Department, he worked as a Research Assistant from 1989 to 1993 and worked as a Teaching As-sistant from 1991 to 1992. In the summers of 1987 and 1991, he was a technical sta member in IBM Taiwan and Bell Com-munications Research, respectively. He joined the faculty of the Department of Computer and Information Science at National Chiao Tung University in August 1993 and is a Professor since 1999. His research interests include design, analysis, and im-plementation of network protocols and algorithms, wire-speed switching and routing, quality of services, and intranet services. He has authored two books. He is a member of ACM and IEEE. He can be reached at http://www.cis.nctu.edu.tw/ydlin.