中 華 大 學 碩 士 論 文
題目:開發最差情形權重公平排程演算法硬體用於高 速網路上
Develop the Hardware of WF2Q Algorithm (Worst Case Fair Weighted Fair Queuing) for High Speed Network
系 所 別 :資訊工程學系碩士班 學號姓名 : E09102007 林忠志 指導教授 : 許文龍 博士
中 華 民 國 九 十 四 年 六 月
中文摘要
在網際網路盛行之際,上層運用程式的激增伴隨而來的是快速澎脹的資料流通 (尤以視訊會議或網路電話的須求更高),下層封包排程器必須要公平且迅速讓 封包離開路由器,以滿足每秒百萬個封包要傳送的骨幹網路上之路由器。
公平封包排程演算法以文獻[4]所提 GPS 對於處理封包排程問題有最公平性的 答案,然而在現實的網路環境中封包並非以液體形態存在且不具有可隨意分割 性,故其提出一公平性封包排程演算法(Fair Weight Queuing Algorithm)趨近 GPS 特性有令人滿意的答案,但是伴隨網路流量快速澎漲下封包排程演算法以
往以軟體執行,往往無法有效率地處理過量的封包,造成路由器系統 CPU 負
荷、Buffer Overflow、封包過於等待下造成封包被 drop 掉,又 TCP 對遺失的 封包又需重傳更增加網路的擁塞。
在此論文中我們發展WF2Q[2]硬體來改善以 WF2Q 解封包排程問題使用軟體時
的執行速度,在這個發展中我們將以Verilog HDL 語言來進行設計與功能驗證。
經過比較分析後所獲得的結果是我們所設計的硬體比軟體速度提升約百倍。
Abstract
In nowadays network environment, the upper layer application increases sharply causes the network flow increase fast. And Lower layer packet scheduler must to be fair to let packet leave router to satisfy that each second million packets to have transmitted on the backbone.
GPS[4] has the most fairness answer regarding to packet scheduling, however packets are not liquid in real network environment. Therefore J.C.R Bennett and H.
Zhang proposed a fair algorithm (Weight Fair Queuing Algorithm) to approaches the GPS characteristic.
Due to the network flow increase, the packet scheduling algorithm is often executed by software cannot process excessive packets effectively, causes the system overload and buffer overflow and packet is dropped. Therefore re-transmit packets will increase the network congestion.
We develop the hardware of WF2Q to improve the software drawback and make a performance comparison between these two. In our model it shows the hardware WF2Q exceed the software approach by almost hundred times.
誌 謝
本論文順利完成,首先要感謝我的指導教授 許文龍博士,在此期間敦敦 教悔與辛勤教導,對於教授教導及鼓勵,謹此表達心中最誠致的敬意及感謝。
此外要感同學幸洲、志卿、廣嶺、國松的互相鼓勵,還有關心我的同事及 好友們,感謝您們平時的協助及鼓勵,在我心情低落而有所倦怠時,能夠適時 的激發我的鬥志與積極的精神。
最後要感謝我的家人,在這段期間的支持及照顧,雖然有時感到疲憊與沮 喪,但溫柔的親情總能消除一切的疲勞,繼續努力不懈。
僅以本文獻給我最敬愛的師長、最親愛的家人及所有關心支持我的人,願 將這份研究成果與喜悅與您們分享。
林忠志 九十四年六月
Table of Contents
Abstract in Chinese ………..………...…..i
Abstract ……….……….…...ii
Acknowledgment………..iii
Table of Contents……..…….………...iv
List of Figures………vi
List of Table………..vii
Chapter 1 Introduction……….……….1
1.1 Research motivations……….1
1.2 Thesis Organization ………..2
Chapter 2 Background and Related Researches……….……….…………..3
2.1 Packet process flow in router………...….4
2.2 Requirements of a scheduling algorithm………...5
2.3 Overview several queue scheduling algorithms………5
2.4 GPS and WFQ………..……….9
2.5 The design flow………14
Chapter 3 The Overall Design of Our HWF2Q………..………..15
3.1 Explain WF2Q from C code………....15
3.2 Describe the detail verilog HDL code for our design………..20
Chapter 4 Simulation and Performance Analysis…………...………...…..25
4.1 Simulation Results………...25
4.2 Synthesis results and analysis………..29
Chapter 5 Conclusions and Future Works………..………..33
Reference ………..34
Appendix 1 Detail Verilog code ………..36
Appendix 2 DC Script File………..…….…41
.List of Figures
Figure2.1: The General Block Diagram of Packets Flow in a Router…………..04
Figure 2.2: First-in, First-out (FIFO)………06
Figure 2.3: Priority Queuing……….07
Figure 2.4: Fair Queuing………...08
Figure 2.5: An example of generalized processor sharing………10
Figure 2.6: Example of WFQ and WF2Q………..……….…..13
Figure 2.7: Design Flow………14
Figure 3.1: Main procedures of WF2Q Algorithm ………...16
Figure 3.2: Block Diagram of HWF²Q………..19
Figure 3.3: Circuit Block diagram of our verilog code……….24
Figure 4.1: Test 6 result --WF²Q ……….……….26
Figure 4.2: Waveform of test 6 result --- WF²Q ………...27
Figure 4.3: Test 5 result --- WF²Q……….28
Figure 4.4: Test result ---- WFQ……….…………...28
Figure 4.5: Test result ---- WF2Q………...…..29
List of Tables
Table 2.1: Compare GPS with WFQ ...………...…..12 Table 4.1: Test models for HWF2Q………...…25 Table 4.2. The comparison between packet size and out line speed for HWF2Q….31 Table 4.3: Analyze procedures of WF2Q………..32
Chapter 1 Introduction
In this chapter, we first give research motivations of thesis and describe the organization of this thesis.
1.1 Research Motivations
Along with the advance in technology, the communication and the computer software or hardware is developed fast. Causes data increase suddenly in the network follow on communication data have transmitted fast. Therefore make packets to leave router quickly to avoid congestion.
It is important to minimize the amount of congestion in the network, because congestion decrease all over packet throughput, increase end-to-end delay, and can lead to packet loss [1], if there is not insufficient buffer memory to store all of the incoming packets. The transmitter retranslates the packets when packets had dropped by router.
There are many different queuing algorithms, each want to find the correct balance between complexity, fairness, and control.
Scheduler does the best effort to select the packets in the queues fairly. Since fairness is a problem, many researchers have devoted on this problem. We are trying to improve the execution efficiency of scheduling algorithm by.
1.2 Thesis Organization
This thesis is organized as follows: In chapter 2, we introduce background knowledge and related researches. In chapter 3, we will explain the overall design of our HWF2Q.
In chapter 4, our HWF2Q is simulated to verify the correct functionality and analyze the performance with our design.
Finally, conclusion and further works are given in chapter 5.
Chapter 2
Background and Related Researches
In this chapter, the overview of packet processing flow in router is described in section 2.1. In section 2.2, we describe the requirements of a scheduling algorithm. In section 2.3, we introduce a number of scheduling algorithms and analysis them. Then we introduce the GPS and WFQ in section 2.4. Finally, we describe the design flow in section 2.5.
2.1 Packet process flow in router
The packet processing flow in router is illustrated in Figure2.1.
Figure2.1: The General Block Diagram of Packet Flow in a Router
Classifying
Scheduling
P
H
Y
Output querying Policing
Shaping Fragmentation/
Segmentation
Switch Fabric
MAC
Congestion Control
Classify every coming packet into different prior queue according to their weight in the header field
For incoming packets, this step is doing A to D. Translate the electronic pulse to digital bit stream.
MAC does the access control for all the joint nodes according to which MAC algorithm it use.
Use algorithms schedule packets according to packet’s service weight.
2.2 The requirements of a packet scheduling algorithm
1. Support the fair distribution of bandwidth to each of the different service classes
competing for bandwidth on the output port.2. Furnish protection between the different service classes on an output port, so that a
poorly behaved queue cannot impact the bandwidth and cannot delay other service classes to assign packet to output port.3. Provide an algorithm that can be implemented in hardware, so it can arbitrate
access to bandwidth on the highest-speed router interfaces without negatively impacting system performance.2.3 Overview of several queuing algorithms
In this section, we describe a number of queue scheduling Disciplines and analysis them.
2.3.1 First-in, First-out (FIFO) Queuing
FIFO queuing is the basic scheduling algorithm. In FIFO queuing, all packets are treated equally by placing them into a single queue, and then servicing them in the same order that they were placed into the queue. As illustrated in Figure2.1, there are 8 flows coming into the router. The multiplexer sent the packet into the FIFO queue on the arrival time of the packet.
■ FIFO Analysis
1. Low computational load on the system
2. Packets are not reordered and maximum delay is determined by the maximum
depth of the queue[1].
3. A single FIFO queue does not allow routers to organize buffered packets, and then
service one class of traffic differently from other classes of traffic [3].4. A bursting flow can consume the entire buffer space of a FIFO queue, and that
causes all other flows to be denied service until after the burst is serviced. This can result in increased delay, jitter, and loss for the other well-behaved TCP and UDP flows traversing the queue.2
Flow 1
Flow 2
Flow 3
Flow 4
Flow 5
Flow 6
Flow 7
Flow 8
Multiplexer
FIFO Queue
3 5
6
7 4 1 Output
Figure 2.2: First-in, First-out (FIFO)
2.3.2 Priority Queuing (PQ)
Priority queuing (PQ) is the basis for a class of queue scheduling algorithms that are designed to provide a relatively simple method of supporting differentiated service classes. As illustrated in Figure2.2, packets are scheduled from the head of a given queue only if all queues of higher priority are empty.
■ Priority Queuing Analysis
If the amount of high-priority traffic is not policed, lower-priority traffic may experience excessive delay.
2.3.3 Fair Queuing
Fair queuing (FQ) was proposed by John Nagle in 1987. [3].As illustrated in Figure2.3, packets are first classified into flows by the system and then assigned to a queue that is dedicated to that flow. Queues are then serviced one packet at a time in
Flow 1
Flow 2
Flow 3
Flow 4
Flow 5
Flow 6
Flow 7
Flow 8
Classifier
Scheduler Highest
Middle Priority
Lowest Priority Output
Figure 2.3 Priority Queuing
round-robin order.
Figure 2.4 Fair Queuing
■ Fair Queuing Analysis
1. An extremely bursting or misbehaving flow does not degrade the quality of service
delivered to other flows, because each flow is isolated into its own queue.2.
FQ is typically applied at the edges of the network, where subscribers connect to their service provider. Vendor implementations of FQ typically classify packets into 256, 512, or 1024 queues that is calculated across the source/destination address pair, the source/destination UDP/TCP port numbers, and the IP ToS byte.2.4 GPS and WFQ
In this section, we will describe the Generalized Processor Sharing (GPS) and Weighted Fair Queuing (WFQ). Then we will talk about WF2Q surpassed WFQ.
2.4.1 GPS characteristics
Generalized Processor Sharing (GPS) allows different queues to have different service shares [4], by assigning the different weight to each queue. GPS have several nice properties. Since each queue has its own queue, a burst queue (who is sending a lot of data) will only distort itself and not other queues. In [4] Parekh showed that when using a network with GPS switches and a queue that is leaky bucket constrained an end-to end delay bound can be guaranteed.
It’s work conserving(server must be busy if there are packets waiting in the system) and operates at a fixed rateγ . It is characterized by weights (positive real numbers) given to the flows.
Let Si(τ,t) be the amount of session i traffic served in an interval (τ,t].
Then a GPS server is defined as one for which Si(τ,t) φi
Sj(τ,t) φj , j=1,2,….N for any session i that is backlogged throughout the interval [τ, t ].
Summing over all session j :
Si(τ,t)
Σ
φj > = (t -τ) γφi And session i is guaranteed a rate ofφi Σjφj
Figure 2.5 illustrates generalized processor sharing.
________
> = ____
______
gi = γ
Variable length packets arrive from both sessions’ infinite capacity links. For i = 1,2, Si(0,t) be the amount of session i traffic that arrives at the system in the interval(0,t).
Ai(0,t) is the amount of time that the packet must wait in the system before it departs.
Figure 2.5 An example of generalized processor sharing
2.4.2 Weighted Fair Queuing (WFQ)
Because GPS is an idealized fluid model however it is impractical. Therefore WFQ algorithm (Fair Weight Queuing Algorithm) is proposed to Approaches the GPS characteristic [4].
Weighted Fair Queuing (WFQ) is a packed scheduling technique allowing guaranteed bandwidth services. The purpose of WFQ is to let several queues share the same link.
In WFQ a packet at a time is selected and outputted among the active queues. This work is as follows. Each arriving packet is given virtual start and finish times. The virtual start time S(k,i) and the virtual finish time F(k,i) of the k:th packet in queue i are computed as follows :
S(k,i)=max{F(k-1,i),V(a(k,i))}
F(k,i) = S(k,i) + L(k,i)/r(i)
With F(0,i) = 0, a(k,i) and L(k,i) are the arrival time and the length of the packet respectively.
The packet selected for output is the packet with the smallest virtual finish time. In [4]
Parekh describes a Packet GPS (PGPS) algorithm which is identical to the WFQ algorithm described here. He derives several relationships between a fluid GPS system and a corresponding packet WFQ system:
Finish time of a packet in the WFQ system compared to the GPS system will not be later than at most the transmission time of a maximum sized packet. The number of bits serviced in queue by the WFQ system will not fall behind the GPS system by more than a maximum sized packet. The result is shown Table 2.1
Session 1 Session 2 Queue’s
weight
Arrival Size
1 1
2 1
3 2
11 2
0 3
5 2
9 2
Ø1 = Ø2 GPS PGPS
3 4
5 5
9 7
13 13
5 3
9 9
11 11
2 Ø1 = Ø2 GPS PGPS
4 4
5 5
9 9
13 13
4 3
8 7
11 11
Table 2.1 Compare GPS with WFQ
2.4.3 Worst Case Fair Weighted Fair Queuing (WF2Q)
Although WFQ and WF2Q all take approach GPS as a goal, but WF2Q also considered whether virtual clock has been bigger than packet virtual staring time.
Therefore WFQ is able to transmit the packet too quick. For example Figure 2.5, first queue is assigned to 50% bandwidth and other queue is 5%. Assume packet size is 1 and output port width is 1. When the time is 0, first queue has 11 packets to enter and 2nd to 11 queues have one Packet to enter. In this kind of situation, the virtual starting time is 2 (K-1) and virtual finishing is 2k of kth packet in first queue, but the virtual starting time is 0 and virtual finishing time all is 20 of first packet in 2nd to 11 queues.
Because WFQ only considers packets virtual finishing time, therefore only transmits 10 packets of first queue in time 10 before, then starts to transmit 11th packet of 2nd to 11 queue in time 10 after. Its smooth like Figure 2.5, so has two shortcomings of WFQ :
1. delay jitter t is too big: When time 2 seconds, the 1 to 10th packets of first queue
leaved the system, then 11th packet of first queue leaved the system in 10 seconds.Between these differences are too big.
2. WFQ possible transmits too quickly:
In 10 seconds, WFQ already transmitted 10 packets of first queue, but compares with GPS only should transmit 5 packets. If the system has many queues can be more serious. Therefore WF2Q has the better packet transmission order than WFQ.
Figure 2.6 An Example of WFQ and WF2Q
Session 1
Session 2 to Session 10
WFQ service order
WF2Q service order
Result of ordered
2.5 The Design Flow
Our design Flow is in Figure 2.6
Figure 2.7 Design Flow
Algorithm analysis
Hardware model
System Design
Simulation/Synthesis
Chapter 3
The Overall Design of Our HWF2Q
In section 3.1, we first show four main procedures of the WF2Q algorithm as Figure 3.1. Then explain the functions by C program. In section 3.2, we explain overall design of our hardware WF2Q (HWF2Q).
3.1 Explain WF2Q from C program
In section 3.1,we first show main procedures of the WF2Q algorithm as Figure 3.1.
Figure 3.1 Main procedures of WF2Q Algorithm
StartStep 3: Update the global virtual time Step 0:Set up parameters
Flow quantityFlow’s size
Each flow’s weight Initial virtual time
Step 2: Set the start and the finish times of the remaining Step 1: Look for the candidate flow with the earliest finish
time.
Decision the packet which will be transmit
The overall procedure of the WF2Q algorithm is shown below.
[Procedure: WF2Q Algorithm]
★ Step 0: Setting the parameters
● Flow quantity
● Flow’s size: It is FIFO size of each flow.
● Each flow weight
● Initial virtual time: It contain global virtual time and virtual star time and virtual finish time.
★ Step 1: Look for the candidate flow with the earliest finish time.
In this procedure, first find out eligible queue, then Compare the virtual finish time within eligible queue.
[C code: ] Look for the candidate flow with the earliest finish time packet* Look_earliest_time( )
{
Packet *pkt = NULL, *nextpkt;
int i;
int pktSize;
double minF;
int flow = -1;
for (i = 0; i< MAXFLOWS; i++){
if (!fs_[i].qcrtSize_) continue;
if (fs_[i].S_ <= V) if (fs_[i].F_ < minF){
flow = i;
minF = fs_[i].F_;
★ Step 2: Set the start and the finish times of the remaining packets in the queue.
This procedure will calculate the transmission time and update the virtual star time and virtual finish time. The process is below:
★ Step 3: Update the global virtual clock
In this procedure, first find out minimum virtual start time, then compare the global virtual clock with minimum virtual start time. The process is below:
[C code: Set the start and the finish times of the remaining packets in the queue]
void Set_ start_finish_time(Packet* nextpkt ) {
fs_[flow].S_ = fs_[flow].F_;
fs_[flow].F_ = fs_[flow].S_ + (nextPkt->size()/fs_[flow].weight_);
/* the weight parameter better not be 0 */
}
[C code: Update the virtual clock ] void Update_vt(fs_[flow].S )
{
double minS = fs_[flow].S;
for (i = 0; i < MAXFLOWS; ++i) {
W += fs_[i].weight_; /* compute tatal flow weight */
if (fs_[i].qcrtSize_) /* find out minimum virtual start time in flows*/
if (fs_[i].S < minS) minS = fs_[i].S;
}
V = max(minS, (V + ((double)pktSize/W)));
}
/* who’s minimum for V and minS */
3.2 Overall Design
In this section, we first introduce circuit block diagram of our design (HWF2Q). Then describe our design by verilog HDL
3.2.1 Circuit Block Diagram
Figure 3.2 is a block diagram of the HWF2Q. In this diagram, there is a circuit blocks, which is Time Stamp generator (TS_generator). The FIFOs are buffer to temporarily store the packet’s headers, and Time Stamp generator will calculate and update the head of line packet’s virtual time then sent signals (sel_flow) to queuing manager. The packet of is chosen will dequeue according to the signals, which received from Timestamp Generator.
The TS_monitor is a 64 bits signal that indicates the content of head of line packet. It provides the latest information including the packet’s size and arrival time for the TS_generator to computer the Timestamp.
Figure 3.2 Block Diagram of Hardware WF²Q
3.2.2 Describe our design by verilog HDL
Our HWF2Q include four main procedures shown in Figure 3.1.We describe the four main procedures by verilog HDL. ( The detail Code is in appendix 1.)
★ Step 0: Setting the parameters
● Flow quantity
● Flow’s size: It is FIFO size of each flow.
● Each flow weight
● Initial virtual time: It contain global virtual time and virtual star time and virtual finish time.
module TS (clk,rst,
linehead_1, linehead_2, linehead_3, linehead_4,
sel_flow1, sel_flow2, sel_flow3, sel_flow4); //
flow quantity
input [63:0] linehead_1, linehead_2, linehead_3, linehead_4; //
flow size (
FIFO's input) input rst,clk;output sel_flow1, sel_flow2, sel_flow3, sel_flow4;
// output to QM for selection ․
․
․
Txtime = linehead_3 [15:0]* 5; // flow1's weight is 20%
★ Step 1: Look for the candidate flow with the earliest finish time.
st0: begin // compare each flow's FT_Vir with each flow •
•
•
•
if (temp_FT_vir1 < temp_FT_vir2)
// don’t sure could use as this way (temp_FT_vir) begin
min_FT_vir = temp_FT_vir1;
sel_flow1 =1;
sel_flow2 =0;
sel_flow3 =0;
sel_flow4 =0;
•
•
• end else
begin
min_FT_vir = temp_FT_vir2;
sel_flow1 =0;
sel_flow2 =1;
sel_flow3 =0;
sel_flow4 =0;
•
•
★ Step 2: Set the start and the finish times of the remaining packets in the queue
st1: begin // update the dequeue flow's start & finish time
nextst= st2;
if ({sel_flow1, sel_flow2, sel_flow3, sel_flow4} ==4'b1000 ) // For flow1
begin
ST_vir1 = FT_vir1 ;
// porporgater the FT_vir to next pkt's ST_vir
txtime = linehead_1[15:0] * 5;
// flow1's weight is 20%
FT_vir1 = ST_vir1 + txtime; // update the finish time end
else if ({sel_flow1, sel_flow2, sel_flow3, sel_flow4} ==4'b0100 ) •
•
★ Step 3: Update the global virtual clock
First find out minimum virtual start time, and then compare the global virtual clock with minimum virtual start time.
st0: begin // compare each flow's FT_Vir with each other •
•
pre_cal_vt = vt + linehead_1 [15:0]; //
•
•
st2: begin // Find the Mininum start time after updating S,F time // Min(S)
nextst= st3;
if (ST_vir1 < ST_vir2)
min_ST_vir = ST_vir1;
else
min_ST_vir = ST_vir2;
if (ST_vir3 < min_ST_vir)
min_ST_vir = ST_vir3;
if (ST_vir4 < min_ST_vir)
min_ST_vir = ST_vir4;
end
st3: begin //Update
global virtual clock
nextst= st0;
if(min_ST_vir > pre_cal_vt)
vt = min_ST_vir;
else
vt = pre_cal_vt;
•
•
•
3.2.3 Circuit Block diagram of our verilog code
Figure 3.3 Circuit Block diagram of our verilog code
Chapter 4
Simulation and Performance Analysis
In this chapter, we will verify the functional correctness and deficiency of our design. Our design was written in Verilog HDL, and then results of each test module will be presented in section 4.1. The performance comparison between hardware and software will be discussed in section 4.2.
4.1 Simulation Results
We setup the different test models (Table 4.1) to verify the performances of WF²Q could properly handle packets in the different weight and length situation. We describe several test models.
Secession Weight Packet Length Test
Number Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Check
Test 1 25% 25% 25% 25% Same Size Pass
Test 2 25% 25% 25% 25% Random Pass
Test 3 50% 20% 20% 10% Same Size Pass
Test 4 50% 20% 20% 10% Random Pass
Test 5 25% 25% 25% 25% Gradually Increasing size Pass
Test 6 50% 20% 20% 10% 50 10 10 10 Pass
Test 7 25% 25% 25% 25% 50 10 10 10 Pass
Table 4.1 Test model
4.1.1 Test 6 result :
In this test, we use the weight value of Q1—50%, Q2,Q3—20% and Q4—10%. Also
Q1 has fix packet’s size in 50 bytes and Q2, Q3, Q4 all have same packet’s size in 10
bytes. The result is shown as Figure 4.1 and Figure 4.2. As we can see that all queues
follow their weighted slope and Q4 still gets it services, even thought it has only 10%
bandwidth rate. The larger step on Q1 is due to the bigger packet.
Figure 4.1. Test 6 result ---- WF²Q
0 50 100 150 200 250 300 350 400
T0 T1 T5 t11 t21 t31 t41 t51 t60
Q1: 50% BW.
Q2, Q3: 20% BW.
Q4: 10% BW.
Figure 4.2 Waveform of test 6 result --- WF²Q
4.1.2 Test 5 result :
The result is shown as Figure 4.3; in this test we use the weight value of Q1, Q2, Q3, and Q4—25%. Also of Q1, Q2, Q3, and Q4 all randomly gradual increase size packets’ size start from 10 bytes and each flow received equal amount of the traffic.
0 10 20 30 40 50 60 70 80
t0
Figure 4.3 Test 5 result --- WF²Q
4.1.3 Compare test 3 result with WFQ
The Figure 4.4 and Figure 4.5 have exactly same testing condition, each flow’s
weight is Q1—50%, Q2, Q3—20%, Q4—10% and all have same packet’s size. At this
test, we can see that both WFQ and WF²Q follow the weight’s slope, however if we
zoom in to smaller time scale. We can see the WFQ can not really handle the burst
packets.
Same size of packet WFQ
0 10 20 30 40 50 60 70 80
Q1: 50% BW.
Q2, Q3: 20% BW.
Q4: 10% BW.
Q1: 50% BW.
Q2, Q3: 20% BW.
Q4: 10% BW.
Figure 4.4 Test result ---- WFQ
Same size of packet WF2Q
0 10 20 30 40 50 60 70 80
T0 T60 T120
Figure 4.5 Test result --- WF2Q
4.2 Synthesis results
For the entire test we make assumptions that all the packets have been classified
into four queuing buffers, and scheduler extracts packets from queuing buffers and
packet header filed has indicated the class level of that packet. We have simulated and
synthesized WF²Q and WFQ using Toshiba TC240C for our target library. It is a
standard cell with 0.25µ technologies and has voltage with 2.5/3.3V [10]. In our
synthesis results, the total grid area is 28001.5 ~33466.1 and the gate counts with
respect to the Toshiba Standard Cell in CND2xL(4) is approximately 7000 gates.
We’ve tried several different clock speeds in our script file and the scheduler could
run up to 5ns, which is 200 MHz. In this case, the most critical path has 3.42ns data Q1: 50% BW.
Q2, Q3: 20% BW.
Q4: 10% BW.
require time with -2.07ns data arrival time which meaning the slack is met. We also
check the difference in grid area when running in different clock speed. We found out
that increasing the clock speed will not affect the grid area too much.
4.3 Performance Analysis
In this section we present the performance analysis of our HWF2Q.
4.3.1 Scheduler’s requests for high-speed network.
The complexity of algorithms inhibits their use in high speed switches, which must select a packet to transmit in few microseconds.
In addition, performance must scale to ten thousands connections multiplexed onto a single link, with connection throughput requirements varying over a wide range [12].
The Internet network speed all surpasses Giga Hz; the transmission speed is needed quickly to outline speed in router inside. The packet size of multimedia data (mpg-4) approximately is 50 bytes, so the operating speed is as follow:
(1) Packet size / out line speed = To 50*8(bits) / giga Hz = 0.0000004 (2) Scheduling speed = Ti
The Ti refers to section 4.3.2; therefore the comparison is shown as Table 4.2.
Ti (colocks / giga) Check Outline speed Packet
size(bytes)
To
Software Hardware Software To >Ti
Hardware To >Ti
10 Giga 50 4*10 1*10 5*10 No Yes
10 Giga 100
8*10 1*10 5*10 No Yes
10 Giga 1000
8*10 1*10 5*10
No Yes10 Giga 10000
1*10 1*10 5*10
No YesTable 4.2 The comparison between packet size and out line speed for HWF2Q
4.3.2 Methodology
We generated X86 assembly code listing from the C code. From the detailed assembly code we were able to count instruction path lengths through major code sections. By referring to the instruction set table of the Pentium II CPU, we can know about clock cycles of every instruction. But, we do not have perfectly accurate to count, because some instructions do not have fixed clocks, such as, it’s<CMP reg,reg>clock is 2 or 3.
We compare the executive cycle of WF2Q operation between hardware and software.
The performance comparison between hardware (refer Figure 4.6) and software approaches is given Table 4.3.
-8
-6 -8
-6 -9 -9
-9 -6
-6 -7
-6 -9
Hardware (Our HWF2Q)
Software Function
Total clock cycles
Total clock cycles
Speed-up
Look for the candidate flow with the earliest finish time
1 256
256
Set the start and the finish times of the remaining packets in the queue
1 360 360
Update the global virtual clock
3 501 166
Total clock 5 1117 223
Table 4.3 Analyze procedures of WF2Q
Chapter 5
Conclusions and Future Works
In this dissertation, we develop a hardware of weight fair queuing algorithm for high speeding network to improve the speed up for solving packet scheduling efficiency.
The primary benefit of WFQ is that an extremely burst or misbehaving flow does not degrade the QoS delivered to other flows [3], because each flow is isolated into its own queue. If a flow attempts to consume more than its fair share of bandwidth, then only its queue is affected, so there is no impact on the performance of the other queues on the shared output port. However WFQ might not pass some certain test conditions, such as worse case scheduling [2]. By adding two more steps, which compare and find each queue’s virtual start time with global virtual time. WF²Q is able to avoid the issues that WFQ does.
Actually, there are related research can be worked in future.
(1) To operate in coordination with bandwidth management to revise WF2Q algorithm’s virtual time, it is able to conform to the Integrate Service.
(2) The Packet classification Algorithm implement possibility also is important direction.
Reference:
[1] L.L Peterson and B.S Davie“Computer Networks a systems approach. ”Second Edition 2000
[2] J.C.Rbennett and H.Zhang, “ Worst Case faire weighted fair queueing, ” IEEE,INFOCOM pg 120-126.
[3] Juniper Networks “Supporting differentiated service classes Queue scheduling disciplines”Dec.2001
[4] A. Parekh.“A Generalized Processor Sharing Approach to Flow Control in Integrated Service Networks, ” PhD dissertation, Massachusetts Institute of Technology, February 1992.
[5] K.Nichols et. al., “Definition of the Differentiated”Service Field (DS Field) in the Ipv4 and Ipv6 Headers”,IETF RFC2474,December 1998.
[6]. L. Louis Zhang and Brent Beacham “A Scheduler ASIC for a programmable packet switch” 2000 IEEE
[7] Massoud R Hashemi “A Multicast Single-Queue Switch With a Novel Mechanism”
[8] S. Golestani. A self-clocked fair queuing scheme for broadband applications.
In Proceedings of IEEE INFOCOM, pages 636-646 April 1994.
[9] L. Kleinrock. "Queueing Systems, Volume 2: Computer Applications. Wiley, 1976.
[10] A. Demers, S. Keshav, and S. Shenker. "Analysis and simulation of a fair queueing algorithm. In Journal of Internetworking Research and Experience, pages 3-26, October 1990. Also in Proceedings of ACM SIGCOMM´89, pp 3-12.
[11] Toshiba Corporation. “Toshiba Data Book-TC240C series Primitive Cells, I/O Cells (Non-Linier Delay Models) March, 1999
[12] J. L. Rexford, A. G. Greenberg, and F. G. Bonomi. Hardware-Efficient Fair Queueing Architectures for High-Speed Networks. In IEEE INFOCOM’96, San Francisco, March 1996.
Appendix 1
Verilog code
module TS (clk,rst,
linehead_1, linehead_2, linehead_3, linehead_4, sel_flow1, sel_flow2, sel_flow3, sel_flow4);
input [63:0] linehead_1, linehead_2, linehead_3, linehead_4; // FIFO's input input rst,clk;
output sel_flow1, sel_flow2, sel_flow3, sel_flow4;
// output to QM for selection
reg sel_flow1, sel_flow2, sel_flow3, sel_flow4;
reg [63:0] ST_vir1, ST_vir2, ST_vir3, ST_vir4, //Flow's virtual start time
FT_vir1, FT_vir2, FT_vir3, FT_vir4, //Flow's virtual finish time
vt, // Global Virtual Time pre_cal_vt, txtime, min_FT_vir, min_ST_vir;
parameter st0=0, st1=1, st2=2, st3=3;
reg[1:0] currentst, nextst;
wire c1,c2,c3,c4;
wire [63:0] temp_FT_vir1, temp_FT_vir2, temp_FT_vir3, temp_FT_vir4;
assign c1 = (ST_vir1 < vt); //compare each flow's ST_Vir with VT assign c2 = (ST_vir2 < vt);
assign c3 = (ST_vir3 < vt);
assign c4 = (ST_vir4 < vt);
assign temp_FT_vir1 = (c1 == 1) ? FT_vir1 :
(64'b1111_1111_1111_1111_1111_1111_1111_1111_1111_1111_1111_1111_1111_1111_1111_1111);
assign temp_FT_vir2 = (c2 == 1) ? FT_vir2 :
(64'b1111_1111_1111_1111_1111_1111_1111_1111_1111_1111_1111_1111_1111_1111_1111_1111);
(64'b1111_1111_1111_1111_1111_1111_1111_1111_1111_1111_1111_1111_1111_1111_1111_1111);
assign temp_FT_vir4 = (c4 == 1) ? FT_vir4 :
(64'b1111_1111_1111_1111_1111_1111_1111_1111_1111_1111_1111_1111_1111_1111_1111_1111);
always@(posedge clk or negedge rst) begin
if(rst)
currentst <= st0;
else
currentst <= nextst;
end
always@(currentst or linehead_1 or linehead_2
or linehead_3 or linehead_4) begin
case(currentst)
st0: begin // compare each flow's FT_Vir with each other //dequeue
nextst= st1;
if (temp_FT_vir1 < temp_FT_vir2)
// don’t sure could use as this way (temp_FT_vir) begin
min_FT_vir = temp_FT_vir1;
sel_flow1 =1;
sel_flow2 =0;
sel_flow3 =0;
sel_flow4 =0;
pre_cal_vt = vt + linehead_1 [15:0];
//for st3 use end else
begin
min_FT_vir = temp_FT_vir2;
sel_flow1 =0;
sel_flow2 =1;
sel_flow3 =0;
sel_flow4 =0;
pre_cal_vt = vt + linehead_2[15:0];
//for st3 use end
if (temp_FT_vir3 < min_FT_vir) begin
min_FT_vir = temp_FT_vir3;
sel_flow1 =0;
sel_flow2 =0;
sel_flow3 =1;
sel_flow4 =0;
pre_cal_vt = vt + linehead_3[15:0];
//for st3 use end
if (temp_FT_vir4 < min_FT_vir) begin
min_FT_vir = temp_FT_vir4;
sel_flow1 =0;
sel_flow2 =0;
sel_flow3 =0;
sel_flow4 =1;
pre_cal_vt = vt + linehead_4[15:0];
//for st3 use end end
st1: begin // update the dequeue flow's start & finish time nextst= st2;
if ({sel_flow1, sel_flow2, sel_flow3, sel_flow4} ==4'b1000 ) // For flow1
begin
ST_vir1 = FT_vir1 ;
// porporgater the FT_vir to next pkt's ST_vir
txtime = linehead_1[15:0] * 5;
// flow1's weight is 20%
end
else if ({sel_flow1, sel_flow2, sel_flow3, sel_flow4} ==4'b0100 ) // For flow2
begin
ST_vir2 = FT_vir2 ;
// porporgater the FT_vir to next pkt's ST_vir
txtime = linehead_2 [15:0]* 5;
// flow1's weight is 20%
FT_vir2 = ST_vir2 + txtime;
end
else if ({sel_flow1, sel_flow2, sel_flow3, sel_flow4} ==4'b0010 ) // For flow3
begin
ST_vir3 = FT_vir3 ;
// porporgater the FT_vir to next pkt's ST_vir txtime = linehead_3 [15:0]* 5;
// flow1's weight is 20%
FT_vir3 = ST_vir3 + txtime;
end
else if ({sel_flow1, sel_flow2, sel_flow3, sel_flow4} ==4'b0001) // For flow4
begin
ST_vir4 = FT_vir4 ;
// porporgater the FT_vir to next pkt's ST_vir
txtime = linehead_4[15:0] * 5;
// flow1's weight is 20%
FT_vir4 = ST_vir4 + txtime;
end
end
st2: begin // Find the Mininum start time after updating S,F time // Min(S)
nextst= st3;
if (ST_vir1 < ST_vir2)
min_ST_vir = ST_vir1;
else
min_ST_vir = ST_vir2;
if (ST_vir3 < min_ST_vir)
min_ST_vir = ST_vir3;
if (ST_vir4 < min_ST_vir)
min_ST_vir = ST_vir4;
end
st3: begin // update the virtual time
nextst= st0;
if(min_ST_vir > pre_cal_vt)
vt = min_ST_vir;
else
vt = pre_cal_vt;
end endcase end endmodule