Buffer CBuffer B - 單晶片網路系統平台設計最佳化之研究

Buffer A

Buffer B

Buffer C

TIME A2 B1 B2 A3 A4 C1 C2 A5 B3

A1

MUX

Figure 2.10: Bandwidth allocation of a physical channel using a weighted round-robin scheduler.

be blocked until the buffers are released. In this work, the messages can be delivered rather than blocked by dividing the physical channel into several virtual channels. The waiting time of the message transfer is reduced, and the average latency of this channel is decreased. Thus, the physical channel gets higher utilization and the network obtains a larger throughput.

Second, concerning the bandwidth sharing of a physical channel among all the virtual channels, we exploit the weighted round-robin scheduling scheme to grant the use of the physical channel to each virtual channel. Instead of using the time-division method, the weighted round-robin scheduler as shown in Fig. 2.10 allocates different bandwidth for each virtual channel by assigning different amount of the time slots. The higher weight of

2.2. ARCHITECTURE MODELS AND PLATFORM DESIGN 21

M U X

Controller

E1,sw1 E2,sw1 E3,sw1 E4,sw1

1 0 1 0

E1,sw2 E2,sw2 S1,sw2 S2,sw2 LENGTH−Full

Figure 2.11: Interface transactions between two switches.

a channel means that more communication bandwidth is available.

Third, the data exchange protocol between two switches or between the switch and the network interface of the local processor is executed within four clock cycles. Fig. 2.11 shows that the interface transaction between two adjacent switches, SW1 and SW2. The address mapping table records the destination address to which the messages are trans-ferred. At the first cycle, if the buffer, E1,sw1, of SW1 has data inside, the system controller grants the channel priority to this data. At cycle 2, this E_1,sw1 buffer sends the address of E_1,sw2 through the Address-line to indicate that this transaction tries to deliver data to the buffer E1,sw2 of SW2. At the third clock cycle, the buffer E1,sw2 sends the acknowledge

signal, true or false, back to SW1 through the Ack-line according to its buffer status, full or available. Meanwhile, the buffer E1,sw1 sends the data through the Data-line, and the buffer E1,sw2stores this data if it has spare space. However, this data may be discarded if E_1,sw2 is already full. During the fourth cycle, the buffer E1,sw1 keeps this data until the transaction is successfully completed.

Fourth, our switch provides different memory configurations to improve the local memory utilization. The first reason is that not all buffers of the switches are reserved when the number of the connection paths is smaller than the number of the designed buffers. The second reason, the memory is a critical component for buffering data in a network. Therefore, in memory implementation, we use two-port SRAM instead of regis-ters when the number of virtual channel is large in the physical channel. In the switch, the memory is divided into several different sizes of buffers to optimize the utilization. The memory in a switch port can be partitioned into 8 8-words blocks, 16 4-words blocks or 32 2-words blocks.

Finally, in order to support the real-time application, our switches is able to estab-lish the dedicated connection paths in advance by reserving the corresponding virtual channels since the behavior of the communication and the number of the nodes can be predetermined in early stage of system design.

Although both the traditional circuit-switching and the proposed switching configu-ration have latency guarantee, the proposed one has smaller average latency and higher

2.2. ARCHITECTURE MODELS AND PLATFORM DESIGN 23

MUX MUX

MUX

Buffer Buffer SW 2

Buffer 2

Buffer Buffer SW K

Buffer K

Communication Path Buffer 1

Buffer Buffer SW 1

Scheduler 1 Scheduler 2 Scheduler K

Figure 2.12: Real-time QoS modeling.

hardware utilization. The proposed one has the worst case guarantee as compared to the worm-hole packet switching while both switches have small buffer size and high hardware utilization.

2.2.3 Quality of Service Modeling and Property

In the real-time system, the latency guarantee is the essential requirement of the quality of service (QoS) while the scheduling algorithm enables the appropriate task scheduling to satisfy the real-time requirement in the worst case condition. On the other hand, the QoS also plays a critical role even in a non-real time system. Generally, when using the communication fabric without performance guarantee, designers have to expend more design efforts to estimate the communication latency to make sure that the communication

loading is not underestimated for the given on-chip network communication architecture.

As a consequence, the communication system infrastructure is usually over-designed to avoid the communication congestion. In this work, we use the weighted round-robin scheduling for our QoS model as shown in Fig. 2.12, where the weighted round-robin scheduling is a minimal resources scheduling scheme. Each master has a weight number N_i in the controlled scheduler. The scheduler grants the master if the master proposes the request. The master can transmit at most Ni-word data in a round. After that, the scheduler grants the next master until the round is complete.

Our switches support to establish a predictable communication quality of NoC plat-form and also provide a simple communication model for reducing the design complexity.

As shown in Fig. 2.12, the communication path from Buffer 1 to Buffer K is established.

The transactions from Buffer i is granted by the weighted round-robin Scheduler i. Before analyzing the properties of QoS model as exposed in Fig. 2.12, the useful definitions are revealed in the following:

1) wi,jis the weight of the Buffer i in the weighted round-robin Scheduler j.

2) W_jis the sum of weight of the buffers controlled by the Scheduler j.

3) Dmaxdenotes the maximum delay of a 1-word transmission.

4) R denotes the provided throughput rate of a buffer.

2.2. ARCHITECTURE MODELS AND PLATFORM DESIGN 25

5) Lmaxdenotes the maximum communication path latency of a 1-word transmission.

6) Rpathdenotes the throughput rate of a path.

7) Lburstdenotes the maximum burst data latency.

Using the above definitions, the proposed network switch design has six properties to guarantee QoS, where the six QoS properties are described as follows:

• Property 1: If Buffer i is empty, the maximum delay from the data arrival to the transfer is the time period of a round in the round-robin scheduler, i.e., Dmax = Wj.

• Property 2: If there are data in Buffer i, the Dmax between the transactions is Wj.

• Property 3: If the buffer size is the double of the buffer’s weight or more, the pro-vided lower-bound throughput rate is the ratio of the weight and the sum of the weights in the round-robin scheduler, i.e., R ≥ ^w_W^i,j

• Property 4: The maximum path latency of 1-word transmission is the sum of maxi-mum node latency of 1-word transmission, i.e., Lmax = P^kj=1Wj.

• Property 5: Rpath is dominated by the minimum throughput of the buffers in the path, i.e., Rpath = minn

w_i,j W_j

, where j = 1, 2, 3, · · · , k.

• Property 6: Using Property 4 and Property 5, the burst data delay can be obtained as Lburst= Lmax+ _R^N_path, where N is data size.

PE SW

Processor

Memory Buffer

Figure 2.13: NoC platform model.

The Property 6 means that our switch has an upper bound of the burst data delay such that system designers can design target systems to meet real-time constraints.

2.3 Task Binding Methodology

In this section, we present the communication-aware methodology to solve the task bind-ing problem based on the NoC platform which is constructed by the proposed switch as mention before.

2.3. TASK BINDING METHODOLOGY 27

Figure 2.14: The task graph of MPEG-4 encoder.

2.3.1 NoC platform Modeling

Without loss of generality, the proposed NoC platform using an efficient switch architec-ture can be modeled in Fig. 2.13. The behavior of the NoC platform model are described in the following:

1) The platform composed of the processing element (PE) and switch (SW) is the mesh-based communication architecture. Each PE contains one processor, memory, and network interface (NI).

2) All processors in this platform are identical.

3) The processors have limited buffer to store input and output data.

4) Each PE has local memory to store the execution code and the data.

5) The local memory of a PE is adequate for a task.

We employ the task graph to model applications and assume that applications are able to be partitioned into many communicated tasks due to the parallelism. Fig. 2.14 shows a task graph of an MPEG-4 encoder [16]. A vertex represents a task and the functionality labeled in the vertex. For example, task C denoted as MVMVD performs the motion vector to motion vector difference calculation. The edge represents a data transmission and the corresponding communication amount. After task B is finished, task B transmits C2 unit data to task C and transmits C5 unit data to task E. An edge also indicates the data dependency. A task cannot be executed until it receives the data from the predecessor. For example, task H cannot be executed until it receives C4 unit data from task D and C8 unit data from task G.

2.3.2 Task Binding Problem

The task binding problem is formulated in the following descriptions:

• Given:

1) The application is modeled as a task graph G(V,E), where V is a task and the weight denote the computation amount and E is the dependence and the communication amount between the tasks.

2) The NoC platform (PE,S), where PE and S contain the position information of processors and the communication architecture information, respectively.

2.3. TASK BINDING METHODOLOGY 29

Figure 2.15: Illustration of binding methodology.

• Goal:

To bind each V to PE and to assign the routing path between tasks such that the overall system throughput is maximum.

An example of binding result is shown in Fig. 2.15 that tasks bind to PEs and the communication paths are established. The path A and path B are overlapped and will contend in temporal domain.

In this work, according to Fig. 2.15, we propose a more accurate communication cost calculation scheme that will be exploited by the task mapping and the routing path assign-ment processes. The more complete communication cost function ξ is described in the following.

ξ = X

X∈each path

each channel in X

where Camount, ρX, and Bpenaltydenote the communication amount, contention factor which models the total effect of communication contention on a physical channel, and bandwidth penalty respectively. The detailed expression of ρX, and Bpenalty are revealed in the fol-lowing:

ρ_X = X

Y ∈other paths in the channel

C_amount,Y × Γ_{Y →X} (2.2)

α × ^B^demanded^−B^provided

Bprovided , if B_demanded > B_provided 0 , if Bdemanded < Bprovided

(2.3)

where Γ , α, Bdemanded, and Bprovideddenotes the contention density, the penalty weight, the demanded bandwidth used by communication paths in a physical channel, and the pro-vided bandwidth of a physical channel respectively. The contention density is formulated as

Γ_b→a = t(a,b)overlap

tcommun.time,a

(2.4)

The density of each pair of the communication paths can be derived from the commu-nication profile in the time domain as shown in Fig. 2.16.

The communication cost proposed in (2.1) means that the efficiency of a communica-tion path is affected by the distance of the path, the number of paths to share the channel and the bandwidth usage.

Therefore, we propose a design methodology as shown in Fig. 2.17 to solve the task binding problem. The first block, Task Graph, contains the computation and

communi-2.3. TASK BINDING METHODOLOGY 31

t1 t2 t3 t4 t5 t6 t7

Time Line

在文檔中單晶片網路系統平台設計最佳化之研究 (頁 40-51)