Ping Pong Buffers for Transmission - 有效成本控制的三明治型乒乓記憶體

Chapter 2 Background

2.2 Ping Pong Buffers for Transmission

Memory Bandwidth is frequently a limiting factor in the design of high-speed switches and routers. A buffering scheme called ping-pong buffering increases memory bandwidth by a factor of two. Ping-pong buffering halves the number of memory operations per unit time allowing faster buffers to be built from a given type of memory.

Figure 2(a) shows a memory buffer with arrival (An) and departure (Dn) processes of cells.

In each cell time, which we call a time-slot, zero (An = 0) or one (An = 1) new cell may arrive, and zero (Dn = 0) or one (Dn = 1) cell may depart from the buffer. This means that two independent memory operations are required per cell time: one write, and one read. If dual-ported memory is used, it would be possible for both operations to take place simultaneously. However, commercial considerations generally dictate that conventional single-ported memory be used. As a result, the total memory bandwidth must be at lease twice

the line rate.

Figure 2(b) shows a ping-pong buffer of total capacity M (cells), with the arrival and the departure processes denoted as An and Dn, respectively. The main benefit of a ping-pong buffer is that using conventional memory devices, it allows the design of buffers operating twice as fast. But ping-pong buffer’s benefit comes with a penalty. If the amount of memory is not increased, the overflow rate is from a ping-pong buffer is larger than for a conventional buffer. In the worst case, half of the memory is wasted. Using simulations, fortunately, the problem is eliminated by the addition of just 5% more memory [9].

Figure 2.4 (a) A buffer of capacity M. (b) A Ping-Pong Memory

In network communication, a transmission buffering method that involves two buffers: one buffer receives transmissions while the second deletes earlier transmissions. The two alternate functions, which helps to keep transmissions close to continuous. A ping-pong buffer contains two separate buffers; while one buffer is receiving new transmission information the other buffer is deleting the previous transmission.

We can also find ping-pong buffers in a front end system. The system view is illustrated in

figure 2.5. For example, a front end interface consists of a number of ping-pong buffers, one for each LAN attachment; an internal bus with an arbitrator and a bus interface; a microprocessor; and main memory. An incoming frame is placed in one of the ping-pong buffers, if a buffer is available. Once the buffer is full or the last bit of the frame is received, then a signal is raised to inform the bus arbitrator that a buffer is ready for being emptied.

Various scheduling policies such as First Come First Served, Served in Fixed Order, or non-preemptive priority scheme, are possible to serve multiple ping-pong buffers associated with various attachments. If a buffer is not available, then an incoming frame is assumed to be lost. An algorithm is developed and used to investigate the performance characteristics of the ping-pong buffering scheme [10].Once a permission to transfer is received by a ping-pong buffer, the frame is transferred from the ping-pong buffer to the main memory via the internal bus. It is further assumed that the main memory is large enough that it does not cause any loss of segments. Finally, the transfers from the memory to the front end processor are not explicitly considered here.

Figure 2.5 System View[4]

Chapter 3 Sandwich Ping-Pong Memory

In this chapter, we will propose our design. First of all, we will show the architecture and the operation of a double buffer (Ping-Pong or transpose buffer). We will find the two idled area in a Ping Pong Memory and develop a Common Bar to replace them. As a result, we develop a Sandwich Ping Pong Memory. Second, we introduce the operations of the Sandwich Ping-Pong Memory and derive the formula of Initial Time and Idle Time. We find some conditions from doing the timing analysis for this design finally.

3.1 The use of Sandwich Ping-Pong Memory

3.1.1 The Operation of Ping Pong Memory

We have presented a double buffer roughly and realized that its application is in the 2-D DCT architecture in chapter 2. We are going to introduce the architecture and the operation of a double buffer, also known as, Ping-Pong buffer which is shown in figure 3.1.

Figure 3.1The architecture of a Ping-Pong buffer

There are RAMs in figure 3.1 with some signals such as input data, output data, and address signals for writing and reading. The Control signal controls the write and read operation of the Ping Pong Memory. When RAM1/RAM2 is on read operation, RAM2/RAM1 is on read operation. There are address signals for write and read operations. The address sequence of writing and reading is different, showed in figure 3.2. We write data into RAM1/RAM2 row by row and read data from RAM2/RAM1 column by column simultaneously. There are some things we really concern about. Is there any memory cell idled during the write or read operation? If so, what could we do?

Figure 3.2 Ping Pong Memory is on write/read operation

3.1.2 Common Bar

There are some idled memories in a Ping Pong Memory. The idled memories are showed in figure 3.3. There are one block of dotted line area in the Ping Memory and one in the Pong Memory, respectively. The dotted line area means the idled memories in a Ping Pong Memory.

We combine the two dotted line area into one and named it “Common Bar.” As a result, there is a block of memory, Common Bar, between the Ping and Pong Memory. That is the Sandwich Ping Pong Memory.

Figure 3.3 Common Bar

3.1.3 Sandwich Ping Pong Memory

We develop a Sandwich Ping-Pong Memory based on the double buffer. Figure 3.4 shows the architecture of the Sandwich Ping-Pong Memory which is built up by adding one single-port memory between the ping memory and pong memory.

The architecture of the Common Bar is exactly the same as the ping or pong memory, because they are all the same type of construction. In theory, the double buffer is used in the architecture of 2-D DCT, so is the Sandwich Ping-Pong Memory. By the result of simulation and verification on FPGA, we can prove that Sandwich Ping-Pong Memory work as transpose buffer which is used to connect the two 1-D DCT architectures once the first 1-D DCT outputs are row-wise and the second 1-D DCT inputs must be column-wise.

Figure 3.4 The architecture of Sandwich Ping Pong Memory

3.2 Read / Write Operation

Figure 3.5 shows when the first 1-D DCT architecture writes the results row by row in one memory (ping or pong memory), the second 1-D DCT architecture reads the input values column by column from the other memory (pong or ping memory). The read and write signal addresses are generated by a control block and this control block defines, by control signal, which memory is used to Read/Write at each memory access step.

Figure 3.5 The architecture of 2-D DCT

3.2.1 Row-column block memory

The forward and inverse transforms are merely mappings from the spatial domain to the transform domain and vice versa. The DCT is a separable transform and as such, the row-column decomposition can be used to evaluate (3-1).

Denoting:

cos 1

E E N , the column transform can be expressed as:

And the row transform can be expressed as:

In order to compute an N x N-point DCT (where N is even), N row transforms and N column transforms need to be performed. However, by exploiting the symmetries of the cosine function, the number of multiplications can be reduced from N to² N²/ 2. In this case each row transform given by (3-3) can be written as matrix-vector multipliers via,

^{/2 1}

( )

₍ ₁ ₎ Using a matrix notation, for N=8, (4) can be written as

Equations (3-5) and (3-6) describe the computation of the even and odd coefficients, for the row transform for N=8, respectively. The computation for the second 1-D DCT i.e. the column transform described by (3-2) can also be computed using matrix-vector multipliers

similar to that described by (3-4). Hence both the row and column transform can be performed using the same architecture.

According to the 2-D DCT algorithm, there should be a row-column block ping pong memories to access the data. For example, the data of the computation of the even and odd coefficients should be stored in some memory. In the next section, we will present the scan line in Sandwich Ping Pong Memory.

3.2.2 Scan line of the Sandwich Ping-Pong memory

According to the 2-D DCT algorithm, the scan line of the write and read operations in Sandwich Ping Pong Memory are row by row and column by column, shown in figure 3.6 and 3.7, respectively.

Figure 3.6 Row by row on write operation Figure 3.7 Column by column on read operation

The scan line of writing is as the following step. When write operation starts, data is write in the Pong/Ping memory and Common Bar in sequence. First, data is written in the Ping/Pong memory row by row till the Ping/Pong memory is full. After the Ping/Pong memory is fully occupied, data is written in the single port memory, as know as Common Bar. Finally, data is

written in the Pong/Ping memory row by row definitely.

On the other hand, when read operation starts, data is read from the Pong/Ping memory and Common Bar. Data is read from the Pong/Ping memory column by column. However, reading scan line is a little different from writing scan line. On write operation, we do write data into the Common Bar until we finish writing them into Ping/Pong memory. However, on the read operation, we read the data from Pong/Ping memory and Common Bar by turns. Data is read from the Pong/Ping memory, Common Bar, and back to Pong/Ping memory. Finally, we read the last data from the last on address of Common Bar; we finished the complete read operation. We must know that there is data written in the memory, at the same moment, there is data read from the memory. Write and read operation are took place simultaneously.

The operation of the transpose memory can be explained if we visualize it as an 8 x 8 array.

It is actually implemented as a 64-byte SRAM. The first eight bytes of the SRAM correspond to the first row of the array, the second eight bytes, to the second row, and so on. Let mode 1 be a sequence of accesses to locations {0, 1, 2, 3, 4, 5, 6, 7, 8 ...} in that order. This corresponds to scanning rows starting at the top left corner. Let mode 2 be accesses to locations {0, 8, 16, 24, 32, 40, 48, 56, 1, 9 ...} in that order. This corresponds to scanning columns starting at the top left corner.

The transposition occurs as follows. Data is read out according to mode 1 for the first 64 clock cycles. New data (that needs to be transposed) is also written according to mode 1. A write always follows a read; i.e., a read from a location is always followed by a write to that location. For the next 64 clock cycles, reads and writes occur according to mode 2. The data which is read out is the transpose of the data which was written in during the previous 64 clock cycles. As a result, the latency of the transpose operation is 64 clock cycles.

3.3 Timing Analysis

In the section, we will derive the timing analysis about Initially Idle Time and Idle Time and some conditions or constraints for the Sandwich Ping Pong Memory.

3.3.1 The Initially Idle Time

When could we start to read the data from the Sandwich Ping Pong Memory? We have already derived when to read, the Initially Idle Time. Let’s see our derivation.

We make an example for the Sandwich Ping Pong Memory which is a N X M rectangle in size, in figure 3.8. There are three coefficients in the derivation: N, M and P. The coefficient N represents the cell number of columns of the Sandwich Ping Pong Memory. The coefficient P represents the cell number of rows of the Common bar. Hence, the cell number of rows of the Ping or Pong Memory is M-P. The individual size of Ping (or Pong) memory and Common Bar are (M −P)× andN P N× .

In addition, we assume that the access time to one the memory cell is one unit time. Therefore, the time to write data in Ping/Pong memory is (M −P)× , and the time to write data in N Common Bar isP N× .

The scan line of writing operation is row by row in the Ping/Pong memory and Common Bar in sequence. We should wait for a period of time named Idle Time and continue the next write operation. On the other hand, the scan line of reading operation is column by column by turns of Pong/Ping memory and Common Bar. In the same manner, we should wait a moment and continue the next read operation. First, we derive the Initial Time, and followed by Idle Time.

Figure 3.8 Coefficients for Sandwich Ping Pong Memory

From the time schedule and the figure 3.8, we know that the data is read in the Ping memory column by column by turns of Ping/Pong memory and Common Bar. The first step to operate the Sandwich Ping Pong Memory is to write the data into the part of Ping memory and Common Bar. After writing, we start to read the data from them. The duration to read the first data is defined as “Initially Idle Time”. After the Initially Idle we can read the data from the Ping memory and Common Bar, and the Initially Idle is presented as below. We make a time schedule to explain the derivation. Some constraints come to us.

After filling in the Ping memory and Common Bar with data, we start to read the data. There is the first constraint in Fig.3.9

Condition 1:

Initial nonzero utilization constraint:

T

₁^W

≥ T

₁^R

Write schedule

than the Initially Idle time (T ). Because we have to write the data in the Ping memory and ₁^R Common Bar first, we read the data from them after the Initially Idle time.

After observing the write operation in detail, we found that data are written into the Ping memory first and into the Common Bar later. The second constraint comes up in Fig. 3.10.

Condition 2:

Ping memory read contention constraint:

T

₂^W

≤ T

₂^R

Write schedule

In Fig. 3.10, the time to fill the data in the Ping memory (T ) must be shorter than the ₂^W

Initially Idle time (T ). That is because we want to read the data earlier. We can read the data ₂^R from the Ping memory and write the others into the Common Bar simultaneously.

In addition, the way we read the data is column by column. Before reading the data from the Common Bar, the data should already be written into the Ping memory and Common Bar.

Therefore, we have the third constraint in Fig. 3.11 .

Condition 3:

Common Bar - read memory contention constraint: T₃^W ≤T₃^R

Write schedule

shorter than the time to read the data from the first column in Ping memory (T ). Because ₃^R after filling the data into the Common Bar, we could read the data it later.

In conclusion, according condition 1, 2 and 3, we derive the formula (3-7).

( )* *

After deriving the Initially Idle Time, we present the Idle Time or Idle. Again, from the time schedule and the figure 3.8, we know that the data is written in the Ping memory row by row in sequence of Ping memory and Common Bar. After the Ping memory is fully occupied by data, we wait for a period of the time, Idle Time because the Common Bar. The Idle time is presented with some conditions as below.

On the write schedule, we write the data into the Ping memory and Common Bar and wait for while, “Idle Time.” Then we write the data into the Pong memory and Common Bar and so on.

Simultaneously, on the read schedule, after the Initially Idle Time, we read the data from the Ping memory and Common Bar and wait for a while, “Idle Time.” There should be a constraint to prevent the Sandwich Ping Pong memory from being on null operation.

Therefore, we have the forth constraint in Fig. 3.12.

Condition 4:

Run-time nonzero-utilization constraint:

T

₄^W

≤ T

₄^R

Write schedule

T

W ↓

(M* N)ping Idle (M* N)pong Idle ……

Read schedule

T

R ↓

Initially (M* N)ping Idle (M* N)pong ……

4^W 4^R

* *

T ≤ T ⇒ M N + Idle ≤ Initially M N +

Figure 3.12The fourth constraint

In Fig. 3.12, during writing the data in the Ping memory, Common Bar and the Idle time (T ) must be shorter than the time to read the data from the Ping memory, Common Bar ₄^W

(T ). Because we have to prevent the Idle time on write operation and on the read operation ₄^R from being happened in the meanwhile. If we don not have this constraint, the memory would be on null operation.

Before we write the data into the Common Bar, the data should already be read from the Ping memory and Common Bar. We have the fifth constraint in Fig. 3.13.

Condition 5:

Common Bar: write memory contention constraint: T₅^W ≥T₅^R

Write schedule

In Fig. 3.13, during writing the data in the Ping memory, Common Bar, the Idle time and the Pong memory (T ) must be greater than the time to read the data from the Ping memory, ₅^W

Common Bar (T ). Because after reading the data from the Common Bar, we could write the ₅^R data it later.

Therefore, according condition 4, 5, we derive the formula (3-8).

* *

We take the minimum of the Initially Idle Time in the formula (3-7), and we derive the formula (3-8).

* ( ) * ( )

P N M P Idle M N M P

⇒ − − ≤ ≤ − −

(3-9)

In addition, we take the maximum of the Initially Idle Time in the formula (3-7), and we derive the formula (3-10).

* *

P N Idle M N

⇒ ≤ ≤

(3-10) No matter what the data is, they all access to the Sandwich Ping Pong Memory one by one.

In fact, we usually use 8 x 8 block matrix in 2-D Discrete Cosine Transform. Hence, we should put some conditions and coefficients for formula (3-7) and (3-9).

Here is an example for M=4, N =4 and P=1, shown in figure 3.9.

Figure 3.14 Example for Initial and Idle Time

There are twelve memory cells in Ping and Pong Memory, respectively. There are four cells in Common Bar. We write data into the Ping Memory, then Common Bar and Pong Memory row by row. We read from the Ping Memory, then Common Bar and Pong Memory column by column. According formula (3-7) and (3-9), the Initially Idle Time and Idle Time are 13 and 1 unit time.

3.3.3 Line Buffer

Many algorithms and VLSI architectures for the fast computation of one-dimensional (1-D) and two-dimensional (2-D) DCT have been proposed [11]. For and effective VLSI implementation of an orthogonal transform, the corresponding algorithm should be numerically stable, and its computational structure should be regular (recursive and repetitive structure). The experiences with VLSI implementations show that the regularity of the algorithm is prime concern.[7] Almost all VLSI chips are implemented for fixed 8x8 or 16x16 square block sizes.[8]

Figure 3.15 Coefficients for line buffer

Therefore, we have to put one more coefficient to represent the square block sizes. We choose X as our coefficient, and further, we modify the two formula (3-7) and (3-9). An example, in

figure 3.10, is made for the Sandwich Ping Pong Memory and is called ling buffer. We explain the derivations of the Initially Idle Time formula in general form (3-11) with a general time schedule as below. After filling in Ping memory and Common Bar with data, we start to read the data. From Fig., we have the first constraint.

Condition 1 for general form:

Initial nonzero utilization constraint:

T

₁^W

≥ T

₁^R

Write schedule

Figure 3.16The first constraint for general form

In Fig. 3.16, the time to fill the data in the Ping memory and Common Bar (T ) must be ₁^W

在文檔中有效成本控制的三明治型乒乓記憶體 (頁 21-0)