Serialization Technique For Link Wires - 應用於晶片網路之低功率高可靠度傳輸架構基於自我更正節能編碼技術和自我校準電壓調整技巧

The physical transfer unit is a unit into which a packet is divided and transmitted through micro-network. Simply speaking, the phit size is the bit-width of the link wire, I/O and switch size. Large phit size increases network area and energy consumption, especially for switching circuit and buffering units in switch fabrics. Some approaches address signal integrity to protect the NoC interconnection infrastructures against different transient malfunctions [50,51]. However, these approaches could not decode the encoded codes in each switch fabric because of significant delay. The critical depth, moreover, will increase rapidly as well as the bit-width increases. Therefore, the un-decoded code will induce great amount of area and energy dissipation of switching circuits and buffers in switch fabrics.

Joint coding schemes have been consider the effective way to reduce power consumption and at the same time provide a reliable interconnect. However, both

crosstalk avoidance codes and error correction codes enlarge the physical transfer unit (phit) in network-on-chip. According to the disadvantages mention above, we can joint bus and error correction coding scheme with concept of serialization and deserialization technique. Figure 2.5 show a K-to-N serialization with N:1 ratio.

Serialization ratio is defined as I/O bit width divided to phit size. Original data without serializer was sent K bit each cycle, but was sent K/N bit each cycle with serializer. The serializer and deserializer reduce the phit size and further reduce the area and energy consumption of the switch fabrics. However, in order to achieve the same throughput, the serialization technique will increase the operation frequency of interconnection network. On-chip serialization, nevertheless, is a crucial technique for NoC implementation. It reduces overall network area and optimizes power consumption which is well-explained in [52,53].

Figure 2.5: K-to-N serialization with N:1 ratio

Figure 2.6(a) and 2.6(b) are simulated under high loading and low loading of wires, respectively. Despite of loading, the power consumption decreases with the increasing ratio of serializer under low operation frequency. Unfortunately, with he increasing ratio of serializer under higher operation frequency, the power consumption increases because of large driver to provide high driving ability.

Figure 2.6: Average power versus different ratio of serializer and frequency (a) in high loading (b) in low loading of wires

Figure 2.7 shows that the switch and link energy consumption decrease effectively depend on serialization ratio. But queuing buffer energy consumption increase positive proportion to serialization ratio. From [53] the simulation results, 4:1 serialization ratio is an optimized ratio to achieve energy saving for network on-chip interconnect.

Figure 2.7: Energy variation in relation to serialization ratio when the number of processing units (N) = 16 under Mesh and Star NoC topology [53].

We implemented the serializer and deserializer with all-digital self-calibrated multi-phase delay-locked loop in [54]. The shift register based serializer/deserializer architecture was adopted in this thesis, implemented by low swing edge-triggered Flip-Flop. 4-to-1 Serializer circuit and waveform as showed in Figure 2.8(a), and 4-to-1 deserializer circuit and waveform as showed in Figure 2.8(b). The data can operate in quarter of clock frequency. Reducing operation frequency can achieve power saving goal.

Figure 2.8: (a) Shift Register Based Serializer and waveform (b) Shift Register Based Deserializer and waveform

Chapter 3 Self-Corrected Green Coding Scheme

3.1 Preliminary

Joint coding schemes based on the unified framework provide better communication performance. However, the schemes mention above just combine different kinds of codes directly. The intrinsic qualities of crosstalk avoiding coding and error correction coding are mutually exclusive, except for duplicate-add-parity (DAP). The previous works have disadvantages of encoder/decoder hardware overhead, encoder/decoder need a significant propagation delay when numbers of un-coded bit increase. Besides, without the serialization and deserialization technique for link wires, large phit size will increase network area and energy consumption. To achieve a low-power and reliable interconnect, we propose a joint bus and error correction coding scheme with 4-to-1 serializers and deserializer as Figure 3.1 .

Processor Element interfaceinterface ECC decoderECC encoder

Processor Element

Figure 3.1: A joint bus and error correction coding scheme with serializers/deserializer in network-on-chip

In this chapter, we focus on the joint bus and error correction coding scheme, self-corrected green coding scheme. To realize reliable and green interconnection for NoC platforms. Self-corrected green coding scheme is constructed by two stages, which are green bus coding stage and triplication ECC stage. The green bus coding is developed by the joint triplication bus power model to achieve more energy reduction for the triplication ECC. The detail of self-corrected green coding scheme will be described in the following section. It has the characteristics of shorter delay for ECC, more energy reduction and smaller area.

3.2 A Unified Framework of Joint Coding Scheme

For on-chip interconnection, three main problems have to been considered, which are delay, power and reliability. For the delay problem, large propagation delay due to capacitive. Especially long global line, low swing voltage to charge capacitive take a long time. High power consumption of interconnects is due to both parasitic and coupling capacitance. Finally, reliability depends on increased susceptibility to errors due to noise. In advanced technologies, circuits and interconnects become more sensitive to noises as to the lower operation voltage. In addition, the increasing coupling noise, soft-error rate, bouncing noise decrease the reliability also. In view of this, self-calibration circuitry is essential in today’s SoC design. Therefore, coding theory is an effective solution to deal with the three challenges. Joint bus and error correction coding has been an elegant and effective technique to solve the crosstalk effect and further provides a reliability bound for on-chip interconnect.

According to different problems, there are different coding schemes to deal with:

(1) LPC (Low-Power Codes): Reducing transition activity to achieve low power interconnect.

(2) CAC (Crosstalk Avoidance Codes): Avoid specific code patterns or code transitions to reduce delay and power dissipation produced by crosstalk effect.

(3) ECC (Error Control Codes): To guarantee error-free transmission, the code has to provide a reliability bound. The code is able to detect or correct the error bits.

LPC and CAC are hard to separate completely, because they have some similar properties. Sometimes avoid crosstalk between lines will also lower the power consumption. To briefly sum up, LPC and CAC Reducing transition activity and forbidding some transitions which cost large power.

Joint codes architecture have been proposed in [14]. An unified coding framework as shown in Figure 3.2, it’s rules are:

(1) CAC needs to be the outermost code (2) LPC can follow CAC

(3) ECC needs to be systematic

(4) The additional information bits generated by LPC (p) and ECC (m) need to be encode through linear crosstalk code (LXC1/LXC2)

Figure 3.2: Unified coding framework [14]

3.2.1 Related Work On Crosstalk Avoidance Codes

Crosstalk avoidance codes (CACs) can be used to improve signal integrity and also reduce the coupling capacitance effect and hence the reduce energy dissipation of wire segments. CACs reduce the worst-case switching patterns of a wire by ensuring that transition from one codeword to another codeword does not cause adjacent wires to switch in opposite directions . According to the analysis in [18] for the specific case of on-chip buses, the bus lines must be 20mm longer in order for these encoding schemes to be energy efficient in practical implementations. Due to the NoC design, the wire segments between two routers or between router and IP are significantly shorter than the above mentioned limit. [55].

The purpose of crosstalk avoidance code is to reduce the delay of the line to (1+pλ)τ, where p = 1,2 or 3 depend on the maximum coupling (worst case delay (1+4λ)τ ). The following consider four CACs: Forbidden Overlap Codes, Forbidden Transition Codes, Forbidden Pattern Codes and One Lambda Codes.

These CACs achieve different degrees of delay reduction.

First, we define three conditions which help us to analysis the switching activity of codeword, named Forbidden Overlap condition, Forbidden Transition condition and Forbidden Pattern condition. Forbidden Overlap condition represents a codeword transition from 010 to 101 or from 101 to 010 as shown in Figure 3.3(a).

Forbidden Transition condition represents a codeword transition from 01 to 10 or from 10 to 01 as shown in Figure 3.3(b). Forbidden Pattern condition represents a codeword having 010 or 101 patterns.

Figure 3.3: (a) Forbidden Overlap condition (b) Forbidden Transition condition

(1) Forbidden Overlap Codes (FOC)

Maximum coupling can be reduced to p=3. The FOC can be satisfied if and only if a codeword having the bit pattern 010 (or 101) does not make a transition to a codeword having the pattern 101 (or 010) at the same bit positions. Encoding all the bits at once is not feasible for wide links due to size and complexity of the codec hardware. Considering a 4-bit sub-channel the coding scheme shown in Table 1(a).

For coding 32 bits, eight FOC4-5 blocks are needed, and 32-bit un-coded link will be converted to a 40-bit coded link. In this case two sub-channels can be placed next to each other without any shielding, as well as not violating the FO condition.

(2) Forbidden Transition Codes (FTC)

Maximum coupling can be reduced to p=2. The FTC can be satisfied by ensuring that the transitions between two successive codes do not cause adjacent wires to switch in opposite directions (i.e., a codeword has a 01 bit pattern, the subsequent

codeword cannot have a 10 pattern at the same bit position ) Considering a 3-bit sub-channel the coding scheme is expressed in Table 1(b). In this case also we combined the sub channels in such a way that there is no forbidden transition at the boundaries between them. Consequently a 32-bit un-coded link will be converted to 53-bit coded link.

(3) Forbidden Pattern Codes (FPC)

Maximum coupling can be reduced to p=2. FPC codes can be achieved by avoiding 010 and 101 bit patterns for each of the code words. Considering a 4-bit sub-channel the coding scheme is expressed in Table 1(c). Consequently a 32-bit uncoded link is converted to a 52-bit coded link.

(4) One Lambda Codes (OLC)

Maximum coupling can be reduced to p=1.OLC codes satisfy the Forbidden adjacent boundary pattern condition: two adjacent bit boundaries in the codes cannot both be of 01-type or 10-type. Besides, OLC also avoid FT and FP condition.

The simplest OLC is duplication and shielding, where every bit is duplicated and shield wires are inserted between adjacent pairs of duplicated bits [17]. However, OLC encode k-bits un-coded bits to l=11 / 4 3k − bits. For example, 85wires are required for 32 un-coded bits.

Table. 1: (a) FOC4-5 coding schemes (b) FTC3-4 coding schemes (c) FPC4-5 coding schemes (d) OLC4-8 coding schemes

3.2.2 Related Work On Error Control Codes

Incorporating of different coding schemes in SoC design is being investigated as a means to increase system reliability. We know CACs reduce the worse-case switching capacitance of a wire by ensuring that a specific codeword transitions doesn’t happen.

However, NoC is sensitive to internal (power supply noise, crosstalk noise, inter-symbol interference ) and external (electromagnetic interference, thermal noise , noise by alpha particles) noise source due to lower supply voltage, smaller node capacitances, decreasing of inter-wire spacing, the increasing role of coupling capacitances, the higher clock frequency ..Etc.

CACs don’t help to against these noises. To make the system robust, CAC incorporate with forward error correction coding is a solution. Jointing CAC and single error correction (SEC) codes such as: Duplicate-add-parity (DAP) and Modified Dual Rail (MDR) [14,50], Boundary Shift Code (BSC) [16] and Hamming codes [56] provide on-chip interconnect better reliability.

(1) Duplicate-add-parity (DAP) and Modified Dual Rail (MDR):

Encoder/Decoder of Duplicate-add-parity as shown in Figure 3.4. Encoder duplicates data( x0,x1,x2,x3 ) and generates ( y0,y2,y4,y6 ), y8 is parity bit generate from x0 x1 x2 x3 whic♁ ♁ ♁ h means if data has odd numbers of “1” y8=1, else (even numbers of “1”) y8=0. Decoder receive data y0~y7 and former stage parity bit y8.

Comparing y8 with new parity y1 y3 y5 y7♁ ♁ ♁ on the Decoder sides, if two parity is identical, multiplexer is selected by “0” and get decode data ( y1,y3,y5,y7 ) . Else two parity is different, multiplexer is selected “1” and get data ( y0,y2,y4,y6 ). This scheme has ability to correct one-bit error.

The Modified Dual Rail (MDR) code is very similar to the DAP. In the Dual Rail (DR) code, considering a link of k information bits, m = k + 1 check bits are added, leading to a code word length of n = k + m = 2k + 1. We define the k + 1 check bits with Equation (3.1). In the MDR two copies of parity bit Ck are placed adjacent to the other codeword bits, to reduce crosstalk.

(3.1)

Figure 3.4: Duplicate-add-parity code (a) Encoder (b) Decoder

(2) Boundary Shift Code (BSC):

The following will introduce Boundary-Shift Code and give an example.

Boundary-Shift Code is generated by copying each bit and adding a parity bit to show the input bits have odd or even numbers of “1”. Besides, the parity bit will shift between first bit and last bit of output each transition cycle time as shown in Table 2.

Table 2: Example for Boundary-Shift Code.

Boundary-Shift Code Decoding is done by majority vote; two “copies” of the desired bit and third bit are generated by sum (mod2) of copy of each of other information bits and parity bit. For example y0, y1, and (y2+y4+y6+y8) mod-2 as Shown in Table 3 blue marks. The red marks as shown in Table 3 are error output bits, Table 3 shows an example that BSC is able to correct one error in cycle 1~3 (no matter one error is occurred at data bit or parity bit ), but will fail in cycle 4 when there are two errors or more errors occurred.

Table 3: Example of Boundary-Shift Code error correct ability

Boundary-Shift Code Encoder and Decoder as shown in Figure 3.5, it shows that BSC has disadvantages of large gate numbers and critical path which depend on

transmission bit. For n-bits un-code data, it is encoded to (2n+1) bits, the circuit depth of encoder and decoder are

[

log2n

]

+1 and ⎡⎣log2

(

n+1

)

⎤⎦+1 respectively [16].

Figure 3.5: Boundary Shift Code (a) Encoder (b) Decoder

(3) Hamming code [56]

Traditional error control code such as Binary (7, 4) Hamming code, if transfer 4-bits data (m1, m2, m3, m4), it needs redundant 3 Parity bits as information to detect which bit error and have ability to correct one error. Parity bit Pi is 0 or 1 to make the number of 1s in the set (Pi, mx, my, mz) even. So, the parity is given by Pi = mx♁my♁mz. The complexity of (7,4) Hamming encoder is 5XOR2, and the propagation delay is 2XOR2. The complexity of (7,4) Hamming decoder is 12XOR2+4NAND3+3Inverters, and the propagation delay is 5XOR2. Figure 3.6 shows the encoder, syndrome generator and decoder of (7,4) Hamming code.

At system level, for 32 bit word use binary systematic (38, 32, 3) code, known as extend Hamming code to correct a single error. The parity bits of binary systematic (38, 32, 3) codes are given by (P1, P2, P3, P4, P5, P6) as shown in Equation (3.2), where mi denote data bits and Pi denote parity bits. The complexity of (38, 32, 3) encoder is 70XOR2, and the propagation delay is 5XOR2. The complexity of (38, 32,

3) decoder is 108XOR2+96 NAND3+6Inverters, and the propagation delay is 8.5XOR2. The results shows that Hamming code with large hardware overhead and propagation delay which may degrade the performance of on-chip interconnect.

Figure 3.6: Hamming Code (a) Encoder (b) Syndrome generator (c) Decoder

P1=m1 m2 m4 m5 m7 m9 m11 m12 m14 m16 m18 m20 m22♁ ♁ ♁ ♁ ♁ ♁ ♁ ♁ ♁ ♁ ♁ ♁ m24 m26 m27 m29 m31

♁ ♁ ♁ ♁ ♁

P2=m1 m♁ 3 m4 m♁ ♁ 6 m7 m♁ ♁ 10 m11 m1♁ ♁ 3 m14 m1♁ ♁ 7 m18 m2♁ ♁ 1 m22 m2

♁ ♁ 5 m26 m2♁ ♁ 8 m29 m3♁ ♁ 2

P3=m2 m♁ 3 m4 m♁ ♁ 8 m♁ 9 m♁ 10 m11 m1♁ ♁ 5 m1♁ 6 m1♁ 7 m18 m2♁ ♁ 3 m2

♁ 4 m2♁ 5 m26 m♁ ♁ 30 m♁ 31 m3♁ 2

P4=m5 m♁ 6 m♁ 7 m♁ 8 m♁ 9 m♁ 10 m11 m1♁ ♁ 9 m♁ 20 m♁ 21 m♁ 22 m2♁ 3 m2

♁ 4 m2♁ 5 m2♁ 6

P5=m12 m♁ 13 m♁ 14 m♁ 15 m♁ 16 m♁ 17 m1♁ 8 m1♁ 9 m♁ 20 m♁ 21 m♁ 22 m2

♁ 3 m2♁ 4 m2♁ 5 m2♁ 6

(3.2) P4=m27 m♁ 28 m♁ 29 m♁ 30 m♁ 31 m♁ 32

With aggressive supply voltage scaling and increase in deep sub micron noise, single error correcting codes will not satisfy the reliability requirements. More powerful ECC (such as multiple error correcting) will need in future NoC design.

3.3 Proposed Self-Corrected Green Coding Scheme

3.3.1 Triplication Error Correction Coding Stage

The triplication error correction coding scheme as shown in Figure 3.7 is a single error correcting code by triplicating each bit. From the information theory, it is well-known that a code set with hamming distance of h has h-1 error-detect ability and [(h-1)/2] error-correct ability. For the triplication error correction coding, the hamming distance of each bit is equal to 3. Therefore, each bit can be corrected by itself if there are no more than two error bits in the three triplicated bits. The error bit can be corrected by a majority gate, and the function of the majority gate is shown in Figure 3.7. Compared to other error correction mechanisms, the critical delay of the decoder is a constant delay of a majority gate and much smaller than other ECCs. In other words, it has rapid correction ability by self-corrected in bit level. Therefore, triplication error correction coding is more suitable in network-on-chip for smaller encode/decode propagation delay.

Figure 3.7: Triplication error correction coding scheme

In addition, one of the advantages of incorporating error correction mechanisms in the NoC data stream is that the supply voltage of channels can be reduced without compromising the reliability of system. Reducing the supply voltage Vdd will increase the bit error probability. To simplify the error sources, we assume the bit error probability ε is as Equation (3.3) when a Gaussian distributed noise voltage VN with variance σN2 is added to the signal waveform.

Each triplication sets can be error-free if and only if no error transmission or just 1-bit error transmission. For each triplication sets, therefore, P1-bit correct is given as

( )

For k-bits data, transmission is error-free if and only if all k triplication sets are correct. Pk-bits-correct is given by

(

² ³

)

1 3 2

k k

k bit correct i bit correct i

P

₋

P

₋

ε ε

= ∏ = − +

^(3.6)

Hence, the word-error probability will be

(

² ³

)

1 1 3 2

triplication

P = − − ε + ε

^(3.7)

For small probability of bit error ε, Equation (14) simplifies to

(3.8)

2 3

3 2

self correct

P

₋

= k ε − k ε

By contrast, the word-error probability is much smaller than Hamming code and DAP which are direct to k²ε². The triplication error correction coding, moreover, can

avoid forbidden overlap condition (FOC) and forbidden pattern condition (FPC) which will induce large energy dissipation by coupling effect. The FO condition can be defined that bit pattern (y2,y1,y0) does not have transition from 010 to 101 or from 101 to 010. And forbidden pattern condition can be satisfied that avoiding bit pattern 010 and 101 in (y2,y1,y0).

3.3.2 Joint Triplication Bus Power Model

The bus model proposed by [57] by considering the loading capacitances and coupling capacitances. Figure 3.8(a) shows the model which are modified for four bits bus. The Cii means the loading capacitance of line i and the Cij is the coupling capacitance between line i and line j. Moreover, the bus lines are laid parallel and coplanar. Most of the electric field is trapped between the adjacent lines and the ground. An approximate deep submicron bus power model with ignoring the parasitic between nonadjacent lines is as Figure 3.8(b).

Figure 3.8: (a)Bus model for four bits (b)The approximate bus model

We assume all grounded capacitors which have the same value without considering the fringing effect of the boundary lines. Because of the fringing capacitors are much less than the loading and coupling ones, even more for the wide buses. From Figure 3.8(b), we can define the capacitance matrix C^t as Equation (3.9):

(3.9)

The parameter λ is defined as the ratio of coupling capacitance to loading capacitance. Therefore, the parameter depends on the technology as well as the specific geometry, metal layer and shielding of the bus. It has some properties such that the parameter λ tends to increases with technology scaling. For instance, λ is between 3 and 6, depending on the metal layer for standard 0.13um CMOS technology and minimum distance between the wires. The parameter λ is expected much lager in advanced technology.

Between two adjacent lines, there are five types of transition states, and four of them are mentioned in [58]. The five types can be separated into two cases: the first case is static transitions like type I (single line switching) type II (two lines switching in opposite direction) and type III (no switching) in Figure 3.9. And the other one is dynamic transitions as type IV and type V with signals aliasing in Figure 3.9. The static transition is defined as that the two adjacent lines switch at the same time without noises and different delays. The dynamic one means the two adjacent lines having possible misalignment.

Figure 3.9: Five transition types for two adjacent wires

Although triplication error correction coding can avoid some forbidden conditions, some power-hungry transition patterns can not be avoided completely. These patterns are mainly constructed by Forbidden Transition condition and self switching activity.

在文檔中應用於晶片網路之低功率高可靠度傳輸架構基於自我更正節能編碼技術和自我校準電壓調整技巧 (頁 28-0)