The Design of the PAMP3 - 以Balsa設計之非同步MP3解碼器

This chapter describes the design of the PAMP3. The PAMP3 consists of eight main parts, the synchronizer&Huffman, the requantizer, the reorder, the anti-alias, the IMDCT, the BUFF, the filterbank and the PCM_out. These eight parts work in parallel with the communication channel connected in between them. In the following section, we will introduce the top view of the PAMP3 in the first, and then describe the eight main components.

3-1 The architecture of the PAMP3

The architecture view of the PAMP3 is shown in Figure 23. All of the operations can be completed in the eight stages, and then the PAMP3 outputs a serial of PCM data. The synchronizer&Huffman stage takes the MP3 music data from the Main Memory and puts the header, the side information and the main data into the buffers. The buffers are used as the source and the information of the decoding scale factors and the decoding Huffman data. The requantizer stage is responsible for decoding inverse quantization. It converts the Huffman decoded values back to their spectral values using a power law. The reorder stage and anti alias stage reorders the frequency value from the MDCT and the quantization, and reduces the aliasing effects of the poly-phase synthesis filter bank during its encoding process. The IMDCT stage and the filterbank stage decode the IMDCT and the poly-phase synthesis but here they are implemented in a different way. The BUFF stage holds the data until all of the samples are completed, then the BUFF stage outputs data to the next stage. The PCM_out stage checks the channel mode before outputting data.

Main Memory data length data data

Main

addr data addr data

Req_out data length data data

Main

addr data addr data

Req_out

Figure 23: The architecture view of the PAMP3

The following code is the top level process of the PAMP3. Every component is connected by internal channels.

procedure pamp3_decoder(

input mem_out : 64 bits;

input mem_boundary : 20 bits;

output mem_reset : bit;

output mem_addr : 20 bits;

output data_out : 16 bits

) is . . .

scale_huffman_top(mem_out, mem_boundary,mem_reset, mem_addr, index,freq1,nonzero,global_gain,subgain,scale, preflag,scale_l,scale_s,t_l,t_s,data) ||

requantization(index,freq1,global_gain,subgain,scale,preflag,nonzero, data,scale_l,scale_s,t_s,t_l,req_index_out,freq2,req_out) ||

reorder(req_index_out,freq2,req_out,reorder_out,order_index) ||

alias_reduction(order_index,reorder_out,alias_out,index_out) ||

imdct_top(index_out,alias_out,imdct_out,ch_in) ||

imdct_filterbank(ch_in,imdct_out,filterbank_in,ch_out) ||

filterbank_top(ch_out,filterbank_in,pcm_in,pcm_ch) ||

pcm_out(pcm_ch,pcm_in,data_out) end

3-2 The synchronizer&Huffman stage

This stage contains three buffers and three modules: the synchronizer, the SCALE&HUFFMAN and the BUFF_RW_ARBITOR. The three buffers are the header buffer, the side information buffer and the main data buffer. The synchronizer module retrieves the data from the main memory and decodes the header data and side information data into two buffers. After decoding the header and the side information, the synchronizer sends the side information data and some header data to the SCALE&HUFFMAN module. Then, the synchronizer writes the main data fetched from the main memory into the main data buffer.

While the synchronizer is writing the data, the SCALE&HUFFMAN module also reads the data from the main data buffer to decode the scale factors and the Huffman data. Therefore, the BUFF_RW_ARBITOR controller arbitrates the two control signals that are read and write, from the synchronizer and the SCALE&HUFFMAN modules at the same time, and checks whether the buffer data is valid. The SCALE&HUFFMAN module does the scale factor

decoding and the Huffman decoding. We use the direct table lookup method for Huffman decoding. All the data of the Huffman tables are stored in the ROM, which can be read by the SCALE&HUFFMAN module.

The Figure 24 shows the modules of the Synchronizer&HUFFMAN stage. After the SCALE&HUFFMAN module decodes one value (13 bits) from the data, it immediately transfers the value to the next stage. This is convenient because the next stage does not need the SCALE&HUFFMAN module to decode the entire 576 values before processing.

Synchronizer & Huffman

Figure 24: The Synchronizer&HUFFMAN stage

3-3 The re-quantizer stage

The re-quantizer stage (Figure 25) contains four modules: requant_ctrl, fras_l, fras_s and fras. The requant_ctrl module controls the other modules to compute the output data. The

fras_l module and fras_s module mainly calculate the right side of the multiply sign in eq(1) according to the conditions of long block and short block. The fras module mainly calculates the right side of the equation in eq(1) using the input data, “ISi”, and value, “a”, that were calculated before. Because the value of the 4/3 power calculation is very difficult, a ROM is designed to store all the value of the 4/3 power for future usage. Finally, the requant_ctrl module outputs a value whenever the fras module completes its calculation.

The input data of the requantizer stage, ISi, is 14-bits data is used in the process of looking up the table. After decoding the input, the output data of the requantizer stage, req_out, is a 32-bits data. The format of this output data is represented by a integer of 4 bits and a decimal of 28 bits.

Figure 25: the Requantizer stage

3-4 The reorder stage and anti-alias stage

The reorder stage (Figure 26) immediately assigns the correct position of the input data

according to the frequency mode of the side information. The order_ctrl controls the receiving and the output of a buffer (576 x 32 bits). The output methods are according to the block type of the current frame. In the long block type, the output value can be sent out directly. Because the anti-alias stage performs eight butterfly multiplications for every two subbands ( 2 x 18 values). In the short and mixed block type, a counter is set to count the received data of current frame, and the reordered outputs are sent out when the counter equals to a specified number. The specified number means that 18 reordered values have been received. (Only support for MP3 streams with 44.1kHz sample frequency is implemented.)

Reorder

Figure 26: The reorder stage.

The anti-alias stage (Figure 27) uses two register banks (2*18*32 bits) to store two subbands from the reorder stage and performs an 8-butterfly multiplication as shown in Figure 28. After doing the 8 butterfly multiplication, the anti-alias stage outputs the first half of the results (18*32 bits) to the next stage. Then, the anti-alias stage moves the remaining half of the results forward to one of the previous register banks while waiting for the new 18 reordered values from the previous stage.

alias_out

Figure 27: The anti-alias stage.

count Buffer 1 Buffer 2

Figure 28: The register banks of the anti-alias stage.

3-5 The IMDCT stage

The IMDCT stage does the inverse modified discrete cosine transform, the windowing and the overlapping processes. In the IMDCT processing, we use the method created by

Szu-Wei Lee [16] to implement. Figure 29 shows the computational flow of Szu-Wei Lee’s algorithm. The N-point inverse MDCT is converted to a N/2-point DCT-IV first, then it is converted to a N/2-point SDCT-II. Finally, a N/2-point SDCT-II can be divided to two identical N/4-point SDCT-IIs. Therefore, this algorithm can be simplified into 3-point and 9-point SDCT-II modules, which compute the inverse MDCT for a MPEG layer III. In this algorithm, the total of the multiplications and the additions are only 43 and 115 when the length N = 36.

Figure 29: The IMDCT processing flow of Szu-Wei Lee‘s algorithm.

According to the previous algorithm, 5 sub-stages were constructed for the pipeline architecture in the IMDCT stage as shown in Figure 30. These sub-stages are scaling&butterfly, SDCT-II, post-process, windowing and overlapping. The first three sub-stages are the computing flow in the Szu-Wei Lee’s algorithm. The others execute multiplication between the inputs and the long or short window table data and then overlap between the inputs and the previous frame. The input values from the anti-alias stage are multiplied by constants and then they pass through a butterfly addition with each other in the scaling&butterfly sub-stage. The SDCT-II sub-stage is decomposed into two blocks, the N/4-point SDCT-II and the N/4-point DCT-IV. The first half of the outputs from the previous sub-stage performs the N/4-point SDCT-II immediately. The second half implement the reordering process first and then perform the N/4-point SDCT-II. The 3-point and 9-point

SDCT-II can be used directly. The 3-point SDCT-II requires one multiplication and 5 additions, and the 9-point SDCT-II requires 8 multiplications and 36 additions. After the post-processing sub-stage, 36 outputs of the IMDCT processing will be created. In the windowing sub-stage, the windowing_cntrl controller controls the multiplying process ( mult_long_short) between the data of the window table and the input from the last sub-stage.

The data will differ according to long or short block types. Finally, the overlap sub-stage performs the overlapping between half of the data from the current block and the data from the previous block in the overlapping memory and then outputs 18 overlapped data to the next stage.

Figure 30: the sub-pipeline of the IMDCT stage

The following code is the top level process of the IMDCT stage. The whole IMDCT stage is decomposed into 5 sub-stages of the sub-pipeline.

procedure imdct_top(

input index : 10 bits ;

array 0 .. S-1 of input data_in : data_type ; array 0 .. S-1 of output data_out : data_type ; output ch_out : 3 bits

) is

… …

imdct_stage1(index,data_in,reg1,index1) ||

imdct_stage2(index1,reg1,reg2,index2) ||

imdct_stage3(index2,reg2,reg3,index3) ||

IMDCT_windowing(index3,reg3,index_out,imdct_over) ||

IMDCT_overlap(index_out,imdct_over,data_out,ch_out) end

3-6 The BUFF stage

The outputs of the IMDCT stage are 18 time-domain samples, but the inputs of the poly-phase filterbank stage are 32 subband samples. The BUFF stage (Figure 31) is needed to buffer the inputs from the IMDCT stage until receiving the 576 samples. Then, the output of the BUFF stage delivers 32 subband samples in the buffer to the poly-phase filterbank stage.

The pipeline architecture must work abidingly during the data buffering. Therefore, the buffer is divided into two blocks and the two blocks are read and written in turn. During the buffer0 is being written, data is from the output of the IMDCT stage, and the buffer1 is being read

data to output to the filterbank stage; the process continues until receiving the 576 samples. In the next 576 samples, the buffer0 is being read data to output to the filterbank stage and the buffer1 is being written data from the output of the IMDCT stage, and so on.

mode&channel

Figure 31: The BUFF stage

3-7 The poly-phase filterbank stage

The poly-phase filterbank converts the time domain samples from the IMDCT transform in each subband to PCM samples. As mentioned in the previous chapter, the poly-phase synthesis filterbank can be decomposed into four parts, moving, DCT, matrix multiply and overall adding. The Konstantinides’ algorithm [9] and the B.G. Lee’s algorithm [4] are both used to find a good implementation of the 32-point DCT. The 32 subband samples are the inputs of the DCT and then they are converted by the B.G. Lee’s algorithm. Finally, by using a symmetric way as shown in the Konstantinides’ algorithm (Figure 32), the previous 32-point results become 64-point final results. Figure 33 shows the 8-point DCT using B.G. Lee’s

algorithm [11].

32 subband samples

32‐point DCT

A B

‐B ‐A ‐A

S

V’

V

32‐point result

from DCT

64‐point result after data copying

32 subband samples

32‐point DCT

A B

‐B ‐A ‐A

S

V’

V

32‐point result

from DCT

64‐point result after data copying

Figure 32: The DCT simplification of Konstantinides’ algorithm

Figure 33: The 8-point DCT simplification of B.G. Lee’s algorithm [11]

This pipeline stage is divided into 7 sub-stages as shown in Figure 34. The first 6 sub-stages perform the 32-point DCT by the previous method. The remaining 1 sub-stages perform the FIFO moving, window table multiplying and final scaling. The B.G. Lee’s fast DCT algorithm is recursive, and for a 32-point DCT. It requires only 80 multiplications and 209 additions. Therefore, the first 5 sub-stages are recursively integrating the 32-point subband samples into smaller units. Then the next sub-stage performs the similar post-processing part as shown in the Figure 34. The windowing_cntrl controller in the windowing sub-stage controls the inputs from the previous sub-stage to do data copying as shown in Figure 33. Then the controller pushes the inputs into the FIFO and performs multiplication between the data from the FIFO and the constants from the window table ROM.

After overall adding, the windowing_cntrl outputs the 32 PCM data to perform scaling.

Finally, the windowing and scaling sub-stage outputs 32 scaled PCM data to the next stage.

The DCT of synthesis filterbank

Figure 34: the sub-pipeline of the synthesis filterbank stage

The following code is the top level process of the poly-phase filterbank stage.

procedure filterbank_top(

input ch : 3 bits ;

input data_in : data_type ; output data_out :16 bits;

output ch_out : 3 bits ) is

. . .

filterbank_butterfly_5(ch,data_in,reg1,ch1) ||

filterbank_butterfly_4(ch1,reg1,reg2,ch2) ||

filterbank_butterfly_3(ch2,reg2,reg3,ch3) ||

filterbank_butterfly_2(ch3,reg3,reg4,ch4) ||

filterbank_butterfly_1(ch4,reg4,reg5,ch5) ||

filterbank_GHmake(ch5,reg5,reg6,ch6) ||

filterbank_windowing(reg6, ch6, data_out,ch_out) end

3-8 The PCM_out stage

Because the MP3 decoding can be divided into a mono channel mode or a dual channel mode, the output of the PCM_out stage is a 16-bits PCM data according to the mode bits transferred from the previous stage. When the mode bits are equal to 0 or 2, this stage must

store the entire 576 samples of one channel until the data of the other channel is received and then the stage outputs the data by turns of channel 0 and channel 1. When the mode bits are equal to 3, this stage inputs and then outputs directly. When the mode bits equal to 1, it means the mp3 music compressed in the joint-stereo mode. In this article, the joint-stereo decoding is not discussed because it won’t be really implemented.

在文檔中以Balsa設計之非同步MP3解碼器 (頁 39-54)