DSP implementation of successive interference cancellation (SIC) receiver for 3GPP WCDMA uplink transmission

(1)

Wirel. Commun. Mob. Comput. 2003; 3:789–800 (DOI: 10.1002/wcm.157)

DSP implementation of successive interference cancellation

(SIC) receiver for 3GPP WCDMA uplink transmission

z

Yu-jung Chang1, Yu-Nan Lin2and David W. Lin2,*,y

1_{Computer and Communications Research Laboratories, Industrial Technology Research Institute,}

Chutung, Hsinchu, Taiwan 310, R.O.C.

2_{Department of Electronics Engineering and Center for Telecommunications Research,}

National Chiao Tung University, Hsinchu, Taiwan 30010, R.O.C.

Summary

The 3GPP WCDMA is a widely accepted third-generation cellular system standard. By using nonorthogonal codes for different users, the multiple access interference (MAI) can be a limiting factor for system performance, as for other CDMA systems. Multiuser detection (MUD) is known to reduce MAI and improve CDMA system performance, but many such techniques have high complexity. Successive interference cancellation (SIC) is an effective MUD technique with relatively low complexity. We consider the software implementation of an SIC receiver for WCDMA uplink transmission on a commercially available general-purpose multi digital signal processor (DSP) platform. This also goes in line with the recent interest in software-defined radio. Issues addressed in this work include job partitioning and signal routing for multiprocessor implementation, design of SIC components (especially the channel estimator and the signal regenerator), determination of the precision of fixed-point computations, consideration of the receiver’s error performance and analysis of the implementation’s complexity and efficiency. These issues are tightly coupled with the 3GPP WCDMA specifications. Because the employed platform only contains four DSPs, the implementation only considers up to three users. But this is sufficient for us to appreciate various DSP implementation issues of an SIC receiver. Moreover, by the nature of SIC, it is easy to extend the implementation to handle more users with an enlarged platform. Our present implementation achieves real-time speed in the RAKE receiver part of the complete receiver. Due to the complexity in signal regeneration, the overall SIC receiver still falls short of the real-time requirement when interference cancellation is activated. In fact, the platform employed presently cannot support real-time processing when the number of multipaths is four or more, unless either the system architecture or the SIC algorithm is redesigned. Such and other ways of improvement are relegated to potential future work. Copyright # 2003 John Wiley & Sons, Ltd.

KEY WORDS: CDMA; Third-generation partnership project (3GPP); Multiuser detection (MUD); Successive interference cancellation (SIC); Multipath channels; Digital signal processors (DSPs)

1. Introduction

The WCDMA standard developed by 3GPP is a widely accepted third-generation cellular system

standard, providing high data rate transmission at a chip rate of 3.84 Mcps. It employs the direct-sequence code division multiple access (DS-CDMA) scheme. DS-CDMA systems assign different codes to different

*Correspondence to: David W. Lin, Department of Electronics Engineering, National Chiao Tung University, Hsinchu, Taiwan 30010, R.O.C.

y_{E-mail: dwlin@mail.nctu.edu.tw}

z_{The work was supported by National Science Council of R.O.C. under grant no. NSC 90-2219-E-009-004.}

(2)

users to discriminate their signals. Ideally, interfer-ence-free transmission is possible if the codes are orthogonal. In practice, due to multipath propagation and implementation reasons, code orthogonality can-not be attained. This results in interference among user signals, known as multiple access interference (MAI), that constitutes a major limiting factor to CDMA system performance.

The adverse effects of MAI can be mitigated by multiuser detection (MUD) and a variety of MUD techniques have been developed [1]. Among the MUD techniques, one group known as the subtractive inter-ference cancellation detectors has proven to be effec-tive in combatting MAI and yet possesses relaeffec-tively low complexity. Subtractive interference cancellation detectors are based on a simple heuristic idea: if an interferer’s signal has been detected, then the detector can use it to regenerate the interfering signal and subtract it from the received signal for better detection of other user signals. Parallel interference cancellation (PIC) and successive interference cancellation (SIC) are two main types of subtractive interference cancel-lation techniques. As the names indicate, PIC attempts to cancel the interference from every user to every other user at the same time, while SIC cancels the contributions from different users in a successive way. We consider SIC in this work.

Despite its being of lower complexity among all multiuser detection techniques, SIC at the high chip rate of 3GPP WCDMA still requires a large amount of computation that is beyond the capability of many general-purpose digital signal processor (DSP) chips commercially available today. Hence we consider a multiprocessor implementation. Besides complexity and performance considerations, the successive man-ner in which user signals are detected in SIC lends it to easier job partitioning and easier signal routing in a multiprocessor implementation. Incidentally, a DSP implementation of PIC is reported in Reference [2].

This paper is organized as follows. Section 2 describes the 3GPP WCDMA uplink transmission system. Section 3 introduces the SIC receiver. Section 4 discusses the DSP implementation and the resulting performance. And finally, Section 5 gives a brief conclusion.

2. The WCDMA Uplink Transmission System In 3GPP WCDMA, the spectrum spreading procedure consists of two successive multiplications with two

different kinds of codes, namely, the channelization codes and a complex scrambling code. The channeli-zation codes are orthogonal and the set of codes are the same for all users. By using more than one channelization code, a user can transmit data on more than one ‘channel’. The complex scrambling codes are used to distinguish one user from another. The scrambling codes are not orthogonal. The output chip rate is 3.84 Mcps.

Figure 1 illustrates the principle of the uplink spreading function [3], where the dedicated physical data channels (DPDCHs) carry user data and the dedicated physical control channel (DPCCH) carries pilot bits and some other control information. As shown, the DPCCH is spread to the chip rate by the channelization code Cc, while the mth DPDCH, de-noted DPDCHm, is spread to the chip rate by the channelization code Cd;m. One DPCCH and up to six parallel DPDCHs can be transmitted simultaneously. The spreading factor of the DPCCH is fixed at 256. If there is only one DPDCH, then the spreading factor can vary between 4 and 256. If more than one DPDCH is used, then all use 4 as the spreading factor. After channelization, the spread signals are weighted by the gain factors c (for DPCCH) and d (for all DPDCHs). The allowed values of cand dare given in Reference [3] and some example assignments can be found in Reference [4, Annex A]. After the weighting, the I- and Q-branches are summed and treated as a complex-valued stream of chips. This valued signal is scrambled by the complex-valued scrambling code Cs.

Fig. 1. Spreading for uplink DPCCH and DPDCHs (based on Figure 1 in Reference [3]).

(3)

Let SF denote the spreading factor for the DPDCHs. Then the kth user’s signal after spreading and scrambling is given by

sk½n ¼ ( X m¼1;3;5 bðkÞ_m ½nCd;m½ndþ j X m¼2;4;6 bðkÞ_m ½nCd;m½ndþ bðkÞp ½nCc½nc !) Cs;k½n ð1Þ where n is the chip index, bðkÞm ½n is the SF-times repeated user data bit on DPDCHmand b

ðkÞ

p ½n is the 256-times repeated control channel bit. The notations Cd;m½n and Cc½n should be self-evident, and we have added a second subscript in Cs;k½n to designate the kth user. If one DPDCH is sufficient for the required transmission rate (as is the case considered in our

implementation), then the modulation becomes

BPSK on both the I- and the Q-branches. (It is not conventional QPSK because the two branches may have different gains.) In this case, Equation (1) re-duces to

sk½n ¼ bðkÞ½nCd½ndþ j bðkÞp ½nCc½nc

n o

Cs;k½n ð2Þ For channel transmission, the scrambled chip stream must be filtered by a pulse shaping filter, gTðtÞ. For this, the 3GPP WCDMA employs a root-raised-cosine (RRC) filter with roll-off factor¼ 0:22. The receiver front-end employs a similar RRC filter gRðtÞ. For convenience in simulating the wireless channel, we use the equivalent lowpass representation sampled at four times the chip rate. Also for conve-nience, we shall distinguish a discrete-time signal at four times the chip rate and its corresponding version at the chip rate by the presence and absence of the superscript0. Accordingly, let s0_k½n denote the four-times upsampled version of sk½n, and g0T½n and g0R½n be gTðtÞ and gRðtÞ sampled at four times the chip rate respectively. Let Lk be the number of resolvable propagation paths of the channel for user k. Then the signal after the receiver filter, sampled at four times the chip rate, is

r_k0½n ¼X Lk l¼1 0_k;l½nðg0T½n s 0 k½n k;l g0R½nÞ ð3Þ

where * denotes convolution, and k;l and 0k;l are respectively, the delay (in units of one-fourth the chip period) and the attenuation of the lth path associated with user k. If there are K users in the system, then the signal after receiver filtering at the base station is

r0½n ¼X K1

k¼0

r_k0½n þ 0½n g0R½n ð4Þ

where 0½n is the additive channel noise sampled at four times the chip rate.

For practical implementation, g0_T½n and g0

R½n have to be of finite lengths. We find that the spectrum emission mask of 3GPP WCDMA can be satisfied with g0_T½n truncated to 33 taps [5]. The receiver filter g0_R½n may be truncated to the same or a different length depending on complexity and performance considerations.

3. Structure of the SIC Receiver

As mentioned earlier, SIC attempts to eliminate the interference from different users in a successive way. More exactly, it detects a user’s signal, regenerates the contribution of this user signal in the received signal and subtracts the regenerated signal from the received signal. It then detects another user’s signal and repeats the above process, until all user signals are detected. If the detections are correct, then each iteration through the process reduces the number of interferers by one and thereby improves the performance of later detec-tions. Also as mentioned, the sequential manner in which signal detections are carried out in SIC is beneficial to our DSP implementation employing multiple processors.

In the following subsections, we describe the key features of our SIC receiver.

3.1. Overall Structure

While in principle there can be many users in the system, our hardware platform (described in more detail in Section 4) contains only four DSP chips. We therefore consider three-user SIC. From the perspec-tive of practical system implementation, this should not be severely limiting because the SIC receiver can be modified, replicated and connected to handle the case of more users. On the other hand, consideration of three users is sufficient for us to appreciate various DSP implementation issues.

(4)

The block diagram of a three-user SIC receiver is shown in Figure 2. U0, U1 and U2 denote the three users respectively, in the order they are processed. A RAKE receiver is used to detect each user signal, after interference cancellation except for U0. The receiver starts with detection of U0’s signal. The result is passed through a signal regenerator to reproduce its contribution in the received signal. The regenerator output is subtracted from the received signal to form the input to the next RAKE receiver to detect U1’s signal. Mathematically, the subtraction yields

r0½n ^rr00½n ¼ X K1 k¼0 r0_k½n ^rr00½n þ 0_{½n g}0 R½n ð5Þ

where K¼ 3 in our implementation. If U0’s signal is correctly detected, then ^rr₀0½n ¼ r0

0½n and the residual

signal will only contain the contributions of U1 and U2. The signal detection, regeneration and subtraction process for U1 is similar to that for U0. Because U2 is the last user whose signal is detected, there is no need to regenerate its contribution in the received signal. In fact, for a system with K users, signal regeneration and subtraction may not need to be carried out to the penultimate user. They can be omitted as soon as the interference has been reduced to an acceptable level. For SIC, it is known that proper power ranking can improve the error performance, so that the user signals are detected in the order of decreasing signal-to-interference-plus-noise ratio (SINR). This is not in-cluded in the present implementation for complexity reason.

3.2. The RAKE Receiver

Single-user RAKE receivers have been used widely in today’s DS-CDMA systems. Essentially, a RAKE receiver performs maximal-ratio combining (MRC) of the received signal in its span to yield maximum SINR. For illustration, a four-finger RAKE receiver is shown in Figure 3 for an arbitrary user k. The channel estimator employs the pilot symbols in the uplink DPCCH of 3GPP WCDMA and is discussed further in the next subsection.

For each finger, after proper delay as determined by the path searcher-tracker, the received signal is sampled at the chip rate. Techniques for path search-ing and tracksearch-ing are not the focus of this work, although we do include a path searcher-tracker in

Fig. 2. Block diagram of a three-user SIC receiver.

(5)

our implementation. Details concerning the searching and tracking method can be found in Reference [6]. The descrambled signal in finger l is given by

pk;l½n ¼ r0½4n þ ^k;lCs;k ½n ð6Þ where ^k;l is the estimated path delay. The RAKE receiver then performs despreading in each finger and the despread signal of user k in finger l is, for the ith bit, qk;l½i ¼ X ðiþ1ÞSF1 n¼iSF pk;l½n Cd½n ð7Þ

The despread signal in each finger is weighted by the complex conjugate of the estimated channel coef-ficient corresponding to that path. The results are summed and thresholded to yield a decision of the ith bit as ^ b bðkÞ½i ¼ sgn X Lk l¼1 R ðqk;l½i ^k;l½bi SF=256cÞ ( ) ð8Þ where ^k;l½bi SF=256c is the estimated channel coefficient of the lth path of user k in the period of bit i. Because the spreading factor for DPCCH is 256, for simplicity the channel coefficients are assumed to remain constant in the period of a pilot bit (a total of 256=SF user data bits) at least. Therefore we have bi SF=256c as its time index. The above assumption is also reflected in the channel estimator shown in Figure 3.

3.3. The Channel Estimator

In WCDMA uplink transmission, the DPCCH carries some pilot symbols known to the receiver. We use these symbols to estimate the channel coefficients. Essentially, this estimate is obtained by correlating the locally generated pilot with the received pilot. By using a simple autoregressive moving average (ARMA) filter, as shown in the upper part of Figure 3, the correlator’s time constant is made longer than the period of a single pilot bit so that the noise effects may be reduced. The ARMA filter is given by

^ k;l½m ¼ W ^k;l½m 2 þ ð1 WÞ ~k;l½m þ ~k;l½m 1 ð9Þ

where ~k;l½m is the filter input and is given by

~ k;l½m ¼ j bðkÞp ½m X ðmþ1Þ2561 n¼m256 pk;l½n ð10Þ

The channel estimate is updated once every two pilot bits only. It is also held constant in the period when the DPCCH contains nonpilot symbols. The summation in the right hand side of Equation (10) comes from the fact that Cc½n in 3GPP contains all 1s. Thus despreading becomes simple accumulation. The weight W is set somewhere between 0 and 1 and may change according to the channel condition. For ex-ample, it may be smaller when the channel is fast-changing and larger conversely.

3.4. The Signal Regenerator

The purpose of a signal regenerator is to duplicate the contribution in the received signal of a specific user. The more accurate the duplication, the better perfor-mance the SIC can achieve. A signal regenerator that takes finite-precision computation into consideration is shown in Figure 4.

Note that, while in the transmitter the DPCCH and the DPDCH are multiplied by two different gains c and d, in the signal generator we only multiply the DPDCH by the gain ratio d=c. This is because the channel estimator, in its computation using the pilot symbols on DPCCH, will supply the factor c. For user k therefore, after scrambling we have

^ssk½n ¼ ^ b bðkÞ½nCd½n d c þ j ^bbðkÞ_p ½nCc½n Cs;k½n ð11Þ To regenerate a user signal’s contribution in the received signal, the effect of the transmitter and the receiver filters must also be mimicked. We mentioned that a four-times oversampled, 33-tap RRC filter can satisfy the 3GPP WCDMA’s spectrum emission mask. If a similar receiver filter is employed for matched filtering, then their combination is a 65-tap filter. To implement such a filter in the signal regenerator is a severe computational burden. Since the cascade of the two filters should approximate a raised cosine (RC) filter, we may use a shorter RC filter to lower the complexity. This may potentially lead to some per-formance loss compared to perfect reproduction of the transmitter and the receiver filters. But nevertheless it

(6)

will be able to approximately regenerate the contribu-tion of the given user in the received signal to achieve interference reduction in SIC. In addition, when the transmitter and the receiver are designed and manu-factured by two different entities, perfect matching may not be possible anyway. Our analysis shows that a four-times oversampled, 9-tap RC filter contains about 94.5% of the energy in the original 65-tap filter. We thus employ this filter in place of the cascaded RRC and denote it g0½n.

After the RC filtering and the simulated multipath propagation based on the estimated channel response, the signal regenerator output for a specific user k can be expressed as ^rr0_k½n ¼X Lk l¼1 ^ k;l n=4 256 ^ss0k½n k;l g0½n ð12Þ

Equation (12) can be compared with Equation (3) to appreciate and analyze the effects of various inac-curacies on SIC performance.

4. DSP Implementation 4.1. Architecture

Our implementation employs the Quatro6x DSP card made by Innovative Integration (I.I.) [7], with an IBM-compatible PC as the host. The card houses four Texas Instruments’ DSP chips, where the DSPs

may be the fixed-point TMS320C6201 (200 MHz) or the floating-point TMS320C6701 (166 MHz). The card lays out the four DSPs in a symmetric multi-processing relationship with high-bandwidth interpro-cessor communication links, called the FIFO links. The network of FIFO links allows any of the DSPs to transmit to and receive from any other DSP via a high speed 32-bit-wide FIFO buffered interface. The so-called FIFOPorts provide buffered 16-bit interfaces which allow the card to communicate with other I.I. DSP cards or external hardware at high data rates. The PCI interface allows transfer of data to and from the host PC.

We employ Quatro62 (Quatro6x with TMS320-C6201) to implement the three-user SIC receiver. Each SIC stage takes one DSP, as shown in Figure 5. The received signal is first passed to the path searcher-tracker implemented on CPU_4, which finds the path delays and sends the information to CPU_3, CPU_2 and CPU_1 via the FIFO links. As mentioned, we refer to Reference [6] for details of the path searching and tracking method. Besides the path delays, the received signal is also passed to CPU_3. The RAKE receiver on CPU_3 detects U0’s data, performs signal regeneration and interference can-cellation, and delivers the resulting signal to CPU_2. CPU_2 performs similar operations, except that the user of concern is U1. Finally, in CPU_1, the last RAKE is performed for U2, completing the entire task. At present, the synchronization between the DSPs is asynchronous, event-driven: a downstream DSP would halt until its upstream FIFO buffer is filled.

(7)

The receiver performance can be examined and verified in several ways. One is to compare the transmitted and the detected user bits at the host PC and another is to compare them using one DSP. Both ways are illustrated in Figure 5. To do it at the host PC requires routing of the RAKE receiver outputs for U0 and U1 to CPU_1 because only CPU_1 has access to the PCI. On the other hand, to do it on a DSP requires loading of the transmitted user bits into the DSP card’s memory. The latter approach is simpler.

4.2. Error Performance

We examine the bit-error-rate (BER) performance in this subsection. We also compare the BER performance of the DSP implementation with the simulation results. Although wireless channels are often characterized by multipath fading, for verification purposes we first consider the AWGN condition. Due to considerations outside the scope of the present work, the spreading factor of user data is set to 16 [8]. Figure 6 plots the

Fig. 5. DSP implementation of three-user SIC receiver.

(8)

performance of the three-user SIC receiver in AWGN at Eb=N0¼ 7 dB. The user indexes correspond to that in Figure 2. The BERs, both with and without SIC, are given. Theoretically, the single-user BER is Qð ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi2Eb=N0

p

Þ for BPSK in AWGN with coherent demodulation. With Eb=N0 ¼ 7 dB, it is approxi-mately 103. This sets the lower bound of any detec-tion technique.

First, consider the floating-point simulation results. With SIC, the performance of the second and the third users improves significantly. To verify the perfor-mance of a 9-tap RC filter in the signal regenerator, we compare it with that of a 33-tap RC filter. The difference is minute, which justifies the use of the 9-tap filter for complexity reason. In both cases with SIC, the BERs for U2 are also close to the theoretical minimum of 103. The differences should be caused mainly by the residual MAI, signifying that the inter-ference subtractions are not perfect. Next, consider the fixed-point simulation results. A comparison with the floating-point results shows that our design pre-serves the accuracy well. Finally, consider the results of DSP implementation. The results are obtained using the development tool Code Composer Studio (CCS) from Texas Instruments. We see that the results agree well with that of the simulations.

We now turn to the condition of fading channels. Some results for a single-path fading channel are

shown in Figure 7. Here we only compare the result of the DSP implementation with that of fixed-point simulation on a general-purpose computer. Their clo-seness reaffirms the correctness of the DSP implemen-tation. As a side remark, we note that in comparison to the AWGN channel results, the performance of both the RAKE and the SIC receivers suffers due to fading. Lastly, we turn to the condition of multipath chan-nels. For this we consider the channel models given in Reference [4], which are shown in Table I. Experi-mental results again show that a 9-tap RC filter in the signal regenerator performs similarly to a 33-tap RC and that our fixed-point design performs similarly to floating-point simulation.

4.3. Execution Speed and Computational Complexity

Table II lists the measured execution time of SIC components in four multipath conditions for the DSP implementation. Note that the signal regenerator is the most time-consuming component of all. This is due to the RC filtering and the simulation of multipath propagation. Table III shows the measured execution time of different stages of the SIC receiver. Since the scrambling code generation has to be done only once, it is performed offline. The results for U0 and U1 are exactly the sum of the results for RAKE and signal

(9)

regenerator in Table II. Therefore, there is impercep-tible overhead in combining the SIC components. The results for U2 are exactly the same as the RAKE results in Table II, as for this user only the RAKE is performed. The results also show that we fall behind the real-time requirement by about three times in CPU_3 and CPU_2 under the 4-path condition. If only conventional RAKE receiving is needed, then Table II shows that real-time processing is achievable in all conditions.

To see how efficiently we have used the DSP

resources in the implementation, we analyze

the computational complexity of the SIC receiver. For simplicity, we only calculate the number of multi-plications required. This is because in the RAKE receiver and the signal regenerator, multiplications and additions are primary operations. Since the signal flow is quite regular in these components, the amount of multiplications and additions should be indicative

of the overall complexity. However, on the

TMS320C6201 DSP, there are six adders but only two multipliers [10]. We thus measure the complexity using the number of multiplications.

A multiplication on the TMS320C6201 requires two clock cycles. But the multipliers are pipelined, so that, with proper sequencing of data, a throughput of one multiplication per cycle per multiplier is possible. At a 200 MHz clock therefore, the DSP can perform up to 400 106 _{multiplications per} second. This will be our base in gauging the efficiency of the implementation.

Consider first the RAKE receiver. The amount of (real-number) multiplications required, per finger, for a 10 ms frame is estimated below.

1. Descrambling: 38 400 |fflfflffl{zfflfflffl} # of chips

_|{z}4

real mult0s: per complex mult: 2. Channel estimation: An upper bound is

150 |{z} # of DPCCH bits

ð _|{z}2

mult: with pilot

þ 1 2 |{z} freq: of ARMA _|{z}4 ARMA filtering Þ 3. MRC: 38 400 _|{z}2 I- and Q-branches zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{despreading þ 38 400=16 |fflfflfflfflfflffl{zfflfflfflfflfflffl} # of data bits _|{z}2

for real part of complex mult: zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{cophased gain

The total complexity for RAKE receiving is summar-ized in Table IV. The table also shows the efficiency of our RAKE implementation, where the efficiency is defined as the fraction of multiplier resource used in

Table II. Processing time of SIC components for one WCDMA signal frame (10 ms) on DSP, in ms.

Path number 4 3 2 1

One scrambling code generator 7.0 7.0 7.0 7.0

One RAKE 7.8 5.9 3.9 2.0

One signal regenerator 21.0 16.1 11.7 9.9

Table III. Processing time of SIC stages for one WCDMA signal frame (10 ms) on DSP, in ms.

Path number 4 3 2 1

U0 on CPU_3 28.8 22.0 15.6 11.9

U1 on CPU_2 28.8 22.0 15.6 11.9

U2 on CPU_1 7.8 5.9 3.9 2.0

Table I. Multipath channel examples (based on Table B.1 of Reference [4]).

Case 1 (3 km/h) Case 2 (3 km/h) Case 3 (120 km/h) Case 4 (3 km/h)

Case 5 (50 km/h) Case 6 (250 km/h)

Relative Average Relative Average Relative Average Relative Average

Delay [ns] Power [dB] delay [ns] Power [dB] Delay [ns] Power [dB] Delay [ns] Power [dB]

0 0 0 0 0 0 0 0

976 10 976 0 260 3 976 0

20000 0 521 6

(10)

the time spent for RAKE receiving as given in Table II. For example, in the case of 4 paths the efficiency is calculated as 943 200=ð400 106_{7:8 10}3_{Þ ¼} 30:2%. The efficiency figures are rather consistent over different path numbers, which can also be appre-ciated from the processing time figures in Table II.

The amount of real multiplications required for the signal regenerator is estimated as follows.

1. Spreading: 38 400 ð _|{z}1 DPDCH spreading þ _|{z}1 DPDCH gain Þ 2. Scrambling: 38 400 _|{z}4

real mult0s: per complexmult: 3. Pulse-shaping filtering: 38 400 9 |{z} RC filtering _|{z}2 I- and Q-branches 4. Simulated multipath propagation: For each finger,

38 400 ₄

|{z} oversampling factor

_|{z}4

real mult0s: per complex mult: The total complexity for signal regeneration and the efficiency of our DSP implementation are summarized in Table V. The efficiency figures for different path numbers vary over a greater range than that of the

RAKE receiver. This may be due to uneven overhead in memory accesses.

Taken together, the overall efficiency of the complete SIC receiver, for each user in each multipath condition, is given in Table VI.

Note from Tables IV and V that the required total number of multiplications in the case of four paths is beyond the capability of one TMS320C6201 chip. Therefore, real-time processing cannot be achieved without re-engineering the system architecture (such as distributing the load of one signal regenerator to more than one DSP chip) or the SIC algorithm (to reduce the number of multiplications). Nevertheless, the efficiency figures indicate that there may be room for improvement even under the present system and algorithm structure. These are relegated to potential future work.

4.4. Memory Usage

Table VII gives the memory usage in the three CPUs carrying out the SIC function, where IPRAM and IDRAM respectively, are internal program and data RAMs on the DSP chip, and synchronous DRAM (SDRAM) is external synchronous DRAM on the Quatro62 card. Both the IPRAM and the IDRAM are 64 kbytes in size and the SDRAM contains 64 Mbytes. CPU_2 and CPU_3 have similar usage of the IPRAM and the IDRAM, since both of them imple-ment a RAKE and a signal regenerator. CPU_3 uses a significantly larger amount of the SDRAM because it has to get the received signal samples from CPU_4 and store them for processing use. On the other hand, CPU_1 uses a comparatively smaller amount of the internal memories, especially the IDRAM because it does no signal regeneration.

Table IV. Computational complexity of RAKE receiver for one WCDMA signal frame and efficiency of DSP implementation.

Path number 4 3 2 1

Mult’s. required 943 200 707 400 471 600 235 800 Efficiency (%) 30.2 30.0 30.2 29.5

Table V. Computational complexity of signal regenerator for one WCDMA signal frame and efficiency of DSP implementation.

Path number 4 3 2 1

Mult’s. required 3 379 200 2 764 800 2 150 400 1 536 000 Efficiency (%) 40.2 42.9 45.9 38.8

Table VI. Efficiency of DSP processing for three-user SIC receiver. (all data are in percentage)

Path number 4 3 2 1

U0 on CPU_3 37.5 39.5 42.0 37.2

U1 on CPU_2 37.5 39.5 42.0 37.2

U2 on CPU_1 30.2 30.0 30.2 29.5

Table VII. Memory usage of the DSP implementation of three-user SIC receiver, in kbytes.

Processor CPU_1 CPU_2 CPU_3

IPRAM 28.5 31.5 31.094

IDRAM 30.861 60.188 62.164

(11)

5. Conclusion

In the area of wireless communication, two subjects of much recent interest are software-defined radio and multiuser detection of CDMA signals. We conducted a study on DSP implementation of SIC receiver for 3GPP WCDMA uplink transmission. The implemen-tation employed a commercially available general-purpose multi-DSP platform. Issues addressed in the work included system-level design for multiprocessor implementation, design of the channel estimator, de-sign of the de-signal regenerator, determination of the precision of fixed-point computations, consideration of the receiver’s error performance and analysis of the implementation’s complexity and efficiency. These issues are tightly coupled with the 3GPP WCDMA specifications.

Due to the features of the DSP platform employed, the implementation only considered up to three users. But this has been sufficient for us to appreciate various DSP implementation issues of an SIC receiver. In addition, by the nature of SIC, it is easy to extend the implementation to handle more users with an enlarged platform.

Our present implementation can achieve real-time processing speed if RAKE receivers alone are acti-vated. Due to the complexity in signal regeneration, it still falls short of the real-time requirement when interference cancellation is enacted. Indeed, when the number of multipaths is four or more, either the system architecture or the SIC algorithm needs to be redesigned for real-time processing to be possi-ble with the present platform. These and other ways of improvement are relegated to potential future work.

References

1. Verdu´ S. Multiuser Detection. Cambridge University Press: Cambridge, UK, 1998.

2. Correal NS, Buehrer RM, Woerner BD. A DSP-based DS-CDMA multiuser receiver employing partial parallel interfer-ence cancellation. IEEE Journal on Selected Areas in Commu-nications 1999; 17(4): 613–630.

3. 3GPP. Technical Specification Group Radio Access Network; Spreading and Modulation (FDD). Doc. 3G TS 25.213 ver. 4.1.0, June 2001.

4. 3GPP. Technical Specification Group Radio Access Networks; UE Radio Transmission and Reception (FDD). Doc. 3G TS 25.101 ver. 4.1.0, June 2001.

5. Tsai SL, Lin YN, Lin DW. Study and DSP implementation of 3GPP wideband-CDMA transmission signal processing and wireless channel simulation. In Proceedings of National Sym-posium on Telecommunications, paper no. PCOM-1-11, Puli, Nantou, Taiwan, ROC, December 2002.

6. Lin JC, Wei CH. DSP implementation of uplink code synchro-nization for WCDMA wireless system. In Proceedings of National Symposium on Telecommunications, paper no. PCOM-1-1, Puli, Nantou, Taiwan, ROC, December 2002. 7. Innovative Integration. Quatro6x Development Package

Man-ual 16 January 2001.

8. Chen WY, Lin YN, Lin DW. Study and DSP implementation of 3GPP WCDMA uplink multiplexing and channel coding methods. In Proceedings of National Symposium on Telecom-munications, paper no. COM-6-5, Puli, Nantou, Taiwan, ROC, December 2002.

9. Stu¨ber GL. Principles of Mobile Communication, 2nd edn. Kluwer Academic: Boston, 2001.

10. Texas Instruments. TMS320C6000 CPU and Instruction Set Reference Guide. Literature no. SPRU189F, October 2000.’

Authors’ Biographies

Yu-jung Chang was born in

Tai-wan on 1 July 1978. He received the B.S. degree in Electrical Engineer-ing from National TsEngineer-ing Hua Uni-versity, Taiwan, in 2000, and the M.S. degree in Electronics Engi-neering from National Chiao Tung University, Taiwan, in 2002, where he researched in wireless commu-nication, especially in multiuser

de-tection techniques applied to

WCDMA.

He is currently with Industrial Technology Research Institute, Hsinchu, Taiwan, where he works in intelligent transportation systems (ITS). His research interests include spread-spectrum transmission systems, multiuser detection, detection and estimation with application to wireless com-munication and implementation of digital transceivers.

Yu-Nan Linwas born in Taichung,

Taiwan, R.O.C., in 1975. He re-ceived the B.S. degree in Electro-nics Engineering from National Chiao Tung University, Hsinchu, Taiwan, R.O.C., in 1998. He was accepted into the M.S. degree pro-gram and subsequently the Ph.D. in 1998 and 2000 respectively, in the same Department of Electronics Engineering of National Chiao Tung University on outstanding academic performance. His current research interests include CDMA systems, multiuser detection and channel estimation.

David W. Lin received the B.S.

degree from National Chiao Tung

University, Hsinchu, Taiwan,

R.O.C., in 1975, and the M.S. de-gree and Ph.D. from the University of Southern California, Los An-geles, in 1979 and 1981

respec-tively, all in Electrical

(12)

He was with Bell Laboratories during 1981–1983, and with Bellcore during 1984–1990 and again during 1993– 1994. Since 1990, he has been a professor in the Department of Electronics Engineering and the Center for Telecommu-nications Research, National Chiao Tung University. He has

conducted research in digital adaptive filtering and tele-phone echo cancellation, digital subscriber line and coaxial network transmission, speech and video coding, and wire-less communication. His research interests include various topics in communication engineering and signal processing.