• 沒有找到結果。

Multi-Stream Voice Quality Prediction Model

Multi-Stream Transmission System and Quality Prediction Model

2.2 Multi-Stream Voice Quality Prediction Model

In Section 1.2 we stated two limitations to E-model to predict the conversation speech quality in the third scenario. First, it may fail to register impairments due to recon-struction based on information from a single path as opposed to from both paths, when no packets from either path are lost. Moreover, the resulting detrimental effects that accompany the change in the playout scenarios may thus be ignored and harm its prediction of the overall quality. Recognizing this, we propose a new objective method for predicting the perceived quality of multi-stream voice transmission. In addition to delay and packet loss, the model also takes into account the quality impairments due to frequent switch of playout scenarios.

Conceptually the proposed model followed the commonly used ITU E-model [18]

in defining factors that affect the perceptual quality of the MD voice transmission.

As an analytical model of conversational speech quality used for network planning purposes, the E-model combines individual impairments due to the signal’s properties and the network characteristics into a single R-factor, ranging from 0 to 100. In VoIP applications [24], the R-factor may be simplified as follows: R = 94.2 − Id− Ie, where Id represents the delay impairment. Ie is known as the equipment impairment and accounts for impairments due to speech coding and packet loss. The delay impairment can be derived by a simplified fitting process in [24] with the following form

Id(d) = 0.024d + 0.11(d − 177.3)H(d − 177.3) (2.1)

where d is the end-to-end delay and H(x) is the step function. The E-model, originally proposed for single-stream transmission, is only applicable to a limited number of speech codecs and network conditions, since it requires time-consuming subjective tests to derive the Ie model. With multiple voice streams, any subset can be used for signal reconstruction, and the transmission quality improves with the size of the subsets. In addition to delay and packet loss, a good quality prediction model should take into account the impairments due to dynamic size allocations during the speech playout.

For two-path transmission, each channel can either deliver or erase the transmitted description, so the two channels will always be in one of four possible states: no loss, loss in channel 1, loss in channel 2, and loss in both channels (packet erasure). Among them, only the speech resulting from the packet-erasure state is not affected by playout buffer operations. The receiver deals with the loss of both descriptions by using the error concealment algorithm of G.729 codec to conceal the erased packet. If, additionally, speech decoded from either MD-G.729 description is assumed to be of similar quality, we only need to consider two kinds of playout scenarios at the receiver end. Specifically, a packet is 1) fully restored with two descriptions and thus played with high quality;

and 2) partially restored with one description and thus played with degraded quality.

For brevity, let Sk denote the scenario that k descriptions are received before the playout time. Conditioned on the event that the packet can be restored, we let qk

be the probability to play out the packet using k descriptions. Formally, it is given by qk = P (Sk)/(P (S1) + P (S2)). It is improtant to notice that quality degradation resulting from S1 and S2 are different perceptual experiences. For scenario S2, the standard G.729 decoding process is carried out after combining the two descriptions into one bitstream. Let Ie,k denote the equipment impairment as a result of playing out k received descriptions. From the perceived QoS perspective, the MD-G.729 codec may be viewed as operating at two coding rates: 4.6 kbps for S1 and 8 kbps for S2. By taking frequent switch of coding rates into account, we define the average equipment

Figure 2.3: Schematic diagram for prediction of Ie model.

impairment due to MD-G.729 coding as follows:

Ie(e) = q1Ie,1(e) + q2Ie,2(e). (2.2)

The next issue to be addressed is how to derive an equipment impairment Ie,k cor-responding to each playout scenario Sk. We followed the work of [12], which describes an objective method for prediction of Ie,k regression model using the PESQ algorithm [25]. As shown in Fig. 2.3, each single measurement consists of three steps and is repeated several times with different transmission configurations. First, a speech sam-ple is selected from an English speech database that contains 16 sentential utterances spoken by eight males and eight females. Each sample has a duration of 8 seconds and sampled at 8 kHz. Second, the speech sample is encoded using MD-G.729 codec and then processed in accordance with the simulated loss model to generate the degraded speech. In our experiments, the decoder deals with packet erasure by using the error concealment algorithm of G.729 [9] to conceal erased packets, while in other scenarios speech packets are reconstructed depending on how many descriptions are received by the playout deadline. Third, the reference speech and degraded speech are processed by the PESQ to obtain a mean opinion score (MOS). For each speech sample, a MOS value for one packet-erasure rate is obtained by averaging over 30 different erasure locations in order to remove the influence of erasure location. Further, these MOS values are averaged over all speech samples and then converted to a rating R to give

an equipment impairment value Ie,k = 94.2 − R. The R-factor can be obtained from the average MOS with a conversion formula as follows:

R = 3.026MOS3− 25.314MOS2+ 87.06MOS − 57.336. (2.3)

Fig. 2.4 shows that impact of transmission scenario Sk and packet-erasure rate e on the equipment impairment Ie,k with a packetization of one frame per packet. The Ie,k

value for zero packet-erasure rate represents the codec impairment itself. It is obvious that the speech playout resulting from S2 has a lower codec impairment and has a high robustness to packet loss. By inspecting Figure 2.4, we observe that our measured Ie,2

value for zero packet erasure, 21.96, is inconsistent with the ITU-published Ie value, 10, for codec G.729 [9]. One possible reason for this discrepancy may lie in the codec algorithm. As the G.729 is a CELP-based codec, the use of linear predictive model of speech production can lead to variations in codec performance with different talkers or languages [26]. Support for such a speculation can be found in at least two studies using the same codec [12][27], which, in case of zero packet loss and using different speech samples from the ITU-T data set [28], rendered measured Ie values of 21.14 and 17.128, respectively, similar to the value obtained for this study. From the curves, a nonlinear regression model can be derived for each Ie,k by the least-squares data fitting method. The fitting curves are also shown in Figure 2.4. The derived Ie,k model for scenario Sk has the following form: Ie,k(e) = γ1,k + γ2,kln(1 + γ3,ke), where e is the packet-erasure rate in percentage. Our findings indicate that the regression model parameters (γ1, γ2, γ3) for S1 are (52.61, 7.52, 10) and (21.96, 17.02, 16.09) for S2.