基於多重敘述編碼理論之無線通訊系統的品質管理研究

(1)

國立交通大學

電信工程研究所

博士論文

基於多重敘述編碼理論之無線通訊系統的

品質管理研究

QoS Control for Multi-Stream Voice over

Mobile IP Networks

研究生：吳俊鋒

指導教授：張文輝博士

(2)

基於多重敘述編碼理論之無線通訊系統的品質管

理研究

QoS Control for Multi-Stream Voice over Mobile

IP Networks

研究生：吳俊鋒

Student: Chun-Feng Wu

指導教授：張文輝博士

Advisor: Dr. Wen-Whei Chang

國立交通大學

電信工程研究所

博士論文

A Dissertation

Submitted to Institute of Communication Engineering

College of Electrical and Computer Engineering

National Chiao Tung University

in Partial Fulfillment of the Requirements

for the Degree of Doctor of Philosophy

in

Communication Engineering

Hsinchu, Taiwan

(3)

基於多重敘述編碼理論之無線通訊系統的品質

管理研究

學生：吳俊鋒指導教授：張文輝博士

國立交通大學電信工程研究所

摘要

無線通訊的服務品質取決於諸多因素，包括封包漏失、延遲時間、背景雜訊、及語音編碼失真。本篇論文旨在探討多重敘述編碼理論之無線通訊系統的品質管理研究，主要採用多重敘述傳輸系統，一方面利用路徑分集以增加傳輸系統的強健性，另一方面可利用不同敘述間的相關性設計其錯誤隱匿機制。有關傳輸位元錯誤的隱匿機制，前人研究基於強健性能與快速實現的整體考量，根據渦旋碼理論而發展疊代訊源通道解碼演算法，關鍵元件包括軟性輸出通道解碼器和軟性位元訊源解碼器。問題是一般採用的位元層級通道解碼演算法有其限制，不僅無法將相鄰索引之間的相關特性有效整合於訊源事前訊息，與基於索引層級而推導的訊源解碼演算法也存在著相容性的問題。針對這些議題，本論文研究將鎖定索引層級的疊代訊源通道解碼機制。首先開發一個索引層級的 BCJR 通道解碼演算法，可有效整合訊源的事前訊息於其軟性輸出的解碼過程。並且進一步配合多重敘述所屬相關訊息的交叉運用，準確估算不同傳輸索引值的後驗機率，並依最小均方誤差準則求得多重敘述向量量化的最佳解碼輸出。另一個重要的議題則是接收端播放緩衝器的設計。系統設計應整體考量不同關鍵元件的最佳組合，且因應隨時變化的網路傳輸特性作合理調整。首先，我們延伸國際電信聯盟 ITU 針對單一路徑傳輸系統所制訂的 E-model，進一步開發新的音質評量效能指標，可廣泛應用在多重敘述傳輸的系統規劃。有別於前人研究是將播放緩衝器與前向錯誤控制分開設計，本研究基於音質最佳化的設計理念提出一個適應性整合控制演算法。根

(4)

ii

據新的音質評量指標，多重敘述傳輸系統的設計規劃成為一個音質損害最小化問題，依據網路動態彈性調整前向錯誤控制與播放排程，進而達到延遲與封包漏失的最佳平衡點。

(5)

QoS Control for Multi-Stream Voice over Mobile

IP Networks

Student: Chun-Feng Wu Advisor: Dr. Wen-Whei Chang

Institute of Communications Engineering, National Chiao Tung University Hsinchu, Taiwan, Republic of China

ABSTRACT

Packet loss and network delay are two essential problems to real-time voice commu-nication over mobile IP Networks. The purpose of this dissertation is to develop a multi-stream voice communication system with its quality of service (QoS) control for increased channel robustness. The first part will focus on the error concealment of packet-erasure as well as channel bit errors. The basic strategy is a multiple descrip-tion scalar quantizadescrip-tion (MDSQ) system, in which multiple correlated indexes of the source are assigned and transmitted over channels to take advantage of largely uncorre-lated loss and delay characteristics. We propose the use of turbo principle to develop a symbol-based iterative source-channel decoding algorithm for better decoding of mul-tiple descriptions over a noisy channel. We first modify the BCJR algorithm based on sectionalization trellis so that symbol a posteriori probabilities can be derived and used as the extrinsic information to improve the iterative decoding between the source and channel decoders. The residual source redundancies are exploited as a priori informa-tion and a joint source decoding is formulated in the form of a maximum a posteriori estimation problem. We also formulate a recursive implementation for the source de-coder that processes reliability information received on different channels and combines them with inter-description correlation to estimate the transmitted quantizer indexes. Another important issue to address is the playout buffer design which is used at the

(6)

receiver to smooth out the jitter. As a further step toward perceptual optimization, the error concealing capabilities of multiple description coding can be improved by including an forward error control (FEC) mechanism. We present an objective method for multi-stream voice quality prediction model. Based on the new prediction model, we proposed the use of minimum overall impairment as a perceptually motivated op-timization criterion for joint playout buffer and FEC control. Joint playout and FEC adjustment is then formulated as an optimization problem leading to a better balance between end-to-end delay and packet loss.

(7)

誌謝

博士班研讀的過程中，首先要感謝我的指導老師張文輝教授悉

心的指導，使我在研究上可以很有效的掌握研究技巧與重點，故期許

自己對於電信相關領域能有多一分的貢獻。另一方面也感謝老師在生

活上也給予我很多的建議與鼓勵。此外，個人非常感謝交通大學所提

供優良的學術環境，讓學生在科技與人文的素養均能有所成長。再

者，特別感謝新竹教育大學江源泉教授對於研究所提供的協助與指

導。在語音通訊實驗室裡的研究生活中，有學長李承龍與同學傅泰

魁、曹正宏、林宜德、許忠安、蔡知鑑、何依信、顏廣儀、曾啟翔、

潘彥璋、張永樂、陳亞民、戴玲玲、葉葉誠、吳鴻材等相互提攜與砥

礪。除此，還要對強大隊整體成員致上敬意，他們在假日時與我一起

吃喝玩樂，使我博士班的生活多彩多姿。除此，也要感謝顏雅慧在一

路相伴。最後我要特別地感謝我的父母，阿姨還有弟弟，他們是我物

質上與心靈上的能量來源。由此，本人才得以順利完成博士論文。

(8)

List of Tables

2.1 Ie comparison for different prediction models. . . 18

3.1 MOS comparison for different playout algorithms. . . 27

3.2 Basic Notation. . . 30

(12)

List of Figures

1.1 Block diagram of MD voice transmission system. . . 2

1.2 Block diagram of multiple description coding system. . . 4

2.1 A two-channel VoIP simulation system. . . 12

2.2 A multi-hop transmission model for network simulations. . . 12

2.3 Schematic diagram for prediction of Ie model. . . 15

2.4 Ie,k vs. packet erasure rate e. . . 17

3.1 Performance comparison for different playout algorithms. . . 26

3.2 A multi-description voice transmission system. . . 28

3.3 Performance comparison for different playout algorithms. . . 35

4.1 Block diagram of a two-channel MD communication system. . . 42

4.2 MD-ISCD scheme for the concatenation of MDSQ and convolutional codes. 42 4.3 Bit-level and merged trellis diagrams . . . 45

4.4 MD-ISCD3 performance for Gauss-Markov sources with ρ = 0.95 and (M, R) = (5, 3). . . 54

4.5 SNR performance of different decoders for (M, R) = (4, 3) and Gauss-Markov sources (ρ = 0.8, 0.95) . . . 55

4.6 SNR performance of different decoders for (M, R) = (5, 3) and Gauss-Markov sources (ρ = 0.8, 0.95) . . . 56

(13)

5.2 VLC trellis representation for T = 4, N = 10 and C = {c(0) = 11, c(1) = 00, c(2) = 101, c(3) = 010}. . . 64 5.3 Bit-level and merged trellis for C = {c(0) = 11, c(1) = 00, c(2) =

101, c(3) = 010}. . . 69 5.4 Three-dimension Trellis digram for ν = 2, N = 5 and T = 2. . . 70

(14)

Chapter 1 Introduction

Quality of Service (QoS) has been one of the major concerns in the context of real-time multimedia communication over unreliable IP networks. Interactive real-time applica-tions such as telephony and audio/video conferencing require high constraints on packet loss and end-to-end delay. When packet loss rates exceeds 10% and one-way delay ex-ceeds 150 ms, the perceived conversational speech quality can be quite poor. There has been much interest in the use of packet-level forward error correction (FEC) [1] to mitigate the impact of packet losses. Most current FEC mechanisms send additional information along with the media stream so that the lost data can be recovered in part from the redundant information. In FEC schemes, however, loss recovery is performed at the cost of increased end-to-end delay. Multiple description (MD) coding [2]-[4] is another method to gain robustness by taking advantage of the largely uncorrelated loss and delay characteristics on different network paths. In MD coding, multiple descrip-tions of the speech are created in such a way that each description can be individually decoded for a reduced quality reconstruction, but if all descriptions are available, they can be jointly decoded for a better quality reconstruction. With multiple voice streams, the network delay experienced may vary with each packet depending on the paths taken by different streams and on the level of congestion along the path. The variation in network delay, referred to as jitter, must be smoothed out since it obstructs the proper

(15)

Figure 1.1: Block diagram of MD voice transmission system.

and timely reconstruction of the speech signal at the receiver end. The most common approach is to store recently arrived packets in a buffer before playing them out at scheduled intervals. By increasing the buffer size, the late loss rate is reduced, but the resulting improvement in voice transmission is offset by the accompanying increase in the end-to-end delay.

This dissertation focuses on two important issues in MD voice transmission sys-tem as shown in Figure 1.1: (1) From the viewpoint of QoS, we develop a mult-stream playout scheduling technique to improve the delay-loss tradeoff as well as speech recon-struction quality. (2) We also consider the iterative source-channel decoding algorithm to increase error robustness of MD transmission system. In Section 1.1, MD transmis-sion system and some MD Coding techniques are first reviewed. In Section 1.2, some commonly used playout scheduling schemes are reviewed. The concept of Iterative source-channel decoding is introduced in Section 1.3 and finally, Section 1.4 outlines the structure of the dissertation.

1.1 Multiple Description Coding

MD coding [5] is a method of representing a source with multiple correlated descriptions such that any subset of the descriptions can be used to decode the source with a fidelity

(16)

that increases with the number of received descriptions. The output symbols of an MD encoder exhibit considerable residual redundancy in terms of both nonuniformity of distribution and their dependencies. This redundancy is due to the nonoptimality of the practically designed source encoder in presence of complexity and delay constraints, or by path diversity as a result of MD coding. The ability to exploit path diversity and source residual redundancy for error robustness makes MD coding an attractive option for the multimedia transmission over unreliable IP networks.

In multimedia communication, MD Coding have been applied to efficient compres-sion of voice and image/video signals. For example, Ingle [6] proposed to separate speech samples into odd and even samples for DPCM encoding. Jiang and Ortega [7] proposed a method which quantizes the even samples in PCM (8 bits/sample), encodes the difference between even and odd samples in ADPCM (2 bits/sample) and then packetizes them into stream 1. Proceeding in a similar approach, the odd samples are quantized in PCM (8 bits/sample) and the difference between odd and even samples in ADPCM (2 bits/sample) and then packetized into stream 2. In [4], Gibson pro-posed two MD-based speech coding approaches, denoted by MD-AMR and MD-G.729, which are extensions of the AMR-WB codec [8] and the G.729 codec [9], respectively. These two MD coders are design to create balanced descriptions in a way that one lost description can be recovered through interpolation in the received description.

The block diagram of an MD coding system is shown in Figure 1.2. The system has two major components: MD encoder and MD decoder. MD encoder [1][10] splits source samples into two descriptions by using scalar quantizer (SD) followed by index assign-ment. The index assignment can be represented by a mapping of each reproduction level of the SQ to a unique element in an index assignment matrix. The choice of the index assignment matrix determines the correlation between the descriptions and is the key to realize an MDSQ. Design algorithms for good index assignments are presented in [10]. In these, the inter-description correlation is controlled by choosing the number of diagonals covered by the index assignment. The MDSQ has been extensively studied

(17)

Figure 1.2: Block diagram of multiple description coding system.

for noiseless channels with packet loss, assuming that there exists multiple indepen-dent channels that either provide error-free transmission or experience packet erasure. In many practical situations, however, multiple descriptions of the source signals are transmitted over channels that are subject to noise as well as packet loss.

1.2 Joint Playout and FEC Control

Packet loss and delay are the major network impairments for transporting real-time voice over IP networks. The network delay experienced may vary for each packet de-pending on the level of congestion along the path. The variation in network delay, referred to as jitter, must be smoothed out since it obstructs the proper and timely reconstruction of the speech signal at the receiver end. The most common approach is to store recently arrived packets in a buffer before playing them out at scheduled intervals. By increasing the buffer size, the late loss rate is reduced, but the resulting improvement in voice transmission is off-set by the accompanying increase in the end-to-end delay. In balancing the impairment due to delay and packet loss, two current coding strategies, single and multiple description transmissions, have used different playout buffer algorithms. In single description (SD) coding, a number of adaptive

(18)

playout buffer algorithms have been proposed that react to changing network con-ditions by dynamically adjusting the playout delay. Most of them work by taking measurements on the network delays and either compressing or expanding silent pe-riods between consecutive talkspurts. Although there are methods which focused on the delay-loss performance [11], better algorithms have been proposed along with voice quality prediction models for perceptual optimization of playout buffer [12][13]. Taking a different approach, MD coding [2][3][4] exploits the packet path diversity such that each description can be individually decoded for a reduced quality reconstruction, but if all descriptions are available, they can be jointly decoded for a better quality recon-struction. For multi-stream voice transmission, Liang et al., [3] proposed an algorithm which uses the Lagrangian cost function to trade delay versus loss by following a a play-first strategy; that is, it plays out early-arriving descriptions while discarding the later ones. Such a design was based on the assumption that human perceptual experi-ence is more strongly impaired by high latency than packet loss. They neither consider the quality degradation due to frequent switching among playout scenarios nor try to optimize the perceived speech quality by way of a prediction model.

Packet loss in MD voice transmission is a result of not only network loss, but also late loss, which greatly impairs communication quality. Due to the stringent delay bud-get and the need to output speech continuously, packets experiencing sudden high delay have to be discarded at the receiver end if they arrive later than the scheduled playout deadline. There has been much interest in the use of packet-level forward error control (FEC) to mitigate the impact of packet losses [14]. Most current FEC mechanisms send additional information along with the media stream so that the lost data can be recovered in part from the redundant information. In many applications, however, the losses of successive packets are correlated and a packet loss may be followed by a burst packet loss, which significantly decreases the efficiency of FEC. Furthermore, the loss recovery of FEC is performed at the cost of increased end-to-end delay. This has moti-vated our investigation into trying to exploit the largely uncorrelated characteristics of

(19)

packet loss and delay variation on multiple network paths using a joint control of MD and FEC. With an MD scheme coded with FEC we have now more freedom to trade off delay, late loss, and speech reconstruction quality.

Traditionally, the study of FEC for loss recovery and playout buffer adaptation for jitter compensation have proceeded independently. Most packet-level FEC mecha-nisms send some redundant information along with the media stream so that the lost data can be recovered in part from the redundant information embedded in the later arriving packets. In waiting for the arrival of a minimum required number of packets at the receiving end, loss recovery is performed at the cost of increased end-to-end delay. In view of this potential limitation and the coupling between FEC and playout buffer adaptation [15][16], there is a need to develop a joint FEC and playout control scheme such that the additional delay due to FEC application is dealt within the same optimization framework as for regular MD schemes. Previous efforts toward linking FEC with playout buffer for single-stream transmission can be found in [16], but the assumption on which their algorithm was based may limit its applicability. Specifi-cally, it was assumed that the single-stream network over which the voice packets are sent delivers packets in sequence, and thus if a given packet arrives after its playout time, then all the following packets will also arrive after the playout time of the given packet. This line of reasoning has been challenged by a number of related studies [17] that addressed the possibility of packets delivered out of sequence because of network jitter. As such, the joint FEC and playout control scheme proposed in this work will ignore the constraints imposed by the no-reordering assumption made in [16].

The concept of perceptual optimization is usually realized through the use of E-model [18] to predict the conversational speech quality. However, the E-E-model does not consider the dynamics of transmission impairments because it relies on the static trans-mission parameters such as average packet loss and average end-to-end delay. Thus, the E-model may make invalid predictions in dealing with the overall quality issues that MD transmission is focused on. For example, the E-model may only suit

(20)

single-path transmission with two conceivable playout scenarios; i.e., total loss vs. no-loss of packets. A third scenario, partial loss, however, would rise with MD transmission. That is, with multiple streams sent along two paths, if packets from one path expe-rience erasure or excessive delay, packets from the other path can often be used to conceal the lost packets. Although the partial loss is concealed, the resulting degraded playout quality may be not. In dealing with such reconstruction scheme, the E-model is expected to show two limitations. First, it may fail to register impairments due to reconstruction based on information from a single path as opposed to from both paths, when no packets from either path are lost. Moreover, the resulting detrimental effects that accompany the change in the playout scenarios may thus be ignored and harm its prediction of the overall quality.

In this work, we propose a new objective method for predicting the perceived quality of multi-stream voice transmission. In addition to delay and packet loss, the model also takes into account the quality impairments due to frequent switch of playout scenarios. Based on the new model, we then propose the use of minimum overall impairment as a criterion for perceptual optimization of joint playout buffer and FEC adjustment.

1.3 Iterative Source-Channel Decoding

For MD communication over noisy channels with packet loss, a channel encoder may be used on each description to deal with random bit errors. When the MDSQ is con-catenated with convolutional codes, iterative source-channel decoding (ISCD) [19][20] inspired by turbo principle has been shown effective using the source residual redun-dancy and assisted with the reliability information provided by the soft-output channel decoder. In the so-called MD-ISCD schemes [21][22], source residual redundancy and channel-code redundancy are exploited alternatively by exchanging extrinsic informa-tion between the constituent decoders. An iterative decoder consisting of two maximum a posterioriprobability (MAP) detectors is proposed in [21] for joint decoding of MDSQ

(21)

and convolutional codes. In [22], a cross decoding strategy was stated that exploits not only the reliability information of every bit in one description but also the extrinsic information from the other description according to the chosen index assignment. In the decoding procedure, MAP detectors operating on soft channel outputs were used for each of the two descriptions in such a way that the output of one MAP detector is combined with inter-description correlation to compute the a priori information for the other detector.

In previous works[21][22], MD-ISCD schemes are expected to show two limitations. Firstly, as the source decoder uses two separate MAP detectors with each detector op-erating on one description, it may report invalid codeword combinations corresponding to the empty cells of the index assignment matrix. In dealing with such situations, an invalid codeword combination is treated as an uncorrectable error and the mean of the source is reconstructed. Secondly, the major part of the iterative decoding process runs on bit-level, but the source decoder itself is realized on symbol-level. This is in part due to the fact that binary convolutional codes are commonly used, so the soft-output channel decoding can be implemented efficiently by the BCJR algorithm [14][23]. It causes the problem that only bitwise source a priori knowledge can be exploited by the channel decoder, since the BCJR algorithm is derived based on a bit-level code trellis. For the purpose of applicability, it requires the symbol-to-bit and bit-to-symbol prob-ability conversion in each passing of the extrinsic information between the source and channel decoders. This processing step destroys the bit-correlations within a symbol, thus reducing the effectiveness of iterative decoding.

Recognizing this, we will focus on symbol-based trellis decoding algorithms through-out this paper since they allows to exchange between the source and channel decoders the whole symbol extrinsic information. The first step toward realization is to use sectionalized code trellises rather than bit-level trellises as the bases for soft-output channel decoding of binary convolutional codes. Performance is further improved by using a joint MAP source decoder that processes reliability information received on

(22)

different channels and combines them with inter-description correlation to provide a better estimate of the transmitted quantizer index.

1.4 Dissertation Organization

The rest of this dissertation is organized as follows. Chapter 2 introduces some multiple description coding schemes. The source is encoded into multiple redundant descrip-tions that are separately transmitted over independent network path. Also proposed is a multi-stream voice quality prediction model. In Chapter 3, we propose the use of minimum overall impairment as a criterion for perceptual optimization of joint playout buffer and FEC adjustment. When the MDSQ is cancatenated with channel codes, the concept of extrinsic information from turbo decoding can be adopted for MD iterative source-channel decoding (MD-ISCD) [21]-[22]. Unlike previous works which focused in bit-level ISCD, we present in Chapter 4 a symbol-based iterative decoding of convolu-tionally encoded multiple descriptions. Finally, Chapter 5 summarizes this dissertation and outlines some directions for future research.

(23)

Chapter 2 Multi-Stream Transmission System

and Quality Prediction Model

The MD Coding is a technique to generate two or more descriptions, which are sent separately over multiple independent channels. When two or more descriptions are received at the receiver, they can be decoded for acceptable quality reconstruction of the source. A number of MD coding techniques have been proposed for voice com-munication over mobile IP networks. In this chapter, the MD-G.729 based speech packetization scheme described in [4] was considered for the development of joint play-out and FEC control. This section also presents a new objective method for predicting the perceived quality of multi-stream voice transmission.

2.1 Multi-Stream

Voice

Transmission

over

a

Packet-Erasure Channel

A block diagram of the proposed multi-stream VoIP simulation system is shown in Figure 2.1. The system has four major components: MD speech coder, Internet traffic

(24)

simulator, delay distribution modelling and adaptive playout buffer. The implemen-tation procedure consisted of description generation and description transmission over two independent network paths. For description generation, the MD-G.729 based on speech packetization scheme described in [4] was used to generate two descriptions from the bitstream of the ITU-T G.729 codec [9]. G.729 is a conjugate-structure alge-braic code-excited linear prediction (CS-CELP) codec for encoding narrowband speech at the rate of 8 kbps. It operates on 10-ms speech frames and each speech frame is divided into two subframes and all the parameters except the LPC coefficients are de-termined once per subframe. The MD-G.729 coder is designed to create two balanced descriptions; i.e., each description is of equal rate 4.6 kbps and speech decoded from either description is of similar quality. During description transmission, the best-effort nature of IP networks results in packets experiencing varying amounts of delay and loss due to different levels of network congestion. To characterize this, we used the ns-2 network simulator [14] to generate the traces of VoIP traffic for different network topologies and varying network load. Meanwhile, traces were also extended for varying link loss rates. A value ranging from 0-30% was used to simulate losses with differ-ent degrees of severity. Figure 2.2 shows a two path multi-hop network topology for our simulation, with transmission control protocol (TCP) data traffic on both paths contending simultaneously for network resources. The three nodes situated between source and destination on each path (N1 through N3 on the top path and N4 through N6 on the bottom), represent the data access points, each with a number of data sources attached, thus channelling in a large amount of incoming TCP traffic heading for different destinations. On each path a constant bit rate (CBR) voice stream is transmitted in 10-ms UDP packets at a rate of 4.6 kbps. The running time for each simulation is 15 seconds.

At the receiver, a playout buffer is employed to improve the tradeoff among delay, late loss rate, and speech reconstruction quality. We focused on adaptive algorithms which adjust the playout buffer at the beginning of each talkspurt and subsequent

(25)

pack-M D - G 7 2 9 s p e e c h e n c o d e r P l a y o u t b u f f e r M D - G 7 2 9 s p e e c h d e c o d e r D e l a y d i s t r i b u t i o n m o d e l l i n g V o i c e i n p u t V o i c e o u t p u t N e t w o r k s i m u l a t o r

Figure 2.1: A two-channel VoIP simulation system.

S D N 1 N 4 N 5 N 6 N 2 N 3 p a t h 1 p a t h 2 C B R s o u r c e C B R s i n k T C P s o u r c e s

Figure 2.2: A multi-hop transmission model for network simulations.

ets of that talkspurt are played out with the generation rate at the sender. Scheduling the playout of multiple voice streams is formulated as an optimization problem on the basis of a minimum overall impairment criterion. In addition to packet loss and de-lay, it takes into account the dynamics of transmission impairments due to frequent switch of playout scenarios. To proceed with this, it is a prerequisite to establish a delay distribution model as it provides a direct link to late loss rate in the presence of jitter. Previous work in [13] has found that the delay characteristics of VoIP traffic can be represented by statistical models which follow Pareto, Normal and Exponential distributions depending on applications. Finally, the MD-G.729 bit stream is decoded to generate the degraded speech. In our experiments, the decoder deals with the loss of two descriptions by using the error concealment algorithm of G.729 [9], while in other situations speech packets are reconstructed depending on how many descriptions are

(26)

received by the playout deadline. If both descriptions are received, the central decoder performs the standard G.729 decoding process after combining the two descriptions into one bitstream. If only one description is lost, the side decoder substitutes the missing information by using received parameters from the other description or information from the most recent correctly received frame [4].

2.2 Multi-Stream Voice Quality Prediction Model

In Section 1.2 we stated two limitations to E-model to predict the conversation speech quality in the third scenario. First, it may fail to register impairments due to recon-struction based on information from a single path as opposed to from both paths, when no packets from either path are lost. Moreover, the resulting detrimental effects that accompany the change in the playout scenarios may thus be ignored and harm its prediction of the overall quality. Recognizing this, we propose a new objective method for predicting the perceived quality of multi-stream voice transmission. In addition to delay and packet loss, the model also takes into account the quality impairments due to frequent switch of playout scenarios.

Conceptually the proposed model followed the commonly used ITU E-model [18] in defining factors that affect the perceptual quality of the MD voice transmission. As an analytical model of conversational speech quality used for network planning purposes, the E-model combines individual impairments due to the signal’s properties and the network characteristics into a single R-factor, ranging from 0 to 100. In VoIP applications [24], the R-factor may be simplified as follows: R = 94.2 − Id− Ie, where

Id represents the delay impairment. Ie is known as the equipment impairment and

accounts for impairments due to speech coding and packet loss. The delay impairment can be derived by a simplified fitting process in [24] with the following form

(27)

where d is the end-to-end delay and H(x) is the step function. The E-model, originally proposed for single-stream transmission, is only applicable to a limited number of speech codecs and network conditions, since it requires time-consuming subjective tests to derive the Ie model. With multiple voice streams, any subset can be used for signal

reconstruction, and the transmission quality improves with the size of the subsets. In addition to delay and packet loss, a good quality prediction model should take into account the impairments due to dynamic size allocations during the speech playout.

For two-path transmission, each channel can either deliver or erase the transmitted description, so the two channels will always be in one of four possible states: no loss, loss in channel 1, loss in channel 2, and loss in both channels (packet erasure). Among them, only the speech resulting from the packet-erasure state is not affected by playout buffer operations. The receiver deals with the loss of both descriptions by using the error concealment algorithm of G.729 codec to conceal the erased packet. If, additionally, speech decoded from either MD-G.729 description is assumed to be of similar quality, we only need to consider two kinds of playout scenarios at the receiver end. Specifically, a packet is 1) fully restored with two descriptions and thus played with high quality; and 2) partially restored with one description and thus played with degraded quality. For brevity, let Sk denote the scenario that k descriptions are received before the

playout time. Conditioned on the event that the packet can be restored, we let qk

be the probability to play out the packet using k descriptions. Formally, it is given by qk = P (Sk)/(P (S1) + P (S2)). It is improtant to notice that quality degradation

resulting from S1 and S2 are different perceptual experiences. For scenario S2, the

standard G.729 decoding process is carried out after combining the two descriptions into one bitstream. Let Ie,k denote the equipment impairment as a result of playing

out k received descriptions. From the perceived QoS perspective, the MD-G.729 codec may be viewed as operating at two coding rates: 4.6 kbps for S1 and 8 kbps for S2. By

(28)

Figure 2.3: Schematic diagram for prediction of Ie model.

impairment due to MD-G.729 coding as follows:

Ie(e) = q1Ie,1(e) + q2Ie,2(e). (2.2)

The next issue to be addressed is how to derive an equipment impairment Ie,k

cor-responding to each playout scenario Sk. We followed the work of [12], which describes

an objective method for prediction of Ie,k regression model using the PESQ algorithm

[25]. As shown in Fig. 2.3, each single measurement consists of three steps and is repeated several times with different transmission configurations. First, a speech sam-ple is selected from an English speech database that contains 16 sentential utterances spoken by eight males and eight females. Each sample has a duration of 8 seconds and sampled at 8 kHz. Second, the speech sample is encoded using MD-G.729 codec and then processed in accordance with the simulated loss model to generate the degraded speech. In our experiments, the decoder deals with packet erasure by using the error concealment algorithm of G.729 [9] to conceal erased packets, while in other scenarios speech packets are reconstructed depending on how many descriptions are received by the playout deadline. Third, the reference speech and degraded speech are processed by the PESQ to obtain a mean opinion score (MOS). For each speech sample, a MOS value for one packet-erasure rate is obtained by averaging over 30 different erasure locations in order to remove the influence of erasure location. Further, these MOS values are averaged over all speech samples and then converted to a rating R to give

(29)

an equipment impairment value Ie,k = 94.2 − R. The R-factor can be obtained from

the average MOS with a conversion formula as follows:

R = 3.026MOS3− 25.314MOS2+ 87.06MOS − 57.336. (2.3)

Fig. 2.4 shows that impact of transmission scenario Sk and packet-erasure rate e on

the equipment impairment Ie,k with a packetization of one frame per packet. The Ie,k

value for zero packet-erasure rate represents the codec impairment itself. It is obvious that the speech playout resulting from S2 has a lower codec impairment and has a high

robustness to packet loss. By inspecting Figure 2.4, we observe that our measured Ie,2

value for zero packet erasure, 21.96, is inconsistent with the ITU-published Ie value,

10, for codec G.729 [9]. One possible reason for this discrepancy may lie in the codec algorithm. As the G.729 is a CELP-based codec, the use of linear predictive model of speech production can lead to variations in codec performance with different talkers or languages [26]. Support for such a speculation can be found in at least two studies using the same codec [12][27], which, in case of zero packet loss and using different speech samples from the ITU-T data set [28], rendered measured Ie values of 21.14

and 17.128, respectively, similar to the value obtained for this study. From the curves, a nonlinear regression model can be derived for each Ie,k by the least-squares data

fitting method. The fitting curves are also shown in Figure 2.4. The derived Ie,k model

for scenario Sk has the following form: Ie,k(e) = γ1,k + γ2,kln(1 + γ3,ke), where e is

the packet-erasure rate in percentage. Our findings indicate that the regression model parameters (γ1, γ2, γ3) for S1 are (52.61, 7.52, 10) and (21.96, 17.02, 16.09) for S2.

2.3 Experimental Results

A set of experimental conditions was designed for the use of artificially degraded speech samples to verify the detrimental effects estimated by the proposed Ie regression model

(30)

0 5 10 15 20 25 30 20 25 30 35 40 45 50 55 60 65 70

Packet Erasure Rate (%)

I e I e,2 (measured) I e,2 (fitting) I e,1 (measured) I e,1 (fitting)

Figure 2.4: Ie,k vs. packet erasure rate e.

including packet loss as a main impairment factor, differ in how reconstruction in con-ditions with partial packet losses is treated. The proposed model differentiates partial reconstruction with one description from full reconstruction with two descriptions. The three states of frame reconstruction dictated by the model are 1) fully restored, when both descriptions are available and thus played with high quality, 2) partially restored, when only one description is available and thus played with less than optimal quality, and 3) restored by the G.729 error concealment algorithm, when both descriptions are lost during transmission. In contrast, the traditional model treats the full and the par-tial reconstruction states uniformly as the no-loss state, leaving out any differentiation of the processes involved that lead to the no-loss at the receiver end. It is thus reason-able to hypothesize that the traditional model fails to register any quality impairment due to partial reconstruction. As such, if the Ie’s estimated with the two models show

(31)

Table 2.1: Ie comparison for different prediction models.

Speech e% q1% Traditional Proposed Measured

Ie Ie Ie

Female 9.88 6.48 43.16 44.35 44.54

4.93 22 34.41 39.48 40.84

Male 4.84 14 31.78 34.97 36.57

12 31 40.37 45.67 46.27

significant differences in their closeness to the Ie’s measured, then adding such a

differ-entiation scheme into the modelling process should prove a valid approach. The speech samples considered here were one male and one female utterance. The G.729 speech codec and the proposed MD coding scheme were used sequentially, which turned each utterance into a bitstream of frames with two identical descriptions to be transmitted along separate dynamically-changing paths. At the receiver end, each utterance was artificially degraded to render two tokens, each with its own composition of frames of the three reconstruction states. Since the proposed model diverges from the traditional model by treating the loss of one packet as a separate state from either total loss or no loss, the underlying variable being manipulated in the frame composition was the rate q1 of partial loss. Thus, there was a total of four test conditions.

Table 2.1 lists for each condition the percentages of frames that are erased and re-stored with only one description, followed by the three corresponding Ie’s as estimated

by the traditional model, by the proposed model, and as measured then converted with PESQ. The results showed that, unlike the traditional model that yielded poorer esti-mations for samples containing higher percentages of one description loss, the proposed model gave estimations that are quite robust regardless of the sample frame composi-tion. For example, given the same percentage increases from 6.84% to 22% and from 14% to 31% in the female and the male utterance respectively, the traditional model showed deviations from the measured Ie’s that were increased from 1.38 to 6.43 and

(32)

stable and smaller deviations that ranged from 0.6 to 1.6. Taken together, these com-parison data suggest that independent evaluation of impairments due to loss of one vs. both descriptions adds to the robustness of the proposed model.

(33)

Chapter 3 QoS Control for Multi-Stream

Voice over IP Networks

In this chapter, we study the QoS Control of Multi-Stream Voice over IP Networks. The proposed MD system model has been presented in Chapter 2. Section 3.1 presents a perceptual-based playout mechanism which uses optimization criterion based on new quality prediction model proposed in Section 2.2. A further step toward perceptual optimization, a packet-level FEC channel encoder is included into our proposed MD system to strengthen the error concealing capacities of MD system. Also proposed is a joint playout and FEC control scheme in Section 3.2.

3.1 Perceptual-Based Playout Mechanism

Packet loss and delay are the major network impairments for transporting real-time voice over IP networks. In the proposed system, multiple descriptions of the speech are used to take advantage of the packet path diversity. Our goal is to develop is a multi-stream playout buffer algorithm, together with an adaptive parameter adjust-ment scheme, that maximizes the perceived speech quality via delay-loss trading.

(34)

Ex-perimental results showed that, compared to FEC-protected single-path transmission, the proposed multi-stream transmission scheme achieves significant reductions in delay and packet loss rates as well as improved speech quality.

3.1.1 Adaptive Playout Scheduling Algorithm

The main attraction of multi-stream transmission arises from its flexibility to trade off different sources of impairments against each other. Waiting for the arrival of both descriptions results in lower equipment impairment Ie, but at the cost of higher delay

impairment Id. On the other hand, playing out the voice description with lower delay

avoids latency, but increases the equipment impairment. Since playout scheduling aims to improve the overall conversational speech quality, which hangs on the balance between delay and packet loss, full reconstruction of both descriptions may not always be the priority if the overall impairment does not justify the extra delay from waiting. Given that, the design of a playout buffer must play around with switching between different playout scenarios in order to maximize the benefits of packet path diversity. To accomplish this goal, the proposed voice quality prediction model is applied on adaptive control of the multi-stream playout buffer. Prior to the arrival of each packet i, the playout delay for that packet is determined according to the past recorded delays. The playout delay of packet i is denoted by dplay,i, which is defined as the time from

the moment that packet is delivered to the network until it has to be played out. A packet may get lost due to its late arrival, if its network delay is larger than the playout delay.

The basic adaptive playout algorithm estimates two statistics characterizing the network delay, and uses them to calculate the playout delay as follows:

dplay,i = ˆdi+ βˆvi. (3.1)

where ˆdi and ˆvi are running estimates of the mean and variation of network delay seen

(35)

tradeoff between delay and late packet loss, which in turn influences the conversational speech quality. From (3.1) it can be deduced that increasing β leads to lower late loss rate as more packets arrive in time, however the end-to-end delay increases. All of the algorithms [11-13] used a fixed value of β, e.g., β = 4, to set the buffer size, so that only a small fraction of the arriving packets should be lost due to late arrival. In this work, a β-adaptive algorithm is instead used to control the playout buffer so that the reconstructed voice quality is maximized in terms of delay and loss. The idea behind our algorithm is to adaptively adjust the value of β with each incoming talkspurt, depending on the variation in the network delays.

3.1.2 Perceptually Motivated Optimization Criterion

Next, we formulated the parameter adjustment as a perceptually motivated optimiza-tion problem and the adopted criterion relies on the use of the proposed multi-stream voice quality prediction model. Let di be the end-to-end delay experienced by the ith

packet, which consists of encoding delay dc and playout delay dplay,i. ei is the

packet-erasure probability to lose two descriptions, no matter if the description is dropped by the network or discarded due to its late arrival, and is given by

ei = e (1)

n e(2)n + e(1)n (1 − e(2)n )e(2)_b,i + e(2)n (1 − e(1)n )e(1)_b,i

+(1 − e(1)n )(1 − e(2)n )e(1)_b,ie(2)_b,i

(3.2)

where e(l)n and e(l)_b,i represent the link loss probability and estimated late loss probability

of packet i in stream l, respectively. Now, we define an overall impairment function Im

which is a function of both di and ei, with Im(di, ei) = Id(di) + Ie(ei). Using (2.2) and

(2.3), Im can be expressed as

Im(di, ei) = 0.024di+ 0.11(di− 177.3)H(di− 177.3) +

P

k=1,2qkIe,k(ei). (3.3)

where q1+ q2 = 1 and the probability to receive both descriptions is given by

q2 = 1 1 − e (1 − e (1) n )(1 − e (2) n )(1 − e (1) b,i)(1 − e (2) b,i). (3.4)

(36)

Our optimization framework requires an analytic expression for the packet erasure probability ei as a function of the single parameter βi. Notice that e(l)b,i and the playout

delay dplay,i are strongly correlated, and to find out their relationship, the network

delays of stream l are assumed to follow a Pareto distribution which is defined as Fl(x) = 1 − (gl/x)αl. The parameters of Pareto distribution αl and glcan be estimated

from past recorded delays using the maximum likelihood estimation method [8]. Then, the late loss probability of packet i in stream l can be computed as follows:

e(l)_b,i = 1 − Fl(dplay,i) = (gl/dplay,i)αl (3.5)

This reduces the expression of the packet-erasure probability ei to be a function of the

playout delay dplay,i, which in turn is a function of the parameter βi. Its gradient with

respect to βi is given by dei dβi = −ˆvi dplay,i{(1 − e (1) n )(1 − e(2)n )e(1)_b,ie(2)_b,i(α1+ α2) +e(1)n (1 − e(2)n )e(2)_b,iα2+ e(2)n (1 − e(1)n )e(1)_b,iα1} (3.6)

The overall impairment function Im is a function of the playout delay dplay,i and

the probability qk as well as the packet-erasure probability ei. Since these parameters

are all functions of the parameter βi, the overall impairment Im is also a function of βi,

i.e., Im(di, ei) = Im(βi). By differentiating it with respect to βi, we get the following

equation for the gradient:

Im′ (βi) = cˆvi+P_k=1,2{qk γ2,kγ3,k 1+γ3,kei dei dβi + dqk dβiIe,k(ei)}. (3.7) where c = { 0.024, βi < (177.3 − dc− ˆdi)/ˆvi; 0.134, βi > (177.3 − dc− ˆdi)/ˆvi. (3.8) dq2 dβi = ˆ vi dplay,i(1−ei)(1 − e (1) n )(1 − e(2)n )[α1e(1)b,i(1 − e (2) b,i) +α2e(2)b,i(1 − e (1) b,i)] + _(1−e1_i)2 dei dβi(1 − e (1) n )(1 − e(2)n )(1 − e(1)_b,i)(1 − e(2)_b,i) (3.9)

(37)

3.1.3 Perceptual Optimization of Playout Buffer

Our general problem can be stated as follows: Given estimates of the parameters characterizing the delay distribution and Ie regression model, find the optimal value

of βi so as to minimize the overall impairment function Im(βi). This task belongs

to the class of set-constrained optimization problems, which can be solved efficiently by means of one-dimensional search methods [29]. For computational purposes, we applied the secant method [29] to search for the minimizer ˆβi of Im over the constraint

set {βi ∈ R, βi > 0}. Starting with two initial values βi(−1) and βi(0), the iterative

formula for the secant algorithm at the j-th iteration has the form

βi(j + 1) = βi(j) − βi(j) − βi(j − 1) I′ m(βi(j)) − I ′ m(βi(j − 1)) I_m′ (βi(j)). (3.10)

The new value βi(j + 1) is then used in the next iteration and the estimation process

is repeated until the difference |βi(j + 1) − βi(j)| is smaller than a threshold. Finally,

we summarize the proposed multi-stream playout buffer algorithm as below.

1. Apply an autoregressive algorithm [11] to estimate the delay mean ˆd(l)i and

vari-ance ˆv(l)_i for individual stream l (l = 1, 2) as follows:

ˆ d(l)i = µ ˆd (l) i₋₁+ (1 − µ)n (l) i . (3.11) ˆ v_i(l) = µˆv_i(l)₋₁+ (1 − µ)|n(l)_i − ˆd(l)_i |. (3.12) where n(l)_i is the network delay of packet i in stream l and µ = 0.998002 is a weighting factor for convergence control.

2. At the beginning of each talkspurt, update network delay records for the past L = 200 packets in every stream l (l = 1, 2), and use them to calculate the Pareto distribution parameters (αl, gl) by the maximum likelihood estimation method.

Given a set of past network delays {n(l)i₋₁, n (l) i₋₂, . . . , n (l) i−L}, we compute gl = min{n(l)i−1, n (l) i−2, . . . , n (l) i_−L} (3.13)

(38)

αl = L/Σij−L_=i−1log(

n(l)j

gl

) (3.14)

3. Use the values of (αl, gl) in the secant method to determine the minimizer ˆβi(l) of

the utility function,

Im(βi(l)) = Id(dc+ ˆd(l)i + β (l) i ˆv

(l)

i ) + Ie(ei(βi(l))). (3.15)

4. Set the playout delay to

dplay,i= ˆd (l∗₎ i + ˆβ (l∗₎ i vˆ (l∗₎ i , l∗ _{= arg min{I}_m_{( ˆ}_β(l) i ), l = 1, 2} (3.16)

3.1.4 Experimental Results

Computer simulations were carried out to evaluate the performances given by three examples, MD1-3, of the MD voice transmission scheme, which used the 9.2 kbps MD-G.729 codec for the generation of two balanced descriptions. An FEC-protected single description (SD) transmission scheme was also tested for its comparative strength. The SD scheme applied the 8 kbps G.729 codec and performed packet-level (9,8) Reed-Soloman channel code, a condition in which an FEC packet was generated for every 8 packets and whenever any 8 of the 9 packets (8 + the resulting FEC packet) had been received over a period of time, the 8 packets were fully recovered at the receiver end. It was hypothesized that the performances of these four schemes being tested would be set apart mainly by the value of β (fixed or dynamically changing) they each assumed during the test period, and that the best performance should come with β values whose calculation was based on link loss, packet-erasure loss and various transmission scenarios. MD1 had a fixed β = 4, and MD2 took values of β that were dynamically adjusted by the playout buffer according to the proposed voice quality prediction model. MD3 differed from the previous two by having its β set following the play-first strategy proposed by Liang et al [3]. The SD scheme, with the FEC feature,

(39)

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70

link loss rate (%)

R−factor

SD (dynamic β) with RS−FEC(9,8) MD1 (β = 4)

MD2 (dynamic β) MD3 (Liang)

Figure 3.1: Performance comparison for different playout algorithms.

assumed a dynamic β value as determined by the E-model [30]. The speech data fed into the simulations were two sentential utterances spoken by one male and one female, each sampled at 8 kHz. and 8 seconds in duration. Both samples were encoded and then processed in accordance with the delay and loss characteristics of the trace data to degrade the speech. Figure 3.1 plots the perceived speech quality for the SD and the 3 MD schemes as a function of the link loss rate. As described in Section 2.3, the perceived quality was gauged by calculating the predicted average R-factor according to the E-model, and the link loss rate was varied from 0-30%. It can be seen that, although the quality deteriorated for all four schemes as the link loss rate increased, the three MD schemes yielded better speech quality than the SD scheme, especially at increased link loss rate. At rates slightly beyond the minimum (eg.,5%), the SD scheme, despite its FEC feature, started showing incapability of recovering the lost packets in facing link losses. Among the three MD schemes that showed three levels of dynamics in making decisions about delay, MD1, with its fixed β value, yielded the worst quality

(40)

Table 3.1: MOS comparison for different playout algorithms.

Link loss rate (%) 0 5 10 15 20 25 30 SD with RS-FEC(9,8) 3.305 3.028 2.678 2.396 2.147 1.979 1.833 MD1 3.226 3.040 2.832 2.623 2.439 2.274 2.128 MD2 3.481 3.305 3.059 2.860 2.627 2.441 2.254 MD3 3.473 3.288 2.942 2.677 2.399 2.126 1.893

at 0% link loss rate, yet showed better results at rates above 20% than MD3, suggesting a limitation of the Lagrangian cost function in predicting the actual perceived speech quality. The best results, as hypothesized, were obtained with the currently proposed scheme MD2, which can be attributed to its all encompassing algorithm and thus the overall function that takes into account the impairment impacts as a result of delay, packet-erasure loss and various transmission scenarios. To elaborate further, MOS performances of various playout algorithms were examined for MD transmission with link loss rates ranging from 0% to 30%. As shown in Table 3.1, the results indicate that the proposed playout algorithm is preferable to other algorithms in all the tests and its performance gain tends to increase with increasing link loss rates.

3.2 Jont FEC and Playout Control Mechanisms

This Section presents a joint playout buffer and FEC adjustment scheme that max-imizes the perceived speech quality via delay-loss trading. Figure 3.2 shows a block diagram of the simulation system with the first two components, MD speech coder and channel coder, responsible for description generation and the rest, for transmission and signal reconstruction. After source coding, packet-level Reed-Solomon (N, K) codes [14] are used for channel coding of individual descriptions. The channel encoder takes a codeword of K speech packets and generate N − K additional FEC check packets for the transmission of N packets over the network. Such a code, denoted as a RS (N, K)

(41)

code, is able to recover all losses in the block if and only if at least K out of N pack-ets are received correctly. The receiver end features an adaptive playout buffer that smooths out the network jitter. The algorithm adjusts the playout buffer at the be-ginning of each talkspurt and subsequent packets of that talkspurt are played out with the generation rate at the sender. A joint design of FEC and playout buffer adaptation was further formulated as an optimization problem on the basis of a minimum overall impairment criterion. In addition to packet loss and delay that traditional systems sought to control, this design takes into account the dynamics of transmission impair-ments due to frequent switch of playout scenarios. Experimental results showed that the proposed multi-stream voice transmission scheme achieves significant reductions in delay and packet loss rates as well as improved speech quality.

Figure 3.2: A multi-description voice transmission system.

3.2.1 FEC in a Gilbert-Model Loss Process

In Section 1.2 we stated the rationale in combining FEC into the playout buffer al-gorithm without following the no-reordering assumption underlying the work in [16]. Assume that multiple descriptions of the speech are transmitted over independent net-work paths and each path is characterized by a Gilbert-model loss process. The Gilbert

(42)

model is a two-state Markov chain model in which state B represents a network loss and state G represents a packet reaching the destination. For each stream l, the pa-rameters p(l) _{and q}(l) _{denote respectively the probabilities of transitions from G to B}

states and from B to G states. A packet is said to be missing so long as the packet is either dropped in the network or discarded due to its late arrival. For the sake of clar-ity, every packet i is assigned a variable Wi ∈ {0, 1, 2}, corresponding to the following

3 arrival scenarios: Wi = 0, arriving before its playout time, Wi = 1, a network loss,

Wi = 2, arriving after its playout time. Following the development of (Boutremans

and Boudec, 2003), let R(l)_{(m, n, D}

F,i) denote the probability that m − 1 packets are

missing (dropped or received late) in the next n − 1 packets following the network loss of packet i, and let S(l)_{(m, n, D}

F,i) denote the probability that m − 1 packets are

missing in the next n − 1 packets following the late loss of packet i. Similarly, let ˜

R(l)_{(m, n, D}

F,i) and ˜S(l)(m, n, DF,i) denote the probability that m − 1 missing packets

occur in the last n − 1 packets preceding packet i which is dropped and received late, respectively. As shown in the Appendix A, these probabilities can be computed by recurrence as follows: R(l)_{(m, n, D} F,i) =                          q(l)_{(1 − p}(l)₎n₋₂_·Qn−1 h=1(1 − e (l) b,i+h), m = 1, n ≥ 1 (1 − q(l)_)R(l)_{(m − 1, n − 1, D} F,i+1) + n−m X j=1 {q(l)_{(1 − p}(l)₎j₋₁ j Y h=1 (1 − e(l)_b,i+h) · {p(l)_R(l)_{(m − 1, n − j − 1, D} F,i+j+1) +(1 − p(l)_)e(l) b,i+j+1S(l)(m − 1, n − j − 1, DF,i+j+1)}}, 2 ≤ m ≤ n (3.17)

(43)

and S(l)_{(m, n, D} F,i) =                    e(l)_b,i(1 − p(l)₎n−1_·Qn−1 h=1(1 − e (l) b,i+h), m = 1, n ≥ 1 n−m X j=0 {e(l)_b,i(1 − p(l))j j Y h=1 (1 − e(l)_b,i+h) · {p(l)R(l)(m − 1, n − j − 1, DF,i+j+1) +(1 − p(l)_)e(l) b,i+j+1S(l)(m − 1, n − j − 1, DF,i+j+1)}}, 2 ≤ m ≤ n (3.18)

where DF,i is the FEC delay and e(l)_b,i is the estimated late loss probability of packet i

in stream l. Table 3.2 summarizes the basic notation used in the Appendix.

Table 3.2: Basic Notation.

Notation Description

DF,i FEC delay of packet i

ei Packet-erasure probability of packet i

e(l)_b,i Late loss probability of packet i in stream l

P_L(l)(i) Resudual loss probability of packet i in stream l after FEC is used

P_R1(l)(i) Probability to recover a dropped packet i in stream l

P_R2(l)(i) Probability to recover a late lost packet i in stream l

R(l)(m, n, DF,i) Probability that m − 1 packets are missing in the

next n − 1 packets following the network loss of packet i

S(l)(m, n, DF,i) Probability that m − 1 packets are missing in the

next n − 1 packets following the late loss of packet i ˜

R(l)_{(m, n, D}

F,i) Probability that m − 1 packets are missing in the

last n − 1 packets proceeding the network loss of packet i ˜

S(l)(m, n, DF,i) Probability that m − 1 packets are missing in the

last n − 1 packets proceeding the late loss of packet i

With RS (N, K) code, each code takes a codeword of K voice packets and generates N − K additional FEC packets for the transmission of N packets over the network. Such a code is able to recover any missing packet in the block if and only if at least K out of N packets in this block are received before their playout time. Viewed from

(44)

this perspective, the probability to recover a dropped packet is given by

P_R1(l)(i)

= Pr(packet i can be recovered | packet i is dropped in the network) =

N_−K

X

L=1

Pr(L packets are missing in WN

1 |Wi= 1) = N_−K X L=1 min_{(L−i,i−1)} X m=0

Pr( m packets are missing in W₁i−1|Wi = 1)

· Pr( L − m − 1 packets are missing in WN

i+1|Wi = 1) = N_−K X L=1 min_{(L−i,i−1)} X m=0 ˜ R(l)(m + 1, i, DF,i) · R(l)(L − m, N − i + 1, DF,i) (3.19)

and the probability to recover a late lost packet is given by

P_R2(l)(i)

= Pr(packet i can be recovered | packet i is received late) =

N_−K

X

L=1

Pr(L packets are missing in WN

1 |Wi= 2) = N_−K X L=1 min_{(L−i,i−1)} X m=0

Pr(m packets are missing in W₁i−1|Wi = 2)

· Pr(L − m − 1 packets are missing in WN

i+1|Wi = 2) = N_−K X L=1 min(L−i,i−1) X m=0 ˜ S(l)(m + 1, i, DF,i) · S(l)(L − m, N − i + 1, DF,i) (3.20)

Using these probabilities, we can compute the residual loss probability (after FEC is used) as follows:

P_L(l)(i) = e(l)n (1 − P_R1(l)(i)) + (1 − e(l)n )e(l)_b,i(1 − P_R2(l)(i)) (3.21)

where e(l)n represents the network loss probability measured in stream l. The

packet-erasure probability ei is defined as the probability that none of the descriptions of

packet i arrives on time, and is given by

ei = 2

Y

l=1

(45)

3.2.2 Joint FEC and Playout Control

The main attraction of multi-stream transmission arises from its flexibility in trading different sources of impairments against each other. Waiting for the arrival of both descriptions results in lower equipment impairment, but at the cost of higher delay impairment. On the other hand, playing out the voice description with lower delay avoids latency, but increases the equipment impairment. Since playout scheduling aims to improve the overall conversational speech quality, which hangs on the balance between delay and packet loss, full reconstruction of both descriptions may not always be the priority if the overall impairment does not justify the extra delay from waiting. Given that, the joint playout and FEC control must play around with switching between different playout scenarios in order to maximize the benefits of packet path diversity. To accomplish this goal, we formulated the system design as a perceptually motivated optimization problem and the adopted criterion relies on the use of the proposed multi-stream voice quality prediction model. Our efforts began by estimating the playout delay, which is defined as the time from the moment that packet is delivered to the network until it has to be played out. We applied an autoregressive algorithm (Moon et al. 1998) to estimate the mean ˆd and variance ˆv of network delay, and use them to calculate the buffer delay db = ˆd + βˆv. Waiting for the FEC check packets results in

additional delay and, consequently, the playout delay is given by

dplay = ˆd + βˆv + (N − 1)Tp (3.23)

where Tpis the packet generation interval. The parameter β has a critical impact on the

tradeoff between delay and late packet loss, which in turn influences the conversational speech quality. From (3.23) it can be deduced that increasing β leads to lower late loss rate as more packets arrive in time, and yet the end-to-end delay also increases. Most playout buffer algorithms [11][12][13] used a fixed value of β; e.g., β = 4, to set the buffer size, so that only a small fraction of the arriving packets should be lost due to late arrival. In this work, a β-adaptive algorithm is instead used to control the buffer size so that the reconstructed voice quality is maximized in terms of delay and loss.

(46)

Our general problem can be stated as follows: Given estimates of the parameters characterizing the packet loss and delay distribution, find the optimal values of β and {N, K} so as to minimize the overall impairment function subject to the rate constraint. Let di be the end-to-end delay experienced by the ith packet, which consists of encoding

delay dc and playout delay dplay. Now, we define an overall impairment function Im as

a function of both di and eK1 = (e1, · · · , eK) with the following form

Im(di, eK1 ) = Id(di) + _K1

PK

j=1

P

l=1,2rlIe,l(ej) (3.24)

where r1+ r2 = 1 and the probability to receive both descriptions is given by

r2 = 1 1 − ei 2 Y l=1 (1 − P_L(l)(i)). (3.25)

Our optimization framework requires an analytic expression for the packet erasure probability ei as a function of the parameter β. Notice that e

(l)

b,i and the playout

delay dplay are strongly correlated, and to find out their relationship, the network

delays of stream l are assumed to follow a Pareto distribution which is defined as FD(l)(d) = 1−(gl/d)αl. The parameters of Pareto distribution αland glcan be estimated

from past recorded delays using the maximum likelihood estimation method [13]. More specifically, given a set of past network delays {n(l)i₋₁, n

(l)

i₋₂, . . . , n (l)

i_−M}, we compute gl =

min{n(l)_i₋₁, n(l)_i₋₂, . . . , n_i(l)_−M} and αl = M/Σij−M=i−1log( n(l)_j

gl ). Then, the late loss probability

of packet i in stream l can be computed as follows:

e(l)_b,i = 1 − F_D(l)(DF,i) = (gl/DF,i)αl. (3.26)

where DF,i= dplay − (i − 1)Tp.This reduces the expression of the packet-erasure

prob-ability ei to be a function of the playout delay dplay, which in turn is a function of the

parameter β.

Finally, we summarize the proposed multi-stream joint playout and FEC adjustment algorithm as below.

(47)

1. Apply an autoregressive algorithm [11] to estimate the delay mean ˆd(l)i and

vari-ance ˆv(l)_i for individual stream l (l = 1, 2) as follows:

ˆ d(l)i = µ ˆd (l) i₋₁+ (1 − µ)n (l) i . (3.27) ˆ v_i(l) = µˆv_i(l)₋₁+ (1 − µ)|n(l)_i − ˆd(l)_i |. (3.28) where n(l)_i is the network delay of packet i in stream l and µ = 0.998002 is a weighting factor for convergence control.

2. At the beginning of each talkspurt, update network delay records for the past M = 200 packets in every stream l (l = 1, 2), and use them to calculate the Pareto distribution parameters (αl, gl) by the maximum likelihood estimation method.

3. Use the values of (αl, gl) to compute the late loss probability in (3.26) and the

packet erasure probability ei in (3.22). Apply an exhaustive search method to

determine the minimizer ( ˆβi(l), ˆN(l), ˆK(l)) of the overall impairment function in

(3.24) subject to the code rate constraint N K ×

9.2

8 ≤ Rmax. Here, the maximum

overall code rate Rmax is chosen to be 2.

4. Set the playout delay and RS code parameters to

dplay = ˆd(l ∗₎ + ˆβ_i(l∗)vˆ(l∗₎ + ( ˆN(l∗₎ − 1)Tp, (N, K) = ( ˆN(l∗ )_{, ˆ}_K(l∗ )₎ (3.29)

with l∗ _{= arg min{I}_m_{( ˆ}_β(l)_{, ˆ}_N(l)_{, ˆ}_K(l)_{), l = 1, 2}}

3.2.3 Experimental Results

Computer simulations were carried out to evaluate the performances given by the four MD voice transmission schemes, MD1-4, which all used the MD-G.729 for source coding and RS(N, K) code for channel coding. The speech data fed into the simulations were two sentential utterances spoken by one male and one female, each sampled at

(48)

0 0.03 0.06 0.09 0.12 0.15 45

50 55 60

Link loss rate (%)

R−factor MD1 Dynamic {N,K,β} MD2 RS(3,2) β=4 MD3 RS(5,3) β=4 MD4 RS(10,6) β=4 SD Dynamic {N,K, d play }

Figure 3.3: Performance comparison for different playout algorithms.

8 kHz and 8 seconds in duration. Both samples were encoded and then processed in accordance with the delay and loss characteristics of the trace data to degrade the speech. Among the four schemes, MD1 had its parameters {β, N, K} dynamically adjusted according to the proposed voice quality prediction model, while MD2-4 shared a fixed β = 4 with (N, K) set at (3,2), (5,3), and (10,6) respectively. It should be pointed out that the last two (N, K) sets allowed MD3 and MD4 to perform at the same FEC coding ratio but with different lengths of delay, which gave us the opportunity to evaluate in our test environment the effect of packet loss vs. delay. It was hypothesized that the performances of these schemes would be set apart mainly by the values of {β, N, K} they each assumed, and that the best performance should come with the adaptive parameter adjustment scheme, or MD1 in the current case, whose calculation was based on link loss, packet-erasure loss and various transmission scenarios.

基於多重敘述編碼理論之無線通訊系統的品質管理研究

國 立 交 通 大 學

電信工程研究所

博 士 論 文

基於多重敘述編碼理論之無線通訊系統的

品質管理研究

QoS Control for Multi-Stream Voice over

Mobile IP Networks

研 究 生：吳俊鋒

指導教授：張文輝 博士

基於多重敘述編碼理論之無線通訊系統的品質管

理研究

QoS Control for Multi-Stream Voice over Mobile

IP Networks

研究生：吳俊鋒

Student: Chun-Feng Wu

指導教授：張文輝 博士

Advisor: Dr. Wen-Whei Chang

國立交通大學

電信工程研究所

博士論文

A Dissertation

Submitted to Institute of Communication Engineering

College of Electrical and Computer Engineering

National Chiao Tung University

in Partial Fulfillment of the Requirements

for the Degree of Doctor of Philosophy

in

Communication Engineering

Hsinchu, Taiwan

基於多重敘述編碼理論之無線通訊系統的品質

管理研究

學生：吳俊鋒 指導教授：張文輝 博士

國立交通大學電信工程研究所

摘要

QoS Control for Multi-Stream Voice over Mobile

IP Networks

誌謝

博士班研讀的過程中，首先要感謝我的指導老師張文輝教授悉

心的指導，使我在研究上可以很有效的掌握研究技巧與重點，故期許

自己對於電信相關領域能有多一分的貢獻。另一方面也感謝老師在生

活上也給予我很多的建議與鼓勵。此外，個人非常感謝交通大學所提

供優良的學術環境，讓學生在科技與人文的素養均能有所成長。再

者，特別感謝新竹教育大學江源泉教授對於研究所提供的協助與指

導。在語音通訊實驗室裡的研究生活中，有學長李承龍與同學傅泰

魁、曹正宏、林宜德、許忠安、蔡知鑑、何依信、顏廣儀、曾啟翔、

潘彥璋、張永樂、陳亞民、戴玲玲、葉葉誠、吳鴻材等相互提攜與砥

礪。除此，還要對強大隊整體成員致上敬意，他們在假日時與我一起

吃喝玩樂，使我博士班的生活多彩多姿。除此，也要感謝顏雅慧在一

路相伴。最後我要特別地感謝我的父母，阿姨還有弟弟，他們是我物

質上與心靈上的能量來源。由此，本人才得以順利完成博士論文。

Contents

List of Tables

List of Figures

Chapter 1

Introduction

1.1

Multiple Description Coding

1.2

Joint Playout and FEC Control

1.3

Iterative Source-Channel Decoding

1.4

Dissertation Organization

Chapter 2

Multi-Stream Transmission System

and Quality Prediction Model

2.1

Multi-Stream

Voice

Transmission

over

a

Packet-Erasure Channel

2.2

Multi-Stream Voice Quality Prediction Model

2.3

Experimental Results

Chapter 3

QoS Control for Multi-Stream

國立交通大學

博士論文

研究生：吳俊鋒

指導教授：張文輝博士

指導教授：張文輝博士

學生：吳俊鋒指導教授：張文輝博士