Solution Strategies for Linear Inverse Problems in Spatial Audio Signal Processing

(1)

applied

sciences

Article

Solution Strategies for Linear Inverse Problems in

Spatial Audio Signal Processing

Mingsian R. Bai1,*, Chun Chung1, Po-Chen Wu1, Yi-Hao Chiang1and Chun-May Yang2 1 _{Department of Power Mechanical Engineering, National Tsing Hua University, No. 101, Section 2,}

Kuang-Fu Road, Hsinchu 30013, Taiwan; [email protected] (C.C.); [email protected] (P.-C.W.); [email protected] (Y.-H.C.)

2 _{Department of Electrical Engineering, National Chiao Tung University, No. 1001, Ta-Hsueh Road,} Hsinchu 30013, Taiwan; [email protected]

* Correspondence: [email protected]; Tel.: +886-3-5742915

Academic Editors: Woon Seng Gan and Jung-Woo Choi

Received: 30 March 2017; Accepted: 26 May 2017; Published: 5 June 2017

Abstract:The aim of this study was to compare algorithms for solving inverse problems generally encountered in spatial audio signal processing. Tikhonov regularization is typically utilized to solve overdetermined linear systems in which the regularization parameter is selected by the golden section search (GSS) algorithm. For underdetermined problems with sparse solutions, several iterative compressive sampling (CS) methods are suggested as alternatives to traditional convex optimization (CVX) methods that are computationally expensive. The focal underdetermined system solver (FOCUSS), the steepest descent (SD) method, Newton’s (NT) method, and the conjugate gradient (CG) method were developed to solve CS problems more efficiently in this study. These algorithms were compared in terms of problems, including source localization and separation, noise source identification, and analysis and synthesis of sound fields, by using a uniform linear array (ULA), a uniform circular array (UCA), and a random array. The derived results are discussed herein and guidelines for the application of these algorithms are summarized.

Keywords: inverse problems; Tikhonov regularization; compressive sensing (CS); convex optimization (CVX); focal underdetermined system solver (FOCUSS); steepest descent (SD); Newton’s method (NT); conjugate gradient (CG); golden section search (GSS)

1. Introduction

Numerous inverse problems exist in the field of acoustics. For example, nearfield acoustic holography (NAH) is a noise source identification method that reconstructs a surface field of the source on the basis of sound pressure measured in the nearfield of the source [1–5]. The deconvolution approach for the mapping of acoustic sources (DAMAS) is also a method for noise source identification [6]. Another example is the source signal separation problem, where an individual source signal is to be extracted from a mixed array of signals [7]. Inverse problems can also be found in source sound field synthesis (SFS) problems, where the sound field produced by secondary sources is to be matched with a target field [8,9]. Other examples include sound field control [10,11], crosstalk cancellation in binaural audio rendering [12], noise reduction in speech enhancement [13], room response equalization, and dereverberation [14,15]. In linear range acoustics, each of these problems can be formulated as a linear system (Ax = b). The current study focused on the solutions of farfield noise source identification problems, sound source localization and separation problems, and sound field analysis (SFA) and synthesis (SFS) problems. Although inverse solutions of acoustic problems have long been investigated by researchers, according to our review of the literature, no conclusive results that give solution strategies and parameter choice guidelines can be found in the

(2)

Appl. Sci. 2017, 7, 582 2 of 29

literature. Furthermore, although audio quality is the chief concern in practical audio reproduction, most previous research has examined general numerical accuracy and stability. This study explored these problems from the perspective of reproduced signal quality. Solution strategies were compared in a unified framework, and guidelines of parameter choice are summarized herein.

In general, inverse problems can be divided into two categories: overdetermined and square systems, and underdetermined systems. To solve overdetermined systems, least-squares methods, Tikhonov regularization (TIKR) [16], and truncated singular value decomposition (TSVD) are commonly used. Traditionally, Morozov’s discrepancy principle, generalized cross-validation (GCV), and the L-curve method can be used to choose the regularization parameter in the TIKR method [17–20]. However, solution methods that are better suited to audio applications than conventional approaches are proposed in this work. In particular, golden section search (GSS) [21] is applied to find optimal regularization parameters. To solve underdetermined problems, compressive sampling (CS) [22,23] solved by using convex optimization (CVX) [24–26] is a widely used approach that is known to be computationally expensive. In the present study, computationally efficient iterative approaches that incorporate sparsity constraints, including the focal underdetermined system solver (FOCUSS) [27], steepest descent (SD), Newton’s (NT) method, and conjugate gradient (CG), were developed. These algorithms were compared for several audio application scenarios. The first scenario was sound source localization and separation using a uniform linear array (ULA) and a uniform circular array (UCA). To assess the separation quality, perceptual evaluation of audio quality (PEAQ), perceptual evaluation of speech quality (PESQ), and segmental signal-to-noise ratio (segSNR) were adopted [13,28]. The second scenario was concerned with analyses and syntheses of sound fields. Recently, an integrated array system was developed on the basis of a freefield model for spatial audio recording and reproduction [8,9]. This study extended the previous work to a reverberant environment; a live room was fitted with reflective walls. For the SFA, a 24-element circular microphone array was utilized to encode the sound field based on plane-wave decomposition, whereas in the SFS, a 32-element rectangular loudspeaker array was employed to decode the encoded sound field using three approaches. The third scenario was sound source localization and separation using a random array.

2. Inverse Solution Algorithms

In this section, an array model is given, along with its assumptions. Assume that the sound sources are at the farfield of the microphone array such that sound waves impinging on the array can be regarded as plane waves. In the following array model, time-harmonic dependence ejωt, j=√−1 and ω as angular frequency, is assumed so that the model is essentially formulated in the frequency domain. M microphones and N sources are considered. The array pressure vector can be expressed as [29].

p=As+v, (1)

where p ∈ CM _{is the sound pressure vector received by the microphone, s} _{∈ C}N _{is the source}

amplitude vector, v is the noise vector, and A ∈ CM×N_{is the steering matrix associated with the}

sources. Therefore, given the pressure measurement p and the steering matrix A, solving the problem of Equation (1) for the unknown source amplitude vector s is a linear inverse problem. Linear inverse problems can be divided into three categories: square systems (M = N), overdetermined systems (M > N), and underdetermined systems (M < N). In the following, solution strategies are presented for these categories.

2.1. Overdetermined or Square Systems 2.1.1. TSVD and Least-Squares Problems

The most basic approach [16–18] to solve linear inverse problems is the least-squares method in which the following cost function is minimized:

(3)

J= kek2₂=eHe, (2) where e = p−Asdenotes the error vector and “k · k₂” denotes the vector 2-norm. If matrix A is of full-column rank, the least-squares solution can be written as

ˆs= [AHA]−1AHp. (3)

More generally, by the TSVD of

A=UΣVH

ˆs=A+p, (4)

whereΣ represents a diagonal matrix with singular values at its diagonal entries, U and V represent unitary matrices, and A+represents the pseudoinverse of the matrix A defined as

A+=VΣ+UH, (5)

whereΣ+=diaghσ₁−1, . . . , σr−1, 0, . . . , 0 i

∈ CN×M_{, r}₌_rank(_A₎_[₃₀_,₃₁_{]. Note that the expression} of Equation (4) is sufficiently general that it always provides the minimum-norm least-squares solution, even when the matrix A is rank-deficient. The square of the residual error becomes

e2LS= kp−Aˆsk22= k(I−AA+)pk 2 2= M

∑

i=r+1 u H i p 2 (6)

with uibeing the ith column of U and I being the identity matrix.

In practice, the matrix A can contain small singular values and can be very ill-conditioned, which leads to numerical instability during the inversion of A. Two common methods to cope with the ill-posedness of the problem are TSVD and TIKR. Briefly, the TSVD method is simply to discard small singular values of the matrix A, whereas TIKR involves attempts to minimize the following cost function [16]:

J= kAˆs−pk2₂+β2ksk2₂, (7)

where the regularization parameter β weights the residual norm and the solution norm. After some manipulation, we derive with the following optimal solution:

ˆs=AHA+β2I −1

AHp (8)

This result can be rewritten in terms of the TSVD of A as follows:

ˆs= (AHA+β2I)−1AHp= N

∑

i=1 fiσ_i−1αivi, (9) where A=USVH = _∑N i=1

σiuivH_i , with uiand vibeing the ith column partition of the matrices U and V,

αi =u_iHp, where fi(β) = σ 2 i σ_i2+β2 = 1 1+ (β/σi)2 (10)

denotes the filter function.

It can also be shown that the minimum residual vector can be written as

p−Aˆs=

N

∑

i=1

(4)

Appl. Sci. 2017, 7, 582 4 of 29

where r⊥= (I−AA+)p = M

∑

i=M+1

αiui is the residual vector of the components of p orthogonal to

{u1· · ·un}. The residual norm can be written as

kp−Aˆxk2₂= n

∑

i=1 (1−fi)2|αi|2+ m

∑

i=n+1 |αi|2= n

∑

i=1 (1−fi)2|αi|2+ kr⊥k22 (12)

From Equation (9), the solution 2-norm can be written as

kˆsk2₂=

N

∑

i=1

f_i2σ_i−2|αi|2 (13)

2.1.2. Choice of Regularization Parameter β

In traditional solution strategies for inverse problems, methods are available for choosing regularization parameters, such as Morozov’s discrepancy principle, the Generalized Cross-Validation (GCV) method, and the L-curve method [19,20]. The first two methods have been discussed extensively in the literature. Therefore, we subsequently focus on only the L-curve method for brevity.

The L-curve method is widely used for choosing regularization parameters in inverse solutions. In the curve, the solution norm is plotted versus the residual norm by varying the regularization parameter. From Equations (9) and (11), it is straightforward to find the solution norm

kˆsk2₂=

N

∑

i=1

f_i2σ_i−2|αi|2 (14)

and the residual norm

kp−Aˆsk2₂= N

∑

i=1 (1−fi)2|αi|2+ M

∑

i=n+1 |r⊥|2 (15)

Regularization helps to improve robustness against system perturbation and measurement noise. Insights can be gained by writing the solution error as

s−ˆs=s−A#b= (s−A#p) −A#e= "_N

∑

i=1 (1− fi) uH_i p σi vi # − N

∑

i=1 fi u_iHe σi vi, (16)

where A#= (AHA+β2I)−1AH,(s−A#p)is the regularization error and A#eis the perturbation error. Hence, when β→0, fi→1 for very well-conditioned problems with high signal-to-noise ratio

(SNR) measurements, the solution error is dominated by the perturbation error and a few high-order modes are filtered out. In this case, the solution norm is sensitive to the choice of β. The solution tends to be undersmoothed and susceptible to measurement noise. Conversely, when β→∞, fi →0

for very ill-conditioned problems with low SNR measurements, the solution error is dominated by regularization error and numerous high-order modes are filtered out. The solution tends to be oversmoothed and fine details such as resolution are thus lost. In this case, the residual norm is sensitive to the choice of β.

The parameter β acts as a weighting factor between the residual norms and the solution norm. Choosing an appropriate β to strike the balance between these two terms is vital. However, conventional approaches such as GCV and the L-curve method do not always provide satisfactory results in this situation. In this paper, a new method is proposed to facilitate the choice of the regular parameter β for the TIKR method.

Setting the gradient of the cost function of the TIKR method, J= kp−Ask2₂+β2ksk2₂, to zero leads to the normal equation

(5)

Without loss of generality, assume A is of full-column rank. Note that AHA+β2I=VhΣHΣ+β2I i VH =Vdiagh(σ₁2+β2), . . . ,(σ_N2 +β2) i VH

which has an effective condition numberq(σ₁2+β2)/(σ2_N+β2). Therefore, if we want the condition number to be τ after regularization, we must require

τ2= σ

2 1+β2

σ2_N+β2 (18)

Let κ be the condition number of A; that is, κ = σ1/σM. In general, for very ill-conditioned

systems, κτ1, and β= s σ₁2−τ2σ2_M τ2−1 ≈ s σ₁2−τ2(σ₁2/κ2) τ2 = s σ₁2 κ2τ2(κ 2₋_τ2_{) ≈} σ1 τ (19)

Therefore, the regularization parameter β can be chosen to be the maximal singular value σ1

of A divided by the regularized condition number τ. For example, one may choose that τ = 100, which causes 40-dB potential loss of SNR in the inverse solution. Normally, A tends to be ill-conditioned at low frequencies. Choosing a frequency-independent β may suffice for the worst-case scenario. Thus, the regularization parameter β is chosen according to the maximal condition number at a selected low frequency (100 Hz is selected in the following simulation). Next, a coarse search is performed by varying β in orders of 10. A potential interval in which an optimal β may exist is located by observing how an objective function, such as Perceptual Evaluation of Speech Quality (PESQ) [29], varies with β. Finally, a fine-grained search is performed in the potential interval by using the Golden Section Search (GSS) algorithm [21].

GSS is an optimization technique suited for finding the extremum of a unimodal function. It is a simple bracketing method that does not require evaluation of the gradient of the cost function. In each search, a probe point must be selected within the left and right brackets according to the golden ratio. The golden ratio can be defined as

ϕ= 1+ √

5

2 =1.618 , ϕ is the golden ratio

Let f (β) be the objective function (PESQ in our case) for which we wish to find the optimal β that maximizes f (β). First, f (β) has been evaluated at two points, β1and β3. The maximizing value is

between β1and β3. The golden ratio can be used to find β2and β4. β2and β4can be written as

β2 = β3− (β3− β1)/ϕ, β4 = β1+ (β3 − β1)/ϕ

If f (β2) is larger than f (β4), a maximum clearly lies in the interval between β1and β4. Therefore,

the new β3is equal to β4.If f (β2) is smaller than f (β4), a maximum lies in the interval between β2

and β3. Therefore, the new β1is equal to β2. Figure1shows the Schematic of golden section search.

The process is repeated until the gap between β2and β4is small. The iteration process stops when the

beta converges within a prespecified tolerance window (0.0001 in our case). The optimal beta βoptcan

be written as

βopt= (β2+β4)/2

(6)

Appl. Sci. 2017, 7, 582 6 of 29 β2 = β3− (β3− β1)/ϕ β4 = β1+ (β3 − β1)/ϕ while|β2 − β4|>ε i f f(β4) < f(β2) β3 = β4 else β1 = β2 end βopt = (β2+β4)/2

Let f(β) be the objective function (PESQ in our case) for which we wish to find the optimal β that maximizes f(β). First, f(β) has been evaluated at two points, β1 and β3. The maximizing value is

between β1 and β3. The golden ratio can be used to find β2 and β4. β2 and β4 can be written as

2 3 ( 3 1) /

      , 4= + 1 ( ) 3  1 /

If f(β2) is larger than f(β4), a maximum clearly lies in the interval between β1 and β4. Therefore,

the new β3_{is equal to β}4._{If f(β}2_{) is smaller than f(β}4_{), a maximum lies in the interval between β}2_{and β}3_.

Therefore, the new β1 is equal to β2. Figure 1 shows the Schematic of golden section search. The

process is repeated until the gap between β2 and β4 is small. The iteration process stops when the beta

converges within a prespecified tolerance window (0.0001 in our case). The optimal beta opt can

be written as

2 4

( ) / 2 opt

    The algorithm is summarized as the following pseudocode:

2 3 3 1 4 1 3 1 2 4 4 2 3 4 1 2 2 4 ( ) / = + ( ) / > ( ) ( ) = = ( ) / 2 opt while if f f else end                              

Figure 1. Schematic of golden section search. If f(β2) is higher than f(β4), ₃ is equal to β4 in the next

iteration.

Therefore, this study developed a procedure for choosing optimal regularization parameter beta. The procedure involves four steps as follows:

Figure 1. Schematic of golden section search. If f (β2) is higher than f (β4), β03is equal to β4 in the next iteration.

Therefore, this study developed a procedure for choosing optimal regularization parameter beta. The procedure involves four steps as follows:

• Step 1. Select τ according to a condition number threshold.

• Step 2. Select a constant β=σ1/τ as an initial guess. For a frequency-domain design, it may be

necessary to choose a frequency-independent β for the worst-case scenario.

• Step 3. Perform a coarse search by running the simulation forward and backward with 10 s powers of β. Locate a potential interval in which an optimal β may exist by observing the trend of an objective function, such as PESQ, with respect to β.

• Step 4. Perform a fine search by using optimization methods such as the GSS algorithm to find the optimal regularization parameter β.

2.2. Underdetermined Systems

In this section, algorithms are presented for underdetermined problems, where the number of microphones (M) is lower than the number of potential sources (N). In this case, the solution is generally not unique unless we impose constraints. Although a pseudoinverse gives a unique minimum-norm least-squares solution, the resolution of the solution is generally not favorable because the solution error tends to be evenly distributed among all entries. Instead, we impose sparsity as the constraint to limit the cardinality (nonzero entries) of the solutions in the study, which suggests that pruning procedures of some sort must be incorporated into the iteration process.

2.2.1. CVX Algorithms

An underdetermined problem with sparse solutions can be written as the following CS problem: min

s ksk1 st. kAs−pk2≤ε, (20)

wherek · k₁denotes the vector 1-norm. Numerous methods are available for solving this constrained optimization problem [22,23]. Suppose that the noise energy is constrained within a threshold ε that

(7)

can be selected with reference to the aforementioned least-squares solution. This problem can be solved numerically by CVX. Freeware was adopted to conduct CVX in this study [24–26].

2.2.2. Iterative Approaches

In the following, we apply four iterative algorithms to solve underdetermined problems. The first method is Focal Underdetermined System Solver (FOCUSS) [27], which is an iterative technique well suited for finding sparse solutions to underdetermined linear systems. The algorithm has two integral parts: a low-resolution initial estimate of the real signal and the iteration process that refines the initial solution to the final localized solution. Because the system is underdetermined, the sensors are more numerous than the sources. In this case, we assume our dictionary contains 36 sources. These sources are located at 5◦intervals from the x label. Actually, this case has only three sources. Therefore, if the result is perfect, our final solution has only three nonzero solutions.

The FOCUSS algorithm can be summarized in three steps,

Wk= [diag(sk−1)] (21)

qk= (AWk)+p (22)

sk =Wkqk (23)

In Equation (21),[diag(sk−1)]converts the vector sk−1into a diagonal weight matrix. The TIKR

solution is used as the initial condition. Similar to other fixed-point iteration methods, the algorithms converge within finite numbers of iterations to the sparse solution with appropriate initial conditions.

The large term in the weighting reduces the 2-norm of q

kW+spk2= kqk2= n

∑

i=1,ωi6=0 (spi wi ) 2 (24)

The relatively large entries in W reduce the contributions of the corresponding elements of spto

the cost, and the solution is nonzero in the source direction. The pseudoinverse in Equation (22) can also be implemented by using the TIKR method. Therefore, it can also be written as

qk = (WHk−1AHAWk−1+β2I)−1W_k−1H AHp (25) The FOCUSS-TIKR method is robust to noise because of the regularization parameter β. The iteration process stops when the solution converges within a prespecified tolerance window (0.0001 in our case).

2.2.3. Iterative Approaches: Promote Sparsity by Pruning

CVX algorithms can solve CS problems, but these algorithms are known to be very computationally expensive, which prevents their use in real-time processing. To address this challenge, several iterative approaches are proposed as follows.

In these iterative techniques, the quadratic residual function

J(s) =1 2kAs−pk 2 2= 1 2 h sHAHAs−pHAs−sHAHp+pHpi (26)

is to be minimized. The key step that executes the “compressive sampling” is a pruning process that must be incorporated into each iteration to promote sparsity, as illustrated in Figure2. First, several main peaks as well as sidelobes may appear in the source diagram. We reset all elements in the source vector s below a prespecified threshold (smax−D) to zero. D is initially set to a very small value D0;

(8)

Appl. Sci. 2017, 7, 582 8 of 29 2 2 ₂ 1, 0

(

)

i n pi p i 

w

i   





s

W s

q

(24)

The relatively large entries in W reduce the contributions of the corresponding elements of

s

p

to the cost, and the solution is nonzero in the source direction. The pseudoinverse in Equation (22) can also be implemented by using the TIKR method. Therefore, it can also be written as

2 1

1 1 1

(

H H

)

H H

k



k k





 k

q

W A AW

I W A p

(25)

The FOCUSS-TIKR method is robust to noise because of the regularization parameter β. The iteration process stops when the solution converges within a prespecified tolerance window (0.0001 in our case).

2.2.3. Iterative Approaches: Promote Sparsity by Pruning

CVX algorithms can solve CS problems, but these algorithms are known to be very computationally expensive, which prevents their use in real-time processing. To address this challenge, several iterative approaches are proposed as follows.

In these iterative techniques, the quadratic residual function 2 2

1

2

H H H H H H

J

(s) As p









_

s A As p As s A p p p







_

(26)

is to be minimized. The key step that executes the “compressive sampling” is a pruning process that must be incorporated into each iteration to promote sparsity, as illustrated in Figure 2. First, several main peaks as well as sidelobes may appear in the source diagram. We reset all elements in the source vector s below a prespecified threshold (

s

max



D

) to zero. D is initially set to a very small value

D

0

; it is then increased incrementally by



D

in each iteration, typically by the same



D

in every step.

Figure 2. Pruning process to promote sparsity of inverse solution. Figure 2.Pruning process to promote sparsity of inverse solution.

The iterative pruning process is summarized with a flowchart in Figure 3. The stopping condition is

D>Dmaxork∇J(s)k22≤0.001 (27)

in which∇J(s)is the gradient of the quadratic residual function J(s). Three approaches were employed in this study to update the solutions in the iterative CS algorithms such that each quadratic residual function J(s)is minimized.

Appl. Sci. 2017, 7, 582 9 of 30

The iterative pruning process is summarized with a flowchart in Figure 3. The stopping condition is

2

max or .002 0 1

D  D J(s)  (27)

in which

 (s)

J

is the gradient of the quadratic residual function

J(s)

. Three approaches were employed in this study to update the solutions in the iterative CS algorithms such that each quadratic residual function

J(s)

is minimized.

Figure 3. Flowchart of iterative compressive sampling algorithms (adapted from reference [5]). Steepest Decent (SD) Method

The SD algorithm is based on the notion that the search direction at each iteration is the negative gradient of the cost function in Equation (26) for minimization problems. The gradient vector of the quadratic residual function is given by

( )

(

H H

)

H

J

 

 





w

s

A As A p

A r

, (28)

where

r p As

 

is the residual vector. The new solution s is updated as



     

s

w

(29)

where μ is the step size. To determine the optimal step size



, let the vector g be



g

A w

(30)

It can be shown after some algebraic manipulations that 2

1 ( )

(

2 )

2

H H H

F

s

 



g g





g r r r



(31)

(9)

Steepest Decent (SD) Method

The SD algorithm is based on the notion that the search direction at each iteration is the negative gradient of the cost function in Equation (26) for minimization problems. The gradient vector of the quadratic residual function is given by

w= −∇J(s) = −(AHAs−AHp) =AHr, (28) where r=p−Asis the residual vector.

The new solution s0is updated as

s0=s+∆s=s+µw (29)

where µ is the step size. To determine the optimal step size µ, let the vector g be

g=Aw (30)

It can be shown after some algebraic manipulations that

F(s0) = 1 2(µ

2_gH_g₋_2µgH_r₊_rH_r₎ ₍₃₁₎

From Equation (31), the step size µ to minimize F(s0)along the direction w can be found by setting the derivative of F(s0)with respect to µ to zero. Consequently, we obtain

µ= kwk

2 2

kgk2₂ (32)

Newton’s Method

The NT method is also an iterative method gradient search. Recall the gradient of the cost function is

∇F(s) =v= (AHAs−AHp) = −AH(p−As) = −AHr, (33) where r is as defined before. Setting this gradient to zero leads to the optimal solution

s0= (AHA)−1AHp=A+p, (34)

where A+= (AHA)−1AH. Next, multiplying(AHA)−1on both sides of v yields

(AHA)−1v= (AHA)−1(AHAs−AHp) =s− (AHA)−1AHp=s−A+p = −(AHA)−1AHr= −A+r

or

−A+r=s−A+p=s−s0 Hence, the solution can be updated as

s0 =s+A+r (35)

In practical implementation, the pseudoinverse A+ is usually of the form of TIKR, namely A+= (AHA+β2I)−1AH, because AHAis singular for underdetermined problems. As a refinement of the algorithm, a step size µ can be used to rewrite the update equation as

(10)

Appl. Sci. 2017, 7, 582 10 of 29

Conjugate Gradient (CG) Method

The CG method is an iterative algorithm well suited for the numerical solution of systems of linear equations, associated with symmetric and positive-definite matrices. Instead of the negative gradient used in the SD method, which occasionally causes zigzag convergence, the search direction of the CG method is a linear combination of the current negative gradient and the previous search direction. Development of the CG algorithm is based on nested Krylov subspaces. For details, we refer the interested readers to Ref. [32–34]. For brevity, we only summarize the CG algorithm with the following pseudocode: r0=p−As0 p0=r0 k=0 f or α_k= r T krk pT kApk sk+1 =sk+αkpk rk+1 =rk−αkApk βk= rT k+1rk+1 rT krk pk+1=rk+1+βkpk k=k+1 end

To put the system of equations in Equation (1) into a more tractable form, we multiply by AHon both sides, which leads to the normal equations

AHp=AHAsp (37)

This equation is equivalent to finding the vector s for which the gradient of F(s) equals zero.

∇F(s) = −AH(p−As) =0 (38)

3. Comparison of Algorithms

This section presents the application of the preceding algorithms to acoustic source localization and separation problems through numerical simulation. These algorithms were compared for three example problems, with the aid of a uniform linear array (ULA) and a random array. In addition, an inverse solution involved in sound field analysis (SFA) and sound field synthesis (SFS) in spatial audio was investigated. Microphone data are synthetic and generated by the model of Equation (1). 3.1. Uniform Linear Array

In the numerical simulation shown in Figure4, 10-microphone ULA was utilized to separate the signals emitted by three sources. The sources were located at the far field such that the plane wave assumption was valid. The spacing between adjacent microphones was 10 cm. This simulated underdetermined system contained 36 sources as our dictionary. We separated the signals and localized the signals in one stage. TIKR, FOCUSS, and CS-CVX algorithms were used to solve these inverse problems. Figure5shows the condition numbers of different frequencies. The problem is ill-conditioned at low frequencies. Source localization results are shown in Figure6.

(11)

Appl. Sci. 2017, 7, 582 11 of 29

Appl. Sci. 2017, 7, 582 12 of 30

Figure 4. Numerical simulation of 10-microphone uniform linear array utilized to separate the signals

emitted by three sources. The first source, located at





45

, was broadcasting a male speech signal. The second source, located at





90

, was broadcasting a female speech signal. The third source, located at





135

, was broadcasting a music signal.

Figure 5. Condition number versus frequency.

(a) 0 1000 2000 3000 4000 5000 6000 7000 8000 101 102 103 104 105 Frequency(Hz) C o nd it io n n u m ber TIKR spectrum Angle(degree) Fr eq ue nc y( H z) 20 40 60 80 100 120 140 160 180 0 1000 2000 3000 4000 5000 6000 7000 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Figure 4.Numerical simulation of 10-microphone uniform linear array utilized to separate the signals emitted by three sources. The first source, located at θ=45◦, was broadcasting a male speech signal. The second source, located at θ = 90◦, was broadcasting a female speech signal. The third source, located at θ=135◦, was broadcasting a music signal.





45 



90 



135

(a) 0 1000 2000 3000 4000 5000 6000 7000 8000 101 102 103 104 105 Frequency(Hz) C o nd it io n n u m ber TIKR spectrum Angle(degree) Fr eq ue nc y( H z) 20 40 60 80 100 120 140 160 180 0 1000 2000 3000 4000 5000 6000 7000 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Figure 5.Condition number versus frequency.





45 



90 



135

(a) 0 1000 2000 3000 4000 5000 6000 7000 8000 101 102 103 104 105 Frequency(Hz) C o nd it io n n u m ber TIKR spectrum Angle(degree) Fr eq ue nc y( H z) 20 40 60 80 100 120 140 160 180 0 1000 2000 3000 4000 5000 6000 7000 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Figure 6. Cont.

(12)

Appl. Sci. 2017, 7, 582 12 of 29 Appl. Sci. 2017, 7, 582 13 of 30 (b) (c) (d) 0 20 40 60 80 100 120 140 160 180 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

1 TIKR frequency averaged spectrum

Angle(degree) N o rm ali zed m ag n itu d e FOCUSS spectrum Angle(degree) F req uen cy(H z) 20 40 60 80 100 120 140 160 180 0 1000 2000 3000 4000 5000 6000 7000 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0 20 40 60 80 100 120 140 160 180 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 X: 135 Y: 0.8152

FOCUSS frequency averaged spectrum

Angle(degree) No rm alized m ag n it u d e X: 45 Y: 0.7183 X: 90 Y: 1 Figure 6. Cont.

(13)

Appl. Sci. 2017, 7, 582 14 of 30

(e)

(f)

Figure 6. Source localization results obtained using the focal underdetermined system solver (FOCUSS) algorithm and a uniform linear array. (a) Tikhonov regularization (TIKR) spectrum (b) TIKR averaged spectrum (c) FOCUSS spectrum (d) FOCUSS frequency-averaged spectrum. (e) Compressive sampling-convex optimization (CS-CVX) spectrum (f) CS-CVX frequency-averaged spectrum.

The separation results obtained using TIKR, FOCUSS, and CS-CVX are summarized in Table 1. PESQ is an objective test measure for speech quality evaluation. It is a full-reference algorithm and analyzes the speech signal sample-by-sample after a temporal alignment of corresponding excerpts of reference and test signal. The mean opinion score (MOS) is calculated on the basis of PESQ ranging from 1 to 5; MOS signifies the difference in speech quality between the clean and the separated signals, which is affected by separation performance and signal distortion. The segmental SNR (segSNR) is defined as 2 10 2 1

(n)

1 10log

ˆ(n)

(n)

k k N n frame seg k n frame

SNR

N

  











_

_













_

s

(39)

Angle(degree)

Fr

eque

nc

y(

H

z)

CS spectrum

20 40 60 80 100 120 140 160 180 0 1000 2000 3000 4000 5000 6000 7000 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0 20 40 60 80 100 120 140 160 180 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Angle(degree) N o rm al iz ed m a g n it u de

CS-CVX frequency averaged spectrum

Figure 6.Source localization results obtained using the focal underdetermined system solver (FOCUSS) algorithm and a uniform linear array. (a) Tikhonov regularization (TIKR) spectrum (b) TIKR averaged spectrum (c) FOCUSS spectrum (d) FOCUSS frequency-averaged spectrum. (e) Compressive sampling-convex optimization (CS-CVX) spectrum (f) CS-CVX frequency-averaged spectrum.

The separation results obtained using TIKR, FOCUSS, and CS-CVX are summarized in Table1. PESQ is an objective test measure for speech quality evaluation. It is a full-reference algorithm and analyzes the speech signal sample-by-sample after a temporal alignment of corresponding excerpts of reference and test signal. The mean opinion score (MOS) is calculated on the basis of PESQ ranging from 1 to 5; MOS signifies the difference in speech quality between the clean and the separated signals, which is affected by separation performance and signal distortion. The segmental SNR (segSNR) is defined as SNRseg= 1 N N

∑

k=1 10 log₁₀     ∑ n∈ f ramek |s(n)|2 ∑ n∈ f rame_k |ˆs(n) −s(n)|2     (39)

(14)

Appl. Sci. 2017, 7, 582 14 of 29

The segSNR correlates with the effect of noise reduction. The FOCUSS-PINV algorithm was observed to achieve the highest score in PESQ and segSNR (Table1), although it required more computation time than TIKR and FOCUSS-TIKR.

Table 1.Separation performance of the Tikhonov regularization (TIKR) and focal underdetermined system solver (FOCUSS) methods for three sources in the underdetermined system.

Methods TIKR FOCUSS-PINV FOCUSS-TIKR CS-CVX

PESQ Source 1 2.034 3.966 3.350 3.251 Source 2 1.696 3.146 2.879 2.783 Source 3 1.912 3.818 3.394 3.347 segSNR Source 1 0.554 11.92 11.27 8.817 CPU time (s) 53 810 487 16302

In our previous simulation, we simulated microphones that had no noise; therefore, our regularization parameter was very close to zero. In the current simulation, our microphones did have white noise with a magnitude equal to the magnitude of the microphone signals divided by 100. Therefore, the potential loss of SNR was 40 dB.

The regularization parameter is chosen by the maximal singular σ1 value dividing 100 in

100 Hz. In our case, the maximal singular value was equal to 5.32. Next, a coarse search was performed by varying β in orders of 10 and using the GSS algorithm to find the optimal regularization parameter. In this case, the optimal regularization parameter was 0.0174, and the FOCUSS-PINV of the robustness to the noise was very ineffective because PINV artifacts caused discontinuities in regularization. CS-CVX was sensitive to the noise, and the segSNR and PESQ achieved lower scores than FOCUSS-TIKR did, as listed in Tables 2and 3(20dB). Therefore, noise was present, and FOCUSS-TIKR was the best choice in this case. It outperformed PESQ and segSNR.

Table 2.Separation performance of the Tikhonov regularization (TIKR), focal underdetermined system solver (FOCUSS), and compressive sampling-convex optimization (CS-CVX) methods for three sources with additive white noises (40dB SNR).

Methods TIKR FOCUSS-PINV FOCUSS-TIKR CS-CVX

PESQ Source 1 2.135 1.170 2.941 2.412 Source 2 1.810 1.121 2.234 1.321 Source 3 1.723 1.168 2.841 2.241 segSNR Source 1 0.52 −10.00 1.558 0.946 CPU time (s) 58 817 497 16431

Table 3.Separation performance of the TIKR, FOCUSS and CS-CVX methods for three sources with additive white noises (20dB SNR).

Methods TIKR FOCUSS-PINV FOCUSS-TIKR CS-cvx

PESQ Source 1 1.512 1.168 2.512 1.436 Source 2 1.401 1.118 2.012 1.712 Source 3 1.489 1.044 2.487 1.324 segSNR Source 1 −0.12 −13.11 0.12 −1.112 CPU time (s) 57 815 494 16421 3.2. Random Array

A simulation was conducted for localization and separation of two point sources located at (0, 0,−1 m) and (0.8, 0.3,−1 m), both of which were emitting clean speech signals, as illustrated

(15)

in Figure7. A 30-element random array with aperture dimension 0.48 m×0.4 m situated at z = 0 was utilized to capture the signals emitted by these two sources. To set up the propagation matrix, 100 (10×10) equivalent sources were distributed on the image plane located 1 m away from the array. Figure8shows the condition numbers of different frequencies. The problem is ill-conditioned at low frequencies.

Appl. Sci. 2017, 7, 582 16 of 30

equivalent sources were distributed on the image plane located 1 m away from the array. Figure 8 shows the condition numbers of different frequencies. The problem is ill-conditioned at low frequencies.

Figure 7. Arrangement for simulation of localization and separation of two sources.

Figure 9a–f show the source localization results obtained using six approaches. Two sources were correctly located on the noise map with varying degrees of resolution by all methods. A conventional method delay and sum (DAS) method (a) gave the poorest resolution, whereas the CS-CVX method provided the highest resolution. The SD method (d) and CG method (f) were acceptable but did not perform quite as well as the CS-CVX method. The NT method (e) yielded accurate source locations with a slightly increased sidelobe level, but it was the most computationally efficient (Table 2). In general, 0 1000 2000 3000 4000 5000 6000 7000 8000 100 102 104 106 108 1010 Frequency(Hz) C on di tion num be r

Figure 7.Arrangement for simulation of localization and separation of two sources.

Appl. Sci. 2017, 7, 582 16 of 30

equivalent sources were distributed on the image plane located 1 m away from the array. Figure 8 shows the condition numbers of different frequencies. The problem is ill-conditioned at low frequencies.

Figure 7. Arrangement for simulation of localization and separation of two sources.

Figure 9a–f show the source localization results obtained using six approaches. Two sources were correctly located on the noise map with varying degrees of resolution by all methods. A conventional method delay and sum (DAS) method (a) gave the poorest resolution, whereas the CS-CVX method provided the highest resolution. The SD method (d) and CG method (f) were acceptable but did not perform quite as well as the CS-CVX method. The NT method (e) yielded accurate source locations with a slightly increased sidelobe level, but it was the most computationally efficient (Table 2). In general, 0 1000 2000 3000 4000 5000 6000 7000 8000 100 102 104 106 108 1010 Frequency(Hz) C on di tion num be r

Figure 8.Condition number versus frequency.

Figure9a–f show the source localization results obtained using six approaches. Two sources were correctly located on the noise map with varying degrees of resolution by all methods. A conventional method delay and sum (DAS) method (a) gave the poorest resolution, whereas the CS-CVX method provided the highest resolution. The SD method (d) and CG method (f) were acceptable but did not perform quite as well as the CS-CVX method. The NT method (e) yielded accurate source locations with a slightly increased sidelobe level, but it was the most computationally efficient (Table2).

(16)

Appl. Sci. 2017, 7, 582 16 of 29 Appl. Sci. 2017, 7, 582 17 of 30 (a) (b) (c) Figure 9. Cont.

(17)

(d)

(e)

(f)

Figure 9. Localization results of two point sources. (a) Delay and sum algorithm, (b) Tikhonov

regularization algorithm, (c) compressive sampling-convex optimization algorithm, (d) steepest descent algorithm, (e) Newton’s algorithm, and (f) conjugate gradient algorithm.

Figure 9. Localization results of two point sources. (a) Delay and sum algorithm, (b) Tikhonov regularization algorithm, (c) compressive sampling-convex optimization algorithm, (d) steepest descent algorithm, (e) Newton’s algorithm, and (f) conjugate gradient algorithm.

(18)

Appl. Sci. 2017, 7, 582 18 of 29

Table4presents a comparison of the separation results obtained using five methods. CS-CVX displayed the highest scores for PESQ and segSNR, despite being extremely time-consuming. Iterative CS approaches were determined to be far more computationally efficient than the CS-CVX method. The CG method attained the highest PESQ, but the lowest segSNR. This suggests that the favorable separation performance of the CG method comes at the price of signal distortion. The SD and NT methods demonstrated acceptable PESQ and high segSNR. Although signals were not perfectly separated by using these two methods, the incurred distortion was minor. In general, methods present a trade-off between separation performance and signal distortion. Table5shows the separation results with additive noise (SNR = 28 dB). All the methods were observed to suffer from the interference of noise; consequently, the values of PESQ and segSNR were notably low. However, all the methods were determined to be robust to noise. The present study also considered mismatches between equivalent sources (dictionaries) and real sources. Table6shows the separation results of the NT method with different levels of mismatch. Mismatch means that the real source is not precisely on the deployed source location (dictionary). Extreme mismatch means the real source is exactly on the center of four nearest the deployed source location. More descriptions are added to the revised manuscript. Unless the source was just at the center of the near dictionaries, the separation performance was high and not influenced by noise.

Table 4.Separation performance of five algorithms for two speech sources.

Methods TIKR CS-CVX SD NT CG PESQ Source 1 1.99 3.12 2.39 2.48 2.76 Source 2 2.60 3.31 2.77 2.83 3.13 segSNR Source 1 2.15 7.54 5.08 6.04 0.50 Source 2 2.73 8.24 7.02 7.26 1.53 CPU time (s) 201 31065 377 296 386

Table 5.Separation performance of five algorithms for two speech sources with additive noise.

Methods TIKR CS-CVX SD NT CG PESQ Source 1 1.82 2.83 2.22 2.38 2.59 Source 2 2.12 3.20 2.53 2.53 2.88 segSNR Source 1 1.40 2.11 2.03 4.13 −0.50 Source 2 2.30 6.51 2.09 1.72 1.23 CPU time (s) 211 31348 358 290 307

Table 6. Separation performance of Newton’s algorithm for two speech sources with different mismatch conditions.

Methods without Mismatch with Mismatch Extreme Mismatch

PESQ Source 1 2.52 2.48 2.14

Source 2 2.81 2.83 1.70

segSNR Source 1 6.90 6.04 −0.40

Source 2 6.33 7.26 0.18

3.3. SFA and SFS

Depending on the sparsity of the sound sources, the SFA stage can be implemented in several manners [35–38]. For the sparse-source scenario, a two-stage algorithm is utilized; the source bearings are estimated using the minimum power distortionless response (MPDR) [7] and the associated amplitudes of plane waves are estimated using the TIKR algorithm. For the nonsparse-source scenario, a one-stage algorithm based on the CS-CVX algorithm or the FOCUSS algorithm is employed.

(19)

Appl. Sci. 2017, 7, 582 19 of 29

The SFS stage is carried out using a loudspeaker array to reconstruct the sound field with the source bearing and amplitude obtained in the SFA stage. Pressure matching was employed for the SFS purpose in this study by sampling a large number of virtual control points in the interior area surrounded by the loudspeakers. The pressure matching procedure can be described as the following optimization problem: min ss(ω) kB(ω)sp(ω) −H(ω)ss(ω)k, (40) where sp(ω) = h s1(ω) · · · sP(ω) iT

is the amplitude vector of the Pth primary plane-wave component, ss(ω) =

h

s1(ω) · · · sL(ω) iT

denotes the amplitude vector of the input signals to the L secondary loudspeaker sources, H(ω) ∈ CK×L denotes the room response matrix, and b_d = h

e−jkd·y1 _{· · ·} _e−jkd·yK iT

is the steering vector for the dth primary plane-wave component to the nth control point, yn, n = 1, . . . , K. B(ω) =

h

b1 · · · bd

i

∈ CK×P _{is the steering matrix from}

the plane-wave components obtained in the preceding SFA stage to the control points. Therefore, the optimal solution can be written as

ss(ω) =H#(ω)B(ω)sp(ω), (41)

where “#” symbolizes some type of inverse operation on the matrix H(ω). In this study, TIKR was utilized to calculate the input signal amplitudes to the secondary sources. GSS can be used to find the optimal regularization parameter.

Experiments were conducted to validate the proposed audio analysis and synthesis system. In the SFA stage, a 24-element circular microphone array with a radius of 12 cm was utilized to capture and parameterize the sound field in an anechoic chamber (the recording room), as illustrated in Figure10. In the SFS stage, a rectangular, 32-loudspeaker array was employed to reproduce in a live room (the reproduction room) the sound field previously encoded in the SFA stage. The walls of the room were lined with acoustically reflective boards (Figure11).

The SFS stage is carried out using a loudspeaker array to reconstruct the sound field with the source bearing and amplitude obtained in the SFA stage. Pressure matching was employed for the SFS purpose in this study by sampling a large number of virtual control points in the interior area surrounded by the loudspeakers. The pressure matching procedure can be described as the following optimization problem:  

   

m in , s  p s







s B s H s (40)

where

s

_p

 







s

₁

( )





s

_P

( )





T is the amplitude vector of the Pth primary plane-wave component,

s

_s

 







s

₁

( )





s

_L

( )





T denotes the amplitude vector of the input signals to the L secondary loudspeaker sources,

H

 







K L denotes the room response matrix, and

1 d d K T j j d

e

 





 

k y k y



b



_



is the steering vector for the dth primary plane-wave component to the nth control point,

y

n

,

n



1, ,



K

.

 



1



K P d



_

_



B

b



b



is the steering matrix from the plane-wave components obtained in the preceding SFA stage to the control points. Therefore, the optimal solution can be written as

 

#

     

_,

s







p



s

H

B

s

(41)

where “#” symbolizes some type of inverse operation on the matrix

H

 



. In this study, TIKR was utilized to calculate the input signal amplitudes to the secondary sources. GSS can be used to find the optimal regularization parameter.

Experiments were conducted to validate the proposed audio analysis and synthesis system. In the SFA stage, a 24-element circular microphone array with a radius of 12 cm was utilized to capture and parameterize the sound field in an anechoic chamber (the recording room), as illustrated in Figure 10. In the SFS stage, a rectangular, 32-loudspeaker array was employed to reproduce in a live room (the reproduction room) the sound field previously encoded in the SFA stage. The walls of the room were lined with acoustically reflective boards (Figure 11).

Figure 10. Sound field analysis experimental arrangement in a 5.4 m × 3.5 m × 2 m anechoic room. Figure 10.Sound field analysis experimental arrangement in a 5.4 m×3.5 m×2 m anechoic room.

(20)

Appl. Sci. 2017, 7, 582 20 of 29

Appl. Sci. 2017, 7, 582 21 of 30

Figure 11. Sound field synthesis experimental arrangement in a 3.6 m × 3.6 m × 2 m live room fitted

with reflective walls.

To process microphone output and loudspeaker input signals, multichannel analog-to-digital converters (M-32 AD) and digital-to-analog converters (M-32 DA) (RME, Haimhausen, Germany) were used with a sampling frequency of 16 kHz.

An audio codec system involves three inverse problems, namely the SFA stage, room response modeling, and the SFS stage. The condition numbers are plotted against the frequencies of three steering matrices in Figure 12a–c. Figure 12c indicates that the ill-posedness encountered in the room response modeling procedures must be addressed, with the aid of appropriate regularization methods. Large regularization parameters can increase the robustness of the inverse problem. In this study, we set the regularization parameter to 10.

(a)

Figure 11.Sound field synthesis experimental arrangement in a 3.6 m×3.6 m×2 m live room fitted with reflective walls.

An audio codec system involves three inverse problems, namely the SFA stage, room response modeling, and the SFS stage. The condition numbers are plotted against the frequencies of three steering matrices in Figure12a–c. Figure12c indicates that the ill-posedness encountered in the room response modeling procedures must be addressed, with the aid of appropriate regularization methods. Large regularization parameters can increase the robustness of the inverse problem. In this study, we set the regularization parameter to 10.

Appl. Sci. 2017, 7, 582 21 of 30

Figure 11. Sound field synthesis experimental arrangement in a 3.6 m × 3.6 m × 2 m live room fitted

with reflective walls.

An audio codec system involves three inverse problems, namely the SFA stage, room response modeling, and the SFS stage. The condition numbers are plotted against the frequencies of three steering matrices in Figure 12a–c. Figure 12c indicates that the ill-posedness encountered in the room response modeling procedures must be addressed, with the aid of appropriate regularization methods. Large regularization parameters can increase the robustness of the inverse problem. In this study, we set the regularization parameter to 10.

(a)

(21)

Appl. Sci. 2017, 7, 582 22 of 30

(b)

(c)

Figure 12. Plots of condition numbers versus frequencies of each steering matrix in three inverse problems: (a) sound field analysis stage; (b) room response modeling; and (c) sound field synthesis stage.

In the SFA experiment, loudspeaker sources positioned at the angles





60 , 240



played two 10-s speech clips. After recording the source by CMA, we used three algorithms to extract the source signals. First, we applied the two-stage MPDR and TIKR algorithms. The MPDR spectrum is plotted as a function of angle and frequency in Figure 13a. The resulting frequency-averaged and normalized MPDR spectrum is illustrated in Figure 13b, which peaks at the angles





60 , 240



as desired. The results show that the source was accurately localized using MPDR. Next, the source signals were extracted using the TIKR algorithm. We also applied one-stage CS algorithms and one-stage FOCUSS algorithms, which located sources and separated their amplitudes in a single calculation.

Condition N

umber

Figure 12. Plots of condition numbers versus frequencies of each steering matrix in three inverse problems: (a) sound field analysis stage; (b) room response modeling; and (c) sound field synthesis stage.

In the SFA experiment, loudspeaker sources positioned at the angles θ=60◦, 240◦played two 10-s speech clips. After recording the source by CMA, we used three algorithms to extract the source signals. First, we applied the two-stage MPDR and TIKR algorithms. The MPDR spectrum is plotted as a function of angle and frequency in Figure13a. The resulting frequency-averaged and normalized MPDR spectrum is illustrated in Figure 13b, which peaks at the angles θ = 60◦, 240◦ as desired. The results show that the source was accurately localized using MPDR. Next, the source signals were extracted using the TIKR algorithm. We also applied one-stage CS algorithms and one-stage FOCUSS algorithms, which located sources and separated their amplitudes in a single calculation.

(22)

Appl. Sci. 2017, 7, 582 22 of 29

Appl. Sci. 2017, 7, 582 23 of 30

(a)

(b)

Figure 13. Localization of one music source signal with sampling rate 16 kHz, located at the angles 60 , 240  in an anechoic chamber. By using a uniform circular array with a radius of 12 cm, the source direction can be identified by the peak in the angular spectrum. (a) Minimum power distortionless response (MPDR) spectrum plotted versus angle and frequency. (b) Frequency-averaged and normalized MPDR spectrum.

The signals extracted using different methods were evaluated by using the MOS of the PESQ test. Results confirmed that the TIKR performed well in signal separation with satisfactory audio quality. The results are summarized in Table 7.

Table 7. Mean opinion score of perceptual evaluation of speech quality for source signal separation

at the angles 60 , 240  with speech signals using Tikhonov regularization (TIKR), compressive sampling-convex optimization (CS-CVX), and focal underdetermined system solver (FOCUSS).

Methods Two-Stage TIKR One-Stagecs-CVX One-Stagefocuss

PESQ Source 1 2.84 3.11 1.56

Source 2 2.79 2.99 1.61

segSNR Source 1 17.40 20.74 13.54

Source 2 16.59 17.34 12.68

CPU time (s) 8 27588 275

Figure 13.Localization of one music source signal with sampling rate 16 kHz, located at the angles 60◦, 240◦ in an anechoic chamber. By using a uniform circular array with a radius of 12 cm, the source direction can be identified by the peak in the angular spectrum. (a) Minimum power distortionless response (MPDR) spectrum plotted versus angle and frequency. (b) Frequency-averaged and normalized MPDR spectrum.

The signals extracted using different methods were evaluated by using the MOS of the PESQ test. Results confirmed that the TIKR performed well in signal separation with satisfactory audio quality. The results are summarized in Table7.

Table 7.Mean opinion score of perceptual evaluation of speech quality for source signal separation at the angles 60◦, 240◦ with speech signals using Tikhonov regularization (TIKR), compressive sampling-convex optimization (CS-CVX), and focal underdetermined system solver (FOCUSS).

Methods Two-Stage TIKR One-Stagecs-CVX One-Stagefocuss

PESQ Source 1 2.84 3.11 1.56

Source 2 2.79 2.99 1.61

segSNR Source 1 17.40 20.74 13.54

Source 2 16.59 17.34 12.68

(23)

One sample coherence function between one loudspeaker and one microphone is shown in Figure14, indicating the signal quality to be poor below 200 Hz. Therefore, band-limited processing was applied for all frequencies up to 200 Hz in the SFS stage. In this frequency range, pressure matching was used on the basis of the room response model.

One sample coherence function between one loudspeaker and one microphone is shown in Figure 14, indicating the signal quality to be poor below 200 Hz. Therefore, band-limited processing was applied for all frequencies up to 200 Hz in the SFS stage. In this frequency range, pressure matching was used on the basis of the room response model.

Figure 14. Coherence curve measured from one loudspeaker to one microphone in a live room. A PULSE analysis platform made from Brüel & Kjær was arranged to measure the coherence curve. A white noise signal with sampling rate 16 kHz was used as the driving signal of the loudspeaker.

The SFS stage was conducted for three different methods. The coherence between the loudspeaker and the microphone was poor below 200 Hz; therefore, the signals below 200 Hz were not processed. Method 1, band-limited processing, was applied from 200 Hz to the spatial aliasing frequency, 952 Hz, in the SFS stage. In this frequency range, pressure matching was performed on the basis of the room response model. Below 200 Hz, unprocessed audio signals were fed directly to the loudspeakers. Above 952 Hz, a simple vector panning [39] approach was adopted. The optimal regularization parameter β achieving the highest MOS in room response modeling was calculated using GSS [21] as β = 0.0008634.

In the second method, instead of a vector panning method, we used DAS to process signals above 952 Hz. In the third method, we used pressure matching to obtain signals above 200 Hz. The use of different regularization parameters in pressure matching results in different levels of localization performance and audio quality.

Figure 15a,c shows the MPDR spectrum and the normalized MPDR spectrum obtained using the third method for β = 0.01 and β = 10, respectively. Low values of the regularization parameter β yielded higher localization performance than high values did. These two signals were compared with the clean signal through the PESQ test. The results showed that the high β ensured satisfactory voice quality, whereas the lowβ impaired voice quality.

The results of three localization methods are presented in Figure 16a–f. The MPDR spectra are plotted as functions of angle and frequency in Figure 16a,c,f. The resulting frequency-averaged and normalized MPDR spectra are shown in Figure 16b,d,f.

Figure 14. Coherence curve measured from one loudspeaker to one microphone in a live room. A PULSE analysis platform made from Brüel & Kjær was arranged to measure the coherence curve. A white noise signal with sampling rate 16 kHz was used as the driving signal of the loudspeaker.

The SFS stage was conducted for three different methods. The coherence between the loudspeaker and the microphone was poor below 200 Hz; therefore, the signals below 200 Hz were not processed. Method 1, band-limited processing, was applied from 200 Hz to the spatial aliasing frequency, 952 Hz, in the SFS stage. In this frequency range, pressure matching was performed on the basis of the room response model. Below 200 Hz, unprocessed audio signals were fed directly to the loudspeakers. Above 952 Hz, a simple vector panning [39] approach was adopted. The optimal regularization parameter β achieving the highest MOS in room response modeling was calculated using GSS [21] as β= 0.0008634.

In the second method, instead of a vector panning method, we used DAS to process signals above 952 Hz. In the third method, we used pressure matching to obtain signals above 200 Hz. The use of different regularization parameters in pressure matching results in different levels of localization performance and audio quality.

Figure15a,c shows the MPDR spectrum and the normalized MPDR spectrum obtained using the third method for β = 0.01 and β = 10, respectively. Low values of the regularization parameter β yielded higher localization performance than high values did. These two signals were compared with the clean signal through the PESQ test. The results showed that the high β ensured satisfactory voice quality, whereas the low β impaired voice quality.

The results of three localization methods are presented in Figure16a–f. The MPDR spectra are plotted as functions of angle and frequency in Figure16a,c,f. The resulting frequency-averaged and normalized MPDR spectra are shown in Figure16b,d,f.

(24)

Appl. Sci. 2017, 7, 582 24 of 29 Appl. Sci. 2017, 7, 582 25 of 30 (a) (b) (c) Figure 15. Cont.

(25)

Appl. Sci. 2017, 7, 582 26 of 30

(d)

Figure 15. Localization results in the sound field synthesis experiment by the third method with

different regularization parameters. β = 10 (a,b) and β = 0.01 (c,d).

(a)

(b)

Figure 15. Localization results in the sound field synthesis experiment by the third method with different regularization parameters. β = 10 (a,b) and β = 0.01 (c,d).

Appl. Sci. 2017, 7, 582 26 of 30

(d)

Figure 15. Localization results in the sound field synthesis experiment by the third method with

different regularization parameters. β = 10 (a,b) and β = 0.01 (c,d).

(a)

(b) Figure 16. Cont.

(26)

Appl. Sci. 2017, 7, 582 26 of 29 Appl. Sci. 2017, 7, 582 27 of 30 (c) (d) (e) Figure 16. Cont.