Psychoacoustic Model - Basic Concepts of Audio Coding

Chapter 2 Basic Concepts of Audio Coding

2.1 Psychoacoustic Model

The goal of audio coding is to use minimum bits to represent the audio signals with a perceptually lossless quality for human hearing. For an ideal audio coding system, humans can not tell the differences between the original and coded signals. To achieve this goal, the psychoacoustic model is introduced to simulate the human auditory system. The human auditory system has some interesting properties. Human hearing has a dynamic frequency range from about 20 to 20000 Hz, and hears sounds with intensity varying over many magnitudes. The hearing system may thus seem to be a very wide-range instrument, which is not altogether true. In current audio coding, only some critical properties of human auditory system are known and used since there is no model that can simulate the human hearing process precisely now. Even though, the listening quality of the coded audio signals is significantly affected by these psychoacoustic properties. Involving these critical psychoacoustic principles in audio coding, the audio developers can remove the perceptually irrelevant information, shape the quantization noise to be inaudible, and improve the listening quality significantly. In this section, we will briefly introduce these psychoacoustic principles including absolute hearing threshold, critical bands, and masking effects.

2.1.1 Absolute Threshold of Hearing

In a noiseless environment, the amount of energy needed in a pure tone such that it can be detected by a listener is called the absolute threshold of hearing (ATH). The absolute threshold of hearing is varying from frequency to frequency. In 1979, Terhardt [4] proposed a well approximated nonlinear function:

SPL)

f is the frequency index, T

(f) is the absolute threshold of hearing in frequency f. The

curve of above function was shown in Figure 2. In general, the absolute threshold of hearing can be measured by increasing the sound pressure level (SPL) of a test tone to the listeners. Then, the absolute threshold of hearing in all frequencies can be measured by increasing the frequency of the test tone from low to high. Since the signals lower than the absolute threshold of hearing will be inaudible, those signals can be removed from the input samples to improve the coding efficiency. Furthermore,

T

(f) can also be considered as the maximum allowable energy level of coding

distortion. That is, the quantization error will be inaudible and can be omitted if it is lower than the absolute threshold of hearing.

Figure 2: The absolute threshold of hearing in quiet [4]

2.1.2 Critical Bands

In perceptual audio coding, the absolute threshold of hearing provides only the basic utility to shape the coding distortions. To understand and simulate the human auditory system, the critical band must be involved. The critical band structure and the related analysis can be used to describe the behavior of the auditory system in many aspects. A basic definition of the critical band is “the bandwidth at which subjective response changes abruptly” [5]. That means the human perception of signals within the same critical band will be similar and not change rapidly. The basic unit of critical band rate is Bark. The length of one Bark on the basilar membrane is about 1.3 mm [6]. Experiments revealed that 25 critical bands exist over the frequency range of human hearing and the bandwidth of critical band can be approximated by

(Hz), )

4 . 1 1 ( 75

f

² ⁰^.⁶⁹

f

_G = + +

∆ (2)

where f is frequency, expressed in kHz. The center and edge frequencies of these 25 critical bands are shown in Table 1 [7].

For a given frequency, the critical band is the smallest band of frequencies around it which activate the same part of the basilar membrane. Whereas the different threshold is the just noticeable difference of a single frequency, the critical bandwidth represents the ear’s resolving power for simultaneous tone or partials. In a complex

two partials such that each can still be heard separately. It may also be measured by taking a sine tone barely masked by a band of white noise around it. When the noise band is narrowed until the point where the sine tone becomes audible, its width at that point is the critical bandwidth. Simultaneous tones lying within a critical bandwidth do not give any increase in perceived loudness over that of the single tone and provided the sound pressure level remains constant. For tones lying more than critical bandwidth apart, their combination results in increased loudness. When two tones are close together in frequency, the resulting tone is a confusion of the two frequencies. If the frequency difference increases, the roughness in the tones appears. The phenomenon appears because both frequencies are activating the same part of the basilar membrane. Further apart, the two frequencies can be discriminated separately, whereas the roughness only occurs at a frequency separation equal to the critical bandwidth.

Table 1 Critical Band Center and Edge Frequencies [7]

Band

10 1080 1170 1270 23 9500 10500 12000 11 1270 1370 1480 24 12000 13500 15500 12 1480 1600 1720 25 15500 19500

13 1720 1850 2000

2.1.3 Simultaneous Masking Effect

Masking effect is one of the most important concepts for perceptual audio coding to hide the coding distortions to be inaudible. The masking effect is a phenomenon of human auditory system that the threshold of audibility of one sound is raised by the presence of another sound. Masking effect occurs in both frequency and time domains, and the masking effect of a signal depends on the frequency, structure, and the energy level of both masker and maskee.

Simultaneous masking effect is the masking effect which occurs in frequency domain. Simultaneous masking occurs when two stimuli are simultaneously presented to the auditory system and one of them is made inaudible by the other. Physiological evidence reveals that the simultaneous masking is caused due to the function of the basilar membrane and the hair cells. Many researchers believe that the masker produces a great amount of activities on the basilar membrane such that any activity caused by the weaker signal may become undetectable. In physiology, the hair cells detect the strongest vibration in any critical band along the basilar membrane.

Simultaneous masking is typically determined for a noise masking a noise (NMN), a tone masking a tone (TMT), a noise masking a tone (NMT), or a tone masking a noise (TMN). However, in practice, only TMN and NMT are often involved to determine the simultaneous masking effect in order to reduce the complexity. Figure 3 shows the noise-masking-tone effect and the energy level of a test tone that just masked by narrow band noise [8]. The tone-masking-noise effect is shown in Figure 4 for a narrow band noise and a tonal masker [9].

Figure 3: The broken line is the absolute threshold of hearing [8]. (a) The noise level is 60 dB and the center frequency varies. (b) The center frequency of the noise is 1 kHz and the energy level varies

Figure 4: The energy level required for a narrow band noise to be auditable in a tonal masker [9].

The center frequency of the noise is 250Hz and the energy level of the tone is 80 dB

2.1.4 Non-simultaneous Masking Effect

Non-simultaneous masking effect is the masking effect which occurs in time domain. Non-simultaneous masking occurs when the masker and the maskee are not presented to the hearing system at the same time. Sometimes a signal can be masked by a sound preceding it, called pre-masking, or even by a sound following it, called post-masking. The characteristic of non-simultaneous masking for human hearing system is asymmetric, meaning that the pre-masking effect is much less than the post-masking. Pre-masking occurs before the presence of the masker and lasts approximately 20 milliseconds. However, the significant pre-masking tends to last only 1-2 milliseconds. Post-masking occurs after the vanishment of the masker and lasts more than 100 milliseconds [6]. Figure 5 illustrates the non-simultaneous masking effect including pre-masking and post-masking. For the purpose s of perceptual audio coding, abrupt audio signal transients (e.g., the onset of a percussive musical instrument) create pre-masking and post-masking regions in time. During these time slots, a listener will not perceive the signal which is beneath the raised audibility threshold produced by the masker. In fact, non-simultaneous masking has been used in several audio coding algorithms [10]-[14]. Pre-masking in particular has been exploited in conjunction with adaptive block size transform coding to compensate for pre-echo distortions.

Figure 5: Illustration of the non-simultaneous masking effect [6]

在文檔中 MPEG-4 AAC中的PNS模組之設計與M/S編碼技術之改良 (頁 13-18)