Chapter 2 Basic Concepts of Audio Coding
2.2 Quality Measurement
The audio quality of a coding system can be described as the perceived difference between the output of a system and the original signal. The most trivial way to assess the audio quality is to perform a listening test, also called subjective quality measurement. Subjective quality measurement is evaluating the quality by human ears directly. However, performing a subjective quality measurement is very expensive and time consuming. Another way to assess the audio quality is to perform objective quality measurement. Objective quality measurement is to predict the basic audio quality by using objective measurements incorporating psychoacoustic principles. The objective quality measurement is cheaper and more efficient to assess the quality of an audio coding system. However, the objective way is not accurate enough to replace the subjective one and not generally accepted now. In this section, both the subjective and objective quality measurements are introduced below.
2.2.1 Subjective Quality Measurement
The ITU-R BS.1116 [15] is a standard of subjective audio quality measurement and very effective in evaluating high quality audio system with small impairments.
The grading scale used in BS.1116 listening test is based on the five-grade impairment scale as defined by ITU-R BS.562-3 [16] and shown in Figure 6. According to BS.562-3, any perceived difference between the reference signal and the test signal should be matched to one of the discrete five scales based on the degree of the impairment. In BS.1116, the ratings are represented on a continuous scale between 1.0~5.0. Scale “1.0” stands for highly annoying impairment and “5.0” for transparent coding.
5.0
4.0
3.0
2.0
1.0
Imperceptible
Perceptible but Not Annoying
Slightly Annoying
Annoying
Very Annoying
Figure 6: ITU-R five-grade impairment scale [16]
The test method most widely accepted for subjective listening test is the so-called
“double-blind, triple-stimulus with hidden reference” method. In this method, the listener is presented with three signals: the reference signal “R” and then the test signals “A” and “B”. Either A or B will be identical to the reference signal and the other will be the coded signal. The assignment of A and B will be done randomly so that none of the listeners could predict which signal is identical to the reference one.
The listeners are asked to assess the impairment of A compared to R, and of B compared to R according to the grading scale in Figure 6. Since one of the test signals is actually the reference signal, one of them should receive a grade of 5.0 while the other may receive a grade that describes the listener’s assessment of the impairment.
The double-blind, triple-stimulus with hidden reference method has been implemented in various ways. For example, the system under test can be a real-time hardware implementation or a software simulation of the system. The stimuli can be presented with a tape-based reproduction or with a playback system from computer hard disk. The listener is allowed to switch between R, A or B and to loop through the test. The inclusion of the hidden reference in each trial provides an easy mean to check that the listener does not consistently make mistakes and therefore provides a control condition on the expertise of the listener. The double-blind, triple-stimulus with hidden reference method has been employed worldwide for many formal listening tests of perceptual audio codecs. The consensus is that it provides a very sensitive, accurate, and stable way of assessing small impairments in audio systems.
In general formal listening tests have shown very good reliability in the evaluation of audio coding systems and high correlation in their results, see for example [17]-[20].
2.2.2 Objective Quality Measurement
The purpose of objective quality measurement is to predict the basic audio quality by using objective measurements incorporating psychoacoustic principles. Objective quality measurements that incorporate perceptual models have been introduced since the late 70’s [21]. More recently, psychoacoustic models have been exploited in the measurements of perceived quality of audio coding systems, see for example [23]-[26]. The effectiveness of objective quality measurements can only be assessed by comparison with corresponding scores obtained from subjective listening test. One of the first global opportunities for correlating the results of these different audio objective evaluations with informal subjective listening test results arose in 1995 in the early stages of the development of the MPEG-2 AAC codec. The need to test different reference models in the development of MPEG-2 AAC led to the study of objective tests as a supplement and as an alternative to listening tests. Unfortunately, none of the objective quality measuring techniques under the examination at that time showed reliable correlation with the results of the listening test [27]. The recent adoption by ITU-R of PEAQ in BS.1387 [28] came in conjunction with data that proved the correlation between PEAQ objective difference grades, ODGs, with the subjective difference grades, SDGs, obtained averaging the results of previous formal subjective listening test [29]. While PEAQ is based on a refinement of generally accepted psychoacoustic models, it also includes new cognitive components to account for high-level processes that come to play a role in the judgment of audio quality.
PEAQ was used to generate objective quality measurements for audio data previously utilized in formal listening tests of state-of-the-art perceptual audio codecs.
The performance of PEAQ was evaluated in different ways. The objective and mean subjective ratings were compared for each critical audio item used in formal tests.
Then the objective and subjective overall system quality measurements were compared by averaging codec quality measurements over critical items. The correlation between subjective and objective results proved very good and analysis if SDG and ODG showed no significant statistical differences [29]. The accuracy of the ODG demonstrated the capacity of PEAQ to correctly predict the outcome of the formal listening tests including the ranking of the codecs in terms of measured quality.
PEAQ was also tested as a tool in aiding the selection of critical material for formal listening tests. On the basis of quality measurement, the PEAQ set of critical material included more than half the critical sequences used in the formal listening test under exam [29].