4. Discussion
4.1 Experiment 1
In this study, overall performance for Bimodal hearing was greater than that for CI-alone in both Taiwanese Mandarin sentences and words. Despite does not account for elevated thresholds of actual CI users, the low-pass speech provides a theoretical basis for evaluating the usefulness of low frequency acoustic cues in our experiment. These results demonstrated that low frequency acoustic information can be an important cue under simulated Bimodal hearing subject. Although the low-pass speech was unintelligible, it was nevertheless improved the performance of speech recognition in Taiwanese Mandarin in noise.
The current findings, taken together with previous studies, such that when speech and noise were presented both from the front, Zhang et al., showed adding low frequency acoustic information to electrically stimulated information led to a significant improvement in sentence recognition in noise. They proposed that the robust representation of voicing allows access to low frequency acoustic landmarks that mark syllable structure and word boundaries in noise. These landmarks can bootstrap word and sentence recognition [58]. Whereas Berrettini et al., evaluated the benefits from Bimodal hearing when speech and noise are presented from different spatial directions and demonstrated improvements in speech perception from Bimodal hearing, in comparison to the CI alone [100]. Our findings also extend the previous results in the following ways. First, Taiwanese Mandarin sentences and words recognition can improve with additional low frequency information whatever noise was coming from different spatial direction. Second, improvement was larger when noise was interference CI significantly. Third, Bimodal stimulation can led binaural effect keeps the performance in an average level.
40
The most of current CI speech processing strategies do not explicitly encode the F0 and its harmonics, thus, provide insufficient spectral or temporal cues for tone recognition and the temporal pitch cues in electric hearing are generally weak. Several studies have tried some approaches to improve pitch and tone perception in CI users, such as increase the low frequency spectral resolution [101], enhance the temporal envelope cues associated with pitch and tonal patterns [102-104], or using the vary stimulation rate which based on the F0 extracted from speech signal to encode tonal information [105]. However, these approaches only achieve limited improvements in tone recognition. Even enhanced temporal envelope cues presumably remain relatively weak in electric hearing, in contrast, residual acoustic hearing provides more salient low frequency pitch cues, Bimodal stimulation may greatly improve CI users‟ tone recognition and, in turn, tonal language recognition.
Previous studies, which evaluate the importance of tonal information to Mandarin speech recognition by CI users [106-109]. For example, Fu et al. showed that even pitch information that was limited by simulated CI processing, native speakers of Mandarin could identify lexical tones [106]. The author concluded that, along with the tonal cues for these sounds, distinguishing temporal envelope cues were also present, and that listeners were able to rely on these. Xu et al.
also found that the reduced spectral cues in Mandarin tonal patterns for NH listener in CI simulations could be compensated by temporal cues with varying degrees of temporal and spectral resolution [110]. In contrast, other studies have found that for the Cantonese [107] and Mandarin language [108-109] where pitch cues are not typically supplemented by temporal cues, a distinct deficit in recognizing these lexical tones was observed in CI users, with only a few users able to score above chance levels. Similarly, Intonation cues for speech recognition are also present in other languages such as English, but at the supra-segmental level, for example the distinction between a statement and a question in a sentence. Peng et al. found that, on average, the CI users
41
could distinguish only half as many of the question-statement pairs as NH listeners [111]. This result should call attention to consider that the speech intelligibility abilities required in many real-world situations go beyond simple identification of the words themselves.
On the other hand, the spectral details contained in low-pass speech may have helped listeners to better separate speech from noise, thereby improving speech recognition relative to the CI-alone. Previous studies showed improved reception of F0 information mostly below 500 Hz with low-pass speech provided significantly better tone recognition, while improved reception of first formant F1 information from around 500 to 1000 Hz with low-pass speech provided significantly better vowel recognition [34, 51]. Another consequence of poor frequency resolution in CI is seen when the primary task involves the perception of pitch, such as listening to music, where CI users have been shown to have severe perceptual deficits [112, 113]. For example, Gfeller et al. found that while NH listeners had no difficulty in discriminating piano notes with one semitone apart which approximately 6% frequency difference, whereas, CI users‟ ability to discriminate pitches was generally much poorer, the typical threshold was 1/2 octave (6 semitones), with some listeners requiring as much as 2 octaves difference between notes for discrimination [112]. This poor pitch perception certainly affects the typical implant patients‟ ability to recognize melodies.
In summary, the poor frequency resolution in traditional CI users appeared to be a major remaining hurdle in improving the listening abilities of them. Some of the deficits that implant patients suffer from in their everyday life could not revealed by simply employed testing in quite. It is also interesting that the frequency resolution provided by acoustic hearing in noise, even when there is a severe hearing loss, is often better than that provided by a CI. For these reasons, preserving residual acoustic hearing in CI user has some potential advantages.
Accurate pitch perception requires that the listener extract F0 information from the acoustic signal. In both acoustic and electric hearing this information may be extracted by temporal coding
42
which based on extracting information regarding the F0 from the temporal waveform output of auditory filters or CI bandpass filters, and/or place coding which based on resolving the individual harmonics in a signal (see [114] for a review). In acoustic hearing, place coding for complex sounds involves the resolution of individual low-order harmonics by narrow-bandwidth auditory filters.
These harmonics appear as individual peaks in the basilar membrane excitation pattern which can be compared with pre-formed „„harmonic templates‟‟ to determine the F0 [115]. The filterbanks involved in CI processing strategies are comprised of a smaller number of wide band-pass filters (24 for HiRes, 22 for ACE, 20 for SPEAK) with fixed centre frequencies that cover a more restricted frequency range [116]. In contrast, the auditory filters of a normal cochlea, which are non-linear, level-dependent and have continuous centre frequencies, are markedly different (see [114] for a review). Thus, low-order harmonics may not be fully resolved by these wide bandpass filters in CI devices, making it difficult for users to derive the pitch of harmonics and/or extract the F0 of complex sounds [116]. Even if individual harmonics are resolved, the user may only be able to determine the harmonic falls into one filter or pair of filters, as this would result in the activation of the corresponding electrode [9, 19]. While there is some evidence that CI users are able to use place cues to determine the position of pure tones within a filterband or pair of filterbands [117], this may not be true for the harmonics of complex sounds. Laneau et al. [118] showed adult CI users utilizing the ACE strategy that followed by the removal of temporal pitch cues, were unable to rank the F0 of pairs of synthetic vowels, even for F0 differences as large as 1.7 octaves. This may suggests that, for F0 of the synthetic vowels used in their study, the ACE filterbank does not allow for the adequate transmission of place-based pitch information.
The delivery of place-pitch cues is also limited by the nature of electrical stimulation. The number of independent sites of stimulation is physically limited by the number of, size of and spacing between intracochlear electrodes. Pitch perception is further limited by overlaps in the
43
electrical currents generated at adjacent, and more distant, electrodes [119]. Overlapping electrical currents occur because intracochlear electrode arrays are surrounded by highly conductive fluid that fills scala tympani [8]. Some evidence suggests only 4-8 independent sites of stimulation are available, even for arrays with 22 electrodes [120-121]. A host of biological variables can also limit the ability of CI users to use place cues including the density, and distribution of spiral ganglion neurons relative to electrode array and/or the pathophysiology of the hearing loss [122-123].
In a normal cochlea, pitch is coded by the place of excitation or the temporal fine structure of the neural discharge [124]. Almost current CI strategies extract only the temporal envelope of incoming signals and the fine structure information is discarded due to usage of fixed-rate pulsatile carrier. Although some methods have been proposed to enhance pitch representations in CI, e.g., using a varying pulse rate to restore the temporal cue [125] or using virtual channels to increase spectral resolution [126], all have shown limited functional benefit. The limited number of electrodes and the broad spread of electrical current also cause a very coarse place pitch cue. The relatively shallow insertion depth of current electrode arrays limit the transmitted of low frequency information. In current study, the addition of low-pass speech provided low frequency acoustic information, such as F0, low-order harmonics, consonant voicing, lexical boundaries and contextual emphasis as well as manner. Above information are important to speech recognition in complex auditory scene [127], but in CI device, it couldn‟t transmitted. In current study, with those low frequency cues, Bimodal hearing significant enhanced performance range from 4-64%. Especially when noise was presented at the same side of CI and interference significantly, Bimodal hearing improved by up to an average of 74%. We suggested that due to the lexical tone is critical to recognize Taiwanese Mandarin and it requires more accurate pitch perception ability which determines by F0.
In summary, the pitch information provided to CI users contains only crude representations
44
of the pitch cues present in the original acoustic signal. Even if place and temporal pitch cues are available to a recipient, the interaction between the cues may deficit accurate pitch perception [116].
These factors may account for the relatively poor pitch ranking accuracy of the unilateral CI users relative to the NH and Bimodal users.
Speech cues can be roughly divided into three features that correspond to various actions in the vocal cords or vocal tract when a speech sound is produced. The three features are: Voicing, Manner and Place. Voicing cues are the presence or absence of periodic vocal fold vibrations and are signaled by the overall intensity of the speech sound, which is dominated by the low frequency regions of the spectrum. Examples of a voiced versus un-voiced consonant sound pair is /z/ versus /s/. Manner cues involve the timing, intensity and the frequency of speech sounds, and signal the general mechanism of producing the speech sound. For example, some different “manner”
categories are fricatives, stop consonants, or nasals. Manner cues are generally located across the broad frequency spectrum of speech. Place cues, otherwise known as “place of articulation” cues are specifically related to the position of the tongue and other movable parts of the vocal tract.
These movable articulators determine the shape of the short-term speech spectrum (i.e. formants, bursts, etc.) and are signaled by the specific frequency locations of spectral peaks. Two consonant sounds that differ in the place feature are /b/ versus /d/. Place of articulation cues are generally located in mid- to high-frequency regions of the spectrum.
In a recent study, Spitzer et al analyzed lexical word boundary errors from simulated Bimodal stimulation and actual Bimodal users [128]. To evaluate whether adding the acoustic signal to the electric signal better defines word onsets for Bimodal users. In English, words are more likely to begin with a stressed or strong syllable than a weak syllable [129]. Strong syllables in English are characterized acoustically by relatively more pitch variation, high intensity and long duration. Strong and weak syllables also differ in vowel quality, such as vowels tend to be reduced
45
towards schwa in weak syllables. If the acoustic signal could better defined strong versus weak syllables, then lexical boundary errors should be reduced in Bimodal stimulation versus electric-only. Spitzer et al found that fewer lexical boundary errors in the Bimodal stimulation than in the electric-only. Thus, it appears to be the case that, when segmental information is reduced, the acoustic signal aids the recognition of strong and weak syllables, which in turn, leads to better word boundaries perception in continuous speech.
Binaural hearing is a fundamental property of the normal auditory system. In real-life listening situations, conversation often occurs in the presence of background noise, primarily in rooms where other people are talking. If there is more than one talker, understanding speech requires the ability to locate and follow each one. Many individuals with hearing loss in one or both ears have difficulty with this situation. For restoring hearing symmetry, it is reasonable to hypothesize that the novel information provided by low frequency acoustic hearing would provide more benefit than the redundant information added by a second implant. Any increase in performance with bilateral stimulation vs. unilateral stimulation, when signals are presented from a single loudspeaker, likely comes from “Binaural summation effect”, also known as “Binaural redundancy” [130]. In amount studies described above, the signals were presented from a single loudspeaker. To realize other potential binaural benefits for speech understanding, such as squelch and head shadow, signals must be presented from multiple, spatially separated sources as is more typically encountered in the real world. Evidence for binaural benefits when both ears are stimulated compared with stimulation of one ear alone is well documented in normal listeners [131].
For example, when subjects listen to monaural compared with binaural pure tones at supra-threshold levels, the stimulus in the monaural ear must be 6 to 10 dB higher than the stimulus during binaural presentation to result in equal loudness judgments [132-133]. When speech and noise are spatially separated, the head shadow effect creates another binaural benefit, which can
46
result in 8 to 10 dB of improvement in NH subjects [134-135]. The head shadow effect is a physical phenomenon due to the placement of the head. It refers to the benefit in speech recognition when the noise was moved from the ipsilateral side to the contralateral side of CI, so that the noise is shielded by the head.
In current study, for noise was presented from left, CI-alone yielded worst scores at each SNR level due to the significant interference by noise. A significant main effect between different noise source angles of incidence in CI-alone condition which indicates that the head shadow effect was observed. Furthermore, the superior speech recognition performance in binaurally combined hearing over the CI alone may arise from the benefits of binaural processing also including the binaural squelch effect [136-137] and/or Binaural summation effect [130]. The squelch effect describes if the speech and noise come from different spatial directions, adding an ear closer to the noise can significantly improve the speech recognition due to listener use the interaural level difference (ILD) and interaural time difference (ITD) cues to segregation the speech from noise in auditory system. The binaural summation effect describes the advantage of hearing with two ears results in signal being louder than with one ear. For speech and noise both presented from front, a significant benefit from Bimodal relative to CI-alone. However, we argue that the presently observed improved speech recognition in noise with binaurally combined acoustic and electric hearing cannot be due to these two binaural advantages. First, the advantage from Binaural summation effect is small [138] and it results mainly in better speech recognition in quiet [139].
This cannot account for the considerably large improvement of speech recognition in noise with the Bimodal hearing in this study. Second, previous studies showed similar improvement was obtained with combined acoustic and electric hearing on the same side with the short-electrode implant, providing evidence strongly against the binaural advantage hypothesis. Thus we suggested Bimodal hearing may provide binaural information to the central auditory nervous system, enabling the
47
utilization of binaural effect that assists in speech segregation in a monaurally based grouping and segregation mechanism in noise background. Similarly, no significant difference was found between Bimodal and CI-alone in noise was presented from right at each SNR level indicated that there was no squelch effect on simulated Bimodal subject. However, because Bimodal subjects use different strategies between ears which the ITD and ILD cues cannot be reliably coded, their speech recognition cannot be benefit by the squelch effect.
The present results are somewhat different from [34], who found that English spondee word recognition in steady-state, speech-shaped noise was not significantly improved by adding low frequency 500 Hz acoustic information to the CI simulations. Two factors may have contributed to this difference in results. First, the different language and material used in these studies to evaluated speech recognition. In current study, tonal language recognition depends more strongly on tonal information and pitch cue than English speech recognition. Thus, while additional low-pass filtered speech might significantly improve Taiwanese Mandarin speech intelligibility in steady-state, speech-shaped noise as in the present study. On the contrary, it may be little effect on English speech recognition. Second, different speech processing conditions were used to simulate the CI in these studies. Turner et al. used 16-channel CI simulations, which produced a much lower SRT (-15 dB) than are typically observed with CI users. The data obtain from their study may have underestimated the contribution of low frequency acoustic information to speech recognition in noise, due to overestimated the spectral resolution of real CI users. Within the overlapping frequency ranges between the CI simulation and low-pass filtered speech, the reduced spectrum temporal resolution in the simulated CI was presumably insufficient to be susceptible to interference by high-resolution acoustic information in contralateral side. It is also possible that listeners may attend to the better signal and ignore the poorer representation when bilateral speech cues overlap in frequency.
48
4.2 Experiment 2
While the addition of low frequency information did not return speech reception performance to normal levels in experiment 1, the improvement was nevertheless significant when compared with CI-alone. As stated above, with some level of residual acoustic hearing, the benefit of speech perception may be attributed to increase more accurate low-order formant information and spectral resolution, rather than to improvements in F0 perception alone. However, some previous studies have demonstrated that, with very low frequency acoustic information (125 to 150 Hz low pass), speech reception benefits of residual hearing could be directly attributable to an improvement in F0 representation [55, 58]. It is currently unclear the underlying mechanism for exactly how F0 information from the low frequency region is used to help listener to segregation
While the addition of low frequency information did not return speech reception performance to normal levels in experiment 1, the improvement was nevertheless significant when compared with CI-alone. As stated above, with some level of residual acoustic hearing, the benefit of speech perception may be attributed to increase more accurate low-order formant information and spectral resolution, rather than to improvements in F0 perception alone. However, some previous studies have demonstrated that, with very low frequency acoustic information (125 to 150 Hz low pass), speech reception benefits of residual hearing could be directly attributable to an improvement in F0 representation [55, 58]. It is currently unclear the underlying mechanism for exactly how F0 information from the low frequency region is used to help listener to segregation