CHAPTER 2 LITERATURE REVIEW
2.1 Models of spoken word recognition
2.1.2 Merge (2000)
國
立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
Cohort theory was revised by Marslen-Wilson (1987) so that the system chooses the
best match to fit an incoming word. Under this revised Cohort theory, word
recognition depend less on the initial auditory input. A word can be recognized as
long as the phonological representation of that word shares enough features with the
incoming stimulus. Nevertheless, Marlen-Wilson (1989) reemphasized the importance
of the word-initial information because lexical activation would be obstructed even if
all the other information except word-initial information is consistent with the target
words.
2.1.2 Merge (2000)
The Merge model (2000), which is an autonomous model, was proposed by
Norris, McQueen, and Cutler. The network of Merge is a simple
competition-activation network which is the same as the basic dynamics as Shortlist
(Norris, 1994b). In Merge, there are three types of nodes, including input nodes,
lexical nodes and phoneme decision nodes. As in Figure 2, the input nodes are
associated by facilitatory links to the appropriate lexical nodes and the phoneme
decision nodes. The lexical nodes are also connected by facilitatory links to the
suitable phoneme decision nodes. But, different from the TRACE model (McClelland
& Elman, 1986), an interative model, there is no feedback from the lexical nodes to
the prelexical phoneme nodes. Inhibitory activation happens between lexical nodes as
‧ 國
立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
well as phoneme decision nodes, but not between input nodes.
Figure 2. The basic architecture of Merge. The facilitatory connections, which are unidirectional, are displayed by bold lines with arrows; the inhibitory connections, which are bidirectional, are illustrated by fine lines with circles (Norris, McQueen, & Cutler, 2000)
Figure 2 displays the simulation of the subcategorical mismatch in the
architecture of Merge. The network was designed with merely 14 nodes, including 6
input nodes (/dʒ/, / /, /g/, /b/, /v/, and /z/), 4 phoneme decision nodes, and 4 possible
word nodes, job, jog, jov, joz. The latter two word nodes stand for only the possible
combinations of words, rather than the real words.
The Merge model, which is faithful to the basic principles of autonomy, was
designed to explain the data which were not compatible with TRACE (McClelland &
Elman, 1986). Merge, with the phoneme decision nodes that combine the lexical and
phonemic information flows, provides a simple and appropriate account for the data
proposed by Marslen-Wilson & Warren (1994), McQueen et al. (1999a), Connine et al.
(1997), and Frauenfelder et al. (1990), which cannot be explained either by TRACE
‧ 國
立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
appropriately. Therefore, Merge can give a full explanation of the known empirical
findings in phonemic decision making.
2.2 The role of acoustic onsets and acoustic offsets
A basic feature of speech signal is its intrinsic directionality in time. When
utterances proceed, speech signals are moving along the time line from the beginning
to the end of the utterances. This fundamental property of speech signal strongly
implies that the initial acoustic signal is of paramount importance, which is in
accordance with the claims of Cohort theory (Marslen-Wilson, 1984).
Auditory word recognition is a very complicated language processing issue
because of many linguistic and non-linguistic factors that may disrupt the acoustic
cues of speech signal. These disruptive factors include speech errors, acoustic
phonetic variability under different phonological conditions, and the auditory
obstructions due to the surrounding noise. These possible acoustic disruptions can
happen at any moment of auditory word recognition. However, human brains can still
recognize words with little difficulty most of the time. Moreover, the speech input is a
stream of acoustic signal. Hearers do not exactly know whether the particular input is
in the initial, medial, or final position of a word.
From the above mentioned difficulties of speech processing, it is clear that the
Cohort model, first proposed by Marslen-Wilson & Welsh (1978), putting great
‧ 國
立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
emphasis on the importance of initial acoustic cues, cannot account for the fact that
speech signal is more or less varied or disrupted under different circumstances.
Therefore, the Cohort model was revised by Marslen-Wilson (1987), which rejected
the total dependence on the word-initial cues for auditory word recognition. Unlike
the old Cohort model (1978), the relatively new Cohort theory claims that the
disruptions of the word-initial signal are not the end of the world because the
non-word-initial information can still bring about the activation of candidates.
Therefore, in the latter Cohort theory, 100 percent match between the word-initial
acoustic signal and the phonological representation of a given word is not as crucial as
what the original Cohort theory claims given the acoustic information in spoken
words. In addition, there are other experiments indicating that auditory word
recognition is not blocked even though the word-initial signal is distorted. One such
experiment is that an ambiguous phoneme between /d/ and /t/ is presented before the
sequence /ajp/. The subjects, after listening the stimulus, have to decide what
phoneme they have heard (Connine & Clifton, 1987). The result shows that subjects
tend to label the ambiguous phoneme as /t/ when followed the sequence /ajp/ because
/tȹajp/ ‘type’ is a word. This indicates that the word-initial signal is not extremely
crucial in auditory word recognition; otherwise the word ‘type’ cannot be recognized
due to the ambiguous word-initial acoustic cues.
‧ 國
立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
Marslen-Wilson and Zwitserlood (1989) conducted experiments to investigate
whether a nonword can activate a real word if the nonword is different from the real
word only by the initial phoneme. The results of their study indicated that the
nonword different from the real word by merely the initial phoneme cannot activate
the real word generally. According to the results, Marslen-Wilson and Zwitserlood
reemphasized the importance of the word-initial information. They claimed that
lexical activation would be barred even if all the other information except the initial
phoneme is consistent with the hypothesized words. Therefore, mispronunciation of
the initial phoneme of a word cannot facilitate the base but preclude the activation of
it. From the result that a nonword derived from a real word by merely changing its
initial phoneme cannot facilitate the real word, it is clear that word-initial information
is very important in auditory word recognition. In addition to the studies regarding the
initial segment of the input, Nooteboom and van der Vlugt (1988) compared the
importance of word onsets and offsets. The results indicated that words can be
recognized equally well no matter the inputs are heard from the beginnings or from
the endings as long as the hearers knows which part of the words they have heard.
However, they still claimed that word-beginning priority exists due to the fact that
word initial information is more easily to be associated correctly to the lexical
representation than the word final information.
‧ 國
立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
Another relevant study concerning the role of the initial segment of a nonword
was done by Connie, Blasko, and Titone (1993). The purpose of their research was to
demonstrate whether phonetically similar initial phonemes in a derived nonword
would be sufficient to produce activation of a base word. They designed the nonwords
which was only one or two phonetic features different from the base words. The
altered segments of those nonwords were either in the initial position or medial
position. The results of the study indicated that a base word can still be activated by a
nonword with a similar initial phoneme. The results also showed that the altered
position of a nonword is not the factor that influences the priming effect. Connie,
Blasko, and Titone (1993) concluded that relative similarity of elements in the input to
a lexical representation is critical for auditory word recognition. Furthermore, it is not
the exact positional acoustic information of a particular lexical item that is important
in spoken word recognition, but the overall acoustic-phonetic similarity between the
input and lexical representation that is influential. Therefore, the findings of their
study contradict the cohort theory, which claims that the initial segment serves to
determine the activated word candidates.
Wingfield et al. (1997) used gating technique to investigate the interaction
among the acoustic onsets and offsets, the cohort size, and syllabic stress in English.
Their analysis on the cohort sizes from both forward and backward gating showed
‧ 國
立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
that the cohort size is significantly larger at the recognition point from forward gating
than from backward gating for two and three syllable words and for all stress patterns.
This finding depicted a great advantage of forward gating over backward gating for
two and three syllable words and for all stress patterns, indicating that acoustic onset
information is much more important than acoustic offset information for all stress
patterns though words can be identified from both beginning and ending directions.
However, Wingfield et al. (1997) degraded the absolute word-onset priority principle
when taking the stress patterns into consideration. They assumed that stress patterns
can restrict the cohort size. They showed that the cohort sizes at recognition point
were not only significantly reduced, but the cohort sizes at recognition point were also
equal in both forward and backward gating directions. This analysis supported the
claim that cohort reduction is a very crucial mechanism in auditory word recognition,
regardless of direction of gating, which supported the overall goodness-of-fit
hypothesis, rather than the absolute word-onset priority principle. Nevertheless,
Wingfield et al. did not deny the fact that more acoustic information is needed for
word recognition if a word is gated from its ending. That is possibly due to the fact
that, for any given cohort size, a longer gate duration is needed in the
backward-gating condition than in forward-gating in English.
‧ 國
立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
2.3 Mandarin phonological system
There are 12 combinations of Mandarin syllable structure, including V, CV, GV,
VG, VN, CVG, CVN, CGV, GVG, GVC, CGVG, and CGVN. In Mandarin, a syllable
is traditionally divided into three parts, including an optional initial, a final and a tone
(C. Cheng, 1973). The initial part can be a nasal or a consonant. The final part
contains an optional prenuclear glide, a vowel, and an optional postnuclear glide or a
nasal. However, during the past two decades, the status of the prenuclear glides in
Mandarin syllable has raised many debates (Bao, 2002; Yip, 2002; Duanmu, 2002;
Wan, 2002a). Under the study, since the status of the prenuclear glide is not the focus,
the prenuclear glide was not grouped with the onset or the rhyme and was replaced by
the hiccup noise alone just as the initial consonant and the vowel. Last but not least, in
order not to let the duration of the rime be much longer than that of the prenuclear
glide and that of the initial consonant, the rime was further divided into a vowel plus a
postnuclear glide, or a vowel plus a final nasal. Each part of the rime could be
replaced by the hiccup noise individually.
2.4 The acoustic-phonetic cues of the consonants in Taiwan Mandarin
In Taiwan Mandarin, there are overall 21 onset consonants, namely, six oral stops
/p/, /pȹ/, /t/, /tȹ/, /k/, /kȹ/, two nasal stops, /m/, /n/, six fricatives /f/, / /, /x/, / /, /s/, / /,
six affricates /t /, /tȹ /, /t /, /tȹ /, /ts/, /tȹs/, and one liquid /l/. In the following sections,
‧ 國
立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
the acoustic-phonetic characteristics of those onset consonants are introduced. These
characteristics serve as the criteria of segmentation in experiment 1 and 2.
2.4.1 The acoustic-phonetic cues of stops
There are three acoustic-phonetic cues for distinguishing stops. They are formant
transitions, burst amplitude, and duration.
First, formant transitions are crucial for detecting the place of articulation of
stops. The F2 and F3 transitions from the bilabial stops into the following vowels are
rising. The F2 and F3 transitions from the alveolar stops into the following vowels are
almost flat. The F2 and F3 transitions from velar stops into the following vowel come
together. Second, previous research (Repp, 1984) indicated that the burst amplitude of
labial stops is weaker than that of the alveolar and velar stops. Perceptual experiments
have shown that burst amplitude can influence the identification of labial and alveolar
stops. This effect can be better realized on voiceless stops than voiced stops. Third,
VOT is of paramount importance for the detection of voicing. Stops, which have
relatively long VOT, tend to be perceived as voiceless stops; in contrast, stops, which
have relatively short VOT, are prone to be recognized as voiced stops. In addition,
voiceless aspirated stops have the longest VOT compared with voiced stops, and
voiceless unaspirated stops. In Mandarin, the mean VOTs for /p/, /pȹ/, /t/, /tȹ/, /k/, and
/kȹ/ are 14 ms, 82 ms, 16 ms, 81 ms, 27 ms, and 92 ms, respectively (Chao et al.,
‧ 國
立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
2006).
2.4.2 The acoustic-phonetic cues of nasals
According to Ladefoged (2000), there are four acoustic-phonetic cues for
recognizing nasals. First, there is a sharp change in the spectrogram at the time of the
formation of the articulatory closure. Second, the bands of the nasal are lighter than
those of the vowel, which indicates that the intensity of the nasal is weaker than that
of the vowel. Third, the F1 of the nasal is often very low, centered at around 250 Hz.
Fourth, there is a large space above the F1 with no energy. Based on these
acoustic-phonetic cues, nasals can be identified.
2.4.3 The acoustic-phonetic cues of fricatives
The most crucial acoustic-phonetic cue for separating voiceless fricatives from
voiced fricatives is by examining the extended period of noise (Borden et al., 1994).
The extended period of noise can be easily detected on the spectrogram. Voiceless
fricatives have longer duration and stronger intensity. To the contrary, voiced
fricatives (i.e., / / in Mandarin) are shorter in duration and weaker in intensity, but
their formant frequencies are clearer than those of voiceless fricatives.
Fricatives are known for their high-frequency noise in the spectrum, which is an
acoustic-phonetic cue for distinguishing the place of articulation of fricatives. Another
acoustic-phonetic cue for distinguishing the place of articulation of fricatives is the
‧ 國
立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
intensity of frication. Sibilants (i.e., /s/, / /, / /, / /, /t /, /tȹ /, /t /, /tȹ /, /ts/, and /tȹs/in
Mandarin) are noted for relatively steep, high-frequency spectral peaks, whereas
nonsibilants (i.e., /f/ and /x/ in Mandarin) are famous for relatively flat and wider
band spectra. Moreover, alveolar sibilants (i.e., /s/ in Mandarin) can be distinguished
from palatal sibilants (i.e., / /...) by the location of the lowest spectral peak. The
lowest spectral peak of the alveolar sibilants is around 4000 Hz, while the lowest
spectral peak of the palatal sibilants is around 2500 Hz. Furthermore, the intensity
shown on the spectrogram can also differentiate the place of articulation of fricatives.
Stronger intensity is the feature of sibilants; weaker intensity, the feature of
nonsibilants. This is because the resonating cavity in front of the alveolar or the
palatal constrictions results in high intensity. However, there is no resonating cavity in
front of the labio-dental constriction, which brings about the relatively weak intensity.
The acoustic-phonetic characterization of fricatives is illustrated in Figure 3.
‧ 國
立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
Figure 3. Acoustic-phonetic characteristics of fricatives (Borden et al., 1994)
Figure 3 shows how listeners perceive fricatives. When the listener hears an
input, it enters the first filter and is judged by whether it has noisy sound with
relatively long duration. If the answer is yes, the input is regarded as a fricative and
sent to the next filter. In the second filter, the input is examined by whether its
intensity is relatively high. If the answer is yes, the input is considered a sibilant and
sent to the next filter. In the third filter, the input is investigated by its first spectral
peak. If the first spectral peak of the input is around 4kHz, it is viewed as /s/ or /z/ and
sent to the next filter. In the fourth filter, the input is judged by “phonation exists or
duration and intensity small enough?” If the answer is yes, the input is perceived as /z/;
if the answer is no, it is perceived as /s/. By those filters, the input is examined step by
‧ 國
立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
2.4.4 The acoustic-phonetic cues of affricates
There are three pairs of affricates in Mandarin, /t /, /tȹ /, /t /, /tȹ /, /ts/, and /tȹs/.
According to Ladefoged (2000), an affricate is simply a sequence of a stop followed
by a homorganic fricative. Therefore, it can be inferred that affricates have the
acoustic-phonetic characteristics of both stops and fricatives.
2.5 The acoustic-phonetic cues of the vowels in Taiwan Mandarin
Phonetically speaking, there are overall 12 vowels in Taiwan Mandarin,
including 4 high vowels ([i], [u], [y], and [ ]), 2 low vowels ([a] and [ ]), as well as 6
mid vowels ([e], [ ], [ə], [ ], [o], and [ ]). Vowels have very different phonetic cues
from consonants. First of all, vowels have much longer duration than consonants.
Second, the formants of vowels are much clearer than those of consonants. Third, the
energy of vowels is stronger than that of consonants, causing darker spectrogram.
Fourth, the F0 in vowels displays the tones in Mandarin. From the acoustic-phonetic
cues, vowels can be distinguished from consonants.
2.6 Mandarin tone
2.6.1The perception of Mandarin Chinese tones
Lexical tones are pitch patterns that can distinguish lexical meanings in a given
language. In Mandarin Chinese, tones, like the aspirated and unaspirated stops, are
phonemic features that can differentiate word meanings. Mandarin Chinese
‧ 國
立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
phonemically distinguishes four tones, which are Tone 1, with high-level pitch, Tone
2, with high-rising pitch, Tone 3, with low falling-rising pitch, and Tone 4, with
high-falling pitch (Chao, 1948). The same syllable structure can have different
meanings if it carries different tones. For instance, ma with Tone 1 has the meaning of
‘mother’; ma with Tone 2 has the meaning of ‘numbness’; ma with Tone 3 has the
meaning of ‘horse’; ma with Tone 4 has the meaning of ‘scold’.
There are several factors that can affect the perception of Mandarin Chinese
tones. First, fundamental frequency plays a role in the Mandarin Chinese tone
perception. Previous acoustic studies have found that the F0 height and F0 contour are
the acoustic cues for Mandarin Chinese tone perception. Howie (1976) performed the
tone perception experiments to test whether the participants could identify the correct
tones of the stimuli. Howie designed three contrasted conditions, which were
synthetic speech with natural F0 patterns, synthetic speech with the monotonic F0
contour, and synthetic speech sounding like a whisper. The results showed that
subjects easily recognized the synthetic speech which F0 patterns were maintained.
Gandour (1984) and Tseng & Cohen (1985) indicated that both F0 height and F0
contour are very crucial acoustic cues for Mandarin tone perception. Neither one can
be missed. Moore and Jongman (1997) differentiated Tone 2 from Tone 3 in terms of
two characteristics. One is turning point, which is the point in time at which the tone
‧ 國
立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
changes from falling to rising, and the other is ∆F0, which is the F0 change from the
onset to the turning point. Moore and Jongman found that the turning point of Tone 2
is earlier than that of Tone 3, and the ∆F0 of Tone 2 is smaller than that of Tone 3.
Comparing the acoustic cues of Tone 3 and Tone 4, Garding et al. (1986) found that
the stimuli which have the early peak of pitch and fall dramatically after the turning
point tend to be perceived as Tone 4. The stimuli which stay at low F0 range and have
long duration tend to be recognized as Tone 3. This study demonstrates that F0
contour is of paramount importance for Mandarin tone perception.
The second factor that can influence the perception of Mandarin Chinese tones is
the temporal properties of tones. According to the production data, Nordenhake and
Svantesson (1983) found that the duration of Tone 3 is the longest, which is only
slightly longer than that of Tone 2, while the duration of Tone 4 is the shortest. Given
that the F0 contours are similar between Tone 2 and Tone 3, Nordenhake and
Svantesson (1983) further indicated that Tone 2 could be perceived as Tone 3 if it is
Svantesson (1983) further indicated that Tone 2 could be perceived as Tone 3 if it is