• 沒有找到結果。

CHAPTER 2 LITERATURE REVIEW

2.1 Models of spoken word recognition

2.1.2 Merge (2000)

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

Cohort theory was revised by Marslen-Wilson (1987) so that the system chooses the

best match to fit an incoming word. Under this revised Cohort theory, word

recognition depend less on the initial auditory input. A word can be recognized as

long as the phonological representation of that word shares enough features with the

incoming stimulus. Nevertheless, Marlen-Wilson (1989) reemphasized the importance

of the word-initial information because lexical activation would be obstructed even if

all the other information except word-initial information is consistent with the target

words.

2.1.2 Merge (2000)

The Merge model (2000), which is an autonomous model, was proposed by

Norris, McQueen, and Cutler. The network of Merge is a simple

competition-activation network which is the same as the basic dynamics as Shortlist

(Norris, 1994b). In Merge, there are three types of nodes, including input nodes,

lexical nodes and phoneme decision nodes. As in Figure 2, the input nodes are

associated by facilitatory links to the appropriate lexical nodes and the phoneme

decision nodes. The lexical nodes are also connected by facilitatory links to the

suitable phoneme decision nodes. But, different from the TRACE model (McClelland

& Elman, 1986), an interative model, there is no feedback from the lexical nodes to

the prelexical phoneme nodes. Inhibitory activation happens between lexical nodes as

‧ 國

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

well as phoneme decision nodes, but not between input nodes.

Figure 2. The basic architecture of Merge. The facilitatory connections, which are unidirectional, are displayed by bold lines with arrows; the inhibitory connections, which are bidirectional, are illustrated by fine lines with circles (Norris, McQueen, & Cutler, 2000)

Figure 2 displays the simulation of the subcategorical mismatch in the

architecture of Merge. The network was designed with merely 14 nodes, including 6

input nodes (/dʒ/, / /, /g/, /b/, /v/, and /z/), 4 phoneme decision nodes, and 4 possible

word nodes, job, jog, jov, joz. The latter two word nodes stand for only the possible

combinations of words, rather than the real words.

The Merge model, which is faithful to the basic principles of autonomy, was

designed to explain the data which were not compatible with TRACE (McClelland &

Elman, 1986). Merge, with the phoneme decision nodes that combine the lexical and

phonemic information flows, provides a simple and appropriate account for the data

proposed by Marslen-Wilson & Warren (1994), McQueen et al. (1999a), Connine et al.

(1997), and Frauenfelder et al. (1990), which cannot be explained either by TRACE

‧ 國

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

appropriately. Therefore, Merge can give a full explanation of the known empirical

findings in phonemic decision making.

2.2 The role of acoustic onsets and acoustic offsets

A basic feature of speech signal is its intrinsic directionality in time. When

utterances proceed, speech signals are moving along the time line from the beginning

to the end of the utterances. This fundamental property of speech signal strongly

implies that the initial acoustic signal is of paramount importance, which is in

accordance with the claims of Cohort theory (Marslen-Wilson, 1984).

Auditory word recognition is a very complicated language processing issue

because of many linguistic and non-linguistic factors that may disrupt the acoustic

cues of speech signal. These disruptive factors include speech errors, acoustic

phonetic variability under different phonological conditions, and the auditory

obstructions due to the surrounding noise. These possible acoustic disruptions can

happen at any moment of auditory word recognition. However, human brains can still

recognize words with little difficulty most of the time. Moreover, the speech input is a

stream of acoustic signal. Hearers do not exactly know whether the particular input is

in the initial, medial, or final position of a word.

From the above mentioned difficulties of speech processing, it is clear that the

Cohort model, first proposed by Marslen-Wilson & Welsh (1978), putting great

‧ 國

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

emphasis on the importance of initial acoustic cues, cannot account for the fact that

speech signal is more or less varied or disrupted under different circumstances.

Therefore, the Cohort model was revised by Marslen-Wilson (1987), which rejected

the total dependence on the word-initial cues for auditory word recognition. Unlike

the old Cohort model (1978), the relatively new Cohort theory claims that the

disruptions of the word-initial signal are not the end of the world because the

non-word-initial information can still bring about the activation of candidates.

Therefore, in the latter Cohort theory, 100 percent match between the word-initial

acoustic signal and the phonological representation of a given word is not as crucial as

what the original Cohort theory claims given the acoustic information in spoken

words. In addition, there are other experiments indicating that auditory word

recognition is not blocked even though the word-initial signal is distorted. One such

experiment is that an ambiguous phoneme between /d/ and /t/ is presented before the

sequence /ajp/. The subjects, after listening the stimulus, have to decide what

phoneme they have heard (Connine & Clifton, 1987). The result shows that subjects

tend to label the ambiguous phoneme as /t/ when followed the sequence /ajp/ because

/tȹajp/ ‘type’ is a word. This indicates that the word-initial signal is not extremely

crucial in auditory word recognition; otherwise the word ‘type’ cannot be recognized

due to the ambiguous word-initial acoustic cues.

‧ 國

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

Marslen-Wilson and Zwitserlood (1989) conducted experiments to investigate

whether a nonword can activate a real word if the nonword is different from the real

word only by the initial phoneme. The results of their study indicated that the

nonword different from the real word by merely the initial phoneme cannot activate

the real word generally. According to the results, Marslen-Wilson and Zwitserlood

reemphasized the importance of the word-initial information. They claimed that

lexical activation would be barred even if all the other information except the initial

phoneme is consistent with the hypothesized words. Therefore, mispronunciation of

the initial phoneme of a word cannot facilitate the base but preclude the activation of

it. From the result that a nonword derived from a real word by merely changing its

initial phoneme cannot facilitate the real word, it is clear that word-initial information

is very important in auditory word recognition. In addition to the studies regarding the

initial segment of the input, Nooteboom and van der Vlugt (1988) compared the

importance of word onsets and offsets. The results indicated that words can be

recognized equally well no matter the inputs are heard from the beginnings or from

the endings as long as the hearers knows which part of the words they have heard.

However, they still claimed that word-beginning priority exists due to the fact that

word initial information is more easily to be associated correctly to the lexical

representation than the word final information.

‧ 國

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

Another relevant study concerning the role of the initial segment of a nonword

was done by Connie, Blasko, and Titone (1993). The purpose of their research was to

demonstrate whether phonetically similar initial phonemes in a derived nonword

would be sufficient to produce activation of a base word. They designed the nonwords

which was only one or two phonetic features different from the base words. The

altered segments of those nonwords were either in the initial position or medial

position. The results of the study indicated that a base word can still be activated by a

nonword with a similar initial phoneme. The results also showed that the altered

position of a nonword is not the factor that influences the priming effect. Connie,

Blasko, and Titone (1993) concluded that relative similarity of elements in the input to

a lexical representation is critical for auditory word recognition. Furthermore, it is not

the exact positional acoustic information of a particular lexical item that is important

in spoken word recognition, but the overall acoustic-phonetic similarity between the

input and lexical representation that is influential. Therefore, the findings of their

study contradict the cohort theory, which claims that the initial segment serves to

determine the activated word candidates.

Wingfield et al. (1997) used gating technique to investigate the interaction

among the acoustic onsets and offsets, the cohort size, and syllabic stress in English.

Their analysis on the cohort sizes from both forward and backward gating showed

‧ 國

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

that the cohort size is significantly larger at the recognition point from forward gating

than from backward gating for two and three syllable words and for all stress patterns.

This finding depicted a great advantage of forward gating over backward gating for

two and three syllable words and for all stress patterns, indicating that acoustic onset

information is much more important than acoustic offset information for all stress

patterns though words can be identified from both beginning and ending directions.

However, Wingfield et al. (1997) degraded the absolute word-onset priority principle

when taking the stress patterns into consideration. They assumed that stress patterns

can restrict the cohort size. They showed that the cohort sizes at recognition point

were not only significantly reduced, but the cohort sizes at recognition point were also

equal in both forward and backward gating directions. This analysis supported the

claim that cohort reduction is a very crucial mechanism in auditory word recognition,

regardless of direction of gating, which supported the overall goodness-of-fit

hypothesis, rather than the absolute word-onset priority principle. Nevertheless,

Wingfield et al. did not deny the fact that more acoustic information is needed for

word recognition if a word is gated from its ending. That is possibly due to the fact

that, for any given cohort size, a longer gate duration is needed in the

backward-gating condition than in forward-gating in English.

‧ 國

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

2.3 Mandarin phonological system

There are 12 combinations of Mandarin syllable structure, including V, CV, GV,

VG, VN, CVG, CVN, CGV, GVG, GVC, CGVG, and CGVN. In Mandarin, a syllable

is traditionally divided into three parts, including an optional initial, a final and a tone

(C. Cheng, 1973). The initial part can be a nasal or a consonant. The final part

contains an optional prenuclear glide, a vowel, and an optional postnuclear glide or a

nasal. However, during the past two decades, the status of the prenuclear glides in

Mandarin syllable has raised many debates (Bao, 2002; Yip, 2002; Duanmu, 2002;

Wan, 2002a). Under the study, since the status of the prenuclear glide is not the focus,

the prenuclear glide was not grouped with the onset or the rhyme and was replaced by

the hiccup noise alone just as the initial consonant and the vowel. Last but not least, in

order not to let the duration of the rime be much longer than that of the prenuclear

glide and that of the initial consonant, the rime was further divided into a vowel plus a

postnuclear glide, or a vowel plus a final nasal. Each part of the rime could be

replaced by the hiccup noise individually.

2.4 The acoustic-phonetic cues of the consonants in Taiwan Mandarin

In Taiwan Mandarin, there are overall 21 onset consonants, namely, six oral stops

/p/, /pȹ/, /t/, /tȹ/, /k/, /kȹ/, two nasal stops, /m/, /n/, six fricatives /f/, / /, /x/, / /, /s/, / /,

six affricates /t /, /tȹ /, /t /, /tȹ /, /ts/, /tȹs/, and one liquid /l/. In the following sections,

‧ 國

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

the acoustic-phonetic characteristics of those onset consonants are introduced. These

characteristics serve as the criteria of segmentation in experiment 1 and 2.

2.4.1 The acoustic-phonetic cues of stops

There are three acoustic-phonetic cues for distinguishing stops. They are formant

transitions, burst amplitude, and duration.

First, formant transitions are crucial for detecting the place of articulation of

stops. The F2 and F3 transitions from the bilabial stops into the following vowels are

rising. The F2 and F3 transitions from the alveolar stops into the following vowels are

almost flat. The F2 and F3 transitions from velar stops into the following vowel come

together. Second, previous research (Repp, 1984) indicated that the burst amplitude of

labial stops is weaker than that of the alveolar and velar stops. Perceptual experiments

have shown that burst amplitude can influence the identification of labial and alveolar

stops. This effect can be better realized on voiceless stops than voiced stops. Third,

VOT is of paramount importance for the detection of voicing. Stops, which have

relatively long VOT, tend to be perceived as voiceless stops; in contrast, stops, which

have relatively short VOT, are prone to be recognized as voiced stops. In addition,

voiceless aspirated stops have the longest VOT compared with voiced stops, and

voiceless unaspirated stops. In Mandarin, the mean VOTs for /p/, /pȹ/, /t/, /tȹ/, /k/, and

/kȹ/ are 14 ms, 82 ms, 16 ms, 81 ms, 27 ms, and 92 ms, respectively (Chao et al.,

‧ 國

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

2006).

2.4.2 The acoustic-phonetic cues of nasals

According to Ladefoged (2000), there are four acoustic-phonetic cues for

recognizing nasals. First, there is a sharp change in the spectrogram at the time of the

formation of the articulatory closure. Second, the bands of the nasal are lighter than

those of the vowel, which indicates that the intensity of the nasal is weaker than that

of the vowel. Third, the F1 of the nasal is often very low, centered at around 250 Hz.

Fourth, there is a large space above the F1 with no energy. Based on these

acoustic-phonetic cues, nasals can be identified.

2.4.3 The acoustic-phonetic cues of fricatives

The most crucial acoustic-phonetic cue for separating voiceless fricatives from

voiced fricatives is by examining the extended period of noise (Borden et al., 1994).

The extended period of noise can be easily detected on the spectrogram. Voiceless

fricatives have longer duration and stronger intensity. To the contrary, voiced

fricatives (i.e., / / in Mandarin) are shorter in duration and weaker in intensity, but

their formant frequencies are clearer than those of voiceless fricatives.

Fricatives are known for their high-frequency noise in the spectrum, which is an

acoustic-phonetic cue for distinguishing the place of articulation of fricatives. Another

acoustic-phonetic cue for distinguishing the place of articulation of fricatives is the

‧ 國

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

intensity of frication. Sibilants (i.e., /s/, / /, / /, / /, /t /, /tȹ /, /t /, /tȹ /, /ts/, and /tȹs/in

Mandarin) are noted for relatively steep, high-frequency spectral peaks, whereas

nonsibilants (i.e., /f/ and /x/ in Mandarin) are famous for relatively flat and wider

band spectra. Moreover, alveolar sibilants (i.e., /s/ in Mandarin) can be distinguished

from palatal sibilants (i.e., / /...) by the location of the lowest spectral peak. The

lowest spectral peak of the alveolar sibilants is around 4000 Hz, while the lowest

spectral peak of the palatal sibilants is around 2500 Hz. Furthermore, the intensity

shown on the spectrogram can also differentiate the place of articulation of fricatives.

Stronger intensity is the feature of sibilants; weaker intensity, the feature of

nonsibilants. This is because the resonating cavity in front of the alveolar or the

palatal constrictions results in high intensity. However, there is no resonating cavity in

front of the labio-dental constriction, which brings about the relatively weak intensity.

The acoustic-phonetic characterization of fricatives is illustrated in Figure 3.

‧ 國

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

Figure 3. Acoustic-phonetic characteristics of fricatives (Borden et al., 1994)

Figure 3 shows how listeners perceive fricatives. When the listener hears an

input, it enters the first filter and is judged by whether it has noisy sound with

relatively long duration. If the answer is yes, the input is regarded as a fricative and

sent to the next filter. In the second filter, the input is examined by whether its

intensity is relatively high. If the answer is yes, the input is considered a sibilant and

sent to the next filter. In the third filter, the input is investigated by its first spectral

peak. If the first spectral peak of the input is around 4kHz, it is viewed as /s/ or /z/ and

sent to the next filter. In the fourth filter, the input is judged by “phonation exists or

duration and intensity small enough?” If the answer is yes, the input is perceived as /z/;

if the answer is no, it is perceived as /s/. By those filters, the input is examined step by

‧ 國

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

2.4.4 The acoustic-phonetic cues of affricates

There are three pairs of affricates in Mandarin, /t /, /tȹ /, /t /, /tȹ /, /ts/, and /tȹs/.

According to Ladefoged (2000), an affricate is simply a sequence of a stop followed

by a homorganic fricative. Therefore, it can be inferred that affricates have the

acoustic-phonetic characteristics of both stops and fricatives.

2.5 The acoustic-phonetic cues of the vowels in Taiwan Mandarin

Phonetically speaking, there are overall 12 vowels in Taiwan Mandarin,

including 4 high vowels ([i], [u], [y], and [ ]), 2 low vowels ([a] and [ ]), as well as 6

mid vowels ([e], [ ], [ə], [ ], [o], and [ ]). Vowels have very different phonetic cues

from consonants. First of all, vowels have much longer duration than consonants.

Second, the formants of vowels are much clearer than those of consonants. Third, the

energy of vowels is stronger than that of consonants, causing darker spectrogram.

Fourth, the F0 in vowels displays the tones in Mandarin. From the acoustic-phonetic

cues, vowels can be distinguished from consonants.

2.6 Mandarin tone

2.6.1The perception of Mandarin Chinese tones

Lexical tones are pitch patterns that can distinguish lexical meanings in a given

language. In Mandarin Chinese, tones, like the aspirated and unaspirated stops, are

phonemic features that can differentiate word meanings. Mandarin Chinese

‧ 國

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

phonemically distinguishes four tones, which are Tone 1, with high-level pitch, Tone

2, with high-rising pitch, Tone 3, with low falling-rising pitch, and Tone 4, with

high-falling pitch (Chao, 1948). The same syllable structure can have different

meanings if it carries different tones. For instance, ma with Tone 1 has the meaning of

‘mother’; ma with Tone 2 has the meaning of ‘numbness’; ma with Tone 3 has the

meaning of ‘horse’; ma with Tone 4 has the meaning of ‘scold’.

There are several factors that can affect the perception of Mandarin Chinese

tones. First, fundamental frequency plays a role in the Mandarin Chinese tone

perception. Previous acoustic studies have found that the F0 height and F0 contour are

the acoustic cues for Mandarin Chinese tone perception. Howie (1976) performed the

tone perception experiments to test whether the participants could identify the correct

tones of the stimuli. Howie designed three contrasted conditions, which were

synthetic speech with natural F0 patterns, synthetic speech with the monotonic F0

contour, and synthetic speech sounding like a whisper. The results showed that

subjects easily recognized the synthetic speech which F0 patterns were maintained.

Gandour (1984) and Tseng & Cohen (1985) indicated that both F0 height and F0

contour are very crucial acoustic cues for Mandarin tone perception. Neither one can

be missed. Moore and Jongman (1997) differentiated Tone 2 from Tone 3 in terms of

two characteristics. One is turning point, which is the point in time at which the tone

‧ 國

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

changes from falling to rising, and the other is ∆F0, which is the F0 change from the

onset to the turning point. Moore and Jongman found that the turning point of Tone 2

is earlier than that of Tone 3, and the ∆F0 of Tone 2 is smaller than that of Tone 3.

Comparing the acoustic cues of Tone 3 and Tone 4, Garding et al. (1986) found that

the stimuli which have the early peak of pitch and fall dramatically after the turning

point tend to be perceived as Tone 4. The stimuli which stay at low F0 range and have

long duration tend to be recognized as Tone 3. This study demonstrates that F0

contour is of paramount importance for Mandarin tone perception.

The second factor that can influence the perception of Mandarin Chinese tones is

the temporal properties of tones. According to the production data, Nordenhake and

Svantesson (1983) found that the duration of Tone 3 is the longest, which is only

slightly longer than that of Tone 2, while the duration of Tone 4 is the shortest. Given

that the F0 contours are similar between Tone 2 and Tone 3, Nordenhake and

Svantesson (1983) further indicated that Tone 2 could be perceived as Tone 3 if it is

Svantesson (1983) further indicated that Tone 2 could be perceived as Tone 3 if it is