Research questions - 聲調在中文口語字彙觸接的時序處理:眼動研究之證據

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

the behavioral task, recent studies applied the experimental techniques such as event

related potentials (ERPs) or eye-tracking to examine the on-line auditory processing.

These results suggest that the tonal and segmental information are accessed at a

similar temporal point. Therefore, the tone and segment information might play a

comparable role during spoken word recognition (Malins & Joanisse, 2010; Schirmer,

Tang, Penney, Gunter, & Chen, 2005; Tsang, Jia, Huang, & Chen, 2011).

1.2 Research questions

The present study conducts two eye movement experiments to examine the time

course of tone information processing during spoken character recognition. Specific

research questions to be addressed are as follows:

(1) When is tonal information processed during spoken character recognition? Is tonal

information processed in early phase of spoken character recognition? Or, is it

accessed in a relatively late stage of spoken character processing?

(2) In what way does tone affect lexical process with segmental information? Does

lexical tone affect spoken processing independently? Or, does tonal information

influence lexical process depending on segmental information.

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

Literature Review

2.1 Processing the spoken language signal

Processing of speech perception roughly include three levels: the auditory level, the

phonetic level, and the phonological level (Carroll, 2008; Frauenfelder & Tyler, 1987;

Lass, 1976; Studdert-Kennedy, 1976). At the auditory level, the signal is represented

in terms of its frequency, intensity, and temporal attributes, which could be shown on

a spectrogram. At the phonetic level, the individual phones are identified by a

combination of acoustic cues such as the formant transitions. At the phonological

level, the phonetic segment is converted into a phoneme, and phonological rules are

applied to the sound sequence. These levels are successively processed by listeners

when decoding speech signals (Carroll, 2008). Listeners firstly discriminate auditory

signals from other sensory signals and decide whether the auditory stimuli are

something they have heard. Then listeners identify the particular properties and

qualify it as speech. Lastly, the properties would be recognized as the meaningful

speech of a particular language (Carroll, 2008).

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

2.1.1 Perception of phonetic segments

Concerning the speech perception, many researchers have great interest in how

listeners manage to decode speech signals into phonetic units and derive meaningful

words. The properties such as vowels and consonants help listeners identify phonetic

segments are tightly intertwined and overlapped (Gleason & Ratner, 1998). One of the

issues for speech perception is how individual words from the complex speech input

are separated and then further identify them appropriately.

Moreover, there is no one-to-one correspondence between the phonemes and their

acoustic realization. This problem could be termed as lack of invariance, which results

from the phenomenon of context conditioned variation(Carroll, 2008; Frauenfelder &

Tyler, 1987; Gleason & Ratner, 1998). The context conditioned variation refers to the

production of same phonetic segment varies depending on the environment in which

the segment is produced. However, there are also some studies suggest that the speech

perceptions are relied on both invariant and context-conditioned cue (Cole & Scott,

1974).

Another issue about the segmental perception is the phenomenon of categorical

perception. Categorical perception is typically found on contrasts between many

different pairs of consonants. For categorical perception, perceptual systems

transform relatively linear sensory signals into absolute or categorical non-linear

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

mental representations. In speech, listeners convert the continuous auditory signals

into discretely meaningful words. According to Liberman, Harris, Hoffman, and

Griffith (1957), listeners’ ultimate task is to identify [p] or [b] which belongs to one or

another category of speech sounds. The minimal feature between the [p] and [b] is the

voicing. To notice the difference between the voiced [b] and the voiceless [p], the time

when the sound is released at the lips and when the vocal cord starts to vibrate is

crucial. The vibration of voiced [b] occurs immediately but the vibration of voiceless

[p] occurs after a short lag, which is termed as voice onset time (VOT). Some of the

categorical perception studies construct synthesized speech syllables to examine

whether categorical perception holds for nonspeech such as chirp or only for

speech(Jusczyk & Luce, 2002; Liberman et al., 1957). The researchers found that

categorical perception was used in speech rather than the nonspeech. However, there

is still no firm argument regarding whether there is a special mode of speech

perception (Jusczyk & Luce, 2002; Liberman et al., 1957).

Due to the continuous and noncategorical characteristic of vowels, vowel

perception is different from consonant perception (Fry, Abramson, Eimas, &

Liberman, 1962). Vowel has longer and larger formant but consonants are presented

by the formant transitions, which transient cues forces listeners to impose a

categorical identity on the stimuli more rapidly than for vowels. Therefore, after the

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

stimuli have been identified, the cues for the consonants are lost, and only the coded

stimuli remain. Additionally, because of the relatively longer duration of vowels, the

perception course suggests that vowels are processed longer at the auditory level than

consonant (Carroll, 2008; Frauenfelder & Tyler, 1987; Garman, 1990).

2.1.2 Lexical access and models

In addition to the issues on discrimination and categorization of phonetic segments,

many researchers are interested to expand the inquiry domain to the processes which

spoken words are recognized for retrieving meanings. Psycholinguists are eager to

understand how listeners use phonological and prosodic knowledge to parse the

sensory input during word recognition (Grosjean & Gee, 1987; Lyn, 1987; Uli H &

Tyler, 1987).

Models of spoken word recognition generally assume that phonological

information is continuously integrated during spoken word recognition. When the

speech is unfolding, lexical candidates compete for recognition as a function of

phonological similarity with the speech input (Foss & Hakes, 1978; Garman, 1990;

Gleason & Ratner, 1998; Myers, Laver, & Anderson, 1981). The models are different

in explaining the temporal dynamics of spoken word recognition between the

incoming speech stimuli and potential lexical representation.

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

One of the significant models is Cohort model (W. D. Marslen-Wilson & Welsh,

1978; William D & Marslen-Wilson, 1987). Cohort model proposes that the onset of a

word activates a set of lexical candidates competing for recognition. In the first,

autonomous stage, when the first phoneme of a word is heard, all of the candidates

with the phonological resemblance of the words are activated. For example, if the phoneme /d/ in the word “drive” is heard, then the words beginning with /d/ may

activate many candidates such as “dive,” “drink,” “date,” “dunk” and so on. This set

of activated words is called the “cohort”. The words in the cohort are not assumed to

affect the activation levels of one another, which mean that at this stage, word

recognition is a completely data-driven or bottom-up process. In the second stage,

once a cohort structure is activated, all possible sources of the auditory information

may begin to influence the selection of the target word from the cohort. The additional

auditory phonetic information may eliminate some of the cohort words. The coming

phonetic information is assumed to work in a strictly left-to-right fashion. However,

in this stage, the sources of higher levels information may also help to eliminate the

hypothesized word cohorts. For instance, if the phoneme of the /r/ presents following

the phoneme /d/, this further acoustic-phonetic information may eliminate the cohort words such as “date” and “dunk.” And then the higher level sources of the

information may appear and eliminate other words of the cohort word such as “dive”

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

and “drink,” which might be not suitable for the semantic or syntactic available

information. The spoken word recognition is finally achieved when a single candidate

remains in the cohort. A latter revised cohort model extends to consider other sources

of information such as word frequency effect (Frauenfelder & Tyler, 1987; Gleason &

Ratner, 1998; Jusczyk & Luce, 2002; W. Marslen-Wilson & Tyler, 1980; William D &

Marslen-Wilson, 1987).

The TRACE model is an interactive model (McClelland & Elman, 1986),

assuming three levels of primitive processing units: the features, the phonemes, and

the words (Figure 2) . These processing units have excitatory connections between

levels and inhibitory connections within the levels. These connections can both excite

and inhibit the activation levels of the nodes according to the stimulus input and the

activity in the system. For example, the stimuli with voicing such as the consonants

/b/, /d/, or /g/ will make the voiced feature at the phoneme level of the model become

active. The activeness in turn passes its activation to all voiced phonemes at the next

level, which in turn activates the words having those phonemes. Furthermore, via

lateral inhibition among units in a level, the most activated unit may come to

dominate other competing units which are also temporarily concordant with the input.

For example, the word unit cat at the lexical level will inhibit the similar and

competing lexical units (e.g., pat). This inhibition helps to make sure that the best

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

candidate word will win the competition in the process (Gleason & Ratner, 1998;

Jusczyk & Luce, 2002; McClelland & Elman, 1986).

Figure 2. A subset of the units in the TRACE. Each rectangle represents a different unit. The labels indicate the item for which the unit stands, and the horizontal edges of the rectangle indicate the portion of the TRACE spanned by each unit. The input feature specifications for the phrase “tea cup,” preceded and followed by silence, are indicated for the three illustrated dimensions by the blackening of the corresponding feature units (McClelland & Elman, 1986).

There are differences between the Cohort and the TRACE models (Table 1). First,

the Cohort model emphasizes on the temporal dynamics of spoken word recognition.

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

Cohort model suggests the significance of the initial word, which means that spoken

words may be identified before their offsets if similar competitors are not active.

However, the TRACE theory suggested the duplicative nodes and connections of its

system through successive time slices of input. This might be questionable in treating

the temporal dynamics in spoken word recognition. The time-slice solution results in

an extremely complex structure. Second, although the TRACE model is relatively

complex, its highly interactive feature makes it possess the computational specificity,

which results in a relatively easy way to conduct a direct test of behavior simulation.

Therefore, this feature helps in accounting for phenomena with a broad range. On the

contrary, the lack of interactive feature causes the poverty of computational specificity

in Cohort model. Last, the Cohort model emphasizes on the exact match between

auditory input and lexical representation rather than the sublexical representation.

However, the TRACE model has the phonemes level which is between the words

level and features level (Jusczyk & Luce, 2002).

‧

Table 1. The features for the Cohort and the TRACE models (Jusczyk & Luce, 2002)

Cohort TRACE

Activation Constrained Radical

Units and levels

lateral inhibition No Yes

Sublexical-to-lexical

interaction (bottom-up) Facilitative and inhibitory Facilitative Lexical-to-sublexical

interaction (top-down) No Facilitative

Distinguishing features

1. Focus on time-course of recognition

3. Attempts to account for broad range of phenomena

2.2 Prosody in spoken word recognition

According to Cutler, Dahan, and van Donselaar (1997), prosody is an intrinsic

determinant of the spoken form in languages. This intrinsic determinant is realized as

an effect on the timing, amplitude, and frequency spectrum of the utterance. Prosody

includes intonation, duration, stress, and tone. One of the important features is that it

spans over long segments such as syllables, words, and the utterances in speaking

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

style, sentence type and so on. Prosodic cues can convey lexical and nonlexical

information; for example, the function of distinguishing lexical meaning in tone, the

prominence function in stress, or the emotion expression in the sentence intonation.

Any part of the speech has duration, amplitude, and fundamental frequency. Therefore,

when listeners recognize the speech, they are processing the variation determined by

prosody (Cutler et al., 1997; Leena, 2012).

When and how might prosodic information play a role in the processing? Early

findings suggested that prosody plays an organizing role in speech. For example,

nonsense syllables are recalled better only if the string of the nonsense syllables

presented with sentence prosody (Epstein, 1961). In addition,.Cutler et al. (1997)

suggested that processing of speech input is facilitated by coherent prosodic structure

appropriate for sentences. Studies of such facilitated effects have established a

significant role for temporal patterning. Thus temporal envelops of spoken utterance,

preserving amplitude information but virtually without spectral variation, allow

listeners to recognize short utterances and even nonsense syllables almost perfectly

(Cutler et al., 1997; Shannon, Zeng, Kamath, Wygonski, & Ekelid, 1995). Second,

listeners use relevant acoustic information as soon as it becomes available. For

instance, listeners take coarticulatory information efficiently from one segment to

another(Whalen, 1991). Thus, some researchers propose that whenever the prosodic

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

information could constrain initial lexical activation, it is important to see what and

how such prosodic is processed by listeners(Cutler et al., 1997).

Because of the varied characteristics of the prosodic information, the prosody

information, such as stress and tone, in spoken word recognition has been investigated.

Most research on lexical access have been carried out in English, hence, the prosodic

structure which have been investigated is stress (Cutler, 1986; van Donselaar et al.,

2005). In English, the stress pattern can only be contrasted in multisyllabic domain

rather than in monosyllable like in tone languages. Tone languages such as Cantonese

or Mandarin are good examples to be illustrated because tone contrasts may be

realized in a monosyllable (Cutler & Chen, 1997; Taft & Chen, 1992).

2.3 Stress in lexical processing

Studies of English vocabulary structure suggest that listeners could use the

stress-pattern information in word recognition. However, some studies showed that

the stress information does not facilitate English listeners in auditory lexical decision

or in the grammatical category judgment. In Cutler and Clifton (1984) the participant

performed a grammatical category judgment of the bisyllabic syllables with or

without the standard stress pattern (for example, initial stress for noun or final stress

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

for verbs). The result showed that the reaction time was not affected by the different

stress pattern. Cutler (1986) used a cross-modal priming task to distinguish the

contrast pattern of stress such as OBject-obJECT, and FORbear-forBEAR. If the

stress information was used by listeners, the prime and the target would not be

considered as homophones and no homophonic priming effect would be expected.

Subjects were asked to listen to a sentence contained a prime which meaning was

related to the target and then performed the lexical decision task. The resulted showed

that the pair could prime each other. Subjects considered the stress minimal pairs as

homophones, suggesting that the access code did not influenced by the stress prosodic

information. Listeners did not discriminate from these two words for msec in the

initial access to the lexicon.

In Dutch, the stress information involved during spoken word recognition (van

Donselaar et al., 2005). van Donselaar et al. (2005) also used the cross-modal priming

experiments to examine the role of suprasegmental information in processing Dutch.

The result in Dutch showed that the inappropriate stressing could prevent lexical

activation. The authors also suggested that the constraining of the suprasegmental

during the processing was within a single syllable in Dutch, indicating that it began as

soon as the relevant acoustic information was available to modulate the activation of

potential candidate words. The inconsistent results between English and Dutch are

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

probably due to that the minimal stress pairs are rare in English. Although English is a

lexical-stress language, the stress cue might be redundant in lexical processing. The

stress information in English can nearly always be derived from the segmental

information (Cutler et al., 1997).

2.4 Tonal processing

2.4.1 Tone perception

Acoustic analysis about tone typically focuses on the fundamental frequency

(F0), which is a quantification of the rate of vocal fold vibration and usually

expressed in Hertz (Hz). According to Jongman et al. (2006), tone is a function of the

rate of vocal fold vibration. To characterize Mandarin tones, the F0 height and the F0

contour are the crucial acoustic parameters.

Researchers have explored the contribution of F0 height and F0 contour to tonal

perception. Some studies suggested that for Mandarin listeners, both of the F0 height

and F0 height are important. However, some studies claimed a more crucial role of F0

contour (Gandour, 1984; Jongman et al., 2006).

Recently, Tsang et al. (2011) used the event-related potentials (ERPs) to examine

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

how pitch contour and pitch height contributed to early tonal processing in an

auditory passive oddball paradigm in Cantonese. Classifying six tones in Cantonese

by pitch height and contour, the authors manipulated four conditions: height-large

difference (Tone 6/ Tone 1), height-small difference (Tone 6/ Tone 3), contour-early

difference (Tone 1/ Tone 2), and contour-late difference (Tone 6/ Tone 2). In the

experiment, the stimulus (e.g., /ji1/, /ji2/, /j3/ and /ji6/) was presented while the

participant was watching a self-chosen silent movie with closed captions. The result

indicated that the turning point on the pitch contour could modulate the effect of pitch

height, suggesting that Cantonese speakers did not process pitch contour and pitch

height as totally unrelated dimensions. In Mandarin Chinese, Lai and Zhang (2008)

used a gating paradigm to examine the amount of tonal information needed to

correctly identify the four tones of the target. In the experiment, there were eight tone

quadruplets, which contained the same segmental structure but different tones.

Subjects were asked to identify the tone for each gated stimulus (40msec increments)

and provided a confidence rating on a scale of one to seven for their response by

pressing the corresponding button. The stimuli were presented in a duration-blocked

fashion, in which participants heard the first gate of the stimuli to the last gate, which

always contained the entire syllable. The isolation point, which was the size of the

segment needed to correctly identify, was examined. The result showed that the

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

isolation point was different among four tones. The earliest isolation point was the

Tone 1, followed by Tone 4, and then followed by Tone 2 and Tone 3. To sum up, the

acoustic features of four tones in Mandarin affect tone perception.

Whether the acoustic similarity among the four tones affects tonal perception has

been examined in both Chinese and Cantonese. There are six tones in Cantonese.

Tone 1 is most distinct from other tones, while other tones bear a similar point on the

F0 scale. In a lexical decision task, Cutler and Chen (1997) manipulated the mismatch

and match of phonological structure between prime and target In this study, Cantonese

tones were separated by “easy group” including Tone 1, and “hard group” including

the remaining tones. Tone 1 was in the easy group because of its acoustic distinction

from other tones in Cantonese. The “hard” group comprises the tones with similar

acoustic contour. The result showed that the “easy” group had lower error rate and

faster response time than that of the “hard” group. As for Chinese, Ye and Connine

(1999) explored the role of tonal similarity in a vowel and tone monitoring task. In

their experiment three, the Tone 2 and Tone 3 were grouped as “close” tone while the

Tone 2 and Tone 4 were labeled as “far” tone. Results showed that the reaction time to

the “far” tones was longer than the “close” tones in both the vowel and tone

在文檔中聲調在中文口語字彙觸接的時序處理:眼動研究之證據 - 政大學術集成 (頁 22-0)