國
立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
the behavioral task, recent studies applied the experimental techniques such as event
related potentials (ERPs) or eye-tracking to examine the on-line auditory processing.
These results suggest that the tonal and segmental information are accessed at a
similar temporal point. Therefore, the tone and segment information might play a
comparable role during spoken word recognition (Malins & Joanisse, 2010; Schirmer,
Tang, Penney, Gunter, & Chen, 2005; Tsang, Jia, Huang, & Chen, 2011).
1.2 Research questions
The present study conducts two eye movement experiments to examine the time
course of tone information processing during spoken character recognition. Specific
research questions to be addressed are as follows:
(1) When is tonal information processed during spoken character recognition? Is tonal
information processed in early phase of spoken character recognition? Or, is it
accessed in a relatively late stage of spoken character processing?
(2) In what way does tone affect lexical process with segmental information? Does
lexical tone affect spoken processing independently? Or, does tonal information
influence lexical process depending on segmental information.
‧
國立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
5
Literature Review
2.1 Processing the spoken language signal
Processing of speech perception roughly include three levels: the auditory level, the
phonetic level, and the phonological level (Carroll, 2008; Frauenfelder & Tyler, 1987;
Lass, 1976; Studdert-Kennedy, 1976). At the auditory level, the signal is represented
in terms of its frequency, intensity, and temporal attributes, which could be shown on
a spectrogram. At the phonetic level, the individual phones are identified by a
combination of acoustic cues such as the formant transitions. At the phonological
level, the phonetic segment is converted into a phoneme, and phonological rules are
applied to the sound sequence. These levels are successively processed by listeners
when decoding speech signals (Carroll, 2008). Listeners firstly discriminate auditory
signals from other sensory signals and decide whether the auditory stimuli are
something they have heard. Then listeners identify the particular properties and
qualify it as speech. Lastly, the properties would be recognized as the meaningful
speech of a particular language (Carroll, 2008).
‧
國立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
2.1.1 Perception of phonetic segments
Concerning the speech perception, many researchers have great interest in how
listeners manage to decode speech signals into phonetic units and derive meaningful
words. The properties such as vowels and consonants help listeners identify phonetic
segments are tightly intertwined and overlapped (Gleason & Ratner, 1998). One of the
issues for speech perception is how individual words from the complex speech input
are separated and then further identify them appropriately.
Moreover, there is no one-to-one correspondence between the phonemes and their
acoustic realization. This problem could be termed as lack of invariance, which results
from the phenomenon of context conditioned variation(Carroll, 2008; Frauenfelder &
Tyler, 1987; Gleason & Ratner, 1998). The context conditioned variation refers to the
production of same phonetic segment varies depending on the environment in which
the segment is produced. However, there are also some studies suggest that the speech
perceptions are relied on both invariant and context-conditioned cue (Cole & Scott,
1974).
Another issue about the segmental perception is the phenomenon of categorical
perception. Categorical perception is typically found on contrasts between many
different pairs of consonants. For categorical perception, perceptual systems
transform relatively linear sensory signals into absolute or categorical non-linear
‧
國立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
mental representations. In speech, listeners convert the continuous auditory signals
into discretely meaningful words. According to Liberman, Harris, Hoffman, and
Griffith (1957), listeners’ ultimate task is to identify [p] or [b] which belongs to one or
another category of speech sounds. The minimal feature between the [p] and [b] is the
voicing. To notice the difference between the voiced [b] and the voiceless [p], the time
when the sound is released at the lips and when the vocal cord starts to vibrate is
crucial. The vibration of voiced [b] occurs immediately but the vibration of voiceless
[p] occurs after a short lag, which is termed as voice onset time (VOT). Some of the
categorical perception studies construct synthesized speech syllables to examine
whether categorical perception holds for nonspeech such as chirp or only for
speech(Jusczyk & Luce, 2002; Liberman et al., 1957). The researchers found that
categorical perception was used in speech rather than the nonspeech. However, there
is still no firm argument regarding whether there is a special mode of speech
perception (Jusczyk & Luce, 2002; Liberman et al., 1957).
Due to the continuous and noncategorical characteristic of vowels, vowel
perception is different from consonant perception (Fry, Abramson, Eimas, &
Liberman, 1962). Vowel has longer and larger formant but consonants are presented
by the formant transitions, which transient cues forces listeners to impose a
categorical identity on the stimuli more rapidly than for vowels. Therefore, after the
‧
國立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
stimuli have been identified, the cues for the consonants are lost, and only the coded
stimuli remain. Additionally, because of the relatively longer duration of vowels, the
perception course suggests that vowels are processed longer at the auditory level than
consonant (Carroll, 2008; Frauenfelder & Tyler, 1987; Garman, 1990).
2.1.2 Lexical access and models
In addition to the issues on discrimination and categorization of phonetic segments,
many researchers are interested to expand the inquiry domain to the processes which
spoken words are recognized for retrieving meanings. Psycholinguists are eager to
understand how listeners use phonological and prosodic knowledge to parse the
sensory input during word recognition (Grosjean & Gee, 1987; Lyn, 1987; Uli H &
Tyler, 1987).
Models of spoken word recognition generally assume that phonological
information is continuously integrated during spoken word recognition. When the
speech is unfolding, lexical candidates compete for recognition as a function of
phonological similarity with the speech input (Foss & Hakes, 1978; Garman, 1990;
Gleason & Ratner, 1998; Myers, Laver, & Anderson, 1981). The models are different
in explaining the temporal dynamics of spoken word recognition between the
incoming speech stimuli and potential lexical representation.
‧
國立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
One of the significant models is Cohort model (W. D. Marslen-Wilson & Welsh,
1978; William D & Marslen-Wilson, 1987). Cohort model proposes that the onset of a
word activates a set of lexical candidates competing for recognition. In the first,
autonomous stage, when the first phoneme of a word is heard, all of the candidates
with the phonological resemblance of the words are activated. For example, if the phoneme /d/ in the word “drive” is heard, then the words beginning with /d/ may
activate many candidates such as “dive,” “drink,” “date,” “dunk” and so on. This set
of activated words is called the “cohort”. The words in the cohort are not assumed to
affect the activation levels of one another, which mean that at this stage, word
recognition is a completely data-driven or bottom-up process. In the second stage,
once a cohort structure is activated, all possible sources of the auditory information
may begin to influence the selection of the target word from the cohort. The additional
auditory phonetic information may eliminate some of the cohort words. The coming
phonetic information is assumed to work in a strictly left-to-right fashion. However,
in this stage, the sources of higher levels information may also help to eliminate the
hypothesized word cohorts. For instance, if the phoneme of the /r/ presents following
the phoneme /d/, this further acoustic-phonetic information may eliminate the cohort words such as “date” and “dunk.” And then the higher level sources of the
information may appear and eliminate other words of the cohort word such as “dive”
‧
國立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
and “drink,” which might be not suitable for the semantic or syntactic available
information. The spoken word recognition is finally achieved when a single candidate
remains in the cohort. A latter revised cohort model extends to consider other sources
of information such as word frequency effect (Frauenfelder & Tyler, 1987; Gleason &
Ratner, 1998; Jusczyk & Luce, 2002; W. Marslen-Wilson & Tyler, 1980; William D &
Marslen-Wilson, 1987).
The TRACE model is an interactive model (McClelland & Elman, 1986),
assuming three levels of primitive processing units: the features, the phonemes, and
the words (Figure 2) . These processing units have excitatory connections between
levels and inhibitory connections within the levels. These connections can both excite
and inhibit the activation levels of the nodes according to the stimulus input and the
activity in the system. For example, the stimuli with voicing such as the consonants
/b/, /d/, or /g/ will make the voiced feature at the phoneme level of the model become
active. The activeness in turn passes its activation to all voiced phonemes at the next
level, which in turn activates the words having those phonemes. Furthermore, via
lateral inhibition among units in a level, the most activated unit may come to
dominate other competing units which are also temporarily concordant with the input.
For example, the word unit cat at the lexical level will inhibit the similar and
competing lexical units (e.g., pat). This inhibition helps to make sure that the best
‧
國立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
candidate word will win the competition in the process (Gleason & Ratner, 1998;
Jusczyk & Luce, 2002; McClelland & Elman, 1986).
Figure 2. A subset of the units in the TRACE. Each rectangle represents a different unit. The labels indicate the item for which the unit stands, and the horizontal edges of the rectangle indicate the portion of the TRACE spanned by each unit. The input feature specifications for the phrase “tea cup,” preceded and followed by silence, are indicated for the three illustrated dimensions by the blackening of the corresponding feature units (McClelland & Elman, 1986).
There are differences between the Cohort and the TRACE models (Table 1). First,
the Cohort model emphasizes on the temporal dynamics of spoken word recognition.
‧
國立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
Cohort model suggests the significance of the initial word, which means that spoken
words may be identified before their offsets if similar competitors are not active.
However, the TRACE theory suggested the duplicative nodes and connections of its
system through successive time slices of input. This might be questionable in treating
the temporal dynamics in spoken word recognition. The time-slice solution results in
an extremely complex structure. Second, although the TRACE model is relatively
complex, its highly interactive feature makes it possess the computational specificity,
which results in a relatively easy way to conduct a direct test of behavior simulation.
Therefore, this feature helps in accounting for phenomena with a broad range. On the
contrary, the lack of interactive feature causes the poverty of computational specificity
in Cohort model. Last, the Cohort model emphasizes on the exact match between
auditory input and lexical representation rather than the sublexical representation.
However, the TRACE model has the phonemes level which is between the words
level and features level (Jusczyk & Luce, 2002).
‧
Table 1. The features for the Cohort and the TRACE models (Jusczyk & Luce, 2002)
Cohort TRACE
Activation Constrained Radical
Units and levels
lateral inhibition No Yes
Sublexical-to-lexical
interaction (bottom-up) Facilitative and inhibitory Facilitative Lexical-to-sublexical
interaction (top-down) No Facilitative
Distinguishing features
1. Focus on time-course of recognition
3. Attempts to account for broad range of phenomena
2.2 Prosody in spoken word recognition
According to Cutler, Dahan, and van Donselaar (1997), prosody is an intrinsic
determinant of the spoken form in languages. This intrinsic determinant is realized as
an effect on the timing, amplitude, and frequency spectrum of the utterance. Prosody
includes intonation, duration, stress, and tone. One of the important features is that it
spans over long segments such as syllables, words, and the utterances in speaking
‧
國立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
style, sentence type and so on. Prosodic cues can convey lexical and nonlexical
information; for example, the function of distinguishing lexical meaning in tone, the
prominence function in stress, or the emotion expression in the sentence intonation.
Any part of the speech has duration, amplitude, and fundamental frequency. Therefore,
when listeners recognize the speech, they are processing the variation determined by
prosody (Cutler et al., 1997; Leena, 2012).
When and how might prosodic information play a role in the processing? Early
findings suggested that prosody plays an organizing role in speech. For example,
nonsense syllables are recalled better only if the string of the nonsense syllables
presented with sentence prosody (Epstein, 1961). In addition,.Cutler et al. (1997)
suggested that processing of speech input is facilitated by coherent prosodic structure
appropriate for sentences. Studies of such facilitated effects have established a
significant role for temporal patterning. Thus temporal envelops of spoken utterance,
preserving amplitude information but virtually without spectral variation, allow
listeners to recognize short utterances and even nonsense syllables almost perfectly
(Cutler et al., 1997; Shannon, Zeng, Kamath, Wygonski, & Ekelid, 1995). Second,
listeners use relevant acoustic information as soon as it becomes available. For
instance, listeners take coarticulatory information efficiently from one segment to
another(Whalen, 1991). Thus, some researchers propose that whenever the prosodic
‧
國立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
information could constrain initial lexical activation, it is important to see what and
how such prosodic is processed by listeners(Cutler et al., 1997).
Because of the varied characteristics of the prosodic information, the prosody
information, such as stress and tone, in spoken word recognition has been investigated.
Most research on lexical access have been carried out in English, hence, the prosodic
structure which have been investigated is stress (Cutler, 1986; van Donselaar et al.,
2005). In English, the stress pattern can only be contrasted in multisyllabic domain
rather than in monosyllable like in tone languages. Tone languages such as Cantonese
or Mandarin are good examples to be illustrated because tone contrasts may be
realized in a monosyllable (Cutler & Chen, 1997; Taft & Chen, 1992).
2.3 Stress in lexical processing
Studies of English vocabulary structure suggest that listeners could use the
stress-pattern information in word recognition. However, some studies showed that
the stress information does not facilitate English listeners in auditory lexical decision
or in the grammatical category judgment. In Cutler and Clifton (1984) the participant
performed a grammatical category judgment of the bisyllabic syllables with or
without the standard stress pattern (for example, initial stress for noun or final stress
‧
國立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
for verbs). The result showed that the reaction time was not affected by the different
stress pattern. Cutler (1986) used a cross-modal priming task to distinguish the
contrast pattern of stress such as OBject-obJECT, and FORbear-forBEAR. If the
stress information was used by listeners, the prime and the target would not be
considered as homophones and no homophonic priming effect would be expected.
Subjects were asked to listen to a sentence contained a prime which meaning was
related to the target and then performed the lexical decision task. The resulted showed
that the pair could prime each other. Subjects considered the stress minimal pairs as
homophones, suggesting that the access code did not influenced by the stress prosodic
information. Listeners did not discriminate from these two words for msec in the
initial access to the lexicon.
In Dutch, the stress information involved during spoken word recognition (van
Donselaar et al., 2005). van Donselaar et al. (2005) also used the cross-modal priming
experiments to examine the role of suprasegmental information in processing Dutch.
The result in Dutch showed that the inappropriate stressing could prevent lexical
activation. The authors also suggested that the constraining of the suprasegmental
during the processing was within a single syllable in Dutch, indicating that it began as
soon as the relevant acoustic information was available to modulate the activation of
potential candidate words. The inconsistent results between English and Dutch are
‧
國立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
probably due to that the minimal stress pairs are rare in English. Although English is a
lexical-stress language, the stress cue might be redundant in lexical processing. The
stress information in English can nearly always be derived from the segmental
information (Cutler et al., 1997).
2.4 Tonal processing
2.4.1 Tone perception
Acoustic analysis about tone typically focuses on the fundamental frequency
(F0), which is a quantification of the rate of vocal fold vibration and usually
expressed in Hertz (Hz). According to Jongman et al. (2006), tone is a function of the
rate of vocal fold vibration. To characterize Mandarin tones, the F0 height and the F0
contour are the crucial acoustic parameters.
Researchers have explored the contribution of F0 height and F0 contour to tonal
perception. Some studies suggested that for Mandarin listeners, both of the F0 height
and F0 height are important. However, some studies claimed a more crucial role of F0
contour (Gandour, 1984; Jongman et al., 2006).
Recently, Tsang et al. (2011) used the event-related potentials (ERPs) to examine
‧
國立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
how pitch contour and pitch height contributed to early tonal processing in an
auditory passive oddball paradigm in Cantonese. Classifying six tones in Cantonese
by pitch height and contour, the authors manipulated four conditions: height-large
difference (Tone 6/ Tone 1), height-small difference (Tone 6/ Tone 3), contour-early
difference (Tone 1/ Tone 2), and contour-late difference (Tone 6/ Tone 2). In the
experiment, the stimulus (e.g., /ji1/, /ji2/, /j3/ and /ji6/) was presented while the
participant was watching a self-chosen silent movie with closed captions. The result
indicated that the turning point on the pitch contour could modulate the effect of pitch
height, suggesting that Cantonese speakers did not process pitch contour and pitch
height as totally unrelated dimensions. In Mandarin Chinese, Lai and Zhang (2008)
used a gating paradigm to examine the amount of tonal information needed to
correctly identify the four tones of the target. In the experiment, there were eight tone
quadruplets, which contained the same segmental structure but different tones.
Subjects were asked to identify the tone for each gated stimulus (40msec increments)
and provided a confidence rating on a scale of one to seven for their response by
pressing the corresponding button. The stimuli were presented in a duration-blocked
fashion, in which participants heard the first gate of the stimuli to the last gate, which
always contained the entire syllable. The isolation point, which was the size of the
segment needed to correctly identify, was examined. The result showed that the
‧
國立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
isolation point was different among four tones. The earliest isolation point was the
Tone 1, followed by Tone 4, and then followed by Tone 2 and Tone 3. To sum up, the
acoustic features of four tones in Mandarin affect tone perception.
Whether the acoustic similarity among the four tones affects tonal perception has
been examined in both Chinese and Cantonese. There are six tones in Cantonese.
Tone 1 is most distinct from other tones, while other tones bear a similar point on the
F0 scale. In a lexical decision task, Cutler and Chen (1997) manipulated the mismatch
and match of phonological structure between prime and target In this study, Cantonese
tones were separated by “easy group” including Tone 1, and “hard group” including
the remaining tones. Tone 1 was in the easy group because of its acoustic distinction
from other tones in Cantonese. The “hard” group comprises the tones with similar
acoustic contour. The result showed that the “easy” group had lower error rate and
faster response time than that of the “hard” group. As for Chinese, Ye and Connine
(1999) explored the role of tonal similarity in a vowel and tone monitoring task. In
their experiment three, the Tone 2 and Tone 3 were grouped as “close” tone while the
Tone 2 and Tone 4 were labeled as “far” tone. Results showed that the reaction time to
the “far” tones was longer than the “close” tones in both the vowel and tone
the “far” tones was longer than the “close” tones in both the vowel and tone