Journal of Phonetics

(1)

Special Issue: Integrating Phonetics and Phonology, eds. Cangemi & Baumann

Phonology, phonetics, and signal-extrinsic factors in the perception of prosodic prominence: Evidence from Rapid Prosody Transcription

Jason Bishop^a,b,^*, Grace Kuo^c, Boram Kim^a,d

aThe CUNY Graduate Center, New York, NY 10016, USA

bThe College of Staten Island-CUNY, Staten Island, NY 10314, USA

cNational Taiwan University, Taipei City 10617, Taiwan

dHaskins Laboratories, New Haven, CT 06511, USA

a r t i c l e i n f o

Article history:

Received 3 April 2018

Received in revised form 15 March 2020 Accepted 17 March 2020

Available online 28 July 2020

a b s t r a c t

The present study investigated the perception of phrase-level prosodic prominence in American English, using the Rapid Prosody Transcription (RPT) task. We had two basic goals. First, we sought to examine how listeners’ subjective impressions of prominence relate to phonology, defined in terms of Autosegmental-Metrical distinctions in (a) pitch accent status and (b) pitch accent type. Second, and in line with this special issue, we sought to explore how phonology might mediate the effects of other cues to prominence, both signal-based (acoustic) and signal- extrinsic (stimulus and listener properties) in nature. Findings from a large-scale RPT experiment (N = 158) show prominence perception in this task to vary significantly as a function of phonology; a word’s perceived prominence is significantly dependent on its accent status (unaccented, prenuclear accented, or nuclear accented) and to a slightly lesser extent, on pitch accent type (L*, !H*, H*, or L+H*). In addition, the effects of other known cues to prominence—both signal-based acoustic factors as well as more “top-down” signal-extrinsic factors—were found to vary systematically depending on accent status and accent type. Taken together, the results of the present study provide further evidence for the complex nature of prominence perception, with implications for our knowledge of prosody perception and for the use of tasks like RPT as a method for crowdsourcing prosodic annotation.

1. Introduction

1.1. Overview

In the present study we investigated the perception of phrase-level prosodic prominence in American English, using the Rapid Prosody Transcription task (Cole, Mo, &

Hasegawa-Johnson, 2010; Cole, Mo, & Baek, 2010). Of particular interest to us was how listeners’ subjective impressions of prominence at this relatively macroscopic level relate to (a) phonology, (b) phonetic realization, and (c) signal-extrinsic factors. By phonology, we mean the linguistic contrasts related to accentuation within the Autosegmental-Metrical framework (Pierrehumbert, 1980; Gussenhoven, 1984; Beckman &

Pierrehumbert, 1986; Ladd, 1996, 2008; see Arvaniti, to appear, for a recent overview), and more speciﬁcally, the cate-

gories available within the Tones and Break Indices (ToBI) conventions for Mainstream American English (MAE_ToBI;

Beckman & Hirschberg, 1994; Beckman & Ayers Elam, 1997). By phonetic realization, we mean the gradient acoustic correlates of prominence found in physical speech output.

Finally, by signal-extrinsic factors, we mean the (non- phonological) “top-down” properties of stimuli and of listeners themselves (i.e., individual differences) that serve as predictors of perceived prominence.

The paper proceeds as follows. First, we discuss some pre- liminaries to the study of prominence perception in English, as well as the motivations for our study in particular (Section 1.2).

We then describe some of the features of the methodology utilized in our investigation, and the speciﬁc research questions to be explored (Section 1.3). Following these introductory sec- tions, we present a RPT experiment in English (Section 2), which is followed by a discussion of the ﬁndings (Section 3) and concluding remarks (Section 4).

*Corresponding author at: City University of New York, 365 Fifth Avenue, New York, NY 10016, USA. Fax: +1 212 817 1526.

E-mail address:[email protected](J. Bishop).

Contents lists available atScienceDirect

Journal of Phonetics

j o u r n a l h o m e p a g e : w w w . e l s e v i e r . c o m / l o c a t e / P h o n e t i c s

(2)

1.2. Predicting perceived prominence

A wealth of studies in recent years have attempted to identify the factors that predict, for English and closely related languages, listeners’ impressions that some words stand out as stronger than others (e.g., Baumann, 2006, 2014; Bishop, 2012a, 2013, 2016; Kimball & Cole, 2014, 2016; Streefkerk, Pols, & ten Bosch, 1997; Eriksson, Thunberg, & Traunmüller, 2001; Kochanski, Grabe, Coleman, & Rosner, 2005; Wagner, 2005; Mo, 2008; Jagdfeld & Baumann, 2011; Röhr &

Baumann, 2011; Arnold, Möbius, & Wagner, 2011; Baumann

& Riester, 2012; Mahrt, Cole, Fleck, Hasegawa-Johnson, 2012; Cole, Mahrt, & Hualde, 2014; Pintér, Mizuguchi, Tateishi, 2014; Baumann & Röhr, 2015; Cole, Hualde, Eager, Mahrt, 2015; Erickson, Kim, Kawahara, Wilson, Menezes, Suemitsu, Moore, 2015; Mixdorff et al., 2015; Baumann, Niebuhr, & Schroeter, 2016; Hualde, Cole, Smith, Eager, Mahrt, & de Souza, 2016(see alsoCole et al., 2019, this Spe- cial Issue);Cole, Mahrt, & Roy, 2017; Niebuhr & Winkler, 2017;

Roy, Cole, Mahrt, 2017; Turnbull, Royer, Ito, & Speer, 2017;

see also relevant earlier research that had somewhat different goals; Terken, 1991, 1994; Rietveld & Gussenhoven, 1985;

Turk & Sawusch, 1996; Gussenhoven, Repp, Rietveld, Rump, & Terken, 1997; Cambier-Langeveld & Turk, 1999). In this lively body of work, one surprising lack of consensus has persisted that we think helps motivate an investigation into the role that phonology plays in how people perceive prominence. Across studies of English and closely related Dutch and German, weﬁnd strong agreement that fundamental frequency (F0), duration, and intensity (the latter two both con- tributing to the percept of “loudness”) are the primary correlates of both phonological and perceived prominence, but considerable disagreement regarding their importance relative to each other.¹On the one hand, many studies, especially those in which F0 was manipulated experimentally, have reported changes in perceived prominence to be tightly tied to changes in F0 (e.g., Ladd, Verhoeven, & Jacobs, 1994, and Ladd & Morton, 1997, for English; Rietveld & Gussenhoven, 1985, for Dutch; and Niebuhr & Winkler, 2017, for German).

Others—all of which seem to utilize corpus materials—have found F0 to be weakly related to perceived prominence, if related at all (Kochanski et al., 2005; Cole, Mo, & Hasegawa- Johnson, 2010, Cole et al., 2017; but seeMahrt, Cole, Fleck,

& Hasegawa-Johnson, 2012). Thus, despite high levels of interest in the matter, confusion remains regarding one of the most basic questions about prominence perception that can be asked.

We point to this particular situation because it highlights a fundamental problem with attempting to model prominence perception directly from quantitative acoustic measures, as many studies have attempted to do, and the fact that corpus- style analyses are the ones that fail to detect listeners’ sensitivity to F0 is perhaps not surprising. To see why, consider the case of a regression model designed to predict prominence perception for words on the basis of particular F0 values (rather than F0 turning points or some other more discrete event) applied to the utterances in (1) through (3). The utter-

ance in (1) features a well-known ambiguity regarding the accent status of a word positioned between two H* accents (Beckman, 1996), a familiar challenge to human ToBI transcribers who must make a categorical decision about the word’s phonological prominence in the absence of a distinctive F0 target (Beckman & Ayers Elam, 1997). However, it is also problematic for a statistical model designed to predict prominence perception as a function of F0 values at particular time points, since (even allowing for some“sagging” in the transition between the two pitch accents) unaccented bought and unaccented a will have F0 values very close to those of accented Marty and accented motorcycle. However, it seems unlikely this roughly-equal F0 will result in unaccented bought being as perceptually prominent as the two accented wordsflanking it. The utterance in (2) presents similar difficulties for prediction based on acoustic measures of F0, but for slightly different reasons. Here, the unaccented article a will necessarily have a higher F0 value than nuclear accented motorcycle, though this article is specifically not accented, and thus is also unlikely to be perceived as particularly prominent. Finally, while F0 does correlate with the accentual status—and most likely, perceived prominence—of motorcycle in (3a) versus (3b), this is not the case for motorcycle in (3b) versus (3c). While motorcycle is marked by similarly low F0 in both (3b) and (3c), this is for different reasons—interpolation in (3b) but pitch accent in (3c).

The point here is that there are many prosodic structures—perhaps most—that serve to weaken an overall correlation between particular F0 values and accentual structure. This, in turn, would have the effect of also weakening the relationship between particular F0 values and perceived prominence in statistical analyses of corpora.

(1)

(2)

(3)

1See also work on other phonetic cues to prominence in Western Germanic languages inSluijter and Van Heuven (1996),Terken and Hermes (2000)andEpstein (2002). Cues such as voice quality, for example, while surely more marginal in their importance to perceived prominence, are also far less well studied.

(3)

In most situations, we think the listener is probably much more like a ToBI transcriber than like the regression model just described, at least in the sense that she interprets mostfine- grained phonetic variation only after having performed a parse of the signal into a coarser, more discrete sequence of categorical events. Once a basic phonological structure is thus established—based on, for example, identification of F0 targets and meaningful alignments with the text—the listener can then identify residual acoustic variation. This acoustic variation may be very rich and informative indeed (Cangemi & Grice, 2016), but it is assigned significance mostly in relation to the structure that the listener has already constructed. One conse- quence of this alternative, and we think much more plausible scenario, is that gradient variation along acoustic dimensions like F0, duration, and intensity will influence prominence perception, but do so in largely category-specific ways. That is, acoustic variation will be treated differently by the listener depending on whether the associated word is accented or unaccented, whether it bears a H* or a L*, and so on. This implies that the best models of prominence perception will therefore need to either (a) be applied to data for which the phonological structure is already known, or (b) include algo- rithms for assigning that structure automatically (e.g., Rosenberg, 2009). Despite the important and multifaceted role that phonology likely plays in prominence perception, it has received relatively little attention. For this reason, investigating its role was a primary goal for us, which we return to inSec- tion 1.3, where we discuss our research questions more specifically.

In addition to phonology, however, subjective judgments of prominence by human listeners reﬂect other information not found directly in the acoustic signal, and the effect of these additional signal-extrinsic (or “top-down”) factors should also be of interest. For example, a clearﬁnding from previous work is that impressions of prominence are inversely related to a word’s lexical frequency (Cole, Mo, & Hasegawa-Johnson, 2010; Bishop, 2013; Baumann, 2014; Cole et al., 2017; see alsoNenkova et al., 2007) and, to a lesser extent, its number of previous occurrences in the discourse (Cole, Mo, &

Hasegawa-Johnson, 2010). It has also been found that words tend to be perceived as more prominent if they occurﬁnally in a prosodic phrase (e.g.,Rosenberg, Hirschberg, & Manis, 2010;

Jagdfeld & Baumann, 2011; Cole et al., 2017). While it is somewhat unclear what the underlying mechanisms for these effects are, they have more to do with listeners’ expectations about the signal than with the signal itself.

Properties of individual listeners—i.e. individual differences—are another source of signal-extrinsic effects. While very little is known about the factors that underlie individual differences in prosody perception, such differences are certainly known to exist (e.g., Cole, Mo, & Hasegawa-Johnson, 2010;

Bishop, 2016; Roy, Cole, & Mahrt, 2017; Cole et al., 2017;

Baumann & Winter, 2018). Recently,Jun and Bishop (2015) have suggested that at least some such cross-listener variation in prominence perception may be related to individual differences in “cognitive processing styles” (e.g., Ausburn &

Ausburn, 1978; for a recent introduction in the context of phonetic research, seeYu, 2013). Following preliminaryﬁndings reported by Bishop (2012b; superseded by Bishop, 2017), Jun and Bishop argue that listeners’ attention to prosodic

prominence may be in part affected by individual differences in so-called“autistic traits”, an aspect of cognitive processing style that has been implicated in other well-established speech perception phenomena (e.g.,Yu, 2010, 2016; Stewart & Ota, 2008; Ujiie, Asai, & Wakabayashi, 2015). In particular, Jun and Bishop found that listeners who gave more autistic-like responses on the “communication” subscale of the Autism Spectrum Quotient (AQ; Baron-Cohen, Wheelwright, Hill, Raste, Plumb, 2001) were less likely to comprehend syntacti- cally ambiguous sentences based on accentual patterns (i.e., a smaller inﬂuence of what Schafer, Carter, Clifton, & Frazier, 1996, referred to as“Focus Attraction”). Consistent with this, English-speaking listeners who give more autistic-like responses appear to be less sensitive to the presence of prenuclear prominences in online lexical processing (Bishop, 2017) and in off-line sentence completion tasks (Hurley &

Bishop, 2016). Notably, the communication subscale of the AQ has been used in other work to estimate individual differences in “pragmatic skill” (Kulakova & Nieuwland, 2016;

Nieuwland, Ditman, & Kuperberg, 2010; Xiang, Grove, &

Giannakidou, 2013; Yang, Minai, & Fiorentino, 2018). Although we acknowledge that the relationship between this measure and an underlying construct like “pragmatic skill” (and related ones, like“Theory of Mind”) remains a hypothesis, to the extent that such a relationship exists, the communication subscale of the AQ may serve as a rough proxy for listeners’ sensitivity to the relation between prosody and meaning-in-context. This is relevant to prominence perception, since most off-line tasks have shown listeners’ responses to reﬂect, in part, expectations based on contextual meaning (e.g., Vainio & Järvikivi, 2006; Bishop, 2012; Cole, Mahrt, & Hualde, 2014; Turnbull et al., 2017; see alsoGussenhoven, 2015, for an even stronger claim). However, it is likely that these meaning-dependent cues have their effects in phonologically-dependent ways, much like the effects of other types of other signal-extrinsic cues have been argued to (Turnbull et al., 2017).² Thus signal-extrinsic factors tied to the stimulus and to the listener can be identiﬁed as another way in which prominence perception is rich and complex, and importantly for our purposes, best understood in the context of a phonological model of the to-be- perceived speech material.

1.3. Present study: exploring prominence perception using Rapid Prosody Transcription

As is clear from the above discussion, many different cues—and types of cues—contribute to prominence perception by human listeners. The overarching goal of the present study was to explore the role of phonology in prominence perception, and to do so (a) in the context of AM theory, and (b) using Rapid Prosody Transcription. Rapid Prosody Transcrip- tion (RPT) is a promising new method for exploring the perception of prosody (Cole, Mo, & Hasegawa-Johnson, 2010; Cole et al., 2010b) and for “crowdsourcing” prosodic annotation (Buhmann et al., 2002; Mahrt, 2016; Cole & Shattuck- Hufnagel, 2016; Hasegawa-Johnson, Cole, Jyothi, 2015;

Cole et al., 2017). RPT involves the speeded identiﬁcation of

2On this point, see alsoCalhoun (2006) statistical modeling of prenuclear versus nuclear accents in English, although the focus there is on production rather than perception.

J. Bishop et al. / Journal of Phonetics 82 (2020) 100977

(4)

coarsely-deﬁned prosodic events—namely “prominence” and

“juncture”—carried out by groups of linguistically-untrained listeners. While RPT has been used to study prominence perception in a number of languages/language varieties (e.g., Smith, 2009; Smith & Edmunds, 2013; Luchkina, Puri, Jyothi,

& Cole, 2015; Hualde et al., 2016; see alsoCole et al., 2019, this special issue) and under various listening conditions (Cole et al., 2014, 2017), this work has primarily focused on the role that acoustic and non-phonological signal-extrinsic factors play. We therefore asked two basic questions regarding how listeners’ prominence perception in RPT relates to phonology, neither of which has been fully explored in English previously:³

(1) How do the following phonological distinctions relate to patterns of perceived prominence?

a. Accent status (a word’s status as unaccented, prenuclear pitch accented, or nuclear pitch accented)

b. Pitch accent type (the particular tone assigned to a pitch accented syllable, e.g., L*, !H*, H*, L+H*, etc.)

(2) Do accent status and accent type mediate the effects of other (signal-based and signal-extrinsic) cues?

Thefirst question is largely confirmatory, in that hypothe- ses can be straightforwardly derived from theory, and, to some extent, from previous empirical findings from closely- related German (Baumann & Röhr, 2015; Baumann &

Winter, 2018; Baumann, 2014). In particular, we predicted that perceived prominence would pattern much like metrical/phonological prominence; listeners should be significantly more likely to judge nuclear accented words as prominent than prenuclear accented words, which in turn should be significantly more likely to be judged as prominent than unaccented words. Similarly, we predicted that perceived prominence should vary as a function of pitch accent type, and our prediction here was based on a characterization of accent type that emphasizes relative pitch level/height. We predicted that words bearing L+H* (which are known to have increased F0; e.g., Bartels & Kingston, 1994) should be most likely to be perceived as prominent by listeners, followed by those with a H* target, followed by those with a ! H* target, followed by a L* target. Our basic justification for elevating the importance of level in the present study is based in part on work related to intonational meaning discourse function in English (Hirschberg, Gravano, Nenkova, Sneed, & Ward, 2007; Pierrehumbert & Hirschberg, 1990) and on previous perception work that seems to show level-based perceptual differences between some MAE_ToBI pitch accent categories (Turnbull et al., 2017; see also discussion in Ladd, 1994, Ayers, 1996, and Ladd &

Schepman, 2003).⁴

The second question acknowledges the possibility that phonology interacts with other cues (Turnbull et al., 2017) and this question has both conﬁrmatory and exploratory aspects to it. On the conﬁrmatory side, and given our discussion inSection 1.2, we predict that the effect of phonetic cues should not be spread evenly across phonological contrasts.

We present as more exploratory the details of how different cues might be weighted in relation to these contrasts, though even here there are some general predictions that can nonetheless be pointed out. For example, in the case of accent status we might assume that F0 will be most useful to cueing perceived prominence for words that are accented—that is, for words that are aligned with an F0 target rather than on the interpolation line (recall our discussion of example (1), above). Conversely, it seems plausible that signal-extrinsic factors like lexical frequency or individual differences in pragmatic skill might have their strongest effects on the perceived prominence of unaccented words, given the more ambiguous phonetic and phonological cues to words parsed into metrically weak positions. However, we know of no study that has demonstrated any such phonologically-dependent asymme- tries in the importance of various cues to perceived prominence, so we regard the details of our question here to be largely exploratory in nature. We now turn to the RPT experiment that investigated these issues.

2. Experiment

2.1. Methods

2.1.1. Stimuli materials

Speech Corpora. Materials were selected for use in a RPT experiment with native-English speaking listeners from the United States. The stimuli to be presented to listeners consisted of samples of connected speech from four weekly public addresses recorded by United States President Barack Obama, currently in the public domain and stored on a United States Government web archive (Obama, 2013, 2014a, 2014b, 2014c). These recordings had a political purpose, and thus it was assumed that the speech therein would be fairly careful (i.e., likely read and rehearsed). We chose speech of this sort (and by this speaker) for the following reasons. First and foremost, we assumed that (relative to samples of other speech styles) these samples would contain connected speech with fewer disfluencies, and with fewer reduced and ambiguous instances of intonational categories. This was desirable since the goal of the study was to predict prominence perception based on ToBI-defined categories, and other things being equal, it was assumed that spontaneous speech (such as that found in the Buckeye Corpus,Pitt et al., 2007, used in some previous work; Cole, Mo, & Hasegawa-Johnson, 2010) would contain far more disfluencies, more reduction, and in general, more ambiguity. Second, speech produced by Barack Obama in particular was chosen because it was assumed that listeners would be approximately equally familiar with it; since these listeners would be drawn from a population of mostly native New Yorkers, and because variation within this population is considerable (Newman, 2015), we chose a speaker with a dialect we assumed to be equally different from that of most of our listeners, but also equally familiar to them

3As referenced below, closely-related questions have recently been investigated for German, but the only study on English we are aware of isHualde et al. (2016; see also Cole et al., this volume). However, their dataset was primarily designed to explore cross- language differences, and the statistical comparisons in their brief report only directly address ourﬁrst question (and only with respect to accent status).

4We acknowledge, however, that this is not the only way that the inventory of American English pitch accents could be reduced. For their German data, for example, Baumann and his colleagues (Baumann, 2014;Baumann & Röhr, 2015;Baumann & Winter, 2018) group GToBI pitch accents into a combination of levels and movements (e.g., low vs. rising vs.

high vs. falling). Although our statistical tests are based on accent level, we also report patterns for individual pitch accents in the results section.

(5)

(see Cole et al., 2017 for recent evidence regarding cross- dialectal prosody perception).⁵ The four samples selected as stimuli (henceforth Samples A, B, C, and D) came from the“Your Weekly Address” series, which contains commentary by Presi- dent Obama on current events. Each sample was chosen for its quality (sound quality and recording environment varies widely across the many Your Weekly Addresses that President Obama recorded during his two terms) and similar length (approximately 3–4 min). An example excerpt from Sample A is shown in (4):

(4). . .Today our economy is growing and our businesses are con- sistently generating new jobs. But decades-long trends still threaten the middle class. While those at the top are doing better than ever, too many Americans are working harder than ever, but feel like they can’t get ahead. That’s why the budget I sent Congress earlier this year is built on the idea of opportunity for all. It will grow the middle class and shrink the deﬁcits we’ve already cut in half since I took ofﬁce. . .

These samples were downloaded as MP3files (versions in uncompressed file formats not being available), suitable for analysis of intonation and the basic phonetic correlates of prominence that we were interested in. The MP3 files were converted to WAV files for the practical purpose of making them loadable in Praat (Boersma & Weenink, 2017) for later phonological annotation by trained MAE_ToBI labelers, and for later presentation to linguistically-untrained listeners in the RPT experiment. The only editing carried out on thefiles was the removal of brief salutations (e.g., “Hi everybody.” at the beginning of all four speech samples, and similar messages at the end, such as “Thanks everybody, and have a good weekend.”). After this editing, the result was four speech samples, containing a total of 1,821 words, or approximately 10 min of speech material (Sample A: 470 words/2.6 min;

Sample B: 448 words/2.35 min; Sample C: 445 words/2.7 min;

Sample D: 458 words/2.4 min). Finally, orthographic transcripts of the resulting speech materials were produced, with all punc- tuation and capitalization removed, as in previous RPT work (e.g., Cole, Mo, & Hasegawa-Johnson, 2010; Cole, Mo, &

Baek, 2010). These transcripts were set aside, to be used by listeners in the RPT experiments to make their real-time iden- tiﬁcations of prosodic events.

Phonological annotation. All four speech samples were phonologically annotated by two labelers with extensive training in the MAE_ToBI conventions. Recall from above that we chose these careful/performance-style speech materials with the goal of having stimuli on the lower end of reduction/phonetic ambiguity. The reason for this was because we wished to investigate the extent to which RPT judgments of prominence are related to phonological categories, and so we utilized realizations of those categories that were clearer/more canonical. Our annotation procedure and our use of the MAE_- ToBI annotations were also intended to minimize the number of ambiguous instances of phonological categories that would ultimately be analyzed. First, the two ToBI labelers worked

independently, not communicating with each other to resolve disagreements about the labels they considered assigning.

Second, disagreements that occurred in each labeler’s ﬁnal annotations were left unresolved; rather than using a third

“tie-breaking” annotator or some other method to force a decision, we took the two annotators’ inability to agree on a label as evidence that the word’s realization was sufﬁciently ambiguous or otherwise unclear, and simply excluded such words from the analysis. Although agreement rates between the ToBI annotators were not critical to our questions, we report them since they provide additional data on the MAE_ToBI system’s interrater reliability (see also Pitrelli, Beckman, & Hirschberg, 1994; Syrdal & McGory, 2000; Yoon, Chavarria, Cole, &

Hasegawa-Johnson, 2004; Breen, Dilley, Kraemer, & Gibson, 2012). Agreement rates are shown in Table 1, for both the presence versus absence of a pitch accent (disregarding pitch accent type) and pitch accent type (where the possible categories were: unaccented, L*, L*+H, L*+!H, !H*, H+!H*, H*, L +H*, and L+!H*). Rates are displayed both in terms of raw percent agreement and chance-corrected Cohen’s kappa values, but here and throughout the paper, we rely primarily on kappa (j) values for interpretation.⁶It is difficult to make cross-study comparisons with precision, since studies vary considerably with respect to the kinds of speech materials annotated, the level of training of annotators, how categories are defined, and whether chance-corrected measures of agreement are reported. Gener- ally, however, the agreement levels for pitch accent presence and type in the present study were consistent with the range that has been reported previously, and on the higher end of that range (somewhat higher than found byBreen et al., 2012, and more similar to Pitrelli and colleagues’ (1994) findings). This was a desirable outcome, since, as described above, only agreed-upon labels could be used for our analyses. Further, although it is important to stress that such cutoffs are rather arbi- trary, in practice many researchers follow Landis and Koch (1977), who recommended interpreting kappa values of 0.01– 0.20 as “slight agreement”, 0.21–0.40 as “fair agreement”, 0.41–0.60 as “moderate agreement”, 0.61–0.80 as “substantial agreement”, and 0.81–1.0 as “near perfect agreement”. By these standards, our ToBI annotators agreed at rates squarely in the

“substantial” or higher categories.⁷Having obtained MAE_ToBI annotations for the speech materials, these annotations—again, limited to those that both annotators agreed upon—served as our deﬁnition of the materials’ phonological structure, which would later be used to model linguistically-untrained listeners’ perception of prominence in the RPT experiment.

Signal-based and signal-extrinsic factors. The acoustic measures collected from the speech samples were those common to many previous studies using the RPT method (e.g., Mo, 2008; Cole et al., 2017), and were extracted automatically

5 As pointed out by a reviewer, we did not actually attempt to collect any measure from individual participants regarding their familiarity with Obama’s voice, and so it is conceivable that some differences among them might exist. While we doubt that the magnitude of any such differences would likely explain the patterns we were interested in exploring, we acknowledge some possibility and discuss the issue further inSection 3.4.

6We suppress the results of signiﬁcance tests of Cohen’s kappa (and, later in the paper, those for Fleiss’s kappa). All kappa statistics we present in this paper had a z-score that was signiﬁcant at the 0.001 level, indicating that agreement among listeners was always above chance level, even when the kappa was relatively low.

7We do not have an explanation for why agreement between the annotators was higher for Sample C than for the other samples (especially on pitch accent type). However, we note that this was also true of agreement among RPT listeners, as will be shown later in Section 2.2.2. Indeed, the relative rates of agreement across the four speech samples generally turned out to be quite similar for the ToBI annotators and RPT listeners, which suggests relevant differences internal to the speech samples rather than some aspect of the methodology or task.

(6)

in Praat. First, the four speech samples were forced-aligned to word and phone tiers using the Montreal Forced Aligner (McAuliffe, Socolof, Mihuc, Wagner, & Sonderegger, 2017), with hand corrections made where necessary (accuracy was maximized by first segmenting the four speech samples into yet smallerfiles prior to alignment). Lexically stressed syllables were automatically identified with reference to the Carnegie Mellon Pronouncing Dictionary (ver. cmudict-0.7b) and acoustic measures were extracted for the vowel of the lexically- stressed syllable for each word by a Praat script and z- normalized. Measures included (a) Max F0 (the maximum F0 during the vowel, measured using autocorrelation and hand corrected where tracking errors clearly occurred); (b) RMS intensity (measured uniformly across the frequency spectrum);

and (c) the acoustic duration of the vowel. Other non- phonological properties of the stimuli included (a) each word’s CELEX frequency (Baayen, Piepenbrock, & Gulikers, 1996);

(b) the number of previous repetitions in the speech sample;

and (c) phrasal position (whether the word was ﬁnal versus non-ﬁnal in an intermediate phrase). Finally, listener-based properties included (a) gender (the listener’s self-declared status as male or female) and (b) pragmatic skill (the listener’s score on the communication subscale of the Autism Spectrum Quotient (AQ), using the Likert scoring method).

2.1.2. Participants

Participants for the study were 160 monolingual American English speakers recruited from the Greater New York City area (51 male, 109 female; ages ranged from 18 to 48).“Mono- lingual” was defined as not having learned a language other than English before the age of ten, and not being (by their own estimation) afluent speaker in any second language studied after that age. Despite this screening, two participants were later discovered to not meet the requirements, and so their data were excluded from analysis. All participants confirmed that they were free of any history of hearing or communication disorders, and that they lacked any training in prosodic theory or transcription.

2.1.3. Procedure

RPT task: Participants served as listeners in a RPT experiment, designed to elicit coarse prosodic“annotations” of both prominence and juncture (in separate tasks), although we set aside discussion of the latter task. The experiment was carried out in a laboratory setting, in a sound attenuated booth with paper and pencil. In this way our experiment was more similar to the version of the task described byCole, Mo, & Hasegawa- Johnson (2010)than the more recent experiment reported by Cole et al. (2017), who administered the task electronically and

(for some subject groups) remotely via Amazon’s Mechanical Turk. Participants in our study each listened to two of the speech samples (half the participants being assigned to Sam- ples A and B, the other half assigned to Samples C and D), identifying prominent words in one of the samples, and instances of juncture in the other (with the ordering of these two tasks, and the speech samples used for them, balanced across participants). The instructions given to participants for the prominence transcription task were intended to direct their attention to the speaker’s (i.e., Obama’s) voice, rather than to the meaning of utterances, although we assume that interpretation inevitably inﬂuences behavior in this task (see Cole et al., 2019, this Special Issue, for evidence supporting this assumption). The word“prominence” was not itself used with participants, and instead the task was described as in (4):

(4)“This part of the study is about how people use their voice when pronouncing words in English. When people speak, they use things like“loudness” and “tone of voice” to make some words “stand out”

more than others. In this part of the experiment, your job is to listen to President Obama’s voice and underline any and all words that he makes stand out in this way. To do this, you will need to listen very carefully to how he pronounces words in“real-time”.”

Participants were to make their identiﬁcations of prominent words by underlining them on the printed transcript provided.

Although these identiﬁcations had to be made in real-time, without the ability to pause or rewind, participants were presented with the speech sample more than once, and were able to make additions and retractions of prominence identiﬁcations on each presentation, as done by Cole, Mo, & Hasegawa- Johnson (2010) in their experiment. In our study, listeners had three such chances rather than only two, and we use the term“pass” to refer to each of these three passes through the materials. Additionally, we kept track (via different color markings) which pass responses had been made on. Before beginning this task, participants carried out a brief practice trial intended to familiarize them with the setup. This practice session consisted of one utterance produced by President Obama (one that did not appear in any of the samples used in the experiment).

Individual differences measures: In addition to the RPT task, all participants in the study completed three measures of cognitive processing styles that are arguably related to pragmatic skill: the Autism Spectrum Quotient (Baron-Cohen, Wheelwright, Hill, et al., 2001), the Broad Autism Phenotype Questionnaire (Hurley, Losh, Parlier, Reznick, & Piven, 2007) and the Reading the Mind in the Eyes test (Baron-Cohen, Wheelwright, Skinner, et al., 2001). Due to space considera- tions, and because we generally found these measures to predict similar things, we focus only on the AQ here. The AQ is a 50-item, self-report questionnaire, measuring “autistic-like”

personality traits alongﬁve dimensions: social skills, imagina- tion, attention to detail, attention-switching, and communication. As discussed above, the communication subscale of the AQ (henceforth AQ-Comm) is the measure associated with pragmatic skill in previous work (e.g.,Nieuwland et al., 2010;

Xiang et al., 2013), and so it is this sub-scale rather than the whole AQ that is utilized in analyses here (although participants completed the whole test). The items pertaining to AQ- Comm are listed in Appendix I. Scoring was done using a 4-

Table 1

Interrater agreement for assignment of MAE_ToBI labels to the four speech samples by two trained annotators. Agreement is expressed as Cohen’s kappa (with the corresponding percent agreement in parentheses).

Presence of Pitch Accent Pitch AccentType

Sample A 0.78 (89%) 0.62 (72%)

Sample B 0.81 (91%) 0.65 (76%)

Sample C 0.90 (95%) 0.83 (88%)

Sample D 0.84 (91%) 0.71 (81%)

All Materials 0.83 (92%) 0.70 (79%)

(7)

point Likert scale as inYu (2010)and elsewhere rather than the binary agree/disagree scoring used in Baron-Cohen, Wheelwright, Hill, et al. (2001)(seeStevenson & Hart, 2017 for some justiﬁcation for use of the Likert scoring method).

The entire experimental session took approximately 45 min to complete.

2.2. Results

2.2.1. Overview

In this section we present mixed-effects logistic regression analyses intended to model RPT listeners’ prominence identi- ﬁcations as a function of the phonological, phonetic, and signal-extrinsic factors described above. A special interest here was in illuminating how the effects of the latter two types of cues may be dependent on phonological contrasts related to accent status and accent type. Before going on to the regression analyses, however, we offer a brief analysis of interrater agreement, which provides some useful information about how our RPT listeners compare to those in previous RPT studies, both in terms of agreement with each other as well as with the“consensus” ToBI annotation derived from our two trained transcribers.

2.2.2. Agreement

Consideringfirst agreement among RPT listeners, we calculated Fleiss’s kappa (similar to Cohen’s kappa, but suitable when the number of raters is more than two) for responses by all participants for each of the four speech samples. Values are shown inTable 2, for responses that RPT listeners gave on thefirst pass (i.e., after hearing the materials just once) and after all three passes. Thefirst observation we make is that agreement is quite low after just one pass. The second observation is that after three passes, listeners in our study agreed at a rate very similar to those inCole et al. (2017)study, who agreed at a rate ofj = 0.31. Thus, RPT listeners—at least if they have multiple opportunities to hear the materials—seem to agree with each other at rates within the “fair agreement”

range (by the standards ofLandis & Koch, 1977). While this is much lower than the agreement that is achieved by ToBI annotators (in this study and elsewhere), this is not surprising;

RPT listeners lack training and instruction and make their deci- sions quickly in real-time—and without any visual representa- tion of the speech materials.

Turning now to agreement between RPT listeners and the consensus ToBI annotation, we treated prominence identifications by RPT listeners as equivalent to pitch accent identifications (ignoring pitch accent type distinctions) by ToBI annotators. Pairwise Cohen’s kappas and raw proportion agreement were then calculated between the consensus ToBI annotation and each of the individual RPT listeners. The results are illustrated inFig. 1, for RPT responses made after three passes through the materials. In order to illustrate how chance-corrected kappa relates to percent agreement, thefig- ure plots these two measures of agreement against each other. This helps make apparent the extent to which chance inflates the picture of agreement in this binary-choice task; a listener with 75% agreement with ToBI labelers in this dataset has a chance-corrected agreement of only approximately j = 0.50 (“moderate agreement”). Another thing it makes

apparent is that almost all RPT listeners agree below this level;

the individual listener who achieved the highest level of agreement with trained ToBI labelers (a linguistically naïve “super annotator” in the words of Cole et al., 2017) agreed at the j = 0.56 level. Though considerably lower than the rate at which ToBI labelers agree with each other, this is approaching

“substantial agreement” by common standards (Landis &

Koch, 1977). In any case, that some listeners perform this similarly to ToBI annotators is rather impressive given their lack of training and the differences between the tasks. However given that only one or two out of the 158 listeners that we analyzed achieved this, the extent to which such“super annotator” performance is due to chance is rather unclear. We now turn to our statistical modeling of prominence perception in the RPT task, in which we sought to determine the interactive role of phonological, phonetic, and signal-external factors.

2.2.3. Modeling prominence perception

2.2.3.1. Overview. We now turn to our main analyses involving how prominence perception in the RPT task relates to phonological distinctions. Using a series of mixed-effects logistic regression models, our approach in this section involved ﬁrst determining the extent to which intonational phonological contrasts serve as predictors of perceived prominence, and then

Table 2

Interrater agreement, expressed as Fleiss’s kappa, among RPT listeners for each of the four speech samples. Agreement was calculated after one and after three passes through the materials.

Materials One Pass Three Passes

Sample A 0.185 0.266

Sample B 0.213 0.293

Sample C 0.239 0.320

Sample D 0.180 0.274

Mean 0.204 0.288

Fig. 1. Levels of agreement between expert ToBI annotators and each individual RPT listener, expressed as both Cohen’s kappa (j) and percent agreement (%). Kappa agreement and percent agreement are plotted against one another in order to highlight the relationship between chance-corrected and non-corrected measures.

(8)

to examine whether the inﬂuence of acoustic and signal- extrinsic factors vary within these phonologically-deﬁned contrasts. We divide our analysis in this section into two parts, based on the phonological contrast considered.

2.2.3.2. Pitch accent status. Effect of phonology: Consideringfirst the effect of accent status on prominence perception, we derived from the ToBI transcribers’ annotations each word’s status as unaccented, prenuclear accented or nuclear accented, and calculated the proportion of each judged to be prominent by listeners. This was done for judgments made after just one pass through the materials, and after all three passes that listeners were to make.Table 3summarizes these grand proportions. First, it is clear that, overall, listeners identified more words as prominent when given additional passes (approximately 1.8 times as many overall). Second, words parsed into stronger metrical positions were more likely to be perceived as prominent. Notably, however, the most metrically prominent words—those bearing nuclear accents—were still far from ceiling-level identification, being judged as prominent less than half the time. Furthermore, unaccented words were judged as prominent at a non-zero rate by RPT listeners. Both of thesefindings represent clear mismatches between phonology and subjective prominence judgments. It is also worth pointing out that, although listeners identified additional prenuclear and nuclear accents on later passes through the materials, they also identified additional unaccented words as prominent. This indicates that additional passes through the materials to some extent results in listeners simply identifying more words as prominent overall. That is, in relation to ToBI annotators, RPT listeners make more “correct” identifications when given more time, but they also produce more“errors”, and thus overall agreement between RPT listeners and ToBI annotators does not necessarily increase. This, of course, has methodological implications for RPT’s use as a tool for crowdsourcing prosodic annotation (Cole et al., 2017).

To conﬁrm the signiﬁcance of these numerical patterns, mixed-effects logistic regression was carried out using the glmer function in the lme4 package (Bates, Maechler, Bolker,

& Walker, 2015) for R (R Core Team, 2018). The regression model was used to predict the binary outcome variable

“marked prominent by listener” as a function of the ﬁxed- effects factors“accent status”, “pass”, and “order”. Accent status was treated as a discrete ordinal variable so that effects at each level (unaccented < prenuclear accented < nuclear accented) could be simultaneously compared with Tukey corrections applied (using the ghlt function in the multcomp package for R; seeBretz, Hothorn, & Westfall, 2011). Pass was a binary predictor (three passes through the materials vs. just one) and order (a word’s linear position in the speech sample, a measure of listeners’ progression through the experiment)

was a continuous predictor, centered on its mean. Random- effects factors included intercepts for participant and lexical item; the model also included a by-listener random slope for accent status, as its inclusion signiﬁcantly improved model ﬁt as determined by a log likelihood ratio test (Matuschek, Kliegl, Vasishth, Baayen, & Bates, 2017).

The output of the model is shown inTable 4, and indicates that the likelihood that RPT listeners identiﬁed a word as prominent increased along with its metrical prominence;

nuclear accented words were most likely to be judged as prominent, followed by prenuclear accented words, followed by words without a pitch accent. The likelihood that a word was judged as prominent also increased significantly from thefirst to the last pass through the materials. There was a numerical tendency for the identification of words of lower metrical prominence to benefit more from additional passes through the materials than words of higher metrical prominence. Computable fromTable 3is that by the third pass, listeners identified 1.70 times as many nuclear accented words as after thefirst pass, but 1.83 times and 2.07 times as many prenuclear accented and unaccented words, respectively.

However, a log likelihood ratio test indicated that an interaction term between accent status and pass improved modelﬁt only marginally (v²= 5.45, p < .1). Thus, although there was a tendency for listeners to identify nuclear accented words sooner/- more readily in the experiment, the effect of additional passes was primarily a simple effect. In the rest of our analyses, we consider prominence judgments that were made after all three passes.

Effects of phonetic and signal-external factors: One thing that the previous section demonstrated was that listeners in the RPT task were not simply identifying nuclear accented words. Indeed, our listeners both (a) failed to identify many nuclear accented words as prominent, and (b) succeeded in identifying many prenuclear accented words—and even some unaccented words—as prominent. The purpose of this section is to determine what other factors, signal-based and signal- extrinsic, predict prominence perception. As discussed inSec- tion 1.2, our assumption was that phonetic cues are unlikely to be weighted equally across phonological distinctions, and so, in addition to signal-extrinsic cues, our focus here was on assessing the relative contribution of F0, duration and intensity for words of different accent status. We were especially curious about whether some of these factors led listeners to identify unaccented words as prominent at the rate that they did (almost 12% of the time in the aggregate).

To this end, three logistic regression models were constructed, each intended to predict prominence perception for a different accent status category. Fixed-effects factors included in the models fell into three categories, as outlined above inSection 2.1.1. First, these included acoustic properties:“F0 Max” (the maximum F0 occurring during the word’s stressed syllable), “duration” (the duration of the word’s stressed syllable), “RMS intensity” (the root mean square intensity over the word’s stressed vowel), as well as “RMS intensity*duration” (the interaction between these two factors, given their complex and dependent relation to the percept of

“loudness”;Beckman, 1986, Turk & Sawusch, 1996). Second, these included signal-extrinsic properties of the stimuli:“lexical frequency” (CELEX frequency;Baayen et al., 1996), “repeti-

Table 3

Proportion of unaccented, prenuclear accented and nuclear accented words identiﬁed as prominent by RPT listeners on theﬁrst and last of three passes through the speech materials.

One Pass Three Passes

Unaccented 0.057 0.118

PPA 0.163 0.298

NPA 0.250 0.424

(9)

tion” (number of times a word had previously occurred in the speech materials),“order” (the location of the word in the pas- sage), and“phrasal position” (a word’s positioning as ﬁnal versus non-ﬁnal in an intermediate phrase).⁸ Finally, we also included in the model two listener-based properties: “gender”

(the gender, male/female, declared by the listener) and“pragmatic skill” (the participant’s score on AQ-Comm; higher values correspond to more autistic-like, and thus poorer, pragmatic skill). As noted earlier, values for all continuous predictors in the model were z-transformed (and thus also centered on their mean). Random-effects factors included intercepts for listener and lexical item, and a by-listener slope for lexical frequency.

The output of these models is shown inTable 5. For ease of exposition we discuss this output by making reference to Fig. 2, which displays the change in the odds ratio (which we express as percent change), a convenient measure of effect sizeHosmer, Lemeshow, & Sturdivant, (2013)for all significant factors in the models. Considering first acoustic factors, it is apparent in Fig. 2 that words realized with greater acoustic prominence were generally more likely to be perceived as prominent. However, and confirming our basic prediction, the details depended on phonological status (i.e., distinctions such as accent status and accent type). For example, one standard deviation increases in F0, duration, and intensity all had their largest effects on the perceived prominence of prenuclear accented words, and (except in the case of duration) their weakest effects on the perceived prominence of unaccented words (with no significant effect of intensity at all). One interpretation of thefirst observation is that nuclear accented words derive their perceived prominence primarily from their struc- tural prominence and semantic significance (Calhoun, 2006);

additional acoustic prominence therefore has only a moderate effect on how nuclear accented words are perceived. In the case of unaccented words, which are generally of lower phonetic prominence to begin with, there may simply be too little acoustic variation to produce a large difference, and as discussed earlier, their F0 values reﬂect interpolation rather than a structurally signiﬁcant target. Intensity’s effect on the perceived prominence of nuclear accented words was only posi- tive at higher durations, as the simple effect was, curiously, actually negative. However as described above, our assumption was that the interaction between intensity and duration,

as a somewhat better approximation of loudness, is likely the more reliable measure (especially when the model contains both an interaction and simple effect). At any rate, it seems clear that all three parameters were quite relevant to perceived prominence—including F0, in contrast to some previous claims (e.g.,Kochanski et al., 2005; Cole, Mo, & Hasegawa-Johnson, 2010). Indeed, F0 had the largest effect of the three acoustic measures we tested, with a one standard deviation increase in F0 producing a 64% increase in the odds ratio of a“prominent” response for prenuclear accented words. Notably, it is unlikely this effect for F0 would have been apparent if accent status had been invisible to the model.

Next, we consider signal-extrinsic effects on perceived prominence. Apparent inFig. 2is that these factors were generally better predictors of perceived prominence for unaccented words than for accented words. As described above, some previous studies of prominence perception have shown that phrase-final words are more readily identified as prominent than non-phrase-final words; while this effect was also apparent in our data, it was clearly a stronger predictor of perceived prominence for unaccented words, with an increase in the odds ratio that was more than three times the size of that for nuclear accented words. Similarly, previous studies have shown lexical frequency to be inversely associated with perceived prominence; here this relationship was found to be significant only in the case of unaccented words, for which the odds ratio for perceived prominence decreased sharply (approximately 40%) given a one standard deviation increase in lexical frequency. Repetition in the speech materials, while it affected nuclear accented words more than unaccented words, had an extremely small effect on both, with additional occurrences of a nuclear accented word corresponding to only an approximately 5% decrease in the odds ratio for perceived prominence. Similarly, a word’s order in the speech materials had a statistically significant association with prominence judgments by listeners, indicating that listeners were somewhat more conservative with their prominence judgments as the experiment progressed. But here, too, the effect size was so small (less than a 2% change in the odds ratio, and only for unaccented and nuclear accented words), that we do not discuss it further.

Finally, perceived prominence was signiﬁcantly predicted by properties of listeners themselves, which not only indicates the presence of individual differences, but that they are systematically related to the particular variables we tested. First, and most important to us, high scores on AQ-Comm (which indi- cate poorer pragmatic skill) were inversely related to the likelihood of perceived prominence. However, this was only for words of lower metrical prominence; a one standard deviation increase in AQ-Comm was associated with an odds ratio

8 Brief commentary is required regarding some of these predictors. First, we acknowledge that F0 maximum is only an estimate of the effects that F0 has (excursion being another aspect of F0), but we have limited our measure for methodological and analytical simplicity (and for comparability with other RPT studies; e.g.,Cole et al., 2010). Second, we utilized CELEX frequencies for the analyses we report below, but also explored SUBTLEX frequencies (Brysbaert & New, 2009) and did notfind differences in the pattern of statistical results. Third, we point out that“phrase position” does not apply in the case of prenuclear accented words, as such words cannot, by definition, be phrase-final.

Table 4

Results forfixed-effects factors in the logistic regression model that tested for effects of accent status (unaccented, prenuclear accented, or nuclear accented) and Pass (first pass through the materials or after the third andfinal pass). R code for the model is shown in Appendix II.

B SE z p

(Intercept) 2.8723 0.0877 32.75 <0.001

Pass (3 vs. 1) 0.9589 0.0177 54.27 <0.001

Order 0.0132 0.0009 15.27 <0.001

Accent Status (PPA vs. Unaccented) 0.7187 0.0584 12.31 <0.001

Accent Status (PPA vs. NPA) 0.6674 0.0539 12.37 <0.001

Accent Status (NPA vs. Unaccented) 1.3861 0.0635 21.83 <0.001

(10)

Table 5

Results forﬁxed-effects factors in the logistic regression models testing acoustic and signal-extrinsic factors for each accent status category. R code for the models is shown in Appendix II.

Unaccented Words:

B SE z p

(Intercept) 2.3881 0.1343 17.78 <0.001

Phrase Position (Final vs. Non-ﬁnal) 1.0176 0.1490 6.83 <0.001

CELEX Frequency 0.5255 0.1583 3.32 <0.001

Repetition 0.0198 0.0107 1.85 <0.1

Order 0.0070 0.0029 2.44 <0.05

Gender (M) 0.2362 0.1686 1.40 >0.1

AQ-Comm 0.2171 0.0779 2.79 <0.01

F0 Max 0.2254 0.0455 4.96 <0.001

Duration 0.2122 0.0530 4.00 <0.001

RMS Intensity * Duration 0.0766 0.0483 1.58 >0.1

RMS Intensity 0.0357 0.0452 0.79 >0.1

Prenuclear Accented Words:

B SE z p

(Intercept) 1.3742 0.1647 8.34 <0.001

CELEX Frequency 0.2714 0.2382 1.14 >0.1

Repetition 0.0490 0.0352 1.39 >0.1

Order 0.0025 0.0045 0.56 >0.1

Gender (M) 0.4075 0.1456 2.80 <0.001

AQ-Comm 0.1249 0.0668 1.87 <0.1

F0 Max 0.4954 0.0670 7.39 <0.001

Duration 0.3240 0.0922 3.51 <0.001

RMS Intensity * Duration 0.2146 0.0784 2.74 <0.001

RMS Intensity 0.2452 0.1010 2.43 <0.05

Nuclear Accented Words:

B SE z p

(Intercept) 0.6318 0.2045 3.09 <0.001

Phrase Position (Final vs. Non-ﬁnal) 0.4409 0.1950 2.26 <0.05

CELEX Frequency 0.2683 0.1827 1.47 >0.1

Repetition 0.0594 0.0264 2.25 <0.05

Order 0.0156 0.0044 3.51 <0.001

Gender (M) 0.3364 0.1367 2.46 <0.05

AQ-Comm 0.0443 0.0630 0.70 >0.1

F0 Max 0.3119 0.0483 6.46 <0.001

Duration 0.1790 0.0532 3.37 <0.001

RMS Intensity * Duration 0.1925 0.0509 3.79 <0.001

RMS Intensity 0.2409 0.0981 2.46 <0.05

Fig. 2. Effect size (expressed as percent change in the odds ratio for a“prominent” response) for fixed-effects factors in the logistic regression models of unaccented, prenuclear accented (PPA) and nuclear accented (NPA) words. Only factors whose effect was significant are shown; note that phrase position does not apply to PPA words, as all PPA words are by definition either phrase-initial or phrase-medial.

(11)

decrease of approximately 4% for prenuclear accented words, and 20% for unaccented words, but had no significant effect on nuclear accented words. The effect of pragmatic skill was thus like the other signal-extrinsic factors just explored, in that its effects were largely limited to words parsed into phonologically weak positions. To put this in context, the effect of a one standard deviation change in pragmatic skill was, for unaccented words, comparable to a one standard deviation change in acoustic duration. In addition to pragmatic skill, gender was a significant predictor, though only for accented words; relative to female listeners, male listeners were associated with fewer prominence identifications—a decrease of approximately 34% and 29% in the odds ratio for prenuclear accented and nuclear accented words, respectively. Gender, then, had an effect on accented words that was roughly comparable in size to a one standard deviation change in acoustic duration or intensity. We do not know of this gender difference having been reported previously. We now turn to our consideration of effects related to phonological distinctions in pitch accent type rather than pitch accent status.

2.2.3.3. Pitch accent type/level. Effect of category: As described above, we examined the role of pitch accent type primarily in terms of level, that is, groupings based on the height of the accent’s starred tone. Thus L* and L*+H, were grouped as one category, and !H*, H+!H* and L+!H* as another category.

The one exception to this classiﬁcation involved H* and L +H*, which, again, due to L+H*’s association with raised F0, were kept distinct from each other. While our analyses center on these pitch accent level distinctions, we report inTable 6 the proportion of prominence judgments for each individual pitch accent, as well as the number of observations for each.⁹ Considering ﬁrst the differences in perceived prominence associated with the categories themselves,Table 7 displays the proportion of words judged as prominent by RPT listeners for each accent level, broken down by accent status. Overall, words with L* or !H* pitch accents were least likely to be perceived as prominent, words with a L+H* were most likely to be perceived as prominent, and words with H* showed an intermediate likelihood. Notably, perceived prominence of words with H* seemed to vary the most as a function of accent status; prenuclear H* appears to pattern more like prenuclear L* and !H*, but nuclear H* patterns more like nuclear L+H* in

terms of perceived prominence. We therefore explored the role of accent level separately for prenuclear and nuclear accented words.¹⁰ Mixed-effects regression models were thus constructed similarly as for accent status contrasts in the previous section, but in this case the crucial predictor were contrasts in accent level, which was also modeled as a discrete ordinal variable (L* vs. !H* vs. H* vs. L+H*) with Tukey corrections again being applied to the multiple comparisons made.

The results of the two models are shown inTable 8. In general, words marked by pitch accents of a lower accent level were signiﬁcantly less likely to be perceived as prominent than words bearing a pitch accent of a higher accent level, for both prenuclear and nuclear accented words. One exception to this was the perceived prominence for words bearing a L*, which were no less likely to be perceived as prominent than words bearing !H* (and in fact, words with !H* were numerically less likely to be judged as prominent than words with a L* when considering only nuclear pitch accents; see also Cole et al., 2019, this Special Issue). Additionally, words with H* were more strongly associated with perceived prominence than words with a L* and !H* when nuclear accented, but H* did not differ signiﬁcantly from !H* in prenuclear accent position.

Thus, overall, theﬁndings seem to suggest that accent level corresponds to perceived prominence in a mostly non- gradient way. In nuclear position, the distinction is primarily between high and non-high pitch accents (where downstep is regarded as non-high); in prenuclear position, the distinction seems to be between L+H* and all other levels. It is somewhat unclear why accent status should affect H* more than the other pitch accent levels. One possibility is that this in part reﬂects the tendency for the second H* in English H*_H*_L-L%

sequences to have a phonetically higher F0 than theﬁrst. This would of course suggest an important role for within-category phonetic variation (in this case related to F0) in prominence perception. We now turn to listeners’ sensitivity to such variation.

Effects of phonetic factors: Our analysis here focused on effects of signal-based acoustic factors within accent category, setting aside signal-extrinsic factors. We did not expect reex- amination of these factors in the context of accent level contrasts to yield additional insights into their effects.

Additionally, instead of collapsing pitch accent types into level categories as above, our modeling of within-category phonetics effects excluded the bitonal accents, and thus accent level here refers to whether an accented word bore a L*, !H*, H* or L +H* pitch accent type. We did this both because of the more complex phonetic nature of bitonal accents (especially involving F0), but also because of the relatively small number of

9 There were no agreed upon instances of L*+!H in our speech materials, and indeed, rather few instances of the non-downstepped L*+H).

10We did this for ease of interpretation, given the difficulty associated with interpreting interactions between multi-level categorical predictors. To confirm that such an interaction was likely significant, however, we first compared a model of prominence perception that contained an interaction between accent level and accent status with one that contained only the simple effects of these two factors. A log likelihood ratio test confirmed that the model with the interaction contained a significantly better fit to the data (v²= 14.62, p <.01).

Table 6

Proportion of words identiﬁed as prominent as a function of pitch accent type.

L* L*+H L*+!H !H* H+!H* L+!H* H* L+H*

# of Observations 1303 312 0 4002 1275 539 9659 4455

Proportion Prominent 0.329 0.170 – 0.304 0.345 0.243 0.367 0.467

Table 7

Proportion of words identiﬁed as prominent as a function of pitch accent level and status.

L* !H* H* L+H*

PPA 0.247 0.224 0.285 0.428

NPA 0.358 0.337 0.484 0.504