• 沒有找到結果。

Chairs: Neil Brigden, Miami University, USA

Gabriele Paolacci, Erasmus University Rotterdam, The Netherlands

Paper #1: Beyond the Turk: An Empirical Comparison of Alternative Platforms for Crowdsourcing Online Research

Eyal Peer, Bar-Ilan University, Israel

Sonam Samat, Carnegie Mellon University, USA Laura Brandimarte, Carnegie Mellon University, USA Alessandro Acquisti, Carnegie Mellon University, USA Paper #2: Using Nonnaive Participants Can Reduce Effect Sizes

Jesse Chandler, University of Michigan, PRIME Research, USAGabriele Paolacci, Erasmus University Rotterdam, The Netherlands

Eyal Peer, Bar-Ilan University, Israel Pam Mueller, Princeton University, USA Kate Ratliff, University of Florida, USA

Paper #3: Research Design Decisions and Their Effects on Task Outcomes: An Illustration Using Free-sorting Experiments

Simon J. Blanchard, Georgetown University, USA Ishani Banerji, University of Texas at San Antonio, USA Paper #4: Interactivity and Data Quality in Computer-Based Experiments

Neil Brigden, Miami University, USA SESSION OVERVIEW

Consumer researchers are increasingly relying on online samples to conduct their empirical investigations. While crowdsourcing markets and online panels (e.g., Amazon Mechanical Turk) have improved the efficiency of behavioral research, they also introduced unique and novel challenges for data quality (for recent reviews see Gosling & Mason 2015; Paolacci & Chandler 2014). In particular, both the recruitment and execution stages of research are inherently less controllable when they happen online than offline (e.g., in a physical laboratory). As a result, several questions arise that directly impact the reliability of online behavioral investigations and ultimately the trustworthiness of research findings: Is there such thing as an “online population”, or are some online samples more reliable than others?

What are the consequences of research participants self-selecting into research studies as they wish? How can researchers ensure that unsupervised participants will truly engage in the tasks they perform?

This session brings together four papers that assess these concerns, and suggest practical solutions for consumer researchers to ensure high data quality in their online investigations.

The first two papers focus on the recruitment stage of online research. Peer and colleagues compare a variety of online panels, finding vast differences in data quality as well as participant demographics and psychometrics. Chandler and colleagues investigate how participating twice in the same study—a phenomenon which previous research documented as prevalent—affects results, and find that including non-naive participants consistently reduces observed effect sizes. The last two papers focus on how features of the research execution can be changed to increase data quality in online research. Blanchard and Banerji comprehensively examine free sorting tasks in online research and find evidence that researcher decisions (e.g., topic, number of items, pre-task video tutorials) substantially affect both participants’ experience and the quality of

the data they provide. Brigden examines the effect of interactive study elements on participants’ attentiveness, and finds that participants’ attentiveness is significantly improved in presence of interactive elements. Altogether, the four papers highlight specific data quality concerns for online researchers, test the effectiveness of attempts to address these concerns, and identify important avenues for future research in this area.

The Internet democratized science by lowering the barriers to its consumption and dissemination. Recently, opportunities to conduct research online have allowed for a more democratic production of scientific knowledge. However, online data collection also has many undetected pitfalls, that this session uncovers and examines. Prior ACR sessions on crowdsourcing and online research (e.g., Goodman

& Paolacci, 2014) demonstrated great interest in the community for a more thorough understanding of the consequences of relocating empirical investigations online. We expect this session to be equally successful, and to contribute substantially – by providing attendees with actionable methodological implications – to improve the practices of online data collection.

Beyond the Turk: An Empirical Comparison of Alternative Platforms for Crowdsourcing Online

Research

EXTENDED ABSTRACT

In recent years, a growing number of researchers have been using Amazon Mechanical Turk (MTurk) as an efficient platform for crowdsourcing online human-subjects research. A large body of work has shown MTurk to be a reliable and cost-effective source for high-quality and representative data, for various fields and research purposes (e.g., Buhrmester, Kwang, & Gosling, 2011; Chandler, Mueller, & Paolacci, 2014; Crump, McDonnell, & Gureckis, 2013; Fort, Adda, & Cohen, 2011; Goodman, Cryder, & Cheema, 2013; Litman, Robinson, & Rosenzweig, 2014; Mason & Suri, 2012; Paolacci, & Chandler, 2014; Paolacci, Chandler, & Ipeirotis, 2010; Peer, Vosgerau, & Acquisti, 2013; Rand, 2012; Simcox, &

Fiez, 2014; Sprouse, 2011). In parallel, several other alternative platforms now offer similar services, with distinct differences from MTurk: they offer access to new and more naïve populations than MTurk’s, and have fewer restrictions on the types of assignments researchers may ask participants to do (see Vakharia & Lease, 2014, for an overview). These alternative services for crowd-sourced research could be highly beneficial for researchers interested in conducting online surveys and experiments, as long as these new sites prove to provide high-quality data. We conducted an empirical investigation of the data quality (in terms of response rates, attention, dishonesty, reliability and replicability) of several alternative online crowdsourcing platforms, and compared those to both MTurk and a university-based online participants pool.

At first, we focused on six services that we found by searching for crowdsourcing websites on the web, which are similar in purpose and general design to MTurk. These services included CrowdFlower, MicroWorkers, RapidWorkers, MiniJobz, ClickWorker and ShortTask.

However, we were able to run our study only on the first three sites due to various problems with the other sites (MiniJobz rejected our

study with no explanation or response to our questions; ClickWorker required a high set-up fee of about $840 for 200 participants; and ShortTask failed to process our payment method several times and no support could be reached). In addition to the above three sites and MTurk, we also ran our study on a university-based online participant pool (CBDR) as another comparison group. We aimed to sample 200 participants from each site using a week for sampling. We obtained 200 responses from MTurk and CrowdFlower in less than 2 hours (101.01 and 108.55 responses per hour, respectively). With a considerable difference, CBDR showed the third fastest response rate (1.42 responses per hour), followed by MicroWorkers (1.08 responses per hour) and RapidWorkers (0.63 responses per hour) – from which we could only sample 105 completed responses in a week. Eventually, we obtained a total sample of 890 participants.

Our online study included several parts designed to examine different aspects of data quality. For brevity, we describe these parts here alongside their results. In one part, participants completed several validated questionnaires to examine differences in reliability between the sites: the Internet User Information Privacy Concerns scale (Malhotra, Kim, & Agarwal, 2004), the Need For Cognition scale (Cacioppo, Petty, & Kao, 1984) and the Rosenberg Self-Esteem Scale (Rosenberg, 1979). Overall, we found that MTurk participants showed the highest reliability scores on all three scales, followed by CrowdFlower participants, CBDR and Microworkers, all of whom performed adequately well on all scales (except a somewhat lower score for CrowdFlower participants on the NFC scale).

RapidWorkers participants showed high reliability on the IUIPC scale, but very low reliability on the NFC and mediocre reliability on the RSES scales. We used Hakistan & Whalen’s (1976) method to compare between independent reliability coefficients and found no statistically significant differences between the samples (using all participants from each sample) for the IUIPC, χ2(4) = 6.63, p = .17, but we did find statistically significant differences for the NFC and the RSES, χ2 (4) = 127.07, 75.69, p < .01.

Two attention-check questions (Peer et al., 2014), embedded at different points of the study, checked participants’ attention and compliance with written instructions. Whereas only 14% of MTurk participants failed both questions, almost half of the CBDR participants failed them, and the majority of the participants in all other sites failed them as well. Interestingly, CrowdFlower participants (who showed the fastest response rate) had a failure rate of almost 75%.

Another part of the study examined replicability of known findings using tasks from the judgment and decision-making literature (following Chandler et al., 2010): the Asian-disease gain vs. loss framing, the sunk-cost fallacy, and four anchoring-and-adjustment questions. We found the expected effects in both CrowdFlower and MicroWorkers, in levels comparable to MTurk, whereas RapidWorker’s results were less than adequate. In another part, which used a die-throwing task, we found no differences in the propensity for dishonest behavior between the different sites.

To conclude, we found that, at the time of writing, both CrowdFlower and MicroWorkers sites, but not the RapidWorkers site, could be potential alternatives to MTurk. Additional examinations revealed that there was a very small overlap between participants from the different sites, and some individual differences between the sites. The most pronounced, and probably most practical, of those was that CrowdFlower’s participants included much more Asian participants and non-U.S. citizens than MicroWorkers or MTurk.

We believe additional research is required to understand the origins of these differences between the sites, and to further explore other aspects of data quality between these sites in comparison with MTurk, as well as in comparison with other, more traditional samples.

Using Nonnaive Participants Can Reduce Effect Sizes EXTENDED ABSTRACT

When conducting a study, researchers often assume that participants are naive to the research materials, either because the pool of participants is large (e.g., Internet samples) or because participants’ prior exposure to research is limited (e.g., in the case of first year college students). This assumption, however, is often violated. People can belong to a participant pool for several years, and some members are disproportionately likely to be sampled (Chandler, Mueller, & Paolacci, 2014). Moreover, researchers with overlapping interests rely on the same undergraduate subject pools, and participants may easily share information with each other (Edlund et al, 2009). People may also gain knowledge of research materials through college courses or media coverage.

Some research suggests that familiarity with research materials might impact findings. Prior knowledge may increase the likelihood of hypothesis guessing and potentially lead to demand effects (Weber

& Cook, 1972). Relatedly, earlier conditions in within-subject experiments inform subsequent conditions, causing effects observed in between-subjects designs to be inflated, attenuated, or reversed (see Charness, Gneezy, & Kuhn, 2012). Recently, researchers have noted that responses to psychological measures correlate with proxies of prior participation in similar experiments, such as memory of prior participation (Greenwald & Nosek, 2001), chronological order of studies themselves (Rand et al., 2014), measures of the total number of completed experiments (Chandler et al., 2014), or naturally varying levels of prior experience with a task (Mason, Suri

& Watts, 2014). Although these findings suggest that non-naivety may influence observed effect sizes more generally, this possibility has not been directly tested. To address this gap, we examine how prior exposure to study materials affects responses.

Method

We conducted a two-stage study on Amazon Mechanical Turk (for a review see Paolacci & Chandler, 2014). One thousand participants completed a set of eleven two-condition experiments in Wave 1 (W1), testing phenomena such as anchoring, framing, retrospective gambler’s fallacy, etc. (full details about W1 are reported in Klein et al., 2014). In Wave 2 (W2), these participants were invited to participate in a study including the same experiments with the exclusion of two (that were not successful in W1). For each experiment, participants were randomly assigned to the same condition as in W1 or in the alternative condition. Additionally, we manipulated two factors that that should affect whether participants recall previous materials and potentially moderate the effect of non-naivety. Visual similarity was manipulated by randomly assigning participants to complete the experiments on the same platform as W1 or on a different, visually distinct platform. Time Delay was manipulated by re-contacting participants a few days, about a week, or about a month after W1. This resulted in a 3 (Time Delay) X 2 (Visual Similarity) X 2 (Condition) between-participants design.

Results

We tested the effect of non-naivety on the responses of participants who participated in both W1 and W2 (N = 638; 55%

women; Mage = 36, SD = 12.8). Overall, effect sizes declined from W1 (weighted d = 0.82) to W2 (weighted d = 0.63) by d = 0.19, a drop of about 25%. Only one effect size increased from W1 to W2 (low vs. high scales task; Schwarz et al.,1985) and all others showed 17% to 83% declines. 9 of the 12 effects (we analyzed the four anchoring tasks separately) exhibited declines, and 5 of these declines were statistically significant.

To examine whether the attenuation of effects was stronger when recalling information from previous participation was easier, we regressed W2 effect sizes on same vs. different Condition, Visual Similarity, Time Delay, and on dummy variables for experiment (accounting for differences in attenuation across experiments). There was a significant effect of Condition, reflecting that participants exposed to different conditions demonstrated greater decline of effects from W1 to W2. There were no main effects or interactions.

After study completion, participants reported for each experiment whether they remembered participating in it. Memory for participation in each experiment depended on Time Delay and ranged between 35%

and 80% of participants. We regressed W2 effect sizes on whether they represented those who did or did not report remembering the prior experiment, same vs. different condition dummy, and dummies for the different experiments. There was a significant effect of being assigned to a different condition, and neither memory of prior participation nor its interaction with same vs. different condition were significant. This suggests self-reported memory for prior participation is at best a poor indicator of whether participants will display attenuated effect sizes because of prior participation.

Our findings show that prior exposure to research materials can reduce the effect size of true research findings. Effect sizes decreased of 40% on average, although the reduction was different for different experiments. Future research should examine whether and how some paradigms (e.g., those that require participants to generate numerical estimates) are more susceptible to non-naivety. Effects were particularly attenuated when participants were exposed to alternative conditions of an experiment, highlighting that decreased might be a function of information. However, they were also attenuated among participants exposed to the same condition twice, which might be explained by repeated exposure leading to more elaboration and decreased reliance on intuition (Sherman, 1980). Self-reported participation does not identify all prior participants, or even those who demonstrated a particularly large non-naivety effect. This may be explained if participants quickly forget the source from which the information was learned, but do not forget the information itself (Johnston, Hashtroudi & Lindsay, 1993).

Non-naivety is a serious concern for behavioral researchers that cannot be solved controlling for self-reported previous participation.

When directly monitoring prior participation not possible, researchers should design procedures and stimuli that differ from those known to the tested population (Chandler et al., 2014), or increase their sample size to offset the anticipated decrease in power.

Research Design Decisions and Their Effects on Task Outcomes: An Illustration Using Free-sorting

Experiments EXTENDED ABSTRACT

Numerous research areas within psychology and marketing have relied on free-sorting tasks, wherein participants allocate a set of objects into groups of their own choosing to study the natural cognitive processing of information that consumers encounter in their lives (e.g., Blanchard 2011; Ross & Murphy 1999). Unfortunately, there is little systematic empirical research that provides guidance on how researchers should design sorting tasks in order to minimize unwanted consequences such as contaminated process data, and depleted or dissatisfied participants. As different studies have provided differing recommendation, we provide an empirical investigation of sorting task researcher-driven design decisions on a variety of outcomes.

To do so, we created an experimental design that systematically varies the decisions that a researcher may face when adopting a sorting task using a fractional factorial design to test the main effect of each factor (i.e., researcher decisions) on various dependent measures (Collins, Dziak, & Runze, 2009). We then provide guidance as to best practices and potential pitfalls. The factors, along with the final design (involving 36 orthogonal tasks), are presented in a table that can be downloaded here: http://tinyurl.com/blanchardtable Participants & Procedure

We requested 720 participants (20 per task) from Amazon Mechanical Turk (mTurk) for a “consumer perceptions study.”

Participants were paid $0.50 and were allocated evenly/sequentially using Qualtrics’ quota functions.

Once a participant clicked the survey link on mTurk, each participant was randomly assigned to one of the 36 task configurations via an initial Qualtrics survey whose sole purpose was to randomly assign participants. If the participant was assigned to a task design for which pre-task examples were to be provided (Using pre-task tutorials; yes/no), Qualtrics displayed a short video that demonstrated musical instruments being sorted along (adapted from Lickel et al., 2000). Participants then proceeded to the online sorting task interface (Cardsorting.net), which affords researchers the ability to present a varying number of objects to be sorted (Number of objects; 20, 40, 60), to customize instructions (Providing a criteria for the sorts; similarity/dissimilarity), to use either a single or a multiple cards sorting task (Type of sorting task), to require that participants sort all the objects at least once (Requiring to use all the cards at least once), and the option to ask participants to label the piles during the task, after the task, or not at all (If and when to ask for pile labels). After submitting their sorts, all participants proceeded to the same post-task Qualtrics survey, which contained additional dependent measures.

Results

For our analyses, we use mixed-effect linear models where the task number is a random effect and all researcher decisions are fixed effects. We provide the following recommendations:

Researchers should keep the number of objects manageable, but there is no need to severely constrain the number of objects.

Although participants prefer tasks with fewer objects (20), we find no evidence that participants cannot properly follow instructions that require them to sort 40 or even 60 objects. As the number of objects increases, the biggest impact seems to be on completion time, perceptions of the effort required, and the extent to which the task was enjoyable.

Researchers should be mindful of what they are asking participants to sort. The type of objects being sorted has a significant impact. Using “groups types” instead of food objects led to a significant increase in dropout rates, completion time (for an equivalent number of objects), and participant depletion.

There’s no harm in allowing participants to sort objects into multiple piles. Even when given the option, participants were assigned objects to multiple piles only when their perceptions dictate it. Allowing participants to do so when it will not negatively impact research goals and may result in less attrition.

If researchers want labels for the piles, they should ask participants to do so after they have submitted their sorts as complete. Asking participants to name the piles during the task led to significantly greater dropout rates, a greater number of cards left unused, and higher completion times.

If researchers are trying to increase the frequency at which participants assign objects to multiple piles, they should consider providing pictures along with the objects’ names. Doing so allows participants to visualize the objects, and tends to lead to a greater number of objects used more than once in the sorts.

Requiring participants to use all the cards has little effect on task and satisfaction measures. If you suspect that participants will be familiar with the majority of the objects to be sorted, then requiring they sort all objects is more likely to result in fully complete data without any detrimental effects on the sorting process or participants’ experience.

Sorting interfaces are sufficiently intuitive. Expansive instructions may not be necessary. We found that participants followed the instructions, and providing them with videos that illustrated the features of the sorting interface (e.g., adding/removing objects from piles, labeling, etc.) did not have much of an impact other than to

Sorting interfaces are sufficiently intuitive. Expansive instructions may not be necessary. We found that participants followed the instructions, and providing them with videos that illustrated the features of the sorting interface (e.g., adding/removing objects from piles, labeling, etc.) did not have much of an impact other than to

Outline

相關文件