3 Plan for behavioral variability

Online studies can be undermined by low participant engagement or deliberate attempts to “cheat”. These issues are often invisible to the experimenter and can render data unusable. However, there is growing consensus that, when properly managed, online data can rival the quality of laboratory data (Dandurand, Shultz, and Onishi 2008; Germine et al. 2012; Hartshorne et al. 2019; Sauter, Stefani, and Mack 2022; Semmelmann and Weigelt 2017).

3.1 Understand and deter the sources of ‘bad data’

Low-quality online data can arise for several reasons. First, participants may misunderstand the task and complete it in unintended ways. For example, participants in a visuomotor rotation experiment may fail to recognize that they should re-aim to counteract the imposed perturbation, and as a result, perseverate in reaching directly toward the target. It is therefore crucial that participants adequately understand the task before starting (see § Instruct Clearly for tips on instructions).

Second, participants may cheat – gaining an unfair advantage through dishonest behavior – for example by writing notes during an online memory assessment. It is therefore crucial to take proactive steps to discourage cheating where it is possible, such as removing performance-based monetary incentives, explicitly defining to participants what behaviors count as dishonest, and using task designs that minimize opportunities for misconduct (e.g., requiring rapid responses that leave little time for external aids).

Third, participants may not be fully attentive to the task. Distractions could range from brief lapses that affect a few trials (e.g., replying to a text message) to prolonged disengagement that affects the whole session (e.g., watching a film). It is therefore crucial to proactively deter these behaviors by making the experiment more engaging or embedding attention checks and catch trials that require an atypical response (see § Make it engaging; see Rodd (2024) for a thorough treatment of the potential sources of bad data).

3.1.1 Bots?

Concerns about automated or AI-generated survey responses (‘bots’) have grown alongside advances in artificial intelligence, including large language models (Storozuk et al. 2020; Moss et al. 2021; Griffin et al. 2022; Webb and Tangney 2022; Goodrich et al. 2023; Keith and McKay 2024; Westwood 2025). At present, bots appear ill-suited to the types of behavioral studies considered here, largely because they struggle to mimic human movement (indeed, idiosyncratic human movement underpin many CAPTCHA tests). Nonetheless, as AI continues to advance, more robust safeguards may be needed to prevent bots from completing behavioral experiments.

3.2 Define and flag bad data

To anticipate how poor data quality may manifest, researchers can complete their own tasks while deliberately mimicking likely failure modes, such as rushing through trials or multitasking, and examine the resulting data patterns. We also recommend recording auxiliary variables that can help detect disengaged behavior. For example, even when response accuracy is the primary dependent measure, recording response time provides an additional indicator of inattentiveness, which may manifest as unusually fast and overly consistent response times.

Clearly defining what constitutes bad data enables targeted tests to identify and remove them; below, we offer several recommendations. First, we recommend flagging bad data via objective criteria – behavioral metrics with predefined thresholds – rather than subjective judgement or visual inspection. For instance, researchers might flag trials as being inattentive where response error exceeds a threshold and exclude participants who surpass a specified proportion of such trials. Second, we recommend defining bad data in terms of absolute thresholds (e.g., where error exceeds a set value) rather than relative ones (e.g., distance from the mean), so that the inattentive participants do not distort the benchmark. Third, when relative criteria are necessary, use robust outlier detection, such as those based on medians and median absolute deviations (Gagné and Franzen 2023; Leys et al. 2013), which are less sensitive to extreme values.

While this can detect many forms of problematic behavior, not all forms can necessarily be detected. For example, while some behavioral signatures of cheating are detectable (e.g., unusually consistent response times, perfect accuracy, or the absence of hallmark behavioral effects), a diligent cheater could intentionally mask these patterns. Therefore, it is crucial to consider how much of a guarantee post-hoc exclusions offer.

3.3 Pre-register exclusion criteria

All exclusion and inclusion criteria should ideally be pre-registered before data collection, that is, publicly specified in advance in a time-stamped record that constrains analytic decisions. Pre-registration constrains researcher degrees of freedom, reducing the risk of spurious findings from questionable research practices (e.g., ‘p-hacking’; Simmons, Nelson, and Simonsohn (2021)). However, it can be difficult to anticipate how bad data may arise, even with extensive pilot testing. As such, researchers may need to deviate from preregistered plans as they collect more data, but any such deviations should be transparently reported (Lakens 2024).

3.4 Account for greater variability into a priori power analyses

Online studies often show greater within- and between-subject variability than in-lab samples (Miller et al. 2018; Semmelmann and Weigelt 2017). While some of this variability reflects welcome sample diversity, it also reduces statistical power to detect effects of interest. To pre-empt this, we recommend conducting conservative power analyses informed by pilot data or meta-analyses to obtain estimates of sample size (see Box 1).

3.5 The principle in action

We present two experiments that underscore the need to tailor exclusion criteria to the specific demands of each study (Figure 3.1). In our first experiment, we examined goal-directed motor control by instructing participants to simply move straight through a target, with veridical visual feedback throughout the movement (Figure 3.1a; Warburton et al. (2025)). An engaged, instruction-abiding participant would have nearly all their movements fall near the target (Figure 3.1b). In contrast, a disengaged participant might either perseverate in a fixed direction or move in a random manner, with few reaches directed toward the target (Figure 3.1c). Drawing on pilot data, we set a ±60° threshold to flag outlier trials, a conservative criterion chosen to accommodate typical motor variability (≈±10°), and excluded participants who exceeded this threshold on more than 20% of trials.

However, the same exclusion criteria would not be appropriate in contexts where variable, exploratory behavior is expected. In our second experiment, we examined motor adaptation behavior by instructing participants to counteract to a 60° visuomotor rotation, requiring them to deliberately re-aim away from the target (Figure 3.1d; Chen et al. (2025)). Performance on this task is characterized by pronounced trial-to-trial fluctuations as participants explore and ultimately discover an effective re-aiming strategy to counteract the perturbation (Figure 3.1e; Townsend et al. (2025); Ding, Niyogi, and Tsay (2025)). Excluding data based on a simple absolute threshold (e.g., ±60° of fully compensatory movement) would disproportionately remove data from the early exploratory phase, artificially smoothing learning curves and obscuring the very signatures of adaptation (Tsay et al. 2024). Techniques that account for local variability (e.g., sliding windows) and/or complementary metrics (e.g., unusually fast <100 ms or slow >1000 ms reaction times) may better isolate “bad data” arising from attentional lapses. An inattentive participant would not attempt to identify an appropriate strategy and would instead persist in aiming directly at the target throughout (Figure 3.1f) - behavior that would have appeared attentive in the first experiment. Thus, appropriately identifying problematic data rests on assumptions about what constitutes signal versus noise, assumptions that are often experiment-specific and require subject-matter expertise.

Figure 3.1: The importance of tailoring inclusion and exclusion criteria to different experiments. (a) In our first example, participants made goal-directed reaching movements to a visual target with veridical feedback. (b) An attentive participant consistently reached to the target, with example reaches to the 90° (upper) target shown in green in the inset panel (different colors denote reaches to different target locations). We operationalized our exclusion criterion as the percentage of reaches falling within ±60° of the target, a conservative criterion chosen to accommodate typical motor variability (≈±10°). (c) An inattentive participant tends to ignore the cued target location, repeatedly reaching toward a single direction or moving randomly. In this example, only 28% of the participant’s reaches fell within ±60° of the target. Bars denote the number of reaches (out of a total of 480), with red bars denoting outlier trials. Data in panels (b) and (c) from Warburton et al. (2025). (d) In our second example, participants performed a visuomotor adaptation experiment in which performance feedback, indexing the magnitude by which a hidden 60°-rotated cursor missed the target, was provided via a numerical score. (e) When the rotation is imposed, yielding low initial scores, an attentive participant typically exhibits exploration early in learning before converging on a successful strategy late in learning. Red points indicate data that would be classified as outliers under the ±60° criteria, which would remove key data points characterizing the exploratory process. (f) Here, inattentive behavior is characterized by a lack of effort to place the rotated cursor on the target, with participants instead aiming directly at the target throughout the experiment. Data in panels (e) and (f) from Chen et al. (2025).

Chen, Yifei, Sabrina J. Abram, Richard B. Ivry, and Jonathan S. Tsay. 2025. “Indirect Feedback Hinders Explicit Sensorimotor Adaptation.” Proceedings of the Royal Society B: Biological Sciences 292 (2051): 20251407. https://doi.org/10.1101/2024.06.28.601293.

Dandurand, Frédéric, Thomas R Shultz, and Kristine H Onishi. 2008. “Comparing Online and Lab Methods in a Problem-Solving Experiment.” Behavior Research Methods 40 (2): 428–34. https://doi.org/10.3758/BRM.40.2.428.

Ding, Wei, Anjuli Niyogi, and Jonathan S. Tsay. 2025. “Hypothesis Testing Governs an Efficiency-Flexibility Trade-off in Strategic Motor Learning.” 2025. https://doi.org/10.64898/2025.11.29.691289.

Gagné, Nathan, and Léon Franzen. 2023. “How to Run Behavioural Experiments Online: Best Practice Suggestions for Cognitive Psychology and Neuroscience.” Swiss Psychology Open 3 (1): 1. https://doi.org/10.5334/spo.34.

Germine, Laura, Ken Nakayama, Bradley C. Duchaine, Christopher F. Chabris, Garga Chatterjee, and Jeremy B. Wilmer. 2012. “Is the Web as Good as the Lab? Comparable Performance from Web and Lab in Cognitive/Perceptual Experiments.” Psychonomic Bulletin & Review 19 (5): 847–57. https://doi.org/10.3758/s13423-012-0296-9.

Goodrich, Brittney, Marieke Fenton, Jerrod Penn, John Bovay, and Travis Mountain. 2023. “Battling Bots: Experiences and Strategies to Mitigate Fraudulent Responses in Online Surveys.” Applied Economic Perspectives and Policy 45 (2): 762–84. https://doi.org/10.1002/aepp.13353.

Griffin, Marybec, Richard J. Martino, Caleb LoSchiavo, Camilla Comer-Carruthers, Kristen D. Krause, Christopher B. Stults, and Perry N. Halkitis. 2022. “Ensuring Survey Research Data Integrity in the Era of Internet Bots.” Quality & Quantity 56 (4): 2841–52. https://doi.org/10.1007/s11135-021-01252-1.

Hartshorne, Joshua K., Joshua R. De Leeuw, Noah D. Goodman, Mariela Jennings, and Timothy J. O’Donnell. 2019. “A Thousand Studies for the Price of One: Accelerating Psychological Science with Pushkin.” Behavior Research Methods 51 (4): 1782–1803. https://doi.org/10.3758/s13428-018-1155-z.

Keith, Melissa G., and Alexander S. McKay. 2024. “Too Anecdotal to Be True? Mechanical Turk Is Not All Bots and Bad Data: Response to Webb and Tangney (2022).” Perspectives on Psychological Science, March, 17456916241234328. https://doi.org/10.1177/17456916241234328.

Lakens, Daniël. 2024. “When and How to Deviate From a Preregistration.” Collabra: Psychology 10 (1): 117094. https://doi.org/10.1525/collabra.117094.

Leys, Christophe, Christophe Ley, Olivier Klein, Philippe Bernard, and Laurent Licata. 2013. “Detecting Outliers: Do Not Use Standard Deviation Around the Mean, Use Absolute Deviation Around the Median.” Journal of Experimental Social Psychology 49 (4): 764–66. https://doi.org/10.1016/j.jesp.2013.03.013.

Miller, R., K. Schmidt, C. Kirschbaum, and S. Enge. 2018. “Comparability, Stability, and Reliability of Internet-Based Mental Chronometry in Domestic and Laboratory Settings.” Behavior Research Methods 50 (4): 1345–58. https://doi.org/10.3758/s13428-018-1036-5.

Moss, Aaron J, Cheskie Rosenzweig, Shalom N Jaffe, Richa Gautam, Jonathan Robinson, and Leib Litman. 2021. “Bots or Inattentive Humans? Identifying Sources of Low-Quality Data in Online Platforms.” PsyArXiv.

Rodd, Jennifer M. 2024. “Moving Experimental Psychology Online: How to Obtain High Quality Data When We Can’t See Our Participants.” Journal of Memory and Language 134 (February): 104472. https://doi.org/10.1016/j.jml.2023.104472.

Sauter, Marian, Maximilian Stefani, and Wolfgang Mack. 2022. “Equal Quality for Online and Lab Data: A Direct Comparison from Two Dual-Task Paradigms.” Open Psychology 4 (1): 47–59. https://doi.org/10.1515/psych-2022-0003.

Semmelmann, Kilian, and Sarah Weigelt. 2017. “Online Psychophysics: Reaction Time Effects in Cognitive Experiments.” Behavior Research Methods 49 (4): 1241–60. https://doi.org/10.3758/s13428-016-0783-4.

Simmons, Joseph P., Leif D. Nelson, and Uri Simonsohn. 2021. “Pre-Registration: Why and How.” Journal of Consumer Psychology 31 (1): 151–62. https://doi.org/10.1002/jcpy.1208.

Storozuk, Andie, Marilyn Ashley, Véronic Delage, and Erin A. Maloney. 2020. “Got Bots? Practical Recommendations to Protect Online Survey Data from Bot Attacks.” The Quantitative Methods for Psychology 16 (5): 472–81. https://doi.org/10.20982/tqmp.16.5.p472.

Townsend, Max, Matthew Warburton, Carlo Campagnoli, Mark Mon-Williams, Faisal Mushtaq, and J Ryan Morehead. 2025. “An ‘Aha!’ Moment Precedes the Strategic Response to a Visuomotor Rotation.” 2025. https://doi.org/10.1101/2025.05.11.653302.

Tsay, Jonathan S., Hyosub E Kim, Samuel D McDougle, Jordan A Taylor, Adrian Haith, Guy Avraham, John W Krakauer, Anne Ge Collins, and Richard B Ivry. 2024. “Fundamental Processes in Sensorimotor Learning: Reasoning, Refinement, and Retrieval.” eLife 13 (August): e91839. https://doi.org/10.7554/eLife.91839.

Warburton, Matthew, Carlo Campagnoli, Mark Mon-Williams, Faisal Mushtaq, and J. Ryan Morehead. 2025. “Input Device Matters for Measures of Behaviour in Online Experiments.” Psychological Research 89 (29): 1–15. https://doi.org/10.1007/s00426-024-02065-1.

Webb, Margaret A., and June P. Tangney. 2022. “Too Good to Be True: Bots and Bad Data From Mechanical Turk.” Perspectives on Psychological Science, November, 174569162211200. https://doi.org/10.1177/17456916221120027.

Westwood, Sean J. 2025. “The Potential Existential Threat of Large Language Models to Online Survey Research.” Proceedings of the National Academy of Sciences 122 (47): e2518075122. https://doi.org/10.1073/pnas.2518075122.