2 Plan for technical variability

A primary concern with online crowdsourcing is the variability participants’ hardware and software. Computers may differ in screen size and resolution, and input devices can range from mice to trackballs to trackpads, each with different characteristics. Although these factors may appear trivial, they can sometimes compromise the ability to address key research questions (Germine, Reinecke, and Chaytor 2019). Below, we outline strategies to account for variability in software and hardware through experimental design and data analysis.

2.1 Combat technical variability through experiment design

Modern computing systems exhibit variability both within a single setup and across different setups. Within a given setup, response times measured via key presses can vary by roughly 30 ms (SDstandard deviation). Across setups, differences in operating systems and browsers can shift mean response times by as much as 80 ms (Anwyl-Irvine et al. 2021; Bridges et al. 2020). Hardware differences can further exacerbate behavioral variability, with reaction times differing by 70 ms across different touchscreen devices (Schatz, Ybarra, and Leitner 2015) and by 130 ms between mouse and trackpad inputs (Warburton et al. 2025a; Watral et al. 2023).

To mitigate technical variability, we recommend the use of within-subject experimental designs where possible. This approach effectively controls for both within- and between-setup variability, since setup-specific noise would evenly impact all conditions.

However, when within-subject designs are infeasible and between-subject designs are required, it becomes especially important to understand the two keys ways in which technical variability can undermine data quality: first, technological variability adds variability around group means and thus reduces statistical power to detect effects of interest (Figure 2.1a); and second, if random participant assignment fails to balance technical variables across conditions, between-subject designs can lead to spurious effects (Figure 2.1b).

Both risks above are especially problematic when technical variability approaches or exceeds natural behavioral variability – for example, adding 35 ms of technical variability to 70 ms of behavioral variability raises a group’s standard deviation to just 78 ms, a negligible effect on group comparisons. In contrast, adding technical variability (70 ms) equal to behavioral variability (70 ms) raises a group’s standard deviation to 99 ms, a distortion that can obscure group differences. Nonetheless, larger sample sizes can help offset these risks, and a priori power analyses can clarify how technical noise affects the inferences drawn (see Box 1).

Studies of individual differences are especially vulnerable to technical variability. For example, if a demographic factor of interest (e.g., sex/gender) is associated with the hardware participants use (e.g., men more often use trackpads; women more often use mice), and these devices differ in response time properties, analyses that ignore this mediating pathway may falsely attribute response-time differences to sex/gender, despite no direct causal link (Figure 2.1c). In principle, one could measure and control for many such variables, but in practice it is difficult to know whether unmeasured factors remain that could undermine the study, making it crucial for researchers to evaluate how seriously potential confounds could threaten their study.

Figure 2.1: Technical variability can introduce variability or bias to the data. (a) When substantial technical variability is added to behavioral variability, statistical power to detect true effects is reduced. (b) If random assignment fails to balance extraneous variables (e.g., input device) between groups (e.g., sex/gender), there can be spurious group differences in the dependent variable (e.g., response time) even in the absence of a true difference. (c) If sex/gender correlates with input device, and that device itself correlates with response time, a spurious correlation between sex/gender and response time may emerge.

Box 1: Accounting for variability through power analyses

To demonstrate how technical variability can be accounted for when conducting power analyses, we modeled an experiment where we expect to observe a 40ms difference in reaction time between two groups, each with a behavioral standard deviation in reaction time of 80ms across participants (similar to estimates for simple reaction times; Deary and Der (2005)), equivalent to a Cohen’s d of 0.5. However, our ability to measure this true effect is hampered by technical variability between setups.

The simplest way to assess the impact of technical variability is to use a power calculator, such as G*Power (Faul et al. 2007), and enter an inflated estimate of each group’s standard deviation. Under parametric assumptions, the sample standard deviation can be calculated as \(\sigma_s = \sqrt{\sigma_b + \sigma_t}\), where \(\sigma_b\) and \(\sigma_t\) are behavioral and technical standard deviations respectively. With this, we could find the sample size required to reach a desired statistical power, here 80%, across different levels of technical variability (Figure 2.2a).

An alternative, but more comprehensive approach, would be to observe how power changes as the sample size and potential technical variability are both varied, shown in Figure 2.2b (we provide R scripts in the materials). This framework can be extended to incorporate additional factors, for example, jointly modeling power as a function of both the number of participants and the number of trials per participant (Figure 2.2c; see Baker et al. (2021) for more information). The advantage of this approach is that it enables an estimation of the ‘wiggle room’ around a selected level of power, should factors like participant numbers drop due to exclusions. Together, these power analyses clarify the impact of technical variability and help inform the overall experimental design.

Figure 2.2: Power analysis helps anticipate the effects of technical variability. (a) The sample size required to achieve 80% power increases as technical variability rises. (b) Power “contours” illustrate the safety margin available at different levels of technical variability. (c) These contours also indicate how large a sample and how many trials per participant are needed to achieve the desired statistical power.

2.2 Standardize hardware and software

Standardizing software and hardware across participants is one of the most effective ways to reduce technical variability. Whenever possible, we recommend researchers restrict participants to those using specific device types (i.e., either computers, tablets, or phones), a feature supported by crowdsourcing platforms such as Prolific, because measures such as simple response times can vary substantially across these device types (Passell et al. 2021). Researchers can also implement code or questionnaires to further restrict participation based on operating system, browser type, input device, and screen refresh rates. However, the benefits of greater experimental control must be weighed against the risk of homogenizing the sample. For example, findings may fail to generalize if a study inadvertently limits their population to tech-savvy participants or those using high-performance devices (Germine, Reinecke, and Chaytor 2019).

2.3 Standardize the experiment

Standardizing experimental stimuli is a powerful way to further reduce unwanted variability. For example, psychophysics studies often require precise control over stimulus size in terms of visual angle, which depends on both the physical size of the stimulus and the participant’s viewing distance. Similarly, motor control studies may need to measure the actual movement distance on a participant’s trackpad.

To standardize stimuli across participants, researchers can implement calibration procedures that exploit an element of universality. For example, the ubiquity and standard size of credit cards enables easily mapping virtual units onto real-world measurements, allowing estimation of on-screen stimulus size (Li et al. 2020; Yung et al. 2015) and trackpad movement distance (Coltman et al. 2021). Similarly, the relatively fixed position of the human eye’s blind spot can be used to standardize viewing distances (Li et al. 2020), and audio calibration procedures can adjust volume to each participant’s setup and hearing ability (Zhao et al. 2022).

However, standardizing experimental stimuli comes with trade-offs. First, calibration procedures can be tedious and frustrating. Moreover, some may be unwilling or unable to use tools like a credit card for calibration (though clear instructions can help mitigate this; § Instruct clearly). Second, device differences limit how much standardization is possible (for instance, stimuli must be sized to remain visible even on the smallest screens), often forcing researchers to impose additional hardware requirements. Third, even with calibration, variability inevitably persists (even in lab studies), so researchers must weigh the benefits of calibration against participant burden and assess how the remaining variability could affect their research question.

2.4 Record technical factors

For technical variables that cannot be controlled, we recommend recording them and accounting for them analytically, for example by including them as covariates in statistical models (more on this in § Collect data comprehensively). For example, although screen refresh rate cannot be controlled and directly determines temporal resolution, it can be measured and modeled during analysis to assess whether it mediates the behavior of interest.

2.5 Pilot the experiment on a range of devices

We recommend piloting studies across the full range of devices participants might use. Are stimuli legible on small screens? Are experiments easily completed using a trackpad or mouse? Systematically testing diverse setups in advance helps detect and prevent systematic errors before launching the experiment.

2.6 The principle in action

We present an example illustrating how input devices can contaminate online studies of individual differences (Figure 2.3a). When considering the effect of sex/gender on reaction times, males appear to be 53 ms quicker to initiate a goal-directed movement than females (Warburton et al. 2025a). However, after including input device (mouse vs trackpad) as a covariate, the sex/gender gap shrinks to 6 ms and is no longer statistically significant. Thus, the apparent sex/gender effect reflects systematic latency differences between devices combined with biased device usage: female participants were more likely to use slower trackpads, whereas male participants more often used faster mice (also see Zhang et al. 2026).

This technological confound extends beyond response times to many behavioral measures. For example, reach velocity profiles differ between devices (Figure 2.3b), posing a concern for mouse-tracking studies that infer latent cognitive processes from these trajectories (Dotan et al. 2019). However, device effects do not penetrate behaviors universally: in visuomotor adaptation, the rate and extent of adaptation are comparable between trackpad and mouse users (Figure 2.3c; Kim, Forrence, and McDougle (2022); Tsay et al. (2021); Tsay et al. (2024); Warburton et al. (2025b)). This contrast underscores the need to identify when technological variability is likely to influence outcomes.

Figure 2.3: Technical variability can have different effects across behaviors. (a) When modeling the effect of sex/gender alone on reaction time, a significant effect is observed where males show faster reaction times than females (left). However, this apparent sex/gender effect disappears once input device is included as a covariate (right). The intercept reflects reaction times for females pooled across devices when device is not controlled (left), and for females specifically using a mouse when device is controlled (right). (b) Other variables, such as movement time, target-click latency, and overall cursor speed (shown) also differ between devices. (c) In contrast, the rate and extent of visuomotor adaptation in response to a 30° rotation does not differ between devices. Data in (a & b) from Warburton et al. (2025a), and in (c) from Warburton et al. (2025b).

Anwyl-Irvine, Alexander L., Edwin S. Dalmaijer, Nick Hodges, and Jo K. Evershed. 2021. “Realistic Precision and Accuracy of Online Experiment Platforms, Web Browsers, and Devices.” Behavior Research Methods 53 (4): 1407–25. https://doi.org/10.3758/s13428-020-01501-5.

Baker, Daniel H., Greta Vilidaite, Freya A. Lygo, Anika K. Smith, Tessa R. Flack, André D. Gouws, and Timothy J. Andrews. 2021. “Power Contours: Optimising Sample Size and Precision in Experimental Psychology and Human Neuroscience.” Psychological Methods 26 (3): 295–314. https://doi.org/10.1037/met0000337.

Bridges, David, Alain Pitiot, Michael R. MacAskill, and Jonathan W. Peirce. 2020. “The Timing Mega-Study: Comparing a Range of Experiment Generators, Both Lab-Based and Online.” PeerJ 8 (July): e9414. https://doi.org/10.7717/peerj.9414.

Coltman, Susan K., Robert J. van Beers, W. Pieter Medendorp, and Paul L. Gribble. 2021. “Sensitivity to Error During Visuomotor Adaptation Is Similarly Modulated by Abrupt, Gradual, and Random Perturbation Schedules.” Journal of Neurophysiology 126 (3): 934–45. https://doi.org/10.1152/jn.00269.2021.

Deary, Ian J., and Geoff Der. 2005. “Reaction Time, Age, and Cognitive Ability: Longitudinal Findings from Age 16 to 63 Years in Representative Population Samples.” Aging, Neuropsychology, and Cognition 12 (2): 187–215. https://doi.org/10.1080/13825580590969235.

Dotan, Dror, Pedro Pinheiro-Chagas, Fosca Al Roumi, and Stanislas Dehaene. 2019. “Track It to Crack It: Dissecting Processing Stages with Finger Tracking.” Trends in Cognitive Sciences 23 (12): 1058–70. https://doi.org/10.1016/j.tics.2019.10.002.

Faul, Franz, Edgar Erdfelder, Albert-Georg Lang, and Axel Buchner. 2007. “G*Power 3: A Flexible Statistical Power Analysis Program for the Social, Behavioral, and Biomedical Sciences.” Behavior Research Methods 39 (2): 175–91. https://doi.org/10.3758/BF03193146.

Germine, Laura, Katharina Reinecke, and Naomi S. Chaytor. 2019. “Digital Neuropsychology: Challenges and Opportunities at the Intersection of Science and Software.” The Clinical Neuropsychologist 33 (2): 271–86. https://doi.org/10.1080/13854046.2018.1535662.

Kim, Olivia A., Alexander D. Forrence, and Samuel D. McDougle. 2022. “Motor Learning Without Movement.” Proceedings of the National Academy of Sciences 119 (30): e2204379119. https://doi.org/10.1073/pnas.2204379119.

Li, Qisheng, Sung Jun Joo, Jason D. Yeatman, and Katharina Reinecke. 2020. “Controlling for Participants’ Viewing Distance in Large-Scale, Psychophysical Online Experiments Using a Virtual Chinrest.” Scientific Reports 10 (1): 904. https://doi.org/10.1038/s41598-019-57204-1.

Passell, Eliza, Roger W. Strong, Lauren A. Rutter, Heesu Kim, Luke Scheuer, Paolo Martini, Liz Grinspoon, and Laura Germine. 2021. “Cognitive Test Scores Vary with Choice of Personal Digital Device.” Behavior Research Methods 53 (6): 2544–57. https://doi.org/10.3758/s13428-021-01597-3.

Schatz, Philip, Vincent Ybarra, and Donald Leitner. 2015. “Validating the Accuracy of Reaction Time Assessment on Computer-Based Tablet Devices.” Assessment 22 (4): 405–10. https://doi.org/10.1177/1073191114566622.

Tsay, Jonathan S., Hrach Asmerian, Laura T. Germine, Jeremy Wilmer, Richard B. Ivry, and Ken Nakayama. 2024. “Large-Scale Citizen Science Reveals Predictors of Sensorimotor Adaptation.” Nature Human Behaviour, January. https://doi.org/10.1038/s41562-023-01798-0.

Tsay, Jonathan S., Richard B Ivry, Alan Lee, and Guy Avraham. 2021. “Moving Outside the Lab: The Viability of Conducting Sensorimotor Learning Studies Online.” Neurons, Behavior, Data Analysis, and Theory, 1–22. https://doi.org/10.51628/001c.26985.

Warburton, Matthew, Carlo Campagnoli, Mark Mon-Williams, Faisal Mushtaq, and J. Ryan Morehead. 2025a. “Input Device Matters for Measures of Behaviour in Online Experiments.” Psychological Research 89 (29): 1–15. https://doi.org/10.1007/s00426-024-02065-1.

———. 2025b. “Visuomotor Memory Is Not Bound to Visual Motion.” Journal of Neuroscience 45 (17): e1884242025. https://doi.org/10.1523/JNEUROSCI.1884-24.2025.

Watral, Alexandra T., Abby Morley, Robert Pastel, and Kevin M. Trewartha. 2023. “Comparing Mouse Versus Trackpad Input in a Web-Based App for Assessing Motor Learning.” Proceedings of the Human Factors and Ergonomics Society Annual Meeting 67 (1): 2113–19. https://doi.org/10.1177/21695067231192911.

Yung, Amanda, Pedro Cardoso-Leite, Gillian Dale, Daphne Bavelier, and C. Shawn Green. 2015. “Methods to Test Visual Attention Online.” Journal of Visualized Experiments, no. 96 (February): 52470. https://doi.org/10.3791/52470.

Zhang, Aoran, Marit F L Ruitenberg, Matthew Warburton, Stephen Scott, and Jonathan S. Tsay. 2026. “Large Reaching Datasets Quantify the Impact of Age, Sex/Gender, and Experience on Motor Control.” Communications Psychology 4 (16): 1–11. https://doi.org/10.1038/s44271-025-00383-7.

Zhao, Sijia, Christopher A. Brown, Lori L. Holt, and Frederic Dick. 2022. “Robust and Efficient Online Auditory Psychophysics.” Trends in Hearing 26 (January): 23312165221118792. https://doi.org/10.1177/23312165221118792.