Sample size matters when estimating test-retest reliability of behaviour

Download

Text - Accepted Version
· Restricted to Repository staff only
· The Copyright of this document has not been checked yet. This may affect its availability.

Restricted to Repository staff only

Advice

Please see our End User Agreement.

It is advisable to refer to the publisher's version if you intend to cite from this work. See Guidance on citing.

Tools

Lists

Williams, B. ORCID: https://orcid.org/0000-0003-3844-3117, Fitzgibbon, L. ORCID: https://orcid.org/0000-0002-8563-391X, Brady, D. and Christakou, A. ORCID: https://orcid.org/0000-0002-4267-3436 (2024) Sample size matters when estimating test-retest reliability of behaviour. Behavior Research Methods. ISSN 1554-3528 doi: 10.3758/s13428-025-02599-1 (In Press)

Abstract/Summary

Intraclass correlation coefficients (ICCs) are a commonly used metric in test-retest reliability research to assess a measure’s ability to quantify systematic between-subject differences. However, estimates of between-subject differences are also influenced by factors including within-subject variability, random errors, and measurement bias. Here, we use data collected from a large online sample (N=150) to: 1. Quantify test-retest reliability of behavioural and computational measures of reversal learning using ICCs. 2. Use our dataset as the basis of a simulation study investigating effects of sample size on variance component estimation, and the association between estimates of variance components and ICC measures. In line with previously published work we find reliable behavioural and computational measures of reversal learning, a commonly used assay of behavioural flexibility. Reliable estimates of between-subjects, within-subjects (across-session), and error variance components for behavioural and computational measures (with ± .05 precision and 80% confidence) required sample sizes ranging from 10 to >300 (behavioural median N: between-subjects=167, within-subjects=34, error=103; computational median N: between-subjects=68, within-subjects=20, error=45). These sample sizes exceed those often used in reliability studies, suggesting larger sample sizes than are commonly used for reliability studies (circa 30) are required to robustly estimate reliability of task performance measures. Additionally, we found that ICC estimates showed highly positive and highly negative correlations respectively with between-subject and error variance components as might be expected, which remained relatively stable across sample sizes. However, ICC estimates were weakly or not correlated with within-subjects variance, providing evidence for the importance of variance decomposition for reliability studies.

Altmetric Badge

Item Type	Article
URI	https://reading-clone.eprints-hosting.org/id/eprint/120436
Identification Number/DOI	10.3758/s13428-025-02599-1
Refereed	Yes
Divisions	Interdisciplinary Research Centres (IDRCs) > Centre for Integrative Neuroscience and Neurodynamics (CINN) Life Sciences > School of Psychology and Clinical Language Sciences > Department of Psychology Life Sciences > School of Psychology and Clinical Language Sciences > Neuroscience Life Sciences > School of Psychology and Clinical Language Sciences > Language and Cognition
Uncontrolled Keywords	reliability; test retest; sample size; reinforcement learning; computational modelling; reversal learning; cognitive flexibility
Publisher	Springer
Download/View statistics	View download statistics for this item

Deposit Details

Date Deposited:	07 Feb 2025 10:21	Date item deposited into CentAUR
Last Modified:	13 Apr 2025 08:30	Date item last modified

References

Barnhart, H. X., Haber, M. J., & Lin, L. I. (2007). An Overview on Assessing Agreement with Continuous Measurements. Journal of Biopharmaceutical Statistics, 17(4), 529–569. https://doi.org/10.1080/10543400701376480 Barnhart, H. X., Yow, E., Crowley, A. L., Daubert, M. A., Rabineau, D., Bigelow, R., Pencina, M., & Douglas, P. S. (2016). Choice of agreement indices for assessing and improving measurement reproducibility in a core laboratory setting. Statistical Methods in Medical Research, 25(6), 2939–2958. https://doi.org/10.1177/0962280214534651 Bartolo, R., & Averbeck, B. B. (2020). Prefrontal Cortex Predicts State Switches during Reversal Learning. Neuron, 106(6), 1044-1054.e4. https://doi.org/10.1016/j.neuron.2020.03.024 Button, K. S., Ioannidis, J. P. A., Mokrysz, C., Nosek, B. A., Flint, J., Robinson, E. S. J., & Munafò, M. R. (2013). Power failure: Why small sample size undermines the reliability of neuroscience. Nature Reviews Neuroscience, 14(5), 365–376. https://doi.org/10.1038/nrn3475 Carleton, R. N., Norton, M. A. P. J., & Asmundson, G. J. G. (2007). Fearing the unknown: A short version of the Intolerance of Uncertainty Scale. Journal of Anxiety Disorders, 21(1), 105–117. https://doi.org/10.1016/j.janxdis.2006.03.014 Chen, J., Ooi, L. Q. R., Tan, T. W. K., Zhang, S., Li, J., Asplund, C. L., Eickhoff, S. B., Bzdok, D., Holmes, A. J., & Yeo, B. T. T. (2023). Relationship between prediction accuracy and feature importance reliability: An empirical and theoretical study. NeuroImage, 274, 120115. https://doi.org/10.1016/j.neuroimage.2023.120115 Cicchetti, D. V. (1994). Guidelines, criteria, and rules of thumb for evaluating normed and standardized assessment instruments in psychology. Psychological Assessment, 6, 284–290. https://doi.org/10.1037/1040-3590.6.4.284 Clarke, P., & Wheaton, B. (2007). Addressing Data Sparseness in Contextual Population Research: Using Cluster Analysis to Create Synthetic Neighborhoods. Sociological Methods & Research, 35(3), 311–351. https://doi.org/10.1177/0049124106292362 Cokelaer, T., Kravchenko, A., lahdjirayhan, msat59, Varma, A., L, B., Stringari, C. E., Brueffer, C., Broda, E., Pruesse, E., Singaravelan, K., Russo, S. A., Li, Z., padgham, mark, & negodfre. (2024). cokelaer/fitter: V1.7.0 (Version v1.7.0) [Computer software]. Zenodo. https://doi.org/10.5281/zenodo.10459943 Costa, V. D., Tran, V. L., Turchi, J., & Averbeck, B. B. (2015). Reversal Learning and Dopamine: A Bayesian Perspective. Journal of Neuroscience, 35(6), 2407–2416. https://doi.org/10.1523/JNEUROSCI.1989-14.2015 Dajani, D. R., & Uddin, L. Q. (2015). Demystifying cognitive flexibility: Implications for clinical and developmental neuroscience. Trends in Neurosciences, 38(9), 571–578. https://doi.org/10.1016/j.tins.2015.07.003 de Winter, J. C. F., Gosling, S. D., & Potter, J. (2016). Comparing the Pearson and Spearman correlation coefficients across distributions and sample sizes: A tutorial using simulations and empirical data. Psychological Methods, 21(3), 273–290. https://doi.org/10.1037/met0000079 Doros, G., & Lew, R. (2010). Design Based on Intra-Class Correlation Coefficients. Current Research in Biostatistics, 1(1), 1–8. https://doi.org/10.3844/amjbsp.2010.1.8 Freyer, T., Valerius, G., Kuelz, A.-K., Speck, O., Glauche, V., Hull, M., & Voderholzer, U. (2009). Test–retest reliability of event-related functional MRI in a probabilistic reversal learning task. Psychiatry Research: Neuroimaging, 174(1), 40–46. https://doi.org/10.1016/j.pscychresns.2009.03.003 Gell, M., Eickhoff, S. B., Omidvarnia, A., Küppers, V., Patil, K. R., Satterthwaite, T. D., Müller, V. I., & Langner, R. (2023). The Burden of Reliability: How Measurement Noise Limits Brain-Behaviour Predictions (p. 2023.02.09.527898). bioRxiv. https://doi.org/10.1101/2023.02.09.527898 Gershman, S. J. (2016). Empirical priors for reinforcement learning models. Journal of Mathematical Psychology, 71, 1–6. https://doi.org/10.1016/j.jmp.2016.01.006 Gorgolewski, K. J., Storkey, A. J., Bastin, M. E., Whittle, I., & Pernet, C. (2013). Single subject fMRI test–retest reliability metrics and confounding factors. NeuroImage, 69, 231–243. https://doi.org/10.1016/j.neuroimage.2012.10.085 Hedge, C., Powell, G., & Sumner, P. (2018). The reliability paradox: Why robust cognitive tasks do not produce reliable individual differences. Behavior Research Methods, 50(3), 1166–1186. https://doi.org/10.3758/s13428-017-0935-1 Hirschfeld, G., Brachel, R. von, & Thielsch, M. (2014). Selecting items for Big Five questionnaires: At what sample size do factor loadings stabilize? Journal of Research in Personality, 53, 54–63. https://doi.org/10.1016/j.jrp.2014.08.003 Huang, J. L., Bowling, N. A., Liu, M., & Li, Y. (2015). Detecting Insufficient Effort Responding with an Infrequency Scale: Evaluating Validity and Participant Reactions. Journal of Business and Psychology, 30(2), 299–311. https://doi.org/10.1007/s10869-014-9357-6 Huys, Q. J. M., Cools, R., Gölzer, M., Friedel, E., Heinz, A., Dolan, R. J., & Dayan, P. (2011). Disentangling the Roles of Approach, Activation and Valence in Instrumental and Pavlovian Responding. PLOS Computational Biology, 7(4), e1002028. https://doi.org/10.1371/journal.pcbi.1002028 Huys, Q. J. M., Eshel, N., O’Nions, E., Sheridan, L., Dayan, P., & Roiser, J. P. (2012). Bonsai Trees in Your Head: How the Pavlovian System Sculpts Goal-Directed Choices by Pruning Decision Trees. PLOS Computational Biology, 8(3), e1002410. https://doi.org/10.1371/journal.pcbi.1002410 Izquierdo, A., Brigman, J. L., Radke, A. K., Rudebeck, P. H., & Holmes, A. (2017). The neural basis of reversal learning: An updated perspective. Neuroscience, 345, 12–26. https://doi.org/10.1016/j.neuroscience.2016.03.021 Kraus, B., Zinbarg, R., Braga, R. M., Nusslock, R., Mittal, V. A., & Gratton, C. (2023). Insights from Personalized Models of Brain and Behavior for Identifying Biomarkers in Psychiatry. Neuroscience & Biobehavioral Reviews, 105259. https://doi.org/10.1016/j.neubiorev.2023.105259 Kretzschmar, A., & Gignac, G. E. (2019). At what sample size do latent variable correlations stabilize? Journal of Research in Personality, 80, 17–22. https://doi.org/10.1016/j.jrp.2019.03.007 Liljequist, D., Elfving, B., & Roaldsen, K. S. (2019). Intraclass correlation – A discussion and demonstration of basic features. PLOS ONE, 14(7), e0219854. https://doi.org/10.1371/journal.pone.0219854 Lydon-Staley, D. M., Barnett, I., Satterthwaite, T. D., & Bassett, D. S. (2019). Digital phenotyping for psychiatry: Accommodating data and theory with network science methodologies. Current Opinion in Biomedical Engineering, 9, 8–13. https://doi.org/10.1016/j.cobme.2018.12.003 Maas, C. J. M., & Hox, J. J. (2004). Robustness issues in multilevel regression analysis. Statistica Neerlandica, 58(2), 127–137. https://doi.org/10.1046/j.0039-0402.2003.00252.x Maas, C. J. M., & Hox, J. J. (2005). Sufficient Sample Sizes for Multilevel Modeling. Methodology, 1(3), 86–92. https://doi.org/10.1027/1614-2241.1.3.86 Marek, S., Tervo-Clemmens, B., Calabro, F. J., Montez, D. F., Kay, B. P., Hatoum, A. S., Donohue, M. R., Foran, W., Miller, R. L., Hendrickson, T. J., Malone, S. M., Kandala, S., Feczko, E., Miranda-Dominguez, O., Graham, A. M., Earl, E. A., Perrone, A. J., Cordova, M., Doyle, O., … Dosenbach, N. U. F. (2022). Reproducible brain-wide association studies require thousands of individuals. Nature, 603(7902), Article 7902. https://doi.org/10.1038/s41586-022-04492-9 McGraw, K. O., & Wong, S. P. (1996). Forming inferences about some intraclass correlation coefficients. Psychological Methods, 1(1), 30. https://doi.org/10.1037/1082-989X.1.1.30 Neubauer, A. B., Voelkle, M. C., Voss, A., & Mertens, U. K. (2020). Estimating Reliability of Within-Person Couplings in a Multilevel Framework. Journal of Personality Assessment, 102(1), 10–21. https://doi.org/10.1080/00223891.2018.1521418 Paccagnella, O. (2011). Sample Size and Accuracy of Estimates in Multilevel Models. Methodology, 7(3), 111–120. https://doi.org/10.1027/1614-2241/a000029 Papoutsaki, A., Sangkloy, P., Laskey, J., Daskalova, N., Huang, J., & Hays, J. (2016). WebGazer: Scalable Webcam Eye Tracking Using User Interaction. 3839–3845. Piray, P., & Daw, N. D. (2020). A simple model for learning in volatile environments. PLOS Computational Biology, 16(7), e1007963. https://doi.org/10.1371/journal.pcbi.1007963 Piray, P., Dezfouli, A., Heskes, T., Frank, M. J., & Daw, N. D. (2019). Hierarchical Bayesian inference for concurrent model fitting and comparison for group studies. PLOS Computational Biology, 15(6), e1007043. https://doi.org/10.1371/journal.pcbi.1007043 Reddy, L. F., Waltz, J. A., Green, M. F., Wynn, J. K., & Horan, W. P. (2016). Probabilistic Reversal Learning in Schizophrenia: Stability of Deficits and Potential Causal Mechanisms. Schizophrenia Bulletin, 42(4), 942–951. https://doi.org/10.1093/schbul/sbv226 Schaaf, J. V., Weidinger, L., Molleman, L., & van den Bos, W. (2023). Test–retest reliability of reinforcement learning parameters. Behavior Research Methods. https://doi.org/10.3758/s13428-023-02203-4 Schönbrodt, F. D., & Perugini, M. (2013). At what sample size do correlations stabilize? Journal of Research in Personality, 47(5), 609–612. https://doi.org/10.1016/j.jrp.2013.05.009 Smith, P. L., & Little, D. R. (2018). Small is beautiful: In defense of the small-N design. Psychonomic Bulletin & Review, 25(6), 2083–2101. https://doi.org/10.3758/s13423-018-1451-8 Spisak, T., Bingel, U., & Wager, T. D. (2023). Multivariate BWAS can be replicable with moderate sample sizes. Nature, 615(7951), E4–E7. https://doi.org/10.1038/s41586-023-05745-x Tiego, J., Martin, E. A., DeYoung, C. G., Hagan, K., Cooper, S. E., Pasion, R., Satchell, L., Shackman, A. J., Bellgrove, M. A., & Fornito, A. (2023). Precision behavioral phenotyping as a strategy for uncovering the biological correlates of psychopathology. Nature Mental Health, 1(5), Article 5. https://doi.org/10.1038/s44220-023-00057-5 Waltmann, M., Schlagenhauf, F., & Deserno, L. (2022). Sufficient reliability of the behavioral and computational readouts of a probabilistic reversal learning task. Behavior Research Methods. https://doi.org/10.3758/s13428-021-01739-7 Williams, B., & Christakou, A. (2022). Dissociable roles for the striatal cholinergic system in different flexibility contexts. IBRO Neuroscience Reports, 12, 260–270. https://doi.org/10.1016/j.ibneur.2022.03.007 Yu, C., Beckmann, J. F., & Birney, D. P. (2019). Cognitive flexibility as a meta-competency / Flexibilidad cognitiva como meta-competencia. Estudios de Psicología, 40(3), 563–584. https://doi.org/10.1080/02109395.2019.1656463 Zorowitz, S., Solis, J., Niv, Y., & Bennett, D. (2023). Inattentive responding can induce spurious associations between task behaviour and symptom measures. Nature Human Behaviour, 7(10), 1667–1681. https://doi.org/10.1038/s41562-023-01640-7

University Staff: Request a correction | Centaur Editors: Update this record

Search Google Scholar