References

Alagar, Vangalur S., and K. Periyasamy. 2011. Specification of Software Systems. 2nd ed. Texts in Computer Science. New York: Springer.

Attali, Yigal. 2011. “Immediate Feedback and Opportunity to Revise Answers: Application of a Graded Response IRT Model.” Applied Psychological Measurement 35 (6): 472–79. https://doi.org/10.1177/0146621610381755.

Attali, Yigal, Andrew Runge, Geoffrey T. LaFlair, Kevin Yancey, Sarah Goodwin, Yena Park, and Alina A. von Davier. 2022. “The Interactive Reading Task: Transformer-based Automatic Item Generation.” Frontiers in Artificial Intelligence 5 (July): 903077. https://doi.org/10.3389/frai.2022.903077.

Baarsen, Jeroen Van. 2014. GitLab Cookbook.

Baghaei, Purya, and Mona Tabatabaee. 2015. “The C-Test: An Integrative Measure of Crystallized Intelligence.” Journal of Intelligence 3 (2): 46–58. https://doi.org/10.3390/jintelligence3020046.

Bartram, Dave. 2005. “Testing on the Internet: Issues, Challenges and Opportunities in the Field of Occupational Assessment.” In Computer-Based Testing and the Internet, edited by Dave Bartram and Ronald K. Hambleton, 13–37. West Sussex, England: John Wiley & Sons, Ltd. https://doi.org/10.1002/9780470712993.ch1.

Bartram, Dave, and Ronald K. Hambleton, eds. 2006. Computer-Based Testing and the Internet: Issues and Advances. Chichester: Wiley.

Becker, Benjamin, Dries Debeer, Karoline A. Sachse, and Sebastian Weirich. 2021. “Automated Test Assembly in R: The eatATA Package.” Psych 3 (2): 96–112. https://doi.org/10.3390/psych3020010.

Bengs, Daniel, Ulf Kroehne, and Ulf Brefeld. 2021. “Simultaneous Constrained Adaptive Item Selection for Group-Based Testing.” Journal of Educational Measurement 58 (2): 236–61. https://doi.org/10.1111/jedm.12285.

Bennett, Randy Elliot, James Braswell, Andreas Oranje, Brent Sandene, Bruce Kaplan, and Fred Yan. 2008. “Does It Matter If I Take My Mathematics Test on Computer? A Second Empirical Study of Mode Effects in NAEP,” 39.

Böckenholt, Ulf, and Thorsten Meiser. 2017. “Response Style Analysis with Threshold and Multi-Process IRT Models: A Review and Tutorial.” British Journal of Mathematical and Statistical Psychology 70 (1): 159–81. https://doi.org/10.1111/bmsp.12086.

Bolt, Daniel. 2016. “Item Response Models for CBT.” In Technology and Testing: Improving Educational and Psychological Measurement, edited by Fritz Drasgow, 305. Routledge.

Born, Sebastian, and Andreas Frey. 2017. “Heuristic Constraint Management Methods in Multidimensional Adaptive Testing.” Educational and Psychological Measurement 77 (2): 241–62. https://doi.org/10.1177/0013164416643744.

Bridgeman, Brent. 2009. “Experiences from Large-Scale Computer-Based Testing in the USA.” The Transition to Computer-Based Assessment 39.

Bryant, William. 2017. “Developing a Strategy for Using Technology-Enhanced Items in Large-Scale Standardized Tests.” https://doi.org/10.7275/70YB-DJ34.

Buchanan, Tom. 2002. “Online Assessment: Desirable or Dangerous?” Professional Psychology: Research and Practice 33 (2): 148–54. https://doi.org/10.1037/0735-7028.33.2.148.

Buerger, Sarah, Ulf Kroehne, and Frank Goldhammer. 2016. “The Transition to Computer-Based Testing in Large-Scale Assessments: Investigating (Partial) Measurement Invariance Between Modes.”

Buerger, Sarah, Ulf Kroehne, Carmen Koehler, and Frank Goldhammer. 2019. “What Makes the Difference? The Impact of Item Properties on Mode Effects in Reading Assessments.” Studies in Educational Evaluation 62: 1–9. https://doi.org/10.1016/j.stueduc.2019.04.005.

Bugbee, Alan C. 1996. “The Equivalence of Paper-and-Pencil and Computer-Based Testing.” Journal of Research on Computing in Education 28 (3): 282.

Christensen, Garret S., Jeremy Freese, and Edward Miguel. 2019. Transparent and Reproducible Social Science Research: How to Do Open Science. Oakland, California: University of California Press.

Clariana, Roy, and Patricia Wallace. 2002. “Paperbased Versus Computerbased Assessment: Key Factors Associated with the Test Mode Effect.” British Journal of Educational Technology 33 (5): 593–602. https://doi.org/10.1111/1467-8535.00294.

Cochran, Gary L., Jennifer A. Foster, Donald G. Klepser, Paul P. Dobesh, and Allison M. Dering-Anderson. 2020. “The Impact of Eliminating Backward Navigation on Computerized Examination Scores and Completion Time.” American Journal of Pharmaceutical Education 84 (12): ajpe8034. https://doi.org/10.5688/ajpe8034.

Dann, Peter L., Sidney H. Irvine, and Janet M. Collis, eds. 1991. Advances in Computer-Based Human Assessment. Dordrecht: Springer Science+Business Media.

Das, Bidyut, Mukta Majumder, Santanu Phadikar, and Arif Ahmed Sekh. 2021. “Automatic Question Generation and Answer Assessment: A Survey.” Research and Practice in Technology Enhanced Learning 16 (1): 5. https://doi.org/10.1186/s41039-021-00151-1.

Deribo, Tobias, Ulf Kroehne, and Frank Goldhammer. 2021. “Model-Based Treatment of Rapid Guessing.” Journal of Educational Measurement 58 (2): 281–303. https://doi.org/10.1111/jedm.12290.

Diao, Q., and Wim J. van der Linden. 2011. “Automated Test Assembly Using Lp_Solve Version 5.5 in R.” Applied Psychological Measurement 35 (5): 398–409. https://doi.org/10.1177/0146621610392211.

DiBattista, David. 2013. “The Immediate Feedback Assessment Technique: A Learner-centered Multiple-choice Response Form.” Canadian Journal of Higher Education 35 (4): 111–31. https://doi.org/10.47678/cjhe.v35i4.184475.

DiCerbo, Kristen, Emily Lai, and Ventura Matthew. 2020. “Assessment Design with Automated Scoring in Mind.” In Handbook of Automated Scoring, 29–48. Chapman and Hall/CRC.

Dirk, Judith, Gesa Katharina Kratzsch, John P. Prindle, Ulf Kroehne, Frank Goldhammer, and Florian Schmiedek. 2017. “Paper-Based Assessment of the Effects of Aging on Response Time: A Diffusion Model Analysis.” Journal of Intelligence 5 (2): 12. https://doi.org/10.3390/jintelligence5020012.

Downing, Steven M., and Thomas M. Haladyna, eds. 2006. Handbook of Test Development. Mahwah, N.J: L. Erlbaum.

Ebel, Robert L. 1953. “The Use of Item Response Time Measurements in the Construction of Educational Achievement Tests.” Educational and Psychological Measurement 13 (3): 391–401. https://doi.org/10.1177/001316445301300303.

Embretson, Susan, and Steven P Reise. 2013. Item Response Theory. Psychology Press.

Embretson, Susan, and Xiangdong Yang. 2006. “Automatic Item Generation and Cognitive Psychology.” In Handbook of Statistics, 26:747–68. Elsevier. https://doi.org/10.1016/S0169-7161(06)26023-1.

Feskens, Remco, Jean-Paul Fox, and Robert Zwitser. 2019. “Differential Item Functioning in PISA Due to Mode Effects.” In Theoretical and Practical Advances in Computer-Based Educational Measurement, edited by Bernard P. Veldkamp and Cor Sluijter, 231–47. Cham: Springer International Publishing. https://doi.org/10.1007/978-3-030-18480-3_12.

Fink, Aron, Sebastian Born, Christian Spoden, and Andreas Frey. 2018. “A Continuous Calibration Strategy for Computerized Adaptive Testing.” Psychological Test and Assessment Modeling 3 (60): 327–46.

Finn, Bridgid, and Janet Metcalfe. 2010. “Scaffolding Feedback to Maximize Long-Term Error Correction.” Memory & Cognition 38 (7): 951–61. https://doi.org/10.3758/MC.38.7.951.

Fishbein, Bethany, Michael O. Martin, Ina V. S. Mullis, and Pierre Foy. 2018. “The TIMSS 2019 Item Equivalence Study: Examining Mode Effects for Computer-Based Assessment and Implications for Measuring Trends.” Large-Scale Assessments in Education 6 (1): 11. https://doi.org/10.1186/s40536-018-0064-z.

Frey, Andreas, Johannes Hartig, and André A. Rupp. 2009. “An NCME Instructional Module on Booklet Designs in Large-Scale Assessments of Student Achievement: Theory and Practice.” Educational Measurement: Issues and Practice 28 (3): 39–53. https://doi.org/10.1111/j.1745-3992.2009.00154.x.

Frey, Andreas, Christian Spoden, Frank Goldhammer, and S. Franziska C. Wenzel. 2018. “Response Time-Based Treatment of Omitted Responses in Computer-Based Testing.” Behaviormetrika 45 (2): 505–26. https://doi.org/10.1007/s41237-018-0073-9.

Frey, Bruce B., Vicki L. Schmitt, and Justin P. Allen. 2012. “Defining Authentic Classroom Assessment.” https://doi.org/10.7275/SXBS-0829.

Gabadinho, Alexis, Gilbert Ritschard, Nicolas S. Müller, and Matthias Studer. 2011. “Analyzing and Visualizing State Sequences in R with TraMineR.” Journal of Statistical Software 40 (4). https://doi.org/10.18637/jss.v040.i04.

Gandrud, Christopher. 2020. Reproducible Research with R and RStudio. Third edition. The R Series. Boca Raton, FL: CRC Press.

George, A. C., and A. Robitzsch. 2015. “Cognitive Diagnosis Models in R: A Didactic.” The Quantitative Methods for Psychology 11 (3): 189–205. https://doi.org/10.20982/tqmp.11.3.p189.

Gierl, Mark J., and Hollis Lai. 2013. “Instructional Topics in Educational Measurement (ITEMS) Module: Using Automated Processes to Generate Test Items.” Educational Measurement: Issues and Practice 32 (3): 36–50. https://doi.org/10.1111/emip.12018.

Gierl, Mark J., Hollis Lai, and Simon R Turner. 2012. “Using Automatic Item Generation to Create Multiple-Choice Test Items: Automatic Generation of Test Items.” Medical Education 46 (8): 757–65. https://doi.org/10.1111/j.1365-2923.2012.04289.x.

Gobert, Janice D., Michael Sao Pedro, Juelaila Raziuddin, and Ryan S. Baker. 2013. “From Log Files to Assessment Metrics: Measuring Students’ Science Inquiry Skills Using Educational Data Mining.” Journal of the Learning Sciences 22 (4): 521–63. https://doi.org/10.1080/10508406.2013.837391.

Goldhammer, Frank. 2015. “Measuring Ability, Speed, or Both? Challenges, Psychometric Solutions, and What Can Be Gained From Experimental Control.” Measurement: Interdisciplinary Research and Perspectives 13 (3-4): 133–64. https://doi.org/10.1080/15366367.2015.1100020.

Goldhammer, Frank, Caroline Hahnel, and Ulf Kroehne. 2020. “Analysing Log File Data from PIAAC.” In Large-Scale Cognitive Assessment: Analyzing PIAAC Data, edited by Debora B. Maehler and Beatrice Rammstedt. Cham: Springer.

Goldhammer, Frank, Carolin Hahnel, Ulf Kroehne, and Fabian Zehner. 2021. “From Byproduct to Design Factor: On Validating the Interpretation of Process Indicators Based on Log Data.” Large-Scale Assessments in Education 9 (1): 20. https://doi.org/10.1186/s40536-021-00113-5.

Goldhammer, Frank, and U. Kroehne. 2020. “Computerbasiertes Assessment.” In Testtheorie und Fragebogenkonstruktion, edited by Helfried Moosbrugger and Augustin Kelava, 119–41. Berlin, Heidelberg: Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-662-61532-4_6.

Goldhammer, Frank, Thomas Martens, and Oliver Lüdtke. 2017. “Conditioning Factors of Test-Taking Engagement in PIAAC: An Exploratory IRT Modelling Approach Considering Person and Item Characteristics.” Large-Scale Assessments in Education 5 (1): 18. https://doi.org/10.1186/s40536-017-0051-9.

Goldhammer, Frank, Johannes Naumann, Annette Stelter, Krisztina Tóth, Heiko Rölke, and Eckhard Klieme. 2014. “The Time on Task Effect in Reading and Problem Solving Is Moderated by Task Difficulty and Skill: Insights from a Computer-Based Large-Scale Assessment.” Journal of Educational Psychology 106 (3): 608–26. https://doi.org/10.1037/a0034716.

Goldhammer, Frank, and Fabian Zehner. 2017. “What to Make Of and How to Interpret Process Data.” Measurement: Interdisciplinary Research and Perspectives 15 (3-4): 128–32. https://doi.org/10.1080/15366367.2017.1411651.

Gong, Tao, Yang Jiang, Luis E. Saldivia, and Christopher Agard. 2022. “Using Sankey Diagrams to Visualize Drag and Drop Action Sequences in Technology-Enhanced Items.” Behavior Research Methods 54 (1): 117–32. https://doi.org/10.3758/s13428-021-01615-4.

Gorgun, Guher, and Okan Bulut. 2021. “A Polytomous Scoring Approach to Handle Not-Reached Items in Low-Stakes Assessments.” Educational and Psychological Measurement 81 (5): 847–71. https://doi.org/10.1177/0013164421991211.

Greiff, Samuel, Christoph Niepel, Ronny Scherer, and Romain Martin. 2016. “Understanding Students’ Performance in a Computer-Based Assessment of Complex Problem Solving: An Analysis of Behavioral Data from Computer-Generated Log Files.” Computers in Human Behavior 61 (August): 36–46. https://doi.org/10.1016/j.chb.2016.02.095.

Hahnel, Carolin, Beate Eichmann, and Frank Goldhammer. 2020. “Evaluation of Online Information in University Students: Development and Scaling of the Screening Instrument EVON.” Frontiers in Psychology 11 (December): 562128. https://doi.org/10.3389/fpsyg.2020.562128.

Hahnel, Carolin, Alexander J. Jung, and Frank Goldhammer. 2023. “Theory Matters: An Example of Deriving Process Indicators From Log Data to Assess Decision-Making Processes in Web Search Tasks.” European Journal of Psychological Assessment 39 (4): 271–79. https://doi.org/10.1027/1015-5759/a000776.

Hahnel, Carolin, Ulf Kroehne, Frank Goldhammer, Cornelia Schoor, Nina Mahlow, and Cordula Artelt. 2019. “Validating Process Variables of Sourcing in an Assessment of Multiple Document Comprehension.” British Journal of Educational Psychology, April, bjep.12278. https://doi.org/10.1111/bjep.12278.

Hahnel, Carolin, Dara Ramalingam, Ulf Kroehne, and Frank Goldhammer. 2022. “Patterns of Reading Behaviour in Digital Hypertext Environments.” Journal of Computer Assisted Learning, July, jcal.12709. https://doi.org/10.1111/jcal.12709.

Haigh, Matt. 2010. “Why Use Computer-Based Assessment in Education? A Literature Review,” no. 10: 8.

Hao, Jiangang, Zhan Shu, and Alina von Davier. 2015. “Analyzing Process Data from Game/Scenario-Based Tasks: An Edit Distance Approach.” JEDM-Journal of Educational Data Mining 7 (1): 33–50.

Harsch, Claudia, and Johannes Hartig. 2016. “Comparing C-tests and Yes/No Vocabulary Size Tests as Predictors of Receptive Language Skills.” Language Testing 33 (4): 555–75. https://doi.org/10.1177/0265532215594642.

Haverkamp, Ymkje E, Ivar Bråten, Natalia Latini, and Ladislao Salmerón. 2022. “Is It the Size, the Movement, or Both? Investigating Effects of Screen Size and Text Movement on Processing, Understanding, and Motivation When Students Read Informational Text,” 20.

Hethey, Jonathan M. 2013. GitLab Repository Management: Delve into Managing Your Projects with GitLab, While Tailoring It to Fit Your Environment. Community Experience Distilled. Birmingham: Packt Publ.

Hetter, Rebecca D., and J. Bradford Sympson. 1997. “Item Exposure Control in CAT-ASVAB.” In Computerized Adaptive Testing: From Inquiry to Operation., edited by William A. Sands, Brian K. Waters, and James R. McBride, 141–44. Washington: American Psychological Association. https://doi.org/10.1037/10244-014.

Heyne, Nora, Cordula Artelt, Timo Gnambs, Karin Gehrer, and Cornelia Schoor. 2020. “Instructed Highlighting of Text Passages Indicator of Reading or Strategic Performance?” Lingua 236 (March): 102803. https://doi.org/10.1016/j.lingua.2020.102803.

Hornke, Lutz F. 2005. “Response Time in Computer-Aided Testing: A ‘Verbal Memory’ Test for Routes and Maps,” 14.

Ihme, Jan Marten, Martin Senkbeil, Frank Goldhammer, and Julia Gerick. 2017. “Assessment of Computer and Information Literacy in ICILS 2013: Do Different Item Types Measure the Same Construct?” European Educational Research Journal 16 (6): 716–32. https://doi.org/10.1177/1474904117696095.

International Test Commission and Association of Test Publishers. 2022. Guidelines for Technology-Based Assessment.

Jaeger, Judith. 2018. “Digit Symbol Substitution Test: The Case for Sensitivity Over Specificity in Neuropsychological Testing.” Journal of Clinical Psychopharmacology 38 (5): 513–19. https://doi.org/10.1097/JCP.0000000000000941.

Jiang, Yang, Tao Gong, Luis E. Saldivia, Gabrielle Cayton-Hodges, and Christopher Agard. 2021. “Using Process Data to Understand Problem-Solving Strategies and Processes for Drag-and-Drop Items in a Large-Scale Mathematics Assessment.” Large-Scale Assessments in Education 9 (1): 2. https://doi.org/10.1186/s40536-021-00095-4.

Jiao, Hong, Dandan Liao, and Peida Zhan. 2019. “Utilizing Process Data for Cognitive Diagnosis.” In Handbook of Diagnostic Classification Models, edited by Matthias von Davier and Young-Sun Lee, 421–36. Cham: Springer International Publishing. https://doi.org/10.1007/978-3-030-05584-4_20.

Jung, Stefanie, Juergen Heller, Korbinian Moeller, and Elise Klein. 2019. “Mode Effect: An Issue of Perspective? Writing Mode Differences in a Spelling Assessment in German Children with and Without Developmental Dyslexia,” 38.

Jurecka, Astrid. 2008. “Introduction to the Computer-Based Assessment of Competencies.” Assessment of Competencies in Educational Contexts, 193–214.

Kirsch, Irwin, and Mary Louise Lennon. 2017. “PIAAC: A New Design for a New Era.” Large-Scale Assessments in Education 5 (1): 11. https://doi.org/10.1186/s40536-017-0046-6.

Koehler, C., S. Pohl, and C. H. Carstensen. 2014. “Taking the Missing Propensity Into Account When Estimating Competence Scores: Evaluation of Item Response Theory Models for Nonignorable Omissions.” Educational and Psychological Measurement, December. https://doi.org/10.1177/0013164414561785.

Kreuter, Frauke, ed. 2013. Improving Surveys with Paradata: Analytic Uses of Process Information. Wiley Series in Survey Methodology. Hoboken, New Jersey: Wiley & Sons.

Kroehne, Ulf. In Preperation. “Standardization of Log Data from Computer-Based Assessments.”

Kroehne, Ulf, Sarah Buerger, Carolin Hahnel, and Frank Goldhammer. 2019. “Construct Equivalence of PISA Reading Comprehension Measured With Paper-Based and Computer-Based Assessments.” Educational Measurement: Issues and Practice, July, emip.12280. https://doi.org/10.1111/emip.12280.

Kroehne, Ulf, Tobias Deribo, and Frank Goldhammer. 2020. “Rapid Guessing Rates Across Administration Mode and Test Setting.” Psychological Test and Assessment Modeling 62 (2): 147–77.

Kroehne, Ulf, Timo Gnambs, and Frank Goldhammer. 2019. “Disentangling Setting and Mode Effects for Online Competence Assessment.” In Education as a Lifelong Process, 171–93. Edition ZfE. Wiesbaden: Springer VS. https://doi.org/10.1007/978-3-658-23162-0_10.

Kroehne, Ulf, and Frank Goldhammer. in Press. “Tools for Analyzing Log-File Data.” In.

———. 2018. “How to Conceptualize, Represent, and Analyze Log Data from Technology-Based Assessments? A Generic Framework and an Application to Questionnaire Items.” Behaviormetrika. https://doi.org/10.1007/s41237-018-0063-y.

Kroehne, Ulf, Frank Goldhammer, and Ivailo Partchev. 2014. “Constrained Multidimensional Adaptive Testing Without Intermixing Items from Different Dimensions.” Psychological Test and Assessment Modeling 56 (4): 348.

Kroehne, Ulf, Carolin Hahnel, and Frank Goldhammer. 2019. “Invariance of the Response Processes Between Gender and Modes in an Assessment of Reading.” Frontiers in Applied Mathematics and Statistics 5: 2. https://doi.org/10.3389/fams.2019.00002.

Kroehne, Ulf, and Thomas Martens. 2011. “Computer-Based Competence Tests in the National Educational Panel Study: The Challenge of Mode Effects.” Zeitschrift Für Erziehungswissenschaft 14 (S2): 169–86. https://doi.org/10.1007/s11618-011-0185-4.

Kyllonen, Patrick, and Jiyun Zu. 2016. “Use of Response Time for Measuring Cognitive Ability.” Journal of Intelligence 4 (4): 14. https://doi.org/10.3390/jintelligence4040014.

Lahza, Hatim, Tammy G. Smith, and Hassan Khosravi. 2022. “Beyond Item Analysis: Connecting Student Behaviour and Performance Using e-Assessment Logs.” British Journal of Educational Technology, October, bjet.13270. https://doi.org/10.1111/bjet.13270.

Lane, Suzanne, Mark R. Raymond, and Thomas M. Haladyna, eds. 2015. Handbook of Test Development. Second edition. New York: Routledge.

Lesyuk, Andriy. 2013. Mastering Redmine. Birmingham, UK: Packt Publishing.

Linden, Wim J., and Qi Diao. 2011. “Automated Test-Form Generation.” Journal of Educational Measurement 48 (2): 206–22.

Lu, Jing, Chun Wang, Jiwei Zhang, and Jian Tao. 2019. “A Mixture Model for Responses and Response Times with a Higher-Order Ability Structure to Detect Rapid Guessing Behaviour.” British Journal of Mathematical and Statistical Psychology, August, bmsp.12175. https://doi.org/10.1111/bmsp.12175.

Magis, David, and Juan Ramon Barrada. 2017. “Computerized Adaptive Testing with R : Recent Updates of the Package catR.” Journal of Statistical Software 76 (Code Snippet 1). https://doi.org/10.18637/jss.v076.c01.

Magis, David, and Gilles Raîche. 2012. “Random Generation of Response Patterns Under Computerized Adaptive Testing with the R Package catR.” Journal of Statistical Software 48 (8): 1–31. https://doi.org/10.18637/jss.v048.i08.

Mason, Mike. 2006. Pragmatic Version Control Using Subversion. 2nd ed. Pragmatic Starter Kit, v. 1. Raleigh, N.C: Pragmatic Bookshelf.

Mayerl, Jochen. 2013. “Response Latency Measurement in Surveys. Detecting Strong Attitudes and Response Effects.” Survey Methods: Insights from the Field (SMIF).

Mills, Craig N., ed. 2002. Computer-Based Testing: Building the Foundation for Future Assessments. Mahwah, N.J: L. Erlbaum Associates.

Naglieri, Jack A, Fritz Drasgow, Mark Schmit, Len Handler, Aurelio Prifitera, Amy Margolis, and Roberto Velasquez. 2004. “Psychological Testing on the Internet: New Problems, Old Issues.” American Psychologist 59 (3): 150.

Naumann, Johannes. 2015. “A Model of Online Reading Engagement: Linking Engagement, Navigation, and Performance in Digital Reading.” Computers in Human Behavior 53 (December): 263–77. https://doi.org/10.1016/j.chb.2015.06.051.

Naumann, Johannes, and Frank Goldhammer. 2017. “Time-on-Task Effects in Digital Reading Are Non-Linear and Moderated by Persons’ Skills and Tasks’ Demands.” Learning and Individual Differences 53: 1–16.

Naumann, Johannes, and Christine Sälzer. 2017. “Digital Reading Proficiency in German 15-Year Olds: Evidence from PISA 2012.” Zeitschrift Für Erziehungswissenschaft 20 (4): 585–603. https://doi.org/10.1007/s11618-017-0758-y.

Neubert, Jonas C., André Kretzschmar, Sascha Wüstenberg, and Samuel Greiff. 2015. “Extending the Assessment of Complex Problem Solving to Finite State Automata: Embracing Heterogeneity.” European Journal of Psychological Assessment 31 (3): 181–94. https://doi.org/10.1027/1015-5759/a000224.

OECD. 2013. The Survey of Adult Skills: Reader’s Companion. OECD. https://doi.org/10.1787/9789264204027-en.

———. 2019. Beyond Proficiency: Using Log Files to Understand Respondent Behaviour in the Survey of Adult Skills. OECD Skills Studies. OECD. https://doi.org/10.1787/0b1414ed-en.

Parshall, Cynthia G., ed. 2002. Practical Considerations in Computer-Based Testing. New York: Springer.

Partchev, Ivailo. 2004. “A Visual Guide to Item Response Theory.” Retrieved November 9: 2004.

Pavic, Aleksandar. 2016. Redmine Cookbook: Over 80 Hands-on Recipes to Improve Your Skills in Project Management, Team Management, Process Improvement, and Redmine Administration.

Persic-Beck, Lothar, Frank Goldhammer, and Ulf Kroehne. 2022. “Disengaged Response Behavior When the Response Button Is Blocked: Evaluation of a Micro-Intervention.” Frontiers in Psychology 13 (November): 954532. https://doi.org/10.3389/fpsyg.2022.954532.

Phillips, Addison, and Mark Davis. 2009. “Tags for Identifying Languages.”

Pohl, Steffi. 2013. “Longitudinal Multistage Testing.” Journal of Educational Measurement 50 (4): 447–68.

“Question and Test Interoperability (QTI): Implementation Guide.” 2022. http://www.imsglobal.org/question/qtiv2p2/imsqti_v2p2_impl.html.

Reips, Ulf-Dietrich. 2010. “Design and Formatting in Internet-based Research.” In Advanced Methods for Conducting Online Behavioral Research, edited by S. Gosling and J. Johnson, 29–43. Washington, DC: American Psychological Association.

Reis Costa, Denise, Maria Bolsinova, Jesper Tijmstra, and Björn Andersson. 2021. “Improving the Precision of Ability Estimates Using Time-On-Task Variables: Insights From the PISA 2012 Computer-Based Assessment of Mathematics.” Frontiers in Psychology 12 (March): 579128. https://doi.org/10.3389/fpsyg.2021.579128.

Reis Costa, Denise, and Waldir Leoncio. 2019. LOGAN: Log File Analysis in International Large-Scale Assessments. Manual.

Rios, Joseph A. 2022. “Assessing the Accuracy of Parameter Estimates in the Presence of Rapid Guessing Misclassifications.” Educational and Psychological Measurement 82 (1): 122–50. https://doi.org/10.1177/00131644211003640.

Rios, Joseph A., Hongwen Guo, Liyang Mao, and Ou Lydia Liu. 2017. “Evaluating the Impact of Careless Responding on Aggregated-Scores: To Filter Unmotivated Examinees or Not?” International Journal of Testing 17 (1): 74–104. https://doi.org/10.1080/15305058.2016.1231193.

Rios, Joseph A., and James Soland. 2021. “Parameter Estimation Accuracy of the Effort-Moderated Item Response Theory Model Under Multiple Assumption Violations.” Educational and Psychological Measurement 81 (3): 569–94. https://doi.org/10.1177/0013164420949896.

———. 2022. “An Investigation of Item, Examinee, and Country Correlates of Rapid Guessing in PISA.” International Journal of Testing, February, 1–31. https://doi.org/10.1080/15305058.2022.2036161.

Robitzsch, Alexander, Thomas Kiefer, and Margaret Wu. 2022. TAM: Test Analysis Modules. Manual.

Robitzsch, Alexander, and Oliver Lüdtke. 2022. “Some Thoughts on Analytical Choices in the Scaling Model for Test Scores in International Large-Scale Assessment Studies.” Measurement Instruments for the Social Sciences 4 (1): 9. https://doi.org/10.1186/s42409-022-00039-w.

Robitzsch, A., O. Lüdtke, F. Goldhammer, Ulf Kroehne, and O. Köller. 2020. “Reanalysis of the German PISA Data: A Comparison of Different Approaches for Trend Estimation with a Particular Emphasis on Mode Effects.” Frontiers in Psychology 11 (884). https://doi.org/http://dx.doi.org/10.3389/fpsyg.2020.00884.

Rölke, Heiko. 2012. “The ItemBuilder: A Graphical Authoring System for Complex Item Development.” In World Conference on E-Learning in Corporate, Government, Healthcare, and Higher Education, 2012:344–53.

Rose, Norman, Matthias von Davier, and Benjamin Nagengast. 2017. “Modeling Omitted and Not-Reached Items in IRT Models.” Psychometrika 82 (3): 795–819. https://doi.org/10.1007/s11336-016-9544-7.

Sahin, Füsun, and Kimberly F. Colvin. 2020. “Enhancing Response Time Thresholds with Response Behaviors for Detecting Disengaged Examinees.” Large-Scale Assessments in Education 8 (1): 5. https://doi.org/10.1186/s40536-020-00082-1.

Scalise, Kathleen, and Diane D. Allen. 2015. “Use of Open-Source Software for Adaptive Measurement: Concerto as an R-based Computer Adaptive Development and Delivery Platform.” British Journal of Mathematical and Statistical Psychology 68 (3): 478–96. https://doi.org/10.1111/bmsp.12057.

Scalise, Kathleen, and Bernard Gifford. 2006. “Computer-Based Assessment in E-learning: A Framework for Constructing" Intermediate Constraint" Questions and Tasks for Technology Platforms.” The Journal of Technology, Learning and Assessment 4 (6).

Schmiedek, Florian, Ulf Kroehne, Frank Goldhammer, John J. Prindle, Ulman Lindenberger, Johanna Klinger-König, Hans J. Grabe, et al. 2022. “General Cognitive Ability Assessment in the German National Cohort (NAKO) The Block-Adaptive Number Series Task.” The World Journal of Biological Psychiatry, February, 1–12. https://doi.org/10.1080/15622975.2021.2011407.

Schnipke, Deborah L., and David J. Scrams. 1997. “Modeling Item Response Times With a Two-State Mixture Model: A New Method of Measuring Speededness.” Journal of Educational Measurement 34 (3): 213–32.

Schnitzler, Maya, R. Baumann, Barkow, I., and Heiko Rölke. 2013. “Chapter 5: Development of the Cognitive Items.” In.

Shin, Hyo Jeong, Paul A. Jewsbury, and Peter W. van Rijn. 2022. “Generating Group-Level Scores Under Response Accuracy-Time Conditional Dependence.” Large-Scale Assessments in Education 10 (1): 4. https://doi.org/10.1186/s40536-022-00122-y.

Shute, Valerie J. 2008. “Focus on Formative Feedback.” Review of Educational Research 78 (1): 153–89. https://doi.org/10.3102/0034654307313795.

Shute, Valerie J., Lubin Wang, Samuel Greiff, Weinan Zhao, and Gregory Moore. 2016. “Measuring Problem Solving Skills via Stealth Assessment in an Engaging Video Game.” Computers in Human Behavior 63 (October): 106–17. https://doi.org/10.1016/j.chb.2016.05.047.

Sideridis, Georgios, and Maisa Alahmadi. 2022. “Estimation of Person Ability Under Rapid and Effortful Responding.” Journal of Intelligence 10 (3): 67. https://doi.org/10.3390/jintelligence10030067.

Sireci, Stephen G, and April L Zenisky. 2015. “Innovative Item Formats in Computer-Based Testing: In Pursuit of Improved Construct Representation.” In Handbook of Test Development, 313–34. Routledge.

Slepkov, Aaron D., and Alan T. K. Godfrey. 2019. “Partial Credit in Answer-Until-Correct Multiple-Choice Tests Deployed in a Classroom Setting.” Applied Measurement in Education 32 (2): 138–50. https://doi.org/10.1080/08957347.2019.1577249.

Soland, James, Megan Kuhfeld, and Joseph Rios. 2021. “Comparing Different Response Time Threshold Setting Methods to Detect Low Effort on a Large-Scale Assessment.” Large-Scale Assessments in Education 9 (1): 8. https://doi.org/10.1186/s40536-021-00100-w.

Stemmann, Jennifer. 2016. “Technische Problemlösekompetenz Im Alltag - Theoretische Entwicklung Und Empirische Prüfung Des Kompetenzkonstruktes Problemlösen Im Umgang Mit Technischen Geräten.” PhD thesis.

Stodden, Victoria, Friedrich Leisch, and Roger D. Peng, eds. 2014. Implementing Reproducible Research. The R Series. Boca Raton: CRC Press, Taylor & Francis Group.

Striewe, Michael, and Matthias Kramer. 2018. “Empirische Untersuchungen von Lückentext-Items Zur Beherrschung Der Syntax Einer Programmiersprache.” Commentarii Informaticae Didacticae, no. 12: 101–15.

Tang, Xueying, Susu Zhang, Zhi Wang, Jingchen Liu, and Zhiliang Ying. 2021. “ProcData: An R Package for Process Data Analysis.” Psychometrika 86 (4): 1058–83. https://doi.org/10.1007/s11336-021-09798-7.

Tomasik, Martin J., Stéphanie Berger, and Urs Moser. 2018. “On the Development of a Computer-Based Tool for Formative Student Assessment: Epistemological, Methodological, and Practical Issues.” Frontiers in Psychology 9 (November): 2245. https://doi.org/10.3389/fpsyg.2018.02245.

Toplak, Maggie E., Richard F. West, and Keith E. Stanovich. 2014. “Assessing Miserly Information Processing: An Expansion of the Cognitive Reflection Test.” Thinking & Reasoning 20 (2): 147–68. https://doi.org/10.1080/13546783.2013.844729.

Tóth, Krisztina, Heiko Rölke, Samuel Greiff, and Sascha Wüstenberg. 2014. “Discovering Students’ Complex Problem Solving Strategies in Educational Assessment.” In Proceedings of the 7th International Conference on Educational Data Mining.(pp. 225-228). International Educational Data Mining Society.

Tsitoara, Mariot. 2020. Beginning Git and GitHub: A Comprehensive Guide to Version Control, Project Management, and Teamwork for the New Developer. Berkeley, CA: Apress. https://doi.org/10.1007/978-1-4842-5313-7.

Ulitzsch, Esther, Qiwei He, and Steffi Pohl. 2022. “Using Sequence Mining Techniques for Understanding Incorrect Behavioral Patterns on Interactive Tasks.” Journal of Educational and Behavioral Statistics 47 (1): 3–35. https://doi.org/10.3102/10769986211010467.

Ulitzsch, Esther, Christiane Penk, Matthias von Davier, and Steffi Pohl. 2021. “Model Meets Reality: Validating a New Behavioral Measure for Test-Taking Effort.” Educational Assessment 26 (2): 104–24. https://doi.org/10.1080/10627197.2020.1858786.

Ulitzsch, Esther, Steffi Pohl, Lale Khorramdel, Ulf Kroehne, and Matthias von Davier. 2021. “A Response-Time-Based Latent Response Mixture Model for Identifying and Modeling Careless and Insufficient Effort Responding in Survey Data.” Psychometrika, December. https://doi.org/10.1007/s11336-021-09817-7.

Ulitzsch, Esther, Matthias von Davier, and Steffi Pohl. 2020. “A Multiprocess Item Response Model for Not-Reached Items Due to Time Limits and Quitting.” Educational and Psychological Measurement 80 (3): 522–47. https://doi.org/10.1177/0013164419878241.

van der Kleij, Fabienne M., Theo J. H. M. Eggen, Caroline F. Timmers, and Bernard P. Veldkamp. 2012. “Effects of Feedback in a Computer-Based Assessment for Learning.” Computers & Education 58 (1): 263–72. https://doi.org/10.1016/j.compedu.2011.07.020.

van der Linden, Wim J. 2007. “A Hierarchical Framework for Modeling Speed and Accuracy on Test Items.” Psychometrika 72 (3): 287–308. https://doi.org/10.1007/s11336-006-1478-z.

van der Linden, Wim J., and Cees A. W. Glas. 2000. Computerized Adaptive Testing: Theory and Practice. Springer.

van der Linden, Wim J., R. H. Klein Entink, and J.-P. Fox. 2010. “IRT Parameter Estimation With Response Times as Collateral Information.” Applied Psychological Measurement 34 (5): 327–47. https://doi.org/10.1177/0146621609349800.

van der Linen, Wim J. 2006. “Model-Based Innovations in Computer-Based Testing.” In Computer-Based Testing and the Internet: Issues and Advances., edited by Dave Bartram and Ronald K. Hambleton, 39–58. Chichester: Wiley.

Veldkamp, Bernard P., and Cor Sluijter, eds. 2019. Theoretical and Practical Advances in Computer-based Educational Measurement. Methodology of Educational Measurement and Assessment. Cham: Springer International Publishing. https://doi.org/10.1007/978-3-030-18480-3.

von Davier, Matthias. 2018. “Automated Item Generation with Recurrent Neural Networks.” Psychometrika 83 (4): 847–57. https://doi.org/10.1007/s11336-018-9608-y.

Weiss, D. J. 1982. “Improving Measurement Quality and Efficiency with Adaptive Testing.” Applied Psychological Measurement 6 (4): 473–92. https://doi.org/10.1177/014662168200600408.

Wiliam, Dylan. 2011. “What Is Assessment for Learning?” Studies in Educational Evaluation 37 (1): 3–14. https://doi.org/10.1016/j.stueduc.2011.03.001.

Williamson, David M, Mislevy, and Isaac I Bejar. 2006. Automated Scoring of Complex Tasks in Computer-Based Testing. Mahwah, N.J.: Lawrence Erlbaum Associates.

Wise, Steven L. 2017. “Rapid-Guessing Behavior: Its Identification, Interpretation, and Implications.” Educational Measurement: Issues and Practice 36 (4): 52–61. https://doi.org/10.1111/emip.12165.

———. 2019. “An Information-Based Approach to Identifying Rapid-Guessing Thresholds.” Applied Measurement in Education 32 (4): 325–36. https://doi.org/10.1080/08957347.2019.1660350.

Wise, Steven L., and Christine E. DeMars. 2006. “An Application of Item Response Time: The Effort-Moderated IRT Model.” Journal of Educational Measurement 43 (1): 19–38.

Wise, Steven L., Megan R. Kuhfeld, and James Soland. 2019. “The Effects of Effort Monitoring With Proctor Notification on Test-Taking Engagement, Test Performance, and Validity.” Applied Measurement in Education 32 (2): 183–92. https://doi.org/10.1080/08957347.2019.1577248.

Wools, Saskia, Mark Molenaar, and Dorien Hopster-den Otter. 2019. “The Validity of Technology Enhanced Assessments and Opportunities.” In Theoretical and Practical Advances in Computer-based Educational Measurement, edited by Bernard P. Veldkamp and Cor Sluijter, 3–19. Cham: Springer International Publishing. https://doi.org/10.1007/978-3-030-18480-3_1.

Xie, Yihui. 2015. Dynamic Documents with R and Knitr. Second edition. Boca Raton: CRC Press/Taylor & Francis.

Yan, Duanli, André A. Rupp, and Peter W. Foltz, eds. 2020. Handbook of Automated Scoring; Theory into Practice. CRC Press/Taylor & Francis Group.

Yousfi, Safir, and Hendryk F. Böhme. 2012. “Principles and Procedures of Considering Item Sequence Effects in the Development of Calibrated Item Pools: Conceptual Analysis and Empirical Illustration.” Psychol. Test Assess. Model 54: 366–96.

Zehner, Fabian, Frank Goldhammer, Emily Lubaway, and Christine Sälzer. 2018. “Unattended Consequences: How Text Responses Alter Alongside PISA’s Mode Change from 2012 to 2015.” Education Inquiry, October, 1–22. https://doi.org/10.1080/20004508.2018.1518080.

Zehner, Fabian, Christine Sälzer, and Frank Goldhammer. 2016. “Automatic Coding of Short Text Responses via Clustering in Educational Assessment.” Educational and Psychological Measurement 76 (2): 280–303. https://doi.org/10.1177/0013164415590022.

Zheng, Y., and H.-H. Chang. 2015. “On-the-Fly Assembled Multistage Adaptive Testing.” Applied Psychological Measurement 39 (2): 104–18. https://doi.org/10.1177/0146621614544519.

Zumbo, Bruno D., and Anita M. Hubley, eds. 2017. Understanding and Investigating Response Processes in Validation Research. Vol. 69. Social Indicators Research Series. Cham: Springer International Publishing. https://doi.org/10.1007/978-3-319-56129-5.

B Useful Tables