Publication:
A Comparison of Anchor-Item Designs for the Concurrent Calibration of Large Banks of Likert-Type Items

Loading...
Thumbnail Image
Full text at PDC
Publication Date
2010
Advisors (or tutors)
Editors
Journal Title
Journal ISSN
Volume Title
Publisher
SAGE
Citations
Google Scholar
Research Projects
Organizational Units
Journal Issue
Abstract
Current interest in measuring quality of life is generating interest in the construction of computerized adaptive tests (CATs) with Likert-type items. Calibration of an item bank for use in CAT requires collecting responses to a large number of candidate items. However, the number is usually too large to administer to each subject in the calibration sample. The concurrent anchor-item design solves this problem by splitting the items into separate subtests, with some common items across subtests; then administering each subtest to a different sample; and finally running estimation algorithms once on the aggregated data array, from which a substantial number of responses are then missing. Although the use of anchor-item designs is widespread, the consequences of several configuration decisions on the accuracy of parameter estimates have never been studied in the polytomous case. The present study addresses this question by simulation, comparing the outcomes of several alternatives on the configuration of the anchor-item design. The factors defining variants of the anchor-item design are (a) subtest size, (b) balance of common and unique items per subtest, (c) characteristics of the common items, and (d) criteria for the distribution of unique items across subtests. The results of this study indicate that maximizing accuracy in item parameter recovery requires subtests of the largest possible number of items and the smallest possible number of common items; the characteristics of the common items and the criterion for distribution of unique items do not affect accuracy.
Description
Keywords
Citation
Baker, F. B. (1992). Equating tests under the graded response model. Applied Psychological Measurement, 16, 87-96. Baker, F. B. (1997). Estimation of graded response model parameters using MULTILOG. Applied Psychological Measurement, 21, 89-90. Bjorner, J. B., Kosinski, M., & Ware, J. E., Jr. (2003). Calibration of an item pool for assessing the burden of headaches: An application of item response theory to the Headache Impact Test (HITTM). Quality of Life Research, 12, 913-933. Cella, D., & Chang, C.-H. (2000). A discussion of item response theory and its applications in health status assessment. Medical Care, 38, II-66-II-72. Cohen, A. S., & Kim, S.-H. (1998). An investigation of linking methods under the graded response model. Applied Psychological Measurement, 22, 116-130. de Gruijter, D. N. M. (1988). Standard errors of item parameter estimates in incomplete designs. Applied Psychological Measurement, 12, 109-116. Dodd, B. G., Koch, W. R., & De Ayala, R. J. (1989). Operational characteristics of adaptive testing procedures using the graded response model. Applied Psychological Measurement, 13, 129-143. du Toit, M. (Ed.). (2003). IRT from SSI: bilog-mg, multilog, parscale, testfact. Lincolnwood, IL: Scientific Software International. Emons, W. H. M. (2008). Nonparametric person-fit analysis of polytomous item scores. Applied Psychological Measurement, 32, 224-247. Fletcher, R. B., & Hattie, J. A. (2004). An examination of the psychometric properties of the physical selfdescription questionnaire using a polytomous item response model. Psychology of Sport and Exercise, 5, 423-446. Garcı´a-Pe´rez, M. A. (1999). Fitting logistic IRT models: Small wonder. Spanish Journal of Psychology, 2, 74-94. Available from http://www.ucm.es/sjp Hanson, B. A., & Be´guin, A. A. (2002). Obtaining a common scale for item response theory item parameters using separate versus concurrent estimation in the common-item equating design. Applied Psychological Measurement, 26, 3-24. Hart, D. L., Wang, Y.-C., Stratford, P. W., & Mioduski, J. E. (2008a). A computerized adaptive test for patients with hip impairments produced valid and responsive measures of function. Archives of Physical Medicine and Rehabilitation, 89, 2129-2139. Hart, D. L., Wang, Y.-C., Stratford, P. W., & Mioduski, J. E. (2008b). Computerized adaptive test for patients with knee impairments produced valid and responsive measures of function. Journal of Clinical Epidemiology, 61, 1113-1124. Hol, A. M., Vorst, H. C. M., & Mellenbergh, G. J. (2007). Computerized adaptive testing for polytomous motivation items: Administration mode effects and a comparison with short forms. Applied Psychological Measurement, 31, 412-429. Holman, R., & Berger, M. P. F. (2001). Optimal calibration designs for tests of polytomously scored items described by item response theory models. Journal of Educational and Behavioral Statistics, 26, 361- 380. Kim, S.-H., & Cohen, A. S. (2002). A comparison of linking and concurrent calibration under the graded response model. Applied Psychological Measurement, 26, 25-41. Koch, W. R. (1983). Likert scaling using the graded response latent trait model. Applied Psychological Measurement, 7, 15-32. Lai, J.-S., Cella, D., Chang, C.-H., Bode, R. K., & Heinemann, A. W. (2003). Item banking to improve, shorten and computerize self-reported fatigue: An illustration of steps to create a core item bank from the FACIT-Fatigue Scale. Quality of Life Research, 12, 485-501. Lopez Rivas, G. E., Stark, S., & Chernyshenko, O. S. (2009). The effects of referent item parameters on differential item functioning detection using the free baseline likelihood ratio test. Applied Psychological Measurement,33, 251-265. Meade, A. W., Lautenschlager, G. J., & Johnson, E. C. (2007). A Monte Carlo examination of the sensitivity of the differential functioning of items and tests framework for tests of measurement invariance with Likert data. Applied Psychological Measurement, 31, 430-455. Numerical Algorithms Group. (1999). NAG Fortran library manual, Mark 19. Oxford, UK: Author. Reise, S. P., Widaman, K. F., & Pugh, R. H. (1993). Confirmatory factor analysis and item response theory: Two approaches for exploring measurement invariance. Psychological Bulletin, 114, 552-566. Reise, S. P., & Yu, J. (1990). Parameter recovery in the graded response model using MULTILOG. Journal of Educational Measurement, 27, 133-144. Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores. Psychometrika Monograph Supplement, 34(4, Pt. 2, No. 17), 100-114. Samejima, F. (1997). Graded response model. In W. J. van der Linden & R. K. Hambleton (Eds.), Handbook of modern item response theory (pp. 85-100). New York, NY: Springer. Singh, J., Howell, R. D., & Rhoads, G. K. (1990). Adaptive designs for Likert-type data: An approach for implementing marketing surveys. Journal of Marketing Research, 27, 304-321. Singh, J., Rhoads, G. K., & Howell, R. D. (1992). Adapting marketing surveys to individual respondents: An approach using item information functions. Journal of the Market Research Society, 34, 125-147. Uttaro, T., & Lehman, A. (1999). Graded response modeling of the Quality of Life Interview. Evaluation and Program Planning, 22, 41-52. Vale, C. D. (1986). Linking item parameters onto a common scale. Applied Psychological Measurement, 10, 333 344. Ware, J. E., Jr., Bjorner, J. B., & Kosinski, M. (2000). Practical implications of item response theory and computerized adaptive testing: A brief summary of ongoing studies of widely used headache impact scales. Medical Care, 38, II-73-II-82. Ware, J. E., Jr., Kosinski, M., Bjorner, J. B., Bayliss, M. S., Batenhorst, A., Dahlo¨f, C. G. H., . . . Dowson, A. (2003). Applications of computerized adaptive testing (CAT) to the assessment of headache impact. Quality of Life Research, 12, 935-952. Wingersky, M. S., & Lord, F. M. (1984). An investigation of methods for reducing sampling error in certain IRT procedures. Applied Psychological Measurement, 8, 347-364. Woods, C. M. (2007). Ramsay curve IRT for Likert-type data. Applied Psychological Measurement, 31, 195-212.
Collections