The British statistician George Box is credited with the quote, “All models are wrong but some are useful.” As psychometricians, it is important that we never forget this perspective. We cannot be so haughty as to think that our psychometric models actually represent the true underlying phenomena and any data that does not fit nicely is just noise. We need to remember that everything we do is an approximation, and respect the balance between parsimony and parameterization.
Really… all psychometric models are wrong?
Yeah, there is no TRUE model that perfectly describes the interaction between an examinee and a test item. Obviously the probability of a correct response is primarily due to important factors such as examinee ability, item difficulty, item quality, the presence of guessing, and the scoring function of the item. There are also additional factors, such as student motivation, timing factors, lighting in the room, screen size, whether they broke up with their girlfriend/boyfriend the previous day, whether their mom made their favorite breakfast that morning… you get the picture. Attempting to model all those factors is certainly overparameterization.
Wikipedia as has a lengthier quote on that aspect:
Since all models are wrong the scientist cannot obtain a “correct” one by excessive elaboration. On the contrary following William of Occam he should seek an economical description of natural phenomena. Just as the ability to devise simple but evocative models is the signature of the great scientist so overelaboration and overparameterization is often the mark of mediocrity.
Most, if not all psychometricians, would agree that my earlier description of overparameterization is valid. The controversy in the field of Psychometrics is which of those “important factors” I mentioned qualify as overparameterization. The Rasch model famously boils down the interaction to a single item parameter (difficulty) and a single person parameter (ability). Many psychometricians consider this to be underparameterization since, for example, items are known widely differ in their quality (discrimination). The Rasch cohort would consider the 2 and 3 parameter item response theory (IRT) models to be overparameterization, especially since they necessitated the development of new parameter estimation algorithms in the 1970s. There are some practitioners in each camp who would claim that the other is the “mark of mediocrity.”
Sooo… How do I select a psychometric model?
Well, try to be cognizant of that tradeoff, which is one of several tradeoffs when selecting an IRT model. There is no right answer all the time, it is more a matter of whether your data fits a model and whether it satisfies your requirements for a particular situation. That is, whether it is truly useful, which is Box’s original point. But don’t forget that all the models are wrong!
Nathan Thompson, PhD, is CEO and Co-Founder of Assessment Systems Corporation (ASC). He is a psychometrician, software developer, author, and researcher, and evangelist for AI and automation. His mission is to elevate the profession of psychometrics by using software to automate psychometric work like item review, job analysis, and Angoff studies, so we can focus on more innovative work. His core goal is to improve assessment throughout the world.
Nate was originally trained as a psychometrician, with an honors degree at Luther College with a triple major of Math/Psych/Latin, and then a PhD in Psychometrics at the University of Minnesota. He then worked multiple roles in the testing industry, including item writer, test development manager, essay test marker, consulting psychometrician, software developer, project manager, and business leader. He is also cofounder and Membership Director at the International Association for Computerized Adaptive Testing (iacat.org). He’s published 100+ papers and presentations, but his favorite remains https://scholarworks.umass.edu/pare/vol16/iss1/1/.