All Psychometric Models Are Wrong

test battery assessment score

The British statistician George Box is credited with the quote, “All models are wrong but some are useful.”  As psychometricians, it is important that we never forget this perspective.  We cannot be so haughty as to think that our psychometric models actually represent the true underlying phenomena and any data that does not fit nicely is just noise.  We need to remember that everything we do is an approximation, and respect the balance between parsimony and parameterization.

Really… all psychometric models are wrong?

Yeah, there is no TRUE model that perfectly describes the interaction between an examinee and a test item.  Obviously the probability of a correct response is primarily due to important factors such as examinee ability, item difficulty, item quality, the presence of guessing, and the scoring function of the item.  There are also additional factors, such as student motivation, timing factors, lighting in the room, screen size, whether they broke up with their girlfriend/boyfriend the previous day, whether their mom made their favorite breakfast that morning… you get the picture.  Attempting to model all those factors is certainly overparameterization.

Wikipedia as has a lengthier quote on that aspect:

Since all models are wrong the scientist cannot obtain a “correct” one by excessive elaboration. On the contrary following William of Occam he should seek an economical description of natural phenomena. Just as the ability to devise simple but evocative models is the signature of the great scientist so overelaboration and overparameterization is often the mark of mediocrity.

Most, if not all psychometricians, would agree that my earlier description of overparameterization is valid.  The controversy in the field of Psychometrics is which of those “important factors” I mentioned qualify as overparameterization.  The Rasch model famously boils down the interaction to a single item parameter (difficulty) and a single person parameter (ability).  Many psychometricians consider this to be underparameterization since, for example, items are known widely differ in their quality (discrimination).  The Rasch cohort would consider the 2 and 3 parameter item response theory (IRT) models to be overparameterization, especially since they necessitated the development of new parameter estimation algorithms in the 1970s.  There are some practitioners in each camp who would claim that the other is the “mark of mediocrity.”

Sooo… How do I select a psychometric model?

Well, try to be cognizant of that tradeoff, which is one of several tradeoffs when selecting an IRT model.  There is no right answer all the time, it is more a matter of whether your data fits a model and whether it satisfies your requirements for a particular situation.  That is, whether it is truly useful, which is Box’s original point. But don’t forget that all the models are wrong!

The following two tabs change content below.
Avatar for Nathan Thompson, PhD

Nathan Thompson, PhD

Nathan Thompson earned his PhD in Psychometrics from the University of Minnesota, with a focus on computerized adaptive testing. His undergraduate degree was from Luther College with a triple major of Mathematics, Psychology, and Latin. He is primarily interested in the use of AI and software automation to augment and replace the work done by psychometricians, which has provided extensive experience in software design and programming. Dr. Thompson has published over 100 journal articles and conference presentations, but his favorite remains https://scholarworks.umass.edu/pare/vol16/iss1/1/ .
Avatar for Nathan Thompson, PhD

Latest posts by Nathan Thompson, PhD (see all)