Item response theory (IRT) represents an important innovation in the field of psychometrics. While now 50 years old – assuming the “birth” is the classic Lord and Novick (1969) text – it is still underutilized and remains a mystery to many practitioners. So what is item response theory, and why was it invented?
Classical test theory (CTT) is approximately 100 years old, and still remains commonly used because it is appropriate for certain situations, and it is simple enough that it can be used by many people without formal training in psychometrics. Most statistics are limited to means, proportions, and correlations. However, its simplicity means that it lacks the sophistication to deal with a number of very important measurement problems. Here are just a few.
- Sample dependency: Classical statistics are all sample dependent, and unusable on a different sample; results from IRT are sample-independent within a linear transformation (that is, two samples of different ability levels can be easily converted onto the same scale)
- Test dependency: Classical statistics are tied to a specific test form, and do not deal well with sparse matrices introduced by multiple forms, linear on the fly testing, or adaptive testing
- Weak linking/equating: CTT has a number of methods for linking multiple forms, but they are weak compared to IRT
- Measuring the range of students: Classical tests are built for the average student, and do not measure high or low students very well; conversely, statistics for very difficult or easy items are suspect
- Lack of accounting for guessing: CTT does not account for guessing on multiple choice exams
- Scoring: Scoring in classical test theory does not take into account item difficulty.
- Adaptive testing: CTT does not support adaptive testing in most cases.
The Foundation of Item Response Theory
The foundation of IRT is a mathematical model defined by item parameters. For dichotomous items (those scored correct/incorrect), each item has three parameters:
a: the discrimination parameter, an index of how well the item differentiates low from top examinees; typically ranges from 0 to 2, where higher is better, though not many items are above 1.0.
b: the difficulty parameter, an index of what level of examinees for which the item is appropriate; typically ranges from -3 to +3, with 0 being an average examinee level.
c: the pseudoguessing parameter, which is a lower asymptote; typically is focused on 1/k where k is the number of options.
These parameters are used to graphically display an item response function (IRF). An example IRF is on the right. Here, the a parameter is approximately, 1.0, indicating a fairly discriminating item. The b parameter is approximately -0.6 (the point on the x-axis where the midpoint of the curve is), indicating an easy item; examinees well below average would have a 60% chance of answering correctly. The c parameter is approximately 0.20, though the lower asymptote is obviously off the left of the screen.
What does this mean conceptually? We are trying to model the interaction of an examinee with the item, hence the name item response theory. Consider the x-axis to be z-scores on a standard normal scale. Examinees with higher ability are much more likely to respond correctly. Someone at +2.0 (97th percentile) has about a 94% chance of getting the item correct. Meanwhile, someone at -2.0 has only a 37% chance.
Building with the Basic Building Block
The IRF is used for several purposes. Here are a few.
- Interpreting and improving item performance
- Scoring examinees with maximum likelihood or Bayesian methods
- Form assembly, including linear on the fly testing (LOFT)
- Calculating the accuracy of examinee scores
- Development of computerized adaptive tests (CAT)
- Data forensics to find cheaters or other issues.
In addition to being used to evaluate each item individually, IRFs are combined in various ways to evaluate the overall test or form. The two most important approaches are the conditional standard error of measurement (CSEM) and the test information function (TIF). The test information function is higher where the test is providing more measurement information about examinees; if relatively low in a certain range of examinee ability, those examinees are not being measured accurately. The CSEM is the inverse of the TIF, and has the interpretable advantage of being usable for confidence intervals; a person’s score plus or minus 1.96 times the SEM is a 95% confidence interval for their score. The graph on the right shows part of the form assembly process in our FastTest platform.
One Big Happy Family
IRT is actually a family of models, making flexible use of the parameters. In some cases, only two (a,b) or one parameters (b) are used, depending on the type of assessment and fit of the data. If there are multipoint items, such as Likert rating scales or partial credit items, the models are extended to include additional parameters.
Where can I learn more?
For more information, we recommend the textbook Item Response Theory for Psychologists by Embretson & Riese (2000) for those interested in a less mathematical treatment, or de Ayala (2009) for a more mathematical treatment. If you really want to dive in, you can try the 3-volume Handbook of Item Response Theory edited by van der Linden, which contains a chapter discussing ASC’s IRT analysis software, Xcalibre.
Want to improve the quality of your assessments?
Sign up for our newsletter and hear about our free tools, product updates, and blog posts first! Don’t worry, we would never sell your email address, and we promise not to spam you with too many emails.
Latest posts by nthompson (see all)
- What is automated item generation? - December 9, 2019
- Three Ways the Word “Standard” is used in Assessment - December 6, 2019
- WHAT IS THE GENERALIZED PARTIAL CREDIT MODEL? - November 23, 2019