Classical Test Theory vs. Item Response Theory: What are some key differences, and how to choose?
Classical Test Theory and Item Response Theory (CTT & IRT) are the two primary psychometric paradigms. That is, they are mathematical approaches to how tests are analyzed and scored. They differ quite substantially in substance and complexity, even though they both nominally do the same thing. So how are they different, and how can you effectively choose the right solution? Let’s discuss.
First, let’s start by defining the two. This is just a brief intro; there are entire books dedicated to the details!
Classical test theory
Classical test theory (CTT) is an approach that is based on simple mathematics; primarily averages, proportions, and correlations. It is more than 100 years old, but is still used quite often, with good reason. In addition to working with small sample sizes, it is very simple and easy to understand, which makes it useful for working directly with content experts to evaluate, diagnose, and improve items or tests.
Item response theory
Item response theory (IRT) is a much more complex approach to analyzing tests. Moreover, it is not just for analyzing; it is a complete psychometric paradigm that changes how item banks are developed, test forms are designed, tests are delivered (adaptive or linear-on-the-fly), and scores produced. There are many benefits to this approach that justify the complexity, and there is good reason that all major examinations in the world utilize IRT. Learn more about IRT here.
How Classical Test Theory and Item Response Theory Differ
Test-Level and Subscore-Level Analysis
CTT statistics for total scores and subscores include coefficient alpha reliability, standard error of measurement (a function of reliability and SD), descriptive statistics (average, SD…), and roll-ups of item statistics (e.g., mean Rpbis).
With IRT, we utilize the same descriptive statistics, but the scores are now different (theta, not number-correct). The standard error of measurement is now a conditional function, not a single number. The entire concept of reliability is dropped, and replaced with the concept of precision, and also as that same conditional function.
Item statistics for CTT include proportion-correct (difficulty), point-biserial (Rpbis) correlation (discrimination), and a distractor/answer analysis. If there is demographic information, CCT analysis can also provide a simple evaluation of differential item functioning (DIF).
IRT replaces the difficulty and discrimination with its own quantifications, called simply b and a. In addition, it can add a c parameter for guessing effects. More importantly, it creates entirely new classes of statistics for partial credit or rating scale items.
CTT scores tests with traditional scoring: number-correct, proportion-correct, or sum-of-points. IRT scores examinees directly on a latent scale, which psychometricians call theta.
Linking and Equating
Linking and equating is a statistical analysis to determine comparable scores on different forms; e.g., Form A is “two points easier” than Form B and therefore a 72 on Form A is comparable to a 70 on Form B. CTT has several methods for this, including the Tucker and Levine methods, but there are methodological issues with these approaches. These issues, and other issues with CTT, eventually led to the development of IRT in the 1960s and 1970s.
IRT has methods to accomplish linking and equating which are much more powerful than CTT, including anchor-item calibration or conversion methods like Stocking-Lord. There are other advantages as well.
One major advantage of IRT, as a corollary to the strong linking/equating, is that we can link/equate not just across multiple forms in one grade, but from grade to grade. This produces a vertical scale. A vertical scale can span across multiple grades, making it much easier to track student growth, or to measure students that are off-grade in their performance (e.g., 7th grader that is at a 5th grade level). A vertical scale is a substantial investment, but is extremely powerful for K-12 assessments.
Classical test theory can work effectively with 50 examinees, and provide useful results with as little as 20. Depending on the IRT model you select (there are many), the minimum sample size can be 100 to 1,000.
Sample- and Test-Dependence
CTT analyses are sample-dependent and test-dependent, which means that such analyses are performed on a single test form and set of students. It is possible to combine data across multiple test forms to create a sparse matrix, but this has a detrimental effect on some of the statistics (especially alpha), even if the test is of high quality, and the results will not reflect reality.
For example, if Grade 7 Math has 3 forms (beginning, middle, end of year), it is conceivable to combine them into one “super-matrix” and analyze together. The same is true if there are 3 forms given at the same time, and each student randomly receives one of the forms. In that case, 2/3 of the matrix would be empty, which psychometricians call sparse.
Classical test theory will analyze the distractors of a multiple choice item. IRT models, except for the rarely-used Nominal Response Model, do not.
Item response theory has a parameter to account for guessing, though some psychometricians argue against its use. Classical test theory has no effective way to account for guessing.
There are rare cases where adaptive testing (personalized assessment) can be done with classical test theory. However, it pretty much requires the use of item response theory for one important reason: IRT puts people and items onto the same latent scale.
Linear Test Design
Classical Test Theory and Item Response Theory differ in how test forms are designed and built. Classical test theory works best when there are lots of items of middle difficulty, as this maximizes the coefficient alpha reliability. However, there are definitely situations where the purpose of the assessment is otherwise. IRT provides stronger methods for designing such tests, and then scoring as well.
So… How to Choose?
There is no single best answer to the question of Classical Test Theory vs. Item Response Theory. You need to evaluate the aspects listed above, and in some cases other aspects (e.g., financial, or whether you have staff available with the expertise in the first place). In many cases, BOTH are necessary. This is especially true because IRT does not provide an effective and easy-to-understand distractor analysis that you can use to discuss with subject matter experts. It is for this reason that IRT software will typically produce CTT analysis too, though the reverse is not true.
IRT is very powerful, and can provide additional information about tests if used just for analyzing results to evaluate item and test performance. However, IRT is really only useful if you are going to make it your psychometric paradigm, thereby using it in the list of activities above, especially IRT scoring of examines. Otherwise, IRT analysis is merely just another way of looking test and item performance that will correlate substantially with CTT.
Nathan Thompson, PhD
Latest posts by Nathan Thompson, PhD (see all)
- Webinar Recording: A History of Computerized Adaptive Testing with Prof. David J. Weiss - June 14, 2022
- Incremental Validity - June 3, 2022
- Case Study: Escuela Superior de Administración Pública (ESAP), Colombia - June 2, 2022