# Item response theory (IRT): An Introduction

Item response theory (IRT) is a family of mathematical models in the field of psychometrics, which are used to design, analyze, and score exams.  It is a very powerful psychometric paradigm that allows researchers to build stronger assessments.  This post will provide an introduction to the theory, discuss benefits, and explain how item response theory is used.

IRT represents an important innovation in the field of psychometrics. While now 50 years old – assuming the “birth” is the classic Lord and Novick (1969) text – it is still underutilized and remains a mystery to many practitioners.  So what is item response theory, and why was it invented?  For starters, IRT is very complex and requires larger sample sizes, so it is not used in small-scale exams but most large-scale exams use it.

Item response theory is more than just a way of analyzing exam data, it is a paradigm to drive the entire lifecycle of designing, building, delivering, scoring, and analyzing assessments.  It is much more complex than its predecessor, classical test theory, but is also far more powerful.  IRT requires quite a bit of expertise, as well as specially-designed software.  Click the link below to download our software Xcalibre, which provides a user-friendly and visual platform for implementing IRT.

## The Driver: Problems with Classical Test Theory

Classical test theory (CTT) is approximately 100 years old, and still remains commonly used because it is appropriate for certain situations, and it is simple enough that it can be used by many people without formal training in psychometrics.  Most statistics are limited to means, proportions, and correlations.  However, its simplicity means that it lacks the sophistication to deal with a number of very important measurement problems.  Here are just a few.

• Sample dependency: Classical statistics are all sample dependent, and unusable on a different sample; results from IRT are sample-independent within a linear transformation (that is, two samples of different ability levels can be easily converted onto the same scale)
• Test dependency: Classical statistics are tied to a specific test form, and do not deal well with sparse matrices introduced by multiple forms, linear on the fly testing, or adaptive testing
• Weak linking/equating: CTT has a number of methods for linking multiple forms, but they are weak compared to IRT
• Measuring the range of students: Classical tests are built for the average student, and do not measure high or low students very well; conversely, statistics for very difficult or easy items are suspect
• CTT cannot do vertical scaling
• Lack of accounting for guessing: CTT does not account for guessing on multiple choice exams
• Scoring: Scoring in classical test theory does not take into account item difficulty.
• Adaptive testing: CTT does not support adaptive testing in most cases.

## So what is Item Response Theory?

It is a family of mathematical models that try to describe how examinees respond to items (hence the name).  These models can be used to evaluate item performance, because the description are quite useful in and of themselves.  However, item response theory ended up doing so much more – namely, addressing the problems above.

IRT is model-driven, in that there is a specific mathematical equation that is assumed.  There are different parameters that shape this equation to different needs.  That’s what defines different IRT models.

IRT used to be known as latent trait theory and item characteristic curve theory.

## The Foundation of Item Response Theory:

The foundation of IRT is a mathematical model defined by item parameters.  For dichotomous items (those scored correct/incorrect), each item has three parameters:

a: the discrimination parameter, an index of how well the item differentiates low from top examinees; typically ranges from 0 to 2, where higher is better, though not many items are above 1.0.

b: the difficulty parameter, an index of what level of examinees for which the item is appropriate; typically ranges from -3 to +3, with 0 being an average examinee level.

c: the pseudoguessing parameter, which is a lower asymptote; typically is focused on 1/k where k is the number of options.

These parameters are used to graphically display an item response function (IRF), which models the probability of a correct answer as a function of ability.  An example IRF is below.  Here, the a parameter is approximately, 1.0, indicating a fairly discriminating test item.  The b parameter is approximately 0.0 (the point on the x-axis where the midpoint of the curve is), indicating an average-difficulty item; examinees of average ability would have a 60% chance of answering correctly.  The c parameter is approximately 0.20, like a 5-option multiple choice item.

What does this mean conceptually?  We are trying to model the interaction of an examinee responding to an item, hence the name item response theory.  Consider the x-axis to be z-scores on a standard normal scale.  Examinees with higher ability are much more likely to respond correctly.  Someone at +2.0 (97th percentile) has about a 94% chance of getting the item correct.  Meanwhile, someone at -2.0 has only a 37% chance.

Of course, the parameters can and should differ from item to item, reflecting differences in item performance.  The following graph shows five IRFs.  The dark blue line is the easiest item, with a b of -2.00.  The light blue item is the hardest, with a b of +1.80.  The purple one has a c=0.00 while the light blue has c=0.25, indicating that it is susceptible to guessing.

These IRFs are not just a pretty graph or a way to describe how an item performs.  They are the basic building block to accomplishing those important goals mentioned earlier.  That comes next…

## Applications of IRT to Improve Assessment

Item response theory uses the IRF for several purposes.  Here are a few.

1. Interpreting and improving item performance
2. Scoring examinees with maximum likelihood or Bayesian methods
3. Form assembly, including linear on the fly testing (LOFT) and pre-equating
4. Calculating the accuracy of examinee scores
5. Development of computerized adaptive tests (CAT)
6. Post-equating
7. Differential item functioning (finding bias)
8. Data forensics to find cheaters or other issues.

In addition to being used to evaluate each item individually, IRFs are combined in various ways to evaluate the overall test or form.  The two most important approaches are the conditional standard error of measurement (CSEM) and the test information function (TIF).  The test information function is higher where the test is providing more measurement information about examinees; if relatively low in a certain range of examinee ability, those examinees are not being measured accurately.  The CSEM is the inverse of the TIF, and has the interpretable advantage of being usable for confidence intervals; a person’s score plus or minus 1.96 times the SEM is a 95% confidence interval for their score.  The graph on the right shows part of the form assembly process in our FastTest platform.

## Advantages and Benefits of Item Response Theory

So why does this matter?  Let’s go back to the problems with classical test theory.  Why is IRT better?

• Sample-independence of scale: Classical statistics are all sample dependent, and unusable on a different sample; results from IRT are sample-independent within a linear transformation.  Two samples of different ability levels can be easily converted onto the same scale.
• Test statistics: Classical statistics are tied to a specific test form
• Sparse matrices are OK: Classical test statistics do not work with sparse matrices introduced by multiple forms, linear on the fly testing, or adaptive testing
• Linking/equating: Item response theory has much stronger equating, so if your exam has multiple forms, or if you deliver twice per year with a new form, you can have much greater validity in the comparability of scores.
• Measuring the range of students: Classical tests are built for the average student, and do not measure high or low students very well; conversely, statistics for very difficult or easy items are suspect
• Vertical scaling: IRT can do vertical scaling but CTT cannot
• Lack of accounting for guessing: CTT does not account for guessing on multiple choice exams
• Scoring: Scoring in classical test theory does not take into account item difficulty.  With IRT, you can score a student on any set of items and be sure it is on the same latent scale.
• Adaptive testing: CTT does not support adaptive testing in most cases.  Adaptive testing has its own list of benefits.
• Characterization of error: CTT assumes that every examinee has the same amount of error in their score (SEM); IRT recognizes that if the test is all middle-difficulty items, then low or high students will have inaccurate scores
• Stronger form building: IRT has functionality to build forms to be more strongly equivalent and meet the purposes of the exam
• Nonlinear function: IRT does not assume linear function of the student-item relationship when it is impossible.  CTT assumes a linear function (point-biserial) when it is blatantly impossible.

## One Big Happy Family

Remember: Item response theory is actually a family of models, making flexible use of the parameters.  In some cases, only two (a,b) or one parameters (b) are used, depending on the type of assessment and fit of the data.  If there are multipoint items, such as Likert rating scales or partial credit items, the models are extended to include additional parameters. Learn more about the partial credit situation here.

Here’s a quick breakdown of the family tree, with the most common models.

• Unidimensional
• Dichotomous
• Rasch model
• 1PL
• 2PL
• 3PL
• 4PL (intellectual curiosity only!)
• Polytomous
• Rasch partial credit
• Rasch rating scale
• Generalized partial credit
• Generalized rating scale
• Multidimensional
• Compensatory
• Non-compensatory
• Bifactor

For more information, we recommend the textbook Item Response Theory for Psychologists by Embretson & Riese (2000) for those interested in a less mathematical treatment, or de Ayala (2009) for a more mathematical treatment.  If you really want to dive in, you can try the 3-volume Handbook of Item Response Theory edited by van der Linden, which contains a chapter discussing ASC’s IRT analysis software, Xcalibre.

Want to talk to one of our experts about how to apply IRT?  Get in touch!

Nathan Thompson, PhD, is CEO and Co-Founder of Assessment Systems Corporation (ASC). He is a psychometrician, software developer, author, and researcher, and evangelist for AI and automation. His mission is to elevate the profession of psychometrics by using software to automate psychometric work like item review, job analysis, and Angoff studies, so we can focus on more innovative work. His core goal is to improve assessment throughout the world.

Nate was originally trained as a psychometrician, with an honors degree at Luther College with a triple major of Math/Psych/Latin, and then a PhD in Psychometrics at the University of Minnesota. He then worked multiple roles in the testing industry, including item writer, test development manager, essay test marker, consulting psychometrician, software developer, project manager, and business leader. He is also cofounder and Membership Director at the International Association for Computerized Adaptive Testing (iacat.org). He’s published 100+ papers and presentations, but his favorite remains https://scholarworks.umass.edu/pare/vol16/iss1/1/.

## More To Explore

Statistics

### Meta-analysis and Test Validation in Psychological Measurement

Meta-analysis is a research process of collating data from multiple independent but similar scientific studies in order to identify common trends and findings by means