Item analysis refers to the process of statistically analyzing assessment data to evaluate the quality and performance of your test items. This is an extremely important step in the test development cycle, not only because it helps improve the quality of your test, but because it provides documentation for validity: evidence that your test performs well and score interpretations mean what you intend.
The Goals of Item Analysis
Item analysis boils down to two goals:
- Find the items that are not performing well (difficulty and discrimination, usually)
- Figure out WHY those items are not performing well
There are different ways to evaluate performance, such as whether the item is too difficult/easy, too confusing (not discriminating), mis-keyed, or perhaps even biased to a minority group. Moreover, there are two completely different paradigms for this analysis: classical test theory (CTT) and item response theory (IRT). On top of that, the analyses can differ based on whether the item is dichotomous (right/wrong) or polytomous (2 or more points). Because of the possible variations, item analysis is actually a very deep and complex topic. And that doesn’t even get into evaluation of test performance. In this post, we’ll cover some of the basics for each theory, at the item level.
Implementing Item Analysis
To implement item analysis, you should utilize dedicated software designed for this purpose. If you utilize an online assessment platform, it will provide you output like you see below because it will have such dedicated software already integrated (if not, it isn’t a real assessment platform). In some cases, you might utilize standalone software. CITAS provides a simple spreadsheet-based approach to help you learn the basics.
Classical Test Theory
Classical Test Theory provides a very simply and intuitive approach to item analysis. It utilizes nothing more complicated than proportions, averages, counts, and correlations. For this reason, it is useful for small-scale exams, or use with groups that do not have psychometric expertise.
CTT quantifies item difficulty for dichotomous items as the proportion (P value) of examinees that correctly answer it. If P = 0.95, that means the item is very easy. If P = 0.35, the item is very difficult. Note that because the scale is inverted (lower value means higher difficulty), this is sometimes referred to as item facility.
For polytomous items, we evaluate the mean score. If the item is an essay that is scored 0 to 5 points, is the average score 1.9 (difficult) or 4.1 (easy)?
In psychometrics, discrimination is a GOOD thing. The entire point of an exam is to discriminate amongst examinees; smart students should get a high score and not-so-smart students should get a low score. If everyone gets the same score, there is no discrimination, and no point in the exam! Item discrimation evaluates this concept.
CTT uses the point-biserial item-total correlation (Rpbis) as its primary statistic for this. It correlates scores on the item to the total score on the test. If the item is strong, and it measures the topic well, then examinees who get the item right will tend to score higher on the test. This will mean the correlation will be 0.20 or higher. If it is around 0.0, that means the item is just a random data generator, and worthless on the exam.
Key and Distractor Analysis
In the case of many item types, it pays to evaluate the answers. A distractor is an incorrect option. We want to make sure that more examinees are not selecting a distractor than the key (P value) and also that no distractor has a higher discrimination. The latter would mean that smart students are selecting the wrong answer, and not-so-smart students are selecting what is supposedly correct. In some cases, the item is just bad. In others, the answer is just incorrectly recorded, perhaps by a typo. We call this a miskey of the item. In both cases, we want to flag the item and then dig into the distractor statistics to figure out what is wrong.
Below is an example output for one item from our Iteman software, which you can download for free. You might also be interested in this video. Here, we see that 82% of students answered this item correctly, with very high Rpbis. This is a very well performing item.
Item Response Theory
Item Response Theory (IRT) is a very sophisticated paradigm of item analysis, and tackling many other psychometric tasks. It requires much larger sample sizes than CTT, and extensive expertise, so it isn’t relevant for small-scale exams like classroom quizzes. However, it is used by virtually every “real” exam you will take in your life, from K-12 benchmark exams to university admissions to professional certifications.
If you haven’t used IRT, I recommend you check out this blog post first.
IRT evaluates item difficulty for dichotomous items as a b-parameter, which is sort of like a z-score for the item on the bell curve: 0.0 is average, 2.0 is hard, and -2.0 is easy. (This can differ somewhat with the Rasch approach, which rescales everything.) In the case of polytomous items, there is a b-parameter for each threshold, or step between points.
IRT evaluates item discrimination by the slope of its item response function, which is called the a-parameter. Often, values above 0.80 are good and below 0.80 are less effective.
Key and Distractor Analysis
In the case of polytomous items, the multiple b-parameters provide an evaluation of the different answers. For dichotomous items, the IRT modeling does not distinguish amongst correct answers. Therefore, we utilize the CTT approach for distractor analysis. This remains extremely important for diagnosing issues in multiple choice items.
Here is an example of what output from an IRT analysis program (Xcalibre) looks like. You might also be interested in this video. Here, we have a polytomous item, utilizing the generalized partial credit model. It has a strong classical discrimination (0.62) but poor IRT discrimination (0.466).
Nathan Thompson, PhD
Latest posts by Nathan Thompson, PhD (see all)
- Assess.ai named as a Finalist in the 2021 EdTech Awards - April 6, 2021
- What is Item Analysis? - March 31, 2021
- Adaptive Testing Research: JCAT - February 9, 2021