Estimated reading time: 5 minutes
Item analysis refers to the process of statistically analyzing assessment data to evaluate the quality and performance of your test items. This is an important step in the test development cycle, not only because it helps improve the quality of your test, but because it provides documentation for validity: evidence that your test performs well and score interpretations mean what you intend.
The Goals of item analysis
Item analysis boils down to two goals:
- Find the items that are not performing well (difficulty and discrimination, usually)
- Figure out WHY those items are not performing well
There are different ways to evaluate performance, such as whether the item is too difficult/easy, too confusing (not discriminating), miskeyed, or perhaps even biased to a minority group.
Moreover, there are two completely different paradigms for this analysis: classical test theory (CTT) and item response theory (IRT). On top of that, the analyses can differ based on whether the item is dichotomous (right/wrong) or polytomous (2 or more points).
Because of the possible variations, item analysis complex topic. But, that doesn’t even get into the evaluation of test performance. In this post, we’ll cover some of the basics for each theory, at the item level.
Implementing Item Analysis
To implement item analysis, you should utilize dedicated software designed for this purpose. If you utilize an online assessment platform, it will provide you output like you see below because it will have such dedicated software already integrated (if not, it isn’t a real assessment platform).
In some cases, you might utilize standalone software. CITAS provides a simple spreadsheet-based approach to help you learn the basics. Iteman and Xcalibre are two specially-designed software programs from ASC for this purpose, one for classical test theory and one for item response theory.
Item Analysis with Classical Test Theory
Classical Test Theory provides a simple and intuitive approach to item analysis. It utilizes nothing more complicated than proportions, averages, counts, and correlations. For this reason, it is useful for small-scale exams or use with groups that do not have psychometric expertise.
CTT quantifies item difficulty for dichotomous items as the proportion (P value) of examinees that correctly answer it. If P = 0.95, that means the item is very easy. If P = 0.35, the item is very difficult. Note that because the scale is inverted (lower value means higher difficulty), this is sometimes referred to as item facility.
For polytomous items, we evaluate the mean score. If the item is an essay that is scored 0 to 5 points, is the average score 1.9 (difficult) or 4.1 (easy)?
In psychometrics, discrimination is a POSITIVE. The entire point of an exam is to discriminate amongst examinees; smart students should get a high score and not-so-smart students should get a low score. If everyone gets the same score, there is no discrimination and no point in the exam! Item discrimination evaluates this concept.
CTT uses the point-biserial item-total correlation (Rpbis) as its primary statistic for this. It correlates scores on the item to the total score on the test. If the item is strong and measures the topic well, then examinees who get the item right tend to score higher on the test. This means that the correlation will be 0.20 or higher. If it is around 0.0, then the item is just a random data generator and worthless on the exam.
Key and Distractor Analysis
In the case of many item types, it pays to evaluate the answers. A distractor is an incorrect option. We want to make sure that more examinees are not selecting a distractor than the key (P value) and also that no distractor has higher discrimination. The latter would mean that smart students are selecting the wrong answer, and not-so-smart students are selecting what is supposedly correct. In some cases, the item is just bad. In others, the answer is just incorrectly recorded, perhaps by a typo. We call this a miskey of the item. In both cases, we want to flag the item and then dig into the distractor statistics to figure out what is wrong.
Below is an example output for one item from our Iteman software, which you can download for free. You might also be interested in this video. This is a very well-performing item. Here are some key takeaways.
- This is a 4-option multiple choice item in Domain 2 of this particular test
- The ID of the item is Item18
- This item was seen by 300 examinees
- 82% of students answered it correctly, so it was fairly easy, but not too easy
- The Rpbis was 0.51 which is extremely high; the item is good quality
- The line for the correct answer in the quantile plot has a clear positive slope, which reflects the high discrimination quality
- The proportion of examinees selecting the wrong answers was nicely distributed, not too high, and with negative Rpbis values. This means the distractors are sufficiently incorrect and not confusing.
- The Mean for students selecting the correct answer was much higher (
Item Analysis with Item Response Theory
Item Response Theory (IRT) is a very sophisticated paradigm of item analysis and tackles numerous psychometric tasks. It requires much larger sample sizes than CTT (100-1000 responses per item) and extensive expertise (typically a PhD psychometrician). It isn’t suitable for small-scale exams like classroom quizzes.
However, it is used by virtually every “real” exam you will take in your life, from K-12 benchmark exams to university admissions to professional certifications.
If you haven’t used IRT, I recommend you check out this blog post first.
IRT evaluates item difficulty for dichotomous items as a b-parameter, which is sort of like a z-score for the item on the bell curve: 0.0 is average, 2.0 is hard, and -2.0 is easy. (This can differ somewhat with the Rasch approach, which rescales everything.) In the case of polytomous items, there is a b-parameter for each threshold, or step between points.
IRT evaluates item discrimination by the slope of its item response function, which is called the a-parameter. Often, values above 0.80 are good and below 0.80 are less effective.
Key and Distractor Analysis
In the case of polytomous items, the multiple b-parameters provide an evaluation of the different answers. For dichotomous items, the IRT modeling does not distinguish amongst correct answers. Therefore, we utilize the CTT approach for distractor analysis. This remains extremely important for diagnosing issues in multiple choice items.
Here is an example of what output from an IRT analysis program (Xcalibre) looks like. You might also be interested in this video.
- Here, we have a polytomous item, such as an essay scored from 0 to 3 points.
- It is calibrated with the generalized partial credit model.
- It has strong classical discrimination (0.62)
- It has poor IRT discrimination (0.466)
- The average raw score was 2.314 out of 3.0, so fairly easy
- There was a sufficient distribution of responses over the four point levels
- The boundary parameters are not in sequence; this item should be reviewed