Item analysis refers to the process of statistically analyzing assessment data to evaluate the quality and performance of your test items. This is an extremely important step in the test development cycle, not only because it helps improve the quality of your test, but because it provides documentation for validity: evidence that your test performs well and score interpretations mean what you intend.

The Goals of Item Analysis

Item analysis boils down to two goals:

  1. Find the items that are not performing well (difficulty and discrimination, usually)
  2. Figure out WHY those items are not performing well

There are different ways to evaluate performance, such as whether the item is too difficult/easy, too confusing (not discriminating), mis-keyed, or perhaps even biased to a minority group. Moreover, there are two completely different paradigms for this analysis: classical test theory (CTT) and item response theory (IRT). On top of that, the analyses can differ based on whether the item is dichotomous (right/wrong) or polytomous (2 or more points). Because of the possible variations, item analysis is actually a very deep and complex topic. And that doesn’t even get into evaluation of test performance. In this post, we’ll cover some of the basics for each theory, at the item level.

Implementing Item Analysis

To implement item analysis, you should utilize dedicated software designed for this purpose. If you utilize an online assessment platform, it will provide you output like you see below because it will have such dedicated software already integrated (if not, it isn’t a real assessment platform). In some cases, you might utilize standalone software. CITAS provides a simple spreadsheet-based approach to help you learn the basics.

Classical Test Theory

Classical Test Theory provides a very simply and intuitive approach to item analysis. It utilizes nothing more complicated than proportions, averages, counts, and correlations. For this reason, it is useful for small-scale exams, or use with groups that do not have psychometric expertise.

Item Difficulty

CTT quantifies item difficulty for dichotomous items as the proportion (P value) of examinees that correctly answer it. If P = 0.95, that means the item is very easy. If P = 0.35, the item is very difficult. Note that because the scale is inverted (lower value means higher difficulty), this is sometimes referred to as item facility.

For polytomous items, we evaluate the mean score. If the item is an essay that is scored 0 to 5 points, is the average score 1.9 (difficult) or 4.1 (easy)?

Item Discrimination

In psychometrics, discrimination is a GOOD thing. The entire point of an exam is to discriminate amongst examinees; smart students should get a high score and not-so-smart students should get a low score. If everyone gets the same score, there is no discrimination, and no point in the exam! Item discrimation evaluates this concept.

CTT uses the point-biserial item-total correlation (Rpbis) as its primary statistic for this. It correlates scores on the item to the total score on the test. If the item is strong, and it measures the topic well, then examinees who get the item right will tend to score higher on the test. This will mean the correlation will be 0.20 or higher. If it is around 0.0, that means the item is just a random data generator, and worthless on the exam.

Key and Distractor Analysis

In the case of many item types, it pays to evaluate the answers. A distractor is an incorrect option. We want to make sure that more examinees are not selecting a distractor than the key (P value) and also that no distractor has a higher discrimination. The latter would mean that smart students are selecting the wrong answer, and not-so-smart students are selecting what is supposedly correct. In some cases, the item is just bad. In others, the answer is just incorrectly recorded, perhaps by a typo. We call this a miskey of the item. In both cases, we want to flag the item and then dig into the distractor statistics to figure out what is wrong.


Below is an example output for one item from our Iteman software, which you can download for free. You might also be interested in this video. Here, we see that 82% of students answered this item correctly, with very high Rpbis. This is a very well performing item.


Item Response Theory

Item Response Theory (IRT) is a very sophisticated paradigm of item analysis, and tackling many other psychometric tasks. It requires much larger sample sizes than CTT, and extensive expertise, so it isn’t relevant for small-scale exams like classroom quizzes. However, it is used by virtually every “real” exam you will take in your life, from K-12 benchmark exams to university admissions to professional certifications.

If you haven’t used IRT, I recommend you check out this blog post first.

Item Difficulty

IRT evaluates item difficulty for dichotomous items as a b-parameter, which is sort of like a z-score for the item on the bell curve: 0.0 is average, 2.0 is hard, and -2.0 is easy. (This can differ somewhat with the Rasch approach, which rescales everything.) In the case of polytomous items, there is a b-parameter for each threshold, or step between points.

Item Discrimination

IRT evaluates item discrimination by the slope of its item response function, which is called the a-parameter. Often, values above 0.80 are good and below 0.80 are less effective.

Key and Distractor Analysis

In the case of polytomous items, the multiple b-parameters provide an evaluation of the different answers. For dichotomous items, the IRT modeling does not distinguish amongst correct answers. Therefore, we utilize the CTT approach for distractor analysis. This remains extremely important for diagnosing issues in multiple choice items.


Here is an example of what output from an IRT analysis program (Xcalibre) looks like. You might also be interested in this video. Here, we have a polytomous item, utilizing the generalized partial credit model. It has a strong classical discrimination (0.62) but poor IRT discrimination (0.466).

Xcalibre item response theory
The following two tabs change content below.

Nathan Thompson, PhD

Nathan Thompson earned his PhD in Psychometrics from the University of Minnesota, with a focus on computerized adaptive testing. His undergraduate degree was from Luther College with a triple major of Mathematics, Psychology, and Latin. He is primarily interested in the use of AI and software automation to augment and replace the work done by psychometricians, which has provided extensive experience in software design and programming. Dr. Thompson has published over 100 journal articles and conference presentations, but his favorite remains

Latest posts by Nathan Thompson, PhD (see all)