## Test Information Function from Item Response Theory

The Test Information Function is a concept from item response theory (IRT) that is designed to evaluate how well an assessment differentiates examinees, and at what ranges of ability. For example, we might expect an exam composed of difficult items to do a great job in differentiating top examinees, but it is worthless for the lower half of examinees because they will be so confused and lost.

The reverse is true of an easy test; it doesn’t do any good for top examinees. The test information function quantifies this and has a lot of other important applications and interpretations.

## Test Information Function: how to calculate it

The test information function is not something you can calculate by hand. First, you need to estimate item-level IRT parameters, which define the item response function. The only way to do this is with specialized software; there are a few options in the market, but we recommend Xcalibre.

Next, the item response function is converted to an item information function for each item. The item information functions can then be summed into a test information function. Lastly, the test information function is often inverted into the conditional standard error of measurement function, which is extremely useful in test design and evaluation.

## IRT Item Parameters

Software like Xcalibre will estimate a set of item parameters. The parameter you use depends on the item types and other aspects of your assessment.

For example, let’s just use the 3-parameter model, which estimates a, b, and c. And we’ll use a small test of 5 items. These are ordered by difficulty: item 1 is very easy and Item 5 is very hard.

## Item Response Function

The item response function uses the IRT equation to convert the parameters into a curve. The purpose of the item parameters is to fit this curve for each item, like a regression model to describe how it performs.

Here are the response functions for those 5 items. Note the scale on the x-axis, similar to the bell curve, with the easy items to the left and hard ones to the right.

## Item Information Function

The item information function evaluates the calculus derivative of the item response function. An item provides more information about examinees where it provides more slope.

For example, consider Item 5: it is difficult, so it is not very useful for examinees in the bottom half of ability. The slope of the Item 5 IRF is then nearly 0 for that entire range. This then means that its information function is nearly 0.

## Test Information Function

The test information function then sums up the item information functions to summarize where the test is providing information. If you imagine adding the graphs above, you can easily imagine some humps near the top and bottom of the range where there are the prominent IIFs.

## Conditional Standard Error of Measurement Function

The test information function can be inverted into an estimate of the conditional standard error of measurement. What do we mean by conditional? If you are familiar with classical test theory, you know that it estimates the same standard error of measurement for everyone that takes a test.

But given the reasonable concepts above, it is incredibly unreasonable to expect this. If a test has only difficult items, then it measures top students well, and does not measure lower students well, so why should we say that their scores are just as accurate? The conditional standard error of measurement turns this into a function of ability.

Also, note that it refers to the theta scale and not to the number-correct scale.

## How can I implement all this?

For starters, I recommend delving deeper into an item response theory book. My favorite is Item Response Theory for Psychologists by Embretson and Riese. Next, you need some item response theory software.

Xcalibre can be downloaded as a free version for learning and is the easiest program to learn how to use (no 1980s-style command code… how is that still a thing?). But if you are an R fan, there are plenty of resources in that community as well.

## Tell me again: why are we doing this?

The purpose of all this is to effectively model how items and tests work, namely, how they interact with examinees. This then allows us to evaluate their performance so that we can improve them, thereby enhancing reliability and validity.

Classical test theory had a lot of shortcomings in this endeavor, which led to IRT being invented. IRT also facilitates some modern approaches to assessment, such as linear on-the-fly testing, adaptive testing, and multistage testing.

## What is the generalized credit model (GPCM?

The generalized partial credit model (GPCM, Muraki 1992) is a category in a group of families in item response theory

It is designed to work with items that are partial credit.  That is, instead of just right/wrong as possible, scoring an examinee can receive partial points for completing some aspects of the item correctly.  For example, a typical multiple-choice item is scored as 0 points for incorrect and 1 point for correct.

A GPCM item might consist of 3 aspects and be 0 points for incorrect, 3 points for fully correct, and 1 or 2 points if the examinee only completes 1 or 2 of the aspects, but not all three.

## Examples of GPCM items

GPCM items, therefore contain multiple point levels starting at 0.  There are several examples that are common in the world of educational assessment.

The first example, which nearly everyone is familiar with, is essay rubrics.  A student might be instructed to write an essay on why extracurriculars are important in school, with at least 3 supporting points.  Such an essay might be scored with the number of points presented (0,1,2,3) as well as on grammar (0=10 or more errors, 1= 3-9 errors, and 2 = 2 errors or less). Here’s a shorter example.

Another example is multiple response items.  For example, a student might be presented with a list of 5 animals and be asked to identify which are Mammals.  There are 2 correct answers, so the possible points are 0,1,2.

Note that this also includes their tech-enhanced equivalents, such as drag and drop; such items might be reconfigured to dragging the animal names into boxes, but that’s just window dressing to make the item look sexier.

The National Assessment of Educational Progress and many other K-12 assessments utilize the GPCM since they so often use item types like this.

## Why use the generalized partial credit model?

Well, the first part of the answer is a more general question: why use polytomous items?  Well, these items are generally regarded to be higher-fidelity and assess deeper thinking than multiple-choice items. They also provide much more information than multiple-choice items in an IRT paradigm.

The second part of the answer is the specific question: If we have polytomous items, why use the GPCM rather than other models?

There are two parts to that answer that refer to the name generalized partial credit model.  First, partial credit models are appropriate for items where the scoring starts at 0, and different polytomous items could have very different performances.  In contrast, Likert-style items are also polytomous (almost always), but start at 1, and apply the same psychological response process on every item.  For example, a survey where statements are presented and examinees are to, “Rate each on a scale of 1 to 5.”

Second, the “generalized” part of the name means that it includes a discrimination parameter for evaluating the measurement quality of an item.  This is similar to using the 2PL or 3PL for dichotomous items rather than using the Rasch model and assuming items are of equal discrimination.  There is also a Rasch partial credit model that is equivalent and can be used alongside Rasch dichotomous items, but this post is just focusing on GPCM.

## Definition of the Generalized Partial Credit Model

The equation below (Embretson & Reise, 2000) defines the generalized partial credit.

In this equation
m=Number of possible points

x = the student’s score on the item

i = index for item

θ = student ability

a = discrimination parameter for item i

gij = the boundary parameter for step j on item i; there are always m-1 boundaries

r is an index used to manage the summation.

What do these mean?  The a parameter is the same concept as the a parameter in dichotomous IRT, where 0.5 might be low and 1.2 might be high.  The boundary parameters define the steps or thresholds that explain how the GPCM works, which will become clearer when you see the graph below.

As an example, let us consider a 4-point item with the following parameters.

If you use those numbers to graph the functions for each point level as a function of theta, you would see a graph like the one below.  Here, consider Option 1 to be the probability of getting 0 points; this is a very high probability for the lowest examinees but drops as ability increases.

The Option 5 line is for receiving all possible points; high probability for the best examinees, but probability decreases as ability does.  Between, we have probability curves for 1, 2, and 3 points.  If an examinee has a theta of -0.5, they have a high probability of getting 2 points on the item (yellow curve).

The boundary parameters mentioned earlier have a very real interpretation with this graph; they are literally the boundaries between the curves.  That is the theta level, at which 1 point (purple) becomes more likely that 0 points (red) are at -2.4 where the two lines cross.  Note that this is the first boundary parameter b1 in the image earlier.

## How to use the GPCM

As mentioned before, the GPCM is appropriate to use as your IRT model for multi-point items in an educational context, as opposed to Likert-style psychological items.  They’re almost always used in conjunction with the 2PL or 3PL dichotomous models; consider a test of 25 multiple-choice items, 3 multiple response items, and an essay with 2 rubrics.

To implement, you need an IRT software program that can estimate dichotomous and polytomous items jointly, such as Xcalibre.  Consider the screenshot below to specify these.

If you implement IRT with Xcalibre, it produces a page like this for each GPCM item.

To score students with the GPCM, you either need to use the IRT program like Xcalibre to score students or a test delivery system that has been specifically designed to support the GPCM in the item banker and implement GPCM in scoring routines.  The former only works when you are doing the IRT analysis after all examinees have completed a test; if you have a continuous deployment of assessments, you will need to use the latter approach.

IRT textbooks will provide a treatment of polytomous models like the generalized partial credit model. Examples are de Ayala (2010) and Embretson & Reise (2000). Also, I recommend the 2010 book by Nering and Ostini, which was previously available as a monograph.

## What is item response theory?

Item response theory (IRT) represents an important innovation in the field of psychometrics. While now 50 years old – assuming the “birth” is the classic Lord and Novick (1969) text – it is still underutilized and remains a mystery to many practitioners.  So what is item response theory, and why was it invented?

# The Driver: Problems with Classical Test Theory

Classical test theory (CTT) is approximately 100 years old, and still remains commonly used because it is appropriate for certain situations, and it is simple enough that it can be used by many people without formal training in psychometrics.  Most statistics are limited to means, proportions, and correlations.  However, its simplicity means that it lacks the sophistication to deal with a number of very important measurement problems.  Here are just a few.

• Sample dependency: Classical statistics are all sample dependent, and unusable on a different sample; results from IRT are sample-independent within a linear transformation (that is, two samples of different ability levels can be easily converted onto the same scale)
• Test dependency: Classical statistics are tied to a specific test form, and do not deal well with sparse matrices introduced by multiple forms, linear on the fly testing, or adaptive testing
• Weak linking/equating: CTT has a number of methods for linking multiple forms, but they are weak compared to IRT
• Measuring the range of students: Classical tests are built for the average student, and do not measure high or low students very well; conversely, statistics for very difficult or easy items are suspect
• Lack of accounting for guessing: CTT does not account for guessing on multiple choice exams
• Scoring: Scoring in classical test theory does not take into account item difficulty.
• Adaptive testing: CTT does not support adaptive testing in most cases.

# So what is item response theory?

It is a family of mathematical models that try to describe how examinees respond to items (hence the name).  These models can be used to evaluate item performance, because the description are quite useful in and of themselves.  However, item response theory ended up doing so much more – namely, addressing the problems above.

# The Foundation of Item Response Theory

The foundation of IRT is a mathematical model defined by item parameters.  For dichotomous items (those scored correct/incorrect), each item has three parameters:

a: the discrimination parameter, an index of how well the item differentiates low from top examinees; typically ranges from 0 to 2, where higher is better, though not many items are above 1.0.

b: the difficulty parameter, an index of what level of examinees for which the item is appropriate; typically ranges from -3 to +3, with 0 being an average examinee level.

c: the pseudoguessing parameter, which is a lower asymptote; typically is focused on 1/k where k is the number of options.

These parameters are used to graphically display an item response function (IRF).  An example IRF is on the right.  Here, the a parameter is approximately, 1.0, indicating a fairly discriminating item.  The b parameter is approximately -0.6 (the point on the x-axis where the midpoint of the curve is), indicating an easy item; examinees well below average would have a 60% chance of answering correctly.  The c parameter is approximately 0.20, though the lower asymptote is obviously off the left of the screen.

What does this mean conceptually?  We are trying to model the interaction of an examinee with the item, hence the name item response theory.  Consider the x-axis to be z-scores on a standard normal scale.  Examinees with higher ability are much more likely to respond correctly.  Someone at +2.0 (97th percentile) has about a 94% chance of getting the item correct.  Meanwhile, someone at -2.0 has only a 37% chance.

# Building with the Basic Building Block

The IRF is used for several purposes.  Here are a few.

1. Interpreting and improving item performance
2. Scoring examinees with maximum likelihood or Bayesian methods
3. Form assembly, including linear on the fly testing (LOFT) and pre-equating
4. Calculating the accuracy of examinee scores
5. Development of computerized adaptive tests (CAT)
6. Data forensics to find cheaters or other issues.

In addition to being used to evaluate each item individually, IRFs are combined in various ways to evaluate the overall test or form.  The two most important approaches are the conditional standard error of measurement (CSEM) and the test information function (TIF).  The test information function is higher where the test is providing more measurement information about examinees; if relatively low in a certain range of examinee ability, those examinees are not being measured accurately.  The CSEM is the inverse of the TIF, and has the interpretable advantage of being usable for confidence intervals; a person’s score plus or minus 1.96 times the SEM is a 95% confidence interval for their score.  The graph on the right shows part of the form assembly process in our FastTest platform.

# One Big Happy Family

IRT is actually a family of models, making flexible use of the parameters.  In some cases, only two (a,b) or one parameters (b) are used, depending on the type of assessment and fit of the data.  If there are multipoint items, such as Likert rating scales or partial credit items, the models are extended to include additional parameters. Learn more about the partial credit situation here.

For more information, we recommend the textbook Item Response Theory for Psychologists by Embretson & Riese (2000) for those interested in a less mathematical treatment, or de Ayala (2009) for a more mathematical treatment.  If you really want to dive in, you can try the 3-volume Handbook of Item Response Theory edited by van der Linden, which contains a chapter discussing ASC’s IRT analysis software, Xcalibre.

## The 3 best approaches for IRT equating

If you are delivering high-stakes tests in linear forms – or piloting a bank for CAT/LOFT – you are faced with the issue of how to equate the forms together.  That is, how can we defensibly translate a score on Form A to a score on Form B?  While the concept is simple, the methodology can be complex, and there is an entire area of psychometric research devoted to this topic. There are a number of ways to approach this issue, and IRT equating is the strongest.

## Why do we need equating?

The need is obvious: to adjust for differences in difficulty to ensure that all examinees receive a fair score on a stable scale.  Suppose you take Form A and get s score of 72/100 while your friend takes Form B and gets a score of 74/100.  Is your friend smarter than you, or did his form happen to have easier questions?  Well, if the test designers built-in some overlap, we can answer this question empirically.

Suppose the two forms overlap by 50 items, called anchor items or equator items.  They are each delivered to a large, representative sample.  Here are the results.

 Exam Form Mean score on 50 overlap items Mean score on 100 total items A 30 72 B 32 74

Because the mean score on the anchor items was higher, we then think that the Form B group was a little smarter, which led to a higher total score.

Now suppose these are the results:

 Exam Form Mean score on 50 overlap items Mean score on 100 total items A 32 72 B 32 74

Now, we have evidence that the groups are of equal ability.  The higher total score on Form B must then be because the unique items on that form are a bit easier.

## How do I calculate an equating?

You can equate forms with classical test theory (CTT) or item response theory (IRT).  However, one of the reasons that IRT was invented was that equating with CTT was very weak.  CTT methods include Tucker, Levine, and equipercentile.  Right now, though, let’s focus on IRT.

## IRT equating

There are three general approaches to IRT equating.  All of them can be accomplished with our industry-leading software Xcalibre, though conversion equating requires an additional software called IRTEQ.

1. Conversion
2. Concurrent Calibration
3. Fixed Anchor Calibration

### Conversion

With this approach, you need to calibrate each form of your test using IRT, completely separately.  We then evaluate the relationship between IRT parameters on each form and use that to estimate the relationship to convert examinee scores.  Theoretically what you do is line up the IRT parameters of the common items and perform a linear regression, so you can then apply that linear conversion to scores.

But DO NOT just do a regular linear regression.  There are specific methods you must use, including mean/mean, mean/sigma, Stocking & Lord, and Haebara.  Fortunately, you don’t have to figure out all the calculations yourself, as there is free software available to do it for you: IRTEQ.

### Concurrent Calibration

The second approach is to combine the datasets into what is known as a sparse matrix.  You then run this single data set through the IRT calibration, and it will place all items and examinees onto a common scale.  The concept of a sparse matrix is typically represented by the figure below, representing the non-equivalent anchor test (NEAT) design approach.

The IRT calibration software will automatically equate the two forms and you can use the resultant scores.

### Fixed Anchor Calibration

The third approach is a combination of the two above; it utilizes the separate calibration concept but still uses the IRT calibration process to perform the equating rather than separate software.

With this approach, you would first calibrate your data for Form A.  You then find all the IRT item parameters for the common items and input them into your IRT calibration software when you calibrate Form B.

You can tell the software to “fix” the item parameters so that those particular ones (from the common items) do not change.  Then all the item parameters for the unique items are forced onto the scale of the common items, which of course is the underlying scale from Form A.  This then also forces the scores from the Form B students onto the Form A scale.

## How do these approaches compare to each other?

Concurrent calibration is arguably the easiest but has the drawback that it merges the scales of each form into a new scale somewhere in the middle.  If you need to report the scores on either form on the original scale, then you must use the Conversion or Fixed Anchor approaches.  This situation commonly happens if you are equating across time periods.

Suppose you delivered Form A last year and are now trying to equate Form B.  You can’t just create a new scale and thereby nullify all the scores you reported last year.  You must map Form B onto Form A so that this year’s scores are reported on last year’s scale and everyone’s scores will be consistent.

## Where do I go from here?

If you want to do IRT equating, you need IRT calibration software.  All three approaches use it.  I highly recommend Xcalibre since it is easy to use and automatically creates reports in Word for you.  If you want to learn more about the topic of equating, the classic reference is the book by Kolen and Brennan (2004; 2014).  There are other resources more readily available on the internet, like this free handbook from CCSSO.  If you would like to learn more about IRT, I recommend the books by de Ayala (2008) and Embretson & Reise (2000).  A very brief intro is available on our website.

## Want to improve the quality of your assessments?

Sign up for our newsletter and hear about our free tools, product updates, and blog posts first! Don’t worry, we would never sell your email address, and we promise not to spam you with too many emails.

## Interpreting IRT cutscores

Some time ago, I received this question regarding interpreting IRT cutscores (item response theory):

In my examination system, we are currently labeling ‘FAIL’ for student’s mark with below 50% and ‘PASS’ for 50% and above.  I found that this amazing Xcalibre software can classify students’ achievement in 2 groups based on scores.  But, when I tried to run IRT EPC with my data (with cut point of 0.5 selected), it shows that students with 24/40 correct items were classified as ‘FAIL’. Because in CTT, 24/40 correctly answered items is equal to 60% (Pass).  I can’t find its interpretation in Guyer & Thompson (2013) User’s Manual for Xcalibre.  How exactly should I set my cut point to perform 2-group classification using IRT EPC in Xcalibre to make it about equal to 50% achievement in CTT?

In this context, EPC refers to expected percent/proportion correct.  IRT uses the test response function (TRF) to convert a theta score to an expectation of what percent of items in the pool that a student would answer correctly.  So this Xcalibre user is wondering how to set IRT cutscores on theta that meets their needs.

# Setting IRT cutscores

The short answer, in this case, would be to evaluate the TRF and reverse-calculate the theta for the cutscore.  That is, find your desired cutscore on the y-axis, and determine the corresponding value of theta.  In the example below, I have found a % cutscore of 54 and found the corresponding theta of -0.13 or so.  In the case above, a theta=0.5 likely corresponded to a percent correct score of 60%-70%, so observed scores of 24/40 would indeed fail.

Of course, it is indefensible to set a cutscore to be arbitrary round numbers.  To be defensible, you need to set the cutscore with an accepted methodology such as Angoff, modified-Angoff, Nedelsky, Bookmark, or Contrasting Groups.

A nice example is a the modified-Angoff, which is used extremely often in certification and licensure situations.  More information is available on this method here.  The result of this method will typically be a specific cutscore, either on the raw or percent metric.  The TRF can be presented in both of those metrics, allowing the conversion on the right to be calculated easily.

Alternatively, some standard-setting methods can work directly on the IRT theta scale, including the Bookmark and Contrasting Groups approaches.

Interested in applying IRT to improve your assessments?  Download a free trial copy of Xcalibre here.  If you want to deliver online tests that are scored directly with IRT, in real time (including computerized adaptive testing), check out FastTest.

## Want to improve the quality of your assessments?

Sign up for our newsletter and hear about our free tools, product updates, and blog posts first! Don’t worry, we would never sell your email address, and we promise not to spam you with too many emails.

## January 2012 Newsletter

Xcalibre 4is the most user-friendly software available for item response theory (IRT) analysis.  An update has been recently released, which includes a number of bug fixes and enhancements, including an addition of distractor (quantile) plots into the output.  These plots provide an excellent method for evaluating multiple-choice distractors by combining IRT and classical test theory.  We have also added a new fit plot with standard errors to the report document and percentiles to the scores spreadsheet.  Current license holders can download the no-cost update from https://assess.com/xcart/pages.php?pageid=10.  If you do not yet have Xcalibre 4, a free demo version is available at https://assess.com/xcart/product.php?productid=569.