Posts

Item response theory (IRT) represents an important innovation in the field of psychometrics. While now 50 years old – assuming the “birth” is the classic Lord and Novick (1969) text – it is still underutilized and remains a mystery to many practitioners.  So what is item response theory, and why was it invented?

Classical test theory (CTT) is approximately 100 years old, and still remains commonly used because it is appropriate for certain situations, and it is simple enough that it can be used by many people without formal training in psychometrics.  Most statistics are limited to means, proportions, and correlations.  However, its simplicity means that it lacks the sophistication to deal with a number of very important measurement problems.  Here are just a few.

  • Sample dependency: Classical statistics are all sample dependent, and unusable on a different sample; results from IRT are sample-independent within a linear transformation (that is, two samples of different ability levels can be easily converted onto the same scale)
  • Test dependency: Classical statistics are tied to a specific test form, and do not deal well with sparse matrices introduced by multiple forms, linear on the fly testing, or adaptive testing
  • Weak linking/equating: CTT has a number of methods for linking multiple forms, but they are weak compared to IRT
  • Measuring the range of students: Classical tests are built for the average student, and do not measure high or low students very well; conversely, statistics for very difficult or easy items are suspect
  • Lack of accounting for guessing: CTT does not account for guessing on multiple choice exams
  • Scoring: Scoring in classical test theory does not take into account item difficulty.
  • Adaptive testing: CTT does not support adaptive testing in most cases.

Want to start applying IRT without having to learn how to code?
Download Xcalibre for free!

The Foundation of Item Response Theory

The foundation of IRT is a mathematical model defined by item parameters. For dichotomous items (those scored correct/incorrect), each item has three parameters:

 

a: the discrimination parameter, an index of how well the item differentiates low from top examinees; typically ranges from 0 to 2, where higher is better, though not many items are above 1.0.

b: the difficulty parameter, an index of what level of examinees for which the item is appropriate; typically ranges from -3 to +3, with 0 being an average examinee level.

c: the pseudoguessing parameter, which is a lower asymptote; typically is focused on 1/k where k is the number of options.

 

Dichotomous IRF from FastTest

These parameters are used to graphically display an item response function (IRF).  An example IRF is on the right.  Here, the a parameter is approximately, 1.0, indicating a fairly discriminating item.  The b parameter is approximately -0.6 (the point on the x-axis where the midpoint of the curve is), indicating an easy item; examinees well below average would have a 60% chance of answering correctly.  The c parameter is approximately 0.20, though the lower asymptote is obviously off the left of the screen.

 

What does this mean conceptually?  We are trying to model the interaction of an examinee with the item, hence the name item response theory.  Consider the x-axis to be z-scores on a standard normal scale.  Examinees with higher ability are much more likely to respond correctly.  Someone at +2.0 (97th percentile) has about a 94% chance of getting the item correct.  Meanwhile, someone at -2.0 has only a 37% chance.

Building with the Basic Building Block

The IRF is used for several purposes.  Here are a few.

  1. Interpreting and improving item performance
  2. Scoring examinees with maximum likelihood or Bayesian methods
  3. Form assembly, including linear on the fly testing (LOFT)
  4. Calculating the accuracy of examinee scores
  5. Development of computerized adaptive tests (CAT)
  6. Data forensics to find cheaters or other issues.

test information function

In addition to being used to evaluate each item individually, IRFs are combined in various ways to evaluate the overall test or form.  The two most important approaches are the conditional standard error of measurement (CSEM) and the test information function (TIF).  The test information function is higher where the test is providing more measurement information about examinees; if relatively low in a certain range of examinee ability, those examinees are not being measured accurately.  The CSEM is the inverse of the TIF, and has the interpretable advantage of being usable for confidence intervals; a person’s score plus or minus 1.96 times the SEM is a 95% confidence interval for their score.  The graph on the right shows part of the form assembly process in our FastTest platform.

One Big Happy Family

IRT is actually a family of models, making flexible use of the parameters.  In some cases, only two (a,b) or one parameters (b) are used, depending on the type of assessment and fit of the data.  If there are multipoint items, such as Likert rating scales or partial credit items, the models are extended to include additional parameters.

Where can I learn more?

For more information, we recommend the textbook Item Response Theory for Psychologists by Embretson & Riese (2000) for those interested in a less mathematical treatment, or de Ayala (2009) for a more mathematical treatment.  If you really want to dive in, you can try the 3-volume Handbook of Item Response Theory edited by van der Linden, which contains a chapter discussing ASC’s IRT analysis software, Xcalibre.

Want to improve the quality of your assessments?

Sign up for our newsletter and hear about our free tools, product updates, and blog posts first! Don’t worry, we would never sell your email address, and we promise not to spam you with too many emails.

Newsletter Sign Up
First Name*
Last Name*
Email*
Company*
Market Sector*
Lead Source

If you are delivering high stakes tests in linear forms – or piloting a bank for CAT/LOFT – you are faced with the issue of how to equate the forms together.  That is, how can we defensibly translate a score on Form A to a score on Form B?  While the concept is simple, the methodology can be complex, and there is an entire area of psychometric research devoted to this topic.  There are a number of ways to approach this issue, and IRT equating is the strongest.

Why do we need equating?

The need is obvious: to adjust for differences in difficulty to ensure that all examinees receive a fair score on a stable scale.  Suppose you take Form A and get s score of 72/100 while your friend takes Form B and gets a score of 74/100.  Is your friend smarter than you, or did his form happen to have easier questions?  Well, if the test designers built in some overlap, we can answer this question empirically.

Suppose the two forms overlap by 50 items, called anchor items or equator items.  They are each delivered to a large, representative sample.  Here are the results.

Exam FormMean score on 50 overlap itemsMean score on 100 total items
A3072
B3274

Because the mean score on the anchor items was higher, we then think that the Form B group was a little smarter, which led to a higher total score.

Now suppose these are the results:

Exam FormMean score on 50 overlap itemsMean score on 100 total items
A3272
B3274

Now, we have evidence that the groups are of equal ability.  The higher total score on Form B must then be because the unique items on that form are a bit easier.

How do I calculate an equating?

You can equate forms with classical test theory (CTT) or item response theory (IRT).  However, one of the reasons that IRT was invented was that equating with CTT was very weak.  CTT methods include Tucker, Levine, and equipercentile.  Right now, though, let’s focus on IRT.

IRT equating

There are three general approaches to IRT equating.  All of them can be accomplished with our industry-leading software Xcalibre, though conversion equating requires an additional software called IRTEQ.

  1. Conversion
  2. Concurrent Calibration
  3. Fixed Anchor Calibration

Conversion

With this approach, you need to calibrate each form of your test using IRT, completely separately.  We then evaluate the relationship between IRT parameters on each form, and use that to estimate the relationship to convert examinee scores.  Theoretically what you do is line up the IRT parameters of the common items and perform a linear regression, so you can then apply that linear conversion to scores.  But DO NOT just do a regular linear regression.  There are specific methods you must use, including mean/mean, mean/sigma, Stocking & Lord, and Haebara.  Fortunately, you don’t have to figure out all the calculations yourself, as there is free software available to do it for you: IRTEQ.

Concurrent Calibrationcommon item linking irt equating

The second approach is to combine the datasets into what is known as a sparse matrix.  You then run this single data set through the IRT calibration, and it will place all items and examinees onto a common scale.  The concept of a sparse matrix is typically represented by the figure below, representing the non-equivalent anchor test (NEAT) design approach.  The IRT calibration software will automatically equate the two forms and you can use the resultant scores.

Fixed Anchor Calibration

The third approach is a combination of the two above; it utilizes the separate calibration concept but still uses the IRT calibration process to perform the equating rather than separate software.  With this approach, you would first calibrate your data for Form A.  You then find all the IRT item parameters for the common items, and input them into your IRT calibration software when you calibrate Form B.  You can tell the software to “fix” the item parameters so that those particular ones (from the common items) do not change.  Then all the item parameters for the unique items are forced onto the scale of the common items, which of course is the underlying scale from Form A.  This then also forces the scores from the Form B students onto the Form A scale.

 

 

How do these approaches compare to each other?
concurrent calibration irt equating linking

Concurrent calibration is arguably the easiest, but has the drawback that it merges the scales of each form into a new scale somewhere in the middle.  If you need to report the scores on either form on the original scale, then you must use the Conversion or Fixed Anchor approaches.  This situation commonly happens if you are equating across time periods.  Suppose you delivered Form A last year and are now trying to equate Form B.  You can’t just create a new scale and thereby nullify all the scores you reported last year.  You must map Form B onto Form A so that this year’s scores are reported on last year’s scale and everyone’s scores will be consistent.

 

Where do I go from here?

If you want to do IRT equating, you need IRT calibration software.  All three approaches use it.  I highly recommend Xcalibre since it is easy to use and automatically creates reports in Word for you.  If you want to learn more about the topic of equating, the classic reference is the book by Kolen and Brennan (2004; 2014).  There are other resources more readily available on the internet, like this free handbook from CCSSO.  If you would like to learn more about IRT, I recommend the books by de Ayala (2008) and Embretson & Reise (2000).  A very brief intro is available on our website.

Want to improve the quality of your assessments?

Sign up for our newsletter and hear about our free tools, product updates, and blog posts first! Don’t worry, we would never sell your email address, and we promise not to spam you with too many emails.

Newsletter Sign Up
First Name*
Last Name*
Email*
Company*
Market Sector*
Lead Source

Some time ago, I received this question regarding interpreting IRT cutscores (item response theory):

In my examination system, we are currently labeling ‘FAIL’ for student’s mark with below 50% and ‘PASS’ for 50% and above. I found that this amazing Xcalibre software can classify students’ achievement in 2 groups based on scores. But, when I tried to run IRT EPC with my data (with cut point of 0.5 selected), it shows that students with 24/40 correct items were classified as ‘FAIL’. Because in CTT, 24/40 correctly answered items is equal to 60% (Pass). I can’t find its interpretation in Guyer & Thompson (2013) User’s Manual for Xcalibre. How exactly should I set my cut point to perform 2-group classification using IRT EPC in Xcalibre to make it about equal to 50% achievement in CTT?

In this context, EPC refers to expected percent/proportion correct.  IRT uses the test response function (TRF) to convert a theta score to an expectation of what percent of items in the pool that a student would answer correctly.  So this Xcalibre user is wondering how to set IRT cutscores on theta that meets their needs.

Setting IRT cutscores

The short answer, in this case, would be to evaluate the TRF and reverse-calculate the theta for the cutscore.  That is, find your desired cutscore on the y-axis, and determine the corresponding value of theta.  In the example below, I have found a % cutscore of 54 and found the corresponding theta of -0.13 or so.  In the case above, a theta=0.5 likely corresponded to a percent correct score of 60%-70%, so observed scores of 24/40 would indeed fail.

test response function

Of course, it is indefensible to set a cutscore to be arbitrary round numbers.  To be defensible, you need to set the cutscore with an accepted methodology such as Angoff, modified-Angoff, Nedelsky, Bookmark, or Contrasting Groups.

A nice example is a the modified-Angoff, which is used extremely often in certification and licensure situations.  More information is available on this method here.  The result of this method will typically be a specific cutscore, either on the raw or percent metric.  The TRF can be presented in both of those metrics, allowing the conversion on the right to be calculated easily.

Alternatively, some standard-setting methods can work directly on the IRT theta scale, including the Bookmark and Contrasting Groups approaches.

Interested in applying IRT to improve your assessments?  Download a free trial copy of Xcalibre here.  If you want to deliver online tests that are scored directly with IRT, in real time (including computerized adaptive testing), check out FastTest.

Want to improve the quality of your assessments?

Sign up for our newsletter and hear about our free tools, product updates, and blog posts first! Don’t worry, we would never sell your email address, and we promise not to spam you with too many emails.

Newsletter Sign Up
First Name*
Last Name*
Email*
Company*
Market Sector*
Lead Source

Xcalibre 4 updates
Xcalibre 4is the most user-friendly software available for item response theory (IRT) analysis.  An update has been recently released, which includes a number of bug fixes and enhancements, including an addition of distractor (quantile) plots into the output.  These plots provide an excellent method for evaluating multiple-choice distractors by combining IRT and classical test theory.  We have also added a new fit plot with standard errors to the report document and percentiles to the scores spreadsheet.  Current license holders can download the no-cost update from https://assess.com/xcart/pages.php?pageid=10.  If you do not yet have Xcalibre 4, a free demo version is available at https://assess.com/xcart/product.php?productid=569.

Iteman 4 updates
We have also released an update for Iteman 4, the leading software for analysis with classical test theory (CTT).  This update focuses on bug fixes and maintenance.  If you have a current license, you can download the update at https://assess.com/xcart/pages.php?pageid=10.  If you do not yet have Iteman 4, a free demo version is available at https://assess.com/xcart/product.php?productid=541.

Conferences page on ASC website updated for 2012
ASC provides a number of resources to professionals in the field of testing and psychometrics, including a list of major conferences in the field.  We have updated that list for 2012 at https://assess.com/conferences.php… see where you want to go!  ASC has plans to be at the SIOP, IACAT, and ICE conferences.

Connect with ASC on LinkedIn
LinkedIn is an essential social networking site for professionals of all fields.  It marks an important advancement in professional connectivity.  ASC has had a page on LinkedIn for a number of years; if you aren’t already following us, please visit http://www.linkedin.com/company/assessment-systems-corporation. We also update our news at the ASC blog.

Free training videos released
ASC hosts one or two workshops every year (one workshop was just recently held, January 23-25 in Brazil) and presents at a number of conferences, but we realize that not everyone is able to attend.  Therefore, ASC is publishing a set of tutorial videos on our software and important issues in psychometrics, like item response theory and computerized adaptive testing.  The videos are available for free at https://assess.com/tutorials.php.  Two have been completed so far: Running Iteman 4 and Interpreting Iteman 4 Output.

ASC to present at 2012 SIOP Conference
ASC will present a mini-workshop on the development of computerized adaptive tests (CATs) at the Society for Industrial-Organizational Psychology (SIOP) conference.  The session is scheduled for 1:30 PM on Friday, April 27.  More information is available at the SIOP website.