# How do I implement item response theory?

I recently received a email from a researcher that wanted to implement item response theory, but was not sure where to start.  It occurred to me that there are plenty of resources out there which describe IRT but few, if any, that provide guidance for how someone new to the topic could apply IRT.  That is, plenty of resources that define the a-b-c parameters and discuss the item response function, but few resources that tell you how to calculate those parameters or what to do with them.

# Why do I need to implement item response theory?

First of all, you might want to ask yourself this question.  Don’t just be using IRT because you heard it is an advanced psychometric paradigm.  IRT was invented to address shortcomings in classical test theory, and works best in the situations where those shortcomings are highlighted.  For example, you might want to design adaptive tests, assemble parallel forms, or equate score scales across years.

# What sort of tests/data work with IRT?

This is the next question you need to ask yourself is whether your test can work with IRT.  IRT assumes unidimensionality and local independence.  Unidimensionality means that all items intercorrelate highly, and from a factor analysis perspective, load highly on one primary factor.  Local independence means that items are independent of one another – so testlets and “innovative” item types that violate this might not work well.

IRT assumes that items are scored dichotomously (correct/incorrect) or polytomously (integer points where smarter or high-trait examinees earn higher points).  Surprisingly, this isn’t always the case.  This blog post explores how a certain PARCC item type violated the should-be-obvious assumption that smarter students earn higher points, a great example of pedagogues trying to do psychometrics.

And, of course, IRT has sample size requirements.  I’ve received plenty of email questions from people who wonder why Xcalibre doesn’t work on their data set… of 6 students.  Well, IRT requires 100 examinees for the simplest model and up to a minimum of 1,000 for more complex models.  Six students obviously isn’t enough for classical test theory, for that matter.

# How do I calculate IRT analytics?

Classical test theory is super-super-simple, so that anyone can easily calculate things like P, Rpbis, and coefficient alpha with Microsoft Excel formulas.  CITAS does this.  IRT calculations are much more complex, and it takes hundreds of lines of real code to estimate item parameters like a, b, and c.  I recommend the program Xcalibre to do so.  It has a straightforward, user-friendly interface and will automatically create MS Word reports for you.  If you are a member of the Rasch club, the go-to software is Winsteps.  You can also try R packages, but to do so you will need to learn to program in the R language, and the output is greatly inferior to commercial software.

Some of the secondary analyses in IRT can be calculated easily enough that Excel formulas are an option.  The IRT Scoring Spreadsheet scores a single student with IRT item parameters you supply, in an interactive way that helps you learn how IRT scoring works. I also have a spreadsheet that helps you build parallel forms by calculating the test information function (TIF) and conditional standard error of measurement (CSEM).  However, my TestAssembler program does that with automation, saving you hours of manual labor.

There are also a few specific-use tools available on the web.  One of my favorites is IRTEQ, which performs conversion-style equating such as mean/sigma and Stocking-Lord.  That is, it links together scores from different forms of an exam onto a common scale, even if the forms are delivered in different years.