assessment systems logo

Item response theory software

fasttest test statistics tif sem

I recently received a email from a researcher that wanted to implement item response theory, but was not sure where to start.  It occurred to me that there are plenty of resources out there which describe IRT but few, if any, that provide guidance for how someone new to the topic could apply IRT.  That is, plenty of resources that define the a-b-c parameters and discuss the item response function, but few resources that tell you how to calculate those parameters or what to do with them.  Namely, what item response theory software is out there and how to use it?

Why do I need to implement item response theory?

First of all, you might want to ask yourself this question.  Don’t just be using IRT because you heard it is an advanced psychometric paradigm.  IRT was invented to address shortcomings in classical test theory, and works best in the situations where those shortcomings are highlighted.  For example, you might want to design adaptive tests, assemble parallel forms, or equate score scales across years.

What sort of tests/data work with IRT?

This is the next question you need to ask yourself is whether your test can work with IRT.  IRT assumes unidimensionality and local independence.  Unidimensionality means that all items intercorrelate highly, and from a factor analysis perspective, load highly on one primary factor.  Local independence means that items are independent of one another – so testlets and “innovative” item types that violate this might not work well.

IRT assumes that items are scored dichotomously (correct/incorrect) or polytomously (integer points where smarter or high-trait examinees earn higher points).  Surprisingly, this isn’t always the case.  This blog post explores how a certain PARCC item type violated the should-be-obvious assumption that smarter students earn higher points, a great example of pedagogues trying to do psychometrics.

And, of course, IRT has sample size requirements.  I’ve received plenty of email questions from people who wonder why Xcalibre doesn’t work on their data set… of 6 students.  Well, IRT requires 100 examinees for the simplest model and up to a minimum of 1,000 for more complex models.  Six students obviously isn’t enough for classical test theory, for that matter.

Item response theory software

Classical test theory is super-super-simple, so that anyone can easily calculate things like P, Rpbis, and coefficient alpha with Microsoft Excel formulas.  CITAS does this.  IRT calculations are much more complex, and it takes hundreds of lines of real code to estimate item parameters like a, b, and c.  I recommend real item response theory software like the program Xcalibre to do so.  It has a straightforward, user-friendly interface and will automatically create MS Word reports for you.  If you are a member of the Rasch club, the go-to software is Winsteps.  You can also try R packages, but to do so you will need to learn to program in the R language, and the output is greatly inferior to commercial software.

Some of the secondary analyses in IRT can be calculated easily enough that Excel formulas are an option.  The IRT Scoring Spreadsheet scores a single student with IRT item parameters you supply, in an interactive way that helps you learn how IRT scoring works. I also have a spreadsheet that helps you build parallel forms by calculating the test information function (TIF) and conditional standard error of measurement (CSEM).  However, my TestAssembler program does that with automation, saving you hours of manual labor.

There are also a few specific-use tools available on the web.  One of my favorites is IRTEQ, which performs conversion-style equating such as mean/sigma and Stocking-Lord.  That is, it links together scores from different forms of an exam onto a common scale, even if the forms are delivered in different years.

So where do I go from here?

If you want to implement item response theory, I recommend that you start by downloading the free version of Xcalibre.  If you are interested in managing an assessment with IRT throughout the test development cycle, sign up for a free account in FastTest, our cloud-based testing ecosystem.  If you still need to learn more about what IRT is, read this introductory article, then if you want more I recommend the book Item Response Theory for Psychologists by Embretson and Reise (2000).

The following two tabs change content below.

Nathan Thompson, PhD

CEO at Assessment Systems
Nathan Thompson, PhD, is CEO and Co-Founder of Assessment Systems Corporation (ASC). He is a psychometrician, software developer, author, and researcher, and evangelist for AI and automation. His mission is to elevate the profession of psychometrics by using software to automate psychometric work like item review, job analysis, and Angoff studies, so we can focus on more innovative work. His core goal is to improve assessment throughout the world. Nate was originally trained as a psychometrician, with an honors degree at Luther College with a triple major of Math/Psych/Latin, and then a PhD in Psychometrics at the University of Minnesota. He then worked multiple roles in the testing industry, including item writer, test development manager, essay test marker, consulting psychometrician, software developer, project manager, and business leader. He is also cofounder and Membership Director at the International Association for Computerized Adaptive Testing (iacat.org). He's published 100+ papers and presentations, but his favorite remains https://scholarworks.umass.edu/pare/vol16/iss1/1/.

Latest posts by Nathan Thompson, PhD (see all)

Share This Post

Facebook
Twitter
LinkedIn
Email

More To Explore

nedelsky method meeting
Credentialing

Nedelsky Method of Standard Setting

The Nedelsky method is an approach to setting the cutscore of an exam.  Originally suggested by Nedelsky (1954), it is an early attempt to implement