What is computerized adaptive testing? Computerized adaptive tests (CATs), also known as computer-adaptive tests or simply adaptive tests, are a sophisticated method of test delivery based that uses AI algorithms to personalize the test to every examinee. This means that the test becomes shorter, more accurate, more secure, and fairer. The AI algorithms are almost always based on item response theory (IRT), an application of machine learning to assessment, but can be based on other models as well.
This post will cover the following topics:
- What is computerized adaptive testing?
- How does the test adapt?
- An example of computerized adaptive testing
- Advantages of computerized adaptive testing
- How to develop an CAT that is valid and defensible
- What do I need for adaptive testing?
What is computerized adaptive testing?
Computerized adaptive testing is an algorithm that drives how a test is delivered. It is coded into a software platform, using the machine-learning approach of IRT to select items and score examinees. The algorithm proceeds in a loop until the test is complete.
The steps in the diagram above are adapted from Kingsbury and Weiss (1984). Let’s step through how it works.
For starters, you need an item bank that has been calibrated with a relevant psychometric or machine learning model. That is, you can’t just write a few items and subjectively rank them as Easy, Medium, or Hard difficulty. That’s an easy way to get sued. Instead, you need to write a large number of items (rule of thumb is 3x your intended test length) and then pilot them on a representative sample of examinees. The sample must be large enough to support the psychometric model you choose, and can range from 100 to 1000. You then need to perform simulation research – more on that later.
Once you have an item bank ready, here is how the computerized adaptive testing algorithm works for a student that sits down to take the test.
- Starting point: there are three option to select the starting score, which psychometricians call theta
- Everyone gets the same value, like 0.0 (average, in the case of non-Rasch models)
- Randomized within a range, to help test security and item exposure
- Predicted value, perhaps from external data, or from a previous exam
- Select item
- Find the item in the bank that has the highest information value
- Often, you need to balance this with practical constraints such as Item Exposure or Content Balancing
- Score examinee
- Score the examinee; if using IRT, perhaps maximum likelihood or Bayes modal
- Evaluate termination criterion: using a predefined rule supported by your simulation research
- Is a certain level of precision reached, such as a standard error of measurement <0.30
- Are there no good items left in the bank
- Has a time limit been reached
- Has a Max Items limit been reached
The algorithm works by looping through 2-3-4 until the termination criterion is satisfied.
Do I need to program all that myself?
No. Our revolutionary platform, FastTest, makes it easy to publish a CAT. Once you upload the IRT parameters, you can choose whatever options you please for steps 2-3-4 of the algorithm, simply by clicking on elements in our easy-to-use interface. Want to try it yourself? Contact us to set up a free account and demo.
But of course, there are many technical considerations that affect the quality and defensibility of your CAT – we’ll be talking about those in this post.
How does the test adapt? By Difficulty and/or Quantity
They operate by adapting both the difficulty and quantity of items seen by each examinee.
Most characterizations of adaptive testing focus on how item difficulty is matched to examinee ability. High-ability examinees receive more difficult items, while low ability examinees receive easier items, which has important benefits to the student and the organization. An adaptive test typically begins by delivering an item of medium difficulty; if you get it correct, you get a tougher item, and if you get it incorrect, you get an easier item. This basic algorithm continues until the test is finished, though it usually includes sub algorithms for important things like content distribution and item exposure.
A less publicized facet of adaptation is the number of items. Adaptive tests can be designed to stop when certain psychometric criteria are reached, such as a specific level of score precision. Some examinees finish very quickly with few items, so that adaptive tests are typically about half as many questions as a regular test, with at least as much accuracy. Since some examinees have longer tests, these adaptive tests are referred to as variable-length. Obviously, this makes for a massive benefit: cutting testing time in half, on average, can substantially decrease testing costs.
Some adaptive tests use a fixed length, and only adapt item difficulty. This is merely for public relations issues, namely the inconvenience of dealing with examinees who feel they were unfairly treated by the CAT, even though it is arguably more fair and valid than conventional tests.
In general, it is best practice to meld the two: allow test length to be shorter or longer, but put caps on either end that prevent inadvertently too-short tests or tests that could potentially go on to 400 items. For example, the NCLEX has a minimum length exam of 75 items and the maximum length exam of 145 items.
An example of computerized adaptive testing
Let’s walk through an oversimplified example. Below are the item information functions for five items in a bank. Let’s suppose the starting theta is 0.0.
- We find the first item to deliver. Which item has the highest information at 0.0? It is Item 4.
- Suppose the student answers incorrectly.
- We run the IRT scoring algorithm, and suppose the score is -2.0.
- Check the termination criterion; we certainly aren’t done yet, after 1 item.
- Find the next item. Which has the highest information at -2.0? Item 2.
- Suppose the student answers correctly.
- We run the IRT scoring algorithm, and suppose the score is -0.8.
- Evaluate termination criterion; not done yet.
- Find the next item. Item 2 is the highest at -0.8 but we already used it. Item 4 is next best, but we already used it. So the next best is Item 1.
- Item 1 is very easy, so the student gets it correct.
- New score is -0.2.
- Best remaining item at -0.2 is Item 3.
- Suppose the student gets it incorrect.
- New score is perhaps -0.4.
- Evaluate termination criterion. Suppose that the test has a max of 3 items, an extremely simple criterion. We have met it. The test is now done and automatically submitted.
Advantages of computerized adaptive testing
Our experienced team of psychometricians can provide your organization with the requisite experience to implement adaptive testing and help your organization benefit from these advantages. Contact us or read this white paper to learn more.
- Shorter tests, anywhere from a 50% to 90% reduction; reduces cost, examinee fatigue, and item exposure
- More precise scores: CAT will make tests more accurate
- More control of score precision (accuracy): CAT ensures that all students will have the same accuracy, making the test much more fair. Traditional tests measure the middle students well but not the top or bottom students.
- Increased efficiency
- Greater test security because everyone is not seeing the same form
- A better experience for examinees, as they only see items relevant for them, providing an appropriate challenge
- The better experience can lead to increased examinee motivation
- Immediate score reporting
- More frequent retesting is possible; minimize practice effects, which makes this extremely useful for K-12 formative assessment.
- Individual pacing of tests; examinees move at their own speed
- On-demand testing can reduce printing, scheduling, and other paper-based concerns
- Storing results in a database immediately makes data management easier
- Computerized testing facilitates the use of multimedia in items
How to develop an CAT that is valid and defensible
CATs are the future of assessment. They operate by adapting both the difficulty and number of items to each individual examinee. The development of an adaptive test is no small feat, and requires five steps integrating the expertise of test content developers, software engineers, and psychometricians.
The development of a quality adaptive test is complex and requires experienced psychometricians in both item response theory (IRT) calibration and CAT simulation research. FastTest can provide you the psychometrician and software; if you provide test items and pilot data, we can help you quickly publish an adaptive version of your test.
Step 1: Feasibility, applicability, and planning studies. First, extensive monte carlo simulation research must occur, and the results formulated as business cases, to evaluate whether adaptive testing is feasible, applicable, or even possible.
Step 2: Develop item bank. An item bank must be developed to meet the specifications recommended by Step 1.
Step 3: Pretest and calibrate item bank. Items must be pilot tested on 200-1000 examinees (depends on IRT model) and analyzed by a Ph.D. psychometrician.
Step 4: Determine specifications for final CAT. Data from Step 3 is analyzed to evaluate CAT specifications and determine most efficient algorithms using CAT simulation software such as CATSim.
Step 5: Publish live CAT. The adaptive test is published in a testing engine capable of fully adaptive tests based on IRT. There are not very many of them out in the market. Sign up for a free account in our platform FastTest and try for yourself!
Want to learn more about our one-of-a-kind model? Click here to read the seminal article by our two co-founders., or read this blog post on developing an adaptive test. More adaptive testing research is available here.
What do I need for adaptive testing?
Here are some minimum requirements to evaluate if you are considering a move to the CAT approach.
- A large item bank piloted so that each item has at least 100 valid responses (Rasch model) or 500 (3PL model)
- 500 examinees per year
- Specialized IRT calibration and CAT simulation software like Xcalibre and CATsim.
- Staff with a Ph.D. in psychometrics or an equivalent level of experience. Or, leverage our internationally recognized expertise in the field.
- Items (questions) that can be scored objectively correct/incorrect in real-time
- An item banking system and CAT delivery platform
- Financial resources: Because it is so complex, the development of a CAT will cost at least $10,000 (USD) — but if you are testing large volumes of examinees, it will be a significantly positive investment. If you pay $20/hour for proctoring seats and cut a test from 2 hours to 1 hour for just 1,000 examinees… that’s a $20,000 savings. If you are doing 200,000 exams? That is $4,000,000 in seat time that is saved.
Adaptive testing: Resources for further reading
Visit the links below to learn more about adaptive testing.
- We first recommend that you first read this landmark article by our co-founders.
- Want to learn more about the initial hurdles? Here is our white paper detailing the requirements of CAT.
- Read this article on producing better measurements with CAT from Prof. David J. Weiss.
- International Association for Computerized Adaptive Testing: www.iacat.org
- CAT Tutorial from Larry Rudner: http://edres.org/scripts/cat/catdemo.htm
- Below is a video on the history of CAT, by the godfather of CAT, Prof. David J. Weiss
How can I start developing a CAT?
Sign up below for a free account in our industry-leading CAT platform.
Nathan Thompson, PhD
Latest posts by Nathan Thompson, PhD (see all)
- Finding the Best Online Testing Platform - January 4, 2022
- Assessment Systems Partners with Sumadi to Revolutionize AI-Based Assessment for Education and Employment - December 6, 2021
- EdTech Expert, Chris Dufour EdD, Joins ASC as Director of Business Development - November 30, 2021