Computerized adaptive tests (CATs) are a sophisticated method of test delivery based on item response theory (IRT). They operate by adapting both the difficulty and quantity of items seen by each examinee.

Difficulty
Most characterizations of adaptive testing focus on how item difficulty is matched to examinee ability. High ability examinees receive more difficult items, while low ability examinees receive easier items, which has important benefits to the student and the organization. An adaptive test typically begins by delivering an item of medium difficulty; if you get it correct, you get a tougher item, and if you get it incorrect, you get an easier item. This basic algorithm continues until the test is finished, though it usually includes subalgorithms for important things like content distribution and item exposure.

Quantity
A less publicized facet of adaptation is the number of items. Adaptive tests can be designed to stop when certain psychometric criteria are reached, such as a specific level of score precision. Some examinees finish very quickly with few items, so that adaptive tests are typically about half as many questions as a regular test, with at least as much accuracy. Because some examinees have longer tests, these adaptive tests are referred to as variable-length. Obviously, this makes for a massive benefit: cutting testing time in half, on average, can substantially decrease testing costs. Nevertheless, some adaptive tests use a fixed length, and only adapt item difficulty. This is merely for public relations issues, namely the inconvenience of dealing with examinees who feel they were unfairly treated by the CAT, even though it is arguably more fair and valid than conventional tests.

Advantages of adaptive testing

By making the test more intelligent, adaptive testing provides a wide range of benefits.  Some of the well-known advantages of adaptive testing, recognized by scholarly psychometric research, are listed below.  However, the development of an adaptive test is a very complex process that requires substantial expertise in item response theory (IRT) and CAT simulation research.  Our Ph.D. psychometricians can provide your organization with the requisite experience to implement adaptive testing and help your organization benefit from these advantages.  Contact us or read this white paper to learn more.
Shorter tests, anywhere from a 50% to 90% reduction; reduces cost, examinee fatigue, and item exposure
More precise scores: CAT will make tests more accurate
More control of score precision (accuracy): CAT ensures that all students will have the same accuracy, making the test much more fair.  Traditional tests measure the middle students well but not the top or bottom students.
Increased efficiency
Greater test security because everyone is not seeing the same form
A better experience for examinees, as they only see items relevant for them, providing an appropriate challenge
The better experience can lead to increased examinee motivation
Immediate score reporting
More frequent retesting is possible; minimize practice effects
Individual pacing of tests; examinees move at their own speed
On-demand testing can reduce printing, scheduling, and other paper-based concerns
Storing results in a database immediately makes data management easier
Computerized testing facilitates the use of multimedia in items

No, you can’t just subjectively rank items!

Computerized adaptive tests (CATs) are the future of assessment. They operate by adapting both the difficultyand number of items to each individual examinee. The development of an adaptive test is no small feat, and requires five steps integrating the expertise of test content developers, software engineers, and psychometricians. The development of a quality adaptive test is not possible without a Ph.D. psychometrician experienced in both item response theory (IRT) calibration and CAT simulation research. FastTest can provide you the psychometrician and software; if you provide test items and pilot data, when can help you quickly publish an adaptive version of your test. Step 1: Feasibility, applicability, and planning studies. First, extensive monte carlo simulation research must occur, and the results formulated as business cases, to evaluate whether adaptive testing is feasible, applicable, or even possible.
Step 2: Develop item bank. An item bank must be developed to meet the specifications recommended by Step 1.
Step 3: Pretest and calibrate item bank. Items must be pilot tested on 200-1000 examinees (depends on IRT model) and analyzed by Ph.D. psychometrician.
Step 4: Determine specifications for final CAT. Data from Step 3 is analyzed to evaluate CAT specifications and determine most efficient algorithms using CAT simulation software such as CATSim.
Step 5: Publish live CAT. The adaptive test is published in a testing engine capable of fully adaptive tests based on IRT. Want to learn more about our one-of-a-kind model? Click here to read the seminal article by two of our psychometricians. More adaptive testing research is available here.

What do I need for adaptive testing?

Minimum requirements:

  • A large item bank piloted on at least 500 examinees
  • 1,000 examinees per year
  • Specialized IRT calibration and CAT simulation software.
  • Staff with a PhD in psychometrics or an equivalent level of experience. Or, leverage our internationally recognized expertise in the field.
  • Items (questions) that can be scored objectively correct/incorrect in real time
  • Item banking system and CAT delivery platform
  • Financial resources: Because it is so complex, development of a CAT will cost at least $10,000 (USD) — but if you are testing large volumes of examinees, it will be a significant positive investment.


Visit the links below to learn more about adaptive testing.  We also recommend that you first read this landmark article by our industry-leading psychometricians and our white paper detailing the requirements of CAT.  Additionally, you will likely be interested in this article on producing better measurements with CAT.

ASC has been empowering organizations to develop better assessments since 1979.  Curious as to how things were back then?  Below is a copy of our newsletter from 1988, long before the days of sharing news via email and social media!  Our platform at the time was named MICROCAT.  This later became modernized to FastTest PC (Windows), then FastTest Web, and is now being reincarnated yet again as Ada.

Special thanks to Cliff Donath for finding and sharing!

MicroCATNewsAprill1988

Fraudulent testing data is everywhere. In academic testing, students cheat by looking at other students’ responses or informing their friends in the next section what questions are on the test. In professional credentialing, candidates will sit for the exam simply to steal the content for posting on brain dump sites, while other candidates purchasing the content from these sites never pause to consider the ethical ramifications of trading in stolen property.

Threats to test security are also threats to validity and, by extension, the entire existence and integrity of the assessment. What’s worse? The greater the stakes, the greater the incentive to cheat. Has your organization ever taken a deep dive into your assessment data to search for evidence of cheating or other invalid behavior?

Dr. Nathan Thompson, Assessment Systems co-founder and VP of Psychometrics, has long recognized the value of psychometric forensics to an assessment program, but also the lack of software to implement it. Because of this, Dr. Thompson developed Software for Investigating Fraud in Testing (SIFT) in 2016.

“The software is easy to run because of its friendly UI, but the results are so complex that only a small percentage of Ph.D. psychometricians can understand the output,” Dr. Thompson said.

That is why Assessment Systems is proud to offer Psychometric Forensics service, leveraging Dr. Thompson’s expertise (and our love for test security) to bring this customized consulting to organizations who wish to protect the integrity of their assessments.

“The cliché holds true here: an ounce of prevention is worth a pound of cure,” Dr. Thompson said. “We can work with you to identify areas of concern and explore policies, procedures, and practices that will help you.”

If you provide us a dataset, we’ll analyze it with a range of collusion indices and other statistics, evaluating your examinees individually as well as groups such as test centers or classrooms. ASC’s mission is to improve the quality of as many assessments as we can.

Psychometrics is the cornerstone of any high-quality assessment program.  Most organizations do not have an in-house PhD psychometrician, which then necessitates the search for psychometric consulting.  Most organizations, when first searching, are new to the topic and not sure what role the psychometrician plays.  In this article, we’ll talk about how psychometricians and their tools can help improve your assessments, whether you just want to check on test reliability or pursue the lengthy process of accreditation.

Why ASC?

Whether you are establishing or expanding a credentialing program, streamlining operations, or moving from paper to online testing, ASC has a proven track record of providing practical, cost-efficient solutions with uncompromising quality. We offer a free consultation with our team of experts to discuss your needs and determine which solutions are the best fit, including our enterprise SaaS platforms, consulting on sound psychometrics, or recommending you to one of our respected partners.
 

At the heart of our business is our people.

Our collaborative team of Ph.D. psychometricians, accreditation experts, and software developers have diverse experience developing solutions that drive best practices in assessment. This real-world knowledge enables us to consult your organization with solutions tailored specifically to your goals, timeline, and budget.
 

Comprehensive Solutions to Address Specific Measurement Problems

Much of psychometric consulting is project-based around solving a specific problem.  For example, you might be wondering how to set a cutscore on a certification/licensure exam that is legally defensible and meets accreditation standards.  This is a very specific issue, and the scientific literature has suggested a number of sound approaches.  Here are some of the topics where psychometricians can really help:

  • Test Design: Job Analysis & Blueprints
  • Standard and Cutscore Setting Studies
  • Item Writing and Review Workshops
  • Test and Item Statistical Analysis
  • Equating Across Years and Forms
  • Adaptive Testing Research
  • Test Security Evaluation
  • NCCA/ANSI Accreditation

 

Why psychometric consulting?

All areas of assessment can be smarter, faster and fairer.

Develop Reliable and Valid Assessments
We’ll help you understand what needs to be done to develop defensible tests and how to implement them in a cost-efficient manner.  Much of the work revolves around establishing a sound test development cycle.

Increase Test Security
We have specific expertise in psychometric forensics, allowing you to flag suspicious candidates or groups in real time, using our automated forensics report.

Achieve Accreditation
Our dedicated experts will assist in setting your organization up for success with NCCA/ANSI accreditation of professional certification programs.

Comprehensive Psychometric Analytics
We use CTT and IRT with principles of machine learning and AI to deeply understand your data and provide actionable recommendations.

We can help your organization develop and publish certification and licensure exams, based on best practices and accreditation standards, in a matter of months.

If you’re looking for a way to add these best practices to your assessments, here’s how:

Item and Test Statistical Analysis
If you are doing this process at least annually, you are not meeting best practices or accreditation standards. But don’t worry, we can help! In addition to performing these analyses for you, you also have the option of running them yourself in our FastTest platform or using our psychometric software like Iteman and Xcalibre.

Job Analysis
How do you know what a professional certification test should cover?  Well, let’s get some hard data by surveying job incumbents. Knowing and understanding this information and how to use it is essential if you want to test people on whether they are prepared for the job or profession.

Cutscore Studies (Standard Setting)
When you use sound psychometric practices like the modified-Angoff, Beuk Compromise, Bookmark, and Contrasting Groups methods, it will help you establish a cutscore that meets professional standards.

 

It’s all much easier if you use the right software!

Once we help you determine the best solutions for your organization, we can train you on best practices, and it’s extremely easy to use our software yourself.  Software like Iteman and Xcalibre is designed to replace much of the manual work done by psychometricians for item and test analysis, and FastTest automates many aspects of test development and publishing.  We even offer free software like the Angoff Analysis Tool.  However, our ultimate goal is your success: Assessment Systems is a full-service company that continues to provide psychometric consulting and support even after you’ve made a purchase. Our team of professionals is available to provide you with additional support at any point in time. We want to ensure you’re getting the most out of our products!  Click below to sign up for a free account in FastTest and see for yourself.

 

Computerized adaptive testing (CAT) has been around since the 1970s and is well-known for the benefits it can provide, most notably that it can reduce testing time 50-90% with no loss of measurement precision.  Developing a sound, defensible CAT is not easy, but our goal is to make it as easy as possible – that is, everything you need is available in clean software UI and you never have to write a single line of code. Here, we outline the software, data analysis, and project management steps needed to develop a computerized adaptive test that aligns with best practices and international standards.

This approach is based on Thompson and Weiss (2011) model, refer there for a general treatment of CAT development, especially the use of simulation studies.  Also, this article assumes you have mastered the concepts of item response theory and CAT, including:

  • IRT models (e.g., 3PL, rating scale model)
  • Item response functions
  • Item information functions
  • Theta estimation
  • Conditional standard error of measurement
  • Item selection algorithms
  • Termination criterion.

 

If IRT is new to you, please visit these resources

https://assess.com/what-is-item-response-theory/

https://assess.com/how-do-i-implement-item-response-theory/

https://assess.com/what-do-dichotomous-and-polytomous-mean-in-irt/

 

If you have some background in IRT but CAT is new, please visit these resources

https://assess.com/monte-carlo-simulation-adaptive-testing/

https://assess.com/adaptive-testing/

 

And for videos that delve more deeply into IRT/CAT,

https://www.youtube.com/user/ASCpsychometrics

 

Overview: Steps to develop an adaptive test

There are nine steps to developing a CAT on our industry-leading platform, FastTest:

StepWork to be doneSoftware
1Perform feasibility and planning studiesCATSim
2Develop item bankFastTest
3Pilot items on 100-2000 examineesFastTest
4Perform item analysis and other due diligenceIteman/Xcalibre
5IRT calibrationXcalibre
6Upload IRT parameters into FastTestFastTest
7Validity studyCATSim
8Publish CATFastTest
9Quality assuranceFastTest

 

We’ll now talk a little more about each of these.

 

Perform feasibility and planning studies

The first step, before doing anything else, is to confirm that your assessment meets the basic requirements of CAT.  For example, you need to have a decent sized item bank, data on hundreds of examinees (or the future opportunity), and items that are scoreable in real time.  See this paper for a full discussion.  If there are no roadblocks, the next step is to perform monte carlo simulations that help you scope out the project, using the CATSim software.  For example, you might simulate CATs with three sizes of item bank, so you have a better idea how many items to write.

 

Develop item bank

Now that you have some idea of how many items you need and in which ranges of difficulty and/or content constraints, you can leverage the powerful item authoring functionality of FastTest, as well as the item review and workflow management to ensure that subject matter experts are performing quality assurance on each other.

 

Pilot items

Because IRT requires that you have data from real examinees to calibrate item difficulty, you need to get that data.  To do so, create test(s) in FastTest to deliver all your items in a matter that meets your practical situation. That is, some organizations have a captive audience and might be able to have 500 people take all 300 items in their bank next week.  Other organizations might need to create 4 linear forms of 100 items with some overlap. Others might be constrained to still use current test forms and only tack on 20 new items onto the end of every examinee’s test.

Of course, some of you might have existing data.  That is, you might have spreadsheets of data from a previous test delivery system, paper based delivery, or perhaps even already have your IRT parameters from past efforts.  You can use those too.

If you do deliver the pilot phase with FastTest, you now need to export the data to be analyzed in psychometric analytic software.  FastTest makes it easy to export both the data matrix and the item metadata needed for Xcalibre’s control file.

 

Perform item analysis, DIF, and other due diligence

The purpose of this step is to ensure that items included in your future CAT are of high quality.  Any steps that your organization normally does to review item performance is still relevant. This typically includes a review of items with low point-biserial correlations (poor discrimination), items where more examinees selected a distractor than the correct option (key flags), high or low classical P values, and differential item functioning (DIF) flags.  Our Iteman software is designed exactly for this process. If you have a FastTest account the Iteman analysis report is now available at a single click. If not, Iteman is also available as a standalone program.

 

Calibrate with Xcalibre

Because CAT algorithms rely entirely on IRT parameters (unless you are doing special algorithms like diagnostic measurement models or measurement decision theory), we need to calculate the IRT parameters and get them into our testing platform.  If you have delivered all your items in a single block to examinees, like the example above with 500 people, then that single matrix can just be analyzed with Xcalibre. If you have multiple forms, LOFT, or the “tack-on” approach, you need to worry about IRT equating.

 

Upload IRT parameters into FastTest

Xcalibre will provide all the IRT parameters in a spreadsheet, in addition to the primary Word report.  Import them into your testing platform.  This will associate the IRT parameters with all the items in your CAT pool.  FastTest has functionality to streamline this process.

 

Validity study

Now that you have your final pool of items established, and calculated the IRT parameters, you need to establish the algorithms you are going to use to publish the CAT.  That is, you need to decide on the Initial Theta rule, Item Selection rule (including subalgorithms like content or exposure constraints), and Termination Criterion. To establish these, you need to perform more simulation studies, but now with your final bank as the input rather than a fake bank from the monte carlo simulations.  The most important aspect is determining the tradeoff between test length and precision; a termination criteria that provides more precise scores will have longer tests, and you can control the exact extent with a CAT.

 

Publish CAT

Assemble a “test form” in FastTest that consists of all the items you intend to use in your CAT pool.  Then select CAT as the delivery method in the Test Options screen, and you’ll see a screen where you can input the results from your CATSim validity study for the three important CAT algorithms.

adaptive test options

Quality assurance

Your CAT is now ready to go!  Before bringing in real students, however, we recommend that you take it a few times as QA.  Do so with certain students in mind, such as a very low student, a very high student, or one near the cutscore (if you have one).  To peek under the hood at the CAT algorithm, you can export the Examinee Test Detail Report from FastTest, which provides an item-by-item picture of how the CAT proceeds.

adaptive test report

 

Summary

As you can see, the development of an adaptive test is not easy, and can take months even if you have all the software and expertise you need.  But for something so important, that could be used to make important decisions about people, this is absolutely warranted.  However, if you have all the data you need today, there’s no reason that it should take months to develop an adaptive test – assessment platforms should make it easy enough for you to do so in an afternoon, which FastTest absolutely does.

Want to talk with one of our experts about applying this process to your exam?  Get in touch or sign up for a free account!

Simulation studies are an essential step in the development of an computerized adaptive test (CAT) that is defensible and meets the needs of your organization or other stakeholders.  There are three types of simulations: monte carlo, real data (post hoc), and hybrid.  Monte carlo simulation is the most general-purpose approach, and the one most often used early in the process of developing a CAT.  This is because it requires no actual data, either on test items or examinees – although real data is welcome if available – which makes it extremely useful in evaluating whether CAT is even feasible for your organization before any money is invested in moving forward.  Let’s begin with an overview of how monte carlo simulation works, before we return to that point.

First of all, what do we mean by CAT simulation?  Well, a CAT is a test that is administered to students via an algorithm.  We can use that same algorithm on imaginary examinees, or real examinees from the past, and simulate how well a CAT performs on them.  Best of all, we can change the specifications of the algorithm to see how it impacts the examinees and the CAT performance.

Each simulation approach requires three things:

  1. Item parameters from item response theory, though new CAT methods such as diagnostic models are now being developed
  2. Examinee scores (theta) from item response theory
  3. A way to determine how an examinee responds to an item if the CAT algorithm says it should be delivered to the examinee.

The monte carlo simulation approach is defined by how it addresses the third requirement: it generates a response using some sort of mathematical model, while the other two simulation approaches look up actual responses for past examinees (real-data approach) or a mix of the two (hybrid).  The monte carlo simulation approach only uses the response generation process.  The item parameters can either be from a bank of actual items, or generated.  Likewise, the examinee thetas can be from a database of past data, or generated.

How does the response generation process work? 

Well, it differs based on the model that is used as the basis for the CAT algorithm.  Here, let’s assume that we are using the three-parameter logistic model.  Start by supposing we have a fake examinee with a true theta of 0.0.  The CAT algorithm looks in the bank and says that we need to administer item #17 as the first item, which has the following item parameters: a=1.0, b=0.0, and c=0.20.  Well, we can simply plug those numbers into the equation for the three parameter model and obtain the probability that this person would correctly answer this item.

three parameter model irt

The probability in this case is 0.6.  The next step is to generate a random number from the set of all real numbers between 0.0 and 1.0.  If that number is less than the probability of correct response, the examinee “gets” the item correct.  If greater, the examinee gets the item incorrect.  Either way, the examinee is scored and the CAT algorithm proceeds.

For every item that comes up to be used, we utilize this same process.  Of course, the true theta does not change, but the item parameters are different for each item.  Each time, we generate a new random number and compare it to the probability to determine a response of correct or incorrect.  The CAT algorithm proceeds as if a real examinee is on the other side of the computer screen, actually responding to questions, and stops whenever the termination criterion is satisfied.  However, the same process can be used to “deliver” linear exams to examinees; instead of the CAT algorithm selecting the next item, we just process sequentially through the test.

A road to research

For a single examinee, this process is not much more than a curiosity.  Where it becomes useful is in a large scale aggregate level.  Imagine the process above as part of a much larger loop.  First, we establish a pool of 200 items pulled from items used in the past by your program.  Next, we generate a set of 1,000 examinees by pulling numbers from a random distribution.  Finally, we loop through each examinee and administer a CAT by using the CAT algorithm and generating responses with the monte carlo simulation process.  We then have extensive data on how the CAT algorithm performed, which can be used to evaluate the algorithm and the item bank.  The two most important are the length of the CAT and its accuracy, which are a trade-off in most cases.

So how is this useful for evaluating the feasibility of CAT?  Well, you can evaluate the performance of the CAT algorithm by setting up an experiment to compare different conditions.  Suppose you don’t have past items and are not even sure how many items you need?  Well, you can create several different fake item banks and administer a CAT to the same set of fake examinees.  Or you might know the item bank to be used, but need to establish that a CAT will outperform the linear tests you currently use.  There are a wide range of research questions you can ask, and since all the data is being generated, you can design a study to answer many of them.  In fact, one of the greatest problems you might face is that you can get carried away and start creating too many conditions!

How do I actually do a monte carlo simulation study?

Fortunately, there is software to do all the work for you.  The best option is CATSim, which provides all the options you need in a straightforward user interface (beware, this makes it even easier to get carried away).  The advantage of CATSim is that it collates the results for you and presents most of the summary statistics you need without you having to calculate them.  For example, it calculates the average test length (number of items used by a variable-length CAT), and the correlation of CAT thetas with true thetas.  Other software exists which is useful in generating data sets using monte carlo simulation (see SimulCAT), but they do not include this important feature.

The traditional Learning Management System (LMS) is designed to serve as a portal between educators and their learners. Platforms like Moodle are successful in facilitating cooperative online learning in a number of groundbreaking ways: course management, interactive discussion boards, assignment submissions, and delivery of learning content. While all of this is great, we’ve yet to see an LMS that implements best practices in assessment and psychometrics to ensure that medium or high stakes tests meet international standards.

To put it bluntly, LMS systems have assessment functionality that is usually good enough for short classroom quizzes but falls far short of what is required for a test that is used to award a credential.  A white paper on this topic is available here, but some examples include:

  • Treatment of items as reusable objects
  • Item metadata and historical use
  • Collaborative item review and versioning
  • Test assembly based on psychometrics
  • Psychometric forensics to search for non-independent test-taking behavior
  • Deeper score reporting and analytics

Assessment Systems is pleased to announce the launch of an easy-to-use bridge between FastTest and Moodle that will allow users to seamlessly deliver sound assessments from within Moodle while taking advantage of the sophisticated test development and psychometric tools available within FastTest. In addition to seamless delivery for learners, all candidate information is transferred to FastTest, eliminating the examinee import process.  The bridge makes use of the international Learning Tools Interoperability standards.

If you are already a FastTest user, watch a step-by-step tutorial on how to establish the connection, in the FastTest User Manual by logging into your FastTest workspace and selecting Manual in the upper right-hand corner. You’ll find the guide in Appendix N.

If you are not yet a FastTest user and would like to discuss how it can improve your assessments while still allowing you to leverage Moodle or other LMS systems for learning content, sign up for a free account here.

As we jump headfirst into 2018, we’re reflecting on our successes from the past year. One such success was our inclusion in the Minneapolis/St. Paul Business Journal’s list of Best Places to Work in 2017. We’re honored to be recognized!


So, what makes Assessment Systems one of the best places to work?

Though founded in 1979, we run our company with the mindset and energy of a startup. This means we have a strong foundation on which to create world-class software, but at the same time, we’re constantly innovating, working with the newest technologies and taking risks.

Our leadership team drives this startup mentality, which encourages employees to constantly be on their toes. With experts in a variety of areas, including assessment, psychometrics, entrepreneurship, and tech, not only do all team members play an important role in the business, they also have a real opportunity to make a difference.

We have great company values.

Furthermore, it’s easy for our employees to be inspired every day due to our company’s values. Our CEO stresses the importance of doing the right thing and being kind, which everyone on the team is proud to stand behind. Principles such as these are fundamental to the success of our employees. Ask anyone who’s partnered with us and they’ll tell you that we’re a small company with a big heart that wants to provide the best product and service to our clients.


Last, but certainly not least, we love what we do!

Our unique company culture, diverse team, and values make it easy to love where we work. As a result, we’re all the more motivated to make a difference in our industry and to continue improving our company culture even more.

We may not have a big team, but we have incredible skill sets and a collaborative environment where we rely on each other to make great things happen. We are a small company that is changing the way people test online and improving the world one test at a time.


Sound interesting? Check out our careers page or learn more about what we’re doing at assess.com.

Computerized adaptive testing (CAT) is a powerful paradigm for delivering tests that are smarter, faster, and fairer than the traditional linear approach.  However, CAT is not without its challenges.  One is that it is a greedy algorithm which always selects your best items from the pool if it can.  The way that CAT researchers address this issue is with item exposure controls.  These are subalgorithms that are injected into the main item selection algorithm, to alter it from always using the best items. The Sympson-Hetter method is one such approach.

The simplest approach is called the randomesque method.  This selects from the top X items in terms of item information (a term from item response theory), usually for the first Y items in a test.  For example, instead of always selecting the top item, the algorithm finds the 3 top items and then randomly selects between those.

The Sympson-Hetter Method

A more sophisticated method is the Sympson-Hetter method.  Here, the user specifies a target proportion as a parameter for the selection algorithm.  For example, we might decide that we do not want an item seen by more than 75% of examinees.  So, every time that the CAT algorithm goes into the item pool to select a new item, we generate a random number between 0 and 1, which is then compared to the threshold.  If the number is between 0 and 0.75 in this case, we go ahead and administer the item.  If the number is from 0.75 to 1.0, we skip over it and go on to the next most informative item in the pool, though we then do the same comparison for that item.

Why do this?  It obviously limits the exposure of the item.  But just how much it limits it depends on the difficulty of the item.  A very difficult item is likely only going to be a candidate for selection for very high ability examinees.  Let’s say it’s the top 4%… well, then the approach above will limit it to 3% of the sample overall, but 75% of the examinees in its neighborhood.

On the other hand, an item of middle difficulty is used not only for middle examinees, but often for any examinee.  Remember, unless there are some controls, the first item for the test will be the same for everyone!  So if we apply the Sympson-Hetter rule to that item, it limits it to 75% exposure in a more absolute sense.

Because of this, you don’t have to set that threshold parameter to the same value for each item.  The original recommendation was to do some CAT simulation studies, then set the parameters thoughtfully for different items.  Items that are likely to be highly exposed (middle difficulty with high discrimination) might deserve a more strict parameter like 0.40.  On the other hand, that super-difficult item isn’t an exposure concern because only the top 4% of students see it anyway… so we might leave its parameter at 1.0 and therefore not limit it at all.

Is this the only method available?

No.  As mentioned, there’s that simple randomesque approach.  But there are plenty more.  You might be interested in this paper, this paper, or this paper.  The last one reviews the research literature from 1983 to 2005.

What is the original reference?

Sympson, J. B., & Hetter, R. D. (1985, October). Controlling item-exposure rates in computerized adaptive testing. Proceedings of the 27th annual meeting of the Military Testing Association (pp. 973–977). San Diego, CA: Navy Personnel Research and Development Center.

How can I apply this to my tests?

Well, you certainly need a CAT platform first.  Our platform at ASC allows this method right out of the box – that is, all you need to do is enter the target proportion when you publish your exam, and the Sympson-Hetter method will be implemented.  No need to write any code yourself!  Click here to sign up for a free account.

Desperation is seldom fun to see.

Some years ago, having recently released our online marking functionality I was reviewing some of the functionality in a customer workspace I was intrigued to see “Beyonce??” mentioned in a marker’s comments on an essay. The student’s essay was evaluating some poetry and had completely misunderstood the use of metaphor in the poem in question. The student also clearly knew that her interpretation was way off, but didn’t know how and had reached the end of her patience. So after a desultory attempt at answering, with a cry from the heart, reminiscent of William Wallace’s call for freedom, she wrote “BEYONCE” with about seventeen exclamation points. It felt good to see that her spirit was not broken, and it was a moment of empathy that drove home the damage that standardized tests are inflicting on our students. That vignette is playing itself out millions of time each year in this country, the following explains why.

What are “Standardized Tests”?

We use standardized tests for a variety of reasons, but underlying every reason (curriculum effectiveness, college/career preparedness, teacher effectiveness, etc.) is the understanding that the test is measuring what a student has learned. In order to know how all our students are doing, we give them all standardized tests, meaning every student receives essentially the same set of tests. So, a standardized test is a test where all students take essentially the same test. This is a difficult endeavor given the wide range of students and number of tests, and raises the question “How do we do this reliably and in a reasonable amount of time?”

Accuracy and Difficulty vs Length

We all want tests to reliably measure the students’ learning. In order to make these tests reliable, we need to supply questions of varying difficulty, from very easy to very difficult, to cover a wide range of abilities. In order to reduce the length of the test, most of the questions fall in the medium easy to medium difficulty range because that is where most of the students’ ability level will fall. So the test that best balances length and accuracy for the whole population should be constructed such that the amount of questions of any difficulty is proportionate to the number of students of that ability.

Why are most questions in the medium difficulty range? Imagine creating a test to measure 10th graders’ math ability. A small number of the students might have a couple years of calculus. If the test covered those topics, imagine the experience of most students who would often not even understand the notation in the question. Frustrating, right? On the other hand, if the test was also constructed to measure students with only rudimentary math knowledge, these average to advanced students would be frustrated and bored from answering a lot of questions on basic math facts. The solution most organizations use is to present only a few questions that are really easy or difficult, and accept that this score is not as accurate as they would prefer for the students at either end of the ability range.

These Tests are Inaccurate and Mean Spirited

The problem is that while this might work OK for a lot of kids, it exacts a pretty heavy toll on others. Almost one in five students will not know the answer to 80% of the questions on these tests, and scoring about 20% on a test certainly feels like failing. It feels like failing every time a student takes such a test. Over the course of an academic career, students in the bottom quintile will guess on or skip 10,000 questions. That is 10,000 times the student is told that school, learning, or success is not for them. Even biasing the test to be easier only makes a slight improvement.

Computerized Adaptive Testing, Test Performance with Bell Curve

The shaded area represents students who will miss at least 80% of questions.

It isn’t necessarily better for the top students whose every testing experience assures them that they are already very successful when the reality is that they are likely being outperformed by a significant percentage of their future colleagues.

In other words, at both ends of the Bell Curve, we are serving our students very poorly, inadvertently encouraging lower performing students to give up (there is some evidence that the two correlate) and higher performing students to take it easy. It is no wonder that people dislike standardized tests.

There is a Solution

A computerized adaptive test (CAT) solves all the problems outlined above. Properly constructed, a CAT has the ability to make the following faster, fairer, and more valid:

  • Every examinee completes the test in less time (fast)
  • Every examinee gets a more accurate score (valid)
  • Every examinee receives questions tuned to their ability so gets about half right (fair)

Given all the advantages of CAT, it may seem hard to believe that they are not used more often. While they are starting to catch on, it is not fast enough given the heavy toll that the old methods exact on our students. It is true that few testing providers can enable CATs, but that is simply making an excuse. If a standardized test is delivered to as few as 500 students it can be made adaptive. It probably isn’t, but it could be. All that is needed are computers or tablets, an Internet connection, and some effort. We should expect more.

How can my organization implement CAT?

While CAT used to only be feasible for large organizations that tested hundreds of thousands or millions of examinees per year, a number of advances have changed this landscape.  If you’d like to do something about your test, it might be worthwhile for you to evaluate CAT.  We can help you with that evaluation; if you’d like to chat, here is a link to schedule a meeting. Or contact me if you’d like to discuss the math or related ideas please drop me a note.