Computerized adaptive tests (CATs) are a sophisticated method of test delivery based on item response theory (IRT). They operate by adapting both the difficulty and quantity of items seen by each examinee.

Difficulty
Most characterizations of adaptive testing focus on how item difficulty is matched to examinee ability. High ability examinees receive more difficult items, while low ability examinees receive easier items, which has important benefits to the student and the organization. An adaptive test typically begins by delivering an item of medium difficulty; if you get it correct, you get a tougher item, and if you get it incorrect, you get an easier item. This basic algorithm continues until the test is finished, though it usually includes subalgorithms for important things like content distribution and item exposure.

Quantity
A less publicized facet of adaptation is the number of items. Adaptive tests can be designed to stop when certain psychometric criteria are reached, such as a specific level of score precision. Some examinees finish very quickly with few items, so that adaptive tests are typically about half as many questions as a regular test, with at least as much accuracy. Because some examinees have longer tests, these adaptive tests are referred to as variable-length. Obviously, this makes for a massive benefit: cutting testing time in half, on average, can substantially decrease testing costs. Nevertheless, some adaptive tests use a fixed length, and only adapt item difficulty. This is merely for public relations issues, namely the inconvenience of dealing with examinees who feel they were unfairly treated by the CAT, even though it is arguably more fair and valid than conventional tests.

Advantages of adaptive testing

By making the test more intelligent, adaptive testing provides a wide range of benefits.  Some of the well-known advantages of adaptive testing, recognized by scholarly psychometric research, are listed below.  However, the development of an adaptive test is a very complex process that requires substantial expertise in item response theory (IRT) and CAT simulation research.  Our Ph.D. psychometricians can provide your organization with the requisite experience to implement adaptive testing and help your organization benefit from these advantages.  Contact us or read this white paper to learn more.
Shorter tests, anywhere from a 50% to 90% reduction; reduces cost, examinee fatigue, and item exposure
More precise scores: CAT will make tests more accurate
More control of score precision (accuracy): CAT ensures that all students will have the same accuracy, making the test much more fair.  Traditional tests measure the middle students well but not the top or bottom students.
Increased efficiency
Greater test security because everyone is not seeing the same form
A better experience for examinees, as they only see items relevant for them, providing an appropriate challenge
The better experience can lead to increased examinee motivation
Immediate score reporting
More frequent retesting is possible; minimize practice effects
Individual pacing of tests; examinees move at their own speed
On-demand testing can reduce printing, scheduling, and other paper-based concerns
Storing results in a database immediately makes data management easier
Computerized testing facilitates the use of multimedia in items

No, you can’t just subjectively rank items!

Computerized adaptive tests (CATs) are the future of assessment. They operate by adapting both the difficultyand number of items to each individual examinee. The development of an adaptive test is no small feat, and requires five steps integrating the expertise of test content developers, software engineers, and psychometricians. The development of a quality adaptive test is not possible without a Ph.D. psychometrician experienced in both item response theory (IRT) calibration and CAT simulation research. FastTest can provide you the psychometrician and software; if you provide test items and pilot data, when can help you quickly publish an adaptive version of your test. Step 1: Feasibility, applicability, and planning studies. First, extensive monte carlo simulation research must occur, and the results formulated as business cases, to evaluate whether adaptive testing is feasible, applicable, or even possible.
Step 2: Develop item bank. An item bank must be developed to meet the specifications recommended by Step 1.
Step 3: Pretest and calibrate item bank. Items must be pilot tested on 200-1000 examinees (depends on IRT model) and analyzed by Ph.D. psychometrician.
Step 4: Determine specifications for final CAT. Data from Step 3 is analyzed to evaluate CAT specifications and determine most efficient algorithms using CAT simulation software such as CATSim.
Step 5: Publish live CAT. The adaptive test is published in a testing engine capable of fully adaptive tests based on IRT. Want to learn more about our one-of-a-kind model? Click here to read the seminal article by two of our psychometricians. More adaptive testing research is available here.

What do I need for adaptive testing?

Minimum requirements:

  • A large item bank piloted on at least 500 examinees
  • 1,000 examinees per year
  • Specialized IRT calibration and CAT simulation software.
  • Staff with a PhD in psychometrics or an equivalent level of experience. Or, leverage our internationally recognized expertise in the field.
  • Items (questions) that can be scored objectively correct/incorrect in real time
  • Item banking system and CAT delivery platform
  • Financial resources: Because it is so complex, development of a CAT will cost at least $10,000 (USD) — but if you are testing large volumes of examinees, it will be a significant positive investment.


Visit the links below to learn more about adaptive testing.  We also recommend that you first read this landmark article by our industry-leading psychometricians and our white paper detailing the requirements of CAT.  Additionally, you will likely be interested in this article on producing better measurements with CAT.

One of the most cliche phrases associated with assessment is “teaching to the test.”  I’ve always hated this phrase, because it is only used in a derogatory matter, almost always by people who do not understand the basics of assessment and psychometrics.  I recently saw it mentioned in this article on PISA, and that was one time too many, especially since it was used in an oblique, vague, and unreferenced manner.

So, I’m going to come out and say something very unpopular: in most cases, TEACHING TO THE TEST IS A GOOD THING.

 

Why teaching to the test is usually a good thing

If the test reflects the curriculum – which any good test will – then someone who is teaching to the test will be teaching to the curriculum.  Which, of course, is the entire goal of teaching. The phrase “teaching to the test” is used in an insulting sense, especially because the alliteration is resounding and sellable, but it’s really not a bad thing in most cases.  If a curriculum says that 4th graders should learn how to add and divide fractions, and the test evaluates this, what is the problem? Especially if it uses modern methodology like adaptive testing or tech-enhanced items to make the process more engaging and instructional, rather than oversimplifying to a text-only multiple choice question on paper bubble sheets?

The the world of credentialing assessment, this is an extremely important link.  Credential tests start with a job analysis study, which surveys professionals to determine what they consider to be the most important and frequently used skills in the job.  This data is then transformed into test blueprints. Instructors for the profession, as well as aspiring students that are studying to pass the test, then focus on what is in the blueprints.  This, of course, still contains the skills that are most important and frequently used in the job!

 

So what is the problem then?

Now, telling teachers how to teach is more concerning, and more likely to be a bad thing.  Finland does well because it gives teachers lots of training and then power to choose how they teach, as noted in the PISA article.

As a counterexample, my high school math department made an edict starting my sophomore year that all teachers had to use the “Chicago Method.”  It was pure bunk and based on the fact that students should be doing as much busy work as possible instead of the teachers actually teaching. I think it is because some salesman convinced the department head to make the switch so that they would buy a thousand brand new textbooks.  The method makes some decent points (here’s an article from, coincidentally, when I was a sophomore in high school) but I think we ended up with a bastardization of it, as the edict was primarily:

  1. Assign students to read the next chapter in class (instead of teaching them!); go sit at your desk.
  2. Assign students to do at least 30 homework questions overnight, and come back tomorrow with any questions they have.  
  3. Answer any questions, then assign them the next chapter to read.  Whatever you do, DO NOT teach them about the topic before they start doing the homework questions.  Go sit at your desk.

Isn’t that preposterous?  Unsurprisingly, after two years of this, I went from being a leader of the Math Team to someone who explicitly said “I am never taking Math again”.  And indeed, I managed to avoid all math during my senior year of high school and first year of college. Thankfully, I had incredible professors in my years at Luther College, leading to me loving math again, earning a math major, and applying to grad school in psychometrics.  This shows the effect that might happen with “telling teachers how to teach.” Or in this case, specifically – and bizarrely – to NOT teach.

 

What about all the bad tests out there?

Now, let’s get back to the assumption that a test does reflect a curriculum/blueprints.  There are, most certainly, plenty of cases where an assessment is not designed or built well.  That’s an entirely different problem, and is an entirely valid concern. I have seen a number of these in my career.  This danger why we have international standards on assessments, like AERA/APA/NCME and NCCA.  These provide guidelines on how a test should be build, sort of like how you need to build a house according to building code and not just throwing up some walls and a roof.

For example, there is nothing that is stopping me from identifying a career that has a lot of people looking to gain an edge over one another to get a better job… then buying a textbook, writing 50 questions in my basement, and throwing it up on a nice-looking website to sell as a professional certification.  I might sell it for $395, and if I get just 100 people to sign up, I’ve made $39,500!!!! This violates just about every NCCA guideline, though. If I wanted to get a stamp of approval that my certification was legit – as well as making it legally defensible – I would need to follow the NCCA guidelines.

My point here is that there are definitely bad tests out there, just like there are millions of other bad products in the world.  It’s a matter of caveat emptor. But just because you had some cheap furniture on college that broke right away, doesn’t mean you swear off on all furniture.  You stay away from bad furniture.

There’s also the problem of tests being misused, but again that’s not a problem with the test itself.  Certainly, someone making decisions is uninformed. It could actually be the best test in the world, with 100% precision, but if it is used for an invalid application then it’s still not a good situation.  For example, if you took a very well-made exam for high school graduation and started using it for employment decisions with adults. Psychometricians call this validity – that we have evidence to support the intended use of the test and interpretations of scores.  It is the #1 concern of assessment professionals, so if a test is being misused, it’s probably by someone without a background in assessment.

 

So where do we go from here?

Put it this way, if an overweight person is trying to become fitter, is success more likely to come from changing diet and exercise habits, or from complaining about their bathroom scale?  Complaining unspecifically about a high school graduation assessment is not going to improve education; let’s change how we educate our children to prepare them for that assessment, and ensure that the assessment reflects the goals of the education.  Nevertheless, of course, we need to invest in making the assessment as sound and fair as we can – which is exactly why I am in this career.

The field of Psychometrics is definitely a small niche in the world, even though it touches almost every person at some point in their lives.  When I’m trying to explain what I do to people from outside the field, I’m often asked something like, “Where do you even go to study something like that?”  I’m also frequently asked by people already in the field where they can go to get an advanced degree in one of the sophisticated topics like item response theory or adaptive testing.  Well, there are indeed a good number of PhD programs in psychometrics, though they rarely appear with that straightforward name, as you can see below.  This can make them tough to find even if you are specifically looking for them.

 

First of all, you can visit a list of programs at the NCME website.  This list is pretty comprehensive, but here are a few highlights.  I also highly recommend the SIOP list of grad programs; they are for I/O psychology but many of them have professors with expertise in things like assessment validation or item response theory.

My apologies in advance if I left out any that you think should be included here!

 

University of Minnesota: Quantitative/Psychometrics Program (Psychology) and Quantitative Foundations of Educational Research (Education)

I’m partial to this one since it is where I completed my PhD, with Prof. David J. Weiss in the Psychology Department.  The UMN is interesting in that it actually has two separate graduate programs in psychometrics: the one in Psychology, which has since become more focused on quantitative psychology, but also one in the Education department.

https://cla.umn.edu/psychology/graduate/areas-specialization/quantitativepsychometric-methods-qpm

http://www.cehd.umn.edu/edpsych/programs/qme/

University of Massachusetts: Research, Educational Measurement, and Psychometrics (REMP)

For many years, if you wanted to learn item response theory, you read Item Response Theory. Principles and Applications by Hambleton and Swaminathan (1985).  These were two longtime professors at UMass, and it speaks to the quality of that program.  Also note that the program has a nice page on psychometric resources and software.

https://www.umass.edu/remp/

University of Iowa: Center for Advanced Studies in Measurement and Assessment

This program is in the Education department, and has the advantage of being in one of the epicenters of the industry: the testing giant ACT is headquartered only a few miles away, the giant Pearson has an office in town, and the Iowa Test of Basic Skills is an offshoot of the university itself.  Like UMass, Iowa also has a website with educational materials and useful software.

https://education.uiowa.edu/centers/casma

University of Wisconsin-Madison

UW has well-known professors like Subkoviak, Daniel Bolt, and James Wollack.  Plus, Madison is well-known for being a fun city given its small size.

https://edpsych.education.wisc.edu/category/quantitative-methods/

University of Nebraska – Lincoln: Quantitative, Qualitative & Psychometric Methods

For many years, the cornerstones of this program were the husband-and-wife duo of James Impara and Barbara Plake.  They’ve now retired, but excellent new professors have joined.  In addition, UNL is the home of the Buros Institute.

https://cehs.unl.edu/edpsych/quantitative-qualitative-psychometric-methods/

University of Kansas: Research, Evaluation, Measurement, and Statistics

Not far from Lincoln, NE is Lawrence, Kansas.  The program here has been around a long time, and is now home to Jonathan Templin, one of the world’s experts on diagnostic measurement models.

https://epsy.ku.edu/academics/educational-psychology-research/phd/overview-benefits

Michigan State University: Measurement and Quantitative Methods

MSU is home to Mark Reckase, current president of IACAT.  Like most of the rest of these programs, it is in a vibrant college town. The website needs a slight update though!

https://education.msu.edu/cepse/mqm/

 

UNC-Greensboro: Educational Research, Measurement, and Evaluation

While most programs listed here are in the northern USA, this one is in the southern part of the country, where such programs are smaller and fewer.  UNCG is quite strong however.

https://soe.uncg.edu/academics/departments/erm/erm-programs/ph-d-in-educational-research-measurement-and-evaluation/

University of Texas: Quantitative Methods

UT, like some of the other programs, has an advantage in that the educational assessment arm of Pearson is located there.

https://education.utexas.edu/departments/educational-psychology/graduate-programs/quantitative-methods

 

Outside the US

University of Alberta: Center for Research in Applied Measurement and Evaluation

Mark J Gierl has been a longtime professor here, and there are now 4 other professors in the program.

https://sites.google.com/ualberta.ca/crame

University of British Columbia: Measurement, Evaluation, and Research Methodology

UBC is home to Bruno Zumbo, one of the most prolific researchers in the field.

http://ecps.educ.ubc.ca/program/measurement-evaluation-and-research-methodology/

University of Twente: Research Methodology, Measurement and Data Analysis

For decades, Twente has been the center of psychometrics in Europe, with professors like Wim van der Linden, Theo Eggen, Cees Glas, and Bernard Veldkamp.  It’s also linked with Cito, the premier testing company in Europe, which provides excellent opportunities to apply your skills.

https://www.utwente.nl/en/bms/omd/

University of Cambridge: The Psychometrics Centre

The Psychometrics Centre at Cambridge includes professors John Rust and David Stillwell.  It hosted the 2015 IACAT conference and is the home to the open-source CAT platform Concerto.

https://www.psychometrics.cam.ac.uk/

KU Leuven: Research Group of Quantitative Psychology and Individual Differences

This is home to well-known researchers such as Paul De Boeck and David Magis.

https://ppw.kuleuven.be/okp/home/

University of Western Australia: Pearson Psychometrics Laboratory

This is home to David Andrich, best known for the Rasch Rating Scale Model.

http://www.education.uwa.edu.au/ppl

Online

There are very few programs that offer graduate training in psychometrics that is 100% online.  If you know of another one, please get in touch with me.

 

University of Illinois at Chicago: Measurement, Evaluation, Statistics, and Assessment

This program is of particular because it has an online Master’s program, which allows you to get a high quality graduate degree in psychometrics from just about anywhere in the world.  One of my colleagues here at ASC has recently enrolled in this program.

https://education.uic.edu/academics-admissions/programs/measurement-evaluation-statistics-and-assessment-mesa

Artificial intelligence (AI) and machine learning (ML) have become buzzwords over the past few years.  As I already wrote about, they are actually old news in the field of psychometrics.   Factor analysis is a classical example of ML, and item response theory also qualifies as ML. Computerized adaptive testing is actually an application of AI to psychometrics that dates back to the 1970s.

One thing that is very different about the world of AI/ML today is the massive power available in free platforms like R, Python, and TensorFlow.  I’ve been thinking a lot over the past few years how these tools can impact the world of assessment.  A straightforward application is to automated essay scoring; a common way to approach that problem is through natural language processing with  the “bag of words” model and utilize the document-term matrix (DTM) as predictors in a model for essay score as a criterion variable.  Surprisingly simple.  This got me to wondering where else we could apply that sort of modeling.  Obviously, student response data on selected-response items provides a ton of data, but the research questions are less clear.  So, I turned to the topic that I think has the next largest set of data and text: item banks.

Step 1: Text Mining

The first step was to explore tools for text mining in R.  I found this well-written and clear tutorial on the text2vec package and used that as my springboard.  Within minutes I was able to get a document-term matrix, and in a few more minutes was able to prune it.  This DTM alone can provide useful info to an organization on their item bank, but I wanted to delve further.  Can the DTM predict item quality?

Step 2: Fit Models

To do this, I utilized both the caret and glmnet packages to fit models.  I love the caret package, but if you search the literature you’ll find it has problem with sparse matrices, which is exactly what the DTM is.  One blog post I found said that anyone with a sparse matrix is pretty much stuck using glmnet.

I tried a few models on a small item bank of 500 items from a friend of mine, and my adjusted R squared for the prediction of IRT a parameters (as an index of item quality) was 0.53 – meaning that I could account for more than half the variance of item quality just by knowing some of the common words in each item’s stem.  I wasn’t even using the answer texts n-grams, or additional information like Author and content domain.

Want to learn more about your item bank?

I’d love to swim even deeper on this issue.  If you have a large item bank and would like to work with me to analyze it so you can provide better feedback and direction to your item writers and test developers, drop me a message at nthompson@54.89.150.95!  This could directly impact the efficiency of your organization and the quality of your assessments.

 

 

The traditional Learning Management System (LMS) is designed to serve as a portal between educators and their learners. Platforms like Moodle are successful in facilitating cooperative online learning in a number of groundbreaking ways: course management, interactive discussion boards, assignment submissions, and delivery of learning content. While all of this is great, we’ve yet to see an LMS that implements best practices in assessment and psychometrics to ensure that medium or high stakes tests meet international standards.

To put it bluntly, LMS systems have assessment functionality that is usually good enough for short classroom quizzes but falls far short of what is required for a test that is used to award a credential.  A white paper on this topic is available here, but some examples include:

  • Treatment of items as reusable objects
  • Item metadata and historical use
  • Collaborative item review and versioning
  • Test assembly based on psychometrics
  • Psychometric forensics to search for non-independent test-taking behavior
  • Deeper score reporting and analytics

Assessment Systems is pleased to announce the launch of an easy-to-use bridge between FastTest and Moodle that will allow users to seamlessly deliver sound assessments from within Moodle while taking advantage of the sophisticated test development and psychometric tools available within FastTest. In addition to seamless delivery for learners, all candidate information is transferred to FastTest, eliminating the examinee import process.  The bridge makes use of the international Learning Tools Interoperability standards.

If you are already a FastTest user, watch a step-by-step tutorial on how to establish the connection, in the FastTest User Manual by logging into your FastTest workspace and selecting Manual in the upper right-hand corner. You’ll find the guide in Appendix N.

If you are not yet a FastTest user and would like to discuss how it can improve your assessments while still allowing you to leverage Moodle or other LMS systems for learning content, sign up for a free account here.

As we jump headfirst into 2018, we’re reflecting on our successes from the past year. One such success was our inclusion in the Minneapolis/St. Paul Business Journal’s list of Best Places to Work in 2017. We’re honored to be recognized!


So, what makes Assessment Systems one of the best places to work?

Though founded in 1979, we run our company with the mindset and energy of a startup. This means we have a strong foundation on which to create world-class software, but at the same time, we’re constantly innovating, working with the newest technologies and taking risks.

Our leadership team drives this startup mentality, which encourages employees to constantly be on their toes. With experts in a variety of areas, including assessment, psychometrics, entrepreneurship, and tech, not only do all team members play an important role in the business, they also have a real opportunity to make a difference.

We have great company values.

Furthermore, it’s easy for our employees to be inspired every day due to our company’s values. Our CEO stresses the importance of doing the right thing and being kind, which everyone on the team is proud to stand behind. Principles such as these are fundamental to the success of our employees. Ask anyone who’s partnered with us and they’ll tell you that we’re a small company with a big heart that wants to provide the best product and service to our clients.


Last, but certainly not least, we love what we do!

Our unique company culture, diverse team, and values make it easy to love where we work. As a result, we’re all the more motivated to make a difference in our industry and to continue improving our company culture even more.

We may not have a big team, but we have incredible skill sets and a collaborative environment where we rely on each other to make great things happen. We are a small company that is changing the way people test online and improving the world one test at a time.


Sound interesting? Check out our careers page or learn more about what we’re doing at assess.com.

Computerized adaptive testing (CAT) is a powerful paradigm for delivering tests that are smarter, faster, and fairer than the traditional linear approach.  However, CAT is not without its challenges.  One is that it is a greedy algorithm which always selects your best items from the pool if it can.  The way that CAT researchers address this issue is with item exposure controls.  These are subalgorithms that are injected into the main item selection algorithm, to alter it from always using the best items. The Sympson-Hetter method is one such approach.

The simplest approach is called the randomesque method.  This selects from the top X items in terms of item information (a term from item response theory), usually for the first Y items in a test.  For example, instead of always selecting the top item, the algorithm finds the 3 top items and then randomly selects between those.

The Sympson-Hetter Method

A more sophisticated method is the Sympson-Hetter method.  Here, the user specifies a target proportion as a parameter for the selection algorithm.  For example, we might decide that we do not want an item seen by more than 75% of examinees.  So, every time that the CAT algorithm goes into the item pool to select a new item, we generate a random number between 0 and 1, which is then compared to the threshold.  If the number is between 0 and 0.75 in this case, we go ahead and administer the item.  If the number is from 0.75 to 1.0, we skip over it and go on to the next most informative item in the pool, though we then do the same comparison for that item.

Why do this?  It obviously limits the exposure of the item.  But just how much it limits it depends on the difficulty of the item.  A very difficult item is likely only going to be a candidate for selection for very high ability examinees.  Let’s say it’s the top 4%… well, then the approach above will limit it to 3% of the sample overall, but 75% of the examinees in its neighborhood.

On the other hand, an item of middle difficulty is used not only for middle examinees, but often for any examinee.  Remember, unless there are some controls, the first item for the test will be the same for everyone!  So if we apply the Sympson-Hetter rule to that item, it limits it to 75% exposure in a more absolute sense.

Because of this, you don’t have to set that threshold parameter to the same value for each item.  The original recommendation was to do some CAT simulation studies, then set the parameters thoughtfully for different items.  Items that are likely to be highly exposed (middle difficulty with high discrimination) might deserve a more strict parameter like 0.40.  On the other hand, that super-difficult item isn’t an exposure concern because only the top 4% of students see it anyway… so we might leave its parameter at 1.0 and therefore not limit it at all.

Is this the only method available?

No.  As mentioned, there’s that simple randomesque approach.  But there are plenty more.  You might be interested in this paper, this paper, or this paper.  The last one reviews the research literature from 1983 to 2005.

What is the original reference?

Sympson, J. B., & Hetter, R. D. (1985, October). Controlling item-exposure rates in computerized adaptive testing. Proceedings of the 27th annual meeting of the Military Testing Association (pp. 973–977). San Diego, CA: Navy Personnel Research and Development Center.

How can I apply this to my tests?

Well, you certainly need a CAT platform first.  Our platform at ASC allows this method right out of the box – that is, all you need to do is enter the target proportion when you publish your exam, and the Sympson-Hetter method will be implemented.  No need to write any code yourself!  Click here to sign up for a free account.

Desperation is seldom fun to see.

Some years ago, having recently released our online marking functionality I was reviewing some of the functionality in a customer workspace I was intrigued to see “Beyonce??” mentioned in a marker’s comments on an essay. The student’s essay was evaluating some poetry and had completely misunderstood the use of metaphor in the poem in question. The student also clearly knew that her interpretation was way off, but didn’t know how and had reached the end of her patience. So after a desultory attempt at answering, with a cry from the heart, reminiscent of William Wallace’s call for freedom, she wrote “BEYONCE” with about seventeen exclamation points. It felt good to see that her spirit was not broken, and it was a moment of empathy that drove home the damage that standardized tests are inflicting on our students. That vignette is playing itself out millions of time each year in this country, the following explains why.

What are “Standardized Tests”?

We use standardized tests for a variety of reasons, but underlying every reason (curriculum effectiveness, college/career preparedness, teacher effectiveness, etc.) is the understanding that the test is measuring what a student has learned. In order to know how all our students are doing, we give them all standardized tests, meaning every student receives essentially the same set of tests. So, a standardized test is a test where all students take essentially the same test. This is a difficult endeavor given the wide range of students and number of tests, and raises the question “How do we do this reliably and in a reasonable amount of time?”

Accuracy and Difficulty vs Length

We all want tests to reliably measure the students’ learning. In order to make these tests reliable, we need to supply questions of varying difficulty, from very easy to very difficult, to cover a wide range of abilities. In order to reduce the length of the test, most of the questions fall in the medium easy to medium difficulty range because that is where most of the students’ ability level will fall. So the test that best balances length and accuracy for the whole population should be constructed such that the amount of questions of any difficulty is proportionate to the number of students of that ability.

Why are most questions in the medium difficulty range? Imagine creating a test to measure 10th graders’ math ability. A small number of the students might have a couple years of calculus. If the test covered those topics, imagine the experience of most students who would often not even understand the notation in the question. Frustrating, right? On the other hand, if the test was also constructed to measure students with only rudimentary math knowledge, these average to advanced students would be frustrated and bored from answering a lot of questions on basic math facts. The solution most organizations use is to present only a few questions that are really easy or difficult, and accept that this score is not as accurate as they would prefer for the students at either end of the ability range.

These Tests are Inaccurate and Mean Spirited

The problem is that while this might work OK for a lot of kids, it exacts a pretty heavy toll on others. Almost one in five students will not know the answer to 80% of the questions on these tests, and scoring about 20% on a test certainly feels like failing. It feels like failing every time a student takes such a test. Over the course of an academic career, students in the bottom quintile will guess on or skip 10,000 questions. That is 10,000 times the student is told that school, learning, or success is not for them. Even biasing the test to be easier only makes a slight improvement.

Computerized Adaptive Testing, Test Performance with Bell Curve

The shaded area represents students who will miss at least 80% of questions.

It isn’t necessarily better for the top students whose every testing experience assures them that they are already very successful when the reality is that they are likely being outperformed by a significant percentage of their future colleagues.

In other words, at both ends of the Bell Curve, we are serving our students very poorly, inadvertently encouraging lower performing students to give up (there is some evidence that the two correlate) and higher performing students to take it easy. It is no wonder that people dislike standardized tests.

There is a Solution

A computerized adaptive test (CAT) solves all the problems outlined above. Properly constructed, a CAT has the ability to make the following faster, fairer, and more valid:

  • Every examinee completes the test in less time (fast)
  • Every examinee gets a more accurate score (valid)
  • Every examinee receives questions tuned to their ability so gets about half right (fair)

Given all the advantages of CAT, it may seem hard to believe that they are not used more often. While they are starting to catch on, it is not fast enough given the heavy toll that the old methods exact on our students. It is true that few testing providers can enable CATs, but that is simply making an excuse. If a standardized test is delivered to as few as 500 students it can be made adaptive. It probably isn’t, but it could be. All that is needed are computers or tablets, an Internet connection, and some effort. We should expect more.

How can my organization implement CAT?

While CAT used to only be feasible for large organizations that tested hundreds of thousands or millions of examinees per year, a number of advances have changed this landscape.  If you’d like to do something about your test, it might be worthwhile for you to evaluate CAT.  We can help you with that evaluation; if you’d like to chat, here is a link to schedule a meeting. Or contact me if you’d like to discuss the math or related ideas please drop me a note.

Since the first tests were developed 2000 years ago for entry into civil service of Imperial China, test security has been a concern.  The reason is quite straightforward: most threats to test security are also threats to validity, and the decisions we make with test scores could therefore be invalid, or at least suboptimal.  It is therefore imperative that organizations that develop or utilize tests should develop a Test Security Plan (TSP).  The TSP is a document that helps an organization anticipate test security issues, establish deterrent and detection methods, and plan responses.  In can also include validity threats not security-related, such as how to deal with examinees that have low motivation.

There are several reasons to develop a Test Security Plan.  First, it drives greater security and therefore validity.  The TSP will enhance the legal defensibility of the testing program.  It helps to safeguard the content, which is typically an expensive investment for any organization that develops tests themselves.  If incidents do happen, they can be dealt with more swiftly and effectively.  It helps to manage all the security-related efforts.

The development of such a complex document requires a strong framework.  We advocate a framework with three phases: planning, implementation, and response.  In addition, the TSP should be revised periodically.

 

Phase 1: Planning

The first step in this phase is to list all potential threats to each assessment program at your organization.  This could include harvesting of test content, preknowledge of test content from past harvesters, copying other examinees, proxy testers, proctor help, and outside help.  Next, these should be rated on axes that are important to the organization; a simple approach would be to rate on potential impact to score validity, cost to the organization, and likelihood of occurrence.  This risk assessment exercise will help the remainder of the framework.

Next, the organization should develop the TSP.  The first piece is to identify deterrents and procedures to reduce the possibility of issues.  This includes delivery procedures (such as a lockdown browser or proctoring), proctor training manuals, a strong candidate agreement, anonymous reporting pathways, confirmation testing, and candidate identification requirements.  The second piece is to explicitly plan for psychometric forensics.  This can rsange from complex collusion indices based on item response theory to simple flags, such as a candidate responding to a certain multiple choice option more than 50% of the time or obtaining a score in the top 10% but in the lowest 10% of time.  The third piece is to establish planned responses.  What will you do if a proctor reports that two candidates were copying each other?  What if someone obtains a high score in an unreasonably short time?  What if someone obviously did not try to pass the exam, but still sat there for the allotted time?  If a candidate were to lose a job opportunity due to your response, it helps you defensibility to show that the process was established ahead of time with the input of important stakeholders.

 

Phase 2: Implementation

The second phase is to implement the relevant aspects of the Test Security Plan, such as training all proctors in accordance with the manual and login procedures, setting IP address limits, or ensuring that a new secure testing platform with lockdown is rolled out to all testing locations.  There are generally two approaches.  Proactive approaches attempt to reduce the likelihood of issues in the first place, and reactive methods happen after the test is given.  The reactive methods can be observational, quantitative, or content-focused.  Observational methods include proctor reports or an anonymous tip line.  Quantitative methods include psychometric forensics, for which you will need software like SIFT.  Content-focused methods include automated web crawling.

Both approaches require continuous attention.  You might need to train new proctors several times per year, or update your lockdown browser.  If you use a virtual proctoring service based on record-and-review, flagged candidates must be periodically reviewed.  The reactive methods are similar: incoming anonymous tips or proctor reports must be dealt with at any given time.  The least continuous aspect is some of the psychometric forensics, which depend on a large-scale data analysis; for example, you might gather data from tens of thousands of examinees in a testing window and can only do a complete analysis at that point, which could take several weeks.

 

Phase 3: Response

The third phase, of course, to put your planned responses into motion if issues are detected.  Some of these could be relatively innocuous; if a proctor is reported as not following procedures, they might need some remedial training, and it’s certainly possible that no security breach occurred.  The more dramatic responses include actions taken against the candidate.  The most lenient is to provide a warning or simply ask them to retake the test.  The most extreme methods include a full invalidation of the score with future sanctions, such as a five-year ban on taking the test again, which could prevent someone from entering a profession for which they spent 8 years and hundreds of thousands of dollars in educative preparation.

 

What does a test security plan mean for me?

It is clear that test security threats are also validity threats, and that the extensive (and expensive!) measures warrant a strategic and proactive approach in many situations.  A framework like the one advocated here will help organizations identify and prioritize threats so that the measures are appropriate for a given program.  Note that the results can be quite different if an organization has multiple programs, from a practice test to an entry level screening test to a promotional test to a professional certification or licensure.

Another important difference is that between test sponsors/publishers and test consumers.  In the case of an organization that purchases off-the-shelf pre-employment tests, the validity of score interpretations is of more direct concern, while the theft of content might not be an immediate concern.  Conversely, the publisher of such tests has invested heavily in the content and could be massively impacted by theft, while the copying of two examinees in the hiring organization is not of immediate concern.

In summary, there are more security threats, deterrents, procedures, and psychometric forensic methods than can be discussed in one blog post, so the focus here rather on the framework itself.  For starters, start thinking strategically about test security and how it impacts their assessment programs by using the multi-axis rating approach, then begin to develop a Test Security Plan.  The end goal is to improve the health and validity of your assessments.

Want to implement some of the security aspects discussed here, like online delivery lockdown browser, IP address limits, and proctor passwords? Sign up for a free account in FastTest!

One of the best aspects of my position is the opportunity to travel the world and talk with many experts about psychometrics and educational assessment.  In December 2017, I was lucky enough to travel to Monterrey, Mexico, with the dual purpose of a conference and a Psychometrics Seminar.  It was an exciting week that taught me a lot about education in Latin America.

The first half of the week was the Congreso Internacional de Innovacion Educativa, the premier conference in educational technology in Latin America, with more than 3,000 attendees.  I had the opportunity to present on a project we are doing with Tecnologico de Monterrey to implement adaptive testing in admissions exams.  Tecnologico is the #2 university in Mexico, and therefore a leader in all of Latin America, as well as being the host of the conference.  In addition, I heard of number of interesting talks, including one by the CEO of Coursera on the role of MOOCs in higher education.  I’m personally a MOOC addict, though I tend to start new courses far faster than I ever finish one.

 

adaptive testing seminar

The second half of the week was the 2017 ASC Learning Summit, a 2-day seminar to teach item response theory and adaptive testing to anyone that wanted to learn.  We had 22 attendees, which was extremely successful given the short notice. (Tec had to move the entire conference from Mexico City to Monterrey after an earthquake)  We had 18 from Mexico, two from the US, one from Barbados, and one from the UK.  These included university professors, K12 assessment professionals, higher education admissions staff, certification test psychometricians, and more.

 

 

If time allows, I hope to record my presentations from that summit as a series of webinars or videos, so join our mailing list to stay tuned if you aren’t already on it.  In addition, I’ll be holding similar seminars in the future.  If you are interested in having me visit your country for such an event, get in touch with me at nthompson@54.89.150.95.