The traditional Learning Management System (LMS) is designed to serve as a portal between educators and their learners. Platforms like Moodle are successful in facilitating cooperative online learning in a number of groundbreaking ways: course management, interactive discussion boards, assignment submissions, and delivery of learning content. While all of this is great, we’ve yet to see an LMS that implements best practices in assessment and psychometrics to ensure that medium or high stakes tests meet international standards.

To put it bluntly, LMS systems have assessment functionality that is usually good enough for short classroom quizzes but falls far short of what is required for a test that is used to award a credential.  A white paper on this topic is available here, but some examples include:

  • Treatment of items as reusable objects
  • Item metadata and historical use
  • Collaborative item review and versioning
  • Test assembly based on psychometrics
  • Psychometric forensics to search for non-independent test-taking behavior
  • Deeper score reporting and analytics

Assessment Systems is pleased to announce the launch of an easy-to-use bridge between FastTest and Moodle that will allow users to seamlessly deliver sound assessments from within Moodle while taking advantage of the sophisticated test development and psychometric tools available within FastTest. In addition to seamless delivery for learners, all candidate information is transferred to FastTest, eliminating the examinee import process.  The bridge makes use of the international Learning Tools Interoperability standards.

If you are already a FastTest user, watch a step-by-step tutorial on how to establish the connection, in the FastTest User Manual by logging into your FastTest workspace and selecting Manual in the upper right-hand corner. You’ll find the guide in Appendix N.

If you are not yet a FastTest user and would like to discuss how it can improve your assessments while still allowing you to leverage Moodle or other LMS systems for learning content, sign up for a free account here.

Computerized adaptive testing (CAT) is a powerful paradigm for delivering tests that are smarter, faster, and fairer than the traditional linear approach.  However, CAT is not without its challenges.  One is that it is a greedy algorithm that always selects your best items from the pool if it can.

The way that CAT researchers address this issue is with item exposure controls.  These are sub algorithms that are injected into the main item selection algorithm, to alter it from always using the best items. The Sympson-Hetter method is one such approach.

The simplest approach is called the randomesque method.

This selects from the top X items in terms of item information (a term from item response theory), usually for the first Y items in a test.  For example, instead of always selecting the top item, the algorithm finds the 3 top items and then randomly selects between those.

The Sympson-Hetter Method

A more sophisticated method is the Sympson-Hetter method.

Here, the user specifies a target proportion as a parameter for the selection algorithm.  For example, we might decide that we do not want an item seen by more than 75% of examinees.  So, every time that the CAT algorithm goes into the item pool to select a new item, we generate a random number between 0 and 1, which is then compared to the threshold.  If the number is between 0 and 0.75 in this case, we go ahead and administer the item.  If the number is from 0.75 to 1.0, we skip over it and go on to the next most informative item in the pool, though we then do the same comparison for that item.

Why do this?  It obviously limits the exposure of the item.  But just how much it limits it depends on the difficulty of the item.  A very difficult item is likely only going to be a candidate for selection for very high-ability examinees.  Let’s say it’s the top 4%… well, then the approach above will limit it to 3% of the sample overall, but 75% of the examinees in its neighborhood.

On the other hand, an item of middle difficulty is used not only for middle examinees but often for any examinee.  Remember, unless there are some controls, the first item for the test will be the same for everyone!  So if we apply the Sympson-Hetter rule to that item, it limits it to 75% exposure in a more absolute sense.

Because of this, you don’t have to set that threshold parameter to the same value for each item.  The original recommendation was to do some CAT simulation studies, then set the parameters thoughtfully for different items.  Items that are likely to be highly exposed (middle difficulty with high discrimination) might deserve a more strict parameter like 0.40.  On the other hand, that super-difficult item isn’t an exposure concern because only the top 4% of students see it anyway… so we might leave its parameter at 1.0 and therefore not limit it at all.

Is this the only method available?

No.  As mentioned, there’s that simple randomesque approach.  But there are plenty more.  You might be interested in this paper, this paper, or this paper.  The last one reviews the research literature from 1983 to 2005.

What is the original reference?

Sympson, J. B., & Hetter, R. D. (1985, October). Controlling item-exposure rates in computerized adaptive testing. Proceedings of the 27th annual meeting of the Military Testing Association (pp. 973–977). San Diego, CA: Navy Personnel Research and Development Center.

How can I apply this to my tests?

Well, you certainly need a CAT platform first.  Our platform at ASC allows this method right out of the box – that is, all you need to do is enter the target proportion when you publish your exam, and the Sympson-Hetter method will be implemented.  No need to write any code yourself!  Click here to sign up for a free account.

Desperation is seldom fun to see.

Some years ago, having recently released our online marking functionality I was reviewing some of the functionality in a customer workspace I was intrigued to see “Beyonce??” mentioned in a marker’s comments on an essay. The student’s essay was evaluating some poetry and had completely misunderstood the use of metaphor in the poem in question. The student also clearly knew that her interpretation was way off, but didn’t know how and had reached the end of her patience. So after a desultory attempt at answering, with a cry from the heart, reminiscent of William Wallace’s call for freedom, she wrote “BEYONCE” with about seventeen exclamation points. It felt good to see that her spirit was not broken, and it was a moment of empathy that drove home the damage that standardized tests are inflicting on our students. That vignette is playing itself out millions of time each year in this country, the following explains why.

What are “Standardized Tests”?

We use standardized tests for a variety of reasons, but underlying every reason (curriculum effectiveness, college/career preparedness, teacher effectiveness, etc.) is the understanding that the test is measuring what a student has learned. In order to know how all our students are doing, we give them all standardized tests, meaning every student receives essentially the same set of tests. So, a standardized test is a test where all students take essentially the same test. This is a difficult endeavor given the wide range of students and number of tests, and raises the question “How do we do this reliably and in a reasonable amount of time?”

Accuracy and Difficulty vs Length

We all want tests to reliably measure the students’ learning. In order to make these tests reliable, we need to supply questions of varying difficulty, from very easy to very difficult, to cover a wide range of abilities. In order to reduce the length of the test, most of the questions fall in the medium easy to medium difficulty range because that is where most of the students’ ability level will fall. So the test that best balances length and accuracy for the whole population should be constructed such that the amount of questions of any difficulty is proportionate to the number of students of that ability.

Why are most questions in the medium difficulty range? Imagine creating a test to measure 10th graders’ math ability. A small number of the students might have a couple years of calculus. If the test covered those topics, imagine the experience of most students who would often not even understand the notation in the question. Frustrating, right? On the other hand, if the test was also constructed to measure students with only rudimentary math knowledge, these average to advanced students would be frustrated and bored from answering a lot of questions on basic math facts. The solution most organizations use is to present only a few questions that are really easy or difficult, and accept that this score is not as accurate as they would prefer for the students at either end of the ability range.

These Tests are Inaccurate and Mean Spirited

The problem is that while this might work OK for a lot of kids, it exacts a pretty heavy toll on others. Almost one in five students will not know the answer to 80% of the questions on these tests, and scoring about 20% on a test certainly feels like failing. It feels like failing every time a student takes such a test. Over the course of an academic career, students in the bottom quintile will guess on or skip 10,000 questions. That is 10,000 times the student is told that school, learning, or success is not for them. Even biasing the test to be easier only makes a slight improvement.

Computerized Adaptive Testing, Test Performance with Bell Curve

The shaded area represents students who will miss at least 80% of questions.

It isn’t necessarily better for the top students whose every testing experience assures them that they are already very successful when the reality is that they are likely being outperformed by a significant percentage of their future colleagues.

In other words, at both ends of the Bell Curve, we are serving our students very poorly, inadvertently encouraging lower performing students to give up (there is some evidence that the two correlate) and higher performing students to take it easy. It is no wonder that people dislike standardized tests.

There is a Solution

A computerized adaptive test (CAT) solves all the problems outlined above. Properly constructed, a CAT has the ability to make the following faster, fairer, and more valid:

  • Every examinee completes the test in less time (fast)
  • Every examinee gets a more accurate score (valid)
  • Every examinee receives questions tuned to their ability so gets about half right (fair)

Given all the advantages of CAT, it may seem hard to believe that they are not used more often. While they are starting to catch on, it is not fast enough given the heavy toll that the old methods exact on our students. It is true that few testing providers can enable CATs, but that is simply making an excuse. If a standardized test is delivered to as few as 500 students it can be made adaptive. It probably isn’t, but it could be. All that is needed are computers or tablets, an Internet connection, and some effort. We should expect more.

How can my organization implement CAT?

While CAT used to only be feasible for large organizations that tested hundreds of thousands or millions of examinees per year, a number of advances have changed this landscape.  If you’d like to do something about your test, it might be worthwhile for you to evaluate CAT.  We can help you with that evaluation; if you’d like to chat, here is a link to schedule a meeting. Or contact me if you’d like to discuss the math or related ideas please drop me a note.

In the past decade, terms like machine learning, artificial intelligence, and data mining are becoming greater buzzwords as computing power, APIs, and the massively increased availability of data enable new technologies like self-driving cars. However, we’ve been using methodologies like machine learning in psychometrics for decades. So much of the hype is just hype.

So, what exactly is Machine Learning?

Unfortunately, there is no widely agreed-upon definition, and as Wikipedia notes, machine learning is often conflated with data mining. A broad definition from Wikipedia is that machine learning explores the study and construction of algorithms that can learn from and make predictions on data. It’s often divided into supervised learning, where a researcher drives the process, and unsupervised learning, where the computer is allowed to creatively run wild and look for patterns. The latter isn’t of much use to us, at least yet.

Supervised learning includes specific topics like regression, dimensionality reduction, anomaly detection that we obviously have in Psychometrics. But its the general definition above that really fits what Psychometrics has been doing for decades.

What is Machine Learning in Psychometrics?

We can’t cover all the ways that machine learning and related topics are used in psychometrics and test development, but here’s a sampling. My goal is not to cover them all but to point out that this is old news and that should not get hung up on buzzwords and fads and marketing schticks – but by all means, we should continue to drive in this direction.

Dimensionality Reduction

One of the first, and most straightforward, areas is dimensionality reduction. Given a bunch of unstructured data, how can we find some sort of underlying structure, especially based on latent dimension? We’ve been doing this, utilizing methods like cluster analysis and factor analysis, since Spearman first started investigating the structure of intelligence 100 years ago. In fact, Spearman helped invent those approaches to solve the problems that he was trying to address in psychometrics, which was a new field at the time and had no methodology yet. How seminal was this work in psychometrics for the field of machine learning in general? The Coursera MOOC on Machine Learning uses Spearman’s work as an example in one of the early lectures!

Classification

Classification is a typical problem in machine learning. A common example is classifying images, and the classic dataset is the MNIST handwriting set (though Silicon Valley fans will think of the “not hot dog” algorithm). Given a bunch of input data (image files) and labels (what number is in the image), we develop an algorithm that most effectively can predict future image classification.  A closer example to our world is the iris dataset, where several quantitative measurements are used to predict the species of a flower.

The contrasting groups method of setting a test cutscore is a simple example of classification in psychometrics. We have a training set where examinees are already classified as pass/fail by a criterion other than test score (which of course rarely happens, but that’s another story), and use mathematical models to find the cutscore that most efficiently divides them. Not all standard setting methods take a purely statistical approach; understandably, the cutscores cannot be decided by an arbitrary computer algorithm like support vector machines or they’d be subject to immediate litigation. Strong use of subject matter experts and integration of the content itself is typically necessary.

Of course, all tests that seek to assign examinees into categories like Pass/Fail are addressing the classification problem. Some of my earliest psychometric work on the sequential probability ratio test and the generalized likelihood ratio was in this area.

One of the best examples of supervised learning for classification, but much more advanced than the contrasting groups method, is automated essay scoring, which as been around for about 2 decades. It has all the classic trappings: a training set where the observations are classified by humans first, and then mathematical models are trained to best approximate the humans. What makes it more complex is that the predictor data is now long strings of text (student essays) rather than a single number.

Anomaly Detection

The most obvious way this is used in our field is psychometric forensics, trying to find examinees that are cheating or some other behavior that warrants attention. But we also use it to evaluate model fit, possibly removing items or examinees from our data set.

Using Algorithms to Learn/Predict from Data

Item response theory is a great example of the general definition. With IRT, we are certainly using a training set, which we call a calibration sample. We use it to train some models, which are then used to make decisions in future observations, primarily scoring examinees that take the test by predicting where those examinees would fall in the score distribution of the calibration sample. IRT is also applied to solve more sophisticated algorithmic problems: Computerized adaptive testing and automated test assembly are fantastic examples. We IRT more generally to learn from the data; which items are most effective, which are not, the ability range where the test provides most precision, etc.

What differs from the Classification problem is that we don’t have a “true state” of labels for our training set. That is, we don’t know what the true scores are of the examinees, or if they are truly a “pass” or a “fail” – especially because those terms can be somewhat arbitrary. It is for this reason we rely on a well-defined model with theoretical reasons for it fitting our data, rather than just letting a machine learning toolkit analyze it with any model it feels like.

Arguably, classical test theory also fits this definition. We have a very specific mathematical model that is used to learn from the data, including which items are stronger or more difficult than others, and how to construct test forms to be statistically equivalent. However, its use of prediction is much weaker. We do not predict where future examinees would fall in the distribution of our calibration set. The fact that it is test-form-specific hampers is generalizability.

Reinforcement learning

The Wikipedia article also mentions reinforcement learning. This is used less often in psychometrics because test forms are typically published with some sort of finality. That is, they might be used in the field for a year or two before being retired, and no data is analyzed in that time except perhaps some high level checks like the NCCA Annual Statistical Report. Online IRT calibration is a great example, but is rarely used in practice. There, response data is analyzed algorithmically over time, and used to estimate or update the IRT parameters. Evaluation of parameter drift also fits in this definition.

Use of Test Scores

We also use test scores “outside” the test in a machine learning approach. A classic example of this is using pre-employment test scores to predict job performance, especially with additional variables to increase the incremental validity. But I’m not going to delve into that topic here.

Automation

Another huge opportunity for psychometrics that is highly related is automation. That is, programming computers to do tasks more effectively or efficient than humans. Automated test assembly and automated essay scoring are examples of this, but there are plenty of of ways that automation can help that are less “cool” but have more impact. My favorite is the creation of psychometrics reports; Iteman and Xcalibre do not produce any numbers also available in other software, but they automatically build you a draft report in MS Word, with all the tables, graphs, and narratives already embedded. Very unique. Without that automation, organizations would typically pay a PhD psychometrician to spend hours of time on copy-and-paste, which is an absolute shame. The goal of my mentor, Prof. David Weiss, and myself is to automate the test development cycle as a whole; driving job analysis, test design, item writing, item review, standard setting, form assembly, test publishing, test delivery, and scoring. There’s no reason people should be allowed to continue making bad tests, and then using those tests to ruin people’s lives, when we know so much about what makes a decent test.

Summary

I am sure there are other areas of psychometrics and the testing industry that are soon to be disrupted by technological innovations such as this. What’s next?

As this article notes, the future direction is about the systems being able to learn on their own rather than being programmed; that is, more towards unsupervised learning than supervised learning. I’m not sure how well that fits with psychometrics.

But back to my original point: psychometrics has been a data-driven field since its inception a century ago. In fact, we contributed some of the methodology that is used generally in the field of machine learning and data analytics. So it shouldn’t be any big news when you hear terms like machine learning, data mining, AI, or dimensionality reduction used in our field! In contrast, I think it’s more important to consider how we remove roadblocks to more widespread use.

One of the hurdles we need to overcome yet is simply how to get more organizations doing what has been considered best practice for decades. There are two types of problem organizations. The first type is one that does not have the sample sizes or budget to deal with methodologies like I’ve discussed here. The salient example I always think of is a state licensure test required by law for a niche profession that might have only 3 examinees per year (I have talked with such programs!). Not much we can do there. The second type is those organizations that indeed have large sample sizes and a decent budget, but are still doing things the same way they did them 30 years ago. How can we bring modern methods and innovations to these organizations? Because they will definitely only make their tests more effective and fairer.

Computerized adaptive testing (CAT) is an incredibly important innovation in the world of assessment.  It’s a psychometric paradigm that applies machine learning principles to personalize millions and millions of assessments, from K12 education to university admissions to professional certification to employment screening to medical surveys.  While invented in the 1970s, primarily at as part of a Defense research grant at the University of Minnesota, it remains a highly relevant and exciting topic today, especially with the advent of the cloud.

The International Association of Computerized Adaptive Testing (IACAT; www.iacat.org) was founded in 2009 at a small conference held at the University of Minnesota.  Since then, it’s focused on being the nexus of adaptive testing resources and research.  A key component of this mission is a biannual conference that rotates around the world.  For 2017, the conference will take place at Niigata, Japan, 18-21 August.

Today, a draft of the program was released, so you can now see the scientifically rigorous and internationally-spanning speakers and topics.  Click here to view the program.  I’m honored to be able to present a paper with my colleague Jordan Stoeger, as well as teach a workshop with my good friend John Barnard from Australia (www.epecat.com).  A big thanks to John, Cliff Donath, Tetsuo Kimura, Alper Sahin, and all others that have contributed time to making this conference the excellent meeting that it is.

If you are at all interested in CAT, or even assessment technology in general, I highly recommend that you consider attending this year’s conference.  I hope to see you there!

Are you on social media?  Use the hashtag #IACAT2017.  Join our LinkedIn Group.

 

Item response theory (IRT) represents an important innovation in the field of psychometrics. While now 50 years old – assuming the “birth” is the classic Lord and Novick (1969) text – it is still underutilized and remains a mystery to many practitioners.  So what is item response theory, and why was it invented?

The Driver: Problems with Classical Test Theory

Classical test theory (CTT) is approximately 100 years old, and still remains commonly used because it is appropriate for certain situations, and it is simple enough that it can be used by many people without formal training in psychometrics.  Most statistics are limited to means, proportions, and correlations.  However, its simplicity means that it lacks the sophistication to deal with a number of very important measurement problems.  Here are just a few.

  • Sample dependency: Classical statistics are all sample dependent, and unusable on a different sample; results from IRT are sample-independent within a linear transformation (that is, two samples of different ability levels can be easily converted onto the same scale)
  • Test dependency: Classical statistics are tied to a specific test form, and do not deal well with sparse matrices introduced by multiple forms, linear on the fly testing, or adaptive testing
  • Weak linking/equating: CTT has a number of methods for linking multiple forms, but they are weak compared to IRT
  • Measuring the range of students: Classical tests are built for the average student, and do not measure high or low students very well; conversely, statistics for very difficult or easy items are suspect
  • Lack of accounting for guessing: CTT does not account for guessing on multiple choice exams
  • Scoring: Scoring in classical test theory does not take into account item difficulty.
  • Adaptive testing: CTT does not support adaptive testing in most cases.

So what is item response theory?

It is a family of mathematical models that try to describe how examinees respond to items (hence the name).  These models can be used to evaluate item performance, because the description are quite useful in and of themselves.  However, item response theory ended up doing so much more – namely, addressing the problems above.

Want to start applying IRT without having to learn how to code?
Download Xcalibre for free!

The Foundation of Item Response Theory

The foundation of IRT is a mathematical model defined by item parameters.  For dichotomous items (those scored correct/incorrect), each item has three parameters:

 

a: the discrimination parameter, an index of how well the item differentiates low from top examinees; typically ranges from 0 to 2, where higher is better, though not many items are above 1.0.

b: the difficulty parameter, an index of what level of examinees for which the item is appropriate; typically ranges from -3 to +3, with 0 being an average examinee level.

c: the pseudoguessing parameter, which is a lower asymptote; typically is focused on 1/k where k is the number of options.

 

Dichotomous IRF from FastTest

These parameters are used to graphically display an item response function (IRF).  An example IRF is on the right.  Here, the a parameter is approximately, 1.0, indicating a fairly discriminating item.  The b parameter is approximately -0.6 (the point on the x-axis where the midpoint of the curve is), indicating an easy item; examinees well below average would have a 60% chance of answering correctly.  The c parameter is approximately 0.20, though the lower asymptote is obviously off the left of the screen.

 

What does this mean conceptually?  We are trying to model the interaction of an examinee with the item, hence the name item response theory.  Consider the x-axis to be z-scores on a standard normal scale.  Examinees with higher ability are much more likely to respond correctly.  Someone at +2.0 (97th percentile) has about a 94% chance of getting the item correct.  Meanwhile, someone at -2.0 has only a 37% chance.

Building with the Basic Building Block

The IRF is used for several purposes.  Here are a few.

  1. Interpreting and improving item performance
  2. Scoring examinees with maximum likelihood or Bayesian methods
  3. Form assembly, including linear on the fly testing (LOFT) and pre-equating
  4. Calculating the accuracy of examinee scores
  5. Development of computerized adaptive tests (CAT)
  6. Data forensics to find cheaters or other issues.

test information function

In addition to being used to evaluate each item individually, IRFs are combined in various ways to evaluate the overall test or form.  The two most important approaches are the conditional standard error of measurement (CSEM) and the test information function (TIF).  The test information function is higher where the test is providing more measurement information about examinees; if relatively low in a certain range of examinee ability, those examinees are not being measured accurately.  The CSEM is the inverse of the TIF, and has the interpretable advantage of being usable for confidence intervals; a person’s score plus or minus 1.96 times the SEM is a 95% confidence interval for their score.  The graph on the right shows part of the form assembly process in our FastTest platform.

One Big Happy Family

IRT is actually a family of models, making flexible use of the parameters.  In some cases, only two (a,b) or one parameters (b) are used, depending on the type of assessment and fit of the data.  If there are multipoint items, such as Likert rating scales or partial credit items, the models are extended to include additional parameters. Learn more about the partial credit situation here.

Where can I learn more?

For more information, we recommend the textbook Item Response Theory for Psychologists by Embretson & Riese (2000) for those interested in a less mathematical treatment, or de Ayala (2009) for a more mathematical treatment.  If you really want to dive in, you can try the 3-volume Handbook of Item Response Theory edited by van der Linden, which contains a chapter discussing ASC’s IRT analysis software, Xcalibre.

 

If you are delivering high-stakes tests in linear forms – or piloting a bank for CAT/LOFT – you are faced with the issue of how to equate the forms together.  That is, how can we defensibly translate a score on Form A to a score on Form B?  While the concept is simple, the methodology can be complex, and there is an entire area of psychometric research devoted to this topic. There are a number of ways to approach this issue, and IRT equating is the strongest.

Why do we need equating?

The need is obvious: to adjust for differences in difficulty to ensure that all examinees receive a fair score on a stable scale.  Suppose you take Form A and get s score of 72/100 while your friend takes Form B and gets a score of 74/100.  Is your friend smarter than you, or did his form happen to have easier questions?  Well, if the test designers built-in some overlap, we can answer this question empirically.

Suppose the two forms overlap by 50 items, called anchor items or equator items.  They are each delivered to a large, representative sample.  Here are the results.

Exam FormMean score on 50 overlap itemsMean score on 100 total items
A3072
B3274

Because the mean score on the anchor items was higher, we then think that the Form B group was a little smarter, which led to a higher total score.

Now suppose these are the results:

Exam FormMean score on 50 overlap itemsMean score on 100 total items
A3272
B3274

Now, we have evidence that the groups are of equal ability.  The higher total score on Form B must then be because the unique items on that form are a bit easier.

How do I calculate an equating?

You can equate forms with classical test theory (CTT) or item response theory (IRT).  However, one of the reasons that IRT was invented was that equating with CTT was very weak.  CTT methods include Tucker, Levine, and equipercentile.  Right now, though, let’s focus on IRT.

IRT equating

There are three general approaches to IRT equating.  All of them can be accomplished with our industry-leading software Xcalibre, though conversion equating requires an additional software called IRTEQ.

  1. Conversion
  2. Concurrent Calibration
  3. Fixed Anchor Calibration

Conversion

With this approach, you need to calibrate each form of your test using IRT, completely separately.  We then evaluate the relationship between IRT parameters on each form and use that to estimate the relationship to convert examinee scores.  Theoretically what you do is line up the IRT parameters of the common items and perform a linear regression, so you can then apply that linear conversion to scores.

But DO NOT just do a regular linear regression.  There are specific methods you must use, including mean/mean, mean/sigma, Stocking & Lord, and Haebara.  Fortunately, you don’t have to figure out all the calculations yourself, as there is free software available to do it for you: IRTEQ.

Concurrent Calibrationcommon item linking irt equating

The second approach is to combine the datasets into what is known as a sparse matrix.  You then run this single data set through the IRT calibration, and it will place all items and examinees onto a common scale.  The concept of a sparse matrix is typically represented by the figure below, representing the non-equivalent anchor test (NEAT) design approach.

The IRT calibration software will automatically equate the two forms and you can use the resultant scores.

Fixed Anchor Calibration

The third approach is a combination of the two above; it utilizes the separate calibration concept but still uses the IRT calibration process to perform the equating rather than separate software.

With this approach, you would first calibrate your data for Form A.  You then find all the IRT item parameters for the common items and input them into your IRT calibration software when you calibrate Form B.

You can tell the software to “fix” the item parameters so that those particular ones (from the common items) do not change.  Then all the item parameters for the unique items are forced onto the scale of the common items, which of course is the underlying scale from Form A.  This then also forces the scores from the Form B students onto the Form A scale.

How do these approaches compare to each other?
concurrent calibration irt equating linking

Concurrent calibration is arguably the easiest but has the drawback that it merges the scales of each form into a new scale somewhere in the middle.  If you need to report the scores on either form on the original scale, then you must use the Conversion or Fixed Anchor approaches.  This situation commonly happens if you are equating across time periods.

Suppose you delivered Form A last year and are now trying to equate Form B.  You can’t just create a new scale and thereby nullify all the scores you reported last year.  You must map Form B onto Form A so that this year’s scores are reported on last year’s scale and everyone’s scores will be consistent.

Where do I go from here?

If you want to do IRT equating, you need IRT calibration software.  All three approaches use it.  I highly recommend Xcalibre since it is easy to use and automatically creates reports in Word for you.  If you want to learn more about the topic of equating, the classic reference is the book by Kolen and Brennan (2004; 2014).  There are other resources more readily available on the internet, like this free handbook from CCSSO.  If you would like to learn more about IRT, I recommend the books by de Ayala (2008) and Embretson & Reise (2000).  A very brief intro is available on our website.

Want to improve the quality of your assessments?

Sign up for our newsletter and hear about our free tools, product updates, and blog posts first! Don’t worry, we would never sell your email address, and we promise not to spam you with too many emails.

Newsletter Sign Up
First Name*
Last Name*
Email*
Company*
Market Sector*
Lead Source

Education, to me, is the neverending opportunities we have for a cycle of instruction and assessment.    This can be extremely small scale (watching a YouTube video on how to change a bike tire, then doing it) to large scale (teaching a 5th grad math curriculum and then assessing it nationwide).  Psychometrics is the Science of Assessment – using scientific principles to make the assessment side of that equation more efficient, accurate, and defensible.  How can psychometrics, especially its intersection with technology, improve your assessment?  Here are 10 important avenues to improve assessment with psychometrics.

10 ways to improve assessment with psychometrics

  • Job analysis: If you are doing assessment of anything job-related, from pre-employment screening tests of basic skills to a nationwide licensure exam for a high-profile profession, a job analysis is the essential first step.  It uses a range of scientifically vetted and quantitatively leveraged approaches to help you define the scope of the exam.
  • Standard-setting studies: If a test has a cutscore, you need a defensible method to set that cutscore.  Simply selecting a round number like 70% is asking for a disaster.  There are a number of approaches from the scientific literature that will improve this process, including the Angoff method and Contrasting Groups method.
  • Technology-Enhanced Items (TEIs): These item types leverage the power of computers to change assessment my moving the medium from multiple-choice recall questions to questions that evaluate more deeper thinking.  Substantial research exists on these, but don’t forget to establish a valid scoring algorithm!
  • Workflow management: Items are the basic building blocks of the assessment.  If they are not high quality, everything else is a moot point.  There needs to be formal processes in place to develop and review test questions.
  • Linking: Linking and equating refer to the process of statistically determining comparable scores on different forms of an exam, including tracking a scale across years and completely different set of items.  If you have multiple test forms or track performance across time, you need this.  And IRT provides far superior methodologies.
  • Automated test assembly: The assembly of test forms – selecting items to match blueprints – can be incredibly laborious.  That’s why we have algorithms to do it for you.  Check out TestAssembler.
  • Distractor analysis: If you are using items with selected responses (including multiple choice, multiple response, and Likert), a distractor/option analysis is essential to determine if those basic building blocks are indeed up to snuff.  Our reporting platform in FastTest, as well as software like Iteman and Xcalibre, is designed for this purpose.
  • Item response theory (IRT): This is the modern paradigm for developing large-scale assessments.  Most important exams in the world over the past 40 years have used it, across all areas of assessment: licensure, certification, K12 education, postsecondary education, language, medicine, psychology, pre-employment… the trend is clear.  For good reason.
  • Automated essay scoring: This technology is just becoming more widely available, thanks to a public contest hosted by Kaggle.  If your organization scores large volumes of essays, you should probably consider this.
  • Computerized adaptive testing (CAT):  Tests should be smart.  CAT makes them so.  Why waste vast amounts of examinee time on items that don’t contribute?  There are many other advantages too.

Want to improve the quality of your assessments?

Sign up for our newsletter and hear about our free tools, product updates, and blog posts first! Don’t worry, we would never sell your email address, and we promise not to spam you with too many emails.

Newsletter Sign Up
First Name*
Last Name*
Email*
Company*
Market Sector*
Lead Source

The International Association for Computerized Adaptive Testing (IACAT, www.iacat.org) will hold a research summit this year in Princeton, NJ.  Hosted by Educational Testing Service, the summit will bring together researchers on the advanced psychometric algorithms that form the foundation of adaptive testing.

Nathan Thompson, PhD, Vice President of Client Services and Psychometrics, has been invited to present an introductory workshop on CAT.  The workshop is intended for researchers that are experienced in fields such as Psychology, Education, and Medicine but are new to the topic of adaptive testing.  “I’m honored that I have the opportunity to be a part of this exciting conference.  I have attended past IACAT conferences in Minnesota, California, Arnhem (The Netherlands), and Sydney, and have always been impressed by both the technical sophistication of the research in combination with the real-world applications.”

Dr. Thompson and Dr. David Weiss, Chief Psychometric Officer, serve on the conference’s scientific committee, helping to establish the program and contribute to the blind proposal review process, as well as serving on the IACAT Board.

You can read a copy of the official press release here.

Want to improve the quality of your assessments?

Sign up for our newsletter and hear about our free tools, product updates, and blog posts first! Don’t worry, we would never sell your email address, and we promise not to spam you with too many emails.

Newsletter Sign Up
First Name*
Last Name*
Email*
Company*
Market Sector*
Lead Source