An item distractor, also known as a foil or a trap, is an incorrect option for a selected-response item on an assessment.

What makes a good item distractor?

One word: plausibility.  We need the item distractor to attract examinees.  If it is so irrelevant that no one considers it, then it does not do any good to include it in the item.  Consider the following item.

What is the capital of the United States of America?

A. Los Angeles

B. New York

C. Washington, D.C.

D. Mexico City

The last option is quite implausible – not only is it outside the USA, but it mentions another country in the name, so no student is likely to select this.  This then becomes a three-horse race, and students have a 1 in 3 chance of guessing.  This certainly makes the item easier.

In addition, the distractor needs to have negative discrimination.  That is, while we want the correct answer to attract the more capable examinees, we want the distractors to attract the lower examinees.  If you have a distractor that you thought was incorrect, and it turns out to attract all the top students, you need to take a long, hard look at that question! To calculate discrimination statistics on distractors, you will need software such as Iteman.

What makes a bad item distractor?

Obviously, implausibility and negative discrimination are frequent offenders.  But if you think more deeply about plausibility, the key is actually plausibility without being arguably correct.  This can be a fine line to walk, and is a common source of problems for items.  You might have a medical item that presents a scenario and asks for a likely diagnosis; perhaps one of the distractors is very unlikely so as to be essentially implausible, but it might actually be possible for a small subset of patients under certain conditions.  If the author and item reviewers did not catch this, the examinees probably will, and this will be evident in the statistics.  This is one of the reasons it is important to do psychometric analysis of test results; in fact, accreditation standards often require you to go through this process at least once a year.

What is item review?  It is the process of performing quality control on items before they are ever delivered to examinees.  This is an absolutely essential step in the development of items for medium and high stakes exams; while a teacher might not have other teachers review questions on a 4th grade math quiz, items that are part of a admissions exam or professional certification exam will go through multiple layers of independent item review before a single examinee sees them.  This blog post will discuss some important aspects of the item review process.

Why item review?

Assessment items are, when you look at it from a business perspective, a work product.  They are component parts of a larger machine, the test or assessment; in some cases interchangeable, in other cases very intentional and specific.  It is obviously common practice to perform quality assurance on work products, and the item review process simply applies this concept to test questions.

Who does the item review?

This can differ greatly based on the type of assessment and the stakes involved.  In a medium stakes situation, it might be just one other reviewer.  A professional certificate exam might have all items reviewed by one content expert other than the person who wrote the item, and this could be considered sufficient.  In higher stakes exams that are developed by large organizations, the item might go through two content reviewers, a psychometric reviewer, a bias reviewer, and an editor.  Additionally, it then might go through additional stages for formatting.  You can see how this can then become a very big deal, with dozens of people and hundreds of items floating around.

What do the reviewers check?

It depends on who the reviewer is, but there are often checklists that the organization provides.  A content reviewer might check that the stem is clear, the key is fully correct, the distractors fully incorrect, and all answers of reasonably equivalent length.  The psychometric reviewer might check for aspects that inadvertently tip off the correct answer.  The bias reviewer might look for a specific set of situations that potentially disadvantage some subgroup of the populations.  An editor might look for correct usage of punctuation, such as the fact that the stem should never end in a colon.

For example, during my graduate school years I used to write items that were eventually used in the US State of Alaska for K-12 assessments.  The reviewers not only looked for straightforward issues like answer correctness, but for potential bias in the case of Alaskans.  As item writers, we were warned to be careful about mentioning any objects that we take for granted in the Lower 48: roads, shopping malls, indoor plumbing, and farms are examples that come to mind. Checking this was a stage of item review.

How do we manage the work?

Best practice to manage the process is to implement stages.  An organization might decide that all items go to the reviewers listed previously, and in the order that I described them.  Each one must complete their review checklist before the item can be moved onto the next stage.  This might seem like a coldhearted assembly line, given that there certainly is an art to writing good items, but assembly lines unarguably lead to greater quality and increased productivity.

Is there software that makes the item review process easier?

Yes.  You have likely used some form of work process management software in your own job, such as Trello, JIRA, or Github. These are typically based on the concept of swimlanes, which as a whole is often referred to as a Kanban board.  Back in the day, Kanban board were actual boards with post-its on them, as you might have seen on shows like Silicon Valley.   This presents the aforementioned stages as columns in a user interface, and tasks (items) are moved through the stages.  Once the Content Reviewer 1 is done with their work and leaves comments on the item, the software provides a way for them to change the stage to Content Review 2 and assign some person as Content Reviewer 2.

Below is an example of this from ASC’s online assessment platform, Assess.ai.  Because Assess.ai is designed for organizations that are driven by best practices and advanced psychometrics, there is an entire portion of the system dedicated to management of item review via the swimlanes interface.

kanban board

To implement this process, an administrator at the organization defines the stages that they want all items to receive, and Assess.ai will present these as columns in the swimlane interface.  Administrators can then track and manage the workflow visually.  The reviewers themselves don’t need access to everything, but instead are instructed to click on the items they are supposed to review, and they will be presented an interface like the one below.

item review

Can I implement Kanban item review at my organization?

Absolutely!  Assess.ai is available as a free version (sign up here), with a limit of 500 items and 1 user.  While this means that the free version won’t let you manage dozens of users, you can still implement some aspects of the process to improve item quality in your organization.  Once you are ready to expand, you can simply upgrade your account and add the users. 

Want to learn more?  Drop us an email at solutions@assess.com.

Technology enhanced items are assessment items (questions) that utilize technology to improve the interaction of the item, over and above what is possible with paper.  Technology enhanced items can improve examinee engagement (important with K12 assessment), assess complex concepts with higher fidelity, improve precision/reliability, and enhance face validity/sellability.  To some extent, the last word is the key one; tech enhanced items simply look sexier and therefore make an assessment platform easier to sell, even if they don’t actually improve assessment.  I’d argue that there is also technology enabled items , which is distinct, as discussed below.

What is the goal of technology enhanced items?

The goal is to improve assessment, by increasing things like reliability/precision, validity, and fidelity. However, there are a number of TEIs that are actually designed more for sales purposes than psychometric purposes. So, how to know if TEIs improve assessment?  That, of course, is an empirical question that is best answered with an experiment.  But let me suggest one metric to address this question: how far does the item go beyond just reformulating a traditional item format to use current user-interface technology?  I would define the reformulating of traditional format to be a fake TEI, while going beyond would define a true TEI.  An alternative nomenclature might be to call the reformulations technology enhanced items and the true tech usage to be technology enabled items (Almond et al, 2010; Bryant, 2017), as they would not be possible without technology.

A great example of this is is the relationship between a traditional multiple response item and certain types of drag and drop items.  There are a number of different ways that drag and drop items can be created, but for now, let’s use the example of a format that asks the examinee to drag text statements into a box.  An example of this is K12 assessment items from PARCC that ask the student to read a passage, then the item presents a list of some statements about the story, asking the student to drag all true statements into a box.  Take this tech-enhanced item drag & drop, for example.

Brians winter drag drop statements

Now, consider the following item, often called multiple response.

Brians winter multiple response

You can see how this item is the exact same in terms of psychometric interaction: the student is presented a list of statements, and select those they think are true.  The item is scored with integers from 0 to K where K is the number of correct statements; the integers are often then used to implement the generalized partial credit model for final scoring.  This would be true regardless of whether the item was presented as multiple response vs. drag and drop. The multiple response item, of course, could just as easily be delivered via paper and pencil. Converting it to drag and drop enhances the item with technology, but the interaction of the student with the item, psychometrically, remains the same.

Some True TEIs, or Technology Enabled Items

Of course, the past decade or so has witnessed stronger innovation in item formats. Gamified assessments change how the interaction of person and item is approached, though this is arguably not as relevant for high stakes assessment due to concerns of validity. There are also simulation items. For example, a test for a construction crane operator might provide an interface with crane controls and ask the examinee to complete a tasks. Even at the K-12 level there can be such items, such as the simulation of a science experiment where the student is given various test tubes or other instruments on the screen.

Both of these approaches are extremely powerful but have a major disadvantage: cost. They are typically custom-designed. In the case of the crane operator exam or even the science experiment, you would need to hire software developers to create this simulation. There are now some simulation-development ecosystems that make this process more efficient, but the items still involve custom authoring and custom scoring algorithms.

To address this shortcoming, there is a new generation of self-authored item types that are true TEIs. By “self-authored” I mean that a science teacher would be able to create these items themselves, just like they would a multiple choice item. The amount of technology leveraged is somewhere between a multiple choice item and a custom-designed simulation, providing a compromise of reduced cost but still increasing the engagement for the examinee. An example of this is shown below from ASC’s Assess.ai assessment platform. A major advantage of this approach is that the items do not need custom scoring algorithms, and instead are typically scored via point integers, which enables the use of polytomous item response theory.

tech enhanced items

Are we at least moving forward?  Not always!

There is always pushback against technology, and in this topic the counterexample is the gridded item type.  It actually goes in reverse of innovation, because it doesn’t take a traditional format and reformulate it for current UI. It actually ignores the capabilities of current UI (actually, UI for the past 20+ years) and is therefore a step backward. With that item type, students are presented a bubble sheet from a 1960s style paper exam, on a computer screen, and asked to fill in the bubbles by clicking on them rather than using a pencil on paper.

Another example is the EBSR item type from the artist formerly known as PARCC. It was a new item type that intended to assess deeper understanding, but it did not use any tech-enhancement or -enablement, instead asking two traditional questions in a linked manner. As any psychometrician can tell you, this approach ignored basic assumptions of psychometrics, so you can guess the quality of measurement that it put out.

How can I implement TEIs?

It takes very little software development expertise to develop a platform that supports multiple choice items. An item like the graphing one above, though, takes substantial investment. So there are relatively few platforms that can support these, especially with best practices like workflow item review or item response theory. You can try authoring them for free in our Assess.ai assessment platform, or if you have more questions, contact solutions@assess.com.

Automated item generation (AIG) is a paradigm for developing assessment items, aka test questions, utilizing principles of artificial intelligence and automation. As the name suggests, it tries to automate some or all of the effort involved with item authoring, as that is one of the most time-intensive aspects of assessment development – which is no news to anyone who has authored test questions! Items can cost up to $2000 to develop, so even cutting the average cost in half could provide massive time/money savings to an organization.

There are two types of automated item generation:

Type 1: Item Templates (Current Technology)

The first type is based on the concept of item templates to create a family of items using dynamic, insertable variables. There are three stages to this work. For more detail, read this article by Gierl, Lai, and Turner (2012).

  • Authors, or a team, create an cognitive model by isolating what it is they are exactly trying to assess and different ways that it the knowledge could be presented or evidenced. This might include information such as what are the important vs. incidental variables, and what a correct answer should include .
  • They then develop templates for items based on this model, like the example you see below.
  • An algorithm then turns this template into a family of related items, often by producing all possible permutations.

Obviously, you can’t use more than one of these on a given test form. And in some cases, some of the permutations will be an unlikely scenario or possibly completely irrelevant. But the savings can still be quite real. I saw a conference presentation by Andre de Champlain from the Medical Council of Canada, stating that overall efficiency improved by 6x and the generated items were higher quality than traditionally written items because the process made the authors think more deeply about what they were assessing and how. He also recommended that template permutations not be automatically moved to the item bank but instead that each be reviewed by SMEs, for reasons such as those stated above.

You might think “Hey, that’s not really AI…” – AI is doing things that have been in the past done by humans, and the definition gets pushed further every year. Remember, AI used to be just having the Atari be able to play Pong with you!

Type 2: AI Processing of Source Text (Future Technology)

The second type is what the phrase “automated item generation” more likely brings to mind: upload a textbook or similar source to some software, and it spits back drafts of test questions. For example, see this article by von Davier (2019). This technology is still cutting edge and working through issues. For example, how do you automatically come up with quality, plausible distractors for a multiple choice item? This might be automated in some cases like mathematics, but in most cases the knowledge of plausibility lies with content matter expertise. Moreover, this approach is certainly not accessible for the typical educator. It is currently in use, but by massive organizations that spend millions of dollars.

How Can I Implement Automated Item Generation?

AIG has been used the large testing companies for years, but is no longer limited to their domain. It is now available off the shelf as part of ASC’s nextgen assessment platform, Assess.ai. Best of all, that component is available at the free subscription level, all you need to do is register with a valid email address.

Assess.ai provides a clean, intuitive interface to implement Type 1 AIG, in a way that is accessible to all organizations. Develop your item templates, insert dynamic fields, and then export the results to review then implement in an item banking system, which is also available for free in Assess.ai.

If you have worked in the field of assessment and psychometrics, you have undoubtedly encountered the word “standard.” While a relatively simple word, it has the potential to be confusing because it is used in three (and more!) completely different but very important ways. Here’s a brief discussion.

Standard = Cutscore

As noted by the well-known professor Gregory Cizek here, “standard setting refers to the process of establishing one or more cut scores on a test.” The various methods of setting a cutscore, like Angoff or Bookmark, are referred to as standard setting studies. In this context, the standard is the bar that separates a Pass from a Fail. We use methods like the ones mentioned to determine this bar in as scientific and defensible fashion as possible, and give it more concrete meaning than an arbitrarily selected round number like 70%. Selecting a round number like that will likely get you sued since there is no criterion-referenced interpretation.

Standard = Blueprint

If you work in the field of education, you often hear the term “educational standards.” These refer to the curriculum blueprints for an educational system, which also translate into assessment blueprints, because you want to assess what is on the curriculum. Several important ones in the USA are noted here, perhaps the most common of which nowadays is the Common Core State Standards, which attempted to standardize the standards across states. These standards exist to standardize the educational system, by teaching what a group of experts have agreed upon should be taught in 6th grade Math classes for example. Note that they don’t state how or when a topic should be taught, merely that 6th Grade Math should cover Number Lines, Measurement Scales, Variables, whatever – sometime in the year.

Standard = Guideline

If you work in the field of professional certification, you hear the term just as often but in a different context, accreditation standards. The two most common are the National Commission for Certifying Agencies (NCCA) and the ANSI National Accreditation Board (ANAB). These two organizations are a consortium of credentialing bodies that give a stamp of approval to credentialing bodies, stating that a Certification or Certificate program is legit. Why? Because there is no law to stop me from buying a textbook on any topic, writing 50 test questions in my basement, and selling it as a Certification. It is completely a situation of caveat emptor, and these organizations are helping the buyers by giving a stamp of approval that the certification was developed with accepted practices like a Job Analysis, Standard Setting Study, etc.

In addition, there are the professional standards for our field. These are guidelines on assessment in general rather than just credentialing. Two great examples are the AERA/APA/NCME Standards for Educational and Psychological Measurement and the International Test Commission’s Guidelines (yes they switch to that term) on various topics.

Also: Standardized = Equivalent Conditions

The word is also used quite frequently in the context of standardized testing, though it is rarely chopped to the root word “standard.” In this case, it refers to the fact that the test is given under equivalent conditions to provide greater fairness and validity. A standardized test does NOT mean multiple choice, bubble sheets, or any of the other pop connotations that are carried with it. It just means that we are standardizing the assessment and the administration process. Think of it as a scientific experiment; the basic premise of the scientific method is holding all variables constant except the variable in question, which in this case is the student’s ability. So we ensure that all students receive a psychometrically equivalent exam, with equivalent (as much as possible) writing utensils, scrap paper, computer, time limit, and all other practical surroundings. The problem comes with the lack of equivalence in access to study materials, prep coaching, education, and many bigger questions… but those are a societal issue and not a psychometric one.

So despite all the bashing that the term gets, a standardized test is MUCH better than the alternatives of no assessment at all, or an assessment that is not a level playing field and has low reliability. Consider the case of hiring employees: if assessments were not used to provide objective information on applicant skills and we could only use interviews (which are famously subjective and inaccurate), all hiring would be virtually random and the amount of incompetent people in jobs would increase a hundredfold. And don’t we already have enough people in jobs where they don’t belong?

A Standard Setting Study is a formal process fo establishing an performance standard. In the assessment world, there are actually two uses of the word standard – the other one refers to a formal definition of the content that is being tested, such as the Common Core State Standards in the USA. For this reason, I prefer the term cutscore study.

After item authoring, item review, and test form assembly, a cutscore or passing score will often be set to determine what level of performance qualified as “pass” or similar classification.  This cannot be done arbitrarily (e.g., setting it at 70% because that’s what you saw when you were in school).  To be legally defensible and eligible for Accreditation, it must be done using one of several standard setting approaches from the psychometric literature.  The choice of method depends upon the nature of the test, the availability of pilot data, and the availability of subject matter experts.

Some types of Cutscore Studies:

  • Angoff – In an Angoff study, a panel of subject matter experts rates each item, estimating the percentage of minimally competent candidates that would answer each item correctly.  It is often done in tandem with the Beuk Compromise.  The Angoff method does not require actual examinee data, though the Beuk does.
  • Bookmark – The bookmark method orders the items in a test form in ascending difficulty, and a panel of experts reads through and places a “bookmark” in the book where they think a cutscore should be.  Obviously, this requires enough real data to calibrate item difficulty, usually using item response theory, which requires several hundred examinees.
  • Contrasting Groups – Candidates are sorted into Pass and Fail groups based on their performance on a different exam or some other unrelated standard.  If using data from another exam, a sample of at least 50 candidates is obviously needed.
  • Borderline Group – Similar to Contrasting Groups, but a borderline group is defined using alternative information such as biodata, and the scores of the group are evaluated.

One of the most cliche phrases associated with assessment is “teaching to the test.”  I’ve always hated this phrase, because it is only used in a derogatory matter, almost always by people who do not understand the basics of assessment and psychometrics.  I recently saw it mentioned in this article on PISA, and that was one time too many, especially since it was used in an oblique, vague, and unreferenced manner.

So, I’m going to come out and say something very unpopular: in most cases, TEACHING TO THE TEST IS A GOOD THING.

 

Why teaching to the test is usually a good thing

If the test reflects the curriculum – which any good test will – then someone who is teaching to the test will be teaching to the curriculum.  Which, of course, is the entire goal of teaching. The phrase “teaching to the test” is used in an insulting sense, especially because the alliteration is resounding and sellable, but it’s really not a bad thing in most cases.  If a curriculum says that 4th graders should learn how to add and divide fractions, and the test evaluates this, what is the problem? Especially if it uses modern methodology like adaptive testing or tech-enhanced items to make the process more engaging and instructional, rather than oversimplifying to a text-only multiple choice question on paper bubble sheets?

The the world of credentialing assessment, this is an extremely important link.  Credential tests start with a job analysis study, which surveys professionals to determine what they consider to be the most important and frequently used skills in the job.  This data is then transformed into test blueprints. Instructors for the profession, as well as aspiring students that are studying to pass the test, then focus on what is in the blueprints.  This, of course, still contains the skills that are most important and frequently used in the job!

 

So what is the problem then?

Now, telling teachers how to teach is more concerning, and more likely to be a bad thing.  Finland does well because it gives teachers lots of training and then power to choose how they teach, as noted in the PISA article.

As a counterexample, my high school math department made an edict starting my sophomore year that all teachers had to use the “Chicago Method.”  It was pure bunk and based on the fact that students should be doing as much busy work as possible instead of the teachers actually teaching. I think it is because some salesman convinced the department head to make the switch so that they would buy a thousand brand new textbooks.  The method makes some decent points (here’s an article from, coincidentally, when I was a sophomore in high school) but I think we ended up with a bastardization of it, as the edict was primarily:

  1. Assign students to read the next chapter in class (instead of teaching them!); go sit at your desk.
  2. Assign students to do at least 30 homework questions overnight, and come back tomorrow with any questions they have.  
  3. Answer any questions, then assign them the next chapter to read.  Whatever you do, DO NOT teach them about the topic before they start doing the homework questions.  Go sit at your desk.

Isn’t that preposterous?  Unsurprisingly, after two years of this, I went from being a leader of the Math Team to someone who explicitly said “I am never taking Math again”.  And indeed, I managed to avoid all math during my senior year of high school and first year of college. Thankfully, I had incredible professors in my years at Luther College, leading to me loving math again, earning a math major, and applying to grad school in psychometrics.  This shows the effect that might happen with “telling teachers how to teach.” Or in this case, specifically – and bizarrely – to NOT teach.

 

What about all the bad tests out there?

Now, let’s get back to the assumption that a test does reflect a curriculum/blueprints.  There are, most certainly, plenty of cases where an assessment is not designed or built well.  That’s an entirely different problem, and is an entirely valid concern. I have seen a number of these in my career.  This danger why we have international standards on assessments, like AERA/APA/NCME and NCCA.  These provide guidelines on how a test should be build, sort of like how you need to build a house according to building code and not just throwing up some walls and a roof.

For example, there is nothing that is stopping me from identifying a career that has a lot of people looking to gain an edge over one another to get a better job… then buying a textbook, writing 50 questions in my basement, and throwing it up on a nice-looking website to sell as a professional certification.  I might sell it for $395, and if I get just 100 people to sign up, I’ve made $39,500!!!! This violates just about every NCCA guideline, though. If I wanted to get a stamp of approval that my certification was legit – as well as making it legally defensible – I would need to follow the NCCA guidelines.

My point here is that there are definitely bad tests out there, just like there are millions of other bad products in the world.  It’s a matter of caveat emptor. But just because you had some cheap furniture on college that broke right away, doesn’t mean you swear off on all furniture.  You stay away from bad furniture.

There’s also the problem of tests being misused, but again that’s not a problem with the test itself.  Certainly, someone making decisions is uninformed. It could actually be the best test in the world, with 100% precision, but if it is used for an invalid application then it’s still not a good situation.  For example, if you took a very well-made exam for high school graduation and started using it for employment decisions with adults. Psychometricians call this validity – that we have evidence to support the intended use of the test and interpretations of scores.  It is the #1 concern of assessment professionals, so if a test is being misused, it’s probably by someone without a background in assessment.

 

So where do we go from here?

Put it this way, if an overweight person is trying to become fitter, is success more likely to come from changing diet and exercise habits, or from complaining about their bathroom scale?  Complaining unspecifically about a high school graduation assessment is not going to improve education; let’s change how we educate our children to prepare them for that assessment, and ensure that the assessment reflects the goals of the education.  Nevertheless, of course, we need to invest in making the assessment as sound and fair as we can – which is exactly why I am in this career.

Classical test theory is a century-old paradigm for psychometrics – using quantitative and scientifically-based processes to develop and analyze assessments to maximize their quality.  (nobody likes unfair tests!)  The most basic and frequently used item statistic from classical test theory is the P-value.  It is usually called item difficulty but is sometimes called item facility, which can lead to possible confusion.

The P-Value Statistic

The classical P-value is the proportion of examinees that respond correctly to a question, or respond in the “keyed direction” for items where the notion of correct is not relevant (imagine a personality assessment where all questions are Yes/No statements such as “I like to go to parties” … Yes is the keyed direction for an Extraversion scale).  Note that this is NOT the same as the p-value that is used in hypothesis testing from general statistical methods.  This P-value is almost universally agreed upon in terms of calculation.  But some people call it item difficulty and others call it item facility.  Why?

It has to do with the clarity interpretation.  It usually makes sense to think of difficulty as an important aspect of the item.  The P-value presents this, but in a reverse manner.  We usually expect higher values to indicate more of something, right?  But a P-value of 1.00 is high, and it means that there is not much difficulty; everyone gets the item correct, so it is actually no difficulty whatsoever.  A P-value of 0.25 is low, but it means that there is a lot of difficulty; only 25% of examinees are getting it correct, so it has quite a lot of difficulty.

So where does “item facility” come in?

See how the meaning is reversed?  It’s for this reason that some psychometricians prefer to call it item facility or item easiness.  We still use the P-value, but 1.00 means high facility/easiness and 0.25 means low facility/easiness.  The direction of the semantics fits much better.

Nevertheless, this is a minority of psychometricians.  There’s too much momentum to change an entire field at this point!  It’s similar to the 3 dichotomous IRT parameters (a,b,c); some of you might have noticed that they are actually in the wrong order, because the 1-parameter model does not use the a parameter, it uses the b.  At the end of the day, it doesn’t really matter, but it’s another good example of how we all just got used to doing something and it’s now too far down the road to change it.  Tradition is a funny thing.

The modified-Angoff method is arguably the most common method of setting a cutscore on a test.  The Angoff cutscore is legally defensible and meets international standards such as AERA/APA/NCME, ISO 17024, and NCCA.  It also has the benefit that it does not require the test to be administered to a sample of candidates first; methods like Contrasting Groups, Borderline Group, and Bookmark do so.

There are, of course, some drawbacks to the Angoff cutscore process.  The most significant is the fact that the subject matter experts (SMEs) tend to overestimate their conceptualization of a minimally competent candidate, and therefore overestimate the cutscore.  Sometimes to the point that the expected pass rate is zero!

Another drawback is that the Angoff cutscore process only works in the classical psychometric paradigm – the recommended cutscores are on the number-correct metric or percentage-correct metric.  If your tests are developed and scored in the item response theory (IRT) paradigm, you need to convert the classical cutscore to the IRT theta scale.  The easiest way to do that is to reverse-calculate the test response function (TRF) from IRT.

The Test Response Function

The TRF (sometimes called a test characteristic curve) is an important method of characterizing test performance in the IRT paradigm.  The TRF predicts a classical score from an IRT score, as you see below.  Like the item response function and test information function (these need blog posts too), it uses the theta scale as the X-axis.  The Y-axis can be either the number-correct metric or proportion-correct metric.

In this example, you can see that a theta of -0.6 translates to an estimated number-correct score of approximately 10, and +1 to 15.5.  Note that the number-correct metric only makes sense for linear or LOFT exams, where every examinee receives the same number of items.  In the case of CAT exams, only the proportion correct metric makes sense.

Angoff cutscore to IRT

So how does this help us with the conversion of a cutscore?  Well, we hereby have a way of translating any number-correct score or proportion-correct score.  So any Angoff-recommended cutscore can be reverse-calculated to a theta value.  If your Angoff study (or Beuk) recommends a cutscore of 10 out of 20 points, you can convert that to a theta cutscore of -0.6.  If the recommended cutscore was 15.5, the theta cutscore would be 1.0.

Because IRT works in a way that it scores examinees on the same scale with any set of items, as long as those items have been part of a linking/equating study.  Therefore, a single Angoff study on a set of items can be equated to any other linear test form, LOFT pool, or CAT pool.  This makes it possible to apply the classically-focused Angoff method to IRT-focused programs.

Linear on the fly testing (LOFT) is an approach to delivering assessments to examinees.  In general, there are two families of test delivery.  Static approaches deliver the same test form or forms to everyone; this is the ubiquitous and traditional “linear” method of testing.  Algorithmic approaches deliver the test to each examinee based on a computer algorithm; this includes LOFT, computerized adaptive testing (CAT), and multistage testing (MST).

What is linear on the fly testing?

The purpose of linear on the fly testing is to give every an examinee a linear form that is uniquely created for them – but each one is create to be psychometrically equivalent to all others to ensure fairness.  For example, we might have a pool of 200 items, and every person only gets 100, but that 100 is balanced for each person.  This can be done by ensuring content and/or statistical equivalency, as well ancillary metadata such as item types or cognitive level.

Content Equivalence

This portion is relatively straightforward.  If your test blueprint calls for 20 items in each of 5 domains, for a total of 100 items, then each form administered to examinees should follow this blueprint.  Sometimes the content blueprint might go 2 or even 3 levels deep.

Statistical Equivalence

There are, of course, two predominant psychometric paradigms: classical test theory and item response theory.  With CTT, forms can easily be built to have an equivalent P value, and therefore expected mean score.  If point-biserial statistics are available for each item, you can also design the algorithm to design forms that have the same standard deviation and reliability.  With item response theory, the typical approach is to design forms to have the same test information function, or inversely, conditional standard error of measurement function.  To learn more about how these are implemented, download our IRT Scoring Spreadsheet or Classical Form Assembly Tool.

Implementing LOFT

LOFT is typically implemented by publishing a pool of items with an algorithm to select subsets that meet the requirements.  Therefore, you need a psychometrically sophisticated testing engine that stores the necessary statistics and item metadata, lets you define a pool of items, specify the relevant options such as target statistics and blueprints, and deliver the test in a secure manner.  Very few testing platforms can implement a quality LOFT assessment.

Why all this?

It certainly is not easy to build a strong item bank, design LOFT pools, and develop a complex algorithm that meets content and statistical balancing needs.  So why would an organization use linear on the fly testing?  Well, it is much more secure than having a few linear forms.  Since everyone receives a unique form, it is impossible for word to get out about what the first questions on the test are.  And of course, we could simply perform a random selection of 100 items from a pool of 200, but that would be potentially unfair.  Using LOFT will ensure the test remains fair and defensible.