What is a Standard Setting Study?

A standard setting study is a formal, quantitative process for establishing a performance standard on an exam, such as what score is “proficient” or “passing.” This is typically manifested as a cutscore which is then used for making decisions about people: hire them, pass them, accept them into university, etc. Because it is used for such important decisions, a lot of work goes into standard setting, using methods based on scientific research.

What is NOT standard setting?

In the assessment world, there are actually three uses of the word standard:

A formal definition of the content that is being tested, such as the Common Core State Standards in the USA.
A formalized process for delivering exams, as seen in the phrase “standardized testing.”
A benchmark for performance, like we are discussing here.

For this reason, I prefer the term cutscore study, but the phrase standard setting is used more often.

How is a standard setting study used?

As part of a comprehensive test development cycle, after item authoring, item review, and test form assembly, a cutscore or passing score will often be set to determine what level of performance qualified as “pass” or a similar classification. This cannot be done arbitrarily, such as setting it at 70% because that’s what you saw when you were in school. That is a legal landmine! To be legally defensible and eligible for Accreditation of a Certification Program, it must be done using one of several standard-setting approaches from the psychometric literature. So, if your organization is classifying examinees into Pass/Fail, Hire/NotHire, Basic/Proficient/Advanced, or any other groups, you most likely need a standard setting study. This is NOT limited to certification, although it is often discussed in that pass/fail context.

What are some methods of a standard setting study?

There have been many methods suggested in the scientific literature of psychometrics. They are often delineated into examinee-centered and item-centered approaches. Angoff and Bookmark are designed around evaluating items, while Contrasting Groups and Borderline Groups are designed around evaluating the distributions of actual examinee scores. The Bookmark approach is sort of both types, however, because it uses examinee performance on the items as the object of interest. You may also be interested in reading this introductory post on setting a cutscore using item response theory.

Angoff

In an Angoff study, a panel of subject matter experts rates each item, estimating the percentage of minimally competent candidates that would answer each item correctly. If we take the average of all raters, this then translates into the average percentage-correct score that the raters expect from a minimally competent candidate – a very compelling argument for a cutscore to pass competent examinees! It is often done in tandem with the Beuk Compromise. The Angoff method does not require actual examinee data, though the Beuk does.

Bookmark

The bookmark method orders the items in a test form in ascending difficulty, and a panel of experts reads through and places a “bookmark” in the book where they think a cutscore should be. This process requires a sufficient amount of real data to calibrate item difficulty accurately, typically using item response theory, which necessitates data from several hundred examinees. Additionally, the method ensures that the cutscore is both valid and reliable, reflecting the true proficiency needed for the test.

Contrasting Groups

With the contrasting groups approach, candidates are sorted into Pass and Fail groups based on their performance on a different exam or some other unrelated standard. We can then compare the score distributions on our exam for the two separate groups, and pick a cutscore that best differentiates Pass vs Fail on the other standard. An example of this is below. If using data from another exam, a sample of at least 50 candidates is obviously needed, since you are evaluating distributions.

Borderline Group

The Borderline Group method is similar to the Contrasting Groups method, but it defines a borderline group using alternative information, such as biodata, and evaluates the scores of this group. This method involves selecting individuals who are deemed to be on the threshold of passing or failing based on external criteria. These criteria might include previous performance data, demographic information, or other relevant biodata. The scores from this borderline group are then analyzed to determine the cutscore. This approach helps in refining the accuracy of the cutscore by incorporating more nuanced and contextual information about the test-takers.

Hofstee

The Hofstee approach is often used as a reality check for the modified-Angoff method but can also stand alone as a method for setting cutscores. It involves only a few estimates from a panel of SMEs. Specifically, the SMEs provide estimates for the minimum and maximum acceptable failure rates and the minimum and maximum acceptable scores. This data is then plotted to determine a compromise cutscore that balances these criteria. The simplicity and practicality of the Hofstee approach make it a valuable tool in various testing scenarios, ensuring the cutscore is both realistic and justifiable.

Ebel

The Ebel approach categorizes test items by both their importance and difficulty level. This method involves a panel of experts who rate each item on these two dimensions, creating a matrix that helps in determining the cutscore. Despite its thorough and structured approach, the Ebel method is considered very old and has largely fallen out of use in modern testing practices. Advances in psychometric techniques and the development of more efficient and accurate methods, such as item response theory, have led to the Ebel approach being replaced by more contemporary standard-setting techniques.

How to choose an approach?

There is often no specifically correct answer. In fact, guidelines like NCCA do not lay out which method to use, they just tell you to use an appropriate method.

There are several considerations. Perhaps the most important is whether you have existing data. The Bookmark, Contrasting Groups, and Borderline Group approaches all assume that we have data from a test already delivered, which we can analyze with the perspective of the latent standard. The Angoff and Hofstee approaches, in contrast, can be done before a test is ever delivered. This is arguably less defensible, but is a huge practical advantage.

The choice also depends on whether you can easily recruit a panel of subject matter experts, as that is required for Angoff and Bookmark. The Contrasting Groups method assumes we have a gold standard, which is rare.

How can I implement a standard setting study?

If your organization has an in-house psychometrician, they can usually do this. If, for example, you are a board of experts in a profession but lack experience in psychometrics, you need to hire a firm. We can perform such work for you – contact us to learn more.

Bio
Latest Posts

Nathan Thompson, PhD

Nathan Thompson earned his PhD in Psychometrics from the University of Minnesota, with a focus on computerized adaptive testing. His undergraduate degree was from Luther College with a triple major of Mathematics, Psychology, and Latin. He is primarily interested in the use of AI and software automation to augment and replace the work done by psychometricians, which has provided extensive experience in software design and programming. Dr. Thompson has published over 100 journal articles and conference presentations, but his favorite remains https://scholarworks.umass.edu/pare/vol16/iss1/1/ .

Latest posts by Nathan Thompson, PhD (see all)

Psychometrics: Data Science for Assessment - June 5, 2024
Setting a Cutscore to Item Response Theory - June 2, 2024
What are technology enhanced items? - May 31, 2024