test development Archives

Tag Archive for: test development

What is Item Banking? What are item banks?

AdminApril 12, 2017

Item banking refers to the purposeful creation of a database of assessment items to serve as a central repository of all test content, improving efficiency and quality. The term item refers to what many call questions; though their content need not be restricted as such and can include problems to solve or situations to evaluate in addition to straightforward questions. As a critical foundation to the test development cycle, item banking is the foundation for the development of valid, reliable content and defensible test forms.

Automated item banking systems, such as Assess.ai or FastTest, result in significantly reduced administrative time for developing/reviewing items and assembling/publishing tests, while producing exams that have greater reliability and validity. Contact us to request a free account.

Request demo account

What is Item Banking?

While there are no absolute standards in creating and managing item banks, best practice guidelines are emerging. Here are the essentials your should be looking for:

Items are reusable objects; when selecting an item banking platform it is important to ensure that items can be used more than once; ideally, item performance should be tracked not only within a test form but across test forms as well.

Item history and usage are tracked; the usage of a given item, whether it is actively on a test form or dormant waiting to be assigned, should be easily accessible for test developers to assess, as the over-exposure of items can reduce the validity of a test form. As you deliver your items, their content is exposed to examinees. Upon exposure to many examinees, items can then be flagged for retirement or revision to reduce cheating or teaching to the test.

Items can be sorted; as test developers select items for a test form, it is imperative that they can sort items based on their content area or other categorization methods, so as to select a sample of items that is representative of the full breadth of constructs we intend to measure.

Item versions are tracked; as items appear on test forms, their content may be revised for clarity. Any such changes should be tracked and versions of the same item should have some link between them so that we can easily review the performance of earlier versions in conjunction with current versions.

Review process workflow is tracked; as items are revised and versioned, it is imperative that the changes in content and the users who made these changes are tracked. In post-test assessment, there may be a need for further clarification, and the ability to pinpoint who took part in reviewing an item and expedite that process.

Metadata is recorded; any relevant information about an item should be recorded and stored with the item. The most common applications for metadata that we see are author, source, description, content area, depth of knowledge, IRT parameters, and CTT statistics, but there are likely many data points specific to your organization that is worth storing.

Managing an Item Bank

Names are important. As you create or import your item banks it is important to identify each item with a unique, but recognizable name. Naming conventions should reflect your bank’s structure and should include numbers with leading zeros to support true numerical sorting. You might want to also add additional pieces of information. If importing, the system should be smart enough to recognize duplicates.

Search and filter. The system should also have a reliable sorting mechanism.

Prepare for the Future: Store Extensive Metadata

Metadata is valuable. As you create items, take the time to record simple metadata like author and source. Having this information can prove very useful once the original item writer has moved to another department, or left the organization. Later in your test development life cycle, as you deliver items, you have the ability to aggregate and record item statistics. Values like discrimination and difficulty are fundamental to creating better tests, driving reliability, and validity.

Statistics are used in the assembly of test forms while classical statistics can be used to estimate mean, standard deviation, reliability, standard error, and pass rate.

Item response theory parameters can come in handy when calculating test information and standard error functions. Data from both psychometric theories can be used to pre-equate multiple forms.

In the event that your organization decides to publish an adaptive test, utilizing CAT delivery, item parameters for each item will be essential. This is because they are used for intelligent selection of items and scoring examinees. Additionally, in the event that the integrity of your test or scoring mechanism is ever challenged, documentation of validity is essential to defensibility and the storage of metadata is one such vital piece of documentation.

Increase Content Quality: Track Workflow

Utilize a review workflow to increase quality. Using a standardized review process will ensure that all items are vetted in a similar matter. Have a step in the process for grammar, spelling, and syntax review, as well as content review by a subject matter expert. As an item progresses through the workflow, its development should be tracked, as workflow results also serve as validity documentation.

Accept comments and suggestions from a variety of sources. It is not uncommon for each item reviewer to view an item through their distinctive lens. Having a diverse group of item reviewers stands to benefit your test-takers, as they are likely to be diverse as well!

Keep Your Items Organized: Categorize Them

Identify items by content area. Creating a content hierarchy can also help you to organize your item bank and ensure that your test covers the relevant topics. Most often, we see content areas defined first by an analysis of the construct(s) being tested. In the event of a high school science test, this may include the evaluation of the content taught in class. A high-stakes certification exam, almost always includes a job-task analysis. Both methods produce what is called a test blueprint, indicating how important various content areas are to the demonstration of knowledge in the areas being assessed.

Once content areas are defined, we can assign items to levels or categories based on their content. As you are developing your test, and invariably referring back to your test blueprint, you can use this categorization to determine which items from each content area to select.

Why Item Banking?

There is no doubt that item banking is a key aspect of developing and maintaining quality assessments. Utilizing best practices, and caring for your items throughout the test development life cycle, will pay great dividends as it increases the reliability, validity, and defensibility of your assessment. Moreover, good item banking will make the job easier and more efficient thus reducing the cost of item development and test publishing.

Ready to improve assessment quality through item banking?

Visit our Contact Us page, where you can request a demonstration or a free account (up to 500 items).

April 12, 2017/by Admin

Modified-Angoff Method Study

Nathan Thompson, PhDJanuary 11, 2017

A modified-Angoff method study is one of the most common ways to set a defensible cutscore on an exam. It therefore means that the pass/fail decisions made by the test are more trustworthy than if you picked a random number; if your doctor, lawyer, accountant, or other professional has passed an exam where the cutscore has been set with this method, you can place more trust in their skills.

What is the Angoff method?

It is a scientific way of setting a cutscore (pass point) on a test. If you have a criterion-referenced interpretation, it is not legally defensible to just conveniently pick a round number like 70%; you need a formal process. There are a number of acceptable methodologies in the psychometric literature for standard-setting studies, also known as cutscores or passing points. Some examples include Angoff, modified-Angoff, Bookmark, Contrasting Groups, and Borderline. The modified-Angoff approach is by far the popular approach. It is used especially frequently for certification, licensure, certificate, and other credentialing exams.

It was originally suggested as a mere footnote by renowned researcher William Angoff, at Educational Testing Service.

How does the Angoff approach work?

First, you gather a group of subject matter experts, and have them define what they consider to be a Minimally Competent Candidate (MCC). Next, you have them estimate the percent of minimally competent candidates that will answer each item correctly. You then analyze the results for outliers or inconsistencies, and have the experts discuss then re-rate the items to gain better consensus. The average final rating is then the expected percent-correct score for a minimally competent candidate.

Advantages of the Angoff method

It is defensible. Because it is the most commonly used approach and is widely studied in the scientific literature, it is well-accepted.
You can implement it before a test is ever delivered. Some other methods require you to deliver the test to a large sample first.
It is conceptually simple, easy enough to explain to non-psychometricians.
It incorporates the judgment of a panel of experts, not just one person or a round number.
It works for tests with both classical test theory and item response theory.
It does not take long to implement – if a short test, it can be done in a matter of hours!
It can be used with different item types, including polytomously scored items (multi-points).

Disadvantages of the Angoff method

It does not use actual data, unless you implement the Beuk method alongside.
It can lead to the experts overestimating the performance of entry-level candidates, as they forgot what it was like to start out 20-30 years ago.

FAQ about the Angoff approach

How do I calculate the Angoff cutscore and inter-rater reliability?

What is the difference between Angoff and modified-Angoff?

The original approach had the experts only say whether they thought an MCC would get it right, not the percentage.

Why do I need to do an Angoff study?

If the test is used to make decisions, like hiring or certification, you are not allowed to pick a round number like 70% with no justification.

What if the experts disagree?

You will need to evaluate inter-rater reliability and agreement, then re-rate the items. More info below.

How many experts do I need?

The bare minimum is 6; 8-10 is better.

Do I need to deliver the test first?

No, that is one advantage of this method - you can set a cutscore before you deliver to any examinees.

Example of the Modified-Angoff Method

First of all, do not expect a straightforward, easy process that leads to an unassailably correct cutscore. All standard-setting methods involve some degree of subjectivity. The goal of the methods is to reduce that subjectivity as much as possible. Some methods focus on content, others on examinee performance data, while some try to meld the two.

Step 1: Prepare Your Team

The modified-Angoff process depends on a representative sample of subject matter experts (SMEs), usually 6-20. By “representative” I mean they should represent the various stakeholders. For instance, a certification for medical assistants might include experienced medical assistants, nurses, and physicians, from different areas of the country. You must train them about their role and how the process works, so they can understand the end goal and drive toward it.

Step 2: Define The Minimally Competent Candidate (MCC)

This concept is the core of the modified-Angoff method, though it is known by a range of terms or acronyms, including minimally qualified candidates (MQC) or just barely qualified (JBQ). The reasoning is that we want our exam to separate candidates that are qualified from those that are not. So we ask the SMEs to define what makes someone qualified (or unqualified!) from a perspective of skills and knowledge. This leads to a conceptual definition of an MCC. We then want to estimate what score this borderline candidate would achieve, which is the goal of the remainder of the study. This step can be conducted in person, or via webinar.

Step 3: Round 1 Ratings

Next, ask your SMEs to read through all the items on your test form and estimate the percentage of MCCs that would answer each correctly. A rating of 100 means the item is a slam dunk; it is so easy that every MCC would get it right. A rating of 40 is very difficult. Most ratings are in the 60-90 range if the items are well-developed. The ratings should be gathered independently; if everyone is in the same room, let them work on their own in silence. This can easily be conducted remotely, though.

Step 4: Discussion

This is where it gets fun. Identify items where there is the most disagreement (as defined by grouped frequency distributions or standard deviation) and make the SMEs discuss it. Maybe two SMEs thought it was super easy and gave it a 95 and two other SMEs thought it was super hard and gave it a 45. They will try to convince the other side of their folly. Chances are that there will be no shortage of opinions and you, as the facilitator, will find your greatest challenge is keeping the meeting on track. This step can be conducted in person, or via webinar.

Step 5: Round 2 Ratings

Raters then re-rate the items based on the discussion. The goal is that there will be a greater consensus. In the previous example, it’s not likely that every rater will settle on a 70. But if your raters all end up from 60-80, that’s OK. How do you know there is enough consensus? We recommend the inter-rater reliability suggested by Shrout and Fleiss (1979), as well as looking at inter-rater agreement and dispersion of ratings for each item. This use of multiple rounds is known as the Delphi approach; it pertains to all consensus-driven discussions in any field, not just psychometrics.

Step 6: Evaluate Results and Final Recommendation

Evaluate the results from Round 2 as well as Round 1. An example of this is below. What is the recommended cutscore, which is the average or sum of the Angoff ratings depending on the scale you prefer? Did the reliability improve? Estimate the mean and SD of examinee scores (there are several methods for this). What sort of pass rate do you expect? Even better, utilize the Beuk Compromise as a “reality check” between the modified-Angoff approach and actual test data. You should take multiple points of view into account, and the SMEs need to vote on a final recommendation. They, of course, know the material and the candidates so they have the final say. This means that standard setting is a political process; again, reduce that effect as much as you can.

Some organizations do not set the cutscore at the recommended point, but at one standard error of judgment (SEJ) below the recommended point. The SEJ is based on the inter-rater reliability; note that it is NOT the standard error of the mean or the standard error of measurement. Some organizations use the latter; the former is just plain wrong (though I have seen it used by amateurs).

Step 7: Write Up Your Report

Validity refers to evidence gathered to support test score interpretations. Well, you have lots of relevant evidence here. Document it. If your test gets challenged, you’ll have all this in place. On the other hand, if you just picked 70% as your cutscore because it was a nice round number, you could be in trouble.

Additional Topics

In some situations, there are more issues to worry about. Multiple forms? You’ll need to equate in some way. Using item response theory? You’ll have to convert the cutscore from the modified-Angoff method onto the theta metric using the Test Response Function (TRF). New credential and no data available? That’s a real chicken-and-egg problem there.

Where Do I Go From Here?

Ready to take the next step and actually apply the modified-Angoff process to improving your exams? Sign up for a free account in our FastTest item banker.

References

Shrout, P. E., & Fleiss, J. L. (1979). Intraclass correlations: uses in assessing rater reliability. Psychological bulletin, 86(2), 420.

January 11, 2017/by Nathan Thompson, PhD

Item writing: Tips for authoring test questions

Nathan Thompson, PhDAugust 12, 2013

Item writing (aka item authoring) is a science as well as an art, and if you have done it, you know just how challenging it can be! You are experts at what you do, and you want to make sure that your examinees are too. But it’s hard to write questions that are clear, reliable, unbiased, and differentiate on the thing you are trying to assess. Here are some tips.

What is Item Authoring / Item Writing?

Item authoring is the process of creating test questions. You most likely have seen “bad” test questions in your life, and know firsthand just how frustrating and confusing that can be. Fortunately, there is a lot of research in the field of psychometrics on how to write good questions, and also how to have other experts review them to ensure quality. It is best practice to make items go through a workflow, so that the test development process is similar to the software development process.

Because items are the building blocks of tests, it is likely that the test items within your tests are the greatest threat to its overall validity and reliability. Here are some important tips in item authoring. Want deeper guidance? Check out our Item Writing Guide.

Anatomy of an Item

First, let’s talk a little bit about the parts of a test question. The diagram on the right shows a reading passage with two questions on it. Here are some of the terms used:

Asset/Stimulus: This is a reading passage here, but could also be an audio, video, table, PDF, or other resource
Item: An overall test question, usually called an “item” rather than a “question” because sometimes they might be statements.
Stem: The part of the item that presents the situation or poses a question.
Options: All of the choices to answer.
Key: The correct answer.
Distractors: The incorrect answers.

Item authoring tips: The Stem

To find out whether your test items are your allies or your enemies, read through your test and identify the items that contain the most prevalent item construction flaws. The first three of the most prevalent construction flaws are located in the item stem (i.e. question). Look to see if your item stems contain…

1) BIAS

Nowadays, we tend to think of bias as relating to culture or religion, but there are many more subtle types of biases that oftentimes sneak into your tests. Consider the following questions to determine the extent of bias in your tests:

Are there are acronyms in your test that are not considered industry standard?
Are you testing on policies and procedures that may vary from one location to another?
Are you using vocabulary that is more recognizable to a female examinee than a male?
Are you referencing objects that are not familiar to examinees from a newer or older generation?

2) NOT

We’ve all taken tests which ask a negatively worded question. These test items are often the product of item authoring by newbies, but they are devastating to the validity and reliability of your tests—particularly fast test-takers or individuals with lower reading skills. If the examinee misses that one single word, they will get the question wrong even if they actually know the material. This test item ends up penalizing the wrong examinees!

3) EXCESS VERBIAGE

Long stems can be effective and essential in many situations, but they are also more prone to two specific item construction flaws. If the stem is unnecessarily long, it can contribute to examinee fatigue. Because each item requires more energy to read and understand, examinees tire sooner and may begin to perform more poorly later on in the test—regardless of their competence level.

Additionally, long stems often include information that can be used to answer other questions in the test. This could lead your test to be an assessment of whose test-taking memory is best (i.e. “Oh yeah, #5 said XYZ, so the answer to #34 is XYZ.”) rather than who knows the material.

Item writing tips: distractors / options

Unfortunately, item stems aren’t the only offenders. Experienced test writers actually know that the distractors (i.e. options) are actually more difficult to write than the stems themselves. When you review your test items, look to see if your item distractors contain…

4) IMPLAUSIBILTY

The purpose of a distractor is to pull less qualified examinees away from the correct answer by other options that look correct. In order for them to “distract” an examinee from the correct answer, they have to be plausible. The closer they are to being correct, the more difficult the exam will be. If the distractors are obviously incorrect, even unqualified examinees won’t pick them. Then your exam will not help you discriminate between examinees who know the material and examinees that do not, which is the entire goal.

5) 3-TO-1 SPLITS

You may recall watching Sesame Street as a child. If so, you remember the song “One of these things…” (Either way, enjoy refreshing your memory!) Looking back, it seems really elementary, but sometimes our test item options are written in such a way that an examinee can play this simple game with your test. Instead of knowing the material, they can look for the option that stands out as different from the others. Consider the following questions to determine if one of your items falls into this category:

Is the correct answer significantly longer than the distractors?
Does the correct answer contain more detail than the distractors?
Is the grammatical structure different for the answer than for the distractors?

6) ALL OF THE ABOVE

There are a couple of problems with having this phrase (or the opposite “None of the above”) as an option. For starters, good test takers know that this is—statistically speaking—usually the correct answer. If it’s there and the examinee picks it, they have a better than 50% chance of getting the item right—even if they don’t know the content. Also, if they are able to identify two options as correct, they can select “All of the above” without knowing whether or not the third option was correct. These sorts of questions also get in the way of good item analysis. Whether the examinee gets this item right or wrong, it’s harder to ascertain what knowledge they have because the correct answer is so broad.

Item authoring is easier with an item banking system

The process of reading through your exams in search of these flaws in the item authoring is time-consuming (and oftentimes depressing), but it is an essential step towards developing an exam that is valid, reliable, and reflects well on your organization as a whole. We also recommend that you look into getting a dedicated item banking platform, designed to help with this process.

Summary Checklist

Issue	Recommendation
Key is invalid due to multiple correct answers.	Consider each answer option individually; the key should be fully correct with each distractor being fully incorrect.
Item was written in a hard to comprehend way, examinees were unable to apply their knowledge because of poor wording.	Ensure that the item can be understood after just one read through. If you have to read the stem multiple times, it needs to be rewritten.
Grammar, spelling, or syntax errors direct savvy test takers toward the correct answer (or away from incorrect answers).	Read the stem, followed by each answer option, aloud. Each answer option should fit with the stem.
Information was introduced in the stem text that was not relevant to the question.	After writing each question, evaluate the content of the stem. It should be clear and concise without introducing irrelevant information.
Item emphasizes trivial facts.	Work off of a test blue print to ensure that each of your items map to a relevant construct. If you are using Bloom’s taxonomy or a similar approach, items should be from higher order levels.
Numerical answer options overlap.	Carefully evaluate numerical ranges to ensure there is no overlap among options.
Examinees noticed answer was most often A.	Distribute the key evenly among the answer options. This can be avoided with FastTest’s randomized delivery functionality.
Key was overly specific compared to distractors.	Answer options should all be about the same length and contain the same amount of information.
Key was only option to include key word from item stem.	Avoid re-using key words from the stem text in your answer options. If you do use such words, evenly distribute them among all of the answer options so as to not call out individual options.
Rare exception can be argued to invalidate true/false always/never question.	Avoid using “always” or “never” as there can be unanticipated or rare scenarios. Opt for less absolute terms like “most often” or “rarely”.
Distractors were not plausible, key was obvious.	Review each answer option and ensure that it has some bearing in reality. Distractors should be plausible.
Idiom or jargon was used; non-native English speakers did not understand.	It is best to avoid figures of speech, keep the stem text and answer options literal to avoid introducing undue discrimination against certain groups.
Key was significantly longer than distractors.	There is a strong tendency to write a key that is very descriptive. Be wary of this and evaluate distractors to ensure that they are approximately the same length.

August 12, 2013/by Nathan Thompson, PhD