Posts

Item banking refers to the purposeful creation of a database of items intending to measure a predetermined set of constructs. The term item refers to what many call questions; though their content need not be restricted as such and can include problems to solve or situations to evaluate in addition to straightforward questions. The art of item banking is the organizational structure by which items are categorized. As a critical component of any high-quality assessment, item banking is the foundation for the development of valid, reliable content and defensible test forms. Automated item banking systems, such as the Item Explorer module of FastTest, result in significantly reduced administrative time for maintaining content and producing tests. While there are no absolute standards in creating and managing item banks, best practice guidelines are emerging. Some of the essential aspects include ensuring that:

  • Items are reusable objects; when selecting an item banking platform it is important to ensure that items can be used more than once; ideally item performance should be tracked not only within a test form, but across test forms as well.
  • Item history and usage is tracked; the usage of a given item, whether it is actively on a test form or dormant waiting to be assigned, should be easily accessible for test developers to assess, as the over-exposure of items can reduce the validity of a test form. As you deliver your items, their content is exposed to examinees. Upon exposure to many examinees, items can then be flagged for retirement or revision to reduce cheating or teaching to the test.
  • Items can be sorted; as test developers select items for a test form, it is imperative that they can sort items based on their content area or other categorization method, so as to select a sample of items that is representative of the full breadth of constructs we intend to measure.
  • Item versions are tracked; as items appear on test forms, their content may be revised for clarity. Any such changes should be tracked and versions of the same item should have some link between them so that we can easily review the performance of earlier versions in conjunction with current versions.
  • Review process workflow is tracked; as items are revised and versioned, it is imperative that the changes in content and the users who made these changes are tracked. In post-test assessment, there may be a need for further clarification, and the ability to pinpoint who took part in reviewing an item an expedite that process.
  • Metadata is recorded; any relevant information about an item should be recorded and stored with the item. The most common applications for metadata that we see are author, source, description, content area, depth of knowledge, IRT parameters, and CTT statistics, but there are likely many data points specific to your organization that are worth storing.

Keeping these guidelines in mind, here are some concrete steps that you can take to establish your item bank in accordance with psychometric best practices.

Make your Job Easier: Establish a Naming Convention

Names are important. As you are importing or creating your item banks it is important to identify each item with a unique, but recognizable name. Naming conventions should reflect your bank’s structure and should include numbers with leading zeros to support true numerical sorting. For example, let’s consider the item banks of a high school science teacher. Take a look at the example below:

What are some ways that this utilizes best practices?

  • Each subject has its own item bank. We can easily view all Biology items by selecting the Biology item bank.
  • A separate folder, 8Ah clearly delineates items for honors students.
  • The item names follow along with the item bank and category names, allowing us to search for all items for 8th grade unit A-1 with the query “8A-1”, or similarly for honors items “8Ah-1”
  • Leading zeros are used so that as the item bank expands, items will sort properly; an item ending in 001 will appear before 010.

Indeed, the execution of these best practices should be adapted to the needs of your organization, but it is important to establish a convention of some kind.  That is, you can use a period rather than underscore – as long as you are consistent.

Prepare for the Future: Store Extensive Metadata

Metadata is valuable. As you create items, take the time to record simple metadata like author and source. Having this information can prove very useful once the original item writer has moved to another department, or left the organization. Later in your test development life cycle, as you deliver items, you have the ability to aggregate and record item statistics. Values like discrimination and difficulty are fundamental to creating better tests, driving reliability and validity.

Statistics are used in the assembly of test forms, for example.  Classical statistics can be used to estimate mean, standard deviation, reliability, standard error, and pass rate, while item response theory parameters can be used to calculate test information and standard error functions. Data from both psychometric theories can be used to pre-equate multiple forms.

In the event that your organization decides to publish an adaptive test, utilizing CAT delivery, item parameters for each item will be essential because they are used for intelligently selecting items and scoring examinees. Additionally, in the event that the integrity of your test or scoring mechanism is ever challenged, documentation of validity is essential to defensibility and the storage of metadata is one such vital piece of documentation.

Increase Content Quality: Track Workflow

Utilize a review workflow to increase quality. Using a standardized review process will ensure that all items are vetted in a similar matter. Have a step in the process for grammar, spelling, and syntax review, as well as content review by a subject matter expert. As an item progresses through the workflow, its development should be tracked, as workflow results also serve as validity documentation.

Accept comments and suggestions from a variety of sources. It is not uncommon for each item reviewer to view an item through their distinctive lens. Having a diverse group of item reviewers stands to benefit your test takers, as they are likely to be diverse as well!

Keep Your Items Organized: Categorize Them

Identify items by content area. Creating a content hierarchy can also help you to organize your item bank and ensure that your test covers the relevant topics. Most often, we see content areas defined first by an analysis of the construct(s) being tested. In the event of a high school science test, this may include the evaluation of the content taught in class. For a high-stakes certification exam, this almost always includes a job-task analysis. Both methods produce what is called a test blue print, indicating how important various content areas are to the demonstration of knowledge in the areas being assessed. Once content areas are defined, we can assign items to levels or categories based on their content. As you are developing your test, and invariably referring back to your test blueprint, you can use this categorization to determine which items from each content area to select.

There is no doubt that item banking will remain a key aspect of developing and maintaining quality assessments. Utilizing best practices, and caring for your items throughout the test development life cycle, will pay great dividends as it increases the reliability, validity, and defensibility of your assessment.

Worried your current item banking platform isn’t up to par? We would love to discuss how Assessment Systems can help. FastTest was designed by psychometricians with an intuitive and easy to use item banking module. Check out our free version here, or contact us to learn more.

Want to improve the quality of your assessments with item banking?

Sign up for our newsletter and hear about our free tools, product updates, and blog posts first! Don’t worry, we would never sell your email address, and we promise not to spam you with too many emails.

Newsletter Sign Up
First Name*
Last Name*
Email*
Company*
Market Sector*
Lead Source

There are a number of acceptable methodologies in the psyychometric literature for standard setting studies, also known as cutscores or passing points.  Examples include Angoff, modified-Angoff, Bookmark, Contrasting Groups, and Borderline.  The modified-Angoff approach is by far the most commonly used, yet it remains a black box to many professionals in the testing industry, especially non-psychometricians in the credentialing field.  This post hopefully provides some elucidation and demystification.  There is some flexibility in the study implementation, but this article describes a sound method.

What to Expect with the Modified-Angoff Approach

First of all, do not expect a straightforward, easy process that leads to an unassailably correct cutscore.  All standard setting methods involve some degree of subjectivity.  The goal of the methods is to reduce that objectivity as much as possible.  Some methods focus on content, others on data, while some try to meld the two.

Step 1: Prepare Your Team

The modified-Angoff process depends on a representative sample of subject matter experts (SMEs), usually 6-20.  By “representative” I mean they should represent the various stakeholders.  A certification for medical assistants might include experienced medical assistants, nurses, and physicians, from different areas of the country.  You must train them about their role and how the process works, so they can understand the end goal and drive toward it.

Step 2: The Minimally Competent Candidate (MCC)

This concept is the core of the Angoff process, though it is known by a range of terms or acronyms, including minimally qualified candidate (MQC) or just barely qualified (JBQ).  The reasoning is that we want our exam to separate candidates that are qualified from those that are not.  So we ask the SMEs to define what makes someone qualified (or unqualified!) from a perspective of skills and knowledge.  This leads to a conceptual definition of an MCC.  We then want to estimate what score this borderline candidate would achieve, which is the goal of the remainder of the study.   This step can be conducted in person, or via webinar.

Step 3: Round 1 Ratings

Next, ask your SMEs to read through all the items on your test form and estimate the percentage of MCCs that would answer each correctly.  A rating of 100 means the item is a slam dunk; it is so easy that every MCC would get it right.  A rating of 40 is very difficult.  Most ratings are in the 60-90 range if the items are well-developed.  The ratings should be gathered independently; if everyone is in the same room, let them work on their own in silence.  This can easily be conducted remotely, though.

Step 4: Discussion

This is where it gets fun.  Identify items where there is the most disagreement (as defined by grouped frequency distributions or standard deviation) and make the SMEs discuss it.  Maybe two SMEs thought it was super easy and gave it a 95 and two other SMEs thought it was super hard and gave it a 45.  They will try to convince the other side of their folly.  Chances are that there will be no shortage of opinions and you, as the facilitator, will find your greatest challenge is keeping the meeting on track.  This step can be conducted in person, or via webinar.

Step 5: Round 2 Ratings

Raters then re-rate the items based on the discussion.  The goal is that there will be greater consensus.  In the previous example, it’s not likely that every rater will settle on a 70.  But if your raters all end up from 60-80, that’s OK.  How do you know there is enough consensus?  We recommend the inter-rater reliability suggested by Shrout and Fleiss (1979).

Step 6: Evaluate Results and Final Recommendation

Evaluate the results from Round 2 as well as Round 1.  An example of this is below.  What is the recommended cutscore, which is the average or sum of the Angoff ratings depending on the scale you prefer?  Did the reliability improve?  Estimate the mean and SD of examinee scores (there are several methods for this).  What sort of pass rate do you expect?  Even better, utilize the Beuk Compromise as a “reality check” between the modified-Angoff approach and actual test data.  You should take multiple points of view into account, and the SMEs need to vote on a final recommendation.  They, of course, know the material and the candidates so they have the final say.  This means that standard setting is a political process; again, reduce that effect as much as you can.Angoff Method

 

Step 7: Write Up Your Report

Validity refers to evidence gathered to support test score interpretations.  Well, you have lots of relevant evidence here.  Document it.  If your test gets challenged, you’ll have all this in place.  On the other hand, if you just picked 70% as your cutscore because it was a nice round number, you could be in trouble.

Additional Topics

In some situations, there are more issues to worry about.  Multiple forms?  You’ll need to equate in some way.  Using item response theory?  You’ll have to convert the Angoff-recommended cutscore onto the theta metric using the Test Response Function (TRF).  New credential and no data available?  That’s a real chicken-and-egg problem there.

Where Do I Go From Here?

Ready to take the next step and actually apply the modified-Angoff process to improving your exams?  Download our free Angoff Analysis Tool.

Want to go even further and implement automation in your Angoff study?  Sign up for a free account in our FastTest item banker.

References

Shrout & Fleiss (1979). Intraclass correlations: Uses in assessing reliability. Psychological Bulletin, 86(2), 420-428.

Want to improve the quality of your assessments?

Sign up for our newsletter and hear about our free tools, product updates, and blog posts first! Don’t worry, we would never sell your email address, and we promise not to spam you with too many emails.

Newsletter Sign Up
First Name*
Last Name*
Email*
Company*
Market Sector*
Lead Source

Authoring test items: Science as well as art

You are experts at what you do, and you want to make sure that your examinees are too.  In order to do so, you need tests that are reliable, valid, and legally defensible.  That said, it is likely that the test items within your tests are the greatest threat to its actual validity and reliability.

To find out whether your test items are your allies or your enemies, read through your test and identify the items that contain the most prevalent item construction flaws.  The first three of the most prevalent construction flaws are located in the item stem (i.e. question).  Look to see if your item stems contain…

1) BIAS – Nowadays, we tend to think of bias as relating to culture or religion, but there are many more subtle types of biases that oftentimes sneak into your tests.  Consider the following questions to determine the extent of bias in your tests:

  • Are there are acronyms in your test that are not considered industry standard?
  • Are you testing on policies and procedures that may vary from one location to another?
  • Are you using vocabulary that is more recognizable to a female examinee than a male?
  • Are you referencing objects that are not familiar to examinees from a newer or older generation?

2) NOT – We’ve all taken tests which ask a negatively worded question. These test items are easy to write, but they are devastating to the validity and reliability of your tests—particularly fast test-takers or individuals with lower reading skills.  If the examinee misses that one single word, they will get the question wrong even if they actually know the material.  This test item ends up penalizing the wrong examinees!

3) EXCESS VERBIAGE – Long stems can be effective and essential in many situations, but they are also more prone to two specific item construction flaws.  If the stem is unnecessarily long, it can contribute to examinee fatigue.  Because each item requires more energy to read and understand, examinees tire sooner and may begin to perform more poorly later on in the test—regardless of their competence level.

Additionally, long stems often include information that can be used to answer other questions in the test.  This could lead your test to be an assessment of whose test-taking memory is best (i.e. “Oh yeah, #5 said XYZ, so the answer to #34 is XYZ.”) rather than who knows the material.

Unfortunately, item stems aren’t the only offenders.  Experienced test writers actually know that the distractors (i.e. options) are actually more difficult to write than the stems themselves.  When you review your test items, look to see if your item distractors contain

4) IMPLAUSIBILTY – The purpose of a distractor is to pull less qualified examinees away from the correct answer by other options that look correct.  In order for them to “distract” an examinee from the correct answer, they have to be plausible.  The closer they are to being correct, the more difficult the exam will be.  If the distractors are obviously incorrect, even unqualified examinees won’t pick them, and your exam will not help you discriminate between examinees who know the material and examinees that do not.

5) 3-TO-1 SPLITS – You may recall watching Sesame Street as a child.  If so, you remember the song “One of these things…”  (Either way, enjoy refreshing your memory!)   Looking back, it seems really elementary, but sometimes our test item options are written in such a way that an examinee can play this simple game with your test.  Instead of knowing the material, they can look for the option that stands out as different from the others.  Consider the following questions to determine if one of your items falls into this category:

  • Is the correct answer significantly longer than the distractors?
  • Does the correct answer contain more detail than the distractors?
  • Is the grammatical structure different for the answer than for the distractors?

6) ALL OF THE ABOVE – There are a couple of problems with having this phrase (or the opposite “None of the above”) as an option.  For starters, good test takers know that this is—statistically speaking—usually the correct answer.  If it’s there and the examinee picks it, they have a better than 50% chance of getting the item right—even if they don’t know the content.  Also, if they are able to identify two options as correct, they can select “All of the above” without knowing whether or not the third option was correct.  These sorts of questions also get in the way of good item analysis.   Whether the examinee gets this item right or wrong, it’s harder to ascertain what knowledge they have because the correct answer is so broad.

The process of reading through your exams in search of these flaws is time-consuming (and oftentimes depressing), but it is an essential step towards developing an exam that is valid, reliable, and reflects well on your organization as a whole.  Once you have a chance to look at one of your tests, please write in the comments below what you discovered.  We’d love to hear from you and support you as you strive towards better items, exams, and professionals.