A Standard Setting Study is a formal process fo establishing an performance standard. In the assessment world, there are actually two uses of the word standard – the other one refers to a formal definition of the content that is being tested, such as the Common Core State Standards in the USA. For this reason, I prefer the term cutscore study.

After item authoring, item review, and test form assembly, a cutscore or passing score will often be set to determine what level of performance qualified as “pass” or similar classification.  This cannot be done arbitrarily (e.g., setting it at 70% because that’s what you saw when you were in school).  To be legally defensible and eligible for Accreditation, it must be done using one of several standard setting approaches from the psychometric literature.  The choice of method depends upon the nature of the test, the availability of pilot data, and the availability of subject matter experts.

Some types of Cutscore Studies:

  • Angoff – In an Angoff study, a panel of subject matter experts rates each item, estimating the percentage of minimally competent candidates that would answer each item correctly.  It is often done in tandem with the Beuk Compromise.  The Angoff method does not require actual examinee data, though the Beuk does.
  • Bookmark – The bookmark method orders the items in a test form in ascending difficulty, and a panel of experts reads through and places a “bookmark” in the book where they think a cutscore should be.  Obviously, this requires enough real data to calibrate item difficulty, usually using item response theory, which requires several hundred examinees.
  • Contrasting Groups – Candidates are sorted into Pass and Fail groups based on their performance on a different exam or some other unrelated standard.  If using data from another exam, a sample of at least 50 candidates is obviously needed.
  • Borderline Group – Similar to Contrasting Groups, but a borderline group is defined using alternative information such as biodata, and the scores of the group are evaluated.

Whether you are a newly-launched credentialing program or a mature certification body, it is important to perform frequent “checkups” on your assessments, to ensure that they’re not only accurate, but also legally defensible.  The primary component of this process is a psychometric performance report, which provides important statistics on the test like reliability, and item statistics like difficulty and discrimination.  This work is primarily done by a psychometrician, though particular items flagged for poor performance should be reviewed by Subject Matter Experts (SMEs).  However, checkups should also sometimes include Job Task Analysis studies (JTAs) and Cutscore studies. This is where your SMEs really come in.  The frequency depends on how quickly your field is evolving, but a cycle of 5 years is often recommended. JTAs are sometimes called job analysis, practice analysis, or role delineation studies.

Your SMEs play a pivotal role in getting new assessments off the ground and keeping existing assessments fair and accurate. Whether they keep your program abreast with current innovations and industry standards or help you quantify the knowledge and various skills measured in your assessment, your SMEs work side-by-side with your psychometric experts through the job task analysis and cutscore process to ensure fair and accurate decisions are made.

If your program or assessment is in its infant stages, you will need to perform a Job Task Analysis to kick things off. The JTA is all about surveying on-the-job tasks, creating a list of tasks, and then devising a blueprint of what knowledge, skills, and abilities (KSAs) are required for certification in a given role or field.

The Basics of Job Task Analysis

  • Observe— Typically the psychometrician (that’s us) shadows a representative sample of people who perform the job in question (chosen through Panel Composition) to observe and take notes. After the day(s) of observation, the SMEs sit down with the observer so that he or she may ask any clarifying questions. The goal is to avoid doing this during the observation so that the observer has an untainted view of the job.  Alternatively, your SMEs can observe job incumbents – which is often the case when the SMEs are supervisors.

  • Generate— The SMEs now have a corpus of information on what is involved with the job, and generate a list of tasks that describe the most important job-related components.  Not all job analysis uses tasks, but this is the most common approach in certification testing, hence you will often hear the term job task analysis as a general term.
  • Survey— Now that we have a list of tasks, we send a survey out to a larger group of SMEs and ask them to rate various features of each task. How important is the task? How often is it performed? What larger category of tasks does it fall into?

  • Analyze— Next, we crunch the data and quantitatively evaluate the SMEs’ subjective ratings to determine which of the tasks and categories are most important.

  • Review— As a non-SME, the psychometrician needs to take their findings back to the SME panel to review the recommendation and make sure it makes sense.

  • Report— We put together a comprehensive report that outlines what the most important tasks/categories are for the given job.  This in turn serves as the foundation for a test blueprint, because more important content deserves more weight on the test.  This connection is one of the fundamental links in the validity argument for an assessment.

Cutscore studies after job task analysis

When the JTA is completed, we have to determine who should pass the assessment, and who should fail. This is most often done using the modified Angoff process, where the SMEs conceptualize a minimally competent candidate (MCC) and then set pass/fail point so that the MCC would just barely pass.  There are other methods too, such as Bookmark or Contrasting Groups.

For newly-launching certification programs, these processes go hand-in-hand with item writing and review. The use of evidence-based practices in conducting the job task analysis, test design, writing items, and setting a cutscore serve as the basis for a good certification program.  Moreover, if you are seeking to achieve accreditation – a third part stamp of approval that your credential is high quality – documentation that you completed all these steps is required.

Performing these tasks with a trained psychometrician inherently checks a lot of boxes on the accreditation to-do list, which can position your organization well for the future. When it comes to accreditation— the psychometricians and measurement specialists at Assessment Systems have been around the block a time or two. We can walk you through the lengthy process of becoming accredited, or we can help you perform these tasks a la carte.

Working toward accreditation or building your team of professionals? Accreditation bodies like ANSI and NCCA require job analyses. Our Psychometricians are available to conduct a job analysis study and write defensible documentation to move your program forward and ensure you are hiring individuals with the skills and knowledge necessary to be successful.

The job market is competitive, especially for employers; whether you need a job analysis or not, the job description you post must convert prospects to candidates. After all, you can lead a horse to water but you can’t make it drink. Vervoe Co-Founder and CEO Omer Molad shares his thoughts about job descriptions that get the right people. Here’s how to write a job description that will attract the right candidates.

Why Focus on Activities?

People are hired to perform value-adding activities. While companies have different approaches to how they hire, their goals are usually the same. Every company wants to hire high-performing people, not people who just look good on paper.

Despite this simple and obvious assumption, too many companies ignore activities and focus on things that don’t indicate performance. This happens at every stage of the hiring process. For example:

  • Many job descriptions focus on what candidates have done in the past.
  • Screening is based on candidates’ backgrounds.
  • Assessment methods often don’t simulate the tasks are performed in the role.

Instead, use on-the-job activities as the guide for the entire hiring process. If you follow this principle, you will hire people who perform the value-adding activities you require.


Here’s how it works.

The Job Description

Defining the role is the foundation of hiring. If you do that incorrectly, the entire hiring process will be steered in the wrong direction. The clearer you are, the higher your chances of attracting the person you want. The problem with so many job descriptions is that they are aren’t linked closely enough to the daily activities of the job. Let’s change that.

A good job description should have three sections:

1. Start with why

“People don’t buy what you do, they buy why you do it.” – Simon Sinek

This approach is entirely applicable to job descriptions. Sell candidates on your company’s vision and story. Sell them on the role and the culture. This will achieve two things. First, it is likely to increase the quality of applicants. Second, candidates will be more likely to invest in the application process and make an effort if they buy into your “why”.

Conversely, candidates who don’t relate to your vision or culture will opt out. Mission accomplished.

2. Describe the role in activities

Outline, point by point, what the successful candidate will do every day. Keep it simple and be very specific. No clichés, no jargon. Candidates need to understand how they will spend each day, what they need to achieve, who they’ll be working with and under what conditions.

This is a great way of managing expectations. By communicating to candidates what they’ll be doing in the role, you are forcing them to ask themselves whether they can do those activities well and how much they enjoy doing them. This presents another opportunity for less suitable candidates to opt out.

3. State your requirements

The previous two sections should make this part easy because you’ve set the scene. Candidates already know what your company stands for and what they’ll be doing in the role. Now you can add some more detail about the type of person you are looking for and how you expect them to approach the role.

Don’t worry about years of experience, grades in college or anything else that’s not activity-based. Bring it back to activities and use plain English.

Describe the kind of person you’re looking for by listing how you want them to approach the role. Put thing in context. Instead of “strong communicator”, write “clearly communicate customer feedback to the product team”. Instead of “flexible”, write “prepared to join calls with developers late at night when necessary”.

You should also use this section to articulate the attitude and behaviors you’d like to see. Candidates already know from the previous section what they’ll be doing on a daily basis. Now explain how.

Here are some examples of good job descriptions and a useful guide on how to write one.

Candidate Screening

With a good job description and scenario-based assessment, candidate screening is simply not required. To learn more about why you don’t need to screen candidates read this.

But in short, screening is not about activities, it’s about a candidate’s background. Ruling people out based on their background is counterproductive. Instead, set candidates up for success with a savvy job description, and then assess the ones that want the job based on that description.

Don’t worry about receiving too many applications from people who aren’t qualified or ignore the job description. That is solved automatically in the assessment stage and you won’t need to lift a finger.

Scenario-based Assessment

Your job description will attract people who want to be part of your journey, and want to do the job you advertised. That’s the theory at least.

Now it’s time to find out how it stacks up.

The assessment stage, which is the most important part of your hiring process, should be entirely based on activities. Go back to the job description and choose the most important on-the-job activities.

Create simulations of those activities so you can see how candidates perform in real-world scenarios. To learn how to write a great interview script read this.

Use automated interviews to deliver the simulations to candidates online.

Some candidates will not make the effort. Others will find the activities too challenging. Others yet will see that the activities are not aligned with their interests or passions. The most motivated and qualified candidates will prevail.

It’s easy to read a job description and apply for a job. However, when candidates are asked to perform challenging tasks, they need to be motivated and confident in their abilities. You’ll only need to view and score completed interviews and you’ll know who measures up within minutes.

Using automated interviews based on activities, you can audition candidates for the role. They will, in turn, get a chance to do the role, albeit in a small way.

The candidates who perform well in the automated interviews will have proven they can do the activities you want them to do in the role. Seeing first hand how well they perform each of those activities will help you confidently make your hiring decision.

By focusing on activities, you can create a hiring process that reflects your role and how you want it to be performed. It’s a simple and effective method to hire people who can, and want to, perform the activities you consider to be value-adding.

***

Our friends at Vervoe specialize in automating your recruiting and screening process to improve your time to hire and ensure you’re hiring the right person for the right position. This post was originally posted by Vervoe, reposted with permission. For more information about Vervoe, visit them at https://vervoe.com/.

MINNEAPOLIS, MN, September 7, 2018 – Assessment Systems, global leaders in psychometrics and assessment software has added the National Institute of Automotive Service Excellence (ASE) to its growing list of valued partners.

Since 1972, ASE has driven to elevate the quality of vehicle repair and service by assessing and certifying automotive professionals. This partnership joins the power and sophistication of Assessment Systems’ flagship products – FastTest, Iteman and Xcalibre – with ASE’s long standing and renowned certification in the automotive industry.

“Our values align,” said Cassandra Bettenberg, Executive Director of Strategic Partnerships at Assessment Systems, “Like Assessment Systems, ASE values psychometrics and wants to develop and deliver more valid and reliable exams.”

As Assessment Systems continues to diversify its list of partners across industries, they continue to improve their best-in-class assessment technology and their new assessment platform, Ada. Assessment Systems recently earned a spot on the Inc. 5000 –  Inc. Magazine’s list of America’s Fastest-Growing Private Companies – for the second year in a row.

About The National Institute for Automotive Service Excellence
The National Institute for Automotive Service Excellence was established in 1972 as a non-profit organization to help improve the quality of automotive service and repair through the voluntary testing and certification of automotive technicians and parts specialists. Today, there are nearly 400,000 ASE- certified professionals at work in dealerships, independent shops, collision repair shops, auto parts stores, fleets, schools and colleges throughout the country.

About Assessment Systems Corporation
Assessment Systems is the trusted provider of high-stakes assessment and psychometric services for over 250 partners worldwide, delivering over 2,000,000 assessments every year. Powered by decades of research in psychometrics, Assessment Systems offers best-in-class software platforms and consulting services to support high-quality measurement and completely scalable solutions. Assessment Systems’ success is driven by a commitment to make assessments smarter, faster, and fairer to ensure bad tests don’t hurt good people.

###

The modified-Angoff method is arguably the most common method of setting a cutscore on a test.  The Angoff cutscore is legally defensible and meets international standards such as AERA/APA/NCME, ISO 17024, and NCCA.  It also has the benefit that it does not require the test to be administered to a sample of candidates first; methods like Contrasting Groups, Borderline Group, and Bookmark do so.

There are, of course, some drawbacks to the Angoff cutscore process.  The most significant is the fact that the subject matter experts (SMEs) tend to overestimate their conceptualization of a minimally competent candidate, and therefore overestimate the cutscore.  Sometimes to the point that the expected pass rate is zero!

Another drawback is that the Angoff cutscore process only works in the classical psychometric paradigm – the recommended cutscores are on the number-correct metric or percentage-correct metric.  If your tests are developed and scored in the item response theory (IRT) paradigm, you need to convert the classical cutscore to the IRT theta scale.  The easiest way to do that is to reverse-calculate the test response function (TRF) from IRT.

The Test Response Function

The TRF (sometimes called a test characteristic curve) is an important method of characterizing test performance in the IRT paradigm.  The TRF predicts a classical score from an IRT score, as you see below.  Like the item response function and test information function (these need blog posts too), it uses the theta scale as the X-axis.  The Y-axis can be either the number-correct metric or proportion-correct metric.

In this example, you can see that a theta of -0.6 translates to an estimated number-correct score of approximately 10, and +1 to 15.5.  Note that the number-correct metric only makes sense for linear or LOFT exams, where every examinee receives the same number of items.  In the case of CAT exams, only the proportion correct metric makes sense.

Angoff cutscore to IRT

So how does this help us with the conversion of a cutscore?  Well, we hereby have a way of translating any number-correct score or proportion-correct score.  So any Angoff-recommended cutscore can be reverse-calculated to a theta value.  If your Angoff study (or Beuk) recommends a cutscore of 10 out of 20 points, you can convert that to a theta cutscore of -0.6.  If the recommended cutscore was 15.5, the theta cutscore would be 1.0.

Because IRT works in a way that it scores examinees on the same scale with any set of items, as long as those items have been part of a linking/equating study.  Therefore, a single Angoff study on a set of items can be equated to any other linear test form, LOFT pool, or CAT pool.  This makes it possible to apply the classically-focused Angoff method to IRT-focused programs.

Have you heard about standard setting approaches such as the Hofstee method, or perhaps the Angoff, Ebel, Nedelsky, or Bookmark methods?  There are certainly various ways to set a defensible cutscore or a professional credentialing or pre-employment test.  Today, we are going to discuss the Hofstee method.

Why Standard Setting?

Certification organizations that care about the quality of their examinations need to follow best practices and international standards for test development, such as  the Standards laid out by the National Commission for Certifying Agencies (NCCA).  One component of that is standard setting, also known as cutscore studies.  One of the most common and respected approaches for that is the modified-Angoff methodology.

However, the Angoff approach has one flaw: the subject matter experts (SMEs) tend to expect too much out of minimally competent candidates, and sometimes set a cutscore so high that even they themselves would not pass the exam.  There are several reasons this can occur.  For example, raters might think “I would expect anyone that worked for me to know how to do this” and not consider the fact that people who work for them might have 10 years of experience while test candidates could be fresh out of training/school and have the topic only touched on for 5 minutes.  SMEs often forget what it was like to be a much younger and inexperienced version of themselves.

For this reason, several compromise methods have been suggested to compare the Angoff-recommended cutscore with a “reality check” of actual score performance on the exam, allowing the SMEs to make a more informed decision when setting the official cutscore of the exam.  I like to use the Beuk method and the Hofstee method.

The Hofstee Method

One method of adjusting the cutscore based on raters’ impressions of the difficulty of the test and possible pass rates is the Hofstee method (Mills & Melican, 1987; Cizek, 2006; Burr et al., 2016).  This method requires the raters to estimate four values:

 

  1. The minimum acceptable failure rate
  2. The maximum acceptable failure rate
  3. The minimum cutscore, even if all examinees failed
  4. The maximum cutscore, even if all examinees passed

 

The first two values are failure rates, and are therefore between 0% and 100%, with 100% indicating a test that is too difficult for anyone to pass.  The latter two values are on the raw score scale, and therefore range between 0 and the number of items in the test, again with a higher value indicating a more difficult cutscore to achieve.

These values are paired, and the line that passes through the two points estimated.  The intersection of this line with the failure rate function, is the recommendation of the adjusted cutscore.   hofstee method cutscore standard setting

How can I use the Hofstee Method?

Unlike the Beuk, the Hofstee method does not utilize the Angoff ratings, so it represents a completely independent reality check.  In fact, it is sometimes used as a standalone cutscore setting method itself, but because it does not involve rating of every single item, I recommend it be used in concert with the Angoff and Beuk approaches.

How can you perform all the calculations that go into the Hofstee method?  Well, you don’t need to program it all from scratch.  Just head over to our Angoff Analysis Tool page and download a copy for yourself.

Psychometrics is the cornerstone of any high-quality assessment program.  Most organizations do not have an in-house PhD psychometrician, which then necessitates the search for psychometric consulting.  Most organizations, when first searching, are new to the topic and not sure what role the psychometrician plays.  In this article, we’ll talk about how psychometricians and their tools can help improve your assessments, whether you just want to check on test reliability or pursue the lengthy process of accreditation.

Why ASC?

Whether you are establishing or expanding a credentialing program, streamlining operations, or moving from paper to online testing, ASC has a proven track record of providing practical, cost-efficient solutions with uncompromising quality. We offer a free consultation with our team of experts to discuss your needs and determine which solutions are the best fit, including our enterprise SaaS platforms, consulting on sound psychometrics, or recommending you to one of our respected partners.
 

At the heart of our business is our people.

Our collaborative team of Ph.D. psychometricians, accreditation experts, and software developers have diverse experience developing solutions that drive best practices in assessment. This real-world knowledge enables us to consult your organization with solutions tailored specifically to your goals, timeline, and budget.
 

Comprehensive Solutions to Address Specific Measurement Problems

Much of psychometric consulting is project-based around solving a specific problem.  For example, you might be wondering how to set a cutscore on a certification/licensure exam that is legally defensible and meets accreditation standards.  This is a very specific issue, and the scientific literature has suggested a number of sound approaches.  Here are some of the topics where psychometricians can really help:

  • Test Design: Job Analysis & Blueprints
  • Standard and Cutscore Setting Studies
  • Item Writing and Review Workshops
  • Test and Item Statistical Analysis
  • Equating Across Years and Forms
  • Adaptive Testing Research
  • Test Security Evaluation
  • NCCA/ANSI Accreditation

 

Why psychometric consulting?

All areas of assessment can be smarter, faster and fairer.

Develop Reliable and Valid Assessments
We’ll help you understand what needs to be done to develop defensible tests and how to implement them in a cost-efficient manner.  Much of the work revolves around establishing a sound test development cycle.

Increase Test Security
We have specific expertise in psychometric forensics, allowing you to flag suspicious candidates or groups in real time, using our automated forensics report.

Achieve Accreditation
Our dedicated experts will assist in setting your organization up for success with NCCA/ANSI accreditation of professional certification programs.

Comprehensive Psychometric Analytics
We use CTT and IRT with principles of machine learning and AI to deeply understand your data and provide actionable recommendations.

We can help your organization develop and publish certification and licensure exams, based on best practices and accreditation standards, in a matter of months.

If you’re looking for a way to add these best practices to your assessments, here’s how:

Item and Test Statistical Analysis
If you are doing this process at least annually, you are not meeting best practices or accreditation standards. But don’t worry, we can help! In addition to performing these analyses for you, you also have the option of running them yourself in our FastTest platform or using our psychometric software like Iteman and Xcalibre.

Job Analysis
How do you know what a professional certification test should cover?  Well, let’s get some hard data by surveying job incumbents. Knowing and understanding this information and how to use it is essential if you want to test people on whether they are prepared for the job or profession.

Cutscore Studies (Standard Setting)
When you use sound psychometric practices like the modified-Angoff, Beuk Compromise, Bookmark, and Contrasting Groups methods, it will help you establish a cutscore that meets professional standards.

 

It’s all much easier if you use the right software!

Once we help you determine the best solutions for your organization, we can train you on best practices, and it’s extremely easy to use our software yourself.  Software like Iteman and Xcalibre is designed to replace much of the manual work done by psychometricians for item and test analysis, and FastTest automates many aspects of test development and publishing.  We even offer free software like the Angoff Analysis Tool.  However, our ultimate goal is your success: Assessment Systems is a full-service company that continues to provide psychometric consulting and support even after you’ve made a purchase. Our team of professionals is available to provide you with additional support at any point in time. We want to ensure you’re getting the most out of our products!  Click below to sign up for a free account in FastTest and see for yourself.

 

If you are involved with certification testing and are accredited by the National Commission of Certifying Agencies (NCCA), you have come across the term decision consistency.  NCCA requires you to submit a report of 11 important statistics each year, each for all active test forms.  These 11 provide a high level summary of the psychometric health of each form; more on that report here.  One of the 11 is decision consistency.

Decision consistency is an estimate of how consistent the pass/fail decision is on your test.  That is, if someone took your test today, had their brain wiped of that memory, and took the test again next week, what is the probability that they would obtain the same classification both times?  This is often estimated as a proportion or percentage, and we would of course hope that this number is high, but if the test is unreliable it might not be.

The reasoning behind the need for a index specifically on this is that the psychometric aspect we are trying to estimate is different than reliability of point scores (Moltner, Timbil, & Junger, 2015; Downing & Mehrens, 1978).  The argument is that examinees near the cutscore are of interest, and reliability evaluates the entire scale.  It’s for this reason that if you are using item response theory, the NCCA allows you to instead submit the conditional standard error of measurement function at the cutscore.  But all of the classical decision consistency indices evaluate all examinees, and since most candidates are not near the cutscore, this inflates the baseline.  Only the CSEM – from IRT – follows the line of reasoning of focusing on examinees near the cutscore.

An important distinction that stems from this dichotomy is that of decision consistency vs. accuracy.  Consistency refers to receiving the same pass/fail classification each time if you take the test twice.  But what we really care about is whether your pass/fail based on the test matches with your true state.  For a more advanced treatment on this, I recommend Lathrop (2015).

There are a number of classical methods for estimating an index of decision consistency that have been suggested in the psychometric literature.  A simple and classic approach is Hambleton (1972), which is based on an assumption that examinees actually take the same test twice (or equivalent forms).  Of course, this is rarely feasible in practice, so a number of methods were suggested over the next few years on how to estimate this with a single test administration to a given set of examinees.  These include Huynh (1976), Livingston (1972), and Subkoviak (1976).  These are fairly complex.  I once reviewed a report from a psychometrician that faked the Hambleton index because they didn’t have the skills to figure out any of the indices.

How does decision consistency relate to reliability?

The note I made above about unreliability is worth another visit, however.  After the rash of publications on the topic, Mellenbergh and van der Linden (1978; 1980) pointed out that if you assume a linear loss function for misclassification, the conventional estimate of reliability – coefficient alpha – serves as a solid estimate of decision consistency.  What is a linear loss function?  It means that a misclassification is worse if the person’s score is further from the cutscore.  That is, of the cutscore is 70, failing someone with a true score of 80 is twice as bad as failing someone with a true score of 75.  Of course, we never know someone’s true score, so this is a theoretical assumption, but the researchers make an excellent point.

But while research amongst psychometricians on the topic cooled since they made that point, NCCA still requires one of the statistics -most from the 1970s – to be reported.  The only other well-known index on the topic was Hanson and Brennan (1990).  While the indices have been show to be different than classical reliability, I remain to be convinced that they are the right approach.  Of course, I’m not much of a fan of classical test theory at all in the first place; that acceptance of CSEM from IRT is definitely aligned with my views on how psychometrics should tackle measurement problems.

 

Item banking refers to the purposeful creation of a database of items intending to measure a predetermined set of constructs. The term item refers to what many call questions; though their content need not be restricted as such and can include problems to solve or situations to evaluate in addition to straightforward questions. The art of item banking is the organizational structure by which items are categorized. As a critical component of any high-quality assessment, item banking is the foundation for the development of valid, reliable content and defensible test forms. Automated item banking systems, such as the Item Explorer module of FastTest, result in significantly reduced administrative time for maintaining content and producing tests. While there are no absolute standards in creating and managing item banks, best practice guidelines are emerging. Some of the essential aspects include ensuring that:

  • Items are reusable objects; when selecting an item banking platform it is important to ensure that items can be used more than once; ideally item performance should be tracked not only within a test form, but across test forms as well.
  • Item history and usage is tracked; the usage of a given item, whether it is actively on a test form or dormant waiting to be assigned, should be easily accessible for test developers to assess, as the over-exposure of items can reduce the validity of a test form. As you deliver your items, their content is exposed to examinees. Upon exposure to many examinees, items can then be flagged for retirement or revision to reduce cheating or teaching to the test.
  • Items can be sorted; as test developers select items for a test form, it is imperative that they can sort items based on their content area or other categorization method, so as to select a sample of items that is representative of the full breadth of constructs we intend to measure.
  • Item versions are tracked; as items appear on test forms, their content may be revised for clarity. Any such changes should be tracked and versions of the same item should have some link between them so that we can easily review the performance of earlier versions in conjunction with current versions.
  • Review process workflow is tracked; as items are revised and versioned, it is imperative that the changes in content and the users who made these changes are tracked. In post-test assessment, there may be a need for further clarification, and the ability to pinpoint who took part in reviewing an item an expedite that process.
  • Metadata is recorded; any relevant information about an item should be recorded and stored with the item. The most common applications for metadata that we see are author, source, description, content area, depth of knowledge, IRT parameters, and CTT statistics, but there are likely many data points specific to your organization that are worth storing.

Keeping these guidelines in mind, here are some concrete steps that you can take to establish your item bank in accordance with psychometric best practices.

Make your Job Easier: Establish a Naming Convention

Names are important. As you are importing or creating your item banks it is important to identify each item with a unique, but recognizable name. Naming conventions should reflect your bank’s structure and should include numbers with leading zeros to support true numerical sorting. For example, let’s consider the item banks of a high school science teacher. Take a look at the example below:

What are some ways that this utilizes best practices?

  • Each subject has its own item bank. We can easily view all Biology items by selecting the Biology item bank.
  • A separate folder, 8Ah clearly delineates items for honors students.
  • The item names follow along with the item bank and category names, allowing us to search for all items for 8th grade unit A-1 with the query “8A-1”, or similarly for honors items “8Ah-1”
  • Leading zeros are used so that as the item bank expands, items will sort properly; an item ending in 001 will appear before 010.

Indeed, the execution of these best practices should be adapted to the needs of your organization, but it is important to establish a convention of some kind.  That is, you can use a period rather than underscore – as long as you are consistent.

Prepare for the Future: Store Extensive Metadata

Metadata is valuable. As you create items, take the time to record simple metadata like author and source. Having this information can prove very useful once the original item writer has moved to another department, or left the organization. Later in your test development life cycle, as you deliver items, you have the ability to aggregate and record item statistics. Values like discrimination and difficulty are fundamental to creating better tests, driving reliability and validity.

Statistics are used in the assembly of test forms, for example.  Classical statistics can be used to estimate mean, standard deviation, reliability, standard error, and pass rate, while item response theory parameters can be used to calculate test information and standard error functions. Data from both psychometric theories can be used to pre-equate multiple forms.

In the event that your organization decides to publish an adaptive test, utilizing CAT delivery, item parameters for each item will be essential because they are used for intelligently selecting items and scoring examinees. Additionally, in the event that the integrity of your test or scoring mechanism is ever challenged, documentation of validity is essential to defensibility and the storage of metadata is one such vital piece of documentation.

Increase Content Quality: Track Workflow

Utilize a review workflow to increase quality. Using a standardized review process will ensure that all items are vetted in a similar matter. Have a step in the process for grammar, spelling, and syntax review, as well as content review by a subject matter expert. As an item progresses through the workflow, its development should be tracked, as workflow results also serve as validity documentation.

Accept comments and suggestions from a variety of sources. It is not uncommon for each item reviewer to view an item through their distinctive lens. Having a diverse group of item reviewers stands to benefit your test takers, as they are likely to be diverse as well!

Keep Your Items Organized: Categorize Them

Identify items by content area. Creating a content hierarchy can also help you to organize your item bank and ensure that your test covers the relevant topics. Most often, we see content areas defined first by an analysis of the construct(s) being tested. In the event of a high school science test, this may include the evaluation of the content taught in class. For a high-stakes certification exam, this almost always includes a job-task analysis. Both methods produce what is called a test blue print, indicating how important various content areas are to the demonstration of knowledge in the areas being assessed. Once content areas are defined, we can assign items to levels or categories based on their content. As you are developing your test, and invariably referring back to your test blueprint, you can use this categorization to determine which items from each content area to select.

There is no doubt that item banking will remain a key aspect of developing and maintaining quality assessments. Utilizing best practices, and caring for your items throughout the test development life cycle, will pay great dividends as it increases the reliability, validity, and defensibility of your assessment.

Worried your current item banking platform isn’t up to par? We would love to discuss how Assessment Systems can help. FastTest was designed by psychometricians with an intuitive and easy to use item banking module. Check out our free version here, or contact us to learn more.

Want to improve the quality of your assessments with item banking?

Sign up for our newsletter and hear about our free tools, product updates, and blog posts first! Don’t worry, we would never sell your email address, and we promise not to spam you with too many emails.

Newsletter Sign Up
First Name*
Last Name*
Email*
Company*
Market Sector*
Lead Source