Posts on psychometrics: The Science of Assessment

Spearman-Brown

The Spearman-Brown formula, also known as the Spearman-Brown Prophecy Formula or Correction, is a method used in evaluating test reliability.  It is based on the idea that split-half reliability has better assumptions than coefficient alpha but only estimates reliability for a half-length test, so you need to implement a correction that steps it up to a true estimate for a full-length test.

Looking for software to help you analyze reliability?  Download a free copy of Iteman.

Coefficient Alpha vs. Split Half Reliability

The most commonly used index of test score reliability is coefficient alpha.  However, it’s not the only index on internal consistency.  Another common approach is split-half reliability, where you split the test into two halves (first/last, even/odd, or random split) and then correlate scores on each.  The reasoning is that if both halves of the test measure the same construct at a similar level of precision and difficulty, then scores on one half should correlate highly with scores on the other half.  More information on split-half is found here.

However, split-half reliability provides an inconvenient situation: we are effectively gauging the reliability of half a test.  It is a well-known fact that reliability is increased by more items (observations); we can all agree that a 100-item test is more reliable than a 10 item test comprised of similar quality items.  So the split half correlation is blatantly underestimating the reliability of the full-length test.

Adjusting the Split Half Back To Reality: The Spearman-Brown Formula

To adjust for this, psychometricians use the Spearman-Brown prophecy formula.  It takes the split half correlation as input and converts it to an estimate of the equivalent level of reliability for the full-length test.  While this might sound complex, the actual formula is quite simple.

Spearman-Brown

As you can see, the formula takes the split half reliability (rhalf) as input and produces the full-length estimation (rfull) .  This can then be interpreted alongside the ubiquitously used coefficient alpha.

While the calculation is quite simple, you still shouldn’t have to do it yourself.  Any decent software for classical item analysis will produce it for you.  As an example, here is the output of the Reliability Analysis table from our Iteman software for automated reporting and assessment intelligence with CTT.  This lists the various split-half estimates alongside the coefficient alpha (and its associated SEM) for the total score as well as the domains, so you can evaluate if there are domains that are producing unusually unreliable scores. 

Note: There is an ongoing argument amongst psychometricians whether domain scores are even worthwhile since the assumed unidimensionality of most tests means that the domain scores are  less reliable estimates of the total score, but that’s a whole ‘nother blog post!

Score N Items Alpha SEM Split-Half (Random) Split-Half (First-Last) Split-Half (Odd-Even) S-B Random S-B First-Last S-B Odd-Even
All items 50 0.805 3.058 0.660 0.537 0.668 0.795 0.699 0.801
1 10 0.522 1.269 0.338 0.376 0.370 0.506 0.547 0.540
2 18 0.602 1.860 0.418 0.309 0.448 0.590 0.472 0.619
3 12 0.605 1.496 0.449 0.417 0.383 0.620 0.588 0.553
4 10 0.485 1.375 0.300 0.329 0.297 0.461 0.495 0.457

You can see that, as mentioned earlier, there are 3 ways to do the split in the first place, and Iteman reports all three.  It then reports the Spearman-Brown formula for each, which are higher than the split-half coefficients.  These generally align with the results of the alpha estimates, which overall provide a cohesive picture about the structure of the exam and its reliability of scores.  As you might expect, domains with more items are slightly more reliable, but not super reliable since they are all less than 20 items.

So, what does this mean in the big scheme of things?  Well, in many cases the Spearman-Brown estimates might not differ from the alpha estimates, but it’s still good to know that they do.  In the case of high-stakes tests, you want to go through every effort you can to ensure that the scores are highly reliable and precise.

Tell me more!

If you’d like to learn more, here is an article on the topic.  Or, contact solutions@assess.com to discuss consulting projects with our Ph.D. psychometricians.

item-writing-tips

Item writing (aka item authoring) is a science as well as an art, and if you have done it, you know just how challenging it can be!  You are experts at what you do, and you want to make sure that your examinees are too.  But it’s hard to write questions that are clear, reliable, unbiased, and differentiate on the thing you are trying to assess.  Here are some tips.

What is Item Writing / Item Authoring ?

Item authoring is the process of creating test questions.  You have certainly seen “bad” test questions in your life, and know firsthand just how frustrating and confusing that can be.  Fortunately, there is a lot of research in the field of psychometrics on how to write good questions, and also how to have other experts review them to ensure quality.  It is best practice to make items go through a workflow, so that the test development process is similar to the software development process.

Because items are the building blocks of tests, it is likely that the test items within your tests are the greatest threat to its overall validity and reliability.  Here are some important tips in item authoring.  Want deeper guidance?  Check out our Item Writing Guide.

Anatomy of an Item

First, let’s talk a little bit about the parts of a test question.  The diagram on the right shows a reading passage with two questions on it.  Here are some of the terms used:

  • Asset/Stimulus: This is a reading passage here, but could also be an audio, video, table, PDF, or other resource
  • Item: An overall test question, usually called an “item” rather than a “question” because sometimes they might be statements.
  • Stem: The part of the item that presents the situation or poses a question.
  • Options: All of the choices to answer.
  • Key: The correct answer.
  • Distractors: The incorrect answers.

Parts of a test item

Item writing tips: The Stem

To find out whether your test items are your allies or your enemies, read through your test and identify the items that contain the most prevalent item construction flaws.  The first three of the most prevalent construction flaws are located in the item stem (i.e. question).  Look to see if your item stems contain…

1) BIAS

Nowadays, we tend to think of bias as relating to culture or religion, but there are many more subtle types of biases that oftentimes sneak into your tests.  Consider the following questions to determine the extent of bias in your tests:

  • Are there are acronyms in your test that are not considered industry standard?
  • Are you testing on policies and procedures that may vary from one location to another?
  • Are you using vocabulary that is more recognizable to a female examinee than a male?
  • Are you referencing objects that are not familiar to examinees from a newer or older generation?

2) NOT

We’ve all taken tests which ask a negatively worded question. These test items are often the product of item authoring by newbies, but they are devastating to the validity and reliability of your tests—particularly fast test-takers or individuals with lower reading skills.  If the examinee misses that one single word, they will get the question wrong even if they actually know the material.  This test item ends up penalizing the wrong examinees!

3) EXCESS VERBIAGEborderline method educational assessment

Long stems can be effective and essential in many situations, but they are also more prone to two specific item construction flaws.  If the stem is unnecessarily long, it can contribute to examinee fatigue.  Because each item requires more energy to read and understand, examinees tire sooner and may begin to perform more poorly later on in the test—regardless of their competence level.

Additionally, long stems often include information that can be used to answer other questions in the test.  This could lead your test to be an assessment of whose test-taking memory is best (i.e. “Oh yeah, #5 said XYZ, so the answer to #34 is XYZ.”) rather than who knows the material.

Item writing tips:  distractors / options

Unfortunately, item stems aren’t the only offenders.  Experienced test writers actually know that the distractors (i.e. options) are actually more difficult to write than the stems themselves.  When you review your test items, look to see if your item distractors contain

4) IMPLAUSIBILTY

The purpose of a distractor is to pull less qualified examinees away from the correct answer by other options that look correct.  In order for them to “distract” an examinee from the correct answer, they have to be plausible.  The closer they are to being correct, the more difficult the exam will be.  If the distractors are obviously incorrect, even unqualified examinees won’t pick them.  Then your exam will not help you discriminate between examinees who know the material and examinees that do not, which is the entire goal.

5) 3-TO-1 SPLITS

You may recall watching Sesame Street as a child.  If so, you remember the song “One of these things…”  (Either way, enjoy refreshing your memory!)   Looking back, it seems really elementary, but sometimes our test item options are written in such a way that an examinee can play this simple game with your test.  Instead of knowing the material, they can look for the option that stands out as different from the others.  Consider the following questions to determine if one of your items falls into this category:

  • Is the correct answer significantly longer than the distractors?
  • Does the correct answer contain more detail than the distractors?
  • Is the grammatical structure different for the answer than for the distractors?

6) ALL OF THE ABOVE

There are a couple of problems with having this phrase (or the opposite “None of the above”) as an option.  For starters, good test takers know that this is—statistically speaking—usually the correct answer.  If it’s there and the examinee picks it, they have a better than 50% chance of getting the item right—even if they don’t know the content.  Also, if they are able to identify two options as correct, they can select “All of the above” without knowing whether or not the third option was correct.  These sorts of questions also get in the way of good item analysis.   Whether the examinee gets this item right or wrong, it’s harder to ascertain what knowledge they have because the correct answer is so broad.

This is helpful, can I learn more?

Want to learn more about item writing?  Here’s an instructional video from one of our PhD psychometricians.  You should also check out this book.

Item authoring is easier with an item banking system

The process of reading through your exams in search of these flaws in the item authoring is time-consuming (and oftentimes depressing), but it is an essential step towards developing an exam that is valid, reliable, and reflects well on your organization as a whole.  We also recommend that you look into getting a dedicated item banking platform, designed to help with this process.

Summary Checklist

 

Issue

Recommendation

Key is invalid due to multiple correct answers. Consider each answer option individually; the key should be fully correct with each distractor being fully incorrect.
Item was written in a hard to comprehend way, examinees were unable to apply their knowledge because of poor wording.

 

Ensure that the item can be understood after just one read through. If you have to read the stem multiple times, it needs to be rewritten.
Grammar, spelling, or syntax errors direct savvy test takers toward the correct answer (or away from incorrect answers). Read the stem, followed by each answer option, aloud. Each answer option should fit with the stem.
Information was introduced in the stem text that was not relevant to the question. After writing each question, evaluate the content of the stem. It should be clear and concise without introducing irrelevant information.
Item emphasizes trivial facts. Work off of a test blue print to ensure that each of your items map to a relevant construct. If you are using Bloom’s taxonomy or a similar approach, items should be from higher order levels.
Numerical answer options overlap. Carefully evaluate numerical ranges to ensure there is no overlap among options.
Examinees noticed answer was most often A. Distribute the key evenly among the answer options. This can be avoided with FastTest’s randomized delivery functionality.
Key was overly specific compared to distractors. Answer options should all be about the same length and contain the same amount of information.
Key was only option to include key word from item stem. Avoid re-using key words from the stem text in your answer options. If you do use such words, evenly distribute them among all of the answer options so as to not call out individual options.
Rare exception can be argued to invalidate true/false always/never question. Avoid using “always” or “never” as there can be unanticipated or rare scenarios. Opt for less absolute terms like “most often” or “rarely”.
Distractors were not plausible, key was obvious. Review each answer option and ensure that it has some bearing in reality. Distractors should be plausible.
Idiom or jargon was used; non-native English speakers did not understand. It is best to avoid figures of speech, keep the stem text and answer options literal to avoid introducing undue discrimination against certain groups.
Key was significantly longer than distractors. There is a strong tendency to write a key that is very descriptive. Be wary of this and evaluate distractors to ensure that they are approximately the same length.
 

validity threats

Validity threats are issues with a test or assessment that hinder the interpretations and use of scores, such as cheating, inappropriate use of scores, unfair preparation, or non-standardized delivery.  It is important to establish a test security plan to define the threats relevant for you and address them.

Validity, in its modern conceptualization, refers to evidence that supports our intended interpretations of test scores (see Chapter 1 of APA/AERA/NCME Standards for full treatment).   The word “interpretation” is key because test scores can be interpreted in different ways, including ways that are not intended by the test designers.  For example, a test given at the end of Nursing school to prepare for a national licensure exam might be used by the school as a sort of Final Exam.  However, the test was not designed for this purpose and might not even be aligned with the school’s curriculum.  Another example is that certification tests are usually designed to demonstrate minimal competence, not differentiate amongst experts, so interpreting a high score as expertise might not be warranted.

Validity threats: Always be on the lookout!

Test sponsors, therefore, must be vigilant against any validity threats.  Some of these, like the two aforementioned examples, might be outside the scope of the organization.  While it is certainly worthwhile to address such issues, our primary focus is on aspects of the exam itself.

Which validity threats rise to the surface in psychometric forensics?

Here, we will discuss several threats to validity that typically present themselves in psychometric forensics, with a focus on security aspects.  However, I’m not just listing security threats here, as psychometric forensics is excellent at flagging other types of validity threats too.

Threat Description Approach Example Indices
Collusion (copying) Examinees are copying answers from one another, usually with a defined Source. Error similarity (only looks at incorrect) 2 examinees get the same 10 items wrong, and select the same distractor on each B-B Ran, B-B Obs, K, K1, K2, S2
Response similarity 2 examinees give the same response on 98/100 items S2, g2, ω, Zjk
Group level help/issues Similar to collusion but at a group level; could be examinees working together, or receiving answers from a teacher/proctor.  Note that many examinees using the same brain dump would have a similar signature but across locations. Group level statistics Location has one of the highest mean scores but lowest mean times Descriptive statistics such as mean score, mean time, and pass rate
Response or error similarity On a certain group of items, the entire classroom gives the same answers Roll-up analysis, such as mean collusion flags per group; also erasure analysis (paper only)
Pre-Knowledge Examinee comes in to take the test already knowing the items and answers, often purchased from a brain dump website. Time-Score analysis Examinee has high score and very short time RTE or total time vs. scores
Response or error similarity Examinee has all the same responses as a known brain dump site All indices
Pretest item comparison Examinee gets 100% on existing items but 50% on new items Pre vs Scored results
Person fit Examinee gets the 10 hardest items correct but performs below average on the rest of the items Guttman indices, lz
Harvesting Examinee is not actually taking the test, but is sitting it to memorize items so they can be sold afterwards, often at a brain dump website.  Similar signature to Sleepers but more likely to occur on voluntary tests, or where high scores benefit examinees. Time-Score analysis Low score, high time, few attempts. RTE or total time vs. scores
Mean vs Median item time Examinee “camps” on 10 items to memorize them; mean item time much higher than the median Mean-Median index
Option flagging Examinee answers “C” to all items in the second half Option proportions
Low motivation: Sleeper Examinees are disengaged, producing data that is flagged as unusual and invalid; fortunately, not usually a security concern but could be a policy concern. Similar signature to Harvester but more likely to occur on mandatory tests, or where high scores do not benefit examinees. Time-Score analysis Low score, high time, few attempts. RTE or total time vs. scores
Item timeout rate If you have item time limits, examinee hits them Proportion items that hit limit
Person fit Examinee attempt a few items, passes through the rest Guttman indices, lz
Low motivation: Clicker Examinees are disengaged, producing data that is flagged as unusual and invalid; fortunately, not usually a security concern but could be a policy concern. Similar idea to Sleeper but data is quite different. Time-Score analysis Examinee quickly clicks “A” to all items, finishing with a low time and low score RTE, Total time vs. scores
Option flagging See above Option proportions

Psychometric Forensics to Find Evidence of Cheating

An emerging sector in the field of psychometrics is the area devoted to analyzing test data to find cheaters and other illicit or invalid testing behavior.  There is a distinction between primary and secondary collusion, and there are specific collusion detection indices and methods to investigate aberrant testing behavior, such as

While research on this topic is more than 50 years old, the modern era did not begin until Wollack published his paper on the Omega index in 1997. Since then, the sophistication and effectiveness of methodology in the field has multiplied, and many more publications focus on it than in the pre-Omega era. This is evidenced by not one but three recent books on the subject:

  1. Wollack, J., & Fremer, J. (2013).  Handbook of Test Security.
  2. Kingston, N., & Clark, A. (2014).  Test Fraud: Statistical Detection and Methodology.
  3. Cizek, G., & Wollack, J. (2016). Handbook of Quantitative Methods for Detecting Cheating on Tests.

 

Likert scale meme

Likert scales (items) are a type of item used in human psychoeducational assessment, primarily to assess noncognitive constructs.  That is, while item types like multiple choice or short answer are used to measure knowledge or ability, Likert scales are better suited to measuring things like anxiety, conscientiousness, or motivation.

In the realm of psychology, surveys, and market analysis, Likert scales stand tall as one of the most versatile and widely used tools. Whether you’re a researcher, a marketer, or simply someone interested in understanding human attitudes and opinions, grasping the essence of Likert scales can significantly enhance your understanding of data collection and analysis. In this guide, we’ll delve into what Likert scales are, why they’re indispensable, the types of items they’re suited for, and how to score them effectively.

What is a Likert Scale/Item?

A Likert scale, named after its creator Rensis Likert, is a psychometric scale used to gauge attitudes, opinions, perceptions, and behaviors. It typically consists of a series of statements or questions that respondents are asked to rate based on a specified scale. The scale often ranges from strongly disagree to strongly agree, with varying degrees of intensity or frequency in between. Likert scales are primarily used in survey research but have found applications in various fields, including psychology, sociology, marketing, and education.

We’ve all seen these in our past; they are the items that say something like “Rate on a scale of 1 to 5.”  Sometimes the numbers have descriptive text anchors, like you see below.  If these are behaviorally-based, they are called Behaviorally Anchored Rating Scales (BARS).

Likert scale item

You can consider the Likert Scale to be the notion of 1 to 5 or Strongly Disagree to Strongly Agree.  A Likert Item is an item on an assessment that uses a Likert Scale.  In many cases, the scale is reused over items; in the example above, we have two items that use the same scale.  However, the terms are often used interchangeably.

Why Use a Likert Scale?

The popularity of Likert scales stems from their simplicity, flexibility, and ability to capture nuanced responses. Here are several reasons why Likert scales are favored:

  • Ease of Administration: Likert items are easy to administer, making them suitable for both online and offline surveys.
  • Quantifiable Data: Likert scales generate quantitative data, allowing for statistical analysis and comparison across different groups or time points. Open response items, where an examinee might type in how they feel about something, are much harder to quantify.
  • Flexibility: They can accommodate a wide range of topics and attitudes, from simple preferences to complex opinions.
  • Standardization: Likert scales provide a standardized format for measuring attitudes, enhancing the reliability and validity of research findings.
  • Ease of Interpretation: Likert responses are straightforward to interpret, making them accessible to both researchers.  For example, in the first example above, if the average response is 4.1, we can say that respondents generally Agree with the statement.
  • Ease of understanding: Since these are so commonly used, everyone is familiar with the format and can respond quickly.

 

What Sort of Assessments Use a Likert Scale?

Likert scales are well-suited for measuring various constructs, including:

  • Attitudes: Assessing attitudes towards a particular issue, product, or service (e.g., “I believe climate change is a pressing issue”).
  • Opinions: Gauging opinions on controversial topics or current events (e.g., “I support the legalization of marijuana”).
  • Perceptions: Capturing perceptions of quality, satisfaction, or trust (e.g., “I am satisfied with the customer service provided”).
  • Behaviors: Examining self-reported behaviors or intentions (e.g., “I exercise regularly”).
  • Agreement or Frequency: Measuring agreement with statements or the frequency of certain behaviors (e.g., “I often recycle household waste”).

 

How Do You Score a Likert Item?

Scoring a Likert scale item involves assigning numerical values to respondents’ selected options. Typically, the scale is assigned values from 1 to 5 (or more), representing varying degrees of agreement, frequency, or intensity.  In the example above, the possible scores for each item are 1, 2, 3, 4, 5.  There are then two ways we can uXcalibre-poly-outputse this to obtain scores for examinees.

  • Classical test theory: Either sum or average. For the former, simply add up the
  • scores for all items within the scale for each respondent. If they respond as 4 to both items, their score is 8.  For the latter, we find their average answer.  If they answer a 3 and a 4, their score is 3.50.  Note that both of these are easily interpretable.
  • Item Response Theory: In large scale assessment, Likert scales are often analyzed and scored with polytomous IRT models such as the Rating Scale Model and Graded Response Model.  An example of this sort of analysis is shown here.

Other important considerations:

  • Reverse Coding: If necessary, reverse code items to ensure consistency (e.g., strongly disagree = 1, strongly agree = 5).  In the example above, we are clearly assessing Extraversion; the first item is normal scoring, while the second item is reverse-scored.  So actually, answering a 2 to the second question is really a 4 in the direction of Extraversion, and we would score it as such.
  • Collapsing Categories: Sometimes, if few respondents answer a 2, you might collapse 1 and 2 into a single category.  This is especially true if using IRT.  The image here also shows an example of that.
  • Norms: Because most traits measured by Likert scales are norm-referenced (see more on that here!), we often need to set norms.  In the simple example, what does a score of 8/10 mean?  The meaning is quite different if the average is 6 with a standard deviation of 1, than if the average is 8.  Because of this, scores might be reported as z-scores or T-scores.

 

Item Analysis of Likert items

We use statistical techniques to item quality, such as average score per item or item-total correlation (discrimination). You can also perform more advanced analyses like factor analysis or regression to uncover underlying patterns or relationships.  Here are some initial considerations.

  • Frequency of each response: How many examinees selected each?  This is the N column in the graph above.  The Prop column is the same thing but converted to proportion.
  • Mean score per response: This is evidence that the item is working well.  Did people who answered “1” score lower overall on the Extraversion score than people who scored 3?  This is definitely the case above.
  • Rpbis per response, or overall R: We want the item to correlate with total score.  This is strong evidence for the validity of the item.  In this example, the correlation is 0.620, which is great.
  • Item response theory:  We can evaluate threshold values and overall item discrimination, as well as issues like item fit.  This is extremely important, but beyond the scope of this post!

We also want to validate the overall test.  Scores and subscores can be evaluated with descriptive statistics, and for reliability with indices like coefficient alpha.

 

Summary

In conclusion, Likert scales are invaluable tools for capturing and quantifying human attitudes, opinions, and behaviors. Understanding their utility and nuances can empower researchers, marketers, and decision-makers to extract meaningful insights from their data, driving informed decisions and actions. So, whether you’re embarking on a research project, designing a customer satisfaction survey, or conducting employee assessments, remember to leverage Likert scales to efficiently assess the noncognitive traits and opinions.

test blueprints specifications

Test blueprints, aka test specifications (shortened to “test specs”), are the formalized design of an assessment, test, or exam.  This can be in the context of educational assessment, pre-employment, certification, licensure, or any other type.  Generally, the amount of effort and detail is commensurate with the stakes of the assessment; a 10 item quiz for 5th grade math is quite different than the licensure exam for surgeons!

Why do we need test blueprints?

Job-analysis-to-test-blueprintsThe blueprints are used for various purposes.  The most important is that they are part of the validity documentation.  Validity refers to the evidence we have (“evidence-centered design”) that a test’s scores mean what we want them to mean.  So if we want the scores to reflect knowledge of high school math curriculum for graduation, then the test specifications should align to the curriculum quite closely.  If we want the scores to reflect that a surgeon is qualified to do practice, we want the test specifications to reflect the knowledge and skills needed to practice.  A lot of work can go into designing the blueprints, such as job task analysis in certification and licensure.  The image here provides an example of how JTA data is converted into content blueprints.

The test blueprints/specifications are also important for directing efforts in test development.  At the simplest level, you want your item writers to create new items in areas where you need them.  If the blueprints only call for 1% of the test on a certain topic, you don’t want the item writers making a lot of new questions there.

The test blueprints are often published publicly in a simplified version to help external stakeholders.  For example, you want the surgeons to be able to study for their test, so you publish a list of content domains that is covered by the test, and the percentage of items from each.  A fantastic example of this is at NOCTI.  Another good example which covers multiple aspects of the list below is this one from New Mexico.

 

What are test blueprints?

The test blueprints, like the blueprints of a house or office building, define everything needed to build it.  There are multiple aspects to this, which can vary by type of exam.  It breaks down into two types of information: item distribution, and operational guidelines.

Item distribution

There are many ways that you can classify items on the test.  The content domain or topic that they cover is the most obvious here, such as defining a math test that is 40% Algebra, 30% Geometry, and 30% Calculus.  But there are other, more practical and operational, considerations as well.

 

Number of items

First, the blueprints should define the number of items, including a breakdown of scored vs. unscored (pilot) items.  Often, there is documented reasoning behind the choices for this, such as pretesting plans, or an estimate of reliability based on projected test length.

Content

This is the most important and most common.  Some test blueprints only cover this and the number of items.  It defines all the content covered by the test, and the percentage for each.  Sometimes, there are sub-domains and sub-sub-domains!  Here is an example of that, from the New Mexico link provided earlier.

New Mexico test blueprints

Item type

Many tests only have multiple choice items, so this is then unnecessary.  But there are tests, for example, that require 50 multiple choice items, 10 drag and drop, 10 fill-in-the-blank, and 2 essay.  Such designs need to be explained and codified in the the test blueprints.

Statistics

Some test blueprints define a distribution or target level of statistics.  For example, it might require 20% of the items to have classical difficulty statistics (P-values) of 0.40 to 0.60, 60% of the items with values 0.60 to 0.90, and 20% from 0.90 to 1.00.  Or, there might just be acceptable ranges, such as stating that all difficulty statistics should be 0.40 to 0.98.

Cognitive level or Bloom’s

Not all assessments tackle this consideration, but it is common in education.  The test blueprints might specify a certain number of items that are Recall vs. higher levels of cognitive complexity.  Note that this might overlap with Item Type.

Sections

The design of the test might be ordered into sections, which is documented closely.  Continuing the example above, there might be Section 1 that is the 50 multiple choice items, Section 2 is drag-and-drop plus fill-in-the-blank, and Section 3 is Essay.

 

Operational and practical considerations

This part of the blueprints covers aspects other than the nature of items.  There are many things that are useful, but here are a few examples.

  • Time limits – What is the overall time limit of the test?  Section time limits?
  • Navigation – Are examinees allowed to move back and forth between sections?  Between items?
  • Test design – If you are using modern designs like computerized adaptive testing or linear on the fly testing, you need to define these with a lot of detail.
  • Messaging – What instructions will you give?  Are there pop-up messages?
  • Access – How do you control access to the exam?  Are there eligibility requirements?  Published online vs. paper?  So many options.

 

Summary

As you can see, there are a ton of things to consider when publishing a test.  If the test is low-stakes, many of these are treated informally, such as a teacher handing out a 10 item quiz.  But for high-stakes assessment, the definition of formal test blueprints and specifications is absolutely essential.  Not only does it prepare the candidates and other stakeholders, but it makes things easier for the test developers, and provides substantial documentation for validity.  Moreover, if you work in an area where there are potential legal challenges, it provides a bulwark of legal defensibility.  If you work in high-stakes or high-volume assessment, you need to define your test blueprints.

test publishing quality control

Test publishing is the process of preparing computer-based assessments for delivery on an electronic platform. Test publishing is like a car rolling off the assembly line. It’s the culmination of a great deal of effort in developing the assessment. Just as a car undergoes extensive checks before leaving the factory, a computer-based assessment requires meticulous quality control procedures to make sure that it functions as intended. Errors may have significant consequences for the sponsoring organization, including a loss of reputation, and can even have legal implications, depending upon the type of error.

test publishing quality assurance

The test publishing quality control process begins prior to the  start of the publishing process. The key steps in the process are as follows:

 

Step 1: Determine the test publishing specifications

Quality control begins with the completion of a test specifications document. The test specifications document provides the pattern or the playbook for how the test should be published. It typically includes the following information:

  • Test design
    • Administration model (i.e., linear fixed form, LOFT, CAT)
    • Scoring strategy (dichotomous/polytomous item-level scoring, compensatory/conjunctive domain/sectional scoring)
    • Test length (number of items shown to each candidate)
    • Test duration (time allowed for exam/sections)
  • Content specifications
    • List of included items (and which are scored/unscored)
    • Mapping of items to domains/sections/subscales (if applicable)
    • Mapping of stimuli to items (if applicable)
    • Item keys
  • Ancillary delivery components
    • Non-disclosure agreement
    • Tutorial
    • Customized help screens
    • Calculator
  • Features and functionality
    • Navigation (e.g., review of previous items allowed)
    • Review screens
    • Electronic scratch pad
    • Item-level comments/feedback

Note that this is not a comprehensive list, and information needed for the test specifications documents may vary depending upon the type of assessment and the specific testing platform used for delivery. Some of the data on the test specifications are relatively static and will change only with changes to the test design. Other data, such as the list of included items, are dynamic and will typically change each time the assessment is republished.

The test specifications document becomes the authoritative source of truth used by the test publisher for how the assessment should be published. It is a key communication tool between the sponsoring organization and their test publishing vendor or partner. 

 

Step 2: Identify sources of test publishing errors

A comprehensive determination of everything that could possibly go wrong in the test publishing process should serve as the guide for the quality control checks that need to be performed before the test goes live. A tool that can assist in developing a comprehensive list of potential errors is a fishbone diagram, also known as an Ishikawa diagram or a cause-and-effect diagram.  It is a visual representation used to identify and organize possible causes of a specific problem or effect. The diagram takes the form of a fish skeleton (hence, its name), with the “head” representing the problem or effect, and the “bones” representing different categories of potential causes. Along each bone, smaller branches represent sub-causes, which are specific elements that may contribute to the problem.

Fishbone diagrams are created by having a team representing all disciplines involved in a process brainstorm potential problems, or in the case of test publishing, potential errors that can be introduced into the test publishing process. Determining potential categories of errors first and then brainstorming more specific errors enables a comprehensive analysis of the test publishing process and the potential occasions in which errors can be introduced in that process.

Here’s a sample fishbone diagram for test publishing errors: 

test publishing errors fishbone

 

Step 3: Review against source of truth

The examination should be reviewed against the source of truth for each potential error. The test specifications will be the key source of truth against which the published examination is compared to identify any errors. The item bank is the source of truth for item presentation and item metadata.

Error-free test publishing is central to preserving the test sponsor’s reputation. Even minor mistakes, such as misspelled words, can be damaging. More importantly, errors in a published examination can have deleterious effects for candidates. A scoring error might mean the difference between a candidate failing and passing, and in the case of a certification or licensure examination, that can have dire consequences for the candidate’s career and livelihood. 

As mathematician Nassem Nicholas Taleb stated, “Quality is the result of an intelligent effort, not a chance happening.” A rigorous quality control procedure aids in making the publishing process an intelligent effort.

student-testlet

Testlet is a term in educational assessment that refers to a set of test items or questions grouped together on a test, often with a common theme or scenario. This approach aims to provide a more comprehensive and nuanced assessment of an individual’s abilities compared to traditional testing methods.

What is a testlet?

As mentioned above, a testlet is a group of items delivered together.  There are two ways of doing this.

  1. Items that share a common stimulus or otherwise MUST be together.  An example of this is a reading passage with 4 questions about it.  You can’t have the passage and the 4 questions scattered about a 100 item test as 5 screens in random places!  It all has to be together to make sense.
  2. Items that do not have to be together, but it improves the purpose of the assessment.  In this case, you might have 10 items that are standalone (no reading passage or anything relating them), but your test might be multistage testing and all items are delivered in blocks of 10.  Test designers can tailor the difficulty level based on the test-taker’s performance. As a test-taker progresses through a testlet, the system dynamically adjusts the complexity of subsequent questions, ensuring a personalized and accurate assessment of their proficiency.

Example item - testlet

 

Why use testlets?

The answer is obvious in the first case: you have to.  But it does get deeper than that.

One key feature of testlets is their ability to mimic real-world scenarios. Unlike standalone questions, testlets present a series of interconnected problems or tasks that require the test-taker to apply their knowledge in a cohesive manner. This not only assesses their understanding of isolated concepts but also evaluates their ability to integrate information and solve complex problems.  Testlets are can be particularly effective in assessing critical thinking, problem-solving skills, and practical application of knowledge. By presenting questions in a contextually linked manner, testlets offer a more authentic representation of a person’s ability to handle real-world challenges.

Testlets promote efficiency in testing. With a focused set of questions, they save time and reduce the fatigue associated with extensive testing sessions. This makes them an attractive option for educators and testing organizations seeking to streamline assessment processes while maintaining accuracy.  That is, if you want 20 items on reading comprehension, you could have 20 reading passages each with 1 question, or 4 reading passages each with 5 questions.  The fatigue would be far less in the latter test!

The second case, of standalone items, is a bit more nuanced.  It often has to do with managing the blueprints of the test, making best use of the item bank, and other operational considerations.  For example, perhaps the test has a blueprint to have 50% algebra items, 30% geometry, and 20% trigonometry.  You might build packets of 10 items with 5, 3, 2 respectively, and use those packets.

 

How do you score testlets?

Testlets can be scored with traditional methods, or with a new technology that was developed for this unique situation.

First, you can score with classical test theory, which is the traditional method of number-correct or points.

Second, you can use item response theory.  However, if the items share a strong relation, this might violate the IRT assumption of local independence.

Third, testlet response theory (TRT; Wainer, Bradlow, & Wang, 2007) works to address some of the concerns with traditional IRT.

 

Summary

In conclusion, a testlet is a powerful and flexible tool in toolbox of assessment designers. Its ability to present interconnected questions, mimic real-world scenarios, and adapt to individual performance makes it a valuable asset in gauging a person’s knowledge and skills. As education and assessment methods continue to evolve, the role of testlets is likely to expand, contributing to more accurate and meaningful evaluations of individuals in various fields.

job-task-analysis

Job Task Analysis (JTA) is an essential step in designing a test to be used in the workforce, such as pre-employment or certification/licensure, by analyzing data on what is actually being done in the job.  Also known as Job Analysis or Role Delineation, job task analysis is important to design a test that is legally defensible and eligible for accreditation.  It usually involves a panel of subject matter experts to develop a survey, which you then deliver to professionals in your field to get quantitative data about what is most frequently done on the job and what is most critical/important.  This data can then be used for several important purposes.

Need help? Our experts can help you efficiently produce a job task analysis study for your certification, guide the process of item writing and standard setting, then publish and deliver the exam on our secure platform.

 

Reasons to do a Job Task Analysis

Job analysis is extremely important in the field of industrial/organizational psychology, hence the meme here from @iopsychmemes.  It’s not just limited to credentialing.

Job analysis I/O Psychology

Exam design

The most common reason is to get quantitative data that will help you design an exam.  By knowing what knowledge, skills, or abilities (KSAs), are most commonly used, you then know which deserve more questions on the test.  It can also help you with more complex design aspects, such as defining a practical exam with live patients.

Training curriculum

Similarly, that quantitative info can help design a curriculum and other training materials.  You will have data on what is most important or frequent.

Compensation analysis

You have a captive audience with the JTA survey.  Ask them other things that you want to know!  This is an excellent time to gather information about compensation.  I worked on a JTA in the past which asked about work location: clinic, hospital, private practice, or vendor/corporate.

Job descriptions

A good job analysis will help you write a job description for postings.  It will tell you the job responsibilities (common tasks), qualifications (required skills, abilities, and education), and other important aspects.  If you gather compensation data in the survey, that can be used to define the salary range of the open position.

Workforce planning

Important trends might become obvious when analyzing the data.  Are fewer people entering your profession, perhaps specific to a certain region or demographic?  Are they entering without certain skills?  Are there certain universities or training programs that are not performing well?  A JTA can help you discover such issues and then work with stakeholders to address them.  These are major potential problems for the profession.

IT IS MANDATORY

If you have a professional certification exam and want to get it accredited by a board such as NCCA or ANSI/ANAB/ISO, then you are REQUIRED to do some sort of job task analysis.

 

Why is a JTA so important for certification and licensure?  Validity.

The fundamental goal of psychometrics is validity, which is evidence that the interpretations we make from scores are actually true. In the case of certification and licensure exams, we are interpreting that someone who passes the test is qualified to work in that job role. So, the first thing we need to do is define exactly what is the job role, and to do it in a quantitative, scientific way. You can’t just have someone sit down in their basement and write up 17 bullet points as the exam blueprint.  That is a lawsuit waiting to happen.

There are other aspects that are essential as well, such as item writer training and standard setting studies.

 

The Methodology: Job Task Inventory

It’s not easy to develop a defensible certification exam, but the process of job task analysis (JTA) doesn’t require a Ph.D. in Psychometrics to understand. Here’s an overview of what to expect.

  1. Convene a panel of subject matter experts (SMEs), and provide a training on the JTA process.
  2. The SMEs then discuss the role of the certification in the profession, and establish high-level topics (domains) that the certification test should cover. Usually, there is 5-20. Sometimes there are subdomains, and occasionally sub-subdomains.
  3. The SME panel generates a list of job tasks that are assigned to domains; the list is reviewed for duplicates and other potential issues. These tasks have an action verb, a subject, and sometimes a qualifier. Examples: “Calibrate the lensometer,” “Take out the trash”, “Perform an equating study.”  There is a specific approach to help with the generation, called the critical incident technique.  With this, you ask the SMEs to describe a critical incident that happened on the job and what skills or knowledge led to success by the professional.  While this might not generate ideas for frequent yet simple tasks, it can help generate ideas for tasks that are rarer but very important.
  4. The final list is used to generate a survey, which is sent to a representative sample of professionals that actually work in the role. The respondents take the survey, whereby they rate each task, usually on its importance and time spent (sometimes called criticality and frequency). Demographics are also gathered, which include age range, geographic region, work location (e.g., clinic vs hospital if medical), years of experience, educational level, and additional certifications.
  5. A psychometrician analyzes the results and creates a formal report, which is essential for validity documentation.  This report is sometimes considered confidential, sometimes published on the organization’s website for the benefit of the profession, and sometimes published in an abbreviated form.  It’s up to you.  For example, this site presents the final results, but then asks you to submit your email address for the full report.

 

Using JTA results to create test blueprints

Many corporations do a job analysis purely for in-house purposes, such as job descriptions and compensation.  This becomes important for large corporations where you might have thousands of people in the same job; it needs to be well-defined, with good training and appropriate compensation.

If you work for a credentialing organization (typically a non-profit, but sometimes the Training arm of a corporation… for example, Amazon Web Services has a division dedicated to certification exams, you will need to analyze the results of the JTA to develop exam blueprints.  We will discuss this process in more detail with another blog post.  But below is an example of how this will look, and here is a free spreadsheet to perform the calculations: Job Task Analysis to Test Blueprints.

 

Job Task Analysis Example

Suppose you are an expert widgetmaker in charge of the widgetmaker certification exam.  You hire a psychometrician to guide the organization through the test development process.  The psychometrician would start by holding a webinar or in-person meeting for a panel of SMEs to define the role and generate a list of tasks.  The group comes up with a list of 20 tasks, sorted into 4 content domains.  These are listed in a survey to current widgetmakers, who rate them on importance and frequency.  The psychometrician analyzes the data and presents a table like you see below.

We can see here that Task 14 is the most frequent, while Task 2 is the least frequent.  Task 7 is the most important while Task 17 is the least.  When you combine Importance and Frequency either by adding or multiplying, you get the weights on the right-hand columns.  If we sum these and divide by the total, we get the suggested blueprints in the green cells.

 

Job task analysis to test blueprints

 

job analysis

The Four-Fifths Rule is a term that refers to a guideline for fairness in hiring practices in the USA.  Because tests are often used in making hiring decisions, the Four-Fifths Rule applies to them so it is an important aspect of assessment in the workforce, but it also applies to other selection methods, such as interviews or biodata.  It is important not only because violations could lead to legal entanglements, but because achieving a diverse and inclusive workforce is a goal for most organizations.

What is the Four-Fifths Rule?

The Four-Fifths Rule, also known as the 80% Rule, is a statistical guideline established by the Equal Employment Opportunity Commission (EEOC) in the United States, used to evaluate whether a selection process leads to adverse impact against any specific group. The rule comes into play when comparing the selection rates of different demographic groups within an organization, aiming to identify potential disparities. According to the EEOC, a selection rate for any group that is less than four-fifths (or 80%) of the rate for the group with the highest selection rate may indicate adverse impact.

This applies to any organization that is hiring in the United States, even if that organization is based overseas.  A great example of this is a 2023 lawsuit against a Chinese company that was hiring US employees with unfair practices.

The Four-Fifths Rule serves as a vital benchmark for organizations striving for diversity and inclusion. By highlighting disparities in selection rates, it helps employers identify and rectify potential discriminatory practices. This not only aligns with ethical considerations but also ensures compliance with anti-discrimination laws, fostering an environment that values equal opportunity for all.

four-fifths rule diversity in pre-employment testing

Calculation Method

First, determine the selection rate for each demographic group by dividing the number of individuals selected from that group by the total number of applicants from the same group. Next, compare the selection rates of different groups. If the selection rate for any group is less than 80% of the rate for the group with the highest selection rate, it triggers further investigation into potential discrimination.

Example:

Group A has 500 applicants and 100 were selected; a 20% selection rate

Group B has 120 applicants and 17 were selected; a 14.17% selection rate

The ratio is 0.1417/0.20 = 0.7083.  This is below 0.80, so the procedure is biased against Group B.

Note that we are focusing on rates and not overall numbers.  Clearly, Group B has far fewer selected, but the rates are not too different at 20% and 14.17% – but different enough that this test would be under scrutiny.

Implementing the Four-Fifths Rule in Practice

To implement protections against the Four-Fifths Rule effectively, organizations must adopt proactive measures. Regularly monitoring and analyzing selection rates for different demographic groups can help identify trends and address potential issues promptly. Furthermore, organizations should establish clear policies and procedures for hiring, ensuring that decision-makers are well-informed about the Four-Fifths Rule and its implications.

Note that this is only a guideline for flagging potential adverse impact.  It does not mean the selection method will be stricken.  Consider a physical fitness test for firefighters; it most definitely produce lower results for people aged 60 and over, but physical fitness is unarguably a job requirement, so if the test has been validated it will most likely be upheld.

How does AI fit into this?

Artificial intelligence (AI) is governed by the Four-Fifths rule as any other selection approach.  Do you use AI to comb through a pile of resumes, and flag those worthy of an interview?  This is then a selection procedure, and if it were to be found that it was biased against a subgroup, you would be liable.

Conclusion

In the pursuit of a fair and inclusive workplace, the Four-Fifths Rule is a valuable tool for organizations committed to diversity. Moreover, it is a legal guideline for any organization that hires in the United States.  It is legally required that your organization follows this guideline with respect to pre-employment assessments as well as any other selection procedure.

Note: ASC does not provide legal advice, this is only for educational purposes.

Iteman45-quantile-plot

Classical Test Theory (CTT) is a psychometric approach to analyzing, improving, scoring, and validating assessments.  It is based on relatively simple concepts, such as averages, proportions, and correlations.  One of the most frequently used aspects is item statistics, which provide insight into how an individual test question is performing.  Is it too easy, too hard, too confusing, miskeyed, or potentially another issue?  Item statistics are what tell you these things.

What are classical test theory item statistics?

They are indices of how a test item, or components of it, is performing.  Items can be hard vs easy, strong vs weak, and other important aspects.  Below is the output from the  Iteman  report in our  FastTest  online assessment platform, showing an English vocabulary item with real student data.  How do we interpret this?

FastTest Iteman Psychometric Analysis

Interpreting Classical Test Theory Item Statistics: Item Difficulty

The P value (Multiple Choice)

The P value is the classical test theory index of difficulty, and is the proportion of examinees that answered an item correctly (or in the keyed direction). It ranges from 0.0 to 1.0. A high value means that the item is easy, and a low value means that the item is difficult.  There are no hard and fast rules because interpretation can vary widely for different situations.  For example, a test given at the beginning of the school year would be expected to have low statistics since the students have not yet been taught the material.  On the other hand, a professional certification exam, where someone can not even sit unless they have 3 years of experience and a relevant degree, might have all items appear easy even though they are quite advanced topics!  Here are some general guidelines”

    0.95-1.0 = Too easy (not doing much good to differentiate examinees, which is really the purpose of assessment)

    0.60-0.95 = Typical

    0.40-0.60 = Hard

    <0.40 = Too hard (consider that a 4 option multiple choice has a 25% chance of pure guessing)

With Iteman, you can set bounds to automatically flag items.  The minimum P value bound represents what you consider the cut point for an item being too difficult. For a relatively easy test, you might specify 0.50 as a minimum, which means that 50% of the examinees have answered the item correctly.

For a test where we expect examinees to perform poorly, the minimum might be lowered to 0.4 or even 0.3. The minimum should take into account the possibility of guessing; if the item is multiple-choice with four options, there is a 25% chance of randomly guessing the answer, so the minimum should probably not be 0.20.  The maximum P value represents the cut point for what you consider to be an item that is too easy. The primary consideration here is that if an item is so easy that nearly everyone gets it correct, it is not providing much information about the examinees.

In fact, items with a P of 0.95 or higher typically have very poor point-biserial correlations.

The Item Mean (Polytomous)

This refers to an item that is scored with 2 or more point levels, like an essay scored on a 0-4 point rubric or a Likert-type item that is “Rate on a scale of 1 to 5.”

  • 1=Strongly Disagree
  • 2=Disagree
  • 3=Neutral
  • 4=Agree
  • 5=Strongly Agree

The item mean is the average of the item responses converted to numeric values across all examinees. The range of the item mean is dependent on the number of categories and whether the item responses begin at 0. The interpretation of the item mean depends on the type of item (rating scale or partial credit). A good rating scale item will have an item mean close to ½ of the maximum, as this means that on average, examinees are not endorsing categories near the extremes of the continuum.

You will have to adjust for your own situation, but here is an example for the 5-point Likert-style item.

1-2 is very low; people disagree fairly strongly on average

2-3 is low to neutral; people tend to disagree on average

3-4 is neutral to high; people tend to agree on average

4-5 is very high; people agree fairly strongly on average

Iteman also provides flagging bounds for this statistic.  The minimum item mean bound represents what you consider the cut point for the item mean being too low.  The maximum item mean bound represents what you consider the cut point for the item mean being too high.

The number of categories for the items must be considered when setting the bounds of the minimum/maximum values. This is important as all items of a certain type (e.g., 3-category) might be flagged.

Interpreting Classical Test Theory Item Statistics: Item Discrimination

Multiple-Choice Items

The Pearson point-biserial correlation (r-pbis) is a classical test theory measure of the discrimination or differentiating strength, of the item. It ranges from −1.0 to 1.0 and is a correlation of item scores and total raw scores.  If you consider a scored data matrix (multiple-choice items converted to 0/1 data), this would be the correlation between the item column and a column that is the sum of all item columns for each row (a person’s score).

A good item is able to differentiate between examinees of high and low ability yet have a higher point-biserial, but rarely above 0.50. A negative point-biserial is indicative of a very poor item because it means that the high-ability examinees are answering incorrectly, while the low-ability examinees are answering it correctly, which of course would be bizarre, and therefore typically indicates that the specified correct answer is actually wrong. A point-biserial of 0.0 provides no differentiation between low-scoring and high-scoring examinees, essentially random “noise.”  Here are some general guidelines on interpretation.  Note that these assume a decent sample size; if you only have a small number of examinees, many item statistics will be flagged!

0.20+ = Good item; smarter examinees tend to get the item correct

0.10-0.20 = OK item; but probably review it

0.0-0.10 = Marginal item quality; should probably be revised or replaced

<0.0 = Terrible item; replace it

***Major red flag is if the correct answer has a negative Rpbis and a distractor has a positive Rpbis

The minimum item-total correlation bound represents the lowest discrimination you are willing to accept. This is typically a small positive number, like 0.10 or 0.20. If your sample size is small, it could possibly be reduced.  The maximum item-total correlation bound is almost always 1.0, because it is typically desired that the r-pbis be as high as possible.

The biserial correlation is also a measure of the discrimination or differentiating strength, of the item. It ranges from −1.0 to 1.0. The biserial correlation is computed between the item and total score as if the item was a continuous measure of the trait. Since the biserial is an estimate of Pearson’s r it will be larger in absolute magnitude than the corresponding point-biserial.

The biserial makes the stricter assumption that the score distribution is normal. The biserial correlation is not recommended for traits where the score distribution is known to be non-normal (e.g., pathology).

Polytomous Items

The Pearson’s r correlation is the product-moment correlation between the item responses (as numeric values) and total score. It ranges from −1.0 to 1.0. The r correlation indexes the linear relationship between item score and total score and assumes that the item responses for an item form a continuous variable. The r correlation and the r-pbis are equivalent for a 2-category item, so guidelines for interpretation remain unchanged.

The minimum item-total correlation bound represents the lowest discrimination you are willing to accept. Since the typical r correlation (0.5) will be larger than the typical rpbis (0.3) correlation, you may wish to set the lower bound higher for a test with polytomous items (0.2 to 0.3). If your sample size is small, it could possibly be reduced.  The maximum item-total correlation bound is almost always 1.0, because it is typically desired that the r-pbis be as high as possible.

The eta coefficient is an additional index of discrimination computed using an analysis of variance with the item response as the independent variable and total score as the dependent variable. The eta coefficient is the ratio of the between-groups sum of squares to the total sum of squares and has a range of 0 to 1. The eta coefficient does not assume that the item responses are continuous and also does not assume a linear relationship between the item response and total score.

As a result, the eta coefficient will always be equal or greater than Pearson’s r. Note that the biserial correlation will be reported if the item has only 2 categories.