Posts on psychometrics: The Science of Assessment

Likert scale meme

Likert scales (items) are a type of item used in human psychoeducational assessment, primarily to assess noncognitive constructs.  That is, while item types like multiple choice or short answer are used to measure knowledge or ability, Likert scales are better suited to measuring things like anxiety, conscientiousness, or motivation.

In the realm of psychology, surveys, and market analysis, Likert scales stand tall as one of the most versatile and widely used tools. Whether you’re a researcher, a marketer, or simply someone interested in understanding human attitudes and opinions, grasping the essence of Likert scales can significantly enhance your understanding of data collection and analysis. In this guide, we’ll delve into what Likert scales are, why they’re indispensable, the types of items they’re suited for, and how to score them effectively.

What is a Likert Scale/Item?

A Likert scale, named after its creator Rensis Likert, is a psychometric scale used to gauge attitudes, opinions, perceptions, and behaviors. It typically consists of a series of statements or questions that respondents are asked to rate based on a specified scale. The scale often ranges from strongly disagree to strongly agree, with varying degrees of intensity or frequency in between. Likert scales are primarily used in survey research but have found applications in various fields, including psychology, sociology, marketing, and education.

We’ve all seen these in our past; they are the items that say something like “Rate on a scale of 1 to 5.”  Sometimes the numbers have descriptive text anchors, like you see below.  If these are behaviorally-based, they are called Behaviorally Anchored Rating Scales (BARS).

Likert scale item

You can consider the Likert Scale to be the notion of 1 to 5 or Strongly Disagree to Strongly Agree.  A Likert Item is an item on an assessment that uses a Likert Scale.  In many cases, the scale is reused over items; in the example above, we have two items that use the same scale.  However, the terms are often used interchangeably.

Why Use a Likert Scale?

The popularity of Likert scales stems from their simplicity, flexibility, and ability to capture nuanced responses. Here are several reasons why Likert scales are favored:

  • Ease of Administration: Likert items are easy to administer, making them suitable for both online and offline surveys.
  • Quantifiable Data: Likert scales generate quantitative data, allowing for statistical analysis and comparison across different groups or time points. Open response items, where an examinee might type in how they feel about something, are much harder to quantify.
  • Flexibility: They can accommodate a wide range of topics and attitudes, from simple preferences to complex opinions.
  • Standardization: Likert scales provide a standardized format for measuring attitudes, enhancing the reliability and validity of research findings.
  • Ease of Interpretation: Likert responses are straightforward to interpret, making them accessible to both researchers.  For example, in the first example above, if the average response is 4.1, we can say that respondents generally Agree with the statement.
  • Ease of understanding: Since these are so commonly used, everyone is familiar with the format and can respond quickly.

 

What Sort of Assessments Use a Likert Scale?

Likert scales are well-suited for measuring various constructs, including:

  • Attitudes: Assessing attitudes towards a particular issue, product, or service (e.g., “I believe climate change is a pressing issue”).
  • Opinions: Gauging opinions on controversial topics or current events (e.g., “I support the legalization of marijuana”).
  • Perceptions: Capturing perceptions of quality, satisfaction, or trust (e.g., “I am satisfied with the customer service provided”).
  • Behaviors: Examining self-reported behaviors or intentions (e.g., “I exercise regularly”).
  • Agreement or Frequency: Measuring agreement with statements or the frequency of certain behaviors (e.g., “I often recycle household waste”).

 

How Do You Score a Likert Item?

Scoring a Likert scale item involves assigning numerical values to respondents’ selected options. Typically, the scale is assigned values from 1 to 5 (or more), representing varying degrees of agreement, frequency, or intensity.  In the example above, the possible scores for each item are 1, 2, 3, 4, 5.  There are then two ways we can uXcalibre-poly-outputse this to obtain scores for examinees.

  • Classical test theory: Either sum or average. For the former, simply add up the
  • scores for all items within the scale for each respondent. If they respond as 4 to both items, their score is 8.  For the latter, we find their average answer.  If they answer a 3 and a 4, their score is 3.50.  Note that both of these are easily interpretable.
  • Item Response Theory: In large scale assessment, Likert scales are often analyzed and scored with polytomous IRT models such as the Rating Scale Model and Graded Response Model.  An example of this sort of analysis is shown here.

Other important considerations:

  • Reverse Coding: If necessary, reverse code items to ensure consistency (e.g., strongly disagree = 1, strongly agree = 5).  In the example above, we are clearly assessing Extraversion; the first item is normal scoring, while the second item is reverse-scored.  So actually, answering a 2 to the second question is really a 4 in the direction of Extraversion, and we would score it as such.
  • Collapsing Categories: Sometimes, if few respondents answer a 2, you might collapse 1 and 2 into a single category.  This is especially true if using IRT.  The image here also shows an example of that.
  • Norms: Because most traits measured by Likert scales are norm-referenced (see more on that here!), we often need to set norms.  In the simple example, what does a score of 8/10 mean?  The meaning is quite different if the average is 6 with a standard deviation of 1, than if the average is 8.  Because of this, scores might be reported as z-scores or T-scores.

 

Item Analysis of Likert items

We use statistical techniques to item quality, such as average score per item or item-total correlation (discrimination). You can also perform more advanced analyses like factor analysis or regression to uncover underlying patterns or relationships.  Here are some initial considerations.

  • Frequency of each response: How many examinees selected each?  This is the N column in the graph above.  The Prop column is the same thing but converted to proportion.
  • Mean score per response: This is evidence that the item is working well.  Did people who answered “1” score lower overall on the Extraversion score than people who scored 3?  This is definitely the case above.
  • Rpbis per response, or overall R: We want the item to correlate with total score.  This is strong evidence for the validity of the item.  In this example, the correlation is 0.620, which is great.
  • Item response theory:  We can evaluate threshold values and overall item discrimination, as well as issues like item fit.  This is extremely important, but beyond the scope of this post!

We also want to validate the overall test.  Scores and subscores can be evaluated with descriptive statistics, and for reliability with indices like coefficient alpha.

 

Summary

In conclusion, Likert scales are invaluable tools for capturing and quantifying human attitudes, opinions, and behaviors. Understanding their utility and nuances can empower researchers, marketers, and decision-makers to extract meaningful insights from their data, driving informed decisions and actions. So, whether you’re embarking on a research project, designing a customer satisfaction survey, or conducting employee assessments, remember to leverage Likert scales to efficiently assess the noncognitive traits and opinions.

test blueprints specifications

Test blueprints, aka test specifications (shortened to “test specs”), are the formalized design of an assessment, test, or exam.  This can be in the context of educational assessment, pre-employment, certification, licensure, or any other type.  Generally, the amount of effort and detail is commensurate with the stakes of the assessment; a 10 item quiz for 5th grade math is quite different than the licensure exam for surgeons!

Why do we need test blueprints?

Job-analysis-to-test-blueprintsThe blueprints are used for various purposes.  The most important is that they are part of the validity documentation.  Validity refers to the evidence we have (“evidence-centered design”) that a test’s scores mean what we want them to mean.  So if we want the scores to reflect knowledge of high school math curriculum for graduation, then the test specifications should align to the curriculum quite closely.  If we want the scores to reflect that a surgeon is qualified to do practice, we want the test specifications to reflect the knowledge and skills needed to practice.  A lot of work can go into designing the blueprints, such as job task analysis in certification and licensure.  The image here provides an example of how JTA data is converted into content blueprints.

The test blueprints/specifications are also important for directing efforts in test development.  At the simplest level, you want your item writers to create new items in areas where you need them.  If the blueprints only call for 1% of the test on a certain topic, you don’t want the item writers making a lot of new questions there.

The test blueprints are often published publicly in a simplified version to help external stakeholders.  For example, you want the surgeons to be able to study for their test, so you publish a list of content domains that is covered by the test, and the percentage of items from each.  A fantastic example of this is at NOCTI.  Another good example which covers multiple aspects of the list below is this one from New Mexico.

 

What are test blueprints?

The test blueprints, like the blueprints of a house or office building, define everything needed to build it.  There are multiple aspects to this, which can vary by type of exam.  It breaks down into two types of information: item distribution, and operational guidelines.

Item distribution

There are many ways that you can classify items on the test.  The content domain or topic that they cover is the most obvious here, such as defining a math test that is 40% Algebra, 30% Geometry, and 30% Calculus.  But there are other, more practical and operational, considerations as well.

 

Number of items

First, the blueprints should define the number of items, including a breakdown of scored vs. unscored (pilot) items.  Often, there is documented reasoning behind the choices for this, such as pretesting plans, or an estimate of reliability based on projected test length.

Content

This is the most important and most common.  Some test blueprints only cover this and the number of items.  It defines all the content covered by the test, and the percentage for each.  Sometimes, there are sub-domains and sub-sub-domains!  Here is an example of that, from the New Mexico link provided earlier.

New Mexico test blueprints

Item type

Many tests only have multiple choice items, so this is then unnecessary.  But there are tests, for example, that require 50 multiple choice items, 10 drag and drop, 10 fill-in-the-blank, and 2 essay.  Such designs need to be explained and codified in the the test blueprints.

Statistics

Some test blueprints define a distribution or target level of statistics.  For example, it might require 20% of the items to have classical difficulty statistics (P-values) of 0.40 to 0.60, 60% of the items with values 0.60 to 0.90, and 20% from 0.90 to 1.00.  Or, there might just be acceptable ranges, such as stating that all difficulty statistics should be 0.40 to 0.98.

Cognitive level or Bloom’s

Not all assessments tackle this consideration, but it is common in education.  The test blueprints might specify a certain number of items that are Recall vs. higher levels of cognitive complexity.  Note that this might overlap with Item Type.

Sections

The design of the test might be ordered into sections, which is documented closely.  Continuing the example above, there might be Section 1 that is the 50 multiple choice items, Section 2 is drag-and-drop plus fill-in-the-blank, and Section 3 is Essay.

 

Operational and practical considerations

This part of the blueprints covers aspects other than the nature of items.  There are many things that are useful, but here are a few examples.

  • Time limits – What is the overall time limit of the test?  Section time limits?
  • Navigation – Are examinees allowed to move back and forth between sections?  Between items?
  • Test design – If you are using modern designs like computerized adaptive testing or linear on the fly testing, you need to define these with a lot of detail.
  • Messaging – What instructions will you give?  Are there pop-up messages?
  • Access – How do you control access to the exam?  Are there eligibility requirements?  Published online vs. paper?  So many options.

 

Summary

As you can see, there are a ton of things to consider when publishing a test.  If the test is low-stakes, many of these are treated informally, such as a teacher handing out a 10 item quiz.  But for high-stakes assessment, the definition of formal test blueprints and specifications is absolutely essential.  Not only does it prepare the candidates and other stakeholders, but it makes things easier for the test developers, and provides substantial documentation for validity.  Moreover, if you work in an area where there are potential legal challenges, it provides a bulwark of legal defensibility.  If you work in high-stakes or high-volume assessment, you need to define your test blueprints.

test publishing quality control

Test publishing is the process of preparing computer-based assessments for delivery on an electronic platform. Test publishing is like a car rolling off the assembly line. It’s the culmination of a great deal of effort in developing the assessment. Just as a car undergoes extensive checks before leaving the factory, a computer-based assessment requires meticulous quality control procedures to make sure that it functions as intended. Errors may have significant consequences for the sponsoring organization, including a loss of reputation, and can even have legal implications, depending upon the type of error.

test publishing quality assurance

The test publishing quality control process begins prior to the  start of the publishing process. The key steps in the process are as follows:

 

Step 1: Determine the test publishing specifications

Quality control begins with the completion of a test specifications document. The test specifications document provides the pattern or the playbook for how the test should be published. It typically includes the following information:

  • Test design
    • Administration model (i.e., linear fixed form, LOFT, CAT)
    • Scoring strategy (dichotomous/polytomous item-level scoring, compensatory/conjunctive domain/sectional scoring)
    • Test length (number of items shown to each candidate)
    • Test duration (time allowed for exam/sections)
  • Content specifications
    • List of included items (and which are scored/unscored)
    • Mapping of items to domains/sections/subscales (if applicable)
    • Mapping of stimuli to items (if applicable)
    • Item keys
  • Ancillary delivery components
    • Non-disclosure agreement
    • Tutorial
    • Customized help screens
    • Calculator
  • Features and functionality
    • Navigation (e.g., review of previous items allowed)
    • Review screens
    • Electronic scratch pad
    • Item-level comments/feedback

Note that this is not a comprehensive list, and information needed for the test specifications documents may vary depending upon the type of assessment and the specific testing platform used for delivery. Some of the data on the test specifications are relatively static and will change only with changes to the test design. Other data, such as the list of included items, are dynamic and will typically change each time the assessment is republished.

The test specifications document becomes the authoritative source of truth used by the test publisher for how the assessment should be published. It is a key communication tool between the sponsoring organization and their test publishing vendor or partner. 

 

Step 2: Identify sources of test publishing errors

A comprehensive determination of everything that could possibly go wrong in the test publishing process should serve as the guide for the quality control checks that need to be performed before the test goes live. A tool that can assist in developing a comprehensive list of potential errors is a fishbone diagram, also known as an Ishikawa diagram or a cause-and-effect diagram.  It is a visual representation used to identify and organize possible causes of a specific problem or effect. The diagram takes the form of a fish skeleton (hence, its name), with the “head” representing the problem or effect, and the “bones” representing different categories of potential causes. Along each bone, smaller branches represent sub-causes, which are specific elements that may contribute to the problem.

Fishbone diagrams are created by having a team representing all disciplines involved in a process brainstorm potential problems, or in the case of test publishing, potential errors that can be introduced into the test publishing process. Determining potential categories of errors first and then brainstorming more specific errors enables a comprehensive analysis of the test publishing process and the potential occasions in which errors can be introduced in that process.

Here’s a sample fishbone diagram for test publishing errors: 

test publishing errors fishbone

 

Step 3: Review against source of truth

The examination should be reviewed against the source of truth for each potential error. The test specifications will be the key source of truth against which the published examination is compared to identify any errors. The item bank is the source of truth for item presentation and item metadata.

Error-free test publishing is central to preserving the test sponsor’s reputation. Even minor mistakes, such as misspelled words, can be damaging. More importantly, errors in a published examination can have deleterious effects for candidates. A scoring error might mean the difference between a candidate failing and passing, and in the case of a certification or licensure examination, that can have dire consequences for the candidate’s career and livelihood. 

As mathematician Nassem Nicholas Taleb stated, “Quality is the result of an intelligent effort, not a chance happening.” A rigorous quality control procedure aids in making the publishing process an intelligent effort.

student-testlet

Testlet is a term in educational assessment that refers to a set of test items or questions grouped together on a test, often with a common theme or scenario. This approach aims to provide a more comprehensive and nuanced assessment of an individual’s abilities compared to traditional testing methods.

What is a testlet?

As mentioned above, a testlet is a group of items delivered together.  There are two ways of doing this.

  1. Items that share a common stimulus or otherwise MUST be together.  An example of this is a reading passage with 4 questions about it.  You can’t have the passage and the 4 questions scattered about a 100 item test as 5 screens in random places!  It all has to be together to make sense.
  2. Items that do not have to be together, but it improves the purpose of the assessment.  In this case, you might have 10 items that are standalone (no reading passage or anything relating them), but your test might be multistage testing and all items are delivered in blocks of 10.  Test designers can tailor the difficulty level based on the test-taker’s performance. As a test-taker progresses through a testlet, the system dynamically adjusts the complexity of subsequent questions, ensuring a personalized and accurate assessment of their proficiency.

Example item - testlet

 

Why use testlets?

The answer is obvious in the first case: you have to.  But it does get deeper than that.

One key feature of testlets is their ability to mimic real-world scenarios. Unlike standalone questions, testlets present a series of interconnected problems or tasks that require the test-taker to apply their knowledge in a cohesive manner. This not only assesses their understanding of isolated concepts but also evaluates their ability to integrate information and solve complex problems.  Testlets are can be particularly effective in assessing critical thinking, problem-solving skills, and practical application of knowledge. By presenting questions in a contextually linked manner, testlets offer a more authentic representation of a person’s ability to handle real-world challenges.

Testlets promote efficiency in testing. With a focused set of questions, they save time and reduce the fatigue associated with extensive testing sessions. This makes them an attractive option for educators and testing organizations seeking to streamline assessment processes while maintaining accuracy.  That is, if you want 20 items on reading comprehension, you could have 20 reading passages each with 1 question, or 4 reading passages each with 5 questions.  The fatigue would be far less in the latter test!

The second case, of standalone items, is a bit more nuanced.  It often has to do with managing the blueprints of the test, making best use of the item bank, and other operational considerations.  For example, perhaps the test has a blueprint to have 50% algebra items, 30% geometry, and 20% trigonometry.  You might build packets of 10 items with 5, 3, 2 respectively, and use those packets.

 

How do you score testlets?

Testlets can be scored with traditional methods, or with a new technology that was developed for this unique situation.

First, you can score with classical test theory, which is the traditional method of number-correct or points.

Second, you can use item response theory.  However, if the items share a strong relation, this might violate the IRT assumption of local independence.

Third, testlet response theory (TRT; Wainer, Bradlow, & Wang, 2007) works to address some of the concerns with traditional IRT.

 

Summary

In conclusion, a testlet is a powerful and flexible tool in toolbox of assessment designers. Its ability to present interconnected questions, mimic real-world scenarios, and adapt to individual performance makes it a valuable asset in gauging a person’s knowledge and skills. As education and assessment methods continue to evolve, the role of testlets is likely to expand, contributing to more accurate and meaningful evaluations of individuals in various fields.

job-task-analysis

Job Task Analysis (JTA) is an essential step in designing a test to be used in the workforce, such as pre-employment or certification/licensure, by analyzing data on what is actually being done in the job.  Also known as Job Analysis or Role Delineation, job task analysis is important to design a test that is legally defensible and eligible for accreditation.  It usually involves a panel of subject matter experts to develop a survey, which you then deliver to professionals in your field to get quantitative data about what is most frequently done on the job and what is most critical/important.  This data can then be used for several important purposes.

Need help? Our experts can help you efficiently produce a job task analysis study for your certification, guide the process of item writing and standard setting, then publish and deliver the exam on our secure platform.

 

Reasons to do a Job Task Analysis

Job analysis is extremely important in the field of industrial/organizational psychology, hence the meme here from @iopsychmemes.  It’s not just limited to credentialing.

Job analysis I/O Psychology

Exam design

The most common reason is to get quantitative data that will help you design an exam.  By knowing what knowledge, skills, or abilities (KSAs), are most commonly used, you then know which deserve more questions on the test.  It can also help you with more complex design aspects, such as defining a practical exam with live patients.

Training curriculum

Similarly, that quantitative info can help design a curriculum and other training materials.  You will have data on what is most important or frequent.

Compensation analysis

You have a captive audience with the JTA survey.  Ask them other things that you want to know!  This is an excellent time to gather information about compensation.  I worked on a JTA in the past which asked about work location: clinic, hospital, private practice, or vendor/corporate.

Job descriptions

A good job analysis will help you write a job description for postings.  It will tell you the job responsibilities (common tasks), qualifications (required skills, abilities, and education), and other important aspects.  If you gather compensation data in the survey, that can be used to define the salary range of the open position.

Workforce planning

Important trends might become obvious when analyzing the data.  Are fewer people entering your profession, perhaps specific to a certain region or demographic?  Are they entering without certain skills?  Are there certain universities or training programs that are not performing well?  A JTA can help you discover such issues and then work with stakeholders to address them.  These are major potential problems for the profession.

IT IS MANDATORY

If you have a professional certification exam and want to get it accredited by a board such as NCCA or ANSI/ANAB/ISO, then you are REQUIRED to do some sort of job task analysis.

 

Why is a JTA so important for certification and licensure?  Validity.

The fundamental goal of psychometrics is validity, which is evidence that the interpretations we make from scores are actually true. In the case of certification and licensure exams, we are interpreting that someone who passes the test is qualified to work in that job role. So, the first thing we need to do is define exactly what is the job role, and to do it in a quantitative, scientific way. You can’t just have someone sit down in their basement and write up 17 bullet points as the exam blueprint.  That is a lawsuit waiting to happen.

There are other aspects that are essential as well, such as item writer training and standard setting studies.

 

The Methodology: Job Task Inventory

It’s not easy to develop a defensible certification exam, but the process of job task analysis (JTA) doesn’t require a Ph.D. in Psychometrics to understand. Here’s an overview of what to expect.

  1. Convene a panel of subject matter experts (SMEs), and provide a training on the JTA process.
  2. The SMEs then discuss the role of the certification in the profession, and establish high-level topics (domains) that the certification test should cover. Usually, there is 5-20. Sometimes there are subdomains, and occasionally sub-subdomains.
  3. The SME panel generates a list of job tasks that are assigned to domains; the list is reviewed for duplicates and other potential issues. These tasks have an action verb, a subject, and sometimes a qualifier. Examples: “Calibrate the lensometer,” “Take out the trash”, “Perform an equating study.”  There is a specific approach to help with the generation, called the critical incident technique.  With this, you ask the SMEs to describe a critical incident that happened on the job and what skills or knowledge led to success by the professional.  While this might not generate ideas for frequent yet simple tasks, it can help generate ideas for tasks that are rarer but very important.
  4. The final list is used to generate a survey, which is sent to a representative sample of professionals that actually work in the role
    The respondents take the survey, whereby they rate each task, usually on its importance and time spent (sometimes called criticality and frequency). Demographics are also gathered, which include age range, geographic region, work location (e.g., clinic vs hospital if medical), years of experience, educational level, and additional certifications.
  5. A psychometrician analyzes the results and creates a formal report, which is essential for validity documentation.  This report is sometimes considered confidential, sometimes published on the organization’s website for the benefit of the profession, and sometimes published in an abbreviated form.  It’s up to you.  For example, this site presents the final results, but then asks you to submit your email address for the full report.

 

Using JTA results to create test blueprints

Many corporations do a job analysis purely for in-house purposes, such as job descriptions and compensation.  This becomes important for large corporations where you might have thousands of people in the same job; it needs to be well-defined, with good training and appropriate compensation.

If you work for a credentialing organization (typically a non-profit, but sometimes the Training arm of a corporation… for example, Amazon Web Services has a division dedicated to certification exams), you will need to analyze the results of the JTA to develop exam blueprints.  We will discuss this process in more detail with another blog post.  But below is an example of how this will look, and here is a free spreadsheet to perform the calculations: Job Task Analysis to Test Blueprints.

 

Job Task Analysis Example

Suppose you are an expert widgetmaker in charge of the widgetmaker certification exam.  You hire a psychometrician to guide the organization through the test development process.  The psychometrician would start by holding a webinar or in-person meeting for a panel of SMEs to define the role and generate a list of tasks.  The group comes up with a list of 20 tasks, sorted into 4 content domains.  These are listed in a survey to current widgetmakers, who rate them on importance and frequency.  The psychometrician analyzes the data and presents a table like you see below.

We can see here that Task 14 is the most frequent, while Task 2 is the least frequent.  Task 7 is the most important while Task 17 is the least.  When you combine Importance and Frequency either by adding or multiplying, you get the weights on the right-hand columns.  If we sum these and divide by the total, we get the suggested blueprints in the green cells.

 

Job task analysis to test blueprints

 

job analysis

The Four-Fifths Rule is a term that refers to a guideline for fairness in hiring practices in the USA.  Because tests are often used in making hiring decisions, the Four-Fifths Rule applies to them so it is an important aspect of assessment in the workforce, but it also applies to other selection methods, such as interviews or biodata.  It is important not only because violations could lead to legal entanglements, but because achieving a diverse and inclusive workforce is a goal for most organizations.

What is the Four-Fifths Rule?

The Four-Fifths Rule, also known as the 80% Rule, is a statistical guideline established by the Equal Employment Opportunity Commission (EEOC) in the United States, used to evaluate whether a selection process leads to adverse impact against any specific group. The rule comes into play when comparing the selection rates of different demographic groups within an organization, aiming to identify potential disparities. According to the EEOC, a selection rate for any group that is less than four-fifths (or 80%) of the rate for the group with the highest selection rate may indicate adverse impact.

This applies to any organization that is hiring in the United States, even if that organization is based overseas.  A great example of this is a 2023 lawsuit against a Chinese company that was hiring US employees with unfair practices.

The Four-Fifths Rule serves as a vital benchmark for organizations striving for diversity and inclusion. By highlighting disparities in selection rates, it helps employers identify and rectify potential discriminatory practices. This not only aligns with ethical considerations but also ensures compliance with anti-discrimination laws, fostering an environment that values equal opportunity for all.

four-fifths rule diversity in pre-employment testing

Calculation Method

First, determine the selection rate for each demographic group by dividing the number of individuals selected from that group by the total number of applicants from the same group. Next, compare the selection rates of different groups. If the selection rate for any group is less than 80% of the rate for the group with the highest selection rate, it triggers further investigation into potential discrimination.

Example:

Group A has 500 applicants and 100 were selected; a 20% selection rate

Group B has 120 applicants and 17 were selected; a 14.17% selection rate

The ratio is 0.1417/0.20 = 0.7083.  This is below 0.80, so the procedure is biased against Group B.

Note that we are focusing on rates and not overall numbers.  Clearly, Group B has far fewer selected, but the rates are not too different at 20% and 14.17% – but different enough that this test would be under scrutiny.

Implementing the Four-Fifths Rule in Practice

To implement protections against the Four-Fifths Rule effectively, organizations must adopt proactive measures. Regularly monitoring and analyzing selection rates for different demographic groups can help identify trends and address potential issues promptly. Furthermore, organizations should establish clear policies and procedures for hiring, ensuring that decision-makers are well-informed about the Four-Fifths Rule and its implications.

Note that this is only a guideline for flagging potential adverse impact.  It does not mean the selection method will be stricken.  Consider a physical fitness test for firefighters; it most definitely produce lower results for people aged 60 and over, but physical fitness is unarguably a job requirement, so if the test has been validated it will most likely be upheld.

How does AI fit into this?

Artificial intelligence (AI) is governed by the Four-Fifths rule as any other selection approach.  Do you use AI to comb through a pile of resumes, and flag those worthy of an interview?  This is then a selection procedure, and if it were to be found that it was biased against a subgroup, you would be liable.

Conclusion

In the pursuit of a fair and inclusive workplace, the Four-Fifths Rule is a valuable tool for organizations committed to diversity. Moreover, it is a legal guideline for any organization that hires in the United States.  It is legally required that your organization follow this guideline with respect to pre-employment assessments as well as any other selection procedure.

Note: ASC does not provide legal advice, this is only for educational purposes.

Iteman45-quantile-plot

Classical Test Theory (CTT) is a psychometric approach to analyzing, improving, scoring, and validating assessments.  It is based on relatively simple concepts, such as averages, proportions, and correlations.  One of the most frequently used aspects is item statistics, which provide insight into how an individual test question is performing.  Is it too easy, too hard, too confusing, miskeyed, or potentially another issue?  Item statistics are what tell you these things.

What are classical test theory item statistics?

They are indices of how a test item, or components of it, is performing.  Items can be hard vs easy, strong vs weak, and other important aspects.  Below is the output from the Iteman report in our FastTest online assessment platform, showing an English vocabulary item with real student data.  How do we interpret this?

FastTest Iteman Psychometric Analysis

Interpreting Classical Test Theory Item Statistics: Item Difficulty

The P value (Multiple Choice)

The P value is the classical test theory index of difficulty, and is the proportion of examinees that answered an item correctly (or in the keyed direction). It ranges from 0.0 to 1.0. A high value means that the item is easy, and a low value means that the item is difficult.  There are no hard and fast rules because interpretation can vary widely for different situations.  For example, a test given at the beginning of the school year would be expected to have low statistics since the students have not yet been taught the material.  On the other hand, a professional certification exam, where someone can not even sit unless they have 3 years of experience and a relevant degree, might have all items appear easy even though they are quite advanced topics!  Here are some general guidelines”

    0.95-1.0 = Too easy (not doing much good to differentiate examinees, which is really the purpose of assessment)

    0.60-0.95 = Typical

    0.40-0.60 = Hard

    <0.40 = Too hard (consider that a 4 option multiple choice has a 25% chance of pure guessing)

With Iteman, you can set bounds to automatically flag items.  The minimum P value bound represents what you consider the cut point for an item being too difficult. For a relatively easy test, you might specify 0.50 as a minimum, which means that 50% of the examinees have answered the item correctly.

For a test where we expect examinees to perform poorly, the minimum might be lowered to 0.4 or even 0.3. The minimum should take into account the possibility of guessing; if the item is multiple-choice with four options, there is a 25% chance of randomly guessing the answer, so the minimum should probably not be 0.20.  The maximum P value represents the cut point for what you consider to be an item that is too easy. The primary consideration here is that if an item is so easy that nearly everyone gets it correct, it is not providing much information about the examinees.

In fact, items with a P of 0.95 or higher typically have very poor point-biserial correlations.

The Item Mean (Polytomous)

This refers to an item that is scored with 2 or more point levels, like an essay scored on a 0-4 point rubric or a Likert-type item that is “Rate on a scale of 1 to 5.”

  • 1=Strongly Disagree
  • 2=Disagree
  • 3=Neutral
  • 4=Agree
  • 5=Strongly Agree

The item mean is the average of the item responses converted to numeric values across all examinees. The range of the item mean is dependent on the number of categories and whether the item responses begin at 0. The interpretation of the item mean depends on the type of item (rating scale or partial credit). A good rating scale item will have an item mean close to ½ of the maximum, as this means that on average, examinees are not endorsing categories near the extremes of the continuum.

You will have to adjust for your own situation, but here is an example for the 5-point Likert-style item.

1-2 is very low; people disagree fairly strongly on average

2-3 is low to neutral; people tend to disagree on average

3-4 is neutral to high; people tend to agree on average

4-5 is very high; people agree fairly strongly on average

Iteman also provides flagging bounds for this statistic.  The minimum item mean bound represents what you consider the cut point for the item mean being too low.  The maximum item mean bound represents what you consider the cut point for the item mean being too high.

The number of categories for the items must be considered when setting the bounds of the minimum/maximum values. This is important as all items of a certain type (e.g., 3-category) might be flagged.

Interpreting Classical Test Theory Item Statistics: Item Discrimination

Multiple-Choice Items

The Pearson point-biserial correlation (r-pbis) is a classical test theory measure of the discrimination or differentiating strength, of the item. It ranges from −1.0 to 1.0 and is a correlation of item scores and total raw scores.  If you consider a scored data matrix (multiple-choice items converted to 0/1 data), this would be the correlation between the item column and a column that is the sum of all item columns for each row (a person’s score).

A good item is able to differentiate between examinees of high and low ability yet have a higher point-biserial, but rarely above 0.50. A negative point-biserial is indicative of a very poor item because it means that the high-ability examinees are answering incorrectly, while the low examinees are answering it correctly, which of course would be bizarre, and therefore typically indicates that the specified correct answer is actually wrong. A point-biserial of 0.0 provides no differentiation between low-scoring and high-scoring examinees, essentially random “noise.”  Here are some general guidelines on interpretation.  Note that these assume a decent sample size; if you only have a small number of examinees, many item statistics will be flagged!

0.20+ = Good item; smarter examinees tend to get the item correct

0.10-0.20 = OK item; but probably review it

0.0-0.10 = Marginal item quality; should probably be revised or replaced

<0.0 = Terrible item; replace it

***Major red flag is if the correct answer has a negative Rpbis and a distractor has a positive Rpbis

The minimum item-total correlation bound represents the lowest discrimination you are willing to accept. This is typically a small positive number, like 0.10 or 0.20. If your sample size is small, it could possibly be reduced.  The maximum item-total correlation bound is almost always 1.0, because it is typically desired that the r-pbis be as high as possible.

The biserial correlation is also a measure of the discrimination or differentiating strength, of the item. It ranges from −1.0 to 1.0. The biserial correlation is computed between the item and total score as if the item was a continuous measure of the trait. Since the biserial is an estimate of Pearson’s r it will be larger in absolute magnitude than the corresponding point-biserial.

The biserial makes the stricter assumption that the score distribution is normal. The biserial correlation is not recommended for traits where the score distribution is known to be non-normal (e.g., pathology).

Polytomous Items

The Pearson’s r correlation is the product-moment correlation between the item responses (as numeric values) and total score. It ranges from −1.0 to 1.0. The r correlation indexes the linear relationship between item score and total score and assumes that the item responses for an item form a continuous variable. The r correlation and the r-pbis are equivalent for a 2-category item, so guidelines for interpretation remain unchanged.

The minimum item-total correlation bound represents the lowest discrimination you are willing to accept. Since the typical r correlation (0.5) will be larger than the typical rpbis (0.3) correlation, you may wish to set the lower bound higher for a test with polytomous items (0.2 to 0.3). If your sample size is small, it could possibly be reduced.  The maximum item-total correlation bound is almost always 1.0, because it is typically desired that the r-pbis be as high as possible.

The eta coefficient is an additional index of discrimination computed using an analysis of variance with the item response as the independent variable and total score as the dependent variable. The eta coefficient is the ratio of the between-groups sum of squares to the total sum of squares and has a range of 0 to 1. The eta coefficient does not assume that the item responses are continuous and also does not assume a linear relationship between the item response and total score.

As a result, the eta coefficient will always be equal or greater than Pearson’s r. Note that the biserial correlation will be reported if the item has only 2 categories.

students in school

Content validity is an aspect of validity, a term that psychometricians use to refer to evidence that interpretations of test scores are supported.  For example, predictive validity provides evidence that a pre-employment test will predict job performance, tenure, and other important criteria.  Content validity, on the other hand, focuses on evidence that the content of the test covers what it should cover.

What is Content Validity?

Content validity refers to the extent to which a measurement instrument (e.g., a test, questionnaire, or survey) accurately and adequately measures the specific content or construct it is designed to assess. In simpler terms, it assesses whether the questions or items included in an assessment are relevant and representative of the subject matter or concept under investigation.

Example 1: You are working on a benchmark test for 5th grade mathematics in the USA.  You would likely want to ensure that all items align to the Common Core State Standards for the 5th grade mathematics curriculum.

Example 2: You are working on a certification exam for widgetmakers.  You should make sure that all items align to the publicly posted blueprint for this certification.  That, in turn, was not defined in willy-nilly – it should have been built on the results of a formal job task analysis study.

The Importance of Content Validity

Drives Accurate Measurement: Content validity helps in ensuring that the assessment tool is measuring what it’s intended to measure. This is critical for drawing meaningful conclusions and making informed decisions based on the results.content validity

Enhances Credibility: When your assessment has high content validity, it enhances the credibility and trustworthiness of your findings. It demonstrates that you’ve taken the time to design a valid instrument. This is often referred to as face validity – which is not a “real” type of validity that psychometricians consider, but refers to if someone off the street looks at the test and says “yeah, that looks like all the items are on widgetmaking.”

Reduces Bias: Using assessment items that are not content-valid can introduce bias and inaccuracies into your results. By maintaining content validity, you reduce the risk of skewed or unreliable data.

Improves Decision-Making: Organizations often rely on assessments to make important decisions, such as hiring employees, designing educational curricula, or evaluating the effectiveness of marketing campaigns. Content-valid assessments provide a solid foundation for making these decisions.

Legal Defensibility: In general, if you deliver a test to select employees, you need to show either content validity (e.g., test on Microsoft Excel for bookkeepers) or predictive validity (conscientiousness is a personality trait but probably related to success as a bookkeeper).  A similar notion applies to other types of tests.

How to Assess Content Validity

There are various methods to assess content validity, such as expert reviews, pilot testing, and statistical techniques. One common method is to gather a panel of experts in the subject matter and have them review the assessment items to ensure that they align with the content domain.  Of course, if all the items are written directly to the blueprints in the first place, and reviewed before they even become part of the pool of active items, a post-hoc review like that is not necessary.

There has been more recent research on the application of machine learning to evaluate content, including the add-on option to look for enemy items by evaluating the distance between the content of any given pair of items.

If the test is multidimensional, a statistical approach known as factor analysis can help, to see if the items actually load on the dimensions they should.

Conclusion

In summary, content validity is an essential aspect of assessment design that ensures the questions or items used in an assessment are appropriate, relevant, and representative of the construct being measured. It plays a significant role in enhancing the accuracy, credibility, and overall quality of your assessments. Whether you’re a student preparing for an exam, a researcher developing a survey, or a business professional creating a customer feedback form, understanding and prioritizing content validity will help you achieve more reliable and meaningful results. So, next time you’re tasked with creating or using an assessment tool, remember the importance of content validity and its impact on the quality of your data and decision-making processes.

However, it is not the only aspect of validity.  The documentation of validity is a complex process that is often ongoing.  You will also need data on statistical performance of the test (e.g., alpha reliability), evaluation bias (e.g., differential item functioning), possibly predictive validity, and more.  Therefore, it’s important to work with a psychometrician that can help you understand what is involved and ensure that the test meets both international standards and the reason that you are building the test in the first place!

making-predictions-and-decisions-based-on-test-scores

Predictive Validity is a type of test score validity which evaluates how well a test predicts something in the future, usually with a goal of making more effective decisions about people.  For instance, it is often used in the world of pre-employment testing, where we want a test to predict things like job performance or tenure, so that a company can hire people that do a good job and stay a long time – a very good result for the company, and worth the investment.

Validity, in a general sense, is evidence that we have to support intended interpretations of test scores.  There are different types of evidence that we can gather to do so.  Predictive validity refers to evidence that the test predicts things that it should predict.  If we have quantitative data to support such conclusions, it makes the test more defensible and can improve the efficiency of its use.  For example, if a university admissions test does a great job of predicting success at university, then universities will want to use it to select students that are more likely to succeed.

Examples of Predictive Validity

Predictive validity evidence can be gathered for a variety of assessment types.

  1. Pre-employment: Since the entire purpose of a pre-employment test is to positively predict good things like job performance or negatively predict bad things like employee theft or short tenure, a ton of effort goes into developing tests to function in this way, and then documenting that they do.
  2. University Admissions: Like pre-employment testing, the entire purpose of university admissions exams is predictive.  They should positively correlate with good things (first year GPA, four year graduation rate) and negatively predict the negative outcomes like academic probation or dropping out.
  3. Prep Exams: Preparatory or practice tests are designed to predict performance on their target test.  For example, if a prep test is designed to mimic the Scholastic Aptitude Test (SAT), then one way to validate it is to gather the SAT scores later, after the examinees take it, and correlate with the prep test.
  4. Certification & Licensure: The primary purpose of credentialing exams is not to predict job performance, but to ensure that the candidate has mastered the material necessary to practice their profession.  Therefore, predictive validity is not important, compared to content-related validity such as blueprints based on a job analysis. However, some credentialing organizations do research on the “value of certification” linking it to improved job performance, reduced clinical errors, and often external third variables such as greater salary.
  5. Medical/Psychological: There are some assessments that are used in a clinical situation, and the predictive validity is necessary in that sense.  For instance, there might be an assessment of knee pain used during initial treatment (physical therapy, injections) that can be predictively correlated with later surgery.  The same assessment might then be used after the surgery to track rehabilitation.

Predictive Validity in Pre-employment Testing

The case of pre-employment testing is perhaps the most common use of this type of validity evidence.  A new study (Sacket, Zhang, Berry, & Lievens, 2022) was recently released that was a meta-analysis of the various types of pre-employment tests and other selection procedures (e.g., structured interview), comparing their predictive validity power.  This was a modern update to the classic article by Schmidt & Hunter (1998).  While in the past the consensus has been that cognitive ability tests provide the best predictive power in the widest range of situations, the new article suggests otherwise.  It recommends the use of structured interview and job knowledge tests, which are more targeted towards the role in question, and therefore not surprising that they are well-performing.  This in turn suggests that you should not buy pre-fab ability tests and use them in a shotgun approach with the assumption of validity generalization, but instead leverage an online testing platform like FastTest that allows you to build high-quality exams that are more specific to your organization.

Why do we need predictive validity?

There are a number of reasons that you might need predictive validity for an exam.  They are almost always regarding the case where the test is used to make important decisions about people.

  1. Smarter decision-making: Predictive validity provides valuable insights for decision-makers. It helps recruiters identify the most suitable candidates, educators tailor their teaching methods to enhance student learning, and universities to admit the best students.
  2. Legal defensibility: If a test is being used for pre-employment purposes, it is legally required in the USA to either show that the test is obviously job-related (e.g., knowledge of Excel for a bookkeeping job) or that you have hard data demonstrating predictive validity.  Otherwise, you are open for a lawsuit.
  3. Financial benefits: Often, the reason for needing improved decisions is very financial.  It is often costly for large companies to recruit and train personnel.  It’s entirely possible that spending $100,000 per year on pre-employment tests could save millions of dollars in the long run.
  4. Benefits to the examinee: Sometimes, there is directly a benefit to the examinee.  This is often the case with medical assessments.

How to implement predictive validity

The simplest case is that of regression and correlation.  How well does the test score correlate with the criterion variable?  Below is a oversimplified example, of predicting university GPA from scores on an admissions test.  Here, the correlation is 0.858 and the regression is GPA = 0.34*SCORE + 0.533.  Of course, in real life, you would not see this strong of a predictive power, as there are many other factors which influence GPA.

Predictive validity

Advanced Issues

It is usually not a simple situation of two straightforward variables, such as one test and one criterion variable.  Often, there are multiple predictor variables (quantitative reasoning test, MS Excel knowledge test, interview, rating of the candidate’s resume), and moreover there are often multiple criterion variables (job performance ratings, job tenure, counterproductive work behavior).  When you use multiple predictors and a second or third predictor adds some bit of predictive power over that of the first variable, this is known as incremental validity.

You can also implement more complex machine learning models, such as neural networks or support vector machines, if they fit and you have sufficient sample size.

When performing such validation, you need to also be aware of bias.  There can be test bias where the test being used as a predictor is biased against a subgroup.  There can also be predictive bias where two subgroups have the same performance on the test, but one is overpredicted for the criterion and the other is underpredicted.  A rule of thumb for investigating this in the USA is the four-fifths rule.

Summary

Predictive validity is one type of test score validity, referring to evidence that scores from a certain test can predict their intended target variables.  The most common application of it is to pre-employment testing, but it is useful in other situations as well.  But validity is an extremely important and wide-ranging topic, so it is not the only type of validity evidence that you should gather.

shocked-girl-all-psychometric-models-are-wrong

The British statistician George Box is credited with the quote, “All models are wrong but some are useful.”  As psychometricians, it is important that we never forget this perspective.  We cannot be so haughty as to think that our psychometric models actually represent the true underlying phenomena and any data that does not fit nicely is just noise.  We need to remember that everything we do is an approximation, and respect the balance between parsimony and parameterization.

Really… all psychometric models are wrong?

Yeah, there is no TRUE model that perfectly describes the interaction between an examinee and a test item.  Obviously the probability of a correct response is primarily due to important factors such as examinee ability, item difficulty, item quality, the presence of guessing, and the scoring function of the item.  There are also additional factors, such as student motivation, timing factors, lighting in the room, screen size, whether they broke up with their girlfriend/boyfriend the previous day, whether their mom made their favorite breakfast that morning… you get the picture.  Attempting to model all those factors is certainly overparameterization.

Wikipedia as has a lengthier quote on that aspect:

Since all models are wrong the scientist cannot obtain a “correct” one by excessive elaboration. On the contrary following William of Occam he should seek an economical description of natural phenomena. Just as the ability to devise simple but evocative models is the signature of the great scientist so overelaboration and overparameterization is often the mark of mediocrity.

Most, if not all psychometricians, would agree that my earlier description of overparameterization is valid.  The controversy in the field of Psychometrics is which of those “important factors” I mentioned qualify as overparameterization.  The Rasch model famously boils down the interaction to a single item parameter (difficulty) and a single person parameter (ability).  Many psychometricians consider this to be underparameterization since, for example, items are known widely differ in their quality (discrimination).  The Rasch cohort would consider the 2 and 3 parameter item response theory (IRT) models to be overparameterization, especially since they necessitated the development of new parameter estimation algorithms in the 1970s.  There are some practitioners in each camp who would claim that the other is the “mark of mediocrity.”

IRT continues to add more and more parameters, such as multidimensionality, response time, and upper asymptote.  For the most part, these are only academic curiosities, existing only to publish papers on new research, even though most assessments in the world still struggle to apply the Rasch model from 1960.

On the other end of the spectrum is classical test theory, which is based on simple mathematics like averages, proportions, and correlations.  This greatly underparameterizes what is actually going on.  The point-biserial coefficient, for example, assumes that the relation of ability to getting an item correct is linear, which is blatantly false since the probability cannot go above 1.0 or below 0.0.

Sooo… How do I select a psychometric model?

Well, try to be cognizant of that tradeoff, which is one of several tradeoffs when selecting an IRT model.  There is no right answer all the time, it is more a matter of whether your data fits a model and whether it satisfies your requirements for a particular situation.  That is, whether it is truly useful, which is Box’s original point. But don’t forget that all the models are wrong!