Posts on psychometrics: The Science of Assessment

Likert scales are a type of item used in human psychoeducational assessment, primarily to assess noncognitive constructs.  That is, while item types like multiple choice or short answer are used to measure knowledge or ability, Likert scales are better suited to measuring things like anxiety, conscientiousness, or motivation.

In the realm of psychological research, surveys, and market analysis, Likert scales stand tall as one of the most versatile and widely used tools. Whether you’re a researcher, a marketer, or simply someone interested in understanding human attitudes and opinions, grasping the essence of Likert scales can significantly enhance your understanding of data collection and analysis. In this guide, we’ll delve into what Likert scales are, why they’re indispensable, the types of items they’re suited for, and how to score them effectively.

What is a Likert Scale?

A Likert scale, named after its creator Rensis Likert, is a psychometric scale used to gauge attitudes, opinions, perceptions, and behaviors. It typically consists of a series of statements or questions that respondents are asked to rate based on a specified scale. The scale often ranges from strongly disagree to strongly agree, with varying degrees of intensity or frequency in between. Likert scales are primarily used in survey research but have found applications in various fields, including psychology, sociology, marketing, and education.

 

What does a Likert item look like?

We’ve all seen these in our past; they are the items that say something like “Rate on a scale of 1 to 5.”  Sometimes the numbers have descriptive text anchors, like you see below.  If these are behaviorally-based, they are called Behaviorally Anchored Rating Scales (BARS).

Likert scale item

Why Use a Likert Scale?

The popularity of Likert scales stems from their simplicity, flexibility, and ability to capture nuanced responses. Here are several reasons why Likert scales are favored:

  • Ease of Administration: Likert items are easy to administer, making them suitable for both online and offline surveys.
  • Quantifiable Data: Likert scales generate quantitative data, allowing for statistical analysis and comparison across different groups or time points. Open response items, where an examinee might type in how they feel about something, are much harder to quantify.
  • Flexibility: They can accommodate a wide range of topics and attitudes, from simple preferences to complex opinions.
  • Standardization: Likert scales provide a standardized format for measuring attitudes, enhancing the reliability and validity of research findings.
  • Ease of Interpretation: Likert responses are straightforward to interpret, making them accessible to both researchers.  For example, in the first example above, if the average response is 4.1, we can say that respondents generally Agree with the statement.
  • Ease of understanding: Since these are so commonly used, everyone is familiar with the format and can respond quickly.

 

What Sort of Items Use a Likert Scale?

Likert scales are well-suited for measuring various constructs, including:

  • Attitudes: Assessing attitudes towards a particular issue, product, or service (e.g., “I believe climate change is a pressing issue”).
  • Opinions: Gauging opinions on controversial topics or current events (e.g., “I support the legalization of marijuana”).
  • Perceptions: Capturing perceptions of quality, satisfaction, or trust (e.g., “I am satisfied with the customer service provided”).
  • Behaviors: Examining self-reported behaviors or intentions (e.g., “I exercise regularly”).
  • Agreement or Frequency: Measuring agreement with statements or the frequency of certain behaviors (e.g., “I often recycle household waste”).

 

How Do You Score a Likert Item?

Scoring a Likert scale item involves assigning numerical values to respondents’ selected options. Typically, the scale is assigned values from 1 to 5 (or more), representing varying degrees of agreement, frequency, or intensity.  In the example above, the possible scores for each item are 1, 2, 3, 4, 5.  There are then two ways we can use this to obtain scores for examinees.

  • Classical test theory: Either sum or average. For the former, simply add up the scores for all items within the scale for each respondent. If they respond as 4 to both items, their score is 8.  For the latter, we find their average answer.  If they answer a 3 and a 4, their score is 3.50.  Note that both of these are easily interpretable. Xcalibre-poly-output
  • Item Response Theory: In large scale assessment, Likert scales are often analyzed and scored with polytomous IRT models such as the Rating Scale Model and Graded Response Model.  An example of this sort of analysis is shown here.

 

Other important considerations:

  • Reverse Coding: If necessary, reverse code items to ensure consistency (e.g., strongly disagree = 1, strongly agree = 5).  In the example above, we are clearly assessing Extraversion; the first item is normal scoring, while the second item is reverse-scored.  So actually, answering a 2 to the second question is really a 4 in the direction of Extraversion, and we would score it as such.
  • Collapsing Categories: Sometimes, if few respondents answer a 2, you might collapse 1 and 2 into a single category.  This is especially true if using IRT.  The image here also shows an example of that.
  • Norms: Because most traits measured by Likert scales are norm-referenced (see more on that here!), we often need to set norms.  In the simple example, what does a score of 8/10 mean?  The meaning is quite different if the average is 6 with a standard deviation of 1, than if the average is 8.  Because of this, scores might be reported as z-scores or T-scores.

 

Item Analysis of Likert items

We use statistical techniques to item quality, such as average score per item or item-total correlation (discrimination). You can also perform more advanced analyses like factor analysis or regression to uncover underlying patterns or relationships.  Here are some initial considerations.

  • Frequency of each response: How many examinees selected each?  This is the N column in the graph above.  The Prop column is the same thing but converted to proportion.
  • Mean score per response: This is evidence that the item is working well.  Did people who answered “1” score lower overall on the Extraversion score than people who scored 3?  This is definitely the case above.
  • Rpbis per response, or overall R: We want the item to correlate with total score.  This is strong evidence for the validity of the item.  In this example, the correlation is 0.620, which is great.
  • Item response theory:  We can evaluate threshold values and overall item discrimination, as well as issues like item fit.  This is extremely important, but beyond the scope of this post!

We also want to validate the overall test.  Scores and subscores can be evaluated with descriptive statistics, and for reliability with indices like coefficient alpha.

 

Summary

In conclusion, Likert scales are invaluable tools for capturing and quantifying human attitudes, opinions, and behaviors. Understanding their utility and nuances can empower researchers, marketers, and decision-makers to extract meaningful insights from their data, driving informed decisions and actions. So, whether you’re embarking on a research project, designing a customer satisfaction survey, or conducting employee assessments, remember to leverage Likert scales to efficiently assess the noncognitive traits and opinions.

Test blueprints, aka test specifications (shortened to “test specs”), are the formalized design of an assessment, test, or exam.  This can be in the context of educational assessment, pre-employment, certification, licensure, or any other type.  Generally, the amount of effort and detail is commensurate with the stakes of the assessment; a 10 item quiz for 5th grade math is quite different than the licensure exam for surgeons!

Why do we need test blueprints?

Job-analysis-to-test-blueprintsThe blueprints are used for various purposes.  The most important is that they are part of the validity documentation.  Validity refers to the evidence we have (“evidence-centered design”) that a test’s scores mean what we want them to mean.  So if we want the scores to reflect knowledge of high school math curriculum for graduation, then the test specifications should align to the curriculum quite closely.  If we want the scores to reflect that a surgeon is qualified to do practice, we want the test specifications to reflect the knowledge and skills needed to practice.  A lot of work can go into designing the blueprints, such as job task analysis in certification and licensure.  The image here provides an example of how JTA data is converted into content blueprints.

The test blueprints/specifications are also important for directing efforts in test development.  At the simplest level, you want your item writers to create new items in areas where you need them.  If the blueprints only call for 1% of the test on a certain topic, you don’t want the item writers making a lot of new questions there.

The test blueprints are often published publicly in a simplified version to help external stakeholders.  For example, you want the surgeons to be able to study for their test, so you publish a list of content domains that is covered by the test, and the percentage of items from each.  A fantastic example of this is at NOCTI.  Another good example which covers multiple aspects of the list below is this one from New Mexico.

 

What are test blueprints?

The test blueprints, like the blueprints of a house or office building, define everything needed to build it.  There are multiple aspects to this, which can vary by type of exam.  It breaks down into two types of information: item distribution, and operational guidelines.

Item distribution

There are many ways that you can classify items on the test.  The content domain or topic that they cover is the most obvious here, such as defining a math test that is 40% Algebra, 30% Geometry, and 30% Calculus.  But there are other, more practical and operational, considerations as well.

 

Number of items

First, the blueprints should define the number of items, including a breakdown of scored vs. unscored (pilot) items.  Often, there is documented reasoning behind the choices for this, such as pretesting plans, or an estimate of reliability based on projected test length.

Content

This is the most important and most common.  Some test blueprints only cover this and the number of items.  It defines all the content covered by the test, and the percentage for each.  Sometimes, there are sub-domains and sub-sub-domains!  Here is an example of that, from the New Mexico link provided earlier.

New Mexico test blueprints

Item type

Many tests only have multiple choice items, so this is then unnecessary.  But there are tests, for example, that require 50 multiple choice items, 10 drag and drop, 10 fill-in-the-blank, and 2 essay.  Such designs need to be explained and codified in the the test blueprints.

Statistics

Some test blueprints define a distribution or target level of statistics.  For example, it might require 20% of the items to have classical difficulty statistics (P-values) of 0.40 to 0.60, 60% of the items with values 0.60 to 0.90, and 20% from 0.90 to 1.00.  Or, there might just be acceptable ranges, such as stating that all difficulty statistics should be 0.40 to 0.98.

Cognitive level or Bloom’s

Not all assessments tackle this consideration, but it is common in education.  The test blueprints might specify a certain number of items that are Recall vs. higher levels of cognitive complexity.  Note that this might overlap with Item Type.

Sections

The design of the test might be ordered into sections, which is documented closely.  Continuing the example above, there might be Section 1 that is the 50 multiple choice items, Section 2 is drag-and-drop plus fill-in-the-blank, and Section 3 is Essay.

 

Operational and practical considerations

This part of the blueprints covers aspects other than the nature of items.  There are many things that are useful, but here are a few examples.

  • Time limits – What is the overall time limit of the test?  Section time limits?
  • Navigation – Are examinees allowed to move back and forth between sections?  Between items?
  • Test design – If you are using modern designs like computerized adaptive testing or linear on the fly testing, you need to define these with a lot of detail.
  • Messaging – What instructions will you give?  Are there pop-up messages?
  • Access – How do you control access to the exam?  Are there eligibility requirements?  Published online vs. paper?  So many options.

 

Summary

As you can see, there are a ton of things to consider when publishing a test.  If the test is low-stakes, many of these are treated informally, such as a teacher handing out a 10 item quiz.  But for high-stakes assessment, the definition of formal test blueprints and specifications is absolutely essential.  Not only does it prepare the candidates and other stakeholders, but it makes things easier for the test developers, and provides substantial documentation for validity.  Moreover, if you work in an area where there are potential legal challenges, it provides a bulwark of legal defensibility.  If you work in high-stakes or high-volume assessment, you need to define your test blueprints.

Test publishing is the process of preparing computer-based assessments for delivery on an electronic platform. Test publishing is like a car rolling off the assembly line. It’s the culmination of a great deal of effort in developing the assessment. Just as a car undergoes extensive checks before leaving the factory, a computer-based assessment requires meticulous quality control procedures to make sure that it functions as intended. Errors may have significant consequences for the sponsoring organization, including a loss of reputation, and can even have legal implications, depending upon the type of error.

test publishing quality assurance

The test publishing quality control process begins prior to the  start of the publishing process. The key steps in the process are as follows:

 

Step 1: Determine the test publishing specifications

Quality control begins with the completion of a test specifications document. The test specifications document provides the pattern or the playbook for how the test should be published. It typically includes the following information:

  • Test design
    • Administration model (i.e., linear fixed form, LOFT, CAT)
    • Scoring strategy (dichotomous/polytomous item-level scoring, compensatory/conjunctive domain/sectional scoring)
    • Test length (number of items shown to each candidate)
    • Test duration (time allowed for exam/sections)
  • Content specifications
    • List of included items (and which are scored/unscored)
    • Mapping of items to domains/sections/subscales (if applicable)
    • Mapping of stimuli to items (if applicable)
    • Item keys
  • Ancillary delivery components
    • Non-disclosure agreement
    • Tutorial
    • Customized help screens
    • Calculator
  • Features and functionality
    • Navigation (e.g., review of previous items allowed)
    • Review screens
    • Electronic scratch pad
    • Item-level comments/feedback

Note that this is not a comprehensive list, and information needed for the test specifications documents may vary depending upon the type of assessment and the specific testing platform used for delivery. Some of the data on the test specifications are relatively static and will change only with changes to the test design. Other data, such as the list of included items, are dynamic and will typically change each time the assessment is republished.

The test specifications document becomes the authoritative source of truth used by the test publisher for how the assessment should be published. It is a key communication tool between the sponsoring organization and their test publishing vendor or partner. 

 

Step 2: Identify sources of test publishing errors

A comprehensive determination of everything that could possibly go wrong in the test publishing process should serve as the guide for the quality control checks that need to be performed before the test goes live. A tool that can assist in developing a comprehensive list of potential errors is a fishbone diagram, also known as an Ishikawa diagram or a cause-and-effect diagram.  It is a visual representation used to identify and organize possible causes of a specific problem or effect. The diagram takes the form of a fish skeleton (hence, its name), with the “head” representing the problem or effect, and the “bones” representing different categories of potential causes. Along each bone, smaller branches represent sub-causes, which are specific elements that may contribute to the problem.

Fishbone diagrams are created by having a team representing all disciplines involved in a process brainstorm potential problems, or in the case of test publishing, potential errors that can be introduced into the test publishing process. Determining potential categories of errors first and then brainstorming more specific errors enables a comprehensive analysis of the test publishing process and the potential occasions in which errors can be introduced in that process.

Here’s a sample fishbone diagram for test publishing errors: 

test publishing errors fishbone

 

Step 3: Review against source of truth

The examination should be reviewed against the source of truth for each potential error. The test specifications will be the key source of truth against which the published examination is compared to identify any errors. The item bank is the source of truth for item presentation and item metadata.

Error-free test publishing is central to preserving the test sponsor’s reputation. Even minor mistakes, such as misspelled words, can be damaging. More importantly, errors in a published examination can have deleterious effects for candidates. A scoring error might mean the difference between a candidate failing and passing, and in the case of a certification or licensure examination, that can have dire consequences for the candidate’s career and livelihood. 

As mathematician Nassem Nicholas Taleb stated, “Quality is the result of an intelligent effort, not a chance happening.” A rigorous quality control procedure aids in making the publishing process an intelligent effort.

Testlet is a term in educational assessment that refers to a set of test items or questions grouped together on a test, often with a common theme or scenario. This approach aims to provide a more comprehensive and nuanced assessment of an individual’s abilities compared to traditional testing methods.

What is a testlet?

As mentioned above, a testlet is a group of items delivered together.  There are two ways of doing this.

  1. Items that share a common stimulus or otherwise MUST be together.  An example of this is a reading passage with 4 questions about it.  You can’t have the passage and the 4 questions scattered about a 100 item test as 5 screens in random places!  It all has to be together to make sense.
  2. Items that do not have to be together, but it improves the purpose of the assessment.  In this case, you might have 10 items that are standalone (no reading passage or anything relating them), but your test might be multistage testing and all items are delivered in blocks of 10.  Test designers can tailor the difficulty level based on the test-taker’s performance. As a test-taker progresses through a testlet, the system dynamically adjusts the complexity of subsequent questions, ensuring a personalized and accurate assessment of their proficiency.

Example item - testlet

Why use testlets?

The answer is obvious in the first case: you have to.  But it does get deeper than that.

One key feature of testlets is their ability to mimic real-world scenarios. Unlike standalone questions, testlets present a series of interconnected problems or tasks that require the test-taker to apply their knowledge in a cohesive manner. This not only assesses their understanding of isolated concepts but also evaluates their ability to integrate information and solve complex problems.  Testlets are can be particularly effective in assessing critical thinking, problem-solving skills, and practical application of knowledge. By presenting questions in a contextually linked manner, testlets offer a more authentic representation of a person’s ability to handle real-world challenges.

Testlets promote efficiency in testing. With a focused set of questions, they save time and reduce the fatigue associated with extensive testing sessions. This makes them an attractive option for educators and testing organizations seeking to streamline assessment processes while maintaining accuracy.  That is, if you want 20 items on reading comprehension, you could have 20 reading passages each with 1 question, or 4 reading passages each with 5 questions.  The fatigue would be far less in the latter test!

The second case, of standalone items, is a bit more nuanced.  It often has to do with managing the blueprints of the test, making best use of the item bank, and other operational considerations.  For example, perhaps the test has a blueprint to have 50% algebra items, 30% geometry, and 20% trigonometry.  You might build packets of 10 items with 5, 3, 2 respectively, and use those packets.

 

How do you score testlets?

Testlets can be scored with traditional methods, or with a new technology that was developed for this unique situation.

First, you can score with classical test theory, which is the traditional method of number-correct or points.

Second, you can use item response theory.  However, if the items share a strong relation, this might violate the IRT assumption of local independence.

Third, testlet response theory (TRT; Wainer, Bradlow, & Wang, 2007) works to address some of the concerns with traditional IRT.

 

Summary

In conclusion, a testlet is a powerful and flexible tool in toolbox of assessment designers. Its ability to present interconnected questions, mimic real-world scenarios, and adapt to individual performance makes it a valuable asset in gauging a person’s knowledge and skills. As education and assessment methods continue to evolve, the role of testlets is likely to expand, contributing to more accurate and meaningful evaluations of individuals in various fields.

Job Task Analysis (JTA) is an essential step in designing a test to be used in the workforce, such as pre-employment or certification/licensure, by analyzing data on what is actually being done in the job.  Also known as Job Analysis or Role Delineation, job task analysis is important to design a test that is legally defensible and eligible for accreditation.  It usually involves a panel of subject matter experts to develop a survey, which you then deliver to professionals in your field to get quantitative data about what is most frequently done on the job and what is most critical/important.  This data can then be used for several important purposes.

Need help? Our experts can help you efficiently produce a job task analysis study for your certification, guide the process of item writing and standard setting, then publish and deliver the exam on our secure platform.

 

Reasons to do a Job Task Analysis

Job analysis is extremely important in the field of industrial/organizational psychology, hence the meme here from @iopsychmemes.  It’s not just limited to credentialing.

Job analysis I/O Psychology

Exam design

The most common reason is to get quantitative data that will help you design an exam.  By knowing what knowledge, skills, or abilities (KSAs), are most commonly used, you then know which deserve more questions on the test.  It can also help you with more complex design aspects, such as defining a practical exam with live patients.

Training curriculum

Similarly, that quantitative info can help design a curriculum and other training materials.  You will have data on what is most important or frequent.

Compensation analysis

You have a captive audience with the JTA survey.  Ask them other things that you want to know!  This is an excellent time to gather information about compensation.  I worked on a JTA in the past which asked about work location: clinic, hospital, private practice, or vendor/corporate.

Job descriptions

A good job analysis will help you write a job description for postings.  It will tell you the job responsibilities (common tasks), qualifications (required skills, abilities, and education), and other important aspects.  If you gather compensation data in the survey, that can be used to define the salary range of the open position.

Workforce planning

Important trends might become obvious when analyzing the data.  Are fewer people entering your profession, perhaps specific to a certain region or demographic?  Are they entering without certain skills?  Are there certain universities or training programs that are not performing well?  A JTA can help you discover such issues and then work with stakeholders to address them.  These are major potential problems for the profession.

IT IS MANDATORY

If you have a professional certification exam and want to get it accredited by a board such as NCCA or ANSI/ANAB/ISO, then you are REQUIRED to do some sort of job task analysis.

 

Why is a JTA so important for certification and licensure?  Validity.

The fundamental goal of psychometrics is validity, which is evidence that the interpretations we make from scores are actually true. In the case of certification and licensure exams, we are interpreting that someone who passes the test is qualified to work in that job role. So, the first thing we need to do is define exactly what is the job role, and to do it in a quantitative, scientific way. You can’t just have someone sit down in their basement and write up 17 bullet points as the exam blueprint.  That is a lawsuit waiting to happen.

There are other aspects that are essential as well, such as item writer training and standard setting studies.

 

The Methodology: Job Task Inventory

It’s not easy to develop a defensible certification exam, but the process of job task analysis (JTA) doesn’t require a Ph.D. in Psychometrics to understand. Here’s an overview of what to expect.

  1. Convene a panel of subject matter experts (SMEs), and provide a training on the JTA process.
  2. The SMEs then discuss the role of the certification in the profession, and establish high-level topics (domains) that the certification test should cover. Usually, there is 5-20. Sometimes there are subdomains, and occasionally sub-subdomains.
  3. The SME panel generates a list of job tasks that are assigned to domains; the list is reviewed for duplicates and other potential issues. These tasks have an action verb, a subject, and sometimes a qualifier. Examples: “Calibrate the lensometer,” “Take out the trash”, “Perform an equating study.”  There is a specific approach to help with the generation, called the critical incident technique.  With this, you ask the SMEs to describe a critical incident that happened on the job and what skills or knowledge led to success by the professional.  While this might not generate ideas for frequent yet simple tasks, it can help generate ideas for tasks that are rarer but very important.
  4. The final list is used to generate a survey, which is sent to a representative sample of professionals that actually work in the role
    The respondents take the survey, whereby they rate each task, usually on its importance and time spent (sometimes called criticality and frequency). Demographics are also gathered, which include age range, geographic region, work location (e.g., clinic vs hospital if medical), years of experience, educational level, and additional certifications.
  5. A psychometrician analyzes the results and creates a formal report, which is essential for validity documentation.  This report is sometimes considered confidential, sometimes published on the organization’s website for the benefit of the profession, and sometimes published in an abbreviated form.  It’s up to you.  For example, this site presents the final results, but then asks you to submit your email address for the full report.

 

Using JTA results to create test blueprints

Many corporations do a job analysis purely for in-house purposes, such as job descriptions and compensation.  This becomes important for large corporations where you might have thousands of people in the same job; it needs to be well-defined, with good training and appropriate compensation.

If you work for a credentialing organization (typically a non-profit, but sometimes the Training arm of a corporation… for example, Amazon Web Services has a division dedicated to certification exams), you will need to analyze the results of the JTA to develop exam blueprints.  We will discuss this process in more detail with another blog post.  But below is an example of how this will look, and here is a free spreadsheet to perform the calculations: Job Task Analysis to Test Blueprints.

 

Job Task Analysis Example

Suppose you are an expert widgetmaker in charge of the widgetmaker certification exam.  You hire a psychometrician to guide the organization through the test development process.  The psychometrician would start by holding a webinar or in-person meeting for a panel of SMEs to define the role and generate a list of tasks.  The group comes up with a list of 20 tasks, sorted into 4 content domains.  These are listed in a survey to current widgetmakers, who rate them on importance and frequency.  The psychometrician analyzes the data and presents a table like you see below.

We can see here that Task 14 is the most frequent, while Task 2 is the least frequent.  Task 7 is the most important while Task 17 is the least.  When you combine Importance and Frequency either by adding or multiplying, you get the weights on the right-hand columns.  If we sum these and divide by the total, we get the suggested blueprints in the green cells.

 

Job task analysis to test blueprints

 

The Four-Fifths Rule is a term that refers to a guideline for fairness in hiring practices in the USA.  Because tests are often used in making hiring decisions, the Four-Fifths Rule applies to them so it is an important aspect of assessment in the workforce, but it also applies to other selection methods, such as interviews or biodata.  It is important not only because violations could lead to legal entanglements, but because achieving a diverse and inclusive workforce is a goal for most organizations.

What is the Four-Fifths Rule?

The Four-Fifths Rule, also known as the 80% Rule, is a statistical guideline established by the Equal Employment Opportunity Commission (EEOC) in the United States, used to evaluate whether a selection process leads to adverse impact against any specific group. The rule comes into play when comparing the selection rates of different demographic groups within an organization, aiming to identify potential disparities. According to the EEOC, a selection rate for any group that is less than four-fifths (or 80%) of the rate for the group with the highest selection rate may indicate adverse impact.

This applies to any organization that is hiring in the United States, even if that organization is based overseas.  A great example of this is a 2023 lawsuit against a Chinese company that was hiring US employees with unfair practices.

The Four-Fifths Rule serves as a vital benchmark for organizations striving for diversity and inclusion. By highlighting disparities in selection rates, it helps employers identify and rectify potential discriminatory practices. This not only aligns with ethical considerations but also ensures compliance with anti-discrimination laws, fostering an environment that values equal opportunity for all.

four-fifths rule diversity in pre-employment testing

Calculation Method

First, determine the selection rate for each demographic group by dividing the number of individuals selected from that group by the total number of applicants from the same group. Next, compare the selection rates of different groups. If the selection rate for any group is less than 80% of the rate for the group with the highest selection rate, it triggers further investigation into potential discrimination.

Example:

Group A has 500 applicants and 100 were selected; a 20% selection rate

Group B has 120 applicants and 17 were selected; a 14.17% selection rate

The ratio is 0.1417/0.20 = 0.7083.  This is below 0.80, so the procedure is biased against Group B.

Note that we are focusing on rates and not overall numbers.  Clearly, Group B has far fewer selected, but the rates are not too different at 20% and 14.17% – but different enough that this test would be under scrutiny.

Implementing the Four-Fifths Rule in Practice

To implement protections against the Four-Fifths Rule effectively, organizations must adopt proactive measures. Regularly monitoring and analyzing selection rates for different demographic groups can help identify trends and address potential issues promptly. Furthermore, organizations should establish clear policies and procedures for hiring, ensuring that decision-makers are well-informed about the Four-Fifths Rule and its implications.

Note that this is only a guideline for flagging potential adverse impact.  It does not mean the selection method will be stricken.  Consider a physical fitness test for firefighters; it most definitely produce lower results for people aged 60 and over, but physical fitness is unarguably a job requirement, so if the test has been validated it will most likely be upheld.

How does AI fit into this?

Artificial intelligence (AI) is governed by the Four-Fifths rule as any other selection approach.  Do you use AI to comb through a pile of resumes, and flag those worthy of an interview?  This is then a selection procedure, and if it were to be found that it was biased against a subgroup, you would be liable.

Conclusion

In the pursuit of a fair and inclusive workplace, the Four-Fifths Rule is a valuable tool for organizations committed to diversity. Moreover, it is a legal guideline for any organization that hires in the United States.  It is legally required that your organization follow this guideline with respect to pre-employment assessments as well as any other selection procedure.

Note: ASC does not provide legal advice, this is only for educational purposes.

Content validity is an aspect of validity, a term that psychometricians use to refer to evidence that interpretations of test scores are supported.  For example, predictive validity provides evidence that a pre-employment test will predict job performance, tenure, and other important criteria.  Content validity, on the other hand, focuses on evidence that the content of the test covers what it should cover.

What is Content Validity?

Content validity refers to the extent to which a measurement instrument (e.g., a test, questionnaire, or survey) accurately and adequately measures the specific content or construct it is designed to assess. In simpler terms, it assesses whether the questions or items included in an assessment are relevant and representative of the subject matter or concept under investigation.

Example 1: You are working on a benchmark test for 5th grade mathematics in the USA.  You would likely want to ensure that all items align to the Common Core State Standards for the 5th grade mathematics curriculum.

Example 2: You are working on a certification exam for widgetmakers.  You should make sure that all items align to the publicly posted blueprint for this certification.  That, in turn, was not defined in willy-nilly – it should have been built on the results of a formal job task analysis study.

The Importance of Content Validity

Drives Accurate Measurement: Content validity helps in ensuring that the assessment tool is measuring what it’s intended to measure. This is critical for drawing meaningful conclusions and making informed decisions based on the results.content validity

Enhances Credibility: When your assessment has high content validity, it enhances the credibility and trustworthiness of your findings. It demonstrates that you’ve taken the time to design a valid instrument. This is often referred to as face validity – which is not a “real” type of validity that psychometricians consider, but refers to if someone off the street looks at the test and says “yeah, that looks like all the items are on widgetmaking.”

Reduces Bias: Using assessment items that are not content-valid can introduce bias and inaccuracies into your results. By maintaining content validity, you reduce the risk of skewed or unreliable data.

Improves Decision-Making: Organizations often rely on assessments to make important decisions, such as hiring employees, designing educational curricula, or evaluating the effectiveness of marketing campaigns. Content-valid assessments provide a solid foundation for making these decisions.

Legal Defensibility: In general, if you deliver a test to select employees, you need to show either content validity (e.g., test on Microsoft Excel for bookkeepers) or predictive validity (conscientiousness is a personality trait but probably related to success as a bookkeeper).  A similar notion applies to other types of tests.

How to Assess Content Validity

There are various methods to assess content validity, such as expert reviews, pilot testing, and statistical techniques. One common method is to gather a panel of experts in the subject matter and have them review the assessment items to ensure that they align with the content domain.  Of course, if all the items are written directly to the blueprints in the first place, and reviewed before they even become part of the pool of active items, a post-hoc review like that is not necessary.

There has been more recent research on the application of machine learning to evaluate content, including the add-on option to look for enemy items by evaluating the distance between the content of any given pair of items.

If the test is multidimensional, a statistical approach known as factor analysis can help, to see if the items actually load on the dimensions they should.

Conclusion

In summary, content validity is an essential aspect of assessment design that ensures the questions or items used in an assessment are appropriate, relevant, and representative of the construct being measured. It plays a significant role in enhancing the accuracy, credibility, and overall quality of your assessments. Whether you’re a student preparing for an exam, a researcher developing a survey, or a business professional creating a customer feedback form, understanding and prioritizing content validity will help you achieve more reliable and meaningful results. So, next time you’re tasked with creating or using an assessment tool, remember the importance of content validity and its impact on the quality of your data and decision-making processes.

However, it is not the only aspect of validity.  The documentation of validity is a complex process that is often ongoing.  You will also need data on statistical performance of the test (e.g., alpha reliability), evaluation bias (e.g., differential item functioning), possibly predictive validity, and more.  Therefore, it’s important to work with a psychometrician that can help you understand what is involved and ensure that the test meets both international standards and the reason that you are building the test in the first place!

Predictive Validity is a type of test score validity which evaluates how well a test predicts something in the future, usually with a goal of making more effective decisions about people.  For instance, it is often used in the world of pre-employment testing, where we want a test to predict things like job performance or tenure, so that a company can hire people that do a good job and stay a long time – a very good result for the company, and worth the investment.

Validity, in a general sense, is evidence that we have to support intended interpretations of test scores.  There are different types of evidence that we can gather to do so.  Predictive validity refers to evidence that the test predicts things that it should predict.  If we have quantitative data to support such conclusions, it makes the test more defensible and can improve the efficiency of its use.  For example, if a university admissions test does a great job of predicting success at university, then universities will want to use it to select students that are more likely to succeed.

Examples of Predictive Validity

Predictive validity evidence can be gathered for a variety of assessment types.

  1. Pre-employment: Since the entire purpose of a pre-employment test is to positively predict good things like job performance or negatively predict bad things like employee theft or short tenure, a ton of effort goes into developing tests to function in this way, and then documenting that they do.
  2. University Admissions: Like pre-employment testing, the entire purpose of university admissions exams is predictive.  They should positively correlate with good things (first year GPA, four year graduation rate) and negatively predict the negative outcomes like academic probation or dropping out.
  3. Prep Exams: Preparatory or practice tests are designed to predict performance on their target test.  For example, if a prep test is designed to mimic the Scholastic Aptitude Test (SAT), then one way to validate it is to gather the SAT scores later, after the examinees take it, and correlate with the prep test.
  4. Certification & Licensure: The primary purpose of credentialing exams is not to predict job performance, but to ensure that the candidate has mastered the material necessary to practice their profession.  Therefore, predictive validity is not important, compared to content-related validity such as blueprints based on a job analysis. However, some credentialing organizations do research on the “value of certification” linking it to improved job performance, reduced clinical errors, and often external third variables such as greater salary.
  5. Medical/Psychological: There are some assessments that are used in a clinical situation, and the predictive validity is necessary in that sense.  For instance, there might be an assessment of knee pain used during initial treatment (physical therapy, injections) that can be predictively correlated with later surgery.  The same assessment might then be used after the surgery to track rehabilitation.

Predictive Validity in Pre-employment Testing

The case of pre-employment testing is perhaps the most common use of this type of validity evidence.  A new study (Sacket, Zhang, Berry, & Lievens, 2022) was recently released that was a meta-analysis of the various types of pre-employment tests and other selection procedures (e.g., structured interview), comparing their predictive validity power.  This was a modern update to the classic article by Schmidt & Hunter (1998).  While in the past the consensus has been that cognitive ability tests provide the best predictive power in the widest range of situations, the new article suggests otherwise.  It recommends the use of structured interview and job knowledge tests, which are more targeted towards the role in question, and therefore not surprising that they are well-performing.  This in turn suggests that you should not buy pre-fab ability tests and use them in a shotgun approach with the assumption of validity generalization, but instead leverage an online testing platform like FastTest that allows you to build high-quality exams that are more specific to your organization.

Why do we need predictive validity?

There are a number of reasons that you might need predictive validity for an exam.  They are almost always regarding the case where the test is used to make important decisions about people.

  1. Smarter decision-making: Predictive validity provides valuable insights for decision-makers. It helps recruiters identify the most suitable candidates, educators tailor their teaching methods to enhance student learning, and universities to admit the best students.
  2. Legal defensibility: If a test is being used for pre-employment purposes, it is legally required in the USA to either show that the test is obviously job-related (e.g., knowledge of Excel for a bookkeeping job) or that you have hard data demonstrating predictive validity.  Otherwise, you are open for a lawsuit.
  3. Financial benefits: Often, the reason for needing improved decisions is very financial.  It is often costly for large companies to recruit and train personnel.  It’s entirely possible that spending $100,000 per year on pre-employment tests could save millions of dollars in the long run.
  4. Benefits to the examinee: Sometimes, there is directly a benefit to the examinee.  This is often the case with medical assessments.

How to implement predictive validity

The simplest case is that of regression and correlation.  How well does the test score correlate with the criterion variable?  Below is a oversimplified example, of predicting university GPA from scores on an admissions test.  Here, the correlation is 0.858 and the regression is GPA = 0.34*SCORE + 0.533.  Of course, in real life, you would not see this strong of a predictive power, as there are many other factors which influence GPA.

Predictive validity

Advanced Issues

It is usually not a simple situation of two straightforward variables, such as one test and one criterion variable.  Often, there are multiple predictor variables (quantitative reasoning test, MS Excel knowledge test, interview, rating of the candidate’s resume), and moreover there are often multiple criterion variables (job performance ratings, job tenure, counterproductive work behavior).  When you use multiple predictors and a second or third predictor adds some bit of predictive power over that of the first variable, this is known as incremental validity.

You can also implement more complex machine learning models, such as neural networks or support vector machines, if they fit and you have sufficient sample size.

When performing such validation, you need to also be aware of bias.  There can be test bias where the test being used as a predictor is biased against a subgroup.  There can also be predictive bias where two subgroups have the same performance on the test, but one is overpredicted for the criterion and the other is underpredicted.  A rule of thumb for investigating this in the USA is the four-fifths rule.

Summary

Predictive validity is one type of test score validity, referring to evidence that scores from a certain test can predict their intended target variables.  The most common application of it is to pre-employment testing, but it is useful in other situations as well.  But validity is an extremely important and wide-ranging topic, so it is not the only type of validity evidence that you should gather.

Psychometrics is the science of educational and psychological assessment.  It scientifically studies how tests are developed, delivered, and scored, regardless of the test topic.  The goal is to provide validity: evidence to support that interpretations of scores from the test are trustworthy.  This makes the tests more effective for their purpose of providing useful information about people.

Psychometrics tackles fundamental questions around assessment, such as how to determine if a test is reliable or if a question is of good quality, as well as much more complex questions like how to ensure that a score today is on the same scale as a score 10 years ago.  The goal of psychometrics is to make test scores fairer, more precise, and more valid – because test scores are used to make decisions about people (pass a course, hire for a job…), and better tests mean better decisions.  Why?  The scientific evidence is overwhelming that tests provide better information for decision makers than many other types of information, such as interviews, resumes, or educational attainment.

What is psychometrics? An introduction / definition.

Psychometrics is the study of assessment itself, regardless of what type of test is under consideration. In fact, many psychometricians don’t even work on a particular test, they just work on psychometrics itself, such as new methods of data analysis.  Most professionals don’t care about what the test is measuring, and will often switch to new jobs at completely unrelated topics, such as moving from a K-12 testing company to psychological measurement to an Accountant certification exam.  We often refer to whatever we are measuring simply as “theta” – a term from item response theory.Generalized-partial-credit-model psychometrics IRT

Psychometrics is a branch of data science.  In fact, it’s been around a long time before that term was even a buzzword.  Don’t believe me?  Check out this Coursera course on Data Science, and the first example they give as one of the foundational historical projects in data science is… psychometrics!  (early research on factor analysis of intelligence)

Even though assessment is everywhere and Psychometrics is an essential aspect of assessment, to most people it remains a black box, and professionals are referred to as “psychomagicians” in jest. However, a basic understanding is important for anyone working in the testing industry, especially those developing or selling tests.  It’s also important for many areas that use assessments, like human resources and education.

What is not psychometrics?

Psychometrics is NOT limited to very narrow types of assessment.  Some people use the term interchangeably with concepts like IQ testing, personality assessment, or pre-employment testing.  These are each but tiny parts of the field!  Also, it is not the administration of a test.

 

What questions does the field of Psychometrics address?

Building and maintaining a high-quality test is not easy.  A lot of big issues can arise.  Much of the field revolves around solving major questions about tests: what should they cover, what is a good question, how do we set a good cutscore, how do we make sure that the test predicts job performance or student success, etc.

 

How do we define what should be covered by the test? (Test Design)

Before writing any items, you need to define very specifically what will be on the test.  If the test is in credentialing or pre-employment, psychometricians typically run a job analysis study to form a quantitative, scientific basis for the test blueprints.  A job analysis is necessary for a certification program to get accredited.  In Education, the test coverage is often defined by the curriculum.

 

How do we ensure the questions are good quality? (Item Writing)

There is a corpus of scientific literature on how to develop test items that accurately measure whatever you are trying to measure.  A great overview is the book by Haladyna.  This is not just limited to multiple-choice items, although that approach remains popular.  Psychometricians leverage their knowledge of best practices to guide the item authoring and review process in a way that the result is highly defensible test content.  Professional item banking software provides the most efficient way to develop high-quality content and publish multiple test forms, as well as store important historical information like item statistics.

 

How do we set a defensible cutscore? (Standard Setting)

Test scores are often used to classify candidates into groups, such as pass/fail (Certification/Licensure), hire/non-hire (Pre-Employment), and below-basic/basic/proficient/advanced (Education).  Psychometricians lead studies to determine the cutscores, using methodologies such as Angoff, Beuk, Contrasting-Groups, and Borderline.

 

How do we analyze results to improve the exam? (Psychometric Analysis)

Psychometricians are essential for this step, as the statistical analyses can be quite complex.  Smaller testing organizations typically utilize classical test theory, which is based on simple mathematics like proportions and correlations.  Large, high-profile organizations typically use item response theory (IRT), which is based on a type of nonlinear regression analysis.  Psychometricians evaluate overall reliability of the test, difficulty and discrimination of each item, distractor analysis, possible bias, multidimensionality, linking multiple test forms/years, and much more.  Software such as  Iteman  and  Xcalibre  is also available for organizations with enough expertise to run statistical analyses internally.  Scroll down below for examples.

 

How do we compare scores across groups or years? (Equating)

This is referred to as linking and equating.  There are some psychometricians that devote their entire career to this topic.  If you are working on a certification exam, for example, you want to make sure that the passing standard is the same this year as last year.  If you passed 76% last year and this year you passed 25%, not only will the candidates be angry, but there will be much less confidence in the meaning of the credential.

 

How do we know the test is measuring what it should? (Validity)

Validity is the evidence provided to support score interpretations.  For example, we might interpret scores on a test to reflect knowledge of English, and we need to provide documentation and research supporting this.  There are several ways to provide this evidence.  A straightforward approach is to establish content-related evidence, which includes the test definition, blueprints, and item authoring/review.  In some situations, criterion-related evidence is important, which directly correlates test scores to another variable of interest.  Delivering tests in a secure manner is also essential for validity.

 

Where is Psychometrics Used?

Certification/Licensure/Credentialing

In certification testing, psychometricians develop the test via a documented chain of evidence following a sequence of research outlined by accreditation bodies, typically: job analysis, test blueprints, item writing and review, cutscore study, and statistical analysis.  Web-based item banking software like  FastTest  is typically useful because the exam committee often consists of experts located across the country or even throughout the world; they can then easily log in from anywhere and collaborate.

 

Pre-Employment

In pre-employment testing, validity evidence relies primarily on establishing appropriate content (a test on PHP programming for a PHP programming job) and the correlation of test scores with an important criterion like job performance ratings (shows that the test predicts good job performance).  Adaptive tests are becoming much more common in pre-employment testing because they provide several benefits, the most important of which is cutting test time by 50% – a big deal for large corporations that test a million applicants each year. Adaptive testing is based on item response theory, and requires a specialized psychometrician as well as specially designed software like  FastTest.

 

K-12 Education

Most assessments in education fall into one of two categories: lower-stakes formative assessment in classrooms, and higher-stakes summative assessments like year-end exams.  Psychometrics is essential for establishing the reliability and validity of higher-stakes exams, and on equating the scores across different years.  They are also important for formative assessments, which are moving towards adaptive formats because of the 50% reduction in test time, meaning that student spend less time testing and more time learning.

 

Universities

Universities typically do not give much thought to psychometrics even though a significant amount of testing occurs in higher education, especially with the move to online learning and MOOCs.  Given that many of the exams are high stakes (consider a certificate exam after completing a year-long graduate program!), psychometricians should be used in the establishment of legally defensible cutscores and in statistical analysis to ensure reliable tests, and professionally designed assessment systems used for developing and delivering tests, especially with enhanced security.

 

Medicine/Psychology

Have you ever taken a survey at your doctor’s office, or before/after a surgery?  Perhaps a depression or anxiety inventory at a psychotherapist?  Psychometricians have worked on these.

 

The Test Development Cycle

Psychometrics is the core of the test development cycle, which is the process of developing a strong exam.  It is sometimes called similar names like assessment lifecycle.

test development cycle job task analysis psychometrics

You will recognize some of the terms from the introduction earlier.  What we are trying to demonstrate here is that those questions are not standalone topics, or something you do once and simply file a report.  An exam is usually a living thing.  Organizations will often be republishing a new version every year or 6 months, which means that much of the cycle is repeated on that timeline.  Not all of it is; for example, many orgs only do a job analysis and standard setting every 5 years.

Consider a certification exam in healthcare.  The profession does not change quickly because things like anatomy never change and medical procedures rarely change (e.g., how to measure blood pressure).  So, every 5 years it does a job analysis of its certificants to see what they are doing and what is important.  This is then converted to test blueprints.  Items are re-mapped if needed, but most likely do not need it because there are probably only minor changes to the blueprints.  Then a new cutscore is set with the modified-Angoff method, and the test is delivered this year.  It is delivered again next year, but equated to this year rather than starting again.  However, the item statistics are still analyzed, which leads to a new cycle of revising items and publishing a new form for next year.

 

Example of Psychometrics in Action

Here is some output from our Iteman software.  This is deeply analyzing a single question on English vocabulary, to see if the student knows the word alleviate.  About 70% of the students answered correctly, with a very strong point-biserial.  The distractor P values were all in the minority and the distractor point-biserials were negative, which adds evidence to the validity.  The graph shows that the line for the correct answer is going up while the others are going down, which is good.  If you are familiar with item response theory, you’ll notice how the blue line is similar to an item response function.  That is not a coincidence.

FastTest Iteman Psychometrics Analysis

 

Now, let’s look at another one, which is more interesting.  Here’s a vocab question about the word confectioner.  Note that only 37% of the students get it right… even though there is a 25% chance just of guessing!!!  However, the point-biserial discrimination remains very strong at 0.49.  That means it is a really good item.  It’s just hard, which means it does a great job to differentiate amongst the top students.

Confectioner confetti

 

Psychometrics looks fun!  How can I join the band?

You will need a graduate degree.  I recommend you look at the NCME website with resources for students.  Good luck!

Already have a degree and looking for a job?  Here’s the two sites that I recommend:

NCME – Also has a job listings page that is really good (ncme.org)

Horizon Search – Headhunter for Psychometricians and I/O Psychologists

Samejima’s (1969) Graded Response Model (GRM, sometimes SGRM) is an extension of the two parameter logistic model (2PL) within the item response theory (IRT) paradigm.  IRT provides a number of benefits over classical test theory, especially regarding the treatment of polytomous items; learn more about IRT vs. CTT here.

 

What is the Graded Response Model?

GRM is a family of latent trait (latent trait is a variable that is not directly measurable, e.g. a person’s level of neurosis, conscientiousness or openness) mathematical models for grading responses that was developed by Fumiko Samejima (1969) and has been utilized widely since then. GRM is also known as Ordered Categorical Responses Model as it deals with ordered polytomous categories that can relate to both constructed-response or selected-response items where examinees are supposed to obtain various levels of scores like 0-4 points. In this case, the categories are as follows: 0, 1, 2, 3, and 4; and they are ordered. ‘Ordered’ means what it says, that there is a specific order or ranking of responses. ‘Polytomous’ means that the responses are divided into more than two categories, i.e., not just correct/incorrect or true/false.

 

When should I use the GRM?

This family of models is applicable when polytomous responses to an item can be classified into more than two ordered categories (something more than correct/incorrect), such as to represent different degrees of achievement in a solution to a problem or levels of agreement , a Likert scale, or frequency to a certain statement. GRM covers both homogeneous and heterogeneous cases, while the former implies that a discriminating power underlying a thinking process is constant throughout a range of attitude or reasoning.

Samejima (1997) highlights a reasonability of employing GRM in testing occasions when examinees are scored based on correctness (e.g., incorrect, partially correct, correct) or while measuring people’s attitudes and preferences, like in Likert-scale attitude surveys (e.g., strongly agree, agree, neutral, disagree, strongly disagree). For instance, GRM can be used in an extroversion scoring model considering “I like to go to parties” as a high difficulty construction, and “I like to go out for coffee with a close friend” as an easy one.emotion scale grm

Here are some examples of assessments where GRM is utilized:

  • Survey attitude questions using responses like ‘strongly disagree, disagree, neutral, agree, strongly agree’
  • Multiple response items, such as a list of 8 animals and student selects which 3 are reptiles
  • Drag and drop or other tech enhanced items with multiple points available
  • Letter grades assigned to an essay: A, B, C, D, and E
  • Essay responses graded on a 0-to-4 rubric

 

Why to use GRM?

There are three general goals of applying GRM:

  • estimating an ability level/latent trait
  • estimating an adequacy with which test questions measure an ability level/latent trait
  • evaluating a probability that a particular test domain will receive a specific score/grade for each question

Using item response theory in general (not just the GRM) provides a host of advantages.  It can help you validate the assessment.  Using the GRM can also enable adaptive testing.

 

How to calculate a response probability with the GRM?

There is a two-step process of calculating a probability that an examinee selects a certain category in a given question. The first step is to find a probability that an examinee with a definite ability level selects a category n or greater in a given question:

GRM formula1

where

1.7  is the scale factor

a  is the discrimination of the question

bm  is a probability of choosing category n or higher

e  is the constant that approximately equals to 2.718

Θ  is the ability level

P*m(Θ) = 1  if  m = 1  since a probability of replying in the lowest category or in all the major ones is a certain event

P*m(Θ) = 0  if  m = M + 1  since a probability of replying in a category following the largest is null.

 

The second step is to find a probability that an examinee responds in a given category:

GRM formula2

This formula describes the probability of choosing a specific response to the question for each level of the ability it measures.

 

How do I implement the GRM on my assessment?

You need item response theory software.  Start by downloading Xcalibre for free.  Below are outputs for two example items.

How to interpret this?  The GRM uses category response functions which show the probability of selecting a given response as a function of theta (trait or ability).  For item 6, we see that someone of theta -3.0 to -0.5 is very likely to select “2” on the Likert scale (or whatever our response is).  Examinees above -.05 are likely to select “3” on the scale.  But on Item 10, the green curve is low and not likely to be chosen at all; examinees from -2.0 to +2.0 are likely to select “3” on the Likert scale, and those above +2.0 are likely to select “4”.  Item 6 is relatively difficult, in a sense, because no one chose “4.”

Xcalibre - graded response model easyXcalibre - graded response model difficult

References

Keller, L. A. (2014). Item Response Theory Models for Polytomous Response Data. Wiley StatsRef: Statistics Reference Online.

Samejima, F. (1969). Estimation of latent ability using a response pattern of graded coress. Psychometrika monograph supplement17(4), 2. doi:10.1002/j.2333-8504.1968.tb00153.x.

Samejima, F. (1997). Graded response model. In W. J. van der Linden and R. K. Hambleton (Eds), Handbook of Modern Item Response Theory, (pp. 85–100). Springer-Verlag.