Digital literacy assessments are a critical aspect of modern educational and workforce development initiatives, given today’s fast-paced and technology-driven world, where digital literacy is essential in one’s education, occupation, and even in daily life. Defined broadly as the ability to navigate, evaluate, and create information in digital formats, digital literacy is no longer a “nice-to-have”, but a “must-have” skill set. Measuring this complex construct requires strong validity documentation, and psychometrics provides the theoretical and practical tools to do so effectively. This blog delves into the intersection of digital literacy assessments and psychometrics, exploring the frameworks, challenges, and innovations shaping the field.

An important assessment in the field is the Programme for the International Assessment of Adult Competencies (PIAAC) which evaluates digital literacy across countries.  If you are interested in research on this topic, that site provides extensive documentation on how they were developed, as well as results data.

 

Understanding Digital Literacy

Digital literacy refers to the ability to use digital tools and technologies effectively to access, analyze, and create information. It encompasses a broad range of skills, from basic functions like using devices and navigating the Internet, to more advanced skills such as Cybersecurity awareness, digital communication, and content creation. According to frameworks like the European Commission’s DIGCOMP and UNESCO’s Global Framework, digital literacy includes:

  1. Information Literacy: The ability to locate, evaluate, and use information effectively.
  2. Communication and collaboration: The ability to interact, communicate and collaborate with others through the use of digital technologies.
  3. Media Literacy: Understanding and critically analyzing media content and formats.
  4. Technical Literacy: Proficiency in using devices, software, and platforms.
  5. Digital Safety: Awareness of cybersecurity and ethical considerations in the digital space.
  6. Problem solving: The ability to identify needs and problems and resolve them in different digital environments.

These subdomains highlight the multidimensional nature of digital literacy, making it a challenging construct to measure. However, with clear frameworks and psychometric methodologies, we can create assessments that not only evaluate these skills but also guide their development.

 

Digital Literacy Statistics

Eurostat found that “54% in the EU aged 16 to 74 had at least basic overall digital skills in 2021” and the U.S. Department of Education estimated that “16 percent of adults (31.8 million Americans) lack sufficient comfort or competence with technology to use a computer” in 2012 (page 3).

To elaborate on multinational digital literacy statistics, the National Center for Education Statistics compared the average scores of adults ages 16-65 in 26 jurisdictions including the United States and identified “a mixed picture, with U.S. adults scoring higher than the International Average in Literacy, but lower in both Numeracy and Digital Problem Solving.”

As new technological innovations emerge, new skills must be acquired. For example, one must possess skills beyond just knowing how to type on a keyboard. One must also understand how to evaluate information found online, how to communicate securely, and how to create digital content. Digital literacy is a multifaceted competency that affects one’s personal, as well as professional, growth.

What is the Importance of Digital Literacy Assessments?

Digital Literacy Assessments are the process of evaluating an individual’s proficiency in using technologies and tools. There are several reasons for the importance of this type of assessment:

  1. Ability to Measure Skills Levels: A digital literacy assessment helps determine where an individual stands in terms of their digital skills. It allows educators, employers, and policymakers to determine whether individuals are adequately prepared for the digital demands of today.
  2. Targeted Training: After analyzing the results of an assessment, tailored training programs can be developed to improve specific areas of an individual’s digital literacy. For example, an employee struggling with Cybersecurity can receive focused training to improve their competence and understanding of this area.
  3. Empowering Learnings and Workers: Understanding one’s digital literacy level can allow individuals to take control of their learning and development, leading to improved confidence in using technology. This can reduce the digital divide that hinders groups, such as the poor, in their efforts to access education and employment
  4. Enhancing Education and Professional Outcomes: Digital literacy directly impacts academic success and workplace productivity. For example, a student that is well-versed in using a word processor will find writing essays and workplace reports to be an easier task than a student with introductory knowledge of the same. The ability to assess and improve skills such as these ensures that individuals are better equipped to excel in both their academic and professional lives.

 

Types of Digital Literacy Assessments

Digital literacy assessments can take multiple forms, ranging from self-assessments to formal evaluations. Below are a few types of digital literacy assessments that are commonly used:

digital literacy assessment & psychometrics

    1. Self-Assessment Questionnaires: These surveys often ask individuals to rate their own digital skills across various areas such as Internet navigation, software use, and online communication. While these are not as accurate as other methods, self-assessments can give estimates of an individual’s strengths and weaknesses pertaining to their digital skills.
    2. Standardized Tests: Some organizations and educational institutions use standardized tests, which evaluate digital literacy in a controlled setting. These assessments often measure proficiency in tasks such as document creation, online research, and/or responsible use of social media.
    3. Performance-Based Assessments: These simulate real-world tasks to measure practical skills. For example:
      • Using a search engine to find credible information.
      • Identifying and responding to phishing emails.
      • Creating digital content like a blog post or infographic.

      Performance-based assessments are often considered the gold standard because they reflect authentic digital tasks. However, they can be resource-intensive to develop and score.

    4. Knowledge Tests: Traditional knowledge-based tests evaluate understanding of digital concepts, such as:
        • What is a secure password?
        • How do algorithms affect social media feeds?

      Though straightforward to implement, these tests may not fully capture applied skills.

    5. Project-Based Assessments: These involve more extensive tasks in which individuals create digital content or solve real-world problems. These can include designing a website, developing a mobile app, or creating a digital marketing plan. These provide a hands-on way to assess how well individuals can apply their digital knowledge.
    6. Behavioral Data Analysis: This innovative approach uses data from digital interactions (e.g., how users navigate websites or apps) to infer literacy levels. It offers rich insights but raises ethical concerns about privacy.

 

Psychometrics in Digital Literacy

Psychometrics, the science of measurement, provides tools to ensure digital literacy assessments are valid, reliable, and fair. Here’s how psychometric principles are applied:

1. Reliability: Reliability ensures consistent results across different administrations. For example:

High reliability is critical for confidence in assessment results.

2. Validity: Validity ensures the test measures what it claims to measure. Psychometricians focus on:

  • Content Validity: Does the test cover all aspects of digital literacy?
  • Construct Validity: Does the test align with theoretical models?
  • Criterion Validity: Do test scores correlate with real-world performance?

A test measuring digital literacy should reflect not just theoretical understanding but also practical application.

3. Item Response Theory (IRT): IRT models how individual test items relate to the overall ability being measured. It allows for:

  • Adaptive testing, where questions adjust based on the test-taker’s responses.
  • More precise scoring by accounting for item difficulty and discrimination.

4. Addressing bias: Bias in assessments can arise from socioeconomic, cultural, or technical differences. Psychometricians use techniques like Differential Item Function (DIF) analysis to identify and mitigate bias, ensuring fairness.

How to Implement Digital Literacy Assessments

Follow these suggested steps to implement an effective digital literacy assessment:

      1. Define the Scope: Identify which digital literacy skills are most relevant for the context – for example, whether these skills will be used in an educational institution, in a corporate setting, or for general purposes.
      2. Choose the Right Tool: Select the appropriate assessment method based on the needs of the individual being assessed. Consider using a combination of tests, performance tasks, and self-assessments.
      3. Analyze Results: Review the results of the assessment to identify strengths and weaknesses to guide future training and support needs.
      4. Provide Feedback: Offer personalized feedback to individuals, highlighting areas of improvement and offering resources for further learning.
      5. Regular Re-assessment: With the continuous evolution of digital technology, it is crucial to continually assess digital literacy to ensure that individuals receive new skills and the ability to use new tools.

 

Innovations in Digital Literacy Assessment

1. Gamified Assessments

Gamification makes assessments engaging and interactive. For example:

        • A cybersecurity game in which users identify phishing attempts or secure accounts.
        • A digital collaboration in which users solve problems in a virtual workspace.

2. Adaptive Testing

Adaptive tests use algorithms to tailor questions based on a test-taker’s responses. This approach:

        • Reduces test length without sacrificing reliability.
        • Provides a more personalized assessment experience.

3. Data-Driven Insights

AI and machine learning analyze patterns in test responses and digital interactions. For example:

        • Tracking how users evaluate online information to identify gaps in critical thinking.
        • Analyzing social media behavior for insights into media literacy.

4. Cross-Cultural and Global Tools

Global frameworks require assessments that work across diverse cultural contexts. Localization involves:

        • Translating assessments into multiple languages.
        • Adapting scenarios to reflect local digital practices.

Conclusion

In today’s increasingly technology-driven world, digital literacy is a vital skill for everyone to have. Digital literacy assessments are invaluable tools for understanding how skillfully individuals can navigate the digital landscape and where improvements can be made. By accurately assessing digital skills and providing targeting training, we can ensure that people of all ages and backgrounds are prepared for their futures. As new technologies are frequently created, individuals’ digital literacy skills must be frequently updated.

Situational judgment tests (SJTs) are a type of assessment typically used in a pre-employment context to assess candidates’ soft skills and decision-making abilities. As the name suggests, we are not trying to assess something like knowledge, but rather the judgments or likely behaviors of candidates in specific situations, such as an unruly customer. These tests have become a critical component of modern recruitment, offering employers valuable insights into how applicants approach real-world scenarios, with higher fidelity than traditional assessments.

The importance of tools like SJTs becomes even clearer when considering the significant costs of poor hiring decisions. The U.S. Department of Labor suggests that the financial impact of a poor hiring decision can amount to roughly 30% of the employee’s annual salary. Similarly, CareerBuilder reports that around three-quarters of employers face an average loss of approximately $15,000 for each bad hire due to various costs such as training, lost productivity, and recruitment expenses. Gallup’s State of the Global Workplace Report 2022 further emphasizes the broader implications, revealing that disengaged employees—often a result of poor hiring practices—cost companies globally $8.8 trillion annually in lost productivity.

In this article, we’ll define situational judgment tests, explore their benefits, and provide an example question to better understand how they work.

What is a Situational Judgment Test?

A Situational Judgment Test (SJT) is a psychological assessment tool designed to evaluate how individuals handle hypothetical workplace scenarios. These tests present a series of realistic situations and ask candidates to choose or rank responses that best reflect how they would act. Unlike traditional aptitude tests that measure specific knowledge or technical skills, SJTs focus on soft skills like problem-solving, teamwork, communication, and adaptability. They can provide a critical amount of incremental validity over cognitive and job knowledge assessments.

SJTs are widely used in recruitment for roles where interpersonal and decision-making skills are critical, such as management, customer service, and healthcare. They can be administered in various formats, including multiple-choice questions, multiple-response items, video scenarios, or interactive simulations.

Example of a Situational Judgment Test Question

Here’s a typical SJT question to illustrate the concept:

 

Scenario:

You are leading a team project with a tight deadline. One of your team members, who is critical to the project’s success, has missed several key milestones. When you approach them, they reveal they are overwhelmed with personal issues and other work commitments.

hr-interview-pre-employment

Question:

What would you do in this situation?

– Report the issue to your manager and request their intervention.

– Offer to redistribute some of their tasks to other team members to ease their workload.

– Have a one-on-one meeting to understand their challenges and develop a plan together.

– Leave them to handle their tasks independently to avoid micromanaging.

 

Answer Key:

While there’s no definitive “right” answer in SJTs, some responses align better with desirable workplace behaviors. In this example, Option 3 demonstrates empathy, problem-solving, and leadership, which are highly valued traits in most professional settings.

 

Because SJTs typically do not have an overtly correct answer, they will sometimes have a partial credit scoring rule. In the example above, you might elect to give 2 points to Option 3 and 1 point to Option 2. Perhaps even a negative point to some options!

Potential topics for SJTs

Customer service – Given a video of an unruly customer, and how would you respond?

Difficult coworker situation – Like the previous example, how would you find a solution?

Police/Fire – It you made a routine traffic stop and the driver was acting intoxicated and belligerent, what would you do?

How to Develop and Deliver an SJT

Development of an SJT is typically more complex than knowledge-based tests, both because it is more difficult to come up with the topic/content of the item as well as plausible distractors and scoring rules. It can also get expensive if you are utilizing simulation formats or high-quality videos for which you hire real actors!

Here are some suggested steps:

  1. Define the construct you want to measure
  2. Draft item content
  3. Establish the scoring rules
  4. Have items reviewed by experts
  5. Create videos/simulations
  6. Set your cutscore (Standard setting)
  7. Publish the test

SJTs are almost always delivered by computer nowadays because it is so easy to include multimedia. Below is an example of what this will look like, using ASC’s FastTest platform.

FastTest - Situational Judgment Test SJT example

Advantages of Situational Judgment Tests

1. Realistic Assessment of Skills

Unlike theoretical tests, SJTs mimic real-world situations, making them a practical way to evaluate how candidates might behave in the workplace. This approach helps employers identify individuals who align with their organizational values and culture.

2. Focus on Soft Skills

Technical expertise can often be measured through other assessments or qualifications, but soft skills like emotional intelligence, adaptability, and teamwork are harder to gauge. SJTs provide insights into these intangible qualities that are crucial for success in many roles.

3. Reduced Bias

SJTs focus on behavior rather than background, making them a fairer assessment tool. They can help level the playing field by emphasizing practical decision-making over academic credentials or prior experience.

4. Efficient Screening Process

For roles that receive a high volume of applications, SJTs offer a quick and efficient way to filter candidates. By identifying top performers early, organizations can save time and resources in subsequent hiring stages.

5. Improved Candidate Experience

Interactive and scenario-based assessments often feel more engaging to candidates than traditional tests. This positive experience can enhance a company’s employer brand and attract top talent.

Tips for Success in Taking an SJT

If you’re preparing to take a situational judgment test, keep these tips in mind:

– Understand the Role: Research the job to better understand the types of situations that might be encountered, and think through your responses ahead of time.

– Understand the Company: Research organization to align your responses with their values, culture, and expectations.

– Prioritize Key Skills: Many SJTs assess teamwork, leadership, and conflict resolution, so focus on demonstrating these attributes.

– Practice: Familiarize yourself with sample questions to build confidence and improve your response strategy.

Conclusion

Situational judgment tests are a powerful tool for employers to evaluate candidates’ interpersonal and decision-making abilities in a realistic context, and in a way that is much more scalable than 1-on-1 interviews.

For job seekers, they offer an opportunity to showcase soft skills that might not be evident from a resume or educational record alone. As their use continues to grow across industries, understanding and preparing for SJTs can give candidates a competitive edge in the job market.

Additional Resources on SJTs

Lievens, F., & Sackett, P. R. (2012). The validity of interpersonal skills assessment via SJTs: A review. Journal of Applied Psychology, 97(1), 3–17.

Weekley, J. A., & Ployhart, R. E. (Eds.). (2005). Situational judgment tests: Theory, measurement, and application. Psychology Press.

Christian, M. S., Edwards, B. D., & Bradley, J. C. (2010). Situational judgment tests: Constructs assessed and a meta-analysis of their criterion-related validity. Personnel Psychology, 63(1), 83–117.

z-score-avatar

A z-score measures the distance between a raw score and a mean in standard deviation units, conveying the location of an observation in a normal distribution, of which scores on a test are just one of many examples. The z-score is also known as a standard score since it enables comparing scores on various variables by standardizing the distribution of scores. A standard normal distribution (also known as the z-score distribution or probability distribution) is a normally shaped distribution with a mean of 0 and a standard deviation of  1. A T-score is another example of standardized scores, which translates a z-score from N(0,1) to N(50,10).

What does a z-score mean?

The z-score can be positive or negative. The sign depends on whether the observation is above or below the mean. For instance, the z of  +2  indicates that the raw score (data point) is two standard deviations above the mean, while a  -1  signifies that it is one standard deviation below the mean. The z of 0 equals the mean. Z-scores generally range from  -3  standard deviations (which would fall to the far left of the normal distribution curve) up to  +3  standard deviations (which would fall to the far right of the normal distribution curve). This covers  99%  of the population; there are people outside that range (e.g., gifted students) but for most cases it is difficult to measure the extremes and there is little practical difference.

Details and examples are below.  If you would like to explore the concept on your own, here’s a free tool in Excel that you can download!

How to calculate a z-score

Here is a formula for calculating the z:

z = (xμ)/σ

where

     x – individual value

     μ – mean

     σ – standard deviation.

Interpretation of the formula:

  • Subtract the mean of the values from the individual value
  • Divide the difference by the standard deviation.

Here is a graphical depiction of the standard normal curve and how the z-score relates to other metrics.

 

T scores

z-scores vs Scaled Scores in Assessment

Many exams implement scaled scores when they report scores to examinees and other stakeholders.  These are often just a repackaging of z-scores because nobody wants to receive a negative score!  Something like -0.012 might seem disheartening if you don’t know what it means – which is that you are an average student.

The scaled scoring often uses the plus/minus 3 SD paradigm; the SAT has a mean of  500  and standard deviation of  100, so the range is  200  to  800.  The ACT exam is nominally a mean of 18 and SD of 6, hence the scores ranging 0 to 36.  IQ tests tend to have a mean of 100 and standard deviation of 15.

Advantages of using a z-score

When you standardize the raw data by transforming them into z-scores, you receive the following benefits:

  • Identify outliers
  • Understand where an individual score fits into a distribution
  • Normalize scores for statistical decision-making (e.g., grading on a curve)
  • Calculate probabilities and percentiles using the standard normal distribution (e.g., percentile rank)
  • Compare scores on different distributions with different means and standard deviations; a score of 600 on the SAT is equivalent to 24 on the ACT or 115 on an IQ test (nominally).

Example of using a z-score in real life situation

Let’s imagine that there is a set of SAT scores from students, and this data set obeys a normal distribution law with the mean score of  500  and a standard deviation of  100. Suppose we need to find the probability that these SAT scores exceed  650. In order to standardize our data, we have to find the z-score for  650. The z will tell us how many standard deviations away from the mean  650  is.

  • Subtracting the mean from the individual value:

x – 650

μ – 500

xμ = 650– 500= 150

  • Dividing the obtained difference by the standard deviation:

σ – 100

z = 150 ÷ 100 = 1.5

The z for the value of  650  is  1.5, i.e.  650 is 1.5 standard deviations above the mean in our distribution.

If you look up this z-score on a conversion table, you will see that it says  0.93319.  This means that a score of  650  is at the  93rd  percentile of students.

Additional resources

Khan Academy

Normal Distribution (Wikipedia)

confidence-intervals-avatar

Confidence intervals (CIs) are a fundamental concept in statistics, used extensively in assessment and measurement to estimate the reliability and precision of data. Whether in scientific research, business analytics, or health studies, confidence intervals provide a range of values that likely contain the true value of a parameter, giving us a better understanding of uncertainty. This article dives into the concept of confidence intervals, how they are used in assessments, and real-world applications to illustrate their importance.

What Is a Confidence Interval?

CI is a range of values, derived from sample data, that is likely to contain the true population parameter. Instead of providing a single estimate (like a mean or proportion), it gives a range of plausible values, often expressed with a specific level of confidence, such as 95% or 99%. For example, if a survey estimates the average height of adult males to be 175 cm with a 95% CI of 170 cm to 180 cm, it means that we can be 95% confident that the true average height of all adult males falls within this range.

These are made by taking plus or minus a standard error times a factor, around the single estimate, creating a lower bound and upper bound for a range.

Upper Bound = Estimate + factor * standard_error

Lower Bound = Estimate – factor * standard_error

The value of the factor depends upon your assumptions and the desired percentage in the range.  You might want 90%, 95%, 0r 99%.  With the standard normal distribution, the factor is 1.96 (see any table of z-scores), which makes for easy quick math of 2 in your head to get a rough idea.  So, in the example above, we might find that the average height for a sample of 100 adult males is 175 cm with a standard deviation of 25.  The standard error of the mean is SD*sqrt(N), which in this case is 25*sqrt(100) = 2.5.  If we take plus or minus 2 times the standard error of 2.5, that is how we get a confidence interval that the true population mean is 95% likely to be between 170 and 180.

This example is from general statistics, but confidence intervals are used for specific reasons in assessment.

 

How Confidence Intervals Are Used in Assessment and Measurement

calculating-confidence-interval

1. Statistical Inference

CIs play a crucial role in making inferences about a population based on sample data. For instance, researchers studying the effect of a new drug might calculate a CI for the average improvement in symptoms. This helps determine if the drug has a significant effect compared to a placebo.

2. Quality Control

In industries like manufacturing, CIs help assess product consistency and quality. For example, a factory producing light bulbs may use CIs to estimate the average lifespan of their products. This ensures that most bulbs meet performance standards.

3. Education and Testing

In educational assessments, CIs can provide insights into test reliability and student performance. For instance, a standardized test score might come with a CI to account for variability in test conditions or scoring methods.

Real-World Examples of Confidence Intervals

1. Medical Research

explaining-confidence-intervals

In clinical trials, CIs are often used to estimate the effectiveness of treatments. Suppose a study finds that a new vaccine reduces the risk of a disease by 40%, with a 95% CI of 30% to 50%. This means there’s a high probability that the true effectiveness lies within this range, helping policymakers make informed decisions.

2. Business Analytics

Businesses use CIs to forecast sales, customer satisfaction, or market trends. For example, a company surveying customer satisfaction might report an average satisfaction score of 8 out of 10, with a 95% CI of 7.5 to 8.5. This helps managers gauge customer sentiment while accounting for survey variability.

3. Environmental Studies

Environmental scientists use CIs to measure pollution levels or climate changes. For instance, if data shows that the average global temperature has increased by 1.2°C over the past century, with a CI of 0.9°C to 1.5°C, this range provides a clearer picture of the uncertainty in the estimate.

Confidence Intervals in Education: A Closer Look

CIs are particularly valuable in education, where they help assess the reliability and validity of test scores and other measurements. By understanding and applying CIs, educators, test developers, and policymakers can make more informed decisions that impact students and learning outcomes.

1. Estimating a Range for True Score

CIs are often paired with standard error of measurement (SEM) to provide insights into the reliability of test scores. SEM quantifies the amount of error expected in a score due to various factors like testing conditions or measurement tools.  It gives us a range for a true score around the observed score (see technical note near then end on this).

For example, consider a standardized test with a scaled score range of 200 to 800. If a student scores 700 with an SEM of 20, the 95% CI for their true score is calculated as:

     Score ± (SEM × Z-value for 95% confidence)

     700 ± (20 × 1.96) = 700 ± 39.2700 ± (20 × 1.96) = 700 ± 39.2

Thus, the 95% CI is approximately 660 to 740. This means we can be 95% confident that the student’s true score lies within this range, accounting for potential measurement error.  Because this is important, it is sometimes factored into important decisions such as setting a cutscore to be hired at a company based on a screening test.

The reasoning for this is accurately described by this quote from Prof. Michael Rodriguez, noted by Mohammed Abulela on LinkedIn:

A test score is a snapshot estimate, based on a sample of knowledge, skills, or dispositions, with a standard error of measurement reflecting the uncertainty in that score-because it is a sample. Fair test score interpretation employs that standard error and does not treat a score as an absolute or precise indicator of performance.

2. Using Standard Error of Estimate (SEE) for Predictions

The standard error of the estimate (SEE) is used to evaluate the accuracy of predictions in models, such as predicting student performance based on prior data.

For instance, suppose that a college readiness score ranges from 0 to 500, and is predicted by a student’s school grades and admissions test score.  If a predictive model estimates a student’s college readiness score to be 450, with an SEE of 25, the 95% confidence interval for this predicted score is:

     450 ± (25 × 1.96) = 450 ± 49

This results in a confidence interval of 401 to 499, indicating that the true readiness score is likely within this range. Such information helps educators evaluate predictive assessments and develop better intervention strategies.

3. Evaluating Group Performance

confidence-intervals-schemes

CIs are also used to assess the performance of groups, such as schools or districts. For instance, if a district’s average math score is 75 with a 95% CI of 73 to 77, policymakers can be fairly confident that the district’s true average falls within this range. This insight is crucial for making fair comparisons between schools or identifying areas that need improvement.

4. Identifying Achievement Gaps

When studying educational equity, CIs help measure differences in achievement between groups, such as socioeconomic or demographic categories. For example, if one group scores an average of 78 with a CI of 76 to 80 and another scores 72 with a CI of 70 to 74, the overlap (or lack thereof) in intervals can indicate whether the gap is statistically significant or might be due to random variability.

5. Informing Curriculum Development

CIs can guide decisions about curriculum and instructional methods. For instance, when pilot-testing a new teaching method, researchers might use CIs to evaluate its effectiveness. If students taught with the new method have scores averaging 85 with a CI of 83 to 87, compared to 80 (78 to 82) for traditional methods, educators might confidently adopt the new approach.

6. Supporting Student Growth Tracking

In long-term assessments, CIs help track student growth by providing a range around estimated progress. If a student’s reading level improves from 60 (58–62) to 68 (66–70), educators can confidently assert growth while acknowledging measurement variability.

Key Benefits of Using Confidence Intervals

  • Enhanced Decision-Making: CIs provide a range, rather than a single estimate, making decisions more robust and informed.
  • Clarity in Uncertainty: By quantifying uncertainty, confidence intervals allow stakeholders to understand the limitations of the data.
  • Improved Communication: Reporting findings with CIs ensures transparency and builds trust in the results.

 

How to Interpret Confidence Intervals

A common misconception is that a 95% CI means there’s a 95% chance the true value falls within the interval. Instead, it means that if we repeated the study many times, 95% of the calculated intervals would contain the true parameter. Thus, it’s a statement about the method, not the specific interval.  This is similar to the common misinterpretation of an experimental p-value that it is the probability that our alternative hypothesis is true; instead, it is the probability of our experiment’s results if the null is true.

Final Thoughts

CIs are indispensable in assessment and measurement, offering a clearer understanding of data variability and precision. By applying them effectively, researchers, businesses, and policymakers can make better decisions based on statistically sound insights.

Whether estimating population parameters or evaluating the reliability of a new method, CIs provide the tools to navigate uncertainty with confidence. Start using CIs today to bring clarity and precision to your analyses!

General intelligence, often symbolized as “g,” is a concept that has been central to psychology and cognitive science since the early 20th century. First introduced by Charles Spearman, general intelligence represents an individual’s overall cognitive ability. This foundational concept has evolved over the years and remains crucial in both academic and applied settings, particularly in assessment and measurement. Understanding general intelligence can help in evaluating mental abilities, predicting academic and career success, and creating reliable and valid assessment tools. This article delves into the nature of general intelligence, its assessment, and its importance in measurement fields.

What is General Intelligence?

general-intelligence-idea

General intelligence (GI), or “g,” is a theoretical construct referring to the common cognitive abilities underlying performance across various mental tasks. Spearman proposed that a general cognitive ability contributes to performance in a wide range of intellectual tasks. This ability encompasses multiple cognitive skills, such as reasoning, memory, and problem-solving, which are thought to be interconnected. In Spearman’s model, a person’s performance on any cognitive test relies partially on “g” and partially on task-specific skills.

For example, both solving complex math problems and understanding a new language involve specific abilities unique to each task but are also underpinned by an individual’s GI. This concept has been pivotal in shaping how we understand cognitive abilities and the development of intelligence tests.

To further explore the foundational aspects of intelligence, the Positive Manifold phenomenon demonstrates that most cognitive tasks tend to be positively correlated, meaning that high performance in one area generally predicts strong performance in others. You can read more about it in our article on Positive Manifold.

GI in Assessment and Measurement

The assessment of GI has been integral to psychology, education, and organizational settings for decades. Testing for “g” provides insight into an individual’s mental abilities and often serves as a predictor of various outcomes, such as academic performance, job performance, and life success.

  1. Intelligence Testing: Intelligence tests, like the Wechsler Adult Intelligence Scale (WAIS) and Stanford-Binet, aim to provide a measurement of GI. These tests typically consist of a variety of subtests measuring different cognitive skills, including verbal comprehension, working memory, and perceptual reasoning. The results are aggregated to produce an overall IQ score, representing a general measure of “g.” These scores are then compared to population averages to understand where an individual stands in terms of cognitive abilities relative to their peers.
  2. Educational Assessment: GI is often used in educational assessments to help identify students who may need additional support or advanced academic opportunities. For example, cognitive ability tests can assist in identifying gifted students who may benefit from accelerated programs or those who need extra resources. Schools also use “g” as one factor in admission processes, relying on tests like the SAT, GRE, and similar exams, which assess reasoning and problem-solving abilities linked to GI.
  3. Job and Career Assessments: Many organizations use cognitive ability tests as part of their recruitment processes. GI has been shown to predict job performance across many types of employment, especially those requiring complex decision-making and problem-solving skills. By assessing “g,” employers can gauge a candidate’s potential for learning new tasks, adapting to job challenges, and developing in their role. This approach is especially prominent in fields requiring high levels of cognitive performance, such as research, engineering, and management. One notable example is the Armed Services Vocational Aptitude Battery (ASVAB), a multi-test battery that assesses candidates for military service. The ASVAB includes subtests like arithmetic reasoning, mechanical comprehension, and word knowledge, all of which reflect diverse cognitive abilities. These individual scores are then combined into the Armed Forces Qualifying Test (AFQT) score, an overall measure that serves as a proxy for GI. The AFQT score acts as a threshold across military branches, with each branch requiring minimum scores.

Here are a few ASVAB-style sample questions that reflect different cognitive areas while collectively representing general intelligence:

  1. Arithmetic Reasoning:
    If a train travels at 60 mph for 3 hours, how far does it go?
    Answer: 180 miles
  2. Word Knowledge:
    What does the word “arduous” most nearly mean?
    Answer: Difficult
  3. Mechanical Comprehension:
    If gear A turns clockwise, which direction will gear B turn if it is directly connected?
    Answer: Counterclockwise

 

How GI is Measured

studying-cognitive-abilities

In measuring GI, psychometricians use a variety of statistical techniques to ensure the reliability and validity of intelligence assessments. One common approach is factor analysis, a statistical method that identifies the relationships between variables and ensures that test items truly measure “g” as intended.

Tests designed to measure general intelligence are structured to cover a range of cognitive functions, capturing a broad spectrum of mental abilities. Each subtest score contributes to a composite score that reflects an individual’s general cognitive ability. Assessments are also periodically normed, or standardized, so that scores remain meaningful and comparable over time. This standardization process helps maintain the relevance of GI scores in diverse populations.

 

The Importance of GI in Modern Assessment

GI continues to be a critical measure for various practical and theoretical applications:

  • Predicting Success: Numerous studies have linked GI to a wide array of outcomes, from academic performance to career advancement. Because “g” encompasses the ability to learn and adapt, it is often a better predictor of success than task-specific skills alone. In fact, meta-analyses indicate that g accounts for approximately 25% of the variance in job performance, highlighting its unparalleled predictive power in educational and occupational contexts.
  • Validating Assessments: In psychometrics, GI is used to validate and calibrate assessment tools, ensuring that they measure what they intend to. Understanding “g” helps in creating reliable test batteries and composite scores, making it essential for effective educational and professional testing.
  • Advancing Cognitive Research: GI also plays a vital role in cognitive research, helping psychologists understand the nature of mental processes and the structure of human cognition. Studies on “g” contribute to theories about how people learn, adapt, and solve problems, fueling ongoing research in cognitive psychology and neuroscience.

 

The Future of GI in Assessment

With advancements in technology, the assessment of GI is becoming more sophisticated and accessible. Computerized adaptive testing (CAT) and machine learning algorithms allow for more personalized assessments, adjusting test difficulty based on real-time responses. These innovations not only improve the accuracy of GI testing but also provide a more engaging experience for test-takers.

As our understanding of human cognition expands, the concept of GI remains a cornerstone in both educational and occupational assessments. The “g” factor offers a powerful framework for understanding mental abilities and continues to be a robust predictor of various life outcomes. Whether applied in the classroom, the workplace, or in broader psychological research, GI is a valuable metric for understanding human potential and guiding personal and professional development.

Factor analysis is a statistical technique widely used in research to understand and evaluate the underlying structure of assessment data. In fields such as education, psychology, and medicine, this approach to unsupervised machine learning helps researchers and educators identify latent variables, called factors, and which items or tests load on these factors.

For instance, when students take multiple tests, factor analysis can reveal whether these assessments are influenced by common underlying abilities, like verbal reasoning or mathematical reasoning. This insight is crucial for developing reliable and valid assessments, as it helps ensure that test items are measuring the intended constructs. It can also be used to evaluate whether items in an assessment are unidimensional, which is an assumption of both item response theory and classical test theory.

Why Do We Need Factor Analysis?

Factor analysis is a powerful tool for test validation. By analyzing the data, educators and psychometricians can confirm whether the items on a test align with the theoretical constructs they are designed to measure. This ensures that the test is not only reliable but also valid, meaning it accurately reflects the abilities or knowledge it intends to assess. Through this process, factor analysis contributes to the continuous improvement of educational tools, helping to enhance both teaching and learning outcomes.

What is Factor Analysis?

Factor analysis is a comprehensive statistical technique employed to uncover the latent structure underlying a set of observed variables. In the realms of education and psychology, these observed variables are often test scores or scores on individual test items. The primary goal of factor analysis is to identify underlying dimensions, or factors, that explain the patterns of intercorrelations among these variables. By analyzing these intercorrelations, factor analysis helps researchers and test developers understand which variables group together and may be measuring the same underlying construct.

One of the key outputs of factor analysis is the loading table or matrix (see below), which displays the correlations between the observed variables with the latent dimensions, or factors. These loadings indicate how strongly each variable is associated with a particular factor, helping to reveal the structure of the data. Ideally, factor analysis aims to achieve a “simple structure,” where each variable loads highly on one factor and has minimal loadings on others. This clear pattern makes it easier to interpret the results and understand the underlying constructs being measured. By providing insights into the relationships between variables, factor analysis is an essential tool in test development and validation, helping to ensure that assessments are both reliable and valid.

Confirmatory vs. Exploratory Factor Analysis

Factor analysis comes in two main forms: Exploratory Factor Analysis (EFA) and Confirmatory Factor Analysis (CFA), each serving distinct purposes in research.

Exploratory Factor Analysis (EFA) is typically used when researchers have little to no prior knowledge about the underlying structure of their data. It is a data-driven approach that allows researchers to explore the potential factors that emerge from a set of observed variables. In EFA, the goal is to uncover patterns and identify how many latent factors exist without imposing any preconceived structure on the data. This approach is often used in the early stages of research, where the objective is to discover the underlying dimensions that might explain the relationships among variables.

On the other hand, Confirmatory Factor Analysis (CFA) is a hypothesis-driven approach used when researchers have a clear theoretical model of the factor structure they expect to find. In CFA, researchers specify the number of factors and the relationships between the observed variables and these factors before analyzing the data. The primary goal of CFA is to test whether the data fit the hypothesized model. This approach is often used in later stages of research or in validation studies, where the focus is on confirming the structure that has been previously identified or theoretically proposed. By comparing the model fit indices, researchers can determine how well their proposed factor structure aligns with the actual data, providing a more rigorous test of their hypotheses.

Factor Analysis of Test Batteries or Sections, or Multiple Predictors

Factor analysis is particularly valuable when dealing with test batteries, which are collections of tests designed to measure various aspects of student cognitive abilities, skills, or knowledge. In the context of a test battery, factor analysis helps to identify the underlying structure of the tests and determine whether they measure distinct yet related constructs.

For example, a cognitive ability test battery might include subtests for verbal reasoning, quantitative reasoning, and spatial reasoning. Through factor analysis, researchers can examine how these subtests correlate and whether they load onto separate factors, indicating they measure distinct abilities, or onto a single factor, suggesting a more general underlying ability, often referred to as the g” factor or general intelligence.

This approach can also incorporate non-assessment data. For example a researcher on employee selection might look at a set of assessments (cognitive ability, job knowledge, quantitative reasoning, MS Word skills, integrity, counterproductive work behavior), but also variables such as interview scores or resume ratings. Below is an oversimplified example of how the loading matrix might look for this.

Table 1

Variable Dimension 1 Dimension 2
Cognitive ability 0.42 0.09
Job knowledge 0.51 0.02
Quantitative reasoning 0.36 -0.02
MS Word skills 0.49 0.07
Integrity 0.03 0.26
Counterproductive work behavior -0.01 0.31
Interview scores 0.16 0.29
Resume ratings 0.11 0.12

Readers that are familiar with the topic will recognize this as a nod to the work by Walter Borman and Steve Motowidlo on Task vs. Contextual aspects of job performance.  A variable like Job Knowledge would load highly on a factor of task aspects of performing a job.  However, an assessment of counterproductive work behavior might not predict how well they do tasks, but how well they contribute to company culture and other contextual aspects.

This analysis is crucial for ensuring that the test battery provides comprehensive and valid measurements of the constructs it aims to assess. By confirming that each subtest contributes unique information, factor analysis supports the interpretation of composite scores and aids in the design of more effective assessment tools. The process of validating test batteries is essential to maintain the integrity and utility of the test results in educational and psychological settings.

This approach typically uses “regular” factor analysis, which assumes that scores for each input variable are normally distributed. This, of course, is usually the case with something like scores on an intelligence test. But if you are analyzing scores on test items, these are rarely normally distributed, especially for dichotomous data where there is only possible scores of 0 and 1, this is impossible. Therefore, other mathematical approaches must be applied.

Factor Analysis on the Item Level

Factor analysis at the item level is a more granular approach, focusing on the individual test items rather than entire subtests or batteries. This method is used to ensure that each item contributes appropriately to the overall construct being measured and to identify any items that do not align well with the intended factors.

For instance, in a reading comprehension test, factor analysis at the item level can reveal whether each question accurately measures the construct of reading comprehension or whether some items are more aligned with other factors, such as vocabulary knowledge or reasoning skills. Items that do not load strongly onto the intended factor may be flagged for revision or removal, as they could distort the accuracy of the test scores.

This item-level analysis is crucial for developing high-quality educational or knowledge assessments, as it helps to ensure that every question is both valid and reliable, contributing meaningfully to the overall test score. It also aids in identifying “enemy items,” which are questions that could undermine the test’s consistency and fairness.

Similarly, in personality assessments like the Big Five Personality Test, factor analysis is used to confirm the structure of personality traits, ensuring that the test accurately captures the five broad dimensions: openness, conscientiousness, extraversion, agreeableness, and neuroticism. This process ensures that each trait is measured distinctly while also considering how they may interrelate.  Note that the result here was not to show overall unidimensionality in personality, but evidence to support five factors.  An assessment of a given factor is then more or less unidimensional.

An example of this is show in Table 2 below.  Consider if all the descriptive statements are items in a survey where people rate them on a Likert scale of 1 to 5.  The survey might have hundreds of adjectives but these would align themselves with the Big Five with factor analysis, and the simple structure would look like something you see below (2 items per factor in this small example).

 

Table 2

Statement Dimension 1 Dimension 2 Dimension 3 Dimension 4 Dimension 5
I like to try new things 0.63 0.02 0.00 -0.03 -0.02
I enjoy exciting sports 0.71 0.00 0.11 -0.08 0.07
I consider myself neat and tidy 0.02 0.56 0.08 0.11 0.08
I am a perfectionist -0.05 0.69 -0.08 0.09 -0.09
I like to go to parties 0.11 0.15 0.74 0.08 0.00
I prefer to spend my free time alone (reverse scored) 0.13 0.07 0.81 0.01 0.05
I tend to “go with the flow” -0.14 0.02 -0.04 0.68 0.08
I enjoy arguments and debates (reverse scored) 0.03 -0.04 -0.05 0.72 0.11
I get stressed out easily (reverse scored) -0.05 0.03 0.03 0.05 0.81
I perform well under pressure 0.02 0.02 0.02 -0.01 0.77

 

Tools like MicroFACT, a specialized software for evaluating unidimensionality, are invaluable in this process. MicroFACT enables psychometricians to assess whether each item in a test measures a single underlying construct, ensuring the test’s coherence and effectiveness.

Summary

Factor analysis plays a pivotal role in the field of psychometrics, offering deep insights into the structure and validity of educational assessments. Whether applied to test batteries or individual items, factor analysis helps ensure that tests are both reliable and meaningful.

Overall, factor analysis is indispensable for developing effective educational tools and improving assessment practices. It ensures that tests not only measure what they are supposed to but also do so in a way that is fair and consistent across different groups and over time. As educational assessments continue to evolve, the insights provided by factor analysis will remain crucial in maintaining the integrity and effectiveness of these tools.

References

Geisinger, K. F., Bracken, B. A., Carlson, J. F., Hansen, J.-I. C., Kuncel, N. R., Reise, S. P., & Rodriguez, M. C. (Eds.). (2013). APA handbook of testing and assessment in psychology, Vol. 1. Test theory and testing and assessment in industrial and organizational psychology. American Psychological Association. https://doi.org/10.1037/14047-000

Kline, R. B. (2015). Principles and practice of structural equation modeling (4th ed.). The Guilford Press.

Nunnally, J. C., & Bernstein, I. H. (1994). Psychometric theory (3rd ed.). Tata Mcgraw-Hill Ed.

 

Test response function 10 items Angoff

Setting a cutscore on a test scored with item response theory (IRT) requires some psychometric knowledge.  This post will get you started.

How do I set a cutscore with item response theory?

There are two approaches: directly with IRT, or using CTT then converting to IRT.

  1. Some standard setting methods work directly with IRT, such as the Bookmark method.  Here, you calibrate your test with IRT, rank the items by difficulty, and have an expert panel place “bookmarks” in the ranked list.  The average IRT difficulty of their bookmarks is then a defensible IRT cutscore.  The Contrasting Groups method and the Hofstee method can also work directly with IRT.
  2. Cutscores set with classical test theory, such as the Angoff, Nedelsky, or Ebel methods, are easy to implement when the test is scored classically.  But if your test is scored with the IRT paradigm, you need to convert your cutscores onto the theta scale.  The easiest way to do that is to reverse-calculate the test response function (TRF) from IRT.

The Test Response Function

The TRF (sometimes called a test characteristic curve) is an important method of characterizing test performance in the IRT paradigm.  The TRF predicts a classical score from an IRT score, as you see below.  Like the item response function and test information function (item response and test information function), it uses the theta scale as the X-axis.  The Y-axis can be either the number-correct metric or proportion-correct metric.

Test response function 10 items Angoff

In this example, you can see that a theta of -0.3 translates to an estimated number-correct score of approximately 7, or 70%.

Classical cutscore to IRT

So how does this help us with the conversion of a classical cutscore?  Well, we hereby have a way of translating any number-correct score or proportion-correct score.  So any classical cutscore can be reverse-calculated to a theta value.  If your Angoff study (or Beuk) recommends a cutscore of 7 out of 10 points (70%), you can convert that to a theta cutscore of -0.3 as above.  If the recommended cutscore was 8 (80%), the theta cutscore would be approximately 0.7.

Because IRT works in a way that it scores examinees on the same scale with any set of items, as long as those items have been part of a linking/equating study.  Therefore, a single study on a set of items can be equated to any other linear test form, LOFT pool, or CAT pool.  This makes it possible to apply the classically-focused Angoff method to IRT-focused programs.  You can even set the cutscore with a subset of your item pool, in a linear sense, with the full intention to apply it on CAT tests later.

Note that the number-correct metric only makes sense for linear or LOFT exams, where every examinee receives the same number of items.  In the case of CAT exams, only the proportion correct metric makes sense.

How do I implement IRT?

Interested in applying IRT to improve your assessments?  Download a free trial copy of  Xcalibre  here.  If you want to deliver online tests that are scored directly with IRT, in real time (including computerized adaptive testing), check out  FastTest.

Equation editor item type

Technology-enhanced items are assessment items (questions) that utilize technology to improve the interaction of a test question in digital assessment, over and above what is possible with paper.  Tech-enhanced items can improve examinee engagement (important with K12 assessment), assess complex concepts with higher fidelity, improve precision/reliability, and enhance face validity/sellability. 

To some extent, the last word is the key one; tech-enhanced items simply look sexier and therefore make an assessment platform easier to sell, even if they don’t actually improve assessment.  I’d argue that there are also technology-enabled items, which are distinct, as discussed below.

What is the goal of technology enhanced items?

The goal is to improve assessment, by increasing things like reliability/precision, validity, and fidelity. However, there are a number of TEIs that is actually designed more for sales purposes than psychometric purposes. So, how to know if TEIs improve assessment?  That, of course, is an empirical question that is best answered with an experiment.  But let me suggest one metric address this question: how far does the item go beyond just reformulating a traditional item format to use current user-interface technology?  I would define the reformulating of traditional format to be a fake TEI while going beyond would define a true TEI.

An alternative nomenclature might be to call the reformulations technology-enhanced items and the true tech usage to be technology-enabled items (Almond et al, 2010; Bryant, 2017), as they would not be possible without technology.

A great example of this is the relationship between a traditional multiple response item and certain types of drag and drop items.  There are a number of different ways that drag and drop items can be created, but for now, let’s use the example of a format that asks the examinee to drag text statements into a box. 

An example of this is K12 assessment items from PARCC that ask the student to read a passage, then ask questions about it.

drag drop sequence

The item is scored with integers from 0 to K where K is the number of correct statements; the integers are often then used to implement the generalized partial credit model for final scoring.  This would be true regardless of whether the item was presented as multiple response vs. drag and drop. The multiple response item, of course, could just as easily be delivered via paper and pencil. Converting it to drag and drop enhances the item with technology, but the interaction of the student with the item, psychometrically, remains the same.

Some True TEIs, or Technology Enabled Items

Of course, the past decade or so has witnessed stronger innovation in item formats. Gamified assessments change how the interaction of person and item is approached, though this is arguably not as relevant for high stakes assessment due to concerns of validity. There are also simulation items. For example, a test for a construction crane operator might provide an interface with crane controls and ask the examinee to complete a tasks. Even at the K-12 level there can be such items, such as the simulation of a science experiment where the student is given various test tubes or other instruments on the screen.

Both of these approaches are extremely powerful but have a major disadvantage: cost. They are typically custom-designed. In the case of the crane operator exam or even the science experiment, you would need to hire software developers to create this simulation. There are now some simulation-development ecosystems that make this process more efficient, but the items still involve custom authoring and custom scoring algorithms.

To address this shortcoming, there is a new generation of self-authored item types that are true TEIs. By “self-authored” I mean that a science teacher would be able to create these items themselves, just like they would a multiple choice item. The amount of technology leveraged is somewhere between a multiple choice item and a custom-designed simulation, providing a compromise of reduced cost but still increasing the engagement for the examinee. A major advantage of this approach is that the items do not need custom scoring algorithms, and instead are typically scored via point integers, which enables the use of polytomous item response theory.

Are we at least moving forward?  Not always!

There is always pushback against technology, and in this topic the counterexample is the gridded item type.  It actually goes in reverse of innovation, because it doesn’t take a traditional format and reformulate it for current UI. It actually ignores the capabilities of current UI (actually, UI for the past 20+ years) and is therefore a step backward. With that item type, students are presented a bubble sheet from a 1960s style paper exam, on a computer screen, and asked to fill in the bubbles by clicking on them rather than using a pencil on paper.

Another example is the EBSR item type from the artist formerly known as PARCC. It was a new item type that intended to assess deeper understanding, but it did not use any tech-enhancement or -enablement, instead asking two traditional questions in a linked manner. As any psychometrician can tell you, this approach ignored basic assumptions of psychometrics, so you can guess the quality of measurement that it put out.

How can I implement TEIs?

It takes very little software development expertise to develop a platform that supports multiple choice items. An item like the graphing one above, though, takes substantial investment. So there are relatively few platforms that can support these, especially with best practices like workflow item review or item response theory. 

modified-Angoff Beuk compromise

A modified-Angoff method study is one of the most common ways to set a defensible cutscore on an exam.  It therefore means that the pass/fail decisions made by the test are more trustworthy than if you picked an arbitrary round number like 70%.  If your doctor, lawyer, accountant, or other professional has passed an exam where the cutscore has been set with this method, you can place more trust in their skills.

What is the Angoff method?

The Angoff method is a scientific way of setting a cutscore (pass point) on a test.  If you have a criterion-referenced interpretation, it is not legally defensible to just conveniently pick a round number like 70%; you need a formal process.  There are a number of acceptable methodologies in the psychometric literature for standard-setting studies, also known as cutscores or passing points.  Some examples include Angoff, modified-Angoff, Bookmark, Contrasting Groups, and Borderline.  The modified-Angoff approach is by far the popular approach.  It is used especially frequently for certification, licensure, certificate, and other credentialing exams.

It was originally suggested as a mere footnote by renowned researcher William Angoff, at Educational Testing Service. Studies found that panelists involved in modified-Angoff sessions typically demonstrate high agreement levels, with inter-rater reliability often surpassing 0.85, showcasing its robustness in decision consistency

How does the Angoff approach work?

First, you gather a group of subject matter experts (SMEs), with a minimum of 6, though 8-10 is preferred for better reliability, and have them define what they consider to be a Minimally Competent Candidate (MCC).  Next, you have them estimate the percentage of minimally competent candidates that will answer each item correctly.  You then analyze the results for outliers or inconsistencies.  If experts disagree, you will need to evaluate inter-rater reliability and agreement, and after that have the experts discuss and re-rate the items to gain better consensus.  The average final rating is then the expected percent-correct score for a minimally competent candidate.

Advantages of the Angoff method

  1. It is defensible.  Because it is the most commonly used approach and is widely studied in the scientific literature, it is well-accepted.
  2. You can implement it before a test is ever delivered.  Some other methods require you to deliver the test to a large sample first.
  3. It is conceptually simple, easy enough to explain to non-psychometricians.
  4. It incorporates the judgment of a panel of experts, not just one person or a round number.
  5. It works for tests with both classical test theory and item response theory.
  6. It does not take long to implement – if a short test, it can be done in a matter of hours!
  7. It can be used with different item types, including polytomously scored items (multi-points).

Disadvantages of the Angoff method

  1. It does not use actual data, unless you implement the Beuk method alongside.
  2. It can lead to the experts overestimating the performance of entry-level candidates, as they forgot what it was like to start out 20-30 years ago.  This is one reason to use the Beuk method as a “reality check” by showing the experts that if they stay with the cutscore they just picked, the majority of candidates might fail!

Example of the Modified-Angoff Approach

First of all, do not expect a straightforward, easy process that leads to an unassailably correct cutscore.  All standard-setting methods involve some degree of subjectivity.  The goal of the methods is to reduce that subjectivity as much as possible.  Some methods focus on content, others on examinee performance data, while some try to meld the two.

Step 1: Prepare Your Team

The modified-Angoff process depends on a representative sample of SMEs, usually 6-20. By “representative” I mean they should represent the various stakeholders. For instance, a certification for medical assistants might include experienced medical assistants, nurses, and physicians, from different areas of the country. You must train them about their role and how the process works, so they can understand the end goal and drive toward it.

Step 2: Define The Minimally Competent Candidate (MCC)

This concept is the core of the modified-Angoff method, though it is known by a range of terms or acronyms, including minimally qualified candidates (MQC) or just barely qualified (JBQ).  The reasoning is that we want our exam to separate candidates that are qualified from those that are not.  So we ask the SMEs to define what makes someone qualified (or unqualified!) from a perspective of skills and knowledge. This leads to a conceptual definition of an MCC. We then want to estimate what score this borderline candidate would achieve, which is the goal of the remainder of the study. This step can be conducted in person, or via webinar.

Step 3: Round 1 Ratings

Next, ask your SMEs to read through all the items on your test form and estimate the percentage of MCCs that would answer each correctly.  A rating of 100 means the item is a slam dunk; it is so easy that every MCC would get it right.  A rating of 40 is very difficult.  Most ratings are in the 60-90 range if the items are well-developed. The ratings should be gathered independently; if everyone is in the same room, let them work on their own in silence. This can easily be conducted remotely, though.

Step 4: Discussion

This is where it gets fun.  Identify items where there is the most disagreement (as defined by grouped frequency distributions or standard deviation) and make the SMEs discuss it.  Maybe two SMEs thought it was super easy and gave it a 95 and two other SMEs thought it was super hard and gave it a 45.  They will try to convince the other side of their folly. Chances are that there will be no shortage of opinions and you, as the facilitator, will find your greatest challenge is keeping the meeting on track. This step can be conducted in person, or via webinar.

Step 5: Round 2 Ratings

Raters then re-rate the items based on the discussion.  The goal is that there will be a greater consensus.  In the previous example, it’s not likely that every rater will settle on a 70.  But if your raters all end up from 60-80, that’s OK. How do you know there is enough consensus?  We recommend the inter-rater reliability suggested by Shrout and Fleiss (1979), as well as looking at inter-rater agreement and dispersion of ratings for each item. This use of multiple rounds is known as the Delphi approach; it pertains to all consensus-driven discussions in any field, not just psychometrics.

Step 6: Evaluate Results and Final Recommendation

Evaluate the results from Round 2 as well as Round 1.  An example of this is below.  What is the recommended cutscore, which is the average or sum of the Angoff ratings depending on the scale you prefer?  Did the reliability improve?  Estimate the mean and SD of examinee scores (there are several methods for this). What sort of pass rate do you expect?  Even better, utilize the Beuk Compromise as a “reality check” between the modified-Angoff approach and actual test data.  You should take multiple points of view into account, and the SMEs need to vote on a final recommendation. They, of course, know the material and the candidates so they have the final say.  This means that standard setting is a political process; again, reduce that effect as much as you can.

Some organizations do not set the cutscore at the recommended point, but at one standard error of judgment (SEJ) below the recommended point.  The SEJ is based on the inter-rater reliability; note that it is NOT the standard error of the mean or the standard error of measurement.  Some organizations use the latter; the former is just plain wrong (though I have seen it used by amateurs).

 

modified angoff

Step 7: Write Up Your Report

Validity refers to evidence gathered to support test score interpretations.  Well, you have lots of relevant evidence here. Document it.  If your test gets challenged, you’ll have all this in place.  On the other hand, if you just picked 70% as your cutscore because it was a nice round number, you could be in trouble.

Additional Topics

In some situations, there are more issues to worry about.  Multiple forms?  You’ll need to equate in some way.  Using item response theory?  You’ll have to convert the cutscore from the modified-Angoff method onto the theta metric using the Test Response Function (TRF).  New credential and no data available? That’s a real chicken-and-egg problem there.

Where Do I Go From Here?

Ready to take the next step and actually apply the modified-Angoff process to improving your exams?  Sign up for a free account in our  FastTest item banker. You can also download our Angoff analysis tool for free.

References

Shrout, P. E., & Fleiss, J. L. (1979). Intraclass correlations: uses in assessing rater reliability. Psychological bulletin86(2), 420.

test response functions

Item response theory (IRT) is a family of machine learning models in the field of psychometrics, which are used to design, analyze, validate, and score assessments.  It is a very powerful psychometric paradigm that allows researchers to build stronger assessments, whether they work in Education, Psychology, Human Resources, or other fields.  It also solves critical measurement problems like equating across years, designing adaptive tests, or creating vertical scales.

Want to learn more about IRT, how it works, and why it is so important for assessment?  Read on.

What is Item Response Theory?

IRT is a family of models that try to describe how examinees respond to items on a test, hence the name.  These models can be used to evaluate item performance, because the descriptions are quite useful in and of themselves.  However, item response theory ended up doing so much more.Example Item response theory function

IRT is model-driven, in that there is a specific mathematical equation that is assumed, and we fit the models based on raw data, similar to linear regression.  There are different parameters (a, b, c) that shape this equation to different needs.  That’s what defines different IRT models.  This will be discussed at length below.

The models put people and items onto a latent scale, which is usually called θ (theta).  This represents whatever is being measured, whether IQ, anxiety, or knowledge of accounting laws in Croatia.  IRT helps us understand the nature of the scale, how a person answers each question, the distribution of item difficulty, and much more.  IRT used to be known as latent trait theory and item characteristic curve theory.

IRT requires specially-designed software.  Click the link below to download our software Xcalibre, which provides a user-friendly and visual platform for implementing IRT.

 

IRT analysis with Xcalibre

 

Why do we need Item Response Theory?

IRT represents an important innovation in the field of psychometrics. While now more than 50 years old – assuming the “birth” is the classic Lord and Novick (1969) text – it is still underutilized and remains a mystery to many practitioners.

Item response theory is more than just a way of analyzing exam data, it is a paradigm to drive the entire lifecycle of designing, building, delivering, scoring, and analyzing assessments.

IRT requires larger sample sizes and is much more complex than its predecessor, classical test theory, but is also far more powerful.  IRT requires quite a lot of expertise, typically a PhD.  So it is not used for small assessments like a final exam at universities, but is used for almost all major assessments in the world.

 

The Driver: Problems with Classical Test Theory

Classical test theory (CTT) is approximately 100 years old, and still remains commonly used because it is appropriate for certain situations, and it is simple enough that it can be used by many people without formal training in psychometrics.  Most statistics are limited to means, proportions, and correlations.  However, its simplicity means that it lacks the sophistication to deal with a number of very important measurement problems.  A list of these is presented later.

Learn more about the differences between CTT and IRT here.

 

Item Response Theory Parameters

The foundation of IRT is a mathematical model defined by item parametersA parameter is an aspect of a mathematical model that can change its shape or other aspects.  For dichotomous items (those scored correct/incorrect), each item has three parameters:

 

   a: the discrimination parameter, an index of how well the item differentiates low from top examinees; typically ranges from 0 to 2, where higher is better, though not many items are above 1.0.

   b: the difficulty parameter, an index of what level of examinees for which the item is appropriate; typically ranges from -3 to +3, with 0 being an average examinee level.

   c: the pseudo-guessing parameter, which is a lower asymptote; typically is focused on 1/k where k is the number of options.

These paramters are used in the formula below, but are also displayed graphically.

3PL irt equation

Item response function

These parameters are used to graphically display an item response function (IRF), which models the probability of a correct answer as a function of ability.  In the example IRF, the a parameter is approximately, 1.0, indicating a fairly discriminating test item.  The b parameter is approximately 0.0 (the point on the x-axis where the midpoint of the curve is), indicating an average-difficulty item; examinees of average ability would have a 60% chance of answering correctly.  The c parameter is approximately 0.20, like a 5-option multiple choice item.  Consider the x-axis to be z-scores on a standard normal scale.

In some cases, there is no guessing involved, and we only use and b.  This is called the two-parameter model.  If we only use b, this is the one-parameter or Rasch Model.  Here is how that is calculated.

One-parameter-logistic-model-IRT

Item parameters, which are crucial within the IRT framework, might change over time or multiple testing occasions, a phenomenon known as item parameter drift.

 

Example Item Response Theory calculations

Examinees with higher ability are much more likely to respond correctly.  Look at the graph above.  Someone at +2.0 (97th percentile) has about a 94% chance of getting the item correct.  Meanwhile, someone at -2.0 has only a 25% chance – barely above the 1 in 5 guessing rate of 20%.  An average person (0.0) has a 60% chance.  Why 60?  Because we are accounting for guessing.  If the curve went from 0% to 100% probability, then yes, the middle would be 50% change.  But here, we assume 20% as a baseline due to guessing, so halfway up is 60%.

five item response functions

Of course, the parameters can and should differ from item to item, reflecting differences in item performance.  The following graph shows five IRFs with the three-parameter model.  The dark blue line is the easiest item, with a b of -2.00.  The light blue item is the hardest, with a b of +1.80.  The purple one has a c=0.00 while the light blue has c=0.25, indicating that it is more susceptible to guessing.

These IRFs are not just a pretty graph or a way to describe how an item performs.  They are the basic building block to accomplishing those important goals mentioned earlier.  That comes next…

 

Applications of Item Response Theory to Improve Assessment

Item response theory uses the IRF for several purposes.  Here are a few.

test information function from item response theory

  1. Interpreting and improving item performance
  2. Scoring examinees with maximum likelihood or Bayesian methods
  3. Form assembly, including linear on the fly testing (LOFT) and pre-equating
  4. Calculating the accuracy of examinee scores
  5. Development of computerized adaptive tests (CAT)
  6. Post-equating
  7. Differential item functioning (finding bias)
  8. Data forensics to find cheaters or other issues

In addition to being used to evaluate each item individually, IRFs are combined in various ways to evaluate the overall test or form.  The two most important approaches are the conditional standard error of measurement (CSEM) and the test information function (TIF).  The test information function is higher where the test is providing more measurement information about examinees; if relatively low in a certain range of examinee ability, those examinees are not being measured accurately.  The CSEM is the inverse of the TIF, and has the interpretable advantage of being usable for confidence intervals; a person’s score plus or minus 1.96 times the SEM is a 95% confidence interval for their score.  The graph on the right shows part of the form assembly process in our  FastTest  platform.

 

Assumptions of Item Response Theory

Item response theory assumes a few things about your data.

  1. The latent trait you are measuring is unidimensional.  If it is multidimensional, there is multidimensional item response theory, or you can treat the dimensions as separate traits.
  2. Items have local independence, which means that the act of answering one is not impacted by others.  This affects the use of testlets and enemy items.
  3. The probability of responding correctly to an item (or in a certain response, in the case of polytomous like Likert), is a function of the examinee’s ability/trait level and the parameters of the model, following the calculation of the item response function, with some allowance for random error.  As a corollary, we are assuming that the ability/trait has some distribution, with some people having higher or lower levels (e.g., intelligence) and that we are trying to find those differences.

Many texts will only postulate the first two as assumptions, because the third is just implicitly assumed.

 

Advantages and Benefits of Item Response Theory

So why does this matter?  Let’s go back to the problems with classical test theory.  Why is IRT better?

  • Sample-independence of scale: Classical statistics are all sample dependent, and unusable on a different sample; results from IRT are sample-independent. within a linear transformation.  Two samples of different ability levels can be easily converted onto the same scale.
  • Test statistics: Classical statistics are tied to a specific test form.
  • Sparse matrices are OK: Classical test statistics do not work with sparse matrices introduced by multiple forms, linear on the fly testing, or adaptive testing.
  • Linking/equating: Item response theory has much stronger equating, so if your exam has multiple forms, or if you deliver twice per year with a new form, you can have much greater validity in the comparability of scores.
  • Measuring the range of students: Classical tests are built for the average student, and do not measure high or low students very well; conversely, statistics for very difficult or easy items are suspect.
  • Vertical scaling: IRT can do vertical scaling but CTT cannot.
  • Accounting for guessing: CTT does not account for guessing on multiple choice exams.
  • Scoring: Scoring in classical test theory does not take into account item difficulty.  With IRT, you can score a student on any set of items and be sure it is on the same latent scale.
  • Adaptive testing: CTT does not support adaptive testing in most cases.  Adaptive testing has its own list of benefits.
  • Characterization of error: CTT assumes that every examinee has the same amount of error in their score (SEM); IRT recognizes that if the test is all middle-difficulty items, then low or high students will have inaccurate scores.
  • Stronger form building: IRT has functionality to build forms to be more strongly equivalent and meet the purposes of the exam.
  • Nonlinear function: IRT does not assume linear function of the student-item relationship when it is impossible.  CTT assumes a linear function (point-biserial) when it is blatantly impossible.

 

Item Response Theory Models: One Big Happy Family

Remember: IRT is actually a family of models, making flexible use of the parameters.  In some cases, only two (a,b) or one parameters (b) are used, depending on the type of assessment and fit of the data.  If there are multipoint items, such as Likert rating scales or partial credit items, the models are extended to include additional parameters. Learn more about the partial credit situation here.

Here’s a quick breakdown of the family tree, with the most common models.

 

How do I analyze my test with Item Response Theory?

OK item fit

First: you need to get special software.  There are some commercial packages like  Xcalibre, or you can use packages inside platforms like R and Python.

The software will analyze the data in cycles or loops to try to find the best model.  This is because, as always, the data does not always perfectly align.  You might see graphs like the one below if you compared actual proportions (red) to the predicted ones from the item response function (black).  That’s OK!  IRT is quite robust.  And there are analyses built in to help you evaluate model fit.

Some more unpacking of the image above:

  • This was item #39 on the test
  • We are using the three parameter logistic model (3PL), as this was a multiple choice item with 4 options
  • 3422 examinees answered the item
  • 76.9 of them got it correct
  • The classical item discrimination (point biserial item-total correlation) was 0.253, which is OK but not very high
  • The a parameter was 0.432, which is OK but not very strong
  • The b parameter was -1.195, which means the item was quite easy
  • The c parameter was 0.248, which you would expect if there was a 25% chance of guessing
  • The Chi-square fit statistic rejected the null, indicating poor fit, but this statistic is susceptible to sample size
  • The z-Resid fit statistic is a bit more robust, and it did not flag the item for bad fit

Xcalibre-poly-output
The image here shows output from  Xcalibre  from the generalized partial credit model, which is a polytomous model often used for items scored with partial credit.  For example, if a question lists 6 animals and asks students to click on the ones that are reptiles, of which there are 3.  The possible scores are then 0, 1, 2, 3.

Here, the graph labels them as 1-2-3-4, but the meaning is the same.  Here is how you can interpret this.

  • Someone is likely to get 0 points if their theta is below -2.0 (bottom 3% or so of students).
  • A few low students might get 1 point (green)
  • Low-middle ability students are likely to get 2 correct (blue)
  • Anyone above average (0.0) is likely to get all 3 correct.

The boundary locations are where one level becomes more likely than another, i.e., where the curves cross.  For example, you can see that the blue and black lines cross at the boundary -0.339.

Where can I learn more?

For more information, we recommend the textbook Item Response Theory for Psychologists by Embretson & Riese (2000) for those interested in a less mathematical treatment, or de Ayala (2009) for a more mathematical treatment.  If you really want to dive in, you can try the 3-volume Handbook of Item Response Theory edited by van der Linden, which contains a chapter discussing ASC’s IRT analysis software,  Xcalibre.

Want to talk to one of our experts about how to apply IRT?  Get in touch!

TALK TO US Contact