General intelligence, often symbolized as “g,” is a concept that has been central to psychology and cognitive science since the early 20th century. First introduced by Charles Spearman, general intelligence represents an individual’s overall cognitive ability. This foundational concept has evolved over the years and remains crucial in both academic and applied settings, particularly in assessment and measurement. Understanding general intelligence can help in evaluating mental abilities, predicting academic and career success, and creating reliable and valid assessment tools. This article delves into the nature of general intelligence, its assessment, and its importance in measurement fields.

What is General Intelligence?

general-intelligence-idea

General intelligence (GI), or “g,” is a theoretical construct referring to the common cognitive abilities underlying performance across various mental tasks. Spearman proposed that a general cognitive ability contributes to performance in a wide range of intellectual tasks. This ability encompasses multiple cognitive skills, such as reasoning, memory, and problem-solving, which are thought to be interconnected. In Spearman’s model, a person’s performance on any cognitive test relies partially on “g” and partially on task-specific skills.

For example, both solving complex math problems and understanding a new language involve specific abilities unique to each task but are also underpinned by an individual’s GI. This concept has been pivotal in shaping how we understand cognitive abilities and the development of intelligence tests.

To further explore the foundational aspects of intelligence, the Positive Manifold phenomenon demonstrates that most cognitive tasks tend to be positively correlated, meaning that high performance in one area generally predicts strong performance in others. You can read more about it in our article on Positive Manifold.

GI in Assessment and Measurement

The assessment of GI has been integral to psychology, education, and organizational settings for decades. Testing for “g” provides insight into an individual’s mental abilities and often serves as a predictor of various outcomes, such as academic performance, job performance, and life success.

  1. Intelligence Testing: Intelligence tests, like the Wechsler Adult Intelligence Scale (WAIS) and Stanford-Binet, aim to provide a measurement of GI. These tests typically consist of a variety of subtests measuring different cognitive skills, including verbal comprehension, working memory, and perceptual reasoning. The results are aggregated to produce an overall IQ score, representing a general measure of “g.” These scores are then compared to population averages to understand where an individual stands in terms of cognitive abilities relative to their peers.
  2. Educational Assessment: GI is often used in educational assessments to help identify students who may need additional support or advanced academic opportunities. For example, cognitive ability tests can assist in identifying gifted students who may benefit from accelerated programs or those who need extra resources. Schools also use “g” as one factor in admission processes, relying on tests like the SAT, GRE, and similar exams, which assess reasoning and problem-solving abilities linked to GI.
  3. Job and Career Assessments: Many organizations use cognitive ability tests as part of their recruitment processes. GI has been shown to predict job performance across many types of employment, especially those requiring complex decision-making and problem-solving skills. By assessing “g,” employers can gauge a candidate’s potential for learning new tasks, adapting to job challenges, and developing in their role. This approach is especially prominent in fields requiring high levels of cognitive performance, such as research, engineering, and management. One notable example is the Armed Services Vocational Aptitude Battery (ASVAB), a multi-test battery that assesses candidates for military service. The ASVAB includes subtests like arithmetic reasoning, mechanical comprehension, and word knowledge, all of which reflect diverse cognitive abilities. These individual scores are then combined into the Armed Forces Qualifying Test (AFQT) score, an overall measure that serves as a proxy for GI. The AFQT score acts as a threshold across military branches, with each branch requiring minimum scores.

Here are a few ASVAB-style sample questions that reflect different cognitive areas while collectively representing general intelligence:

  1. Arithmetic Reasoning:
    If a train travels at 60 mph for 3 hours, how far does it go?
    Answer: 180 miles
  2. Word Knowledge:
    What does the word “arduous” most nearly mean?
    Answer: Difficult
  3. Mechanical Comprehension:
    If gear A turns clockwise, which direction will gear B turn if it is directly connected?
    Answer: Counterclockwise

 

How GI is Measured

studying-cognitive-abilities

In measuring GI, psychometricians use a variety of statistical techniques to ensure the reliability and validity of intelligence assessments. One common approach is factor analysis, a statistical method that identifies the relationships between variables and ensures that test items truly measure “g” as intended.

Tests designed to measure general intelligence are structured to cover a range of cognitive functions, capturing a broad spectrum of mental abilities. Each subtest score contributes to a composite score that reflects an individual’s general cognitive ability. Assessments are also periodically normed, or standardized, so that scores remain meaningful and comparable over time. This standardization process helps maintain the relevance of GI scores in diverse populations.

 

The Importance of GI in Modern Assessment

GI continues to be a critical measure for various practical and theoretical applications:

  • Predicting Success: Numerous studies have linked GI to a wide array of outcomes, from academic performance to career advancement. Because “g” encompasses the ability to learn and adapt, it is often a better predictor of success than task-specific skills alone.
  • Validating Assessments: In psychometrics, GI is used to validate and calibrate assessment tools, ensuring that they measure what they intend to. Understanding “g” helps in creating reliable test batteries and composite scores, making it essential for effective educational and professional testing.
  • Advancing Cognitive Research: GI also plays a vital role in cognitive research, helping psychologists understand the nature of mental processes and the structure of human cognition. Studies on “g” contribute to theories about how people learn, adapt, and solve problems, fueling ongoing research in cognitive psychology and neuroscience.

 

The Future of GI in Assessment

With advancements in technology, the assessment of GI is becoming more sophisticated and accessible. Computerized adaptive testing (CAT) and machine learning algorithms allow for more personalized assessments, adjusting test difficulty based on real-time responses. These innovations not only improve the accuracy of GI testing but also provide a more engaging experience for test-takers.

As our understanding of human cognition expands, the concept of GI remains a cornerstone in both educational and occupational assessments. The “g” factor offers a powerful framework for understanding mental abilities and continues to be a robust predictor of various life outcomes. Whether applied in the classroom, the workplace, or in broader psychological research, GI is a valuable metric for understanding human potential and guiding personal and professional development.

Factor analysis is a statistical technique widely used in research to understand and evaluate the underlying structure of assessment data. In fields such as education, psychology, and medicine, this approach to unsupervised machine learning helps researchers and educators identify latent variables, called factors, and which items or tests load on these factors.

For instance, when students take multiple tests, factor analysis can reveal whether these assessments are influenced by common underlying abilities, like verbal reasoning or mathematical reasoning. This insight is crucial for developing reliable and valid assessments, as it helps ensure that test items are measuring the intended constructs. It can also be used to evaluate whether items in an assessment are unidimensional, which is an assumption of both item response theory and classical test theory.

Why Do We Need Factor Analysis?

Factor analysis is a powerful tool for test validation. By analyzing the data, educators and psychometricians can confirm whether the items on a test align with the theoretical constructs they are designed to measure. This ensures that the test is not only reliable but also valid, meaning it accurately reflects the abilities or knowledge it intends to assess. Through this process, factor analysis contributes to the continuous improvement of educational tools, helping to enhance both teaching and learning outcomes.

What is Factor Analysis?

Factor analysis is a comprehensive statistical technique employed to uncover the latent structure underlying a set of observed variables. In the realms of education and psychology, these observed variables are often test scores or scores on individual test items. The primary goal of factor analysis is to identify underlying dimensions, or factors, that explain the patterns of intercorrelations among these variables. By analyzing these intercorrelations, factor analysis helps researchers and test developers understand which variables group together and may be measuring the same underlying construct.

One of the key outputs of factor analysis is the loading table or matrix (see below), which displays the correlations between the observed variables with the latent dimensions, or factors. These loadings indicate how strongly each variable is associated with a particular factor, helping to reveal the structure of the data. Ideally, factor analysis aims to achieve a “simple structure,” where each variable loads highly on one factor and has minimal loadings on others. This clear pattern makes it easier to interpret the results and understand the underlying constructs being measured. By providing insights into the relationships between variables, factor analysis is an essential tool in test development and validation, helping to ensure that assessments are both reliable and valid.

Confirmatory vs. Exploratory Factor Analysis

Factor analysis comes in two main forms: Confirmatory Factor Analysis (CFA) and Exploratory Factor Analysis (EFA), each serving distinct purposes in research.

Exploratory Factor Analysis (EFA) is typically used when researchers have little to no prior knowledge about the underlying structure of their data. It is a data-driven approach that allows researchers to explore the potential factors that emerge from a set of observed variables. In EFA, the goal is to uncover patterns and identify how many latent factors exist without imposing any preconceived structure on the data. This approach is often used in the early stages of research, where the objective is to discover the underlying dimensions that might explain the relationships among variables.

On the other hand, Confirmatory Factor Analysis (CFA) is a hypothesis-driven approach used when researchers have a clear theoretical model of the factor structure they expect to find. In CFA, researchers specify the number of factors and the relationships between the observed variables and these factors before analyzing the data. The primary goal of CFA is to test whether the data fit the hypothesized model. This approach is often used in later stages of research or in validation studies, where the focus is on confirming the structure that has been previously identified or theoretically proposed. By comparing the model fit indices, researchers can determine how well their proposed factor structure aligns with the actual data, providing a more rigorous test of their hypotheses.

Factor Analysis of Test Batteries or Sections, or Multiple Predictors

Factor analysis is particularly valuable when dealing with test batteries, which are collections of tests designed to measure various aspects of student cognitive abilities, skills, or knowledge. In the context of a test battery, factor analysis helps to identify the underlying structure of the tests and determine whether they measure distinct yet related constructs.

For example, a cognitive ability test battery might include subtests for verbal reasoning, quantitative reasoning, and spatial reasoning. Through factor analysis, researchers can examine how these subtests correlate and whether they load onto separate factors, indicating they measure distinct abilities, or onto a single factor, suggesting a more general underlying ability, often referred to as the g” factor or general intelligence.

This approach can also incorporate non-assessment data. For example a researcher on employee selection might look at a set of assessments (cognitive ability, job knowledge, quantitative reasoning, MS Word skills, integrity, counterproductive work behavior), but also variables such as interview scores or resume ratings. Below is an oversimplified example of how the loading matrix might look for this.

Table 1

Variable Dimension 1 Dimension 2
Cognitive ability 0.42 0.09
Job knowledge 0.51 0.02
Quantitative reasoning 0.36 -0.02
MS Word skills 0.49 0.07
Integrity 0.03 0.26
Counterproductive work behavior -0.01 0.31
Interview scores 0.16 0.29
Resume ratings 0.11 0.12

Readers that are familiar with the topic will recognize this as a nod to the work by Walter Borman and Steve Motowidlo on Task vs. Contextual aspects of job performance.  A variable like Job Knowledge would load highly on a factor of task aspects of performing a job.  However, an assessment of counterproductive work behavior might not predict how well they do tasks, but how well they contribute to company culture and other contextual aspects.

This analysis is crucial for ensuring that the test battery provides comprehensive and valid measurements of the constructs it aims to assess. By confirming that each subtest contributes unique information, factor analysis supports the interpretation of composite scores and aids in the design of more effective assessment tools. The process of validating test batteries is essential to maintain the integrity and utility of the test results in educational and psychological settings.

This approach typically uses “regular” factor analysis, which assumes that scores for each input variable are normally distributed. This, of course, is usually the case with something like scores on an intelligence test. But if you are analyzing scores on test items, these are rarely normally distributed, especially for dichotomous data where there is only possible scores of 0 and 1, this is impossible. Therefore, other mathematical approaches must be applied.

Factor Analysis on the Item Level

Factor analysis at the item level is a more granular approach, focusing on the individual test items rather than entire subtests or batteries. This method is used to ensure that each item contributes appropriately to the overall construct being measured and to identify any items that do not align well with the intended factors.

For instance, in a reading comprehension test, factor analysis at the item level can reveal whether each question accurately measures the construct of reading comprehension or whether some items are more aligned with other factors, such as vocabulary knowledge or reasoning skills. Items that do not load strongly onto the intended factor may be flagged for revision or removal, as they could distort the accuracy of the test scores.

This item-level analysis is crucial for developing high-quality educational or knowledge assessments, as it helps to ensure that every question is both valid and reliable, contributing meaningfully to the overall test score. It also aids in identifying “enemy items,” which are questions that could undermine the test’s consistency and fairness.

Similarly, in personality assessments like the Big Five Personality Test, factor analysis is used to confirm the structure of personality traits, ensuring that the test accurately captures the five broad dimensions: openness, conscientiousness, extraversion, agreeableness, and neuroticism. This process ensures that each trait is measured distinctly while also considering how they may interrelate.  Note that the result here was not to show overall unidimensionality in personality, but evidence to support five factors.  An assessment of a given factor is then more or less unidimensional.

An example of this is show in Table 2 below.  Consider if all the descriptive statements are items in a survey where people rate them on a Likert scale of 1 to 5.  The survey might have hundreds of adjectives but these would align themselves with the Big Five with factor analysis, and the simple structure would look like something you see below (2 items per factor in this small example).

 

Table 2

Statement Dimension 1 Dimension 2 Dimension 3 Dimension 4 Dimension 5
I like to try new things 0.63 0.02 0.00 -0.03 -0.02
I enjoy exciting sports 0.71 0.00 0.11 -0.08 0.07
I consider myself neat and tidy 0.02 0.56 0.08 0.11 0.08
I am a perfectionist -0.05 0.69 -0.08 0.09 -0.09
I like to go to parties 0.11 0.15 0.74 0.08 0.00
I prefer to spend my free time alone (reverse scored) 0.13 0.07 0.81 0.01 0.05
I tend to “go with the flow” -0.14 0.02 -0.04 0.68 0.08
I enjoy arguments and debates (reverse scored) 0.03 -0.04 -0.05 0.72 0.11
I get stressed out easily (reverse scored) -0.05 0.03 0.03 0.05 0.81
I perform well under pressure 0.02 0.02 0.02 -0.01 0.77

 

Tools like MicroFACT, a specialized software for evaluating unidimensionality, are invaluable in this process. MicroFACT enables psychometricians to assess whether each item in a test measures a single underlying construct, ensuring the test’s coherence and effectiveness.

Summary

Factor analysis plays a pivotal role in the field of psychometrics, offering deep insights into the structure and validity of educational assessments. Whether applied to test batteries or individual items, factor analysis helps ensure that tests are both reliable and meaningful.

Overall, factor analysis is indispensable for developing effective educational tools and improving assessment practices. It ensures that tests not only measure what they are supposed to but also do so in a way that is fair and consistent across different groups and over time. As educational assessments continue to evolve, the insights provided by factor analysis will remain crucial in maintaining the integrity and effectiveness of these tools.

References

Geisinger, K. F., Bracken, B. A., Carlson, J. F., Hansen, J.-I. C., Kuncel, N. R., Reise, S. P., & Rodriguez, M. C. (Eds.). (2013). APA handbook of testing and assessment in psychology, Vol. 1. Test theory and testing and assessment in industrial and organizational psychology. American Psychological Association. https://doi.org/10.1037/14047-000

Kline, R. B. (2015). Principles and practice of structural equation modeling (4th ed.). The Guilford Press.

Nunnally, J. C., & Bernstein, I. H. (1994). Psychometric theory (3rd ed.). Tata Mcgraw-Hill Ed.

 

Test response function 10 items Angoff

Setting a cutscore on a test scored with item response theory (IRT) requires some psychometric knowledge.  This post will get you started.

How do I set a cutscore with item response theory?

There are two approaches: directly with IRT, or using CTT then converting to IRT.

  1. Some standard setting methods work directly with IRT, such as the Bookmark method.  Here, you calibrate your test with IRT, rank the items by difficulty, and have an expert panel place “bookmarks” in the ranked list.  The average IRT difficulty of their bookmarks is then a defensible IRT cutscore.  The Contrasting Groups method and the Hofstee method can also work directly with IRT.
  2. Cutscores set with classical test theory, such as the Angoff, Nedelsky, or Ebel methods, are easy to implement when the test is scored classically.  But if your test is scored with the IRT paradigm, you need to convert your cutscores onto the theta scale.  The easiest way to do that is to reverse-calculate the test response function (TRF) from IRT.

The Test Response Function

The TRF (sometimes called a test characteristic curve) is an important method of characterizing test performance in the IRT paradigm.  The TRF predicts a classical score from an IRT score, as you see below.  Like the item response function and test information function (item response and test information function), it uses the theta scale as the X-axis.  The Y-axis can be either the number-correct metric or proportion-correct metric.

Test response function 10 items Angoff

In this example, you can see that a theta of -0.3 translates to an estimated number-correct score of approximately 7, or 70%.

Classical cutscore to IRT

So how does this help us with the conversion of a classical cutscore?  Well, we hereby have a way of translating any number-correct score or proportion-correct score.  So any classical cutscore can be reverse-calculated to a theta value.  If your Angoff study (or Beuk) recommends a cutscore of 7 out of 10 points (70%), you can convert that to a theta cutscore of -0.3 as above.  If the recommended cutscore was 8 (80%), the theta cutscore would be approximately 0.7.

Because IRT works in a way that it scores examinees on the same scale with any set of items, as long as those items have been part of a linking/equating study.  Therefore, a single study on a set of items can be equated to any other linear test form, LOFT pool, or CAT pool.  This makes it possible to apply the classically-focused Angoff method to IRT-focused programs.  You can even set the cutscore with a subset of your item pool, in a linear sense, with the full intention to apply it on CAT tests later.

Note that the number-correct metric only makes sense for linear or LOFT exams, where every examinee receives the same number of items.  In the case of CAT exams, only the proportion correct metric makes sense.

How do I implement IRT?

Interested in applying IRT to improve your assessments?  Download a free trial copy of  Xcalibre  here.  If you want to deliver online tests that are scored directly with IRT, in real time (including computerized adaptive testing), check out  FastTest.

Equation editor item type

Technology-enhanced items are assessment items (questions) that utilize technology to improve the interaction of a test question in digital assessment, over and above what is possible with paper.  Tech-enhanced items can improve examinee engagement (important with K12 assessment), assess complex concepts with higher fidelity, improve precision/reliability, and enhance face validity/sellability. 

To some extent, the last word is the key one; tech-enhanced items simply look sexier and therefore make an assessment platform easier to sell, even if they don’t actually improve assessment.  I’d argue that there are also technology-enabled items, which are distinct, as discussed below.

What is the goal of technology enhanced items?

The goal is to improve assessment, by increasing things like reliability/precision, validity, and fidelity. However, there are a number of TEIs that is actually designed more for sales purposes than psychometric purposes. So, how to know if TEIs improve assessment?  That, of course, is an empirical question that is best answered with an experiment.  But let me suggest one metric address this question: how far does the item go beyond just reformulating a traditional item format to use current user-interface technology?  I would define the reformulating of traditional format to be a fake TEI while going beyond would define a true TEI.

An alternative nomenclature might be to call the reformulations technology-enhanced items and the true tech usage to be technology-enabled items (Almond et al, 2010; Bryant, 2017), as they would not be possible without technology.

A great example of this is the relationship between a traditional multiple response item and certain types of drag and drop items.  There are a number of different ways that drag and drop items can be created, but for now, let’s use the example of a format that asks the examinee to drag text statements into a box. 

An example of this is K12 assessment items from PARCC that ask the student to read a passage, then ask questions about it.

drag drop sequence

The item is scored with integers from 0 to K where K is the number of correct statements; the integers are often then used to implement the generalized partial credit model for final scoring.  This would be true regardless of whether the item was presented as multiple response vs. drag and drop. The multiple response item, of course, could just as easily be delivered via paper and pencil. Converting it to drag and drop enhances the item with technology, but the interaction of the student with the item, psychometrically, remains the same.

Some True TEIs, or Technology Enabled Items

Of course, the past decade or so has witnessed stronger innovation in item formats. Gamified assessments change how the interaction of person and item is approached, though this is arguably not as relevant for high stakes assessment due to concerns of validity. There are also simulation items. For example, a test for a construction crane operator might provide an interface with crane controls and ask the examinee to complete a tasks. Even at the K-12 level there can be such items, such as the simulation of a science experiment where the student is given various test tubes or other instruments on the screen.

Both of these approaches are extremely powerful but have a major disadvantage: cost. They are typically custom-designed. In the case of the crane operator exam or even the science experiment, you would need to hire software developers to create this simulation. There are now some simulation-development ecosystems that make this process more efficient, but the items still involve custom authoring and custom scoring algorithms.

To address this shortcoming, there is a new generation of self-authored item types that are true TEIs. By “self-authored” I mean that a science teacher would be able to create these items themselves, just like they would a multiple choice item. The amount of technology leveraged is somewhere between a multiple choice item and a custom-designed simulation, providing a compromise of reduced cost but still increasing the engagement for the examinee. A major advantage of this approach is that the items do not need custom scoring algorithms, and instead are typically scored via point integers, which enables the use of polytomous item response theory.

Are we at least moving forward?  Not always!

There is always pushback against technology, and in this topic the counterexample is the gridded item type.  It actually goes in reverse of innovation, because it doesn’t take a traditional format and reformulate it for current UI. It actually ignores the capabilities of current UI (actually, UI for the past 20+ years) and is therefore a step backward. With that item type, students are presented a bubble sheet from a 1960s style paper exam, on a computer screen, and asked to fill in the bubbles by clicking on them rather than using a pencil on paper.

Another example is the EBSR item type from the artist formerly known as PARCC. It was a new item type that intended to assess deeper understanding, but it did not use any tech-enhancement or -enablement, instead asking two traditional questions in a linked manner. As any psychometrician can tell you, this approach ignored basic assumptions of psychometrics, so you can guess the quality of measurement that it put out.

How can I implement TEIs?

It takes very little software development expertise to develop a platform that supports multiple choice items. An item like the graphing one above, though, takes substantial investment. So there are relatively few platforms that can support these, especially with best practices like workflow item review or item response theory. 

modified-Angoff Beuk compromise

A modified-Angoff method study is one of the most common ways to set a defensible cutscore on an exam.  It therefore means that the pass/fail decisions made by the test are more trustworthy than if you picked an arbitrary round number like 70%.  If your doctor, lawyer, accountant, or other professional has passed an exam where the cutscore has been set with this method, you can place more trust in their skills.

What is the Angoff method?

The Angoff method is a scientific way of setting a cutscore (pass point) on a test.  If you have a criterion-referenced interpretation, it is not legally defensible to just conveniently pick a round number like 70%; you need a formal process.  There are a number of acceptable methodologies in the psychometric literature for standard-setting studies, also known as cutscores or passing points.  Some examples include Angoff, modified-Angoff, Bookmark, Contrasting Groups, and Borderline.  The modified-Angoff approach is by far the popular approach.  It is used especially frequently for certification, licensure, certificate, and other credentialing exams.

It was originally suggested as a mere footnote by renowned researcher William Angoff, at Educational Testing Service.

How does the Angoff approach work?

First, you gather a group of subject matter experts (SMEs), with a minimum of 6, though 8-10 is preferred for better reliability, and have them define what they consider to be a Minimally Competent Candidate (MCC).  Next, you have them estimate the percentage of minimally competent candidates that will answer each item correctly.  You then analyze the results for outliers or inconsistencies.  If experts disagree, you will need to evaluate inter-rater reliability and agreement, and after that have the experts discuss and re-rate the items to gain better consensus.  The average final rating is then the expected percent-correct score for a minimally competent candidate.

Advantages of the Angoff method

  1. It is defensible.  Because it is the most commonly used approach and is widely studied in the scientific literature, it is well-accepted.
  2. You can implement it before a test is ever delivered.  Some other methods require you to deliver the test to a large sample first.
  3. It is conceptually simple, easy enough to explain to non-psychometricians.
  4. It incorporates the judgment of a panel of experts, not just one person or a round number.
  5. It works for tests with both classical test theory and item response theory.
  6. It does not take long to implement – if a short test, it can be done in a matter of hours!
  7. It can be used with different item types, including polytomously scored items (multi-points).

Disadvantages of the Angoff method

  1. It does not use actual data, unless you implement the Beuk method alongside.
  2. It can lead to the experts overestimating the performance of entry-level candidates, as they forgot what it was like to start out 20-30 years ago.  This is one reason to use the Beuk method as a “reality check” by showing the experts that if they stay with the cutscore they just picked, the majority of candidates might fail!

Example of the Modified-Angoff Approach

First of all, do not expect a straightforward, easy process that leads to an unassailably correct cutscore.  All standard-setting methods involve some degree of subjectivity.  The goal of the methods is to reduce that subjectivity as much as possible.  Some methods focus on content, others on examinee performance data, while some try to meld the two.

Step 1: Prepare Your Team

The modified-Angoff process depends on a representative sample of SMEs, usually 6-20. By “representative” I mean they should represent the various stakeholders. For instance, a certification for medical assistants might include experienced medical assistants, nurses, and physicians, from different areas of the country. You must train them about their role and how the process works, so they can understand the end goal and drive toward it.

Step 2: Define The Minimally Competent Candidate (MCC)

This concept is the core of the modified-Angoff method, though it is known by a range of terms or acronyms, including minimally qualified candidates (MQC) or just barely qualified (JBQ).  The reasoning is that we want our exam to separate candidates that are qualified from those that are not.  So we ask the SMEs to define what makes someone qualified (or unqualified!) from a perspective of skills and knowledge. This leads to a conceptual definition of an MCC. We then want to estimate what score this borderline candidate would achieve, which is the goal of the remainder of the study. This step can be conducted in person, or via webinar.

Step 3: Round 1 Ratings

Next, ask your SMEs to read through all the items on your test form and estimate the percentage of MCCs that would answer each correctly.  A rating of 100 means the item is a slam dunk; it is so easy that every MCC would get it right.  A rating of 40 is very difficult.  Most ratings are in the 60-90 range if the items are well-developed. The ratings should be gathered independently; if everyone is in the same room, let them work on their own in silence. This can easily be conducted remotely, though.

Step 4: Discussion

This is where it gets fun.  Identify items where there is the most disagreement (as defined by grouped frequency distributions or standard deviation) and make the SMEs discuss it.  Maybe two SMEs thought it was super easy and gave it a 95 and two other SMEs thought it was super hard and gave it a 45.  They will try to convince the other side of their folly. Chances are that there will be no shortage of opinions and you, as the facilitator, will find your greatest challenge is keeping the meeting on track. This step can be conducted in person, or via webinar.

Step 5: Round 2 Ratings

Raters then re-rate the items based on the discussion.  The goal is that there will be a greater consensus.  In the previous example, it’s not likely that every rater will settle on a 70.  But if your raters all end up from 60-80, that’s OK. How do you know there is enough consensus?  We recommend the inter-rater reliability suggested by Shrout and Fleiss (1979), as well as looking at inter-rater agreement and dispersion of ratings for each item. This use of multiple rounds is known as the Delphi approach; it pertains to all consensus-driven discussions in any field, not just psychometrics.

Step 6: Evaluate Results and Final Recommendation

Evaluate the results from Round 2 as well as Round 1.  An example of this is below.  What is the recommended cutscore, which is the average or sum of the Angoff ratings depending on the scale you prefer?  Did the reliability improve?  Estimate the mean and SD of examinee scores (there are several methods for this). What sort of pass rate do you expect?  Even better, utilize the Beuk Compromise as a “reality check” between the modified-Angoff approach and actual test data.  You should take multiple points of view into account, and the SMEs need to vote on a final recommendation. They, of course, know the material and the candidates so they have the final say.  This means that standard setting is a political process; again, reduce that effect as much as you can.

Some organizations do not set the cutscore at the recommended point, but at one standard error of judgment (SEJ) below the recommended point.  The SEJ is based on the inter-rater reliability; note that it is NOT the standard error of the mean or the standard error of measurement.  Some organizations use the latter; the former is just plain wrong (though I have seen it used by amateurs).

 

modified angoff

Step 7: Write Up Your Report

Validity refers to evidence gathered to support test score interpretations.  Well, you have lots of relevant evidence here. Document it.  If your test gets challenged, you’ll have all this in place.  On the other hand, if you just picked 70% as your cutscore because it was a nice round number, you could be in trouble.

Additional Topics

In some situations, there are more issues to worry about.  Multiple forms?  You’ll need to equate in some way.  Using item response theory?  You’ll have to convert the cutscore from the modified-Angoff method onto the theta metric using the Test Response Function (TRF).  New credential and no data available? That’s a real chicken-and-egg problem there.

Where Do I Go From Here?

Ready to take the next step and actually apply the modified-Angoff process to improving your exams?  Sign up for a free account in our  FastTest item banker. You can also download our Angoff analysis tool for free.

References

Shrout, P. E., & Fleiss, J. L. (1979). Intraclass correlations: uses in assessing rater reliability. Psychological bulletin86(2), 420.

test response functions

Item response theory (IRT) is a family of machine learning models in the field of psychometrics, which are used to design, analyze, validate, and score assessments.  It is a very powerful psychometric paradigm that allows researchers to build stronger assessments, whether they work in Education, Psychology, Human Resources, or other fields.  It also solves critical measurement problems like equating across years, designing adaptive tests, or creating vertical scales.

Want to learn more about IRT, how it works, and why it is so important for assessment?  Read on.

What is Item Response Theory?

IRT is a family of models that try to describe how examinees respond to items on a test, hence the name.  These models can be used to evaluate item performance, because the descriptions are quite useful in and of themselves.  However, item response theory ended up doing so much more.Example Item response theory function

IRT is model-driven, in that there is a specific mathematical equation that is assumed, and we fit the models based on raw data, similar to linear regression.  There are different parameters (a, b, c) that shape this equation to different needs.  That’s what defines different IRT models.  This will be discussed at length below.

The models put people and items onto a latent scale, which is usually called θ (theta).  This represents whatever is being measured, whether IQ, anxiety, or knowledge of accounting laws in Croatia.  IRT helps us understand the nature of the scale, how a person answers each question, the distribution of item difficulty, and much more.  IRT used to be known as latent trait theory and item characteristic curve theory.

IRT requires specially-designed software.  Click the link below to download our software Xcalibre, which provides a user-friendly and visual platform for implementing IRT.

 

IRT analysis with Xcalibre

 

Why do we need Item Response Theory?

IRT represents an important innovation in the field of psychometrics. While now more than 50 years old – assuming the “birth” is the classic Lord and Novick (1969) text – it is still underutilized and remains a mystery to many practitioners.

Item response theory is more than just a way of analyzing exam data, it is a paradigm to drive the entire lifecycle of designing, building, delivering, scoring, and analyzing assessments.

IRT requires larger sample sizes and is much more complex than its predecessor, classical test theory, but is also far more powerful.  IRT requires quite a lot of expertise, typically a PhD.  So it is not used for small assessments like a final exam at universities, but is used for almost all major assessments in the world.

 

The Driver: Problems with Classical Test Theory

Classical test theory (CTT) is approximately 100 years old, and still remains commonly used because it is appropriate for certain situations, and it is simple enough that it can be used by many people without formal training in psychometrics.  Most statistics are limited to means, proportions, and correlations.  However, its simplicity means that it lacks the sophistication to deal with a number of very important measurement problems.  A list of these is presented later.

Learn more about the differences between CTT and IRT here.

 

Item Response Theory Parameters

The foundation of IRT is a mathematical model defined by item parametersA parameter is an aspect of a mathematical model that can change its shape or other aspects.  For dichotomous items (those scored correct/incorrect), each item has three parameters:

 

   a: the discrimination parameter, an index of how well the item differentiates low from top examinees; typically ranges from 0 to 2, where higher is better, though not many items are above 1.0.

   b: the difficulty parameter, an index of what level of examinees for which the item is appropriate; typically ranges from -3 to +3, with 0 being an average examinee level.

   c: the pseudo-guessing parameter, which is a lower asymptote; typically is focused on 1/k where k is the number of options.

These paramters are used in the formula below, but are also displayed graphically.

3PL irt equation

Item response function

These parameters are used to graphically display an item response function (IRF), which models the probability of a correct answer as a function of ability.  In the example IRF, the a parameter is approximately, 1.0, indicating a fairly discriminating test item.  The b parameter is approximately 0.0 (the point on the x-axis where the midpoint of the curve is), indicating an average-difficulty item; examinees of average ability would have a 60% chance of answering correctly.  The c parameter is approximately 0.20, like a 5-option multiple choice item.  Consider the x-axis to be z-scores on a standard normal scale.

In some cases, there is no guessing involved, and we only use and b.  This is called the two-parameter model.  If we only use b, this is the one-parameter or Rasch Model.  Here is how that is calculated.

One-parameter-logistic-model-IRT

Item parameters, which are crucial within the IRT framework, might change over time or multiple testing occasions, a phenomenon known as item parameter drift.

 

Example Item Response Theory calculations

Examinees with higher ability are much more likely to respond correctly.  Look at the graph above.  Someone at +2.0 (97th percentile) has about a 94% chance of getting the item correct.  Meanwhile, someone at -2.0 has only a 25% chance – barely above the 1 in 5 guessing rate of 20%.  An average person (0.0) has a 60% chance.  Why 60?  Because we are accounting for guessing.  If the curve went from 0% to 100% probability, then yes, the middle would be 50% change.  But here, we assume 20% as a baseline due to guessing, so halfway up is 60%.

five item response functions

Of course, the parameters can and should differ from item to item, reflecting differences in item performance.  The following graph shows five IRFs with the three-parameter model.  The dark blue line is the easiest item, with a b of -2.00.  The light blue item is the hardest, with a b of +1.80.  The purple one has a c=0.00 while the light blue has c=0.25, indicating that it is more susceptible to guessing.

These IRFs are not just a pretty graph or a way to describe how an item performs.  They are the basic building block to accomplishing those important goals mentioned earlier.  That comes next…

 

Applications of Item Response Theory to Improve Assessment

Item response theory uses the IRF for several purposes.  Here are a few.

test information function from item response theory

  1. Interpreting and improving item performance
  2. Scoring examinees with maximum likelihood or Bayesian methods
  3. Form assembly, including linear on the fly testing (LOFT) and pre-equating
  4. Calculating the accuracy of examinee scores
  5. Development of computerized adaptive tests (CAT)
  6. Post-equating
  7. Differential item functioning (finding bias)
  8. Data forensics to find cheaters or other issues

In addition to being used to evaluate each item individually, IRFs are combined in various ways to evaluate the overall test or form.  The two most important approaches are the conditional standard error of measurement (CSEM) and the test information function (TIF).  The test information function is higher where the test is providing more measurement information about examinees; if relatively low in a certain range of examinee ability, those examinees are not being measured accurately.  The CSEM is the inverse of the TIF, and has the interpretable advantage of being usable for confidence intervals; a person’s score plus or minus 1.96 times the SEM is a 95% confidence interval for their score.  The graph on the right shows part of the form assembly process in our  FastTest  platform.

 

Assumptions of Item Response Theory

Item response theory assumes a few things about your data.

  1. The latent trait you are measuring is unidimensional.  If it is multidimensional, there is multidimensional item response theory, or you can treat the dimensions as separate traits.
  2. Items have local independence, which means that the act of answering one is not impacted by others.  This affects the use of testlets and enemy items.
  3. The probability of responding correctly to an item (or in a certain response, in the case of polytomous like Likert), is a function of the examinee’s ability/trait level and the parameters of the model, following the calculation of the item response function, with some allowance for random error.  As a corollary, we are assuming that the ability/trait has some distribution, with some people having higher or lower levels (e.g., intelligence) and that we are trying to find those differences.

Many texts will only postulate the first two as assumptions, because the third is just implicitly assumed.

 

Advantages and Benefits of Item Response Theory

So why does this matter?  Let’s go back to the problems with classical test theory.  Why is IRT better?

  • Sample-independence of scale: Classical statistics are all sample dependent, and unusable on a different sample; results from IRT are sample-independent. within a linear transformation.  Two samples of different ability levels can be easily converted onto the same scale.
  • Test statistics: Classical statistics are tied to a specific test form.
  • Sparse matrices are OK: Classical test statistics do not work with sparse matrices introduced by multiple forms, linear on the fly testing, or adaptive testing.
  • Linking/equating: Item response theory has much stronger equating, so if your exam has multiple forms, or if you deliver twice per year with a new form, you can have much greater validity in the comparability of scores.
  • Measuring the range of students: Classical tests are built for the average student, and do not measure high or low students very well; conversely, statistics for very difficult or easy items are suspect.
  • Vertical scaling: IRT can do vertical scaling but CTT cannot.
  • Accounting for guessing: CTT does not account for guessing on multiple choice exams.
  • Scoring: Scoring in classical test theory does not take into account item difficulty.  With IRT, you can score a student on any set of items and be sure it is on the same latent scale.
  • Adaptive testing: CTT does not support adaptive testing in most cases.  Adaptive testing has its own list of benefits.
  • Characterization of error: CTT assumes that every examinee has the same amount of error in their score (SEM); IRT recognizes that if the test is all middle-difficulty items, then low or high students will have inaccurate scores.
  • Stronger form building: IRT has functionality to build forms to be more strongly equivalent and meet the purposes of the exam.
  • Nonlinear function: IRT does not assume linear function of the student-item relationship when it is impossible.  CTT assumes a linear function (point-biserial) when it is blatantly impossible.

 

Item Response Theory Models: One Big Happy Family

Remember: IRT is actually a family of models, making flexible use of the parameters.  In some cases, only two (a,b) or one parameters (b) are used, depending on the type of assessment and fit of the data.  If there are multipoint items, such as Likert rating scales or partial credit items, the models are extended to include additional parameters. Learn more about the partial credit situation here.

Here’s a quick breakdown of the family tree, with the most common models.

 

How do I analyze my test with Item Response Theory?

OK item fit

First: you need to get special software.  There are some commercial packages like  Xcalibre, or you can use packages inside platforms like R and Python.

The software will analyze the data in cycles or loops to try to find the best model.  This is because, as always, the data does not always perfectly align.  You might see graphs like the one below if you compared actual proportions (red) to the predicted ones from the item response function (black).  That’s OK!  IRT is quite robust.  And there are analyses built in to help you evaluate model fit.

Some more unpacking of the image above:

  • This was item #39 on the test
  • We are using the three parameter logistic model (3PL), as this was a multiple choice item with 4 options
  • 3422 examinees answered the item
  • 76.9 of them got it correct
  • The classical item discrimination (point biserial item-total correlation) was 0.253, which is OK but not very high
  • The a parameter was 0.432, which is OK but not very strong
  • The b parameter was -1.195, which means the item was quite easy
  • The c parameter was 0.248, which you would expect if there was a 25% chance of guessing
  • The Chi-square fit statistic rejected the null, indicating poor fit, but this statistic is susceptible to sample size
  • The z-Resid fit statistic is a bit more robust, and it did not flag the item for bad fit

Xcalibre-poly-output
The image here shows output from  Xcalibre  from the generalized partial credit model, which is a polytomous model often used for items scored with partial credit.  For example, if a question lists 6 animals and asks students to click on the ones that are reptiles, of which there are 3.  The possible scores are then 0, 1, 2, 3.

Here, the graph labels them as 1-2-3-4, but the meaning is the same.  Here is how you can interpret this.

  • Someone is likely to get 0 points if their theta is below -2.0 (bottom 3% or so of students).
  • A few low students might get 1 point (green)
  • Low-middle ability students are likely to get 2 correct (blue)
  • Anyone above average (0.0) is likely to get all 3 correct.

The boundary locations are where one level becomes more likely than another, i.e., where the curves cross.  For example, you can see that the blue and black lines cross at the boundary -0.339.

Where can I learn more?

For more information, we recommend the textbook Item Response Theory for Psychologists by Embretson & Riese (2000) for those interested in a less mathematical treatment, or de Ayala (2009) for a more mathematical treatment.  If you really want to dive in, you can try the 3-volume Handbook of Item Response Theory edited by van der Linden, which contains a chapter discussing ASC’s IRT analysis software,  Xcalibre.

Want to talk to one of our experts about how to apply IRT?  Get in touch!

TALK TO US Contact

parcc ebsr items

The Partnership for Assessment of Readiness for College and Careers (PARCC) is a consortium of US States working together to develop educational assessments aligned with the Common Core State Standards.  This is a daunting task, and PARCC is doing an admirable job, especially with their focus on utilizing technology.  However, one of the new item types has a serious psychometric fault that deserves a caveat with regards to scoring and validation.

What is an Evidence-Based Selected-­Response (EBSR) question?

The item type is an “Evidence-Based Selected-­Response” (PARCC EBSR) item format, commonly called a Part A/B item or Two-Part item.  The goal of this format is to delve deeper into student understanding, and award credit for deeper knowledge while minimizing the impact of guessing.  This is obviously an appropriate goal for assessment.  To do so, the item is presented as two parts to the student, where the first part asks a simple question and the second part asks for supporting evidence to their answer in Part A.  Students must answer Part A correctly to receive credit on Part B.  As described on the PARCC website:

In order to receive full credit for this item, students must choose two supporting facts that support the adjective chosen for Part A. Unlike tests in the past, students may not guess on Part A and receive credit; they will only receive credit for the details they’ve chosen to support Part A.

How EBSR items are scored

While this makes sense in theory, it leads to problem in data analysis, especially if using Item Response Theory (IRT). Obviously, this violates the fundamental assumption of IRT: local independence (items are not dependent on each other).  So when working with a client of mine, we decided to combine it into one multi-point question, which matches the theoretical approach PARCC EBSR items are taking.  The goal was to calibrate the item with Muraki’s Generalized Partial Credit Model (GPCM), which is the standard approach used to analyze polytomous items in K12 assessment (learn more here).  The GPCM tries to order students based on the points they earn: 0 point students tend to have the lowest ability, 1 point students of moderate ability, and 2 point students are of the highest ability.  Should be obvious, right?  Nope.

The first thing we noticed was that some point levels had very small sample sizes.  Suppose that Part A is 1 point and Part B is 1 point (select two evidence pieces but must get both).  Most students will get 0 points or 2 points.  Not many will receive 1 point.  We thought about it, and realized that the only way to earn 1 point is to guess Part A but select no correct evidence or only select one evidence point.  This leads to issues with the GPCM.

Using the Generalized Partial Credit Model

Even when there was sufficient N at each level, we found that the GPCM had terrible fit statistics, meaning that the item was not performing according to the model described above.  So I ran  Iteman, our classical analysis software, to obtain quantile plots that approximate the polytomous IRFs without imposing the GPCM modeling.  I found that in the 0-2 point items tend to have the issue where not many students get 1 point, and moreover the line for them is relatively flat.  The GPCM assumes that it is relatively bell-shaped.  So the GPCM is looking for where the drop-offs are in the bell shape, crossing with adjacent CRFs – the thresholds – and they aren’t there.  The GPCM would blow up, usually not even estimating thresholds in correct ordering.

PARCC EBSR Graphs

So I tried to think of this from a test development perspective.  How do students get 1 point on these PARCC EBSR items?  The only way to do so is to get Part A right but not Part B.  Given that Part B is the reason for Part A, this means this group is students who answer Part A correctly but don’t know the reason, which means they are guessing.  It is then no surprise that the data for 1-point students is in a flat line – it’s just like the c parameter in the 3PL.  So the GPCM will have an extremely tough time estimating threshold parameters.

Why EBSR items don’t work

From a psychometric perspective, point levels are supposed to represent different levels of ability.  A 1-point student should be higher ability than a 0-point student on this item, and a 2-point student of higher ability than a 1-point student.  This seems obvious and intuitive.  But this item, by definition, violates the idea that a 1-point student should have higher ability than a 0-point student.  The only way to get 1 point is to guess the first part – and therefore not know the answer and are no different than the 0-point examinees whatsoever.  So of course the 1-point results look funky here.

The items were calibrated as two separate dichotomous items rather than one polytomous item, and the statistics turned out much better.  This still violates the IRT assumption but at least produces usable IRT parameters that can score students.  Nevertheless, I think the scoring of these items needs to be revisited so that the algorithm produces data which is able to be calibrated in IRT.

The entire goal of test items is to provide data points used to measure students; if the Evidence-Based Selected-­Response item type is not providing usable data, then it is not worth using, no matter how good it seems in theory!

test-scaling

Scaling is a psychometric term regarding the establishment of a score metric for a test, and it often has two meanings. First, it involves defining the method to operationally scoring the test, establishing an underlying scale on which people are being measured.  A common example is the T-score, which transforms raw scores into a standardized scale with a mean of 50 and a standard deviation of 10, making it easier to compare results across different populations or test forms.  It also refers to score conversions used for reporting scores, especially conversions that are designed to carry specific information.  The latter is typically called scaled scoring.

Examples of Scaling

You have all been exposed to this type of scaling, though you might not have realized it at the time. Most high-stakes tests like the ACT, SAT, GRE, and MCAT are reported on scales that are selected to convey certain information, with the actual numbers selected more or less arbitrarily. The SAT and GRE have historically had a nominal mean of 500 and a standard deviation of 100, while the ACT has a nominal mean of 18 and standard deviation of 6. These are actually the same scale, because they are nothing more than a converted z-score (standard or zed score), simply because no examinee wants to receive a score report that says you got a score of -1. The numbers above were arbitrarily selected, and then the score range bounds were selected based on the fact that 99% of the population is within plus or minus three standard deviations. Hence, the SAT and GRE range from 200 to 800 and the ACT ranges from 0 to 36. This leads to the urban legend of receiving 200 points for writing your name correctly on the SAT; again, it feels better for the examinee. A score of 300 might seem like a big number and 100 points above the minimum, but it just means that someone is in the 3rd percentile.

Now, notice that I said “nominal.” I said that because the tests do not actually have those means observed in samples, because the samples have substantial range restriction. Because these tests are only taken by students serious about proceeding to the next level of education, the actual sample is of higher ability than the population. The lower third or so of high school students usually do not bother with the SAT or ACT. So many states will have an observed average ACT of 21 and standard deviation of 4. This is an important issue to consider in developing any test. Consider just how restricted the population of medical school students is; it is a very select group.

How can I select a score scale?

score-scale

For various reasons, actual observed scores from tests are often not reported, and only converted scores are reported.  If there are multiple forms which are being equated, scaling will hide the fact that the forms differ in difficulty, and in many cases, differ in cutscore.  Scaled scores can facilitate feedback.  They can also help the organization avoid explanations of IRT scoring, which can be a headache to some.

When deciding on the conversion calculations, there are several important questions to consider.

First, do we want to be able to make fine distinctions among examinees? If so, the range should be sufficiently wide. My personal view is that the scale should be at least as wide as the number of items; otherwise you are voluntarily giving up information. This in turn means you are giving up variance, which makes it more difficult to correlate your scaled scores with other variables, like the MCAT is correlated with success in medical school. This, of course, means that you are hampering future research – unless that research is able to revert back to actual observed scores to make sure all information possible is used. For example, supposed a test with 100 items is reported on a 5-point grade scale of A-B-C-D-F. That scale is quite restricted, and therefore difficult to correlate with other variables in research. But you have the option of reporting the grades to students and still using the original scores (0 to 100) for your research.

Along the same lines, we can swing completely in the other direction. For many tests, the purpose of the test is not to make fine distinctions, but only to broadly categorize examinees. The most common example of this is a mastery test, where the examinee is being assessed on their mastery of a certain subject, and the only possible scores are pass and fail. Licensure and certification examinations are an example. An extension of this is the “proficiency categories” used in K-12 testing, where students are classified into four groups: Below Basic, Basic, Proficient, and Advanced. This is used in the National Assessment of Educational Progress. Again, we see the care taken for reporting of low scores; instead of receiving a classification like “nonmastery” or “fail,” the failures are given the more palatable “Below Basic.”

Another issue to consider, which is very important in some settings but irrelevant in others, is vertical scaling. This refers to the chaining of scales across various tests that are at quite different levels. In education, this might involve linking the scales of exams in 8th grade, 10th grade, and 12th grade (graduation), so that student progress can be accurately tracked over time. Obviously, this is of great use in educational research, such as the medical school process. But for a test to award a certification in a medical specialty, it is not relevant because it is really a one-time deal.

Lastly, there are three calculation options: pure linear (ScaledScore = RawScore * Slope + Intercept), standardized conversion (Old Mean/SD to New Mean/SD), and nonlinear approaches like Equipercentile.

Perhaps the most important issue is whether the scores from the test will be criterion-referenced or norm-referenced. Often, this choice will be made for you because it distinctly represents the purpose of your tests. However, it is quite important and usually misunderstood, so I will discuss this in detail.

Criterion-Referenced vs. Norm-Referenced

data-analysis-norms

This is a distinction between the ways test scores are used or interpreted. A criterion-referenced score interpretation means that the score is interpreted with regards to defined content, blueprint, or curriculum (the criterion), and ignores how other examinees perform (Bond, 1996). A classroom assessment is the most common example; students are scored on the percent of items correct, which is taken to imply the percent of the content they have mastered. Conversely, a norm-referenced score interpretation is one where the score provides information about the examinee’s standing in the population, but no absolute (or ostensibly absolute) information regarding their mastery of content. This is often the case with non-educational measurements like personality or psychopathology. There is no defined content which we can use as a basis for some sort of absolute interpretation. Instead, scores are often either z-scores or some linear function of z-scores.  IQ is historically scaled with a mean of 100 and standard deviation of 15.

It is important to note that this dichotomy is not a characteristic of the test, but of the test score interpretations. This fact is more apparent when you consider that a single test or test score can have several interpretations, some of which are criterion-referenced and some of which are norm-referenced. We will discuss this deeper when we reach the topic of validity, but consider the following example. A high school graduation exam is designed to be a comprehensive summative assessment of a secondary education. It is therefore specifically designed to cover the curriculum used in schools, and scores are interpreted within that criterion-referenced context. Yet scores from this test could also be used for making acceptance decisions at universities, where scores are only interpreted with respect to their percentile (e.g., accept the top 40%). The scores might even do a fairly decent job at this norm-referenced application. However, this is not what they are designed for, and such score interpretations should be made with caution.

Another important note is the definition of “criterion.” Because most tests with criterion-referenced scores are educational and involve a cutscore, a common misunderstanding is that the cutscore is the criterion. It is still the underlying content or curriculum that is the criterion, because we can have this type of score interpretation without a cutscore. Regardless of whether there is a cutscore for pass/fail, a score on a classroom assessment is still interpreted with regards to mastery of the content.  To further add to the confusion, Industrial/Organizational psychology refers to outcome variables as the criterion; for a pre-employment test, the criterion is typically Job Performance at a later time.

This dichotomy also leads to some interesting thoughts about the nature of your construct. If you have a criterion-referenced score, you are assuming that the construct is concrete enough that anybody can make interpretations regarding it, such as mastering a certain percentage of content. This is why non-concrete constructs like personality tend to be only norm-referenced. There is no agreed-upon blueprint of personality.

Multidimensional Scaling

camera lenses for multidimensional item response theory

An advanced topic worth mentioning is multidimensional scaling (see Davison, 1998). The purpose of multidimensional scaling is similar to factor analysis (a later discussion!) in that it is designed to evaluate the underlying structure of constructs and how they are represented in items. This is therefore useful if you are working with constructs that are brand new, so that little is known about them, and you think they might be multidimensional. This is a pretty small percentage of the tests out there in the world; I encountered the topic in my first year of graduate school – only because I was in a Psychological Scaling course – and have not encountered it since.

Summary of test scaling

Scaling is the process of defining the scale that on which your measurements will take place. It raises fundamental questions about the nature of the construct. Fortunately, in many cases we are dealing with a simple construct that has a well-defined content, like an anatomy course for first-year medical students. Because it is so well-defined, we often take criterion-referenced score interpretations at face value. But as constructs become more complex, like job performance of a first-year resident, it becomes harder to define the scale, and we start to deal more in relatives than absolutes. At the other end of the spectrum are completely ephemeral constructs where researchers still can’t agree on the nature of the construct and we are pretty much limited to z-scores. Intelligence is a good example of this.

Some sources attempt to delineate the scaling of people and items or stimuli as separate things, but this is really impossible as they are so confounded. Especially since people define item statistics (the percent of people that get an item correct) and items define people scores (the percent of items a person gets correct). It is for this reason that item response theory, the most advanced paradigm in measurement theory, was designed to place items and people on the same scale. It is also for this reason that item writing should consider how they are going to be scored and therefore lead to person scores. But because we start writing items long before the test is administered, and the nature of the construct is caught up in the scale, the issues presented here need to be addressed at the very beginning of the test development cycle.

certification exam development construction

Certification exam development, is a well-defined process governed by accreditation guidelines such as NCCA, requiring steps such as job task analysis and standard setting studies.  For certification, and other credentialing like licensure or certificates, this process is incredibly important to establishing validity.  Such exams serve as gatekeepers into many professions, often after people have invested a ton of money and years of their life in preparation.  Therefore, it is critical that the tests be developed well, and have the necessary supporting documentation to show that they are defensible.

So what exactly goes into developing a quality exam, sound psychometrics, and establishing the validity documentation, perhaps enough to achieve NCCA accreditation for your certification? Well, there is a well-defined and recognized process for certification exam development, though it is rarely the exact same for every organization.  In general, the accreditation guidelines say you need to address these things, but leave the specific approach up to you.  For example, you have to do a cutscore study, but you are allowed to choose Bookmark vs Angoff vs other method.

Key Stages in Certification Exam Development

Job Analysis / Practice Analysis

A job analysis study provides the vehicle for defining the important job knowledge, skills, and abilities (KSA) that will later be translated into content on a certification exam. During a job analysis, important job KSAs are obtained by directly analyzing job performance of highly competent job incumbents or surveying subject-matter experts regarding important aspects of successful job performance. The job analysis generally serves as a fundamental source of evidence supporting the validity of scores for certification exams.

Test Specifications and Blueprints

The results of the job analysis study are quantitatively converted into a blueprint for the certification exam.  Basically, it comes down to this: if the experts say that a certain topic or skill is done quite often or is very critical, then it deserves more weight on the exam, right?  There are different ways to do this.  My favorite article on the topic is Raymond & Neustel, 2006Here’s a free tool to help.

test development cycle job task analysis

Item Development

After important job KSAs are established, subject-matter experts write test items to assess them. The end result is the development of an item bank from which exam forms can be constructed. The quality of the item bank also supports test validity.  A key operational step is the development of an Item Writing Guide and holding an item writing workshop for the SMEs.

Pilot Testing

There should be evidence that each item in the bank actually measures the content that it is supposed to measure; in order to assess this, data must be gathered from samples of test-takers. After items are written, they are generally pilot tested by administering them to a sample of examinees in a low-stakes context—one in which examinees’ responses to the test items do not factor into any decisions regarding competency. After pilot test data is obtained, a psychometric analysis of the test and test items can be performed. This analysis will yield statistics that indicate the degree to which the items measure the intended test content. Items that appear to be weak indicators of the test content generally are removed from the item bank or flagged for item review so they can be reviewed by subject matter experts for correctness and clarity.

Note that this is not always possible, and is one of the ways that different organizations diverge in how they approach exam development.

Standard Setting

Standard setting also is a critical source of evidence supporting the validity of professional credentialing exam (i.e. pass/fail) decisions made based on test scores.  Standard setting is a process by which a passing score (or cutscore) is established; this is the point on the score scale that differentiates between examinees that are and are not deemed competent to perform the job. In order to be valid, the cutscore cannot be arbitrarily defined. Two examples of arbitrary methods are the quota (setting the cut score to produce a certain percentage of passing scores) and the flat cutscore (such as 70% on all tests). Both of these approaches ignore the content and difficulty of the test.  Avoid these!

Instead, the cutscore must be based on one of several well-researched criterion-referenced methods from the psychometric literature.  There are two types of criterion-referenced standard-setting procedures (Cizek, 2006): examinee-centered and test-centered.

The Contrasting Groups method is one example of a defensible examinee-centered standard-setting approach. This method compares the scores of candidates previously defined as Pass or Fail. Obviously, this has the drawback that a separate method already exists for classification. Moreover, examinee-centered approaches such as this require data from examinees, but many testing programs wish to set the cutscore before publishing the test and delivering it to any examinees. Therefore, test-centered methods are more commonly used in credentialing.

The most frequently used test-centered method is the Modified Angoff Method (Angoff, 1971) which requires a committee of subject matter experts (SMEs).  Another commonly used approach is the Bookmark Method.

Equating

If the test has more than one form – which is required by NCCA Standards and other guidelines – they must be statistically equated.  If you use classical test theory, there are methods like Tucker or Levine.  If you use item response theory, you can either bake the equating into the item calibration process with software like Xcalibre, or use conversion methods like Stocking & Lord.

What does this process do?  Well, if this year’s certification exam had an average of 3 points higher than last years, how do you know if this year’s version was 3 points easier, or this year’s cohort was 3 points smarter, or a mixture of both?  Learn more here.

Psychometric Analysis & Reporting

This part is an absolutely critical step in the exam development cycle for professional credentialing.  You need to statistically analyze the results to flag any items that are not performing well, so you can replace or modify them.  This looks at statistics like item p-value (difficulty), item point biserial (discrimination), option/distractor analysis, and differential item functioning.  You should also look at overall test reliability/precision and other psychometric indices.  If you are accredited, you need to perform year-end reports and submit them to the governing body.  Learn more about item and test analysis.

Exam Development: It’s a Vicious Cycle

Now, consider the big picture: in many cases, an exam is not a one-and-done thing.  It is re-used, perhaps continually.  Often there are new versions released, perhaps based on updated blueprints or simply to swap out questions so that they don’t get overexposed.  That’s why this is better conceptualized as an exam development cycle, like the circle shown above.  Often some steps like Job Analysis are only done once every 5 years, while the rotation of item development, piloting, equating, and psychometric reporting might happen with each exam window (perhaps you do exams in December and May each year).

ASC has extensive expertise in managing this cycle for professional credentialing exams, as well as many other types of assessments.  Get in touch with us to talk to one of our psychometricians.  I also suggest reading two other blog posts on the topic: ‘Certification Exam Administration and Proctoring‘ and ‘Certification Exam Delivery: Guidelines for Success‘ for a comprehensive understanding.

One of my favorite quotes is from Mark Twain: “There is no such thing as a new idea. It is impossible. We simply take a lot of old ideas and put them into a sort of mental kaleidoscope.”  How can we construct a better innovation kaleidoscope for assessment?

We all attend conferences to get ideas from our colleagues in the assessment community on how to manage challenges. But ideas from across industries have been the source for some of the most radical innovations. Did you know that the inspiration for fast food drive-throughs was race car pit stops? Or that the idea for wine packaging came from egg cartons?

Most of the assessment conferences we have attended recently have been filled with sessions about artificial intelligence. AI is one of the most exciting developments to come along in our industry – as well as in other industries – in a long time. But many small- or moderate-sized organizations may feel it is out of reach for their organizations. Or they may be reluctant to adopt it for security or other concerns.

There are other worthwhile ideas that can be borrowed from other industries and adapted for use by small and moderate-sized assessment organizations. For instance, concepts from product development, design thinking, and lean manufacturing can be beneficial to assessment processes.

Agile Software Development

Many organizations use agile product methodologies for software development. While strict adherence to an agile methodology may not be appropriate for item development activities, there are pieces of the agile philosophy that might be helpful for item development processes. For instance, in the agile methodology, user stories are used to describe the end goal of a software feature from the standpoint of a customer or end user. In the same way, the user story concept could be used to delineate the intended construct responsibilities items must meet or how items are intended to be scored. This can help ensure that everyone involved in test development has a clear understanding of the measurement intent of the item from the onset.item review kanban

Another feature of agile development is the use of acceptance criteria. Acceptance criteria are predefined standards used to determine if user stories have been completed. In item development processes, acceptance criteria can be developed to set and communicate common standards to all involved in the item authoring process.

Agile development also uses a tool known as a Kanban Board to manage the process of software development by assigning tasks and moving development requests through various stages such as new, awaiting specs, in development, in QA, and user review. This approach can be applied the management of item development in assessment, as you see here from our Assess.ai platform.

Design Thinking and Innovation

Design thinking is a human-centered approach to innovation. At its core is empathy for customers and users. A key design thinking tool is the journey map, which is a visual representation of a process that individuals (e.g., customers or users) go through to achieve a goal. The purpose of creating a journey map is to identify pain points in the user experience and create better user experiences. Journey maps could potentially be used by assessment organizations to diagram the volunteer SME experience and identify potential improvements. Likewise, it could be used in the candidate application and registration process.

Lean Manufacturing

Lean manufacturing is a methodology aimed at reducing production times. A key technique within the lean methodology is value stream mapping (VSM). VSM is a way of visualizing both the flow of information and materials through a process as a means of identifying waste. Admittedly, I do not know a great deal about the intricacies of the technique, but it is most helpful to understand the underlying philosophy and intentions:

· To develop a mutual understanding between all stakeholders involved in the process;

· To eliminate process steps and tasks which do not add value to the process but may contribute to user frustration and to error.

The big question for innovation: Why?

A key question to ask when examining a process is ‘why.’ So often we proceed with processes year in and year out, keeping them the same, because ‘it’s the way we’ve always done them’ without ever questioning why, for so long that we have forgotten what the original answer to the question was. ‘Why’ is an immensely powerful and helpful question.

In addition to asking the ‘why’ question, a takeaway from value stream mapping and from journey mapping is visual representation. Being able to diagram or display the process is a fantastic way to develop a mutual understanding from all stakeholders involved in the process. So often we also concentrate on pursuing shiny new tools like AI that we neglect potential efficiencies in the underlying processes. Visually displaying processes can be extremely helpful in process improvement.