test response functions

Item response theory (IRT) is a family of machine learning models in the field of psychometrics, which are used to design, analyze, validate, and score assessments.  It is a very powerful psychometric paradigm that allows researchers to build stronger assessments, whether they work in Education, Psychology, Human Resources, or other fields.  It also solves critical measurement problems like equating across years, designing adaptive tests, or creating vertical scales.

Want to learn more about IRT, how it works, and why it is so important for assessment?  Read on.

What is Item Response Theory?

IRT is a family of models that try to describe how examinees respond to items on a test, hence the name.  These models can be used to evaluate item performance, because the descriptions are quite useful in and of themselves.  However, item response theory ended up doing so much more.Example Item response theory function

IRT is model-driven, in that there is a specific mathematical equation that is assumed, and we fit the models based on raw data, similar to linear regression.  There are different parameters (a, b, c) that shape this equation to different needs.  That’s what defines different IRT models.  This will be discussed at length below.

The models put people and items onto a latent scale, which is usually called θ (theta).  This represents whatever is being measured, whether IQ, anxiety, or knowledge of accounting laws in Croatia.  IRT helps us understand the nature of the scale, how a person answers each question, the distribution of item difficulty, and much more.  IRT used to be known as latent trait theory and item characteristic curve theory.

IRT requires specially-designed software.  Click the link below to download our software Xcalibre, which provides a user-friendly and visual platform for implementing IRT.

 

IRT analysis with Xcalibre

 

Why do we need Item Response Theory?

IRT represents an important innovation in the field of psychometrics. While now more than 50 years old – assuming the “birth” is the classic Lord and Novick (1969) text – it is still underutilized and remains a mystery to many practitioners.

Item response theory is more than just a way of analyzing exam data, it is a paradigm to drive the entire lifecycle of designing, building, delivering, scoring, and analyzing assessments.

IRT requires larger sample sizes and is much more complex than its predecessor, classical test theory, but is also far more powerful.  IRT requires quite a lot of expertise, typically a PhD.  So it is not used for small assessments like a final exam at universities, but is used for almost all major assessments in the world.

 

The Driver: Problems with Classical Test Theory

Classical test theory (CTT) is approximately 100 years old, and still remains commonly used because it is appropriate for certain situations, and it is simple enough that it can be used by many people without formal training in psychometrics.  Most statistics are limited to means, proportions, and correlations.  However, its simplicity means that it lacks the sophistication to deal with a number of very important measurement problems.  A list of these is presented later.

Learn more about the differences between CTT and IRT here.

 

Item Response Theory Parameters

The foundation of IRT is a mathematical model defined by item parametersA parameter is an aspect of a mathematical model that can change its shape or other aspects.  For dichotomous items (those scored correct/incorrect), each item has three parameters:

 

   a: the discrimination parameter, an index of how well the item differentiates low from top examinees; typically ranges from 0 to 2, where higher is better, though not many items are above 1.0.

   b: the difficulty parameter, an index of what level of examinees for which the item is appropriate; typically ranges from -3 to +3, with 0 being an average examinee level.

   c: the pseudo-guessing parameter, which is a lower asymptote; typically is focused on 1/k where k is the number of options.

These parameters are used in the formula below, but are also displayed graphically.

3PL irt equation

Item response function

These parameters are used to graphically display an item response function (IRF), which models the probability of a correct answer as a function of ability.  In the example IRF, the a parameter is approximately, 1.0, indicating a fairly discriminating test item.  The b parameter is approximately 0.0 (the point on the x-axis where the midpoint of the curve is), indicating an average-difficulty item; examinees of average ability would have a 60% chance of answering correctly.  The c parameter is approximately 0.20, like a 5-option multiple choice item.  Consider the x-axis to be z-scores on a standard normal scale.

In some cases, there is no guessing involved, and we only use and b.  This is called the two-parameter model.  If we only use b, this is the one-parameter or Rasch Model.  Here is how that is calculated.

One-parameter-logistic-model-IRT

Item parameters, which are crucial within the IRT framework, might change over time or multiple testing occasions, a phenomenon known as item parameter drift.

 

Example Item Response Theory calculations

Examinees with higher ability are much more likely to respond correctly.  Look at the graph above.  Someone at +2.0 (97th percentile) has about a 94% chance of getting the item correct.  Meanwhile, someone at -2.0 has only a 25% chance – barely above the 1 in 5 guessing rate of 20%.  An average person (0.0) has a 60% chance.  Why 60?  Because we are accounting for guessing.  If the curve went from 0% to 100% probability, then yes, the middle would be 50% change.  But here, we assume 20% as a baseline due to guessing, so halfway up is 60%.

five item response functions

Of course, the parameters can and should differ from item to item, reflecting differences in item performance.  The following graph shows five IRFs with the three-parameter model.  The dark blue line is the easiest item, with a b of -2.00.  The light blue item is the hardest, with a b of +1.80.  The purple one has a c=0.00 while the light blue has c=0.25, indicating that it is more susceptible to guessing.

These IRFs are not just a pretty graph or a way to describe how an item performs.  They are the basic building block to accomplishing those important goals mentioned earlier.  That comes next…

 

Applications of Item Response Theory to Improve Assessment

Item response theory uses the IRF for several purposes.  Here are a few.

test information function from item response theory

  1. Interpreting and improving item performance
  2. Scoring examinees with maximum likelihood or Bayesian methods
  3. Form assembly, including linear on the fly testing (LOFT) and pre-equating
  4. Calculating the accuracy of examinee scores
  5. Development of computerized adaptive tests (CAT)
  6. Post-equating
  7. Differential item functioning (finding bias)
  8. Data forensics to find cheaters or other issues

IStandard error of measurement and test information functionn addition to being used to evaluate each item individually, IRFs are combined in various ways to evaluate the overall test or form.  The two most important approaches are the conditional standard error of measurement (CSEM) and the test information function (TIF).  The test information function is higher where the test is providing more measurement information about examinees; if relatively low in a certain range of examinee ability, those examinees are not being measured accurately.  The CSEM is the inverse of the TIF, and has the interpretable advantage of being usable for confidence intervals; a person’s score plus or minus 1.96 times the SEM is a 95% confidence interval for their score.  The graph on the right shows part of the form assembly process in our  FastTest  platform.

 

Assumptions of Item Response Theory

Item response theory assumes a few things about your data.

  1. The latent trait you are measuring is unidimensional.  If it is multidimensional, there is multidimensional item response theory, or you can treat the dimensions as separate traits.
  2. Items have local independence, which means that the act of answering one is not impacted by others.  This affects the use of testlets and enemy items.
  3. The probability of responding correctly to an item (or in a certain response, in the case of polytomous like Likert), is a function of the examinee’s ability/trait level and the parameters of the model, following the calculation of the item response function, with some allowance for random error.  As a corollary, we are assuming that the ability/trait has some distribution, with some people having higher or lower levels (e.g., intelligence) and that we are trying to find those differences.

Many texts will only postulate the first two as assumptions, because the third is just implicitly assumed.

 

Advantages and Benefits of Item Response Theory

So why does this matter?  Let’s go back to the problems with classical test theory.  Why is IRT better?

  • Sample-independence of scale: Classical statistics are all sample dependent, and unusable on a different sample; results from IRT are sample-independent. within a linear transformation.  Two samples of different ability levels can be easily converted onto the same scale.
  • Test statistics: Classical statistics are tied to a specific test form.
  • Sparse matrices are OK: Classical test statistics do not work with sparse matrices introduced by multiple forms, linear on the fly testing, or adaptive testing.
  • Linking/equating: Item response theory has much stronger equating, so if your exam has multiple forms, or if you deliver twice per year with a new form, you can have much greater validity in the comparability of scores.
  • Measuring the range of students: Classical tests are built for the average student, and do not measure high or low students very well; conversely, statistics for very difficult or easy items are suspect.
  • Vertical scaling: IRT can do vertical scaling but CTT cannot.
  • Accounting for guessing: CTT does not account for guessing on multiple choice exams.
  • Scoring: Scoring in classical test theory does not take into account item difficulty.  With IRT, you can score a student on any set of items and be sure it is on the same latent scale.
  • Adaptive testing: CTT does not support adaptive testing in most cases.  Adaptive testing has its own list of benefits.
  • Characterization of error: CTT assumes that every examinee has the same amount of error in their score (SEM); IRT recognizes that if the test is all middle-difficulty items, then low or high students will have inaccurate scores.
  • Stronger form building: IRT has functionality to build forms to be more strongly equivalent and meet the purposes of the exam.
  • Nonlinear function: IRT does not assume linear function of the student-item relationship when it is impossible.  CTT assumes a linear function (point-biserial) when it is blatantly impossible.

 

Item Response Theory Models: One Big Happy Family

Remember: IRT is actually a family of models, making flexible use of the parameters.  In some cases, only two (a,b) or one parameters (b) are used, depending on the type of assessment and fit of the data.  If there are multipoint items, such as Likert rating scales or partial credit items, the models are extended to include additional parameters. Learn more about the partial credit situation here.

Here’s a quick breakdown of the family tree, with the most common models.

 

How do I analyze my test with Item Response Theory?

OK item fit

First: you need to get special software.  There are some commercial packages like  Xcalibre, or you can use packages inside platforms like R and Python.

The software will analyze the data in cycles or loops to try to find the best model.  This is because, as always, the data does not always perfectly align.  You might see graphs like the one below if you compared actual proportions (red) to the predicted ones from the item response function (black).  That’s OK!  IRT is quite robust.  And there are analyses built in to help you evaluate model fit.

Some more unpacking of the image above:

  • This was item #39 on the test
  • We are using the three parameter logistic model (3PL), as this was a multiple choice item with 4 options
  • 3422 examinees answered the item
  • 76.9 of them got it correct
  • The classical item discrimination (point biserial item-total correlation) was 0.253, which is OK but not very high
  • The a parameter was 0.432, which is OK but not very strong
  • The b parameter was -1.195, which means the item was quite easy
  • The c parameter was 0.248, which you would expect if there was a 25% chance of guessing
  • The Chi-square fit statistic rejected the null, indicating poor fit, but this statistic is susceptible to sample size
  • The z-Resid fit statistic is a bit more robust, and it did not flag the item for bad fit

Xcalibre-poly-output
The image here shows output from  Xcalibre  from the generalized partial credit model, which is a polytomous model often used for items scored with partial credit.  For example, if a question lists 6 animals and asks students to click on the ones that are reptiles, of which there are 3.  The possible scores are then 0, 1, 2, 3.

Here, the graph labels them as 1-2-3-4, but the meaning is the same.  Here is how you can interpret this.

  • Someone is likely to get 0 points if their theta is below -2.0 (bottom 3% or so of students).
  • A few low students might get 1 point (green)
  • Low-middle ability students are likely to get 2 correct (blue)
  • Anyone above average (0.0) is likely to get all 3 correct.

The boundary locations are where one level becomes more likely than another, i.e., where the curves cross.  For example, you can see that the blue and black lines cross at the boundary -0.339.

Where can I learn more?

For more information, we recommend the textbook Item Response Theory for Psychologists by Embretson & Riese (2000) for those interested in a less mathematical treatment, or de Ayala (2009) for a more mathematical treatment.  If you really want to dive in, you can try the 3-volume Handbook of Item Response Theory edited by van der Linden, which contains a chapter discussing ASC’s IRT analysis software,  Xcalibre.

Want to talk to one of our experts about how to apply IRT?  Get in touch!

TALK TO US Contact

parcc ebsr items

The Partnership for Assessment of Readiness for College and Careers (PARCC) is a consortium of US States working together to develop educational assessments aligned with the Common Core State Standards.  This is a daunting task, and PARCC is doing an admirable job, especially with their focus on utilizing technology.  However, one of the new item types has a serious psychometric fault that deserves a caveat with regards to scoring and validation.

What is an Evidence-Based Selected-­Response (EBSR) question?

The item type is an “Evidence-Based Selected-­Response” (PARCC EBSR) item format, commonly called a Part A/B item or Two-Part item.  The goal of this format is to delve deeper into student understanding, and award credit for deeper knowledge while minimizing the impact of guessing.  This is obviously an appropriate goal for assessment.  To do so, the item is presented as two parts to the student, where the first part asks a simple question and the second part asks for supporting evidence to their answer in Part A.  Students must answer Part A correctly to receive credit on Part B.  As described on the PARCC website:

In order to receive full credit for this item, students must choose two supporting facts that support the adjective chosen for Part A. Unlike tests in the past, students may not guess on Part A and receive credit; they will only receive credit for the details they’ve chosen to support Part A.

How EBSR items are scored

While this makes sense in theory, it leads to problem in data analysis, especially if using Item Response Theory (IRT). Obviously, this violates the fundamental assumption of IRT: local independence (items are not dependent on each other).  So when working with a client of mine, we decided to combine it into one multi-point question, which matches the theoretical approach PARCC EBSR items are taking.  The goal was to calibrate the item with Muraki’s Generalized Partial Credit Model (GPCM), which is the standard approach used to analyze polytomous items in K12 assessment (learn more here).  The GPCM tries to order students based on the points they earn: 0 point students tend to have the lowest ability, 1 point students of moderate ability, and 2 point students are of the highest ability.  Should be obvious, right?  Nope.

The first thing we noticed was that some point levels had very small sample sizes.  Suppose that Part A is 1 point and Part B is 1 point (select two evidence pieces but must get both).  Most students will get 0 points or 2 points.  Not many will receive 1 point.  We thought about it, and realized that the only way to earn 1 point is to guess Part A but select no correct evidence or only select one evidence point.  This leads to issues with the GPCM.

Using the Generalized Partial Credit Model

Even when there was sufficient N at each level, we found that the GPCM had terrible fit statistics, meaning that the item was not performing according to the model described above.  So I ran  Iteman, our classical analysis software, to obtain quantile plots that approximate the polytomous IRFs without imposing the GPCM modeling.  I found that in the 0-2 point items tend to have the issue where not many students get 1 point, and moreover the line for them is relatively flat.  The GPCM assumes that it is relatively bell-shaped.  So the GPCM is looking for where the drop-offs are in the bell shape, crossing with adjacent CRFs – the thresholds – and they aren’t there.  The GPCM would blow up, usually not even estimating thresholds in correct ordering.

PARCC EBSR Graphs

So I tried to think of this from a test development perspective.  How do students get 1 point on these PARCC EBSR items?  The only way to do so is to get Part A right but not Part B.  Given that Part B is the reason for Part A, this means this group is students who answer Part A correctly but don’t know the reason, which means they are guessing.  It is then no surprise that the data for 1-point students is in a flat line – it’s just like the c parameter in the 3PL.  So the GPCM will have an extremely tough time estimating threshold parameters.

Why EBSR items don’t work

From a psychometric perspective, point levels are supposed to represent different levels of ability.  A 1-point student should be higher ability than a 0-point student on this item, and a 2-point student of higher ability than a 1-point student.  This seems obvious and intuitive.  But this item, by definition, violates the idea that a 1-point student should have higher ability than a 0-point student.  The only way to get 1 point is to guess the first part – and therefore not know the answer and are no different than the 0-point examinees whatsoever.  So of course the 1-point results look funky here.

The items were calibrated as two separate dichotomous items rather than one polytomous item, and the statistics turned out much better.  This still violates the IRT assumption but at least produces usable IRT parameters that can score students.  Nevertheless, I think the scoring of these items needs to be revisited so that the algorithm produces data which is able to be calibrated in IRT.

The entire goal of test items is to provide data points used to measure students; if the Evidence-Based Selected-­Response item type is not providing usable data, then it is not worth using, no matter how good it seems in theory!

test-scaling

Scaling is a psychometric term regarding the establishment of a score metric for a test, and it often has two meanings. First, it involves defining the method to operationally scoring the test, establishing an underlying scale on which people are being measured.  A common example is the T-score, which transforms raw scores into a standardized scale with a mean of 50 and a standard deviation of 10, making it easier to compare results across different populations or test forms.  It also refers to score conversions used for reporting scores, especially conversions that are designed to carry specific information.  The latter is typically called scaled scoring.

Examples of Scaling

You have all been exposed to this type of scaling, though you might not have realized it at the time. Most high-stakes tests like the ACT, SAT, GRE, and MCAT are reported on scales that are selected to convey certain information, with the actual numbers selected more or less arbitrarily. The SAT and GRE have historically had a nominal mean of 500 and a standard deviation of 100, while the ACT has a nominal mean of 18 and standard deviation of 6. These are actually the same scale, because they are nothing more than a converted z-score (standard or zed score), simply because no examinee wants to receive a score report that says you got a score of -1. The numbers above were arbitrarily selected, and then the score range bounds were selected based on the fact that 99% of the population is within plus or minus three standard deviations. Hence, the SAT and GRE range from 200 to 800 and the ACT ranges from 0 to 36. This leads to the urban legend of receiving 200 points for writing your name correctly on the SAT; again, it feels better for the examinee. A score of 300 might seem like a big number and 100 points above the minimum, but it just means that someone is in the 3rd percentile.

Now, notice that I said “nominal.” I said that because the tests do not actually have those means observed in samples, because the samples have substantial range restriction. Because these tests are only taken by students serious about proceeding to the next level of education, the actual sample is of higher ability than the population. The lower third or so of high school students usually do not bother with the SAT or ACT. So many states will have an observed average ACT of 21 and standard deviation of 4. This is an important issue to consider in developing any test. Consider just how restricted the population of medical school students is; it is a very select group.

How can I select a score scale?

score-scale

For various reasons, actual observed scores from tests are often not reported, and only converted scores are reported.  If there are multiple forms which are being equated, scaling will hide the fact that the forms differ in difficulty, and in many cases, differ in cutscore.  Scaled scores can facilitate feedback.  They can also help the organization avoid explanations of IRT scoring, which can be a headache to some.

When deciding on the conversion calculations, there are several important questions to consider.

First, do we want to be able to make fine distinctions among examinees? If so, the range should be sufficiently wide. My personal view is that the scale should be at least as wide as the number of items; otherwise you are voluntarily giving up information. This in turn means you are giving up variance, which makes it more difficult to correlate your scaled scores with other variables, like the MCAT is correlated with success in medical school. This, of course, means that you are hampering future research – unless that research is able to revert back to actual observed scores to make sure all information possible is used. For example, supposed a test with 100 items is reported on a 5-point grade scale of A-B-C-D-F. That scale is quite restricted, and therefore difficult to correlate with other variables in research. But you have the option of reporting the grades to students and still using the original scores (0 to 100) for your research.

Along the same lines, we can swing completely in the other direction. For many tests, the purpose of the test is not to make fine distinctions, but only to broadly categorize examinees. The most common example of this is a mastery test, where the examinee is being assessed on their mastery of a certain subject, and the only possible scores are pass and fail. Licensure and certification examinations are an example. An extension of this is the “proficiency categories” used in K-12 testing, where students are classified into four groups: Below Basic, Basic, Proficient, and Advanced. This is used in the National Assessment of Educational Progress. Again, we see the care taken for reporting of low scores; instead of receiving a classification like “nonmastery” or “fail,” the failures are given the more palatable “Below Basic.”

Another issue to consider, which is very important in some settings but irrelevant in others, is vertical scaling. This refers to the chaining of scales across various tests that are at quite different levels. In education, this might involve linking the scales of exams in 8th grade, 10th grade, and 12th grade (graduation), so that student progress can be accurately tracked over time. Obviously, this is of great use in educational research, such as the medical school process. But for a test to award a certification in a medical specialty, it is not relevant because it is really a one-time deal.

Lastly, there are three calculation options: pure linear (ScaledScore = RawScore * Slope + Intercept), standardized conversion (Old Mean/SD to New Mean/SD), and nonlinear approaches like Equipercentile.

Perhaps the most important issue is whether the scores from the test will be criterion-referenced or norm-referenced. Often, this choice will be made for you because it distinctly represents the purpose of your tests. However, it is quite important and usually misunderstood, so I will discuss this in detail.

Criterion-Referenced vs. Norm-Referenced

data-analysis-norms

This is a distinction between the ways test scores are used or interpreted. A criterion-referenced score interpretation means that the score is interpreted with regards to defined content, blueprint, or curriculum (the criterion), and ignores how other examinees perform (Bond, 1996). A classroom assessment is the most common example; students are scored on the percent of items correct, which is taken to imply the percent of the content they have mastered. Conversely, a norm-referenced score interpretation is one where the score provides information about the examinee’s standing in the population, but no absolute (or ostensibly absolute) information regarding their mastery of content. This is often the case with non-educational measurements like personality or psychopathology. There is no defined content which we can use as a basis for some sort of absolute interpretation. Instead, scores are often either z-scores or some linear function of z-scores.  IQ is historically scaled with a mean of 100 and standard deviation of 15.

It is important to note that this dichotomy is not a characteristic of the test, but of the test score interpretations. This fact is more apparent when you consider that a single test or test score can have several interpretations, some of which are criterion-referenced and some of which are norm-referenced. We will discuss this deeper when we reach the topic of validity, but consider the following example. A high school graduation exam is designed to be a comprehensive summative assessment of a secondary education. It is therefore specifically designed to cover the curriculum used in schools, and scores are interpreted within that criterion-referenced context. Yet scores from this test could also be used for making acceptance decisions at universities, where scores are only interpreted with respect to their percentile (e.g., accept the top 40%). The scores might even do a fairly decent job at this norm-referenced application. However, this is not what they are designed for, and such score interpretations should be made with caution.

Another important note is the definition of “criterion.” Because most tests with criterion-referenced scores are educational and involve a cutscore, a common misunderstanding is that the cutscore is the criterion. It is still the underlying content or curriculum that is the criterion, because we can have this type of score interpretation without a cutscore. Regardless of whether there is a cutscore for pass/fail, a score on a classroom assessment is still interpreted with regards to mastery of the content.  To further add to the confusion, Industrial/Organizational psychology refers to outcome variables as the criterion; for a pre-employment test, the criterion is typically Job Performance at a later time.

This dichotomy also leads to some interesting thoughts about the nature of your construct. If you have a criterion-referenced score, you are assuming that the construct is concrete enough that anybody can make interpretations regarding it, such as mastering a certain percentage of content. This is why non-concrete constructs like personality tend to be only norm-referenced. There is no agreed-upon blueprint of personality.

Multidimensional Scaling

camera lenses for multidimensional item response theory

An advanced topic worth mentioning is multidimensional scaling (see Davison, 1998). The purpose of multidimensional scaling is similar to factor analysis (a later discussion!) in that it is designed to evaluate the underlying structure of constructs and how they are represented in items. This is therefore useful if you are working with constructs that are brand new, so that little is known about them, and you think they might be multidimensional. This is a pretty small percentage of the tests out there in the world; I encountered the topic in my first year of graduate school – only because I was in a Psychological Scaling course – and have not encountered it since.

Summary of test scaling

Scaling is the process of defining the scale that on which your measurements will take place. It raises fundamental questions about the nature of the construct. Fortunately, in many cases we are dealing with a simple construct that has a well-defined content, like an anatomy course for first-year medical students. Because it is so well-defined, we often take criterion-referenced score interpretations at face value. But as constructs become more complex, like job performance of a first-year resident, it becomes harder to define the scale, and we start to deal more in relatives than absolutes. At the other end of the spectrum are completely ephemeral constructs where researchers still can’t agree on the nature of the construct and we are pretty much limited to z-scores. Intelligence is a good example of this.

Some sources attempt to delineate the scaling of people and items or stimuli as separate things, but this is really impossible as they are so confounded. Especially since people define item statistics (the percent of people that get an item correct) and items define people scores (the percent of items a person gets correct). It is for this reason that item response theory, the most advanced paradigm in measurement theory, was designed to place items and people on the same scale. It is also for this reason that item writing should consider how they are going to be scored and therefore lead to person scores. But because we start writing items long before the test is administered, and the nature of the construct is caught up in the scale, the issues presented here need to be addressed at the very beginning of the test development cycle.

certification exam development construction

Certification exams are a critical component of workforce development for many professions and play a significant role in the global Testing, Inspection, and Certification (TIC) market, which was valued at approximately $359.35 billion in 2022 and is projected to grow at a compound annual growth rate (CAGR) of 4.0% from 2023 to 2030. As such, a lot of effort goes into exam development and delivery, working to ensure that the exams are valid and fair, then delivered securely yet with enough convenience to reach the target market. If you work for a certification organization or awarding body, this article provides a guidebook to that process and how to select a vendor.

Certification Exam Development

Certification exam development, is a well-defined process governed by accreditation guidelines such as NCCA, requiring steps such as job task analysis and standard setting studies.  For certification, and other credentialing like licensure or certificates, this process is incredibly important to establishing validity.  Such exams serve as gatekeepers into many professions, often after people have invested a ton of money and years of their life in preparation.  Therefore, it is critical that the tests be developed well, and have the necessary supporting documentation to show that they are defensible.

So what exactly goes into developing a quality exam, sound psychometrics, and establishing the validity documentation, perhaps enough to achieve NCCA accreditation for your certification? Well, there is a well-defined and recognized process for certification exam development, though it is rarely the exact same for every organization.  In general, the accreditation guidelines say you need to address these things, but leave the specific approach up to you.  For example, you have to do a cutscore study, but you are allowed to choose Bookmark vs Angoff vs other method.

Job Analysis / Practice Analysis

A job analysis study provides the vehicle for defining the important job knowledge, skills, and abilities (KSA) that will later be translated into content on a certification exam. During a job analysis, important job KSAs are obtained by directly analyzing job performance of highly competent job incumbents or surveying subject-matter experts regarding important aspects of successful job performance. The job analysis generally serves as a fundamental source of evidence supporting the validity of scores for certification exams.

Test Specifications and Blueprints

The results of the job analysis study are quantitatively converted into a blueprint for the certification exam.  Basically, it comes down to this: if the experts say that a certain topic or skill is done quite often or is very critical, then it deserves more weight on the exam, right?  There are different ways to do this.  My favorite article on the topic is Raymond & Neustel, 2006Here’s a free tool to help.

test development cycle job task analysis

Item Development

After important job KSAs are established, subject-matter experts write test items to assess them. The end result is the development of an item bank from which exam forms can be constructed. The quality of the item bank also supports test validity.  A key operational step is the development of an Item Writing Guide and holding an item writing workshop for the SMEs.

Pilot Testing

There should be evidence that each item in the bank actually measures the content that it is supposed to measure; in order to assess this, data must be gathered from samples of test-takers. After items are written, they are generally pilot tested by administering them to a sample of examinees in a low-stakes context—one in which examinees’ responses to the test items do not factor into any decisions regarding competency. After pilot test data is obtained, a psychometric analysis of the test and test items can be performed. This analysis will yield statistics that indicate the degree to which the items measure the intended test content. Items that appear to be weak indicators of the test content generally are removed from the item bank or flagged for item review so they can be reviewed by subject matter experts for correctness and clarity.

Note that this is not always possible, and is one of the ways that different organizations diverge in how they approach exam development.

Standard Setting

Standard setting also is a critical source of evidence supporting the validity of professional credentialing exam (i.e. pass/fail) decisions made based on test scores.  Standard setting is a process by which a passing score (or cutscore) is established; this is the point on the score scale that differentiates between examinees that are and are not deemed competent to perform the job. In order to be valid, the cutscore cannot be arbitrarily defined. Two examples of arbitrary methods are the quota (setting the cut score to produce a certain percentage of passing scores) and the flat cutscore (such as 70% on all tests). Both of these approaches ignore the content and difficulty of the test.  Avoid these!

Instead, the cutscore must be based on one of several well-researched criterion-referenced methods from the psychometric literature.  There are two types of criterion-referenced standard-setting procedures (Cizek, 2006): examinee-centered and test-centered.

The Contrasting Groups method is one example of a defensible examinee-centered standard-setting approach. This method compares the scores of candidates previously defined as Pass or Fail. Obviously, this has the drawback that a separate method already exists for classification. Moreover, examinee-centered approaches such as this require data from examinees, but many testing programs wish to set the cutscore before publishing the test and delivering it to any examinees. Therefore, test-centered methods are more commonly used in credentialing.

The most frequently used test-centered method is the Modified Angoff Method (Angoff, 1971) which requires a committee of subject matter experts (SMEs).  Another commonly used approach is the Bookmark Method.

Equating

If the test has more than one form – which is required by NCCA Standards and other guidelines – they must be statistically equated.  If you use classical test theory, there are methods like Tucker or Levine.  If you use item response theory, you can either bake the equating into the item calibration process with software like Xcalibre, or use conversion methods like Stocking & Lord.

What does this process do?  Well, if this year’s certification exam had an average of 3 points higher than last years, how do you know if this year’s version was 3 points easier, or this year’s cohort was 3 points smarter, or a mixture of both?  Learn more here.

Psychometric Analysis & Reporting

This part is an absolutely critical step in the exam development cycle for professional credentialing.  You need to statistically analyze the results to flag any items that are not performing well, so you can replace or modify them.  This looks at statistics like item p-value (difficulty), item point biserial (discrimination), option/distractor analysis, and differential item functioning.  You should also look at overall test reliability/precision and other psychometric indices.  If you are accredited, you need to perform year-end reports and submit them to the governing body.  Learn more about item and test analysis.

Exam Development: It’s a Vicious Cycle

Now, consider the big picture: in many cases, an exam is not a one-and-done thing.  It is re-used, perhaps continually.  Often there are new versions released, perhaps based on updated blueprints or simply to swap out questions so that they don’t get overexposed.  That’s why this is better conceptualized as an exam development cycle, like the circle shown above.  Often some steps like Job Analysis are only done once every 5 years, while the rotation of item development, piloting, equating, and psychometric reporting might happen with each exam window (perhaps you do exams in December and May each year).

ASC has extensive expertise in managing this cycle for professional credentialing exams, as well as many other types of assessments.  Get in touch with us to talk to one of our psychometricians.

Certification Exam Delivery & Administration

Certification exam administration and proctoring is a crucial component of the professional credentialing process.  Certification exams are expensive to develop well, so an organization wants to protect that investment by delivering the exam with appropriate security so that items are not stolen.  Moreover, there is an obvious incentive for candidates to cheat.  So, a certification body needs appropriate processes in place to deliver the certification exams.  Here are some tips.

1. Determine the best approach for certification exam administration and proctoring

Here are a few of the considerations to take into account.  These can be crossed with each other, such as delivering paper exams at Events vs. Test Centers.

Timing: Cohorts/Windows vs Continuous Availability

Do you have cohorts, where events make more sense, or do you need continuous?  For example, if the test is tied to university training programs that graduate candidates in December and May each year, that affects your need for delivery.  Alternatively, some certifications are not tied to such training; you might have to only show work experience.  In those cases, candidates are ready to take the test continuously throughout the year.

Mode: Paper vs Computer

Does it make more sense to deliver the test on paper or on computer?  This used to be a cost issue, but now the cost of computerized delivery, especially with online proctoring at home, has dropped significantly while saving so much time for candidates.  Also, some exam types like clinical simulations can only be delivered on computers.

Location: Test centers vs Online proctored vs Events vs Multi-Modal

Some types of tests require events, such as a clinical assessment in an actual clinic with standardized patients.  Some tests can be taken anywhere.  Exam events can also coincide with other events; perhaps you have online delivery through the year but deliver a paper version of the test at your annual conference, for convenience.

Do you have an easy way to make your own locations, if you are considering that?  One example is that you have quarterly regional conferences for your profession, where you could simply get a side room to deliver your test to candidates since they will already be there.  Another is that most of your candidates are coming from training programs at universities, and you are able to use classrooms at those universities.

ansi accreditation certification exam candidates

Geography: State, National, or International

If your exam is for a small US state or a small country, it might be easy to require exams in a test center, because you can easily set up only one or two test centers to cover the geography.  Some certifications are international, and need to deliver on-demand throughout the year; those are a great fit for online.

Security: Low vs High

If your test has extremely high stakes, there is extremely high incentive to cheat.  An entry-level certification on WordPress is different than a medical licensure exam.  The latter is a better fit for test centers, while the former might be fine with online proctoring on-demand.

Online proctoring: AI vs Recorded vs Live

If you choose to explore this approach, here are three main types to evaluate.

A. AI only: AI only proctoring means that there are no humans.  The examinee is recorded on video, and AI algorithms flag potential issues, such as if they leave their seat, then notify an administrator (usually a professor) of students with a high number of flags.  This approach is usually not relevant for certifications or other credentialing exams, it is more for low-stakes exams like a Psychology 101 Midterm at your local university.  The vendors for this approach are interested in large-scale projects, such as proctoring all midterms and finals at a university, perhaps hundreds of thousands of exams per year.

B. Record and Review: Record and review proctoring means that the examinee is recorded on video, but that video is watched by a real human and flagged if they think there is cheating, theft, or other issues.  This is much higher quality, and higher price, but has one major flaw that might be concerning to certification tests: if someone steals your test by taking pictures, you won’t find out until tomorrow.  But at least you know who it was and you are certain of what happened, with a video proof.  Perhaps useful for microcredentials or recertification exams.

C. Live Online Proctoring: Live online proctoring (LOP), or what I call “live human proctoring” (because some AI proctoring is also “live” in real time!) means that there is a professional human proctor on the other side of the video from the examinee.  They check the examinee in, confirm their identity, scan the room, provide instructions, and actually watch them take the test.  Some providers like MonitorEDU even have the examinee make a second video stream on their phone, which is placed on a bookshelf or similar spot to see the entire room through the test.  Certainly, this approach is a very good fit with certification exams and other credentialing.  You protect the test content as well as the validity of that individual’s score; that is not possible with the other two approaches.

We have also prepared a list of the best online proctoring software platforms.

2. Determine other technology, psychometric, and operational needs

Next, your organization should establish any other needs for your exams that could impact the vendor selection.

  1. Do you require special item types, such that the delivery platform needs to support or integrate with them?
  2. Do you have simulations or OSCEs?
  3. Do you have specific needs around accessibility and accommodations for your candidates?
  4. Do you need adaptive testing or linear on the fly testing?
  5. Do you need extensive Psychometric consulting services?
  6. Do you need an integrated registration and payment portal?  Or a certification management system to track expirations and other important information?

Write all these up so that you can use the list to shop for a provider.

3. Find a provider – or several!

test development cycle fasttest

While it might seem easier to find a single provider for everything, that’s often not the best solution.  Look for those vendors that specifically fit your needs.

For example, most providers of remote proctoring are just that: remote proctoring.  They do not have a professional platform to manage item banks, schedule examinees, deliver tests, create custom score reports, and analyze psychometrics.  Some do not even integrate with such platforms, and only integrate with learning management systems like Moodle, seeing as their entire target market is only low-stakes university exams.  So if you are seeking a vendor for certification testing or other credentialing, the list of potential vendors is smaller.

Likewise, there are some vendors that only do the exam development and psychometrics, but lack a software platform and proctoring services for deliver.  In these cases, they might have very specific expertise, and often have lower costs due to lower overhead.  An example is JML Testing Services.

Once you have some idea what you are looking for, start shopping for vendors that provide services for certification exam delivery, development, and scoring.  In some cases, you might not settle on a certain approach right away, and that’s OK.  See what is out there and compare prices.  Perhaps the cost of Live Remote Proctoring is more affordable than you anticipated, and you can upgrade to that.

Besides a simple Google search some good places to start are the member listings of the Association of Test Publishers and the Institute for Credentialing Excellence.

4. Establish the new process with policies and documentation

Once you have finalized your vendors, you need to write policies and documentation around them.  For example, if your vendor has a certain login page for proctoring (we have ascproctor.com), you should take relevant screenshots and write up a walkthrough so candidates know what to expect.  Much of this should go into your Candidate Handbook.  Some of the things to cover that are specific to exam day for the candidates:

  • How to prepare for the exam
  • How to take a practice test
  • What is allowed during the exam
  • What is not allowed
  • ID needed and the check-in process
  • Details on specific locations (if using locations)
  • Rules for accessibility and accommodations
  • Time limits and other practical considerations in the exam

Next, consider all the things that are impacted other than exam day.

  • ExamOps - Admin purchases report (blurred)Eligibility pathways and applications
  • Registration and scheduling
  • Candidate training and practice tests
  • Reporting: just to the candidates, or perhaps to training programs as well?
  • Accounting and other operations: consider your business needs, such as how you manage money, monthly accounting reports, etc.
  • Test security plan:  What do you do if someone is caught taking pictures of the exam with their phone, or the other potential events?

5. Let Everyone Know

Once you have written up everything, make sure all the relevant stakeholders know.  Publish the new Candidate Handbook and announce to the world.  Send emails to all upcoming candidates with instructions and an opportunity for a practice exam.  Put a link on your homepage.  Get in touch with all the training programs or universities in your field.  Make sure that everyone has ample opportunity to know about the new process!

6. Roll Out

Finally, of course, you can implement the new approach to certification exam delivery.  You might launch a new certification exam from scratch, or perhaps you are moving one from paper to online with remote proctoring, or some other change.  Either way, you need a date to start using it and a change management process.  The good news is that, even though it’s probably a lot of work to get here, the new approach is probably going to save you time and money in the long run.  Roll it out!

Also, remember that this is not a single point in time.  You’ll need to update into the future.  You should also consider the implementation of audits or quality control as a way to drive improvement.

 

Ready to start?

exam development certification committee

Certification exam delivery is the process of administering a certification test to candidates.  This might seem straightforward, but it is surprisingly complex.  The greater the scale and the stakes, the more potential threats and pitfalls.  Assessment Systems Corporation is one of the world leaders in the development and delivery of certification exams.  Contact us to get a free account in our platform and experience the examinee process, or to receive a demonstration from one of our experts.

 

 

One of my favorite quotes is from Mark Twain: “There is no such thing as a new idea. It is impossible. We simply take a lot of old ideas and put them into a sort of mental kaleidoscope.”  How can we construct a better innovation kaleidoscope for assessment?

We all attend conferences to get ideas from our colleagues in the assessment community on how to manage challenges. But ideas from across industries have been the source for some of the most radical innovations. Did you know that the inspiration for fast food drive-throughs was race car pit stops? Or that the idea for wine packaging came from egg cartons?

Most of the assessment conferences we have attended recently have been filled with sessions about artificial intelligence. AI is one of the most exciting developments to come along in our industry – as well as in other industries – in a long time. But many small- or moderate-sized organizations may feel it is out of reach for their organizations. Or they may be reluctant to adopt it for security or other concerns.

There are other worthwhile ideas that can be borrowed from other industries and adapted for use by small and moderate-sized assessment organizations. For instance, concepts from product development, design thinking, and lean manufacturing can be beneficial to assessment processes.

Agile Software Development

Many organizations use agile product methodologies for software development. While strict adherence to an agile methodology may not be appropriate for item development activities, there are pieces of the agile philosophy that might be helpful for item development processes. For instance, in the agile methodology, user stories are used to describe the end goal of a software feature from the standpoint of a customer or end user. In the same way, the user story concept could be used to delineate the intended construct responsibilities items must meet or how items are intended to be scored. This can help ensure that everyone involved in test development has a clear understanding of the measurement intent of the item from the onset.item review kanban

Another feature of agile development is the use of acceptance criteria. Acceptance criteria are predefined standards used to determine if user stories have been completed. In item development processes, acceptance criteria can be developed to set and communicate common standards to all involved in the item authoring process.

Agile development also uses a tool known as a Kanban Board to manage the process of software development by assigning tasks and moving development requests through various stages such as new, awaiting specs, in development, in QA, and user review. This approach can be applied the management of item development in assessment, as you see here from our Assess.ai platform.

Design Thinking and Innovation

Design thinking is a human-centered approach to innovation. At its core is empathy for customers and users. A key design thinking tool is the journey map, which is a visual representation of a process that individuals (e.g., customers or users) go through to achieve a goal. The purpose of creating a journey map is to identify pain points in the user experience and create better user experiences. Journey maps could potentially be used by assessment organizations to diagram the volunteer SME experience and identify potential improvements. Likewise, it could be used in the candidate application and registration process.

Lean Manufacturing

Lean manufacturing is a methodology aimed at reducing production times. A key technique within the lean methodology is value stream mapping (VSM). VSM is a way of visualizing both the flow of information and materials through a process as a means of identifying waste. Admittedly, I do not know a great deal about the intricacies of the technique, but it is most helpful to understand the underlying philosophy and intentions:

· To develop a mutual understanding between all stakeholders involved in the process;

· To eliminate process steps and tasks which do not add value to the process but may contribute to user frustration and to error.

The big question for innovation: Why?

A key question to ask when examining a process is ‘why.’ So often we proceed with processes year in and year out, keeping them the same, because ‘it’s the way we’ve always done them’ without ever questioning why, for so long that we have forgotten what the original answer to the question was. ‘Why’ is an immensely powerful and helpful question.

In addition to asking the ‘why’ question, a takeaway from value stream mapping and from journey mapping is visual representation. Being able to diagram or display the process is a fantastic way to develop a mutual understanding from all stakeholders involved in the process. So often we also concentrate on pursuing shiny new tools like AI that we neglect potential efficiencies in the underlying processes. Visually displaying processes can be extremely helpful in process improvement.

T scores

A T Score is a conversion of scores on a test to a standardized scale with a mean of 50 and standard deviation of 10.  This is a common example of a scaled score in psychometrics and assessment.  A scaled score is simply a way to present scores in a more meaningful and easier-to-digest context, especially across different types of assessment.  Therefore, a T Score is a standardized way that scores are presented to make them easier to understand.

Details and examples are below.  If you would like to explore the concept on your own, here’s a free tool in Excel that you can download!

What is a T Score?

A T score is a conversion of the standard normal distribution, aka Bell Curve or z-score.  The normal distribution places observations (of anything, not just test scores) on a scale that has a mean of 0.00 and a standard deviation of 1.00.  We simply convert this to have a mean of 50 and standard deviation of 10.  Doing so has two immediate benefits to most consumers:

  1. There are no negative scores; people generally do not like to receive a negative score!
  2. Scores are round numbers that generally range from 0 to 100, depending on whether 3, 4, or 5 standard deviations is the bound (usually 20 to 80); this somewhat fits with what most people expect from their school days, even though the numbers are entirely different.

The image below shows the normal distribution, labeled with the different scales for interpretation.

T score vs z score vs percentile

How do I calculate a T score?

Use this formula:

T = z*10 + 50

where  is the standard z-score on the normal distribution N(0,1).

Example of a T score

Suppose you have a z-score of -0.5.  If you put that into the formula, you get T = -0.5*10 + 50 = -5 + 50 = 45.  If you look at the graphic above, you can see how being half a standard deviation below the mean translates to a T score of 45.

How to interpret a T score?

As you can see above, a T Score of 40 means that you are approximately the 16th percentile.   This is a low score, obviously, but a student will feel better than if they received a score if -1.  It is for the same reason that many educational assessments use other scaled scores.  The SAT has a scale of mean=500 SD=100 (T score x 10), so if you receive a score of 400 it again means that you are z=-1 or percentile of 16.

A 70 means that you are approximately the 98th percentile – so that it is actually quite high though students who are used to receiving 90s will feel like it is low!

Since there is a 1-to-1 mapping of T Score to the other rows, you can see that it does not actually provide any new information.  It is simply a conversion to round, positive numbers, that is easier to digest and less likely to upset someone that is unfamiliar with psychometrics.  My undergraduate professor who introduced me to psychometrics used the term “repackaging” to describe scaled scores.  Like if you take an object out of one box and put it in a different box, it looks different superficially, but the object itself and its meaning (e.g., weight) have not changed.

Is a T Score like a t-test?

No.  Couldn’t be more unrelated.  Nothing like the t-test.

How do I implement with an assessment?

If you are using off-the-shelf psychological assessments, they will likely produce a T Score for you in the results.  If you want to utilize it for your own assessments, you need a world-class assessment platform like  FastTest  that has strong functionality for scoring methods and scaled scoring.  An example of this is below.  Here, we are utilizing item response theory for the raw score.

As with all scaled scoring, it is a good idea to provide an explanation to your examinees and stakeholders.

Scaled scores in FastTest

Spearman-Brown

The Spearman-Brown formula, also known as the Spearman-Brown Prophecy Formula or Correction, is a method used in evaluating test reliability.  It is based on the idea that split-half reliability has better assumptions than coefficient alpha but only estimates reliability for a half-length test, so you need to implement a correction that steps it up to a true estimate for a full-length test.

Looking for software to help you analyze reliability?  Download a free copy of Iteman.

Coefficient Alpha vs. Split Half Reliability

The most commonly used index of test score reliability is coefficient alpha.  However, it’s not the only index on internal consistency.  Another common approach is split-half reliability, where you split the test into two halves (first/last, even/odd, or random split) and then correlate scores on each.  The reasoning is that if both halves of the test measure the same construct at a similar level of precision and difficulty, then scores on one half should correlate highly with scores on the other half.  More information on split-half is found here.

However, split-half reliability provides an inconvenient situation: we are effectively gauging the reliability of half a test.  It is a well-known fact that reliability is increased by more items (observations); we can all agree that a 100-item test is more reliable than a 10 item test comprised of similar quality items.  So the split half correlation is blatantly underestimating the reliability of the full-length test.

Adjusting the Split Half Back To Reality: The Spearman-Brown Formula

To adjust for this, psychometricians use the Spearman-Brown prophecy formula.  It takes the split half correlation as input and converts it to an estimate of the equivalent level of reliability for the full-length test.  While this might sound complex, the actual formula is quite simple.

Spearman-Brown

As you can see, the formula takes the split half reliability (rhalf) as input and produces the full-length estimation (rfull) .  This can then be interpreted alongside the ubiquitously used coefficient alpha.

While the calculation is quite simple, you still shouldn’t have to do it yourself.  Any decent software for classical item analysis will produce it for you.  As an example, here is the output of the Reliability Analysis table from our Iteman software for automated reporting and assessment intelligence with CTT.  This lists the various split-half estimates alongside the coefficient alpha (and its associated SEM) for the total score as well as the domains, so you can evaluate if there are domains that are producing unusually unreliable scores. 

Note: There is an ongoing argument amongst psychometricians whether domain scores are even worthwhile since the assumed unidimensionality of most tests means that the domain scores are  less reliable estimates of the total score, but that’s a whole ‘nother blog post!

Score N Items Alpha SEM Split-Half (Random) Split-Half (First-Last) Split-Half (Odd-Even) S-B Random S-B First-Last S-B Odd-Even
All items 50 0.805 3.058 0.660 0.537 0.668 0.795 0.699 0.801
1 10 0.522 1.269 0.338 0.376 0.370 0.506 0.547 0.540
2 18 0.602 1.860 0.418 0.309 0.448 0.590 0.472 0.619
3 12 0.605 1.496 0.449 0.417 0.383 0.620 0.588 0.553
4 10 0.485 1.375 0.300 0.329 0.297 0.461 0.495 0.457

You can see that, as mentioned earlier, there are 3 ways to do the split in the first place, and Iteman reports all three.  It then reports the Spearman-Brown formula for each, which are higher than the split-half coefficients.  These generally align with the results of the alpha estimates, which overall provide a cohesive picture about the structure of the exam and its reliability of scores.  As you might expect, domains with more items are slightly more reliable, but not super reliable since they are all less than 20 items.

So, what does this mean in the big scheme of things?  Well, in many cases the Spearman-Brown estimates might not differ from the alpha estimates, but it’s still good to know that they do.  In the case of high-stakes tests, you want to go through every effort you can to ensure that the scores are highly reliable and precise.

Tell me more!

If you’d like to learn more, here is an article on the topic.  Or, contact solutions@assess.com to discuss consulting projects with our Ph.D. psychometricians.

item-writing-tips

Item writing (aka item authoring) is a science as well as an art, and if you have done it, you know just how challenging it can be!  You are experts at what you do, and you want to make sure that your examinees are too.  But it’s hard to write questions that are clear, reliable, unbiased, and differentiate on the thing you are trying to assess.  Here are some tips.

What is Item Writing / Item Authoring ?

Item authoring is the process of creating test questions.  You have certainly seen “bad” test questions in your life, and know firsthand just how frustrating and confusing that can be.  Fortunately, there is a lot of research in the field of psychometrics on how to write good questions, and also how to have other experts review them to ensure quality.  It is best practice to make items go through a workflow, so that the test development process is similar to the software development process.

Because items are the building blocks of tests, it is likely that the test items within your tests are the greatest threat to its overall validity and reliability.  Here are some important tips in item authoring.  Want deeper guidance?  Check out our Item Writing Guide.

Anatomy of an Item

First, let’s talk a little bit about the parts of a test question.  The diagram on the right shows a reading passage with two questions on it.  Here are some of the terms used:

  • Asset/Stimulus: This is a reading passage here, but could also be an audio, video, table, PDF, or other resource
  • Item: An overall test question, usually called an “item” rather than a “question” because sometimes they might be statements.
  • Stem: The part of the item that presents the situation or poses a question.
  • Options: All of the choices to answer.
  • Key: The correct answer.
  • Distractors: The incorrect answers.

Parts of a test item

Item writing tips: The Stem

To find out whether your test items are your allies or your enemies, read through your test and identify the items that contain the most prevalent item construction flaws.  The first three of the most prevalent construction flaws are located in the item stem (i.e. question).  Look to see if your item stems contain…

1) BIAS

Nowadays, we tend to think of bias as relating to culture or religion, but there are many more subtle types of biases that oftentimes sneak into your tests.  Consider the following questions to determine the extent of bias in your tests:

  • Are there are acronyms in your test that are not considered industry standard?
  • Are you testing on policies and procedures that may vary from one location to another?
  • Are you using vocabulary that is more recognizable to a female examinee than a male?
  • Are you referencing objects that are not familiar to examinees from a newer or older generation?

2) NOT

We’ve all taken tests which ask a negatively worded question. These test items are often the product of item authoring by newbies, but they are devastating to the validity and reliability of your tests—particularly fast test-takers or individuals with lower reading skills.  If the examinee misses that one single word, they will get the question wrong even if they actually know the material.  This test item ends up penalizing the wrong examinees!

3) EXCESS VERBIAGEborderline method educational assessment

Long stems can be effective and essential in many situations, but they are also more prone to two specific item construction flaws.  If the stem is unnecessarily long, it can contribute to examinee fatigue.  Because each item requires more energy to read and understand, examinees tire sooner and may begin to perform more poorly later on in the test—regardless of their competence level.

Additionally, long stems often include information that can be used to answer other questions in the test.  This could lead your test to be an assessment of whose test-taking memory is best (i.e. “Oh yeah, #5 said XYZ, so the answer to #34 is XYZ.”) rather than who knows the material.

Item writing tips:  distractors / options

Unfortunately, item stems aren’t the only offenders.  Experienced test writers actually know that the distractors (i.e. options) are actually more difficult to write than the stems themselves.  When you review your test items, look to see if your item distractors contain

4) IMPLAUSIBILTY

The purpose of a distractor is to pull less qualified examinees away from the correct answer by other options that look correct.  In order for them to “distract” an examinee from the correct answer, they have to be plausible.  The closer they are to being correct, the more difficult the exam will be.  If the distractors are obviously incorrect, even unqualified examinees won’t pick them.  Then your exam will not help you discriminate between examinees who know the material and examinees that do not, which is the entire goal.

5) 3-TO-1 SPLITS

You may recall watching Sesame Street as a child.  If so, you remember the song “One of these things…”  (Either way, enjoy refreshing your memory!)   Looking back, it seems really elementary, but sometimes our test item options are written in such a way that an examinee can play this simple game with your test.  Instead of knowing the material, they can look for the option that stands out as different from the others.  Consider the following questions to determine if one of your items falls into this category:

  • Is the correct answer significantly longer than the distractors?
  • Does the correct answer contain more detail than the distractors?
  • Is the grammatical structure different for the answer than for the distractors?

6) ALL OF THE ABOVE

There are a couple of problems with having this phrase (or the opposite “None of the above”) as an option.  For starters, good test takers know that this is—statistically speaking—usually the correct answer.  If it’s there and the examinee picks it, they have a better than 50% chance of getting the item right—even if they don’t know the content.  Also, if they are able to identify two options as correct, they can select “All of the above” without knowing whether or not the third option was correct.  These sorts of questions also get in the way of good item analysis.   Whether the examinee gets this item right or wrong, it’s harder to ascertain what knowledge they have because the correct answer is so broad.

This is helpful, can I learn more?

Want to learn more about item writing?  Here’s an instructional video from one of our PhD psychometricians.  You should also check out this book.

Item authoring is easier with an item banking system

The process of reading through your exams in search of these flaws in the item authoring is time-consuming (and oftentimes depressing), but it is an essential step towards developing an exam that is valid, reliable, and reflects well on your organization as a whole.  We also recommend that you look into getting a dedicated item banking platform, designed to help with this process.

Summary Checklist

 

Issue

Recommendation

Key is invalid due to multiple correct answers. Consider each answer option individually; the key should be fully correct with each distractor being fully incorrect.
Item was written in a hard to comprehend way, examinees were unable to apply their knowledge because of poor wording.

 

Ensure that the item can be understood after just one read through. If you have to read the stem multiple times, it needs to be rewritten.
Grammar, spelling, or syntax errors direct savvy test takers toward the correct answer (or away from incorrect answers). Read the stem, followed by each answer option, aloud. Each answer option should fit with the stem.
Information was introduced in the stem text that was not relevant to the question. After writing each question, evaluate the content of the stem. It should be clear and concise without introducing irrelevant information.
Item emphasizes trivial facts. Work off of a test blue print to ensure that each of your items map to a relevant construct. If you are using Bloom’s taxonomy or a similar approach, items should be from higher order levels.
Numerical answer options overlap. Carefully evaluate numerical ranges to ensure there is no overlap among options.
Examinees noticed answer was most often A. Distribute the key evenly among the answer options. This can be avoided with FastTest’s randomized delivery functionality.
Key was overly specific compared to distractors. Answer options should all be about the same length and contain the same amount of information.
Key was only option to include key word from item stem. Avoid re-using key words from the stem text in your answer options. If you do use such words, evenly distribute them among all of the answer options so as to not call out individual options.
Rare exception can be argued to invalidate true/false always/never question. Avoid using “always” or “never” as there can be unanticipated or rare scenarios. Opt for less absolute terms like “most often” or “rarely”.
Distractors were not plausible, key was obvious. Review each answer option and ensure that it has some bearing in reality. Distractors should be plausible.
Idiom or jargon was used; non-native English speakers did not understand. It is best to avoid figures of speech, keep the stem text and answer options literal to avoid introducing undue discrimination against certain groups.
Key was significantly longer than distractors. There is a strong tendency to write a key that is very descriptive. Be wary of this and evaluate distractors to ensure that they are approximately the same length.
 

validity threats

Validity threats are issues with a test or assessment that hinder the interpretations and use of scores, such as cheating, inappropriate use of scores, unfair preparation, or non-standardized delivery.  It is important to establish a test security plan to define the threats relevant for you and address them.

Validity, in its modern conceptualization, refers to evidence that supports our intended interpretations of test scores (see Chapter 1 of APA/AERA/NCME Standards for full treatment).   The word “interpretation” is key because test scores can be interpreted in different ways, including ways that are not intended by the test designers.  For example, a test given at the end of Nursing school to prepare for a national licensure exam might be used by the school as a sort of Final Exam.  However, the test was not designed for this purpose and might not even be aligned with the school’s curriculum.  Another example is that certification tests are usually designed to demonstrate minimal competence, not differentiate amongst experts, so interpreting a high score as expertise might not be warranted.

Validity threats: Always be on the lookout!

Test sponsors, therefore, must be vigilant against any validity threats.  Some of these, like the two aforementioned examples, might be outside the scope of the organization.  While it is certainly worthwhile to address such issues, our primary focus is on aspects of the exam itself.

Which validity threats rise to the surface in psychometric forensics?

Here, we will discuss several threats to validity that typically present themselves in psychometric forensics, with a focus on security aspects.  However, I’m not just listing security threats here, as psychometric forensics is excellent at flagging other types of validity threats too.

Threat Description Approach Example Indices
Collusion (copying) Examinees are copying answers from one another, usually with a defined Source. Error similarity (only looks at incorrect) 2 examinees get the same 10 items wrong, and select the same distractor on each B-B Ran, B-B Obs, K, K1, K2, S2
Response similarity 2 examinees give the same response on 98/100 items S2, g2, ω, Zjk
Group level help/issues Similar to collusion but at a group level; could be examinees working together, or receiving answers from a teacher/proctor.  Note that many examinees using the same brain dump would have a similar signature but across locations. Group level statistics Location has one of the highest mean scores but lowest mean times Descriptive statistics such as mean score, mean time, and pass rate
Response or error similarity On a certain group of items, the entire classroom gives the same answers Roll-up analysis, such as mean collusion flags per group; also erasure analysis (paper only)
Pre-Knowledge Examinee comes in to take the test already knowing the items and answers, often purchased from a brain dump website. Time-Score analysis Examinee has high score and very short time RTE or total time vs. scores
Response or error similarity Examinee has all the same responses as a known brain dump site All indices
Pretest item comparison Examinee gets 100% on existing items but 50% on new items Pre vs Scored results
Person fit Examinee gets the 10 hardest items correct but performs below average on the rest of the items Guttman indices, lz
Harvesting Examinee is not actually taking the test, but is sitting it to memorize items so they can be sold afterwards, often at a brain dump website.  Similar signature to Sleepers but more likely to occur on voluntary tests, or where high scores benefit examinees. Time-Score analysis Low score, high time, few attempts. RTE or total time vs. scores
Mean vs Median item time Examinee “camps” on 10 items to memorize them; mean item time much higher than the median Mean-Median index
Option flagging Examinee answers “C” to all items in the second half Option proportions
Low motivation: Sleeper Examinees are disengaged, producing data that is flagged as unusual and invalid; fortunately, not usually a security concern but could be a policy concern. Similar signature to Harvester but more likely to occur on mandatory tests, or where high scores do not benefit examinees. Time-Score analysis Low score, high time, few attempts. RTE or total time vs. scores
Item timeout rate If you have item time limits, examinee hits them Proportion items that hit limit
Person fit Examinee attempt a few items, passes through the rest Guttman indices, lz
Low motivation: Clicker Examinees are disengaged, producing data that is flagged as unusual and invalid; fortunately, not usually a security concern but could be a policy concern. Similar idea to Sleeper but data is quite different. Time-Score analysis Examinee quickly clicks “A” to all items, finishing with a low time and low score RTE, Total time vs. scores
Option flagging See above Option proportions

Psychometric Forensics to Find Evidence of Cheating

An emerging sector in the field of psychometrics is the area devoted to analyzing test data to find cheaters and other illicit or invalid testing behavior.  There is a distinction between primary and secondary collusion, and there are specific collusion detection indices and methods to investigate aberrant testing behavior, such as

While research on this topic is more than 50 years old, the modern era did not begin until Wollack published his paper on the Omega index in 1997. Since then, the sophistication and effectiveness of methodology in the field has multiplied, and many more publications focus on it than in the pre-Omega era. This is evidenced by not one but three recent books on the subject:

  1. Wollack, J., & Fremer, J. (2013).  Handbook of Test Security.
  2. Kingston, N., & Clark, A. (2014).  Test Fraud: Statistical Detection and Methodology.
  3. Cizek, G., & Wollack, J. (2016). Handbook of Quantitative Methods for Detecting Cheating on Tests.

 

woman-taking-test

Item review is the process of ensuring that newly-written test questions go through a rigorous peer review, to ensure that they are high quality and meet industry standards.

What is an item review workflow?

Developing a high-quality item bank is an extremely involved process, and authoring of the items is just the first step.  Items need to go through a defined workflow, with multiple people providing item review.  For example, you might require all items to be reviewed by another content expert, a psychometrician, an editor, and a bias reviewer.  Each needs to give their input and pass the item along to the next in line.  You need to record the results of the review for posterity, as part of the concept of validity is that we have documentation to support the development of a test.

What to review?

You should first establish what you want reviewed.  Assessment organizations will often formalize the guidelines as an Item Writing Guide.  Here is the guide that Assessment Systems uses with out clients, but I also recommend checking out the NBME Item Writing Guide.  For an even deeper treatment, I recommend the book Developing and Validating Test Items by Haladyna and Rodriguez (2013).

Here are some aspects to consider for item review.

Content

Most importantly, other content experts should check the item’s content.  Is the correct answer actually correct?  Are all the distractors actually correct?  Does the stem provide all the necessary info?  You’d be surprised how many times such issues slip past even the best reviewers!

Psychometrics

Psychometricians will often review an item to confirm that it meets best practices and that there are no tip-offs.  A common one is that the correct answer is often longer (more words) than the distractors.  Some organizations avoid “all of the above” and other approaches.

Format

Formal editors are sometimes brought in to work on the language and format of the item.  A common mistake is to end the stem with a colon even though that does not follow basic grammatical rules of English.

Bias/Sensitivity

For high-stakes exams that are used on diverse populations, it is important to add this step.  You don’t want items that are biased against a subset of students.  This is not just racial; it can include other differentiations of students.  Years ago I worked on items for the US State of Alaska, which has some incredibly rural regions; we had to avoid concepts that many people take for granted, like roads or shopping malls!

How to implement an item review workflow

item review kanban

This is an example of how to implement the process in a professional-grade item banking platform.  Both of our platforms,  FastTest  and  Assess.ai, have powerful functionality to manage this process.  Admin users can define the stages and the required input, then manage the team members and flow of items.  Assess.ai is unique in the industry with its use of Kanban boards – recognized as the best UI for workflow management – for item review.

An additional step, often at the same time, is standard setting.  One of the most common approaches is called the modified-Angoff method, which requires you to obtain a difficulty rating from a team of experts for each item.  The Item Review interfaces excel in managing this process as well, saving you all the effort of manually managing that process!

CREATE WORKFLOW
Assess.ai item review submit optionsSpecify your stages and how items can move between them

DEFINE YOUR REVIEW FIELDS
These are special item metadata fields that require input from multiple users

MOVE NEW ITEMS INTO THE WORKFLOW
Once an item is written, it is ready for review

ASSIGN ITEMS TO USERS
Assign the item in the UI, with the option to send an email

USERS PERFORM REVIEWS
They can read the item, interact as a student would, and then leave feedback and other metadata in the review fields; then push the item down the line

ADMINS EVALUATE/EXPORT THE RESULTS
Admins can evaluate the results and decide if an item needs revision, or if it can be considered released.