Posts on psychometrics: The Science of Assessment

Psychometric software

Automated item generation (AIG) is a paradigm for developing assessment items (test questions), utilizing principles of artificial intelligence and automation. As the name suggests, it tries to automate some or all of the effort involved with item authoring, as that is one of the most time-intensive aspects of assessment development – which is no news to anyone who has authored test questions!

What is Automated Item Generation?

Automated item generation involves the use of computer algorithms to create new test questions, or variations of them.  It can also be used for item review, or the generation of answers, or the generation of assets such as reading passages.  Items still need to be reviewed and edited by humans, but this still saves a massive amount of time in test development.

Why Use Automated Item Generation?

Items can cost up to $2000 to develop, so even cutting the average cost in half could provide massive time/money savings to an organization.  ASC provides AIG functionality, with no limits, to anyone who signs up for a free item banking account in our platform  Assess.ai.

Types of Automated Item Generation?

There are two types of automated item generation.  The Item Templates approach was developed before large language models (LLMs) were widely available.  The second approach is to use LLMs, which became widely available at the end of 2022.

Type 1: Item Templates

The first type is based on the concept of item templates to create a family of items using dynamic, insertable variables. There are three stages to this work. For more detail, read this article by Gierl, Lai, and Turner (2012).

  • Authors, or a team, create an cognitive model by isolating what it is they are exactly trying to assess and different ways that it the knowledge could be presented or evidenced. This might include information such as what are the important vs. incidental variables, and what a correct answer should include .
  • They then develop templates for items based on this model, like the example you see below.
  • An algorithm then turns this template into a family of related items, often by producing all possible permutations.

Obviously, you can’t use more than one of these on a given test form. And in some cases, some of the permutations will be an unlikely scenario or possibly completely irrelevant. But the savings can still be quite real. I saw a conference presentation by Andre de Champlain from the Medical Council of Canada, stating that overall efficiency improved by 6x and the generated items were higher quality than traditionally written items because the process made the authors think more deeply about what they were assessing and how. He also recommended that template permutations not be automatically moved to the item bank but instead that each is reviewed by SMEs, for reasons such as those stated above.

You might think “Hey, that’s not really AI…” – AI is doing things that have been in the past done by humans, and the definition gets pushed further every year. Remember, AI used to be just having the Atari be able to play Pong with you!

AIG-CPR

Type 2: AI Generation or Processing of Source Text

The second type is what the phrase “automated item generation” more likely brings to mind: upload a textbook or similar source to some software, and it spits back drafts of test questions. For example, see this article by von Davier (2019). Or alternatively, simply state a topic as a prompt and the AI will generate test questions.

Until the release of ChatGPT and other publicly available AI platforms to implement large language models (LLMs), this approach was only available to experts at large organizations.  Now, it is available to everyone with an internet connection.  If you use such products directly, you can provide a prompt such as “Write me 10 exam questions on Glaucoma, in a 4-option multiple choice format” and it will do so.  You can also update the instructions to be more specific, and add instructions such as formatting the output for your preferred method, such as QTI or JSON.

Alternatively, many assessment platforms now integrate with these products directly, so you can do the same thing, but have the items appear for you in the item banker under New status, rather than have them go to a raw file on your local computer that you then have to clean and upload.  FastTest  has such functionality available.

This technology has completely revolutionized how we develop test questions.  I’ve seen several research presentations on this, and they all find that AIG produces more items, of quality that is as good or even better than humans, in a fraction of the time!  But, they have also found that prompt engineering is critical, and even one word – like including “concise” in your prompt – can affect the quality of the items.

FastTest Automated item generation

The Limitations of Automated Item Generation

Automated item generation (AIG) has revolutionized the way educational and psychological assessments are developed, offering increased efficiency and consistency. However, this technology comes with several limitations that can impact the quality and effectiveness of the items produced.

One significant limitation is the challenge of ensuring content validity. AIG relies heavily on algorithms and pre-defined templates, which may not capture the nuanced and comprehensive understanding of subject matter that human experts possess. This can result in items that are either too simplistic or fail to fully address the depth and breadth of the content domain .

Another limitation is the potential for over-reliance on statistical properties rather than pedagogical soundness. While AIG can generate items that meet certain psychometric criteria, such as difficulty and discrimination indices, these items may not always align with best practices in educational assessment or instructional design. This can lead to tests that are technically robust but lack relevance or meaningfulness to the learners .

Furthermore, the use of AIG can inadvertently introduce bias. Algorithms used in item generation are based on historical data and patterns, which may reflect existing biases in the data. Without careful oversight and adjustment, AIG can perpetuate or even exacerbate these biases, leading to unfair assessment outcomes for certain groups of test-takers .

Lastly, there is the issue of limited creativity and innovation. Automated systems generate items based on existing templates and rules, which can result in a lack of variety and originality in the items produced. This can make assessments predictable and less engaging for test-takers, potentially impacting their motivation and performance .

In conclusion, while automated item generation offers many benefits, it is crucial to address these limitations through continuous oversight, integration of expert input, and regular validation studies to ensure the development of high-quality assessment items.

How Can I Implement Automated Item Generation?

If you are a user of AI products like ChatGPT or Bard, you can work directly with them.  Advanced users can implement APIs to upload documents or fine-tune the machine learning models.  The aforementioned article by von Davier talks about such usage.

If you want to save time, FastTest provides a direct ChatGPT integration, so you can provide the prompt using the screen shown above, and items will then be automatically created in the item banking folder you specify, with the item naming convention you specify, tagged as Status=New and ready for review.  Items can then be routed through our configurable Item Review Workflow process, including functionality to gather modified-Angoff ratings.

Ready to improve your test development process?  Click here to talk to a psychometric expert.

item-difficulty-parameter

The item difficulty parameter from item response theory (IRT) is both a shape parameter of the item response function (IRF) but also an important way to evaluate the performance of an item in a test.

Item Parameters and Models in IRT

There are three item parameters estimated under dichotomous IRT: the item difficulty (b), the item discrimination (a), and the pseudo-guessing parameter (c).  IRT is actually a family of models, the most common of which are the dichotomous 1-parameter, 2-parameter, and 3-parameter logistic models (1PL, 2PL, and 3PL). The key parameter that is utilized in all three IRT models is the item difficulty parameter, b.  The 3PL uses all three, the 2PL uses a and b, and the 1PL/Rasch uses only b.

Interpreting the IRT item difficulty parameter

The b parameter is an index of how difficult the item is, or the construct level at which we would expect examinees to have a probability of 0.50 (assuming no guessing) of getting the keyed item response. It is worth reminding, that in IRT we model the probability of a correct response on a given item Pr (X) as a function of examinee ability (θ) and certain properties of the item itself. This function is called item response function (IRF) or item characteristic curve (ICC), and it is the basic feature of IRT since all the other constructs depend on this curve.

The IRF plots the probability that an examinee will respond correctly to an item as a function of a latent trait θ. The probability of a correct response is a result of the interaction between the examinees’ ability θ and the item difficulty parameter b. With the IRF, as the θ increases, there is a rise in probability that the examinee will provide a correct response to an item. The b parameter is a location index that indicates the position of the item functions on the ability scale, showing how difficult or easy a specific item is. The higher the b parameter is, the higher the ability required from an examinee to have a 50% chance of getting an item correctly. Difficult items are located to the right or to the higher end of the ability scale while easier items are located to the left or to the lower end of the ability scale. The typical values of the item difficulty range from −3 to +3, items whose b values are near −3 will correspond to items that are very easy, whilst items with values near +3 will correspond to the items that are very difficult for the examinees.

You can interpret the b parameters as a sort of “z-score for the item.”  If the value is -1.0, that means it is appropriate for examinees at a score of -1.0 (15th percentile).

The b parameter interpretation for difficulty is the opposite of the item difficulty statistic p-value in classical test theory (CTT), where a low b indicates an easy item, and a high b indicates a difficult item. Obviously, higher b require higher θ for a correct response.  With the CTT p-value, a low value is hard and a high value is easy.  For this reason it is sometimes called item facility.

Examples of the IRT item difficulty parameter

Let’s consider an example. There are three IRFs below for three different items D, E, and F. All three items have the same level of discrimination but different item difficulty values on the ability scale. In the 1PL, it is assumed that the only characteristic that influences examinee performance is the item difficulty (b parameter) and all items are equally discriminating. The b-values for the items D, E, and F are −0.5, 0.0, and 1.0 respectively. Item D is quite an easy item. Item E represents an item of medium difficulty such that the probability of a correct response is low at the lowest ability levels and near 1 at the highest ability levels. Item F introduces a hard item with the probability of correctly responding examinees being low along the most of the ability scale and only increasing at the higher ability levels.

three item response functions D E F

Look at the five IRFs below and check whether you are able to compare the items in terms of their difficulty. Below are some specific questions and answers for comparing the items.

item response function

  • Which item is the hardest, requiring the highest ability level, on average, to get it correct?

Blue (No 5), as it is the furthest to the right.

  • Which item is the easiest?

Dark blue (No 1), as it is the furthest to the left.

How do I calculate the IRT item difficulty?

You’ll need special software like Xcalibre.  Download a copy for free here.

one-parameter-logistic-model

The One Parameter Logistic Model (OPLM or 1PL or IRT 1PL) is one of the three main dichotomous models in the Item Response Theory (IRT) framework. The OPLM combines mathematical properties of the Rasch model with the flexibility of the Two Parameter Logistic Model (2PL or IRT 2PL). In the OPLM, difficulty parameters, b, are estimated and discrimination indices, a, are imputed as known constants.

Background behind the One Parameter Logistic Model

IRT employs mathematical models assuming that the probability that an examinee would answer the question correctly depends on their ability and item characteristics. Examinee’s ability is considered the major individual characteristic and is denoted as θ (“theta”); it is also called the ability parameter. The ability parameter is conceived as an underlying, unobservable latent construct or trait that helps an individual to answer a question correctly.

These mathematical models include item characteristics also known as the item parameters: discrimination (a), difficulty (b), and pseudo-guessing (c). According to IRT paradigm, all item parameters are considered to be invariant or “person-free”, i.e. they do not depend on examinees’ abilities. In addition, ability estimates are also invariant or “item-free” since they do not depend on the set of items. This mutual independence forms the basis of the IRT models that provides objectivity in measurement.

The OPLM is built off only one parameter, difficulty. Item difficulty simply means how hard an item is (how high does the latent trait ability level need to be in order to have a 50% chance of getting the item right?). b is estimated for each item of the test. The item response function for the 1PL model looks like this:

One-parameter-logistic-model-IRT

where P is the probability that a randomly selected examinee with ability θ will answer correctly a specific item; e is a mathematical constant approximately equal to 2.71828, which is also known as an exponential number or Euler’s number.

Assumptions of the OPLM

The OPLM is based on two basic assumptions: unidimensionality and local independence.

  • Unidimensionality assumption is the most common, but the most complex and restrictive assumption for all IRT models that sometimes cannot be met. It states that only one ability is measured by the set of items in a single test. Thus, it assumes that a single dominant factor should underlie all item responses. For example, in a Math test examinees need to possess strong mathematical abilities to answer test questions correctly. However, if test items measure another ability, like verbal, this test is no longer unidimensional. The unidimensionality can be assessed by various methods, but the most popular among all is the factor analysis approach which is available in the free software  MicroFACT.
  • Local independence assumes that in case of a constant ability, examinee’s responses to any items are statistically independent, i.e. the probability that an examinee would reply correctly to a test question does not depend on their answers to other questions. In other words, the only factor influencing examinee’s responses is the ability.

 

Item characteristic curve

The S-shaped curve describing the relationship between the probability of an examinee’s correct response to a test question and their ability θ  is called item characteristic curve (ICC) or item response function (IRF). In a test, each item will have its own ICC/IRF.

Typical ICC for the One Parameter Logistic Model looks like this:

irf_b_1.0

The S-shaped curve shows that the probability of a correct response is near zero at the lowest level of examinee’s ability and increases up to the highest level of ability as the probability of correct response approaches 1. The curve rises rapidly as we move from left to right and is strictly monotonic.

The OPLM function ranges between 0 and 1. ICC can never reach and cannot be higher than 1. Theoretically, item parameter ranges from -∞ to + ∞ but practically this range is limited between -3 and +3. You can easily plot ICC using the IRT calibration software  Xcalibre.

Application of the OPLM in test development

The OPLM is especially useful in item selection, item banking, item analysis, test equating, and investigating item bias or Differential Item Functioning (DIF). Since the IRT One Parameter Logistic Model allows estimating item parameters that are “examinee-free”, then it is possible to estimate item parameters during their piloting to use them later. Based on the information about items and examinees collected during tests it is easy to build item banks that can be ultimately used for large-scale testing programs and Computerized Adaptive Testing (CAT).

ebel-method-for-multiple-choice-questions

The Ebel method of standard setting is a psychometric approach to establish a cutscore for tests consisting of multiple-choice questions. It is usually used for high-stakes examinations in the fields of higher education, medical and health professions, and for selecting applicants.

How is the Ebel method performed?

The Ebel method requires a panel of judges who would first categorize each item in a data set by two criteria: level of difficulty and relevance or importance. Then the panel would agree upon an expected percentage of items that should be answered correctly for each group of items according to their categorization.

It is crucial that judges are the experts in the examined field; otherwise, their judgement would not be valid and reliable. Prior to the item rating process, the panelists should be given sufficient amount of information about the purpose and procedures of the Ebel method. In particular, it is important that the judges would understand the meaning of difficulty and relevance in the context of the current assessment.

Next stage would be to determine what “minimally competent” performance means in the specific case depending on the content. When everything is clear and all definitions are agreed upon, the experts should classify each item across difficulty (easy, medium, or hard) and relevance (minimal, acceptable, important, or essential). In order to minimize the influence of the judges’ opinion on each other, it is more recommended to use individual ratings rather than consensus ones.

Afterwards judgements on the proportion of items expected to be answered correctly by minimally competent candidates need to be collected for each item category, e.g. easy and desirable. However, for the rating and timesaving purposes the grid proposed by Ebel and Frisbie (1972) might be used. It is worth mentioning though that Ebel ratings are content-specific, so values in the grid might happen to be too low or too high for a test.

Ebel-method-data

At the end, the Ebel method, like the modified-Angoff method, identifies a cut-off score for an examination based on the performance of candidates in relation to a defined standard (absolute), rather than how they perform in relation to their peers (relative). Ebel scores for each item and for the whole exam are calculated as the average of the scores provided by each expert: the number of items in each category is multiplied by the expected percentage of correct answers, and the total results are added to calculate the cutscore.

Pros of using Ebel

  • This method provides an overview of a test difficulty
  • Cut-off score is identified prior to an examination
  • It is relatively easy for experts to perform

 

Cons of using Ebel

  • This method is time-consuming and costly
  • Evaluation grid is hard to get right
  • Digital software is required
  • Back-up is necessary

 

Conclusion

The Ebel method is a quite complex standard-setting process compared to others due to the need of an analysis of the content, and it therefore imposes a burden on the standard-setting panel. However, Ebel considers the relevance of the test items and the expected proportion of the correct answers of the minimally competent candidates, including borderline candidates. Thus, even though the procedure is complicated, the results are very stable and very close to the actual cut-off scores.

References

Ebel, R. L., & Frisbie, D. A. (1972). Essentials of educational measurement.

item parameter drift boat

Item parameter drift (IPD) refers to the phenomenon in which the parameter values of a given test item change over multiple testing occasions within the item response theory (IRT) framework. This phenomenon is often relevant to student progress monitoring assessments where a set of items is used several times in one year, or across years, to track student growth;  the observing of trends in student academic achievements depends upon stable linking (anchoring) between assessment moments over time, and if the item parameters are not stable, the scale is not stable and time-to-time comparisons are not either. Some psychometricians consider IPD as a special case of differential item functioning (DIF), but these two are different issues and should not be confused with each other.

Reasons for Item Parameter Drift

IRT modeling is attractive for assessment field since its property of item parameter invariance to a particular sample of test-takers which is fundamental for their estimation, and that assumption enables important things like strong equating of tests across time and the possibility for computerized adaptive testing. However, item parameters are not always invariant. There are plenty of reasons that could stand behind IPD. One possibility is curricular changes based on assessment results or instruction that is more concentrated. Other feasible reasons are item exposure, cheating, or curricular misalignment with some standards. No matter what has led to IPD, its presence can cause biased estimates of student ability. In particular, IPD can be highly detrimental in terms of reliability and validity in case of high-stakes examinations. Therefore, it is crucial to detect item parameter drift when anchoring assessment occasions over time, especially when the same anchor items are used repeatedly.

Perhaps the simplest example is item exposure.  Suppose a 100-item test is delivered twice per year, with 20 items always remaining as anchors.  Eventually students will share memories and the topics of those will become known.  More students will get them correct over time, making the items appear easier.

Identifying Item Parameter Drift

There are several methods of detecting IPD. Some of them are simpler because they do not require estimation of anchoring constants, and some of them are more difficult due to the need of that estimation. Simple methods include the “3-sigma p-value”, the “0.3 logits”, and the “3-sigma IRT” approaches. Complex methods involve the “3-sigma scaled IRT”, the “Mantel-Haenszel”, and the “area between item characteristic curves”, where the last two approaches are based on consideration that IPD is a special case of DIF, and therefore there is an opportunity to draw upon a massive body of existing research on DIF methodologies.

Handling Item Parameter Drift

Even though not all psychometricians think that removal of outlying anchor items is the best solution for item parameter drift, if we do not eliminate drifting items from the process of equating test scores, they will affect transformations of ability estimates, not only item parameters. Imagine that there is an examination, which classifies examinees as either failing or passing, or into four performance categories; then in case of IPD, 10-40% of students could be misclassified. In high-stakes testing occasions where classification of examinees implies certain sanctions or rewards, IPD scenarios should be minimized as much as possible. As soon as it is detected that some items exhibit IPD, these items should be referred to the subject-matter experts for further investigation. Otherwise, if there is a need in a faster decision, such flagged anchor items should be removed immediately. Afterwards, psychometricians need to re-estimate linking constants and evaluate IPD again. This process should repeat unless none of the anchor items shows item parameter drift.

Example Item response function

Item fit analysis is a type of model-data fit evaluation that is specific to the performance of test items. It is a very useful tool in interpreting and understanding test results, and in evaluating item performance. By implementing any psychometric model, we assume some sort of mathematical function is happening under the hood, and we should check that it is an appropriate function.  In classical test theory (CTT), if you use the point-biserial correlation, you are assuming a linear relationship between examinee ability and the probability of a correct answer.  If using item response theory (IRT), it is a logistic function.  You can evaluate the fit of these using both graphical (visual) and purely quantitative approaches.

Why do item fit analysis?

There are several reasons to do item fit analysis.

  1. As noted above, if you are assuming some sort of mathematical model, it behooves you to check on whether it is appropriate to even use.
  2. It can help you choose the model; perhaps you are using the 2PL IRT model and then notice a strong guessing factor (lower asymptote) when evaluating fit.
  3. Item fit analysis can help identify improper item keying.
  4. It can help find errors in the item calibration, which determines validity of item parameters.
  5. Item fit can be used to measure test dimensionality that affects validity of test results (Reise, 1990).  For example, if you are trying to run IRT on a single test that is actually two-dimensional, it will likely fit well on one dimension and the other dimension’s items have poor fit.
  6. Item fit analysis can be beneficial in detecting measured disturbances, such as differential item functioning (DIF).

 

What is item fit?

Model-data fit, in general, refers to how far away our data is from the predicted values from the model.  As such, it is often evaluated with some sort of distance metric, such as a chi-square or a standardized version ofExample Item response function it.  This easily translates into visual inspection as well.

Suppose we took a sample of examinees and divided it up into 10 quantiles.  The first is the lowest 10%, then 10-20th percentile, and so on.  We graph the proportion in each group that get an item correct.  It will be higher proportion for the smarter students. but if it is a small sample, the line might bounce around like the blue line below.  When we fit a model like the black line, we can find the total distance of the red lines and it gives us some quantification of how the model is fitting.  In some cases, the blue line might be very close to the black, and in others it would not be at all.

Of course, psychometricians turn those values into quantitative indices.  Some examples are a Chi-square and a z-Residual, but there are plenty of others.  The Chi-square will square the red values and sum them up.  The z-Residual takes that and adjusts for sample size then standardizes it onto the familiar z-metric.

Item fit with Item Response Theory

IRT was created in order to overcome most of the limitations that CTT has. Within IRT framework, item and test-taker parameters are independent when test data fit the assumed model. Additionally, these two parameters can be located on one scale, so they are comparable with each other. The independency (invariance) property of IRT makes it possible to solve measurement problems that are almost impossible to get solved within CTT, such as item banking, item bias, test equating, and computerized adaptive testing (Hambleton, Swaminathan, and Rogers, 1991).

There are three logistic models defined and widely used in IRT: one-parameter (1PL), two-parameter (2PL), and three-parameter (3PL). 1PL employs only one parameter, difficulty, to describe the item. 2PL uses two parameters, difficulty and discrimination. 3PL uses three—difficulty, discrimination, and guessing. A successful application of IRT means that test data fit the assumed IRT model. However, it may happen that even when a whole test fits the model, some of the items misfit it, i.e. do not function in the intended manner. Statistically it means that there is a difference between expected and observed frequencies of correct answers to the item at various ability levels.

There are many different reasons for item misfit. For instance, an easy item might not fit the model when low-ability test-takers do not attempt it at all. This usually happens in speeded tests, when there is no penalty for slow work. Next example is when low-ability test-takers answer difficult items correctly by guessing. This usually occurs with the tests consisting of purely multiple-choice items. Another example are the tests that are not unidimensional, then there might be some items that misfit the model.

Examples

Here are two examples of evaluating item fit with item response theory, using the software  Xcalibre.  Here is an item with great fit.  The red line (observed) is very close to the black line (model).  The two fit statistics are Chi-Square and z-Residual.  The p-values for both are large, indicating that we are nowhere near rejecting the hypothesis of model fit.

Good item fit

Now, consider the following item.  The red line is much more erratic.  The Chi-square rejects the model fit hypothesis with p=0.000.  The z-Residual, which corrects for sample size, does not reject but is still smaller.  This item also has a very low a parameter, so it should probably be evaluate.

OK item fit

Summary

To sum up, item fit analysis is key in item and test development. The relationship between item parameters and item fit identifies factors related to item fitness, which is useful in predicting item performance. In addition, this relationship helps understand, analyze, and interpret test results especially when a test has a significant number of misfit items.

References

Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response theory (Vol. 2). Sage.
Reise, S. P. (1990). A comparison of item-and person-fit methods of assessing model-data fit in IRT. Applied Psychological Measurement, 14(2), 127-137.

enhance-assessment

Distractor analysis refers to the process of evaluating the performance of incorrect answers vs the correct answer for multiple choice items on a test.  It is a key step in the psychometric analysis process to evaluate item and test performance as part of documenting test reliability and validity.

What is a distractor?

An item distractor, also known as a foil or a trap, is an incorrect option for a selected-response item on an assessment. Multiple-choice questions always have a few options for an answer, one of which is a key/correct answer, and the remaining ones are distractors/wrong answers. It is worth noting that distractors should not be just any wrong answers but have to be probable answers in case an examinee makes a mistake when looking for a right option. In short, distractors are feasible answers that a examinee might select when making misjudgments or having partial knowledge/understanding.  A great example is later in this article with the word “confectioner.”

Parts of an item - stem options distractor

What makes a good item distractor?

One word: plausibility.  We need the item distractor to attract examinees.  If it is so irrelevant that no one considers it, then it does not do any good to include it in the item.  Consider the following item.

 

   What is the capital of the United States of America?

 A. Los Angeles

 B. New York

 C. Washington, D.C.

 D. Mexico City

 

The last option is quite implausible – not only is it outside the USA, but it mentions another country in the name, so no student is likely to select this.  This then becomes a three-horse race, and students have a 1 in 3 chance of guessing.  This certainly makes the item easier. How much do distractors matter?  Well, how much is the difficulty affected by this new set?

 

   What is the capital of the United States of America?

 A. Paris

B. Rome

 C. Washington, D.C.

 D. Mexico City  

 

In addition, the distractor needs to have negative discrimination.  That is, while we want the correct answer to attract the more capable examinees, we want the distractors to attract the lower examinees.  If you have a distractor that you thought was incorrect, and it turns out to attract all the top students, you need to take a long, hard look at that question! To calculate discrimination statistics on distractors, you will need software such as  Iteman.

What makes a bad item distractor?

Obviously, implausibility and negative discrimination are frequent offenders.  But if you think more deeply about plausibility, the key is actually plausibility without being arguably correct.  This can be a fine line to walk, and is a common source of problems for items.  You might have a medical item that presents a scenario and asks for a likely diagnosis; perhaps one of the distractors is very unlikely so as to be essentially implausible, but it might actually be possible for a small subset of patients under certain conditions.  If the author and item reviewers did not catch this, the examinees probably will, and this will be evident in the statistics.  This is one of the reasons it is important to do psychometric analysis of test results, including distractor analysis to evaluate the effectiveness of incorrect options in multiple-choice questions.  In fact, accreditation standards often require you to go through this process at least once a year.

Why do we need a distractor analysis?

After a test form is delivered to examinees, distractor analysis should be implemented to make sure that all answer options work well, and that the item is performing well and defensibly. For example, it is expected that around 40-95% of students pick a correct answer, and the distractors will be chosen by the smaller number of examinees compared to the number chosen the key with approximately equal distribution of choices. Distractor analysis is usually done with classical test theory, even if item response theory is used for scoring, equating, and other tasks.

How to do a distractor analysis

There are three main aspects:

  1. Option frequencies/proportions
  2. Option point-biserial
  3. Quantile plot

The option frequencies/proportions just refers to the analysis of how many examinees selected each answer.  Usually it is a proportion and labeled as “P.”  Did 70% choose the correct answer while the remaining 30% were evenly distributed amongst the 3 distractors?  Great.  But if only 40% chose the correct answer and 45% chose one of the distractors, you might have a problem on your hands.  Perhaps the answer specified as the Key was not actually correct. The point-biserials (Rpbis) will help you evaluate if this is the case.  The point-biserial is an item-total correlation, meaning that we correlate scores on the item with the total score on the test, which is a proxy index of examinee ability.  If 0.0, there is no relationship, which means the item is not correlated with ability, and therefore probably not doing any good.  If negative, it means that the lower-ability students are selecting it more often; if positive, it means that the higher-ability students are selecting it more often.  We want the correct answer to have a positive value and the distractors to have a negative value.  This is one of the most important points in determining if the item is performing well. In addition, there is a third approach, which is visual, called the quantile plot.  It is very useful for diagnosing how an item is working and how it might be improved.  This splits the sample up into blocks ordered by performance, such as 5 groups where Group 1 is 0-20th percentile, Group 2 is 21-40th, etc.  We expect the smartest group to have a high proportion of examinees selecting the correct answer and low proportion selecting the distractors, and vise versa.  You can see how this aligns with the concept of point-biserial.  An example of this is below. Note that the P and point-biserial for the correct answer serve as “the” statistics for the item as a whole.  The P for the item is called the item difficulty or facility statistic.

Examples of a distractor analysis

Here is an example of a good item.  The P is medium (67% correct) and the Rpbis is strongly positive for the correct answer while strongly positive for the incorrect answers.  This translates to a clean quantile plot where the curve for the correct answer (B) goes up while the curves for the incorrect answers go down.  An ideal situation.

 

Distractor analysis quantile plot classical

 

Now contrast that with the following item.  Here, only 12% of examinees got this correct, and the Rpbis was negative.  Answer C had 21% and a nicely positive Rpbis, as well as a quantile curve that goes up.  This item should be reviewed to see if C is actually correct.  Or B, which had the most responses.  Most likely, this item will need a total rewrite!

 

Bad quantile plot and table for distractor analysis

 

Note that an item can be extremely difficult but still perform well.  Here is an example where the distractor analysis supports continued use of the item.  The distractor is just extremely attractive to lower students; they think that a confectioner makes confetti, since those two words look the closest.  Look how strong the Rpbis is here, and very negative for that distractor.  This is a good result!

 

Confectioner confetti distractor analysis

 

multiple choice test bubble sheet scores

A confidence interval for test scores is a common way to interpret the results of a test by phrasing it as a range rather than a single number.  We all understand that tests provide imperfect measurements at a specific point in time, and actual performance can vary over different occasions.  The examinee might be sick or tired today and score lower than their true score on the test, or get lucky with some items on topics they have studied more closely, then score higher today than they normally might (or vice versa with tricky items).

Psychometricians recognize this and have developed the concept of the standard error of measurement, which is an index of this variation.  The calculation of the SEM differs between classical test theory and item response theory, but in either case, we can use it to make a confidence interval around the observed score. Because tests are imperfect measurements, some psychometricians recommend always reporting scores as a range rather than a single number.

A confidence interval is a very common concept from statistics in general (not psychometrics alone) about making a likely range for the true value of something being estimated.  We can take 1.96 times a standard error on each side of a point estimate to get a 95% confidence interval.  Start by calculating 1.96 times the SEM, then add and subtract it to the original score to get a range.

Example of confidence interval with Classical Test Theory

With CTT, the confidence interval is placed on raw number-correct scores.  Suppose the reliability of a 100-item test is 0.90, with a mean of 85 and standard deviation of 5.  The SEM is then 5*sqrt(1-0.90) = 5*0.31 = 1.58.  If your score is a 67, then a 95% confidence interval is 63.90 to 70.10.  We are 95% sure that your true score lies in that range.

Example of confidence interval with Item Response Theory

The same concept applies to item response theory (IRT).  But the scale of numbers is quite different, because the theta scale runs from approximately -3 to +3.  Also, the SEM is calculated directly from item parameters, in a complex way that is beyond the scope of this discussion.  But if your score is -1.0 and the SEM is 0.30, then the 95% confidence interval for your score is -1.588 to -0.412.  This confidence interval can be compared to a cutscore as an adaptive testing approach to pass/fail tests.

Example of confidence interval with a Scaled Score

This concept also works on scaled scores.  IQ is typically reported on a scale with a mean of 100 and standard deviation of 15.  Suppose the test had an SEM of 3.2, and your score was 112.  Then if we take 1.96*3.2 and plus or minus it on either side, we get a confidence interval of 105.73 to 118.27.

Composite Scores

A composite test score refers to a test score that is combined from the scores of multiple tests, that is, a test battery.  The purpose is to create a single number that succinctly summarizes examinee performance.  Of course, some information is lost by this, so the original scores are typically reported as well.

This is a case where multiple tests are delivered to each examinee, but an overall score is desired.  Note that this is different than the case of a single test with multiple domains; in that case, there is one latent dimension, while with a battery each test has a different dimension, though possibly highly correlated.  That is, we have four measurement situations:

  1. Single test, one domain
  2. Single test, multiple domains
  3. Multiple tests, but correlated or related
  4. Multiple tests, but unrelated latent dimensions

With regards to the composite test score, we are only considering #3.  A case of #4 where a composite score does not make sense is a Big 5 personality assessment.  There are five components (Openness, Conscientiousness, Extraversion, Agreeableness, and Neuroticism), but they are unrelated, and a sum of their scores would not quantify you as having a good or bad personality, or any other meaningful interpretation!

Example of a Composite Test Score

A common example of a composite test score situation is a university admissions exam.  There are often several component tests, such as Logical Reasoning, Mathematics, and English.  These are psychometrically distinct, but there is definitely a positive manifold amongst them.  The exam sponsor will probably report each separately, but also sum all three to a total score as a way to summarize student performance in a single number.

How do you calculate a Composite Test Score?

Here are four ways that you can calculate a composite test score.  They typically use a Scaled Score rather than a Raw Score.

  1. Average – An example is the ACT assessment in the United States, for university admissions. There are four tests (English, Math, Science, Reading), each of which is reported on a scale of 0 to 36, but also the average of them is reported.  Here is a nice explanation.
  2. Sum – An example of this is the SAT, also a university admissions test in the United States.  See explanation at Khan Academy.
  3. Linear combination – You also have the option to combine like a sum, but with differential weighting. An example of this is the ASVAB, the test to enter the United States military. There are 12 tests, but the primary summary score is called AFQT and it is calculated by combining only 4 of the tests.
  4. Nonlinear transformation – There is also the possibility of any nonlinear transformation that you can think of, but this is rare.

How to implement a CompositeComposite Scores Test Score

You will need an online testing platform that supports the concept of a test battery, provides scaled scoring, and then also provides functionality for composite scores.  An example of this screen from our platform is below.  Click here to sign up for a free account.

agreement reliability handshake

Inter-rater reliability and inter-rater agreement are important concepts in certain psychometric situations.  For many assessments, there is never any encounter with raters, but there certainly are plenty of assessments that do.  This article will define these two concepts and discuss two psychometric situations where they are important.  For a more detailed treatment, I recommend Tinsley and Weiss (1975), which is one of the first articles that I read in grad school.

Inter-Rater Reliability

Inter-rater reliability refers to the consistency between raters, which is slightly different than agreement.  Reliability can be quantified by a correlation coefficient.  In some cases this is the standard Pearson correlation, but in others it might be tetrachoric or intraclass (Shrout & Fleiss, 1979), especially if there are more than two raters.  If raters correlate highly, then they are consistent with each other and would have a high reliability estimate.

Inter-Rater Agreement

Inter-rater agreement looks at how often the two raters give exact the same result.  There are different ways to quantify this as well, as discussed below.  Perhaps the simplest, in the two-rater case, is to simply calculate the proportion of rows where the two provided the same rating.  If there are more than two raters in a case, you will need an index of dispersion amongst their ratings.  Standard deviation and mean absolute difference are two examples.

Situation 1: Scoring Essays with Rubrics

If you have an assessment with open-response questions like essays, they need to be scored with a rubric to convert them to numeric scores.  In some cases, there is only one rater doing this.  You have all had essays graded by a single teacher within a classroom when you were a student.  But for larger scale or higher stakes exams, two raters are often used, to provide quality assurance on each other.  Moreover, this is often done in an aggregate scale; if you have 10,000 essays to mark, that is a lot for two raters, so instead of two raters rating 10,000 each you might have a team of 20 rating 1,000 each.  Regardless, each essay has two ratings, so that inter-rater reliability and agreement can be evaluated.  For any given rater, we can easily calculate the correlation of their 1,000 marks with the 1,000 marks from the other rater (even if the other rater rotates between the 19 remaining).  Similarly, we can calculate the proportion of times that they provided the same rating or were within 1 point of the other rater.

Situation 2: Modified-Angoff Standard Setting

Another common assessment situation is a modified-Angoff study, which is used to set a cutscore on an exam.  Typically, there are 6 to 15 raters that rate each item on its difficulty, on a scale of 0 to 100 in multiple of 5.  This makes a more complex situation, since there are not only many more raters per instance (item) but there are many more possible ratings.

To evaluate inter-rater reliability, I typically use the intra-class correlation coefficient, which is:

intraclass correlation reliability

Where BMS is the between items mean-square, EMS is the error mean-square, JMS is the judges mean-square, and n is the number of items.  It is like the Pearson correlation used in a two-rater situation, but aggregated across the raters and improved.  There are other indices as well, as discussed on Wikipedia.

For inter-rater agreement, I often use the standard deviation (as a very gross index) or quantile “buckets.”  See the Angoff Analysis Tool for more information.

 

Examples of Inter-Rater Reliability vs. Agreement

Consider these three examples with a very simple set of data: two raters scoring 4 students on a 5 point rubric (0 to 5).

Reliability = 1, Agreement = 1

Student Rater 1 Rater 2
1 0 0
2 1 1
3 2 2
4 3 3
5 4 4

Here, the two are always the same, so both reliability and agreement are 1.0.

Reliability = 1, Agreement = 0

Student Rater 1 Rater 2
1 0 1
2 1 2
3 2 3
4 3 4
5 4 5

In this example, Rater 1 is always 1 point lower.  They never have the same rating, so agreement is 0.0, but they are completely consistent, so reliability is 1.0.

Reliability = -1, agreement is 0.20 (because they will intersect at middle point)

Student Rater 1 Rater 2
1 0 4
2 1 3
3 2 2
4 3 1
5 4 0

In this example, we have a perfect inverse relationship.  The correlation of the two is -1.0, while the agreement is 0.20 (they agree 20% of the time).

Now consider Example 2 with the modified-Angoff situation, with an oversimplification of only two raters.

Item Rater 1 Rater 2
1 80 90
2 50 60
3 65 75
4 85 95

This is like Example 2 above; one is always 10 points higher, so that there is reliability of 1.0 but agreement of 0.  Even though agreement is an abysmal  0, the psychometrician running this workshop would be happy with the results!  Of course, real Angoff workshops have more raters and many more items, so this is an overly simplistic example.

 

References

Tinsley, H.E.A., & Weiss, D.J. (1975).  Interrater reliability and agreement of subjective judgments. Journal of Counseling Psychology, 22(4), 358-376.

Shrout, P.E., & Fleiss, J.L. (1979).  Intraclass correlations: Uses in assessing rater reliability.  Psychological Bulletin, 86(2), 420-428.