Posts on psychometrics: The Science of Assessment

multi dimensional item response theory

Multidimensional item response theory (MIRT) has been developing from its Factor Analytic and unidimensional item response theory (IRT) roots. This development has led to an increased emphasis on precise modeling of item-examinee interaction and a decreased emphasis on data reduction and simplification. MIRT represents a broad family of probabilistic models designed to portray an examinee’s likelihood of a correct response based on item parameters and multiple latent traits/dimensions. The MIRT models determine a compound multidimensional space to describe individual differences in the targeted dimensions.

Within MIRT framework, items are treated as fundamental units of test construction. Furthermore, items are considered as multidimensional trials to obtain valid and reliable information about examinee’s location in a complex space. This philosophy extends the work from unidimensional IRT to provide a more comprehensive description of item parameters and how the information from items combines to depict examinees’ characteristics. Therefore, items need to be crafted mindfully to be sufficiently sensitive to the targeted combinations of knowledge and skills, and then be carefully opted to help improve estimates of examinee’s characteristics in the multidimensional space.

Trigger for development of Multidimensional Item Response Theory

In modern psychometrics, IRT is employed for calibrating items belonging to individual scales so that each dimension is regarded as unidimensional. According to IRT models, an examinee’s response to an item depends solely on the item parameters and on the examinee’s single parameter, that is the latent trait θ. Unidimensional IRT models are advantageous in terms of operating with quite simple mathematical forms, having various fields of application, and being somewhat robust to violating assumptions.

However, there is a high probability that real interactions between examinees and items are far more jumbled than these IRT models imply. It is likely that responding to a specific item requires examinees to apply plentiful abilities and skills, especially in the compound areas such as the natural sciences. Thus, despite the fact that unidimensional IRT models are highly useful under specific conditions, the world of psychometrics faced the need for more sophisticated models that would reflect multiform examinee-item interactions. For that reason, unidimensional IRT models were extended to multidimensional models to become capable to express situations when examinees need multiple abilities and skills to respond to test items.

Categories of Multidimensional Item Response Theory models

There are two broad categories of MIRT models: compensatory and non-compensatory (partially compensatory).

ways-to-improve-item-banks

  • Under the compensatory model, examinees’ abilities work in cooperation to escalate the probability of a correct response to an item, i.e. higher ability on one trait/dimension compensates for lower ability on the other. For instance, an examinee should read a passage on a current event and answer a question about it. This item assesses two abilities: reading comprehension and knowledge of current events. If the examinee is aware of the current event, then that will compensate for their lower reading ability. On the other hand, if the examinee is an excellent reader then their reading skills will compensate for lack of knowledge about the event.
  • Under the non-compensatory model, abilities do not compensate each other, i.e. an examinee needs to possess a high level abilities on all traits/dimensions to have a high chance to respond to a test item correctly. For example, an examinee should solve a traditional mathematical word problem. This item assesses two abilities: reading comprehension and mathematical computation. If the examinee has excellent reading ability but low mathematical computation ability, they will be able to read the text but not be able to solve the problem. Possessing reverse abilities, the examinee will not be able to solve the problem without understanding what is being asked.

Within the literature, compensatory MIRT models are more commonly used.

Applications of Multidimensional Item Response Theory

  • Since MIRT analyses concentrate on the interaction between item parameters and examinee characteristics, they have provoked numerous studies of skills and abilities necessary to give a correct answer to an item, and of sensitivity dimensions for test items. This research area demonstrates the importance of a thorough comprehension of the ways that tests function. MIRT analyses can help verify group differences and item sensitivities that facilitate test and item bias, and define the reasons behind differential item functioning (DIF) statistics.
  • MIRT allows linking of calibrations, i.e. putting item parameter estimates from multiple calibrations into the same multidimensional coordinate system. This enables reporting examinee performance on different sets of items as profiles on multiple dimensions located on the same scales. Thus, MIRT makes it possible to create large pools of calibrated items that can be used for the construction of multidimensionally parallel test forms and computerized adaptive testing (CAT).

Conclusion

Given the complexity of the constructs in education and psychology and the level of details provided in test specifications, MIRT is particularly relevant for investigating how individuals approach their learning and, subsequently, how it is influenced by various factors. MIRT analysis is still at an early stage of its development and hence is a very active area of current research, in particular of CAT technologies. Interested readers are referred to Reckase (2009) for more detailed information about MIRT.

References

Reckase, M. D. (2009). Multidimensional Item Response Theory. Springer.

guessing-student

The item pseudo-guessing parameter is one of the three item parameters estimated under item response theory (IRT): discrimination a, difficulty b, and pseudo-guessing c. The parameter that is utilized only in the 3PL model is the pseudo-guessing parameter c.  It represents a lower asymptote for the probability of an examinee responding correctly to an item.

Background of IRT item pseudo-guessing parameter 

If you look at the post on the IRT 2PL model, you will realize that the probability of a response depends on the examinee ability level θ, the item discrimination parameter a, and the item difficulty parameter b. However, one of the realities in testing is that examinees will get some multiple-choice items by guessing. Therefore, the probability of the correct response might include a small component that is guessing.

Neither 1PL, nor 2PL considered guessing phenomenon, but Birnbaum (1968) altered the 2PL model to include it. Unfortunately, due to this inclusion the logistic function from the 2PL model lost its nice mathematical properties. Nevertheless, even though it is no longer a logistic model in a technical aspect, it has become known as the three-parameter logistic model (3PL or IRT 3PL). Baker (2001) suggested the following equation for the IRT 3PL model

3pl-formula

where:

a is the item discrimination parameter

b is the item difficulty parameter

c is the item pseudo-guessing parameter

θ is the examinee ability parameter

Interpretation of pseudo-guessing parameter

In general, the pseudo-guessing parameter c is the probability of getting the item correct by guessing alone. For instance, c = 0.20 means that at all ability levels, the probability of getting the item correct by guessing alone is 0.20.  This very often reflects the structure of multiple choice items: 5-options items will tend to have values around 0.20 and 4-option items around 0.25.

It is worth noting, that the value of c does not vary as a function of the trait/ability level θ, i.e. examinees with high and low ability levels have the same probability of responding correctly by guessing. Theoretically, the guessing parameter ranges between 0 and 1, but practically values above 0.35 are considered inacceptable, hence the range 0 < c < 0.35 is applied.  A value higher than 1/k, where k is the number of options, often indicates that a distractor is not performing.

How pseudo-guessing parameter affects other parameters

Due to the presence of the guessing parameter, the definition of the item difficulty parameter b is changed. Within the 1PL and 2PL models, b is the point on the ability scale at which the probability of the correct response is 0.5. Under the 3PL model, the lower limit of the item characteristic curve (ICC) or item response function (IRF) is the value of c rather than zero. According to Baker (2001), the item difficulty parameter is the point on the ability scale where:

probability-c

Therefore, the probability is halfway between the value of c and 1. Thus, the parameter c has defined a boundary to the lowest value of the probability of the correct response, and the item difficulty parameter b determines the point on the ability scale where the probability of the correct response is halfway between this boundary and 1.

The item discrimination parameter a can still be interpreted as being proportional to the slope of the ICC/IRF at the point θ = b. However, under the 3PL model, the slope of the ICC/IRF at θ = b actually equals to a×(1−c)/4. These changes in the definitions of the item parameters a and b are quite important when interpreting test analyses.

References

Baker, F. B. (2001). The basics of item response theory.

Birnbaum, A. L. (1968). Some latent trait models and their use in inferring an examinee’s ability. In F. M. Lord & M. R. Novick (Eds.), Statistical theories of mental test scores (pp. 395–479). Addison-Wesley.

discrimination-parameter

The item discrimination parameter a is an index of item performance within the paradigm of item response theory (IRT).  There are three item parameters estimated with IRT: the discrimination a, the difficulty b, and the pseudo-guessing parameter c. The item parameter that is utilized in two IRT models, 2PL and 3PL, is the IRT item discrimination parameter a.

Definition of IRT item discrimination

irf_b_-2.2

Generally speaking, the item discrimination parameter is a measure of the differential capability of an item. In the analytical aspect, the item discrimination parameter a is a slope of the item response function graph, where the steeper the slope, the stronger the relationship between the ability θ and a correct response, giving a designation of how well a correct response discriminates on the ability of individual examinees along the continuum of the ability scale. A high item discrimination parameter value suggests that the item has a high ability to differentiate examinees. In practice, a high discrimination parameter value means that the probability of a correct response increases more rapidly as the ability θ (latent trait) increases.

­­In a broad sense, item discrimination parameter a refers to the degree to which a score varies with the examinee ability level θ, as well as the effectiveness of this score to differentiate between examinees with a high ability level and examinees with a low ability level. This property is directly related to the quality of the score as a measure of the latent trait/ability, so it is of central practical importance, particularly for the purpose of item selection.

Application of IRT item discrimination

Theoretically, the scale for the IRT item discrimination ranges from –∞ to +∞ and its value does not exceed 2.0. Thus, the item discrimination parameter ranges between 0.0 and 2.0 in practical use.  Some software forces the values to be positive, and will drop items that do not fit this. The item discrimination parameter varies between items; henceforth, item response functions of different items can intersect and have different slopes. The steeper the slope, the higher the item discrimination parameter is, so this item will be able to detect subtle differences in the ability of the examinees.

The ultimate purpose of designing a reliable and valid measure is to be able to map examinees along the continuum of the latent trait. One way to do so is to include into a test the items with the high discrimination capability that add to the precision of the measurement tool and lessen the burden of answering long questionnaires.

However, test developers should be cautious if an item has a negative discrimination because the probability of endorsing a correct response should not decrease as the examinee’s ability increases. Hence, a careful revision of such items should be carried out. In this case, subject matter experts with support from psychometricians would discuss these flagged items and decide what to do next so that they would not worsen the quality of the test.

Sophisticated software provide more accurate evaluation of the item discrimination power because they take into account responses of all examinees rather than just high and low scoring groups which is the case with the item discrimination indices used in classical test theory (CTT). For instance, you could use our software FastTest that has been designed to drive the best testing practices and advanced psychometrics like IRT and computerized adaptive testing (CAT).

Detecting items with higher or lower discrimination

Now let’s do some practice. Look at the five IRF below and check whether you are able to compare the items in terms of their discrimination capability.

item response function

Q1: Which item has the highest discrimination?

A1: Red, with the steepest slope.

Q2: Which item has the lowest discrimination?

A2: Green, with the shallowest slope.

 

Psychometric software

Automated item generation (AIG) is a paradigm for developing assessment items (test questions), utilizing principles of artificial intelligence and automation. As the name suggests, it tries to automate some or all of the effort involved with item authoring, as that is one of the most time-intensive aspects of assessment development – which is no news to anyone who has authored test questions!

What is Automated Item Generation?

Automated item generation involves the use of computer algorithms to create new test questions, or variations of them.  It can also be used for item review, or the generation of answers, or the generation of assets such as reading passages.  Items still need to be reviewed and edited by humans, but this still saves a massive amount of time in test development.

Why Use Automated Item Generation?

Items can cost up to $2000 to develop, so even cutting the average cost in half could provide massive time/money savings to an organization.  ASC provides AIG functionality, with no limits, to anyone who signs up for a free item banking account in our platform  Assess.ai.

Types of Automated Item Generation?

There are two types of automated item generation.  The Item Templates approach was developed before large language models (LLMs) were widely available.  The second approach is to use LLMs, which became widely available at the end of 2022.

Type 1: Item Templates

The first type is based on the concept of item templates to create a family of items using dynamic, insertable variables. There are three stages to this work. For more detail, read this article by Gierl, Lai, and Turner (2012).

  • Authors, or a team, create an cognitive model by isolating what it is they are exactly trying to assess and different ways that it the knowledge could be presented or evidenced. This might include information such as what are the important vs. incidental variables, and what a correct answer should include .
  • They then develop templates for items based on this model, like the example you see below.
  • An algorithm then turns this template into a family of related items, often by producing all possible permutations.

Obviously, you can’t use more than one of these on a given test form. And in some cases, some of the permutations will be an unlikely scenario or possibly completely irrelevant. But the savings can still be quite real. I saw a conference presentation by Andre de Champlain from the Medical Council of Canada, stating that overall efficiency improved by 6x and the generated items were higher quality than traditionally written items because the process made the authors think more deeply about what they were assessing and how. He also recommended that template permutations not be automatically moved to the item bank but instead that each is reviewed by SMEs, for reasons such as those stated above.

You might think “Hey, that’s not really AI…” – AI is doing things that have been in the past done by humans, and the definition gets pushed further every year. Remember, AI used to be just having the Atari be able to play Pong with you!

AIG-CPR

Type 2: AI Generation or Processing of Source Text

The second type is what the phrase “automated item generation” more likely brings to mind: upload a textbook or similar source to some software, and it spits back drafts of test questions. For example, see this article by von Davier (2019). Or alternatively, simply state a topic as a prompt and the AI will generate test questions.

Until the release of ChatGPT and other publicly available AI platforms to implement large language models (LLMs), this approach was only available to experts at large organizations.  Now, it is available to everyone with an internet connection.  If you use such products directly, you can provide a prompt such as “Write me 10 exam questions on Glaucoma, in a 4-option multiple choice format” and it will do so.  You can also update the instructions to be more specific, and add instructions such as formatting the output for your preferred method, such as QTI or JSON.

Alternatively, many assessment platforms now integrate with these products directly, so you can do the same thing, but have the items appear for you in the item banker under New status, rather than have them go to a raw file on your local computer that you then have to clean and upload.  FastTest  has such functionality available.

This technology has completely revolutionized how we develop test questions.  I’ve seen several research presentations on this, and they all find that AIG produces more items, of quality that is as good or even better than humans, in a fraction of the time!  But, they have also found that prompt engineering is critical, and even one word – like including “concise” in your prompt – can affect the quality of the items.

FastTest Automated item generation

The Limitations of Automated Item Generation

Automated item generation (AIG) has revolutionized the way educational and psychological assessments are developed, offering increased efficiency and consistency. However, this technology comes with several limitations that can impact the quality and effectiveness of the items produced.

One significant limitation is the challenge of ensuring content validity. AIG relies heavily on algorithms and pre-defined templates, which may not capture the nuanced and comprehensive understanding of subject matter that human experts possess. This can result in items that are either too simplistic or fail to fully address the depth and breadth of the content domain .

Another limitation is the potential for over-reliance on statistical properties rather than pedagogical soundness. While AIG can generate items that meet certain psychometric criteria, such as difficulty and discrimination indices, these items may not always align with best practices in educational assessment or instructional design. This can lead to tests that are technically robust but lack relevance or meaningfulness to the learners .

Furthermore, the use of AIG can inadvertently introduce bias. Algorithms used in item generation are based on historical data and patterns, which may reflect existing biases in the data. Without careful oversight and adjustment, AIG can perpetuate or even exacerbate these biases, leading to unfair assessment outcomes for certain groups of test-takers .

Lastly, there is the issue of limited creativity and innovation. Automated systems generate items based on existing templates and rules, which can result in a lack of variety and originality in the items produced. This can make assessments predictable and less engaging for test-takers, potentially impacting their motivation and performance .

In conclusion, while automated item generation offers many benefits, it is crucial to address these limitations through continuous oversight, integration of expert input, and regular validation studies to ensure the development of high-quality assessment items.

How Can I Implement Automated Item Generation?

If you are a user of AI products like ChatGPT or Bard, you can work directly with them.  Advanced users can implement APIs to upload documents or fine-tune the machine learning models.  The aforementioned article by von Davier talks about such usage.

If you want to save time, FastTest provides a direct ChatGPT integration, so you can provide the prompt using the screen shown above, and items will then be automatically created in the item banking folder you specify, with the item naming convention you specify, tagged as Status=New and ready for review.  Items can then be routed through our configurable Item Review Workflow process, including functionality to gather modified-Angoff ratings.

Ready to improve your test development process?  Click here to talk to a psychometric expert.

item-difficulty-parameter

The item difficulty parameter from item response theory (IRT) is both a shape parameter of the item response function (IRF) but also an important way to evaluate the performance of an item in a test.

Item Parameters and Models in IRT

There are three item parameters estimated under dichotomous IRT: the item difficulty (b), the item discrimination (a), and the pseudo-guessing parameter (c).  IRT is actually a family of models, the most common of which are the dichotomous 1-parameter, 2-parameter, and 3-parameter logistic models (1PL, 2PL, and 3PL). The key parameter that is utilized in all three IRT models is the item difficulty parameter, b.  The 3PL uses all three, the 2PL uses a and b, and the 1PL/Rasch uses only b.

Interpreting the IRT item difficulty parameter

The b parameter is an index of how difficult the item is, or the construct level at which we would expect examinees to have a probability of 0.50 (assuming no guessing) of getting the keyed item response. It is worth reminding, that in IRT we model the probability of a correct response on a given item Pr (X) as a function of examinee ability (θ) and certain properties of the item itself. This function is called item response function (IRF) or item characteristic curve (ICC), and it is the basic feature of IRT since all the other constructs depend on this curve.

The IRF plots the probability that an examinee will respond correctly to an item as a function of a latent trait θ. The probability of a correct response is a result of the interaction between the examinees’ ability θ and the item difficulty parameter b. With the IRF, as the θ increases, there is a rise in probability that the examinee will provide a correct response to an item. The b parameter is a location index that indicates the position of the item functions on the ability scale, showing how difficult or easy a specific item is. The higher the b parameter is, the higher the ability required from an examinee to have a 50% chance of getting an item correctly. Difficult items are located to the right or to the higher end of the ability scale while easier items are located to the left or to the lower end of the ability scale. The typical values of the item difficulty range from −3 to +3, items whose b values are near −3 will correspond to items that are very easy, whilst items with values near +3 will correspond to the items that are very difficult for the examinees.

You can interpret the b parameters as a sort of “z-score for the item.”  If the value is -1.0, that means it is appropriate for examinees at a score of -1.0 (15th percentile).

The b parameter interpretation for difficulty is the opposite of the item difficulty statistic p-value in classical test theory (CTT), where a low b indicates an easy item, and a high b indicates a difficult item. Obviously, higher b require higher θ for a correct response.  With the CTT p-value, a low value is hard and a high value is easy.  For this reason it is sometimes called item facility.

Examples of the IRT item difficulty parameter

Let’s consider an example. There are three IRFs below for three different items D, E, and F. All three items have the same level of discrimination but different item difficulty values on the ability scale. In the 1PL, it is assumed that the only characteristic that influences examinee performance is the item difficulty (b parameter) and all items are equally discriminating. The b-values for the items D, E, and F are −0.5, 0.0, and 1.0 respectively. Item D is quite an easy item. Item E represents an item of medium difficulty such that the probability of a correct response is low at the lowest ability levels and near 1 at the highest ability levels. Item F introduces a hard item with the probability of correctly responding examinees being low along the most of the ability scale and only increasing at the higher ability levels.

three item response functions D E F

Look at the five IRFs below and check whether you are able to compare the items in terms of their difficulty. Below are some specific questions and answers for comparing the items.

item response function

  • Which item is the hardest, requiring the highest ability level, on average, to get it correct?

Blue (No 5), as it is the furthest to the right.

  • Which item is the easiest?

Dark blue (No 1), as it is the furthest to the left.

How do I calculate the IRT item difficulty?

You’ll need special software like Xcalibre.  Download a copy for free here.

one-parameter-logistic-model

The One Parameter Logistic Model (OPLM or 1PL or IRT 1PL) is one of the three main dichotomous models in the Item Response Theory (IRT) framework. The OPLM combines mathematical properties of the Rasch model with the flexibility of the Two Parameter Logistic Model (2PL or IRT 2PL). In the OPLM, difficulty parameters, b, are estimated and discrimination indices, a, are imputed as known constants.

Background behind the One Parameter Logistic Model

IRT employs mathematical models assuming that the probability that an examinee would answer the question correctly depends on their ability and item characteristics. Examinee’s ability is considered the major individual characteristic and is denoted as θ (“theta”); it is also called the ability parameter. The ability parameter is conceived as an underlying, unobservable latent construct or trait that helps an individual to answer a question correctly.

These mathematical models include item characteristics also known as the item parameters: discrimination (a), difficulty (b), and pseudo-guessing (c). According to IRT paradigm, all item parameters are considered to be invariant or “person-free”, i.e. they do not depend on examinees’ abilities. In addition, ability estimates are also invariant or “item-free” since they do not depend on the set of items. This mutual independence forms the basis of the IRT models that provides objectivity in measurement.

The OPLM is built off only one parameter, difficulty. Item difficulty simply means how hard an item is (how high does the latent trait ability level need to be in order to have a 50% chance of getting the item right?). b is estimated for each item of the test. The item response function for the 1PL model looks like this:

One-parameter-logistic-model-IRT

where P is the probability that a randomly selected examinee with ability θ will answer correctly a specific item; e is a mathematical constant approximately equal to 2.71828, which is also known as an exponential number or Euler’s number.

Assumptions of the OPLM

The OPLM is based on two basic assumptions: unidimensionality and local independence.

  • Unidimensionality assumption is the most common, but the most complex and restrictive assumption for all IRT models that sometimes cannot be met. It states that only one ability is measured by the set of items in a single test. Thus, it assumes that a single dominant factor should underlie all item responses. For example, in a Math test examinees need to possess strong mathematical abilities to answer test questions correctly. However, if test items measure another ability, like verbal, this test is no longer unidimensional. The unidimensionality can be assessed by various methods, but the most popular among all is the factor analysis approach which is available in the free software  MicroFACT.
  • Local independence assumes that in case of a constant ability, examinee’s responses to any items are statistically independent, i.e. the probability that an examinee would reply correctly to a test question does not depend on their answers to other questions. In other words, the only factor influencing examinee’s responses is the ability.

 

Item characteristic curve

The S-shaped curve describing the relationship between the probability of an examinee’s correct response to a test question and their ability θ  is called item characteristic curve (ICC) or item response function (IRF). In a test, each item will have its own ICC/IRF.

Typical ICC for the One Parameter Logistic Model looks like this:

irf_b_1.0

The S-shaped curve shows that the probability of a correct response is near zero at the lowest level of examinee’s ability and increases up to the highest level of ability as the probability of correct response approaches 1. The curve rises rapidly as we move from left to right and is strictly monotonic.

The OPLM function ranges between 0 and 1. ICC can never reach and cannot be higher than 1. Theoretically, item parameter ranges from -∞ to + ∞ but practically this range is limited between -3 and +3. You can easily plot ICC using the IRT calibration software  Xcalibre.

Application of the OPLM in test development

The OPLM is especially useful in item selection, item banking, item analysis, test equating, and investigating item bias or Differential Item Functioning (DIF). Since the IRT One Parameter Logistic Model allows estimating item parameters that are “examinee-free”, then it is possible to estimate item parameters during their piloting to use them later. Based on the information about items and examinees collected during tests it is easy to build item banks that can be ultimately used for large-scale testing programs and Computerized Adaptive Testing (CAT).

ebel-method-for-multiple-choice-questions

The Ebel method of standard setting is a psychometric approach to establish a cutscore for tests consisting of multiple-choice questions. It is usually used for high-stakes examinations in the fields of higher education, medical and health professions, and for selecting applicants.

How is the Ebel method performed?

The Ebel method requires a panel of judges who would first categorize each item in a data set by two criteria: level of difficulty and relevance or importance. Then the panel would agree upon an expected percentage of items that should be answered correctly for each group of items according to their categorization.

It is crucial that judges are the experts in the examined field; otherwise, their judgement would not be valid and reliable. Prior to the item rating process, the panelists should be given sufficient amount of information about the purpose and procedures of the Ebel method. In particular, it is important that the judges would understand the meaning of difficulty and relevance in the context of the current assessment.

Next stage would be to determine what “minimally competent” performance means in the specific case depending on the content. When everything is clear and all definitions are agreed upon, the experts should classify each item across difficulty (easy, medium, or hard) and relevance (minimal, acceptable, important, or essential). In order to minimize the influence of the judges’ opinion on each other, it is more recommended to use individual ratings rather than consensus ones.

Afterwards judgements on the proportion of items expected to be answered correctly by minimally competent candidates need to be collected for each item category, e.g. easy and desirable. However, for the rating and timesaving purposes the grid proposed by Ebel and Frisbie (1972) might be used. It is worth mentioning though that Ebel ratings are content-specific, so values in the grid might happen to be too low or too high for a test.

Ebel-method-data

At the end, the Ebel method, like the modified-Angoff method, identifies a cut-off score for an examination based on the performance of candidates in relation to a defined standard (absolute), rather than how they perform in relation to their peers (relative). Ebel scores for each item and for the whole exam are calculated as the average of the scores provided by each expert: the number of items in each category is multiplied by the expected percentage of correct answers, and the total results are added to calculate the cutscore.

Pros of using Ebel

  • This method provides an overview of a test difficulty
  • Cut-off score is identified prior to an examination
  • It is relatively easy for experts to perform

 

Cons of using Ebel

  • This method is time-consuming and costly
  • Evaluation grid is hard to get right
  • Digital software is required
  • Back-up is necessary

 

Conclusion

The Ebel method is a quite complex standard-setting process compared to others due to the need of an analysis of the content, and it therefore imposes a burden on the standard-setting panel. However, Ebel considers the relevance of the test items and the expected proportion of the correct answers of the minimally competent candidates, including borderline candidates. Thus, even though the procedure is complicated, the results are very stable and very close to the actual cut-off scores.

References

Ebel, R. L., & Frisbie, D. A. (1972). Essentials of educational measurement.

item parameter drift boat

Item parameter drift (IPD) refers to the phenomenon in which the parameter values of a given test item change over multiple testing occasions within the item response theory (IRT) framework. This phenomenon is often relevant to student progress monitoring assessments where a set of items is used several times in one year, or across years, to track student growth;  the observing of trends in student academic achievements depends upon stable linking (anchoring) between assessment moments over time, and if the item parameters are not stable, the scale is not stable and time-to-time comparisons are not either. Some psychometricians consider IPD as a special case of differential item functioning (DIF), but these two are different issues and should not be confused with each other.

Reasons for Item Parameter Drift

IRT modeling is attractive for assessment field since its property of item parameter invariance to a particular sample of test-takers which is fundamental for their estimation, and that assumption enables important things like strong equating of tests across time and the possibility for computerized adaptive testing. However, item parameters are not always invariant. There are plenty of reasons that could stand behind IPD. One possibility is curricular changes based on assessment results or instruction that is more concentrated. Other feasible reasons are item exposure, cheating, or curricular misalignment with some standards. No matter what has led to IPD, its presence can cause biased estimates of student ability. In particular, IPD can be highly detrimental in terms of reliability and validity in case of high-stakes examinations. Therefore, it is crucial to detect item parameter drift when anchoring assessment occasions over time, especially when the same anchor items are used repeatedly.

Perhaps the simplest example is item exposure.  Suppose a 100-item test is delivered twice per year, with 20 items always remaining as anchors.  Eventually students will share memories and the topics of those will become known.  More students will get them correct over time, making the items appear easier.

Identifying Item Parameter Drift

There are several methods of detecting IPD. Some of them are simpler because they do not require estimation of anchoring constants, and some of them are more difficult due to the need of that estimation. Simple methods include the “3-sigma p-value”, the “0.3 logits”, and the “3-sigma IRT” approaches. Complex methods involve the “3-sigma scaled IRT”, the “Mantel-Haenszel”, and the “area between item characteristic curves”, where the last two approaches are based on consideration that IPD is a special case of DIF, and therefore there is an opportunity to draw upon a massive body of existing research on DIF methodologies.

Handling Item Parameter Drift

Even though not all psychometricians think that removal of outlying anchor items is the best solution for item parameter drift, if we do not eliminate drifting items from the process of equating test scores, they will affect transformations of ability estimates, not only item parameters. Imagine that there is an examination, which classifies examinees as either failing or passing, or into four performance categories; then in case of IPD, 10-40% of students could be misclassified. In high-stakes testing occasions where classification of examinees implies certain sanctions or rewards, IPD scenarios should be minimized as much as possible. As soon as it is detected that some items exhibit IPD, these items should be referred to the subject-matter experts for further investigation. Otherwise, if there is a need in a faster decision, such flagged anchor items should be removed immediately. Afterwards, psychometricians need to re-estimate linking constants and evaluate IPD again. This process should repeat unless none of the anchor items shows item parameter drift.

Example Item response function

Item fit analysis is a type of model-data fit evaluation that is specific to the performance of test items. It is a very useful tool in interpreting and understanding test results, and in evaluating item performance. By implementing any psychometric model, we assume some sort of mathematical function is happening under the hood, and we should check that it is an appropriate function.  In classical test theory (CTT), if you use the point-biserial correlation, you are assuming a linear relationship between examinee ability and the probability of a correct answer.  If using item response theory (IRT), it is a logistic function.  You can evaluate the fit of these using both graphical (visual) and purely quantitative approaches.

Why do item fit analysis?

There are several reasons to do item fit analysis.

  1. As noted above, if you are assuming some sort of mathematical model, it behooves you to check on whether it is appropriate to even use.
  2. It can help you choose the model; perhaps you are using the 2PL IRT model and then notice a strong guessing factor (lower asymptote) when evaluating fit.
  3. Item fit analysis can help identify improper item keying.
  4. It can help find errors in the item calibration, which determines validity of item parameters.
  5. Item fit can be used to measure test dimensionality that affects validity of test results (Reise, 1990).  For example, if you are trying to run IRT on a single test that is actually two-dimensional, it will likely fit well on one dimension and the other dimension’s items have poor fit.
  6. Item fit analysis can be beneficial in detecting measured disturbances, such as differential item functioning (DIF).

 

What is item fit?

Model-data fit, in general, refers to how far away our data is from the predicted values from the model.  As such, it is often evaluated with some sort of distance metric, such as a chi-square or a standardized version ofExample Item response function it.  This easily translates into visual inspection as well.

Suppose we took a sample of examinees and divided it up into 10 quantiles.  The first is the lowest 10%, then 10-20th percentile, and so on.  We graph the proportion in each group that get an item correct.  It will be higher proportion for the smarter students. but if it is a small sample, the line might bounce around like the blue line below.  When we fit a model like the black line, we can find the total distance of the red lines and it gives us some quantification of how the model is fitting.  In some cases, the blue line might be very close to the black, and in others it would not be at all.

Of course, psychometricians turn those values into quantitative indices.  Some examples are a Chi-square and a z-Residual, but there are plenty of others.  The Chi-square will square the red values and sum them up.  The z-Residual takes that and adjusts for sample size then standardizes it onto the familiar z-metric.

Item fit with Item Response Theory

IRT was created in order to overcome most of the limitations that CTT has. Within IRT framework, item and test-taker parameters are independent when test data fit the assumed model. Additionally, these two parameters can be located on one scale, so they are comparable with each other. The independency (invariance) property of IRT makes it possible to solve measurement problems that are almost impossible to get solved within CTT, such as item banking, item bias, test equating, and computerized adaptive testing (Hambleton, Swaminathan, and Rogers, 1991).

There are three logistic models defined and widely used in IRT: one-parameter (1PL), two-parameter (2PL), and three-parameter (3PL). 1PL employs only one parameter, difficulty, to describe the item. 2PL uses two parameters, difficulty and discrimination. 3PL uses three—difficulty, discrimination, and guessing. A successful application of IRT means that test data fit the assumed IRT model. However, it may happen that even when a whole test fits the model, some of the items misfit it, i.e. do not function in the intended manner. Statistically it means that there is a difference between expected and observed frequencies of correct answers to the item at various ability levels.

There are many different reasons for item misfit. For instance, an easy item might not fit the model when low-ability test-takers do not attempt it at all. This usually happens in speeded tests, when there is no penalty for slow work. Next example is when low-ability test-takers answer difficult items correctly by guessing. This usually occurs with the tests consisting of purely multiple-choice items. Another example are the tests that are not unidimensional, then there might be some items that misfit the model.

Examples

Here are two examples of evaluating item fit with item response theory, using the software  Xcalibre.  Here is an item with great fit.  The red line (observed) is very close to the black line (model).  The two fit statistics are Chi-Square and z-Residual.  The p-values for both are large, indicating that we are nowhere near rejecting the hypothesis of model fit.

Good item fit

Now, consider the following item.  The red line is much more erratic.  The Chi-square rejects the model fit hypothesis with p=0.000.  The z-Residual, which corrects for sample size, does not reject but is still smaller.  This item also has a very low a parameter, so it should probably be evaluate.

OK item fit

Summary

To sum up, item fit analysis is key in item and test development. The relationship between item parameters and item fit identifies factors related to item fitness, which is useful in predicting item performance. In addition, this relationship helps understand, analyze, and interpret test results especially when a test has a significant number of misfit items.

References

Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response theory (Vol. 2). Sage.
Reise, S. P. (1990). A comparison of item-and person-fit methods of assessing model-data fit in IRT. Applied Psychological Measurement, 14(2), 127-137.

enhance-assessment

Distractor analysis refers to the process of evaluating the performance of incorrect answers vs the correct answer for multiple choice items on a test.  It is a key step in the psychometric analysis process to evaluate item and test performance as part of documenting test reliability and validity.

What is a distractor?

An item distractor, also known as a foil or a trap, is an incorrect option for a selected-response item on an assessment. Multiple-choice questions always have a few options for an answer, one of which is a key/correct answer, and the remaining ones are distractors/wrong answers. It is worth noting that distractors should not be just any wrong answers but have to be probable answers in case an examinee makes a mistake when looking for a right option. In short, distractors are feasible answers that a examinee might select when making misjudgments or having partial knowledge/understanding.  A great example is later in this article with the word “confectioner.”

Parts of an item - stem options distractor

What makes a good item distractor?

One word: plausibility.  We need the item distractor to attract examinees.  If it is so irrelevant that no one considers it, then it does not do any good to include it in the item.  Consider the following item.

 

   What is the capital of the United States of America?

 A. Los Angeles

 B. New York

 C. Washington, D.C.

 D. Mexico City

 

The last option is quite implausible – not only is it outside the USA, but it mentions another country in the name, so no student is likely to select this.  This then becomes a three-horse race, and students have a 1 in 3 chance of guessing.  This certainly makes the item easier. How much do distractors matter?  Well, how much is the difficulty affected by this new set?

 

   What is the capital of the United States of America?

 A. Paris

B. Rome

 C. Washington, D.C.

 D. Mexico City  

 

In addition, the distractor needs to have negative discrimination.  That is, while we want the correct answer to attract the more capable examinees, we want the distractors to attract the lower examinees.  If you have a distractor that you thought was incorrect, and it turns out to attract all the top students, you need to take a long, hard look at that question! To calculate discrimination statistics on distractors, you will need software such as  Iteman.

What makes a bad item distractor?

Obviously, implausibility and negative discrimination are frequent offenders.  But if you think more deeply about plausibility, the key is actually plausibility without being arguably correct.  This can be a fine line to walk, and is a common source of problems for items.  You might have a medical item that presents a scenario and asks for a likely diagnosis; perhaps one of the distractors is very unlikely so as to be essentially implausible, but it might actually be possible for a small subset of patients under certain conditions.  If the author and item reviewers did not catch this, the examinees probably will, and this will be evident in the statistics.  This is one of the reasons it is important to do psychometric analysis of test results, including distractor analysis to evaluate the effectiveness of incorrect options in multiple-choice questions.  In fact, accreditation standards often require you to go through this process at least once a year.

Why do we need a distractor analysis?

After a test form is delivered to examinees, distractor analysis should be implemented to make sure that all answer options work well, and that the item is performing well and defensibly. For example, it is expected that around 40-95% of students pick a correct answer, and the distractors will be chosen by the smaller number of examinees compared to the number chosen the key with approximately equal distribution of choices. Distractor analysis is usually done with classical test theory, even if item response theory is used for scoring, equating, and other tasks.

How to do a distractor analysis

There are three main aspects:

  1. Option frequencies/proportions
  2. Option point-biserial
  3. Quantile plot

The option frequencies/proportions just refers to the analysis of how many examinees selected each answer.  Usually it is a proportion and labeled as “P.”  Did 70% choose the correct answer while the remaining 30% were evenly distributed amongst the 3 distractors?  Great.  But if only 40% chose the correct answer and 45% chose one of the distractors, you might have a problem on your hands.  Perhaps the answer specified as the Key was not actually correct. The point-biserials (Rpbis) will help you evaluate if this is the case.  The point-biserial is an item-total correlation, meaning that we correlate scores on the item with the total score on the test, which is a proxy index of examinee ability.  If 0.0, there is no relationship, which means the item is not correlated with ability, and therefore probably not doing any good.  If negative, it means that the lower-ability students are selecting it more often; if positive, it means that the higher-ability students are selecting it more often.  We want the correct answer to have a positive value and the distractors to have a negative value.  This is one of the most important points in determining if the item is performing well. In addition, there is a third approach, which is visual, called the quantile plot.  It is very useful for diagnosing how an item is working and how it might be improved.  This splits the sample up into blocks ordered by performance, such as 5 groups where Group 1 is 0-20th percentile, Group 2 is 21-40th, etc.  We expect the smartest group to have a high proportion of examinees selecting the correct answer and low proportion selecting the distractors, and vise versa.  You can see how this aligns with the concept of point-biserial.  An example of this is below. Note that the P and point-biserial for the correct answer serve as “the” statistics for the item as a whole.  The P for the item is called the item difficulty or facility statistic.

Examples of a distractor analysis

Here is an example of a good item.  The P is medium (67% correct) and the Rpbis is strongly positive for the correct answer while strongly positive for the incorrect answers.  This translates to a clean quantile plot where the curve for the correct answer (B) goes up while the curves for the incorrect answers go down.  An ideal situation.

 

Distractor analysis quantile plot classical

 

Now contrast that with the following item.  Here, only 12% of examinees got this correct, and the Rpbis was negative.  Answer C had 21% and a nicely positive Rpbis, as well as a quantile curve that goes up.  This item should be reviewed to see if C is actually correct.  Or B, which had the most responses.  Most likely, this item will need a total rewrite!

 

Bad quantile plot and table for distractor analysis

 

Note that an item can be extremely difficult but still perform well.  Here is an example where the distractor analysis supports continued use of the item.  The distractor is just extremely attractive to lower students; they think that a confectioner makes confetti, since those two words look the closest.  Look how strong the Rpbis is here, and very negative for that distractor.  This is a good result!

 

Confectioner confetti distractor analysis