Coefficient cronbachs alhpa interpretation

Coefficient alpha reliability, sometimes called Cronbach’s alpha, is a statistical index that is used to evaluate the internal consistency or reliability of an assessment. That is, it quantifies how consistent we can expect scores to be, by analyzing the item statistics. A high value indicates that the test is of high reliability, and a low value indicates low reliability. This is one of the most fundamental concepts in psychometrics, and alpha is arguably the most common index. You may also be interested in reading about its competitor, the Split Half Reliability Index.

What is coefficient alpha, aka Cronbach’s alpha?

The classic reference to alpha is Cronbach (1954). He defines it as:

coefficient alpha

where k is the number of items, sigma-i is variance of item i, and sigma-X is total score variance.

Kuder-Richardson 20

While Cronbach tends to get the credit, to the point that the index is often called “Cronbach’s Alpha” he really did not invent it. Kuder and Richardson (1927) suggested the following equation to estimate the reliability of a test with dichotomous (right/wrong) items.

kr 20 reliability

Note that it is the same as Cronbach’s equation, except that he replaced the binomial variance pq with the more general notation of variance (sigma). This just means that you can use Cronbach’s equation on polytomous data such as Likert rating scales. In the case of dichotomous data such as multiple choice items, Cronbach’s alpha and KR-20 are the exact same.

Additionally, Cyril Hoyt defined reliability in an equivalent approach using ANOVA in 1941, a decade before Cronbach’s paper.

How to interpret coefficient alpha

In general, alpha will range from 0.0 (random number generator) to 1.0 (perfect measurement). However, in rare cases, it can go below 0.0, such as if the test is very short or if there is a lot of missing data (sparse matrix). This, in fact, is one of the reasons NOT to use alpha in some cases. If you are dealing with linear-on-the-fly tests (LOFT), computerized adaptive tests (CAT), or a set of overlapping linear forms for equating (non-equivalent anchor test, or NEAT design), then you will likely have a large proportion of sparseness in the data matrix and alpha will be very low or negative. In such cases, item response theory provides a much more effective way of evaluating the test.

What is “perfect measurement?”  Well, imagine using a ruler to measure a piece of paper.  If it is American-sized, that piece of paper is always going to be 8.5 inches wide, no matter how many times you measure it with the ruler.  A bathroom scale is slightly less reliability; You might step on it, see 190.2 pounds, then step off and on again, and see 190.4 pounds.  This is a good example of how we often accept unreliability in measurement.

Of course, we never have this level of accuracy in the world of psychoeducational measurement.  Even a well-made test is something where a student might get 92% today and 89% tomorrow (assuming we could wipe their brain of memory of the exact questions).

Reliability can also be interpreted as the ratio of true score variance to total score variance. That is, all test score distributions have a total variance, which consist of variance due to the construct of interest (i.e., smart students do well and poor students do poorly), but also some error variance (random error, kids not paying attention to a question, second dimension in the test… could be many things.

What is a good value of coefficient alpha?

As psychometricians love to say, “it depends.” The rule of thumb that you generally hear is that a value of 0.70 is good and below 0.70 is bad, but that is terrible advice. A higher value indeed indicates higher reliability, but you don’t always need high reliability. A test to certify surgeons, of course, deserves all the items it needs to make it quite reliable. Anything below 0.90 would be horrible. However, the survey you take from a car dealership will likely have the statistical results analyzed, and a reliability of 0.60 isn’t going to be the end of the world; it will still provide much better information than not doing a survey at all!

Here’s a general depiction of how to evaluate levels of coefficient alpha.

Coefficient cronbachs alhpa interpretation

Using alpha: the classical standard error of measurement

Coefficient alpha is also often used to calculate the classical standard error of measurement (SEM), which provides a related method of interpreting the quality of a test and the precision of its scores. The SEM can be interpreted as the standard deviation of scores that you would expect if a person took the test many times, with their brain wiped clean of the memory each time. If the test is reliable, you’d expect them to get almost the same score each time, meaning that SEM would be small.

   SEM=SD*sqrt(1-r)

Note that SEM is a direct function of alpha, so that if alpha is 0.99, SEM will be small, and if alpha is 0.1, then SEM will be very large.

Coefficient alpha and unidimensionality

It can also be interpreted as a measure of unidimensionality. If all items are measuring the same construct, then scores on them will align, and the value of alpha will be high. If there are multiple constructs, alpha will be reduced, even if the items are still high quality. For example, if you were to analyze data from a Big Five personality assessment with all five domains at once, alpha would be quite low. Yet if you took the same data and calculated alpha separately on each domain, it would likely be quite high.

How to calculate the index

Because the calculation of coefficient alpha reliability is so simple, it can be done quite easily if you need to calculate it from scratch, such as using formulas in Microsoft Excel. However, any decent assessment platform or psychometric software will produce it for you as a matter of course. It is one of the most important statistics in psychometrics.

Cautions on overuse

Because alpha is just so convenient – boiling down the complex concept of test quality and accuracy to a single easy-to-read number – it is overused and over-relied upon. There are papers out in the literature that describe the cautions in detail; here is a classic reference.

One important consideration is the over-simplification of precision with coefficient alpha, and the classical standard error of measurement, when juxtaposed to the concept of conditional standard error of measurement from item response theory. This refers to the fact that most traditional tests have a lot of items of middle difficulty, which maximizes alpha. This measures students of middle ability quite well. However, if there are no difficult items on a test, it will do nothing to differentiate amongst the top students. Therefore, that test would have a high overall alpha, but have virtually no precision for the top students. In an extreme example, they’d all score 100%.

Also, alpha will completely fall apart when you calculate it on sparse matrices, because the total score variance is artifactually reduced.

Limitations of coefficient alpha

Cronbach’s alpha has several limitations. Firstly, it assumes that all items on a scale measure the same underlying construct and have equal variances, which is often not the case. Secondly, it is sensitive to the number of items on the scale; longer scales tend to produce higher alpha values, even if the additional items do not necessarily improve measurement quality. Thirdly, Cronbach’s alpha assumes that item errors are uncorrelated, an assumption that is frequently violated in practice. Lastly, it provides only a lower bound estimate of reliability, which means it can underestimate the true reliability of the test.

Summary

In conclusion, coefficient alpha is one of the most important statistics in psychometrics, and for good reason. It is quite useful in many cases, and easy enough to interpret that you can discuss it with test content developers and other non-psychometricians. However, there are cases where you should be cautious about its use, and some cases where it completely falls apart. In those situations, item response theory is highly recommended.

Juggling-statistics

What is the difference between the terms dichotomous and polytomous in psychometrics?  Well, these terms represent two subcategories within item response theory (IRT) which is the dominant psychometric paradigm for constructing, scoring and analyzing assessments.  Virtually all large-scale assessments utilize IRT because of its well-documented advantages.  In many cases, however, it is referred to as a single way of analyzing data.  But IRT is actually a family of fast-growing models, each requiring rigorous item fit analysis to ensure that each question functions appropriately within the model.   The models operate quite differently based on whether the test questions are scored right/wrong or yes/no (dichotomous), vs. complex items like an essay that might be scored on a rubric of 0 to 6 points (polytomous).  This post will provide a description of the differences and when to use one or the other.

 

Ready to use IRT?  Download Xcalibre for free

 

Dichotomous IRT Models

Dichotomous IRT models are those with two possible item scores.  Note that I say “item scores” and not “item responses” – the most common example of a dichotomous item is multiple choice, which typically has 4 to 5 options, but only two possible scores (correct/incorrect).  

True/False or Yes/No items are also obvious examples and are more likely to appear in surveys or inventories, as opposed to the ubiquity of the multiple-choice item in achievement/aptitude testing. Other item types that can be dichotomous are Scored Short Answer and Multiple Response (all or nothing scoring).  

What models are dichotomous?

The three most common dichotomous models are the 1PL/Rasch, the 2PL (Graded Response Model), and the 3PL.  Which one to use depends on the type of data you have, as well as your doctrine of course.  A great example is Scored Short Answer items: there should be no effect of guessing on such an item, so the 2PL is a logical choice.  Here is a broad overgeneralization:

  • 1PL/Rasch: Uses only the difficulty (b) parameter and does not take into account guessing effects or the possibility that some items might be more discriminating than others; however, can be useful with small samples and other situations
  • 2PL: Uses difficulty (b) and discrimination (a) parameters, but no guessing (c); relevant for the many types of assessment where there is no guessing
  • 3PL: Uses all three parameters, typically relevant for achievement/aptitude testing.

What do dichotomous models look like?

Dichotomous models, graphically, will have one S-shaped curve with a positive slope, as seen here.  This model that the probability of responding in the keyed direction increases with higher levels of the trait or ability.  

item response function

Technically, there is also a line for the probability of an incorrect response, which goes down, but this is obviously the 1-P complement, so it is rarely drawn in graphs.  It is, however, used in scoring algorithms (check out this white paper).

In the example, a student with theta = -3 has about a 0.28 chance of responding correctly, while theta = 0 has about 0.60 and theta = 1 has about 0.90.

Polytomous IRT Models

Polytomous models are for items that have more than two possible scores.  The most common examples are Likert-type items (Rate on a scale of 1 to 5) and partial credit items (score on an Essay might be 0 to 5 points). IRT models typically assume that the item scores are integers.

What models are polytomous?

Unsurprisingly, the most common polytomous models use names like rating scale and partial credit.

  • Rating Scale Model (Andrich, 1978)
  • Partial Credit Model (Masters, 1982)
  • Generalized Rating Scale Model (Muraki, 1990)
  • Generalized Partial Credit Model (Muraki, 1992)
  • Graded Response Model (Samejima, 1972)
  • Nominal Response Model (Bock, 1972)

What do polytomous models look like?

Polytomous models have a line that dictates each possible response.  The line for the highest point value is typically S-shaped like a dichotomous curve.  The line for the lowest point value is typically sloped down like the 1-P dichotomous curve.  Point values in the middle typically have a bell-shaped curve. The example is for an Essay that scored 0 to 5 points.  Only students with theta >2 are likely to get the full points (blue), while students 1<theta<2 are likely to receive 4 points (green).

I’ve seen “polychotomous.”  What does that mean?

It means the same as polytomous.  

How is IRT used in our platform?

We use it to support the test development cycle, including form assembly, scoring, and adaptive testing.  You can learn more on this page.

How can I analyze my tests with IRT?

You need specially designed software, like  Xcalibre.  Classical test theory is so simple that you can do it with Excel functions.

Recommended Readings

Item Response Theory for Psychologists by Embretson and Riese (2000).  

group working on meta-analysis

Meta-analysis is a research process of collating data from multiple independent but similar scientific studies in order to identify common trends and findings by means of statistical methods. To put it simply, it is a method where you can accumulate all of your research findings and analyze them statistically. It is often used in psychometrics and industrial-organizational psychology to help validate assessments. Meta-analysis not only serves as a summary of a research question but also provides a quantitative evaluation of the relationship between two variables or the effectiveness of an experiment. It can also work for examining theoretical assumptions that compete with each other.

Background of Meta-Analysis

An American statistician and researcher, Gene Glass, devised the term ‘meta-analysis’ in 1976. He called so the statistical analysis of a large amount of data from individual studies in order to integrate the findings. Medical researchers began employing meta-analysis a few years later. One of the first influential applications of this method was when Elwood and Cochrane used meta-analysis to examine the effect of aspirin on reducing recurrences of heart attacks.

meta-analysis-studies

Purpose of Meta-Analysis

In general, meta-analysis is aimed at two things:

  • to establish whether a study has an effect and to determine whether it is positive or negative,
  • to analyze the results of previously conducted studies to find out common trends.

Performing Meta-Analysis

Even though there could be various ways of conducting meta-analysis depending on the research purpose and field, there are eight major steps:

  1. Set a research question and propose a hypothesis
  2. Conduct a systematic review of the relevant studies
  3. Extract data from the studies to include into the meta-analysis considering sample sizes and data variability measures for intervention and control groups (the control group is under observation whilst the intervention group is under experiment)
  4. Calculate summary measures, called effect sizes (the difference in average values between intervention and control groups), and standardize
    estimates if necessary for making comparisons between the groups
  5. Choose a meta-analytical method: quantitative (traditional univariate meta-analysis, meta-regression, meta-analytic structural equation modeling) or qualitative
  6. Pick up the software depending on the complexity of the methods used and the dataset (e.g. templates for Microsoft Excel, Stata, SPSS, SAS, R, Comprehensive Meta-Analysis, RevMan), and code the effect sizes
  7. Do analyses by employing an appropriate model for comparing effect sizes using fixed effects (assumes that all observations share a common mean effect size) or random effects (assumes heterogeneity and allows for a variation of the true effect sizes across observations)
  8. Synthesize results and report them

Prior to making any conclusions and reporting results, it would be helpful to use the checklist suggested by DeSimone et al. (2021) to ensure that all crucial aspects of the meta-analysis have been addressed in your study.

Meta-Analysis in Assessment & Psychometrics: Test Validation & Validity Generalization

Due to its versatility, meta-analysis is used in various fields of research, in particular as a test validation strategy in psychology and psychometrics. The most common situation to apply meta-analysis is validating the use of tests in workplace in the field of personnel psychology and pre-employment testing. The classic example of such application is the work done by Schmidt and Hunter (1998) who analyzed 85 years of research on what best predicts job performance. This is one of the most important articles in that topic. It has been recently updated by Sackett et al. (2021) with slightly different results.

How is meta-analysis applied to such a situation?  Well, start be reconceptualizing a “sample” as a set of studies, not a set of people. So let’s say we find 100 studies that use pre-employment tests to select examinees by predicting job performance (obviously, there are far more). Because most studies use more than one test, there might be 77 that use a general cognitive ability test, 63 that use a conscientiousness assessment, 24 that use a situational judgment test, etc. We look at the correlation coefficients reported for those first 77 studies and find that the average is 0.51, while the average correlation for conscientiousness is 0.44 and for SJTs is 0.39. You can see how this is extremely useful in a practical sense, as a practitioner that might be tasked with selecting an assessment battery!

Meta-analysis studies will often go further and clean up the results, by tossing studies with poor methodology or skewed samples, and applying corrections for things like range restriction and unreliability. This enhances the validity of the overall results. To see such an example, visit the Sackett et al. (2021) article.

Such research has led to the concept of validity generalization. This suggests that if a test has been validated for many uses, or similar uses, you can consider it validated for your particular use without having to do a validation study. For example, if you are selecting clerical workers and you can see that there are literally hundreds of studies which show that numeracy or quantitative tests will predict job performance, there is no need for you to do ANOTHER study. If challenged, you can just point to the hundreds of studies already done. Obviously, this is a reasonable argument, but you should not take it too far, i.e., generalize too much.

Conclusion

As you might have understood so far, conducting meta-analysis is not a piece of cake. However, it is very efficient when the researcher intends to evaluate effects in diverse participants, set another hypothesis creating a precedence for future research studies, demonstrate statistical significance or surmount the issue of a small sample size in research.

References

Borenstein, M., Hedges, L. V., Higgins, J. P., & Rothstein, H. R. (2021). Introduction to meta-analysis. John Wiley & Sons.

DeSimone, J. A., Brannick, M. T., O’Boyle, E. H., & Ryu, J. W. (2021). Recommendations for reviewing meta-analyses in organizational research. Organizational Research Methods24(4), 694-717.

Field, A. P., & Gillett, R. (2010). How to do a meta‐analysis. British Journal of Mathematical and Statistical Psychology63(3), 665-694.

Glass, G. V. (1976). Primary, secondary, and meta-analysis of research. Educational researcher5(10), 3-8.

Glass, G. V., McGaw, B., & Smith, M. L. (1981). Meta-analysis in social research. Sage Publications.

Gurevitch, J., Koricheva, J., Nakagawa, S., & Stewart, G. (2018). Meta-analysis and the science of research synthesis. Nature555(7695), 175-182.

Hansen, C., Steinmetz, H., & Block, J. (2022). How to conduct a meta-analysis in eight steps: A practical guide. Management Review Quarterly72(1), 1-19.

Hedges, L. V., & Olkin, I. (1985). Statistical methods for meta-analysis. Academic Press.

Hunter, J. E., Schmidt, F. L., & Jackson, G. B. (1982). Meta-analysis: Cumulating research findings across studies. Sage Publications.

Hunter, J. E., & Schmidt, F. L. (2004). Methods of meta-analysis: Correcting error and bias in research findings. Sage.

Peto, R., & Parish, S. (1980). Aspirin after myocardial infarction. Lancet1(8179), 1172-1173.

Sackett, P. R., Zhang, C., Berry, C. M., & Lievens, F. (2021). Revisiting meta-analytic estimates of validity in personnel selection: Addressing systematic overcorrection for restriction of range. Journal of Applied Psychology.

Schmidt, F. L., & Hunter, J. E. (1998). The validity and utility of selection methods in personnel psychology: Practical and theoretical implications of 85 years of research findings. Psychological bulletin124(2), 262.

 

test-validation

Test validation is the process of verifying whether the specific requirements to test development stages are fulfilled or not, based on solid evidence. In particular, test validation is an ongoing process of developing an argument that a specific test, its score interpretation or use is valid. The interpretation and use of testing data should be validated in terms of content, substantive, structural, external, generalizability, and consequential aspects of construct validity (Messick, 1994). Validity is the status of an argument that can be positive or negative: positive evidence supports and negative evidence weakens the validity argument, accordingly. Validity cannot be absolute and can be judged only in degrees. Meta-analysis, a technique frequently employed in psychometrics, aggregates research findings across multiple studies to assess the overall validity and reliability of a test. By synthesizing data from diverse sources, meta-analysis provides a comprehensive evaluation of the test’s construct validity, supporting its use in educational and psychological assessments (AERA, APA, & NCME, 1999).

Validation as part of test development

To be effective, test development has to be structured, systematic, and detail-oriented. These features can guarantee sufficient validity evidence supporting inferences proposed by test scores obtained via assessment. Downing (2006) suggested a twelve-step framework for the effective test development:

  1. Overall plan
  2. Content definition
  3. Test blueprint
  4. Item development
  5. Test design and assembly
  6. Test production
  7. Test administration
  8. Scoring test responses
  9. Standard setting
  10. Reporting test results
  11. Item bank management
  12. Technical report

Even though this framework is outlined as a sequential timeline, in practice some of these steps may occur simultaneously or may be ordered differently. A starting point of the test development – the purpose – defines the planned test and regulates almost all validity-related activities. Each step of the test development process focuses on its crucial aspect – validation.

Hypothetically, an excellent performance of all steps can ensure a test validity, i.e. the produced test would estimate examinee ability fairly within the content area to be measured by this test. However, human factor involved in the test production might play a negative role, so there is an essential need for the test validation.

Reasons for test validation

There are myriads of possible reasons that can lead to the invalidation of test score interpretation or use. Let us consider some obvious issues that potentially jeopardize test validation and are subject to validation:

  • overall plan: wrong choice of a psychometric model;
  • content definition: content domain is ill defined;
  • test blueprint: test blueprint does not specify an exact sampling plan for the content domain;
  • item development: items measure content at an inappropriate cognitive level;
  • test design and assembly: unequal booklets;
  • test administration: cheating;
  • scoring test responses: inconsistent scoring among examiners;
  • standard setting: unsuitable method of establishing passing scores;
  • item bank management: inaccurate updating of item parameters.

Context for test validation

All tests have common types of validity evidence that is purported, e.g. reliability, comparability, equating, and item quality. However, tests can vary in terms of a quantity of constructs measured (single, multiple) and can have different purposes which call for the unique types of test validation evidence. In general, there are several major types of tests:

  • Admissions tests (e.g., SAT, ACT, and GRE)
  • Credentialing tests (e.g., a live-patient examination for a dentist before licensing)
  • Large-scale achievement tests (e.g., Stanford Achievement Test, Iowa Test of Basic Skills, and TerraNova)
  • Pre-employment tests
  • Medical or psychological
  • Language

The main idea is that the type of test usually defines a unique validation agenda that focuses on appropriate types of validity evidence and issues that are challenged in that type of test.

Categorization of test validation studies

Since there are multiple precedents for the test score invalidation, there are many categories of test validation studies that can be applied to validate test results. In our post, we will look at the categorization suggested by Haladyna (2011):

Category 1: Test Validation Studies Specific to a Testing Program

Subcategory of a study

Focus of a study

    1. Studies That Provide Validity Evidence in Support of the Claim for a Test Score Interpretation or Use
  • Content analysis
  • Item analysis
  • Standard setting
  • Equating
  • Reliability
    2. Studies That Threaten a Test Score Interpretation of Use
  • Cheating
  • Scoring errors
  • Student motivation
  • Unethical test preparation
  • Inappropriate test administration
    3. Studies That Address Other Problems That Threaten Test Score Interpretation or Use
  • Drop in reliability
  • Drift in item parameters over time
  • Redesign of a published test
  • Possible security problem

Category 2: Test Validation Studies That Apply to More Than One Testing Program

    Studies that lead to the establishment of concepts, principles, or procedures that guide, inform, or improve test development or scoring
  • Introducing a concept
  • Introducing a principle
  • Introducing a procedure
  • Studying a pervasive problem

Summary

Even though test development is a longitudinal laborious process, test creators have to be extremely accurate while executing their obligations within each activity. The crown of this process is obtaining valid and reliable test scores, and their adequate interpretation and use. The higher the stakes or consequences of the test scores, the greater attention should be paid to the test validity, and, therefore, to the test validation. The latter one is emphasized by integrating all reliable sources of evidence to strengthen the argument for test score interpretation and use.

References

American Educational Research Association (AERA), American Psychological Association (APA), & National Council on Measurement in Education (NCME). (1999). Standards for educational and psychological testing. American Educational Research Association.

Downing, S. M. (2011). Twelve steps for effective test development. In. S. M. Downing & T. M. Haladyna (Eds.), Handbook of test development (pp. 3-25). Lawrence Erlbaum Associates.

Haladyna, T. M. (2011). Roles and importance of validity studies in test development. In. S. M. Downing & T. M. Haladyna (Eds.), Handbook of test development (pp. 739-755). Lawrence Erlbaum Associates.

Messick, S. (1994). The interplay of evidence and consequences in the validation of performance assessments. Educational researcher23(2), 13-23.

 

Psychometric software

Automated item generation (AIG) is a paradigm for developing assessment items (test questions), utilizing principles of artificial intelligence and automation. As the name suggests, it tries to automate some or all of the effort involved with item authoring, as that is one of the most time-intensive aspects of assessment development – which is no news to anyone who has authored test questions!

What is Automated Item Generation?

Automated item generation involves the use of computer algorithms to create new test questions, or variations of them.  It can also be used for item review, or the generation of answers, or the generation of assets such as reading passages.  Items still need to be reviewed and edited by humans, but this still saves a massive amount of time in test development.

Why Use Automated Item Generation?

Items can cost up to $2000 to develop, so even cutting the average cost in half could provide massive time/money savings to an organization.  ASC provides AIG functionality, with no limits, to anyone who signs up for a free item banking account in our platform  Assess.ai.

Types of Automated Item Generation?

There are two types of automated item generation.  The Item Templates approach was developed before large language models (LLMs) were widely available.  The second approach is to use LLMs, which became widely available at the end of 2022.

Type 1: Item Templates

The first type is based on the concept of item templates to create a family of items using dynamic, insertable variables. There are three stages to this work. For more detail, read this article by Gierl, Lai, and Turner (2012).

  • Authors, or a team, create an cognitive model by isolating what it is they are exactly trying to assess and different ways that it the knowledge could be presented or evidenced. This might include information such as what are the important vs. incidental variables, and what a correct answer should include .
  • They then develop templates for items based on this model, like the example you see below.
  • An algorithm then turns this template into a family of related items, often by producing all possible permutations.

Obviously, you can’t use more than one of these on a given test form. And in some cases, some of the permutations will be an unlikely scenario or possibly completely irrelevant. But the savings can still be quite real. I saw a conference presentation by Andre de Champlain from the Medical Council of Canada, stating that overall efficiency improved by 6x and the generated items were higher quality than traditionally written items because the process made the authors think more deeply about what they were assessing and how. He also recommended that template permutations not be automatically moved to the item bank but instead that each is reviewed by SMEs, for reasons such as those stated above.

You might think “Hey, that’s not really AI…” – AI is doing things that have been in the past done by humans, and the definition gets pushed further every year. Remember, AI used to be just having the Atari be able to play Pong with you!

AIG-CPR

Type 2: AI Generation or Processing of Source Text

The second type is what the phrase “automated item generation” more likely brings to mind: upload a textbook or similar source to some software, and it spits back drafts of test questions. For example, see this article by von Davier (2019). Or alternatively, simply state a topic as a prompt and the AI will generate test questions.

Until the release of ChatGPT and other publicly available AI platforms to implement large language models (LLMs), this approach was only available to experts at large organizations.  Now, it is available to everyone with an internet connection.  If you use such products directly, you can provide a prompt such as “Write me 10 exam questions on Glaucoma, in a 4-option multiple choice format” and it will do so.  You can also update the instructions to be more specific, and add instructions such as formatting the output for your preferred method, such as QTI or JSON.

Alternatively, many assessment platforms now integrate with these products directly, so you can do the same thing, but have the items appear for you in the item banker under New status, rather than have them go to a raw file on your local computer that you then have to clean and upload.  FastTest  has such functionality available.

This technology has completely revolutionized how we develop test questions.  I’ve seen several research presentations on this, and they all find that AIG produces more items, of quality that is as good or even better than humans, in a fraction of the time!  But, they have also found that prompt engineering is critical, and even one word – like including “concise” in your prompt – can affect the quality of the items.

FastTest Automated item generation

The Limitations of Automated Item Generation

Automated item generation (AIG) has revolutionized the way educational and psychological assessments are developed, offering increased efficiency and consistency. However, this technology comes with several limitations that can impact the quality and effectiveness of the items produced.

One significant limitation is the challenge of ensuring content validity. AIG relies heavily on algorithms and pre-defined templates, which may not capture the nuanced and comprehensive understanding of subject matter that human experts possess. This can result in items that are either too simplistic or fail to fully address the depth and breadth of the content domain .

Another limitation is the potential for over-reliance on statistical properties rather than pedagogical soundness. While AIG can generate items that meet certain psychometric criteria, such as difficulty and discrimination indices, these items may not always align with best practices in educational assessment or instructional design. This can lead to tests that are technically robust but lack relevance or meaningfulness to the learners .

Furthermore, the use of AIG can inadvertently introduce bias. Algorithms used in item generation are based on historical data and patterns, which may reflect existing biases in the data. Without careful oversight and adjustment, AIG can perpetuate or even exacerbate these biases, leading to unfair assessment outcomes for certain groups of test-takers .

Lastly, there is the issue of limited creativity and innovation. Automated systems generate items based on existing templates and rules, which can result in a lack of variety and originality in the items produced. This can make assessments predictable and less engaging for test-takers, potentially impacting their motivation and performance .

In conclusion, while automated item generation offers many benefits, it is crucial to address these limitations through continuous oversight, integration of expert input, and regular validation studies to ensure the development of high-quality assessment items.

How Can I Implement Automated Item Generation?

If you are a user of AI products like ChatGPT or Bard, you can work directly with them.  Advanced users can implement APIs to upload documents or fine-tune the machine learning models.  The aforementioned article by von Davier talks about such usage.

If you want to save time, FastTest provides a direct ChatGPT integration, so you can provide the prompt using the screen shown above, and items will then be automatically created in the item banking folder you specify, with the item naming convention you specify, tagged as Status=New and ready for review.  Items can then be routed through our configurable Item Review Workflow process, including functionality to gather modified-Angoff ratings.

Ready to improve your test development process?  Click here to talk to a psychometric expert.

borderline method educational assessment

The borderline group method of standard setting is one of the most common approaches to establishing a cutscore for an exam.  In comparison with the item-centered standard setting methods such as modified-Angoff, Nedelsky, and Ebel, there are two well-known examinee-centered methods (Jaeger, 1989), the contrasting groups method and the borderline group method (Livingston & Zieky, 1989). This post will focus on the latter one.

The concept of the borderline group method

Examinee-centered methods require participants to judge whether an individual examinee possesses adequate knowledge, skills, and abilities across specific content standards. The borderline group method is based on the idea to determine a common passing score for all examinees that would be expected from an examinee whose competencies are on the borderline between quite adequate and yet not inadequate.

How to perform the borderline group methodpsychometric training and workshops

First of all, the judges are selected from those who are thoroughly familiar with the content examined and are knowledgeable about knowledge, skills, and abilities of individual examinees. Next, the judges engage in a discussion to develop a description of an examinee who is on the borderline between two extremes, mastery and non-mastery. Alternatively, the judges may be tasked to sort examinees into three categories: clearly competent, clearly incompetent, and those in-between.

After the description is agreed upon, borderline examinees need to be identified. The ultimate goal of the borderline group method is to distribute the borderline examinees’ scores and to find the median of that distribution (50th percentile), which would become the recommended cut score for the borderline group.

Why is the median used and not the mean, you might ask? The reason is that the median is much less affected by extremely high or extremely low values. This feature of the median is particularly important for the borderline group method, because an examinee with a very high or very low score is likely to not really belong in the group.

Analyzing the borderline group method

Advantages of this method:

  • Time efficient
  • Straightforward to implement

Disadvantages of this method:

  • Difficult to achieve consensus on the nature borderline examinees
  • The cut score could have low validity in case of a small number of borderline examinees

What could the borderline group method work poorly and how this can be tackled?

Possible issue Probable solution
The judges could identify some examinees as borderline by mistake (e.g. their skills were difficult to judge), so the borderline group might contain examinees who do not belong it. Remind the judges not to include in the borderline group any examinees whose competencies they are not sure about.
The judges may base their judgements on something other than what the examine measures. Give the judges appropriate instructions and get them agree with each other when defining a borderline examinee.
The judgements in terms of individual standards regarding the examinees’ skills and abilities may differ greatly.

There is also a risk that judges would be sensitive to errors of central tendency and, therefore, might assign a disproportionately large number of examinees to the borderline group if they do not have sufficient knowledge about individual examinees’ performances. Thus, it is key for implementing the borderline group method to pick highly competent judges.

Conclusion

Let’s summarize which steps need to be made to implement the borderline group method:

  • Select the competent judges
  • Define the borderline level of examinees’ knowledge, skills, and abilities
  • Evolve the borderline examinees
  • Obtain the test scores of the borderline examinees
  • Calculate the cut off score as the median of the distribution of the borderline examinees’ test scores

 

References

Jaeger, R. M. (1989). Certification of student competence.

Livingston, S. A., & Zieky, M. J. (1989). A comparative study of standard-setting methods. Applied Measurement in Education2(2), 121-141.

 

ebel-method-for-multiple-choice-questions

The Ebel method of standard setting is a psychometric approach to establish a cutscore for tests consisting of multiple-choice questions. It is usually used for high-stakes examinations in the fields of higher education, medical and health professions, and for selecting applicants.

How is the Ebel method performed?

The Ebel method requires a panel of judges who would first categorize each item in a data set by two criteria: level of difficulty and relevance or importance. Then the panel would agree upon an expected percentage of items that should be answered correctly for each group of items according to their categorization.

It is crucial that judges are the experts in the examined field; otherwise, their judgement would not be valid and reliable. Prior to the item rating process, the panelists should be given sufficient amount of information about the purpose and procedures of the Ebel method. In particular, it is important that the judges would understand the meaning of difficulty and relevance in the context of the current assessment.

Next stage would be to determine what “minimally competent” performance means in the specific case depending on the content. When everything is clear and all definitions are agreed upon, the experts should classify each item across difficulty (easy, medium, or hard) and relevance (minimal, acceptable, important, or essential). In order to minimize the influence of the judges’ opinion on each other, it is more recommended to use individual ratings rather than consensus ones.

Afterwards judgements on the proportion of items expected to be answered correctly by minimally competent candidates need to be collected for each item category, e.g. easy and desirable. However, for the rating and timesaving purposes the grid proposed by Ebel and Frisbie (1972) might be used. It is worth mentioning though that Ebel ratings are content-specific, so values in the grid might happen to be too low or too high for a test.

Ebel-method-data

At the end, the Ebel method, like the modified-Angoff method, identifies a cut-off score for an examination based on the performance of candidates in relation to a defined standard (absolute), rather than how they perform in relation to their peers (relative). Ebel scores for each item and for the whole exam are calculated as the average of the scores provided by each expert: the number of items in each category is multiplied by the expected percentage of correct answers, and the total results are added to calculate the cutscore.

Pros of using Ebel

  • This method provides an overview of a test difficulty
  • Cut-off score is identified prior to an examination
  • It is relatively easy for experts to perform

Cons of using Ebel

  • This method is time-consuming and costly
  • Evaluation grid is hard to get right
  • Digital software is required
  • Back-up is necessary

Conclusion

The Ebel method is a quite complex standard-setting process compared to others due to the need of an analysis of the content, and it therefore imposes a burden on the standard-setting panel. However, Ebel considers the relevance of the test items and the expected proportion of the correct answers of the minimally competent candidates, including borderline candidates. Thus, even though the procedure is complicated, the results are very stable and very close to the actual cut-off scores.

References

Ebel, R. L., & Frisbie, D. A. (1972). Essentials of educational measurement.

item parameter drift boat

Item parameter drift (IPD) refers to the phenomenon in which the parameter values of a given test item change over multiple testing occasions within the item response theory (IRT) framework. This phenomenon is often relevant to student progress monitoring assessments where a set of items is used several times in one year, or across years, to track student growth;  the observing of trends in student academic achievements depends upon stable linking (anchoring) between assessment moments over time, and if the item parameters are not stable, the scale is not stable and time-to-time comparisons are not either. Some psychometricians consider IPD as a special case of differential item functioning (DIF), but these two are different issues and should not be confused with each other.

Reasons for Item Parameter Drift

IRT modeling is attractive for assessment field since its property of item parameter invariance to a particular sample of test-takers which is fundamental for their estimation, and that assumption enables important things like strong equating of tests across time and the possibility for computerized adaptive testing. However, item parameters are not always invariant. There are plenty of reasons that could stand behind IPD. One possibility is curricular changes based on assessment results or instruction that is more concentrated. Other feasible reasons are item exposure, cheating, or curricular misalignment with some standards. No matter what has led to IPD, its presence can cause biased estimates of student ability. In particular, IPD can be highly detrimental in terms of reliability and validity in case of high-stakes examinations. Therefore, it is crucial to detect item parameter drift when anchoring assessment occasions over time, especially when the same anchor items are used repeatedly.

Perhaps the simplest example is item exposure.  Suppose a 100-item test is delivered twice per year, with 20 items always remaining as anchors.  Eventually students will share memories and the topics of those will become known.  More students will get them correct over time, making the items appear easier.

Identifying IPD

There are several methods of detecting IPD. Some of them are simpler because they do not require estimation of anchoring constants, and some of them are more difficult due to the need of that estimation. Simple methods include the “3-sigma p-value”, the “0.3 logits”, and the “3-sigma IRT” approaches. Complex methods involve the “3-sigma scaled IRT”, the “Mantel-Haenszel”, and the “area between item characteristic curves”, where the last two approaches are based on consideration that IPD is a special case of DIF, and therefore there is an opportunity to draw upon a massive body of existing research on DIF methodologies.

Handling IPD

Even though not all psychometricians think that removal of outlying anchor items is the best solution for item parameter drift, if we do not eliminate drifting items from the process of equating test scores, they will affect transformations of ability estimates, not only item parameters. Imagine that there is an examination, which classifies examinees as either failing or passing, or into four performance categories; then in case of IPD, 10-40% of students could be misclassified. In high-stakes testing occasions where classification of examinees implies certain sanctions or rewards, IPD scenarios should be minimized as much as possible. As soon as it is detected that some items exhibit IPD, these items should be referred to the subject-matter experts for further investigation. Otherwise, if there is a need in a faster decision, such flagged anchor items should be removed immediately. Afterwards, psychometricians need to re-estimate linking constants and evaluate IPD again. This process should repeat unless none of the anchor items shows item parameter drift.

response-time-effort

The concept of Speeded vs Power Test is one of the ways of differentiating psychometric or educational assessments. In the context of educational measurement and depending on the assessment goals and time constraints, tests are categorized as speeded and power. There is also the concept of a Timed test, which is really a Power test. Let’s look at these types more carefully.

Speeded test

In this test, examinees are limited in time but expected to answer as many questions as possible but there is a unreasonably short time limit that prevents even the best examinees from completing the test, and therefore forces the speed.  Items are delivered sequentially starting from the first one and until the last one. All items are relatively easy, usually.  Sometimes they are increasing in difficulty.  If a time limit and difficulty level are correctly set, none of the test takers will be able to reach the last item before the time limit is reached. A speeded test is supposed to demonstrate how fast an examinee can respond to questions within a time limit. In this case, examinees’ answers are not as important as their speed of answering questions. Total score is usually computed as a number of questions answered correctly when a time limit is met, and differences in scores are mainly attributed to individual differences in speed rather than knowledge.

An example of this might be a mathematical calculation speed test. Examinees are given 100 multiplication problems and told to solve as many as they can in 20 seconds. Most examinees know the answers to all the items, it is a question of how many they can finish. Another might be a 10-key task, where examinees are given a list of 100 5-digit strings and told to type as many as they can in 20 seconds.

Pros of a speeded test:

  • Speeded test is appropriate for when you actually want to test the speed of examinees; the 10-digit task above would be useful in selecting data entry clerks, for example. The concept of “knowledge of 5 digit string” in this case is not relevant and doesn’t even make sense.
  • Tests can sometimes be very short but still discriminating.
  • In case when a test is a mixture of items in terms of their difficulty, examinees might save some time when responding easier items in order to respond to more difficult items. This can create an increased spread in scores.

Cons of a speeded test:

  • Most situations where a test is used is to evaluate knowledge, not speed.
  • The nature of the test provokes examinees commit errors even if they know the answers, which can be stressful.
  • Speeded test does not consider individual peculiarities of examinees.

Power test

A power test provides examinees with sufficient time so that they could attempt all items and express their true level of knowledge or ability. Therefore, this testing category focuses on assessing knowledge, skills, and abilities of the examinees.  The total score is often computed as a number of questions answered correctly (or with item response theory), and individual differences in scores are attributed to differences in ability under assessment, not to differences in basic cognitive abilities such as processing speed or reaction time.

There is also the concept of a Timed Test. This has a time limit, but it is NOT a major factor in how examinees respond to questions or affect their score. For example, the time limit might be set so that 95% of examinees are not affected at all, and the remaining 5% are slightly hurried. This is done with the CAT-ASVAB.

Pros of a power test:

  • There is no time restrictions for test-takers
  • Power test is great to evaluate knowledge, skills, and abilities of examinees
  • Power test reduces chances of committing errors by examinees even if they know the answers
  • Power test considers individual peculiarities of examinees

Cons of a power test:

  • It can be time consuming (some of these exams are 8 hours long or even more!)
  • This test format sometimes does not suit competitive examinations because of administrative issues (too much test time across too many examinees)
  • Power test is sometimes bad for discriminative purposes, since all examinees have high chances to perform well.  There are certainly some pass/fail knowledge exams where almost everyone passes.  But the purpose of those exams is not to differentiate for selection, but to make sure students have mastered the material, so this is a good thing in that case.

Speeded test vs power test

The categorization of speed or power test depends on the assessment purpose. For instance, an arithmetical test for Grade 8 students might be a speeded test when containing many relatively easy questions but the same test could be a power test for Grade 7 students. Thus, a speeded test measures the power when all of the items are correctly responded in a limited time period. Similarly, a power test might turn into a speeded test when easy items are correctly responded in shorter time period. Once a time limit is fixed for a power test, it becomes a speeded test. Today, a pure speeded or power test is rare. Usually, what we meet in practice is a mixture of both, typically a Timed Test.

Below you may find a comparison of a speeded vs power test, in terms of the main features.

 

Speeded test Power test
Time limit is fixed, and it affects all examinees There is no time limit, or there is one and it only affects a small percentage of examinees
The goal is to evaluate speed only, or a combination of speed and correctness The goal is to evaluate correctness in the sense knowledge, skills, and abilities of test-takers
Questions are relatively easy in nature Questions are relatively difficult in nature
Test format increases chances of committing errors Test format reduces chances of committing errors

 

enhance-assessment

Distractor analysis refers to the process of evaluating the performance of incorrect answers vs the correct answer for multiple choice items on a test.  It is a key step in the psychometric analysis process to evaluate item and test performance as part of documenting test reliability and validity.

What is a distractor?

An item distractor, also known as a foil or a trap, is an incorrect option for a selected-response item on an assessment. Multiple-choice questions always have a few options for an answer, one of which is a key/correct answer, and the remaining ones are distractors/wrong answers. It is worth noting that distractors should not be just any wrong answers but have to be probable answers in case an examinee makes a mistake when looking for a right option. In short, distractors are feasible answers that a examinee might select when making misjudgments or having partial knowledge/understanding.  A great example is later in this article with the word “confectioner.” Parts of an item - stem options distractor

What makes a good item distractor?

One word: plausibility.  We need the item distractor to attract examinees.  If it is so irrelevant that no one considers it, then it does not do any good to include it in the item.  Consider the following item.

 

   What is the capital of the United States of America?

 A. Los Angeles

 B. New York

 C. Washington, D.C.

 D. Mexico City

 

The last option is quite implausible – not only is it outside the USA, but it mentions another country in the name, so no student is likely to select this.  This then becomes a three-horse race, and students have a 1 in 3 chance of guessing.  This certainly makes the item easier. How much do distractors matter?  Well, how much is the difficulty affected by this new set?

 

   What is the capital of the United States of America?

 A. Paris

B. Rome

 C. Washington, D.C.

 D. Mexico City  

 

In addition, the distractor needs to have negative discrimination.  That is, while we want the correct answer to attract the more capable examinees, we want the distractors to attract the lower examinees.  If you have a distractor that you thought was incorrect, and it turns out to attract all the top students, you need to take a long, hard look at that question! To calculate discrimination statistics on distractors, you will need software such as  Iteman.

What makes a bad item distractor?

Obviously, implausibility and negative discrimination are frequent offenders.  But if you think more deeply about plausibility, the key is actually plausibility without being arguably correct.  This can be a fine line to walk, and is a common source of problems for items.  You might have a medical item that presents a scenario and asks for a likely diagnosis; perhaps one of the distractors is very unlikely so as to be essentially implausible, but it might actually be possible for a small subset of patients under certain conditions.  If the author and item reviewers did not catch this, the examinees probably will, and this will be evident in the statistics.  This is one of the reasons it is important to do psychometric analysis of test results, including distractor analysis to evaluate the effectiveness of incorrect options in multiple-choice questions.  In fact, accreditation standards often require you to go through this process at least once a year.

Why do we need distractor analysis?

After a test form is delivered to examinees, distractor analysis should be implemented to make sure that all answer options work well, and that the item is performing well and defensibly. For example, it is expected that around 40-95% of students pick a correct answer, and the distractors will be chosen by the smaller number of examinees compared to the number chosen the key with approximately equal distribution of choices. Distractor analysis is usually done with classical test theory, even if item response theory is used for scoring, equating, and other tasks.

How to do a distractor analysis

There are three main aspects:

  1. Option frequencies/proportions
  2. Option point-biserial
  3. Quantile plot

The option frequencies/proportions just refers to the analysis of how many examinees selected each answer.  Usually it is a proportion and labeled as “P.”  Did 70% choose the correct answer while the remaining 30% were evenly distributed amongst the 3 distractors?  Great.  But if only 40% chose the correct answer and 45% chose one of the distractors, you might have a problem on your hands.  Perhaps the answer specified as the Key was not actually correct. The point-biserials (Rpbis) will help you evaluate if this is the case.  The point-biserial is an item-total correlation, meaning that we correlate scores on the item with the total score on the test, which is a proxy index of examinee ability.  If 0.0, there is no relationship, which means the item is not correlated with ability, and therefore probably not doing any good.  If negative, it means that the lower-ability students are selecting it more often; if positive, it means that the higher-ability students are selecting it more often.  We want the correct answer to have a positive value and the distractors to have a negative value.  This is one of the most important points in determining if the item is performing well. In addition, there is a third approach, which is visual, called the quantile plot.  It is very useful for diagnosing how an item is working and how it might be improved.  This splits the sample up into blocks ordered by performance, such as 5 groups where Group 1 is 0-20th percentile, Group 2 is 21-40th, etc.  We expect the smartest group to have a high proportion of examinees selecting the correct answer and low proportion selecting the distractors, and vise versa.  You can see how this aligns with the concept of point-biserial.  An example of this is below. Note that the P and point-biserial for the correct answer serve as “the” statistics for the item as a whole.  The P for the item is called the item difficulty or facility statistic.

Examples of distractor analysis

Here is an example of a good item.  The P is medium (67% correct) and the Rpbis is strongly positive for the correct answer while strongly positive for the incorrect answers.  This translates to a clean quantile plot where the curve for the correct answer (B) goes up while the curves for the incorrect answers go down.  An ideal situation. Distractor analysis quantile plot classical Now contrast that with the following item.  Here, only 12% of examinees got this correct, and the Rpbis was negative.  Answer C had 21% and a nicely positive Rpbis, as well as a quantile curve that goes up.  This item should be reviewed to see if C is actually correct.  Or B, which had the most responses.  Most likely, this item will need a total rewrite! Bad quantile plot and table for distractor analysis Note that an item can be extremely difficult but still perform well.  Here is an example where the distractor analysis supports continued use of the item.  The distractor is just extremely attractive to lower students; they think that a confectioner makes confetti, since those two words look the closest.  Look how strong the Rpbis is here, and very negative for that distractor.  This is a good result! Confectioner confetti distractor analysis