Posts on psychometrics: The Science of Assessment

students in school

Content validity is an aspect of validity, a term that psychometricians use to refer to evidence that interpretations of test scores are supported.  For example, predictive validity provides evidence that a pre-employment test will predict job performance, tenure, and other important criteria.  Content validity, on the other hand, focuses on evidence that the content of the test covers what it should cover.

What is Content Validity?

Content validity refers to the extent to which a measurement instrument (e.g., a test, questionnaire, or survey) accurately and adequately measures the specific content or construct it is designed to assess. In simpler terms, it assesses whether the questions or items included in an assessment are relevant and representative of the subject matter or concept under investigation.

Example 1: You are working on a benchmark test for 5th grade mathematics in the USA.  You would likely want to ensure that all items align to the Common Core State Standards for the 5th grade mathematics curriculum.

Example 2: You are working on a certification exam for widgetmakers.  You should make sure that all items align to the publicly posted blueprint for this certification.  That, in turn, was not defined in willy-nilly – it should have been built on the results of a formal job task analysis study.

The Importance of Content Validity

  • Drives Accurate Measurement: Content validity helps in ensuring that the assessment tool is measuring what it’s intended to measure. This is critical for drawing meaningful conclusions and making informed decisions based on the results.content validity
  • Enhances Credibility: When your assessment has high content validity, it enhances the credibility and trustworthiness of your findings. It demonstrates that you’ve taken the time to design a valid instrument. This is often referred to as face validity – which is not a “real” type of validity that psychometricians consider, but refers to if someone off the street looks at the test and says “yeah, that looks like all the items are on widgetmaking.”
  • Reduces Bias: Using assessment items that are not content-valid can introduce bias and inaccuracies into your results. By maintaining content validity, you reduce the risk of skewed or unreliable data.
  • Improves Decision-Making: Organizations often rely on assessments to make important decisions, such as hiring employees, designing educational curricula, or evaluating the effectiveness of marketing campaigns. Content-valid assessments provide a solid foundation for making these decisions.
  • Legal Defensibility: In general, if you deliver a test to select employees, you need to show either content validity (e.g., test on Microsoft Excel for bookkeepers) or predictive validity (conscientiousness is a personality trait but probably related to success as a bookkeeper).  A similar notion applies to other types of tests.

 

How to Assess Content Validity

There are various methods to assess content validity, such as expert reviews, pilot testing, and statistical techniques. One common method is to gather a panel of experts in the subject matter and have them review the assessment items to ensure that they align with the content domain.  Of course, if all the items are written directly to the blueprints in the first place, and reviewed before they even become part of the pool of active items, a post-hoc review like that is not necessary.

There has been more recent research on the application of machine learning to evaluate content, including the add-on option to look for enemy items by evaluating the distance between the content of any given pair of items.

If the test is multidimensional, a statistical approach known as factor analysis can help, to see if the items actually load on the dimensions they should.

Conclusion

In summary, content validity is an essential aspect of assessment design that ensures the questions or items used in an assessment are appropriate, relevant, and representative of the construct being measured. It plays a significant role in enhancing the accuracy, credibility, and overall quality of your assessments. Whether you’re a student preparing for an exam, a researcher developing a survey, or a business professional creating a customer feedback form, understanding and prioritizing content validity will help you achieve more reliable and meaningful results. So, next time you’re tasked with creating or using an assessment tool, remember the importance of content validity and its impact on the quality of your data and decision-making processes.

However, it is not the only aspect of validity.  The documentation of validity is a complex process that is often ongoing.  You will also need data on statistical performance of the test (e.g., alpha reliability), evaluation bias (e.g., differential item functioning), possibly predictive validity, and more.  Therefore, it’s important to work with a psychometrician that can help you understand what is involved and ensure that the test meets both international standards and the reason that you are building the test in the first place!

making-predictions-and-decisions-based-on-test-scores

Predictive Validity is a type of test score validity which evaluates how well a test predicts something in the future, usually with a goal of making more effective decisions about people.  For instance, it is often used in the world of pre-employment testing, where we want a test to predict things like job performance or tenure, so that a company can hire people that do a good job and stay a long time – a very good result for the company, and worth the investment.

Validity, in a general sense, is evidence that we have to support intended interpretations of test scores.  There are different types of evidence that we can gather to do so.  Predictive validity refers to evidence that the test predicts things that it should predict.  If we have quantitative data to support such conclusions, it makes the test more defensible and can improve the efficiency of its use.  For example, if a university admissions test does a great job of predicting success at university, then universities will want to use it to select students that are more likely to succeed.

Examples of Predictive Validity

Predictive validity evidence can be gathered for a variety of assessment types.

  1. Pre-employment: Since the entire purpose of a pre-employment test is to positively predict good things like job performance or negatively predict bad things like employee theft or short tenure, a ton of effort goes into developing tests to function in this way, and then documenting that they do.
  2. University Admissions: Like pre-employment testing, the entire purpose of university admissions exams is predictive.  They should positively correlate with good things (first year GPA, four year graduation rate) and negatively predict the negative outcomes like academic probation or dropping out.
  3. Prep Exams: Preparatory or practice tests are designed to predict performance on their target test.  For example, if a prep test is designed to mimic the Scholastic Aptitude Test (SAT), then one way to validate it is to gather the SAT scores later, after the examinees take it, and correlate with the prep test.
  4. Certification & Licensure: The primary purpose of credentialing exams is not to predict job performance, but to ensure that the candidate has mastered the material necessary to practice their profession.  Therefore, predictive validity is not important, compared to content-related validity such as blueprints based on a job analysis. However, some credentialing organizations do research on the “value of certification” linking it to improved job performance, reduced clinical errors, and often external third variables such as greater salary.
  5. Medical/Psychological: There are some assessments that are used in a clinical situation, and the predictive validity is necessary in that sense.  For instance, there might be an assessment of knee pain used during initial treatment (physical therapy, injections) that can be predictively correlated with later surgery.  The same assessment might then be used after the surgery to track rehabilitation.

Predictive Validity in Pre-employment Testing

The case of pre-employment testing is perhaps the most common use of this type of validity evidence.  A new study (Sacket, Zhang, Berry, & Lievens, 2022) was recently released that was a meta-analysis of the various types of pre-employment tests and other selection procedures (e.g., structured interview), comparing their predictive validity power.  This was a modern update to the classic article by Schmidt & Hunter (1998).  While in the past the consensus has been that cognitive ability tests provide the best predictive power in the widest range of situations, the new article suggests otherwise.  It recommends the use of structured interview and job knowledge tests, which are more targeted towards the role in question, and therefore not surprising that they are well-performing.  This in turn suggests that you should not buy pre-fab ability tests and use them in a shotgun approach with the assumption of validity generalization, but instead leverage an online testing platform like FastTest that allows you to build high-quality exams that are more specific to your organization.

Why do we need predictive validity?

There are a number of reasons that you might need predictive validity for an exam.  They are almost always regarding the case where the test is used to make important decisions about people.

  1. Smarter decision-making: Predictive validity provides valuable insights for decision-makers. It helps recruiters identify the most suitable candidates, educators tailor their teaching methods to enhance student learning, and universities to admit the best students.
  2. Legal defensibility: If a test is being used for pre-employment purposes, it is legally required in the USA to either show that the test is obviously job-related (e.g., knowledge of Excel for a bookkeeping job) or that you have hard data demonstrating predictive validity.  Otherwise, you are open for a lawsuit.
  3. Financial benefits: Often, the reason for needing improved decisions is very financial.  It is often costly for large companies to recruit and train personnel.  It’s entirely possible that spending $100,000 per year on pre-employment tests could save millions of dollars in the long run.
  4. Benefits to the examinee: Sometimes, there is directly a benefit to the examinee.  This is often the case with medical assessments.

How to implement predictive validity

The simplest case is that of regression and correlation.  How well does the test score correlate with the criterion variable?  Below is a oversimplified example, of predicting university GPA from scores on an admissions test.  Here, the correlation is 0.858 and the regression is GPA = 0.34*SCORE + 0.533.  Of course, in real life, you would not see this strong of a predictive power, as there are many other factors which influence GPA.

Predictive validity

Advanced Issues

It is usually not a simple situation of two straightforward variables, such as one test and one criterion variable.  Often, there are multiple predictor variables (quantitative reasoning test, MS Excel knowledge test, interview, rating of the candidate’s resume), and moreover there are often multiple criterion variables (job performance ratings, job tenure, counterproductive work behavior).  When you use multiple predictors and a second or third predictor adds some bit of predictive power over that of the first variable, this is known as incremental validity.

You can also implement more complex machine learning models, such as neural networks or support vector machines, if they fit and you have sufficient sample size.

When performing such validation, you need to also be aware of bias.  There can be test bias where the test being used as a predictor is biased against a subgroup.  There can also be predictive bias where two subgroups have the same performance on the test, but one is overpredicted for the criterion and the other is underpredicted.  A rule of thumb for investigating this in the USA is the four-fifths rule.

Summary

Predictive validity is one type of test score validity, referring to evidence that scores from a certain test can predict their intended target variables.  The most common application of it is to pre-employment testing, but it is useful in other situations as well.  But validity is an extremely important and wide-ranging topic, so it is not the only type of validity evidence that you should gather.

Classical Test Theory vs. Item Response Theory

Classical Test Theory and Item Response Theory (CTT & IRT) are the two primary psychometric paradigms.  That is, they are mathematical approaches to how tests are analyzed and scored.  They differ quite substantially in substance and complexity, even though they both nominally do the same thing, which is statistically analyze test data to ensure reliability and validity.  CTT is quite simple, easily understood, and works with small samples, but IRT is far more powerful and effective, so it is used by most big exams in the world.

So how are they different, and how can you effectively choose the right solution?  First, let’s start by defining the two.  This is just a brief intro; there are entire books dedicated to the details!

Classical Test Theory

CTT is an approach that is based on simple mathematics; primarily averages, proportions, and correlations.  It is more than 100 years old, but is still used quite often, with good reason. In addition to working with small sample sizes, it is very simple and easy to understand, which makes it useful for working directly with content experts to evaluate, diagnose, and improve items or tests.

Download free version of Iteman for CTT Analysis

 

Iteman classical test theory

 

Item Response Theory

IRT is a much more complex approach to analyzing tests. Moreover, it is not just for analyzing; it is a complete psychometric paradigm that changes how item banks are developed, test forms are designed, tests are delivered (adaptive or linear-on-the-fly), and scores produced. There are many benefits to this approach that justify the complexity, and there is good reason that all major examinations in the world utilize IRT.  Learn more about IRT here.

 

Download free version of Xcalibre for IRT Analysis

 

Similarities between Classical Test Theory and Item Response Theory

CTT & IRT are both foundational frameworks in psychometrics aimed at improving the reliability and validity of psychological assessments. Both methodologies involve item analysis to evaluate and refine test items, ensuring they effectively measure the intended constructs. Additionally, IRT and CTT emphasize the importance of test standardization and norm-referencing, which facilitate consistent administration and meaningful score interpretation. Despite differing in specific techniques both frameworks ultimately strive to produce accurate and consistent measurement tools. These shared goals highlight the complementary nature of IRT and CTT in advancing psychological testing.

Differences between Classical Test Theory and Item Response Theory

Test-Level and Subscore-Level Analysis

CTT statistics for total scores and subscores include coefficient alpha reliability, standard error of measurement (a function of reliability and SD), descriptive statistics (average, SD…), and roll-ups of item statistics (e.g., mean Rpbis).

With IRT, we utilize the same descriptive statistics, but the scores are now different (theta, not number-correct).  The standard error of measurement is now a conditional function, not a single number. The entire concept of reliability is dropped, and replaced with the concept of precision, and also as that same conditional function.

Item-Level AnalysisXcalibre item response theory

Item statistics for CTT include proportion-correct (difficulty), point-biserial (Rpbis) correlation (discrimination), and a distractor/answer analysis. If there is demographic information, CTT analysis can also provide a simple evaluation of differential item functioning (DIF).

IRT replaces the difficulty and discrimination with its own quantifications, called simply b and a.  In addition, it can add a c parameter for guessing effects. More importantly, it creates entirely new classes of statistics for partial credit or rating scale items.

Scoring

CTT scores tests with traditional scoring: number-correct, proportion-correct, or sum-of-points.  CTT interprets test scores based on the total number of correct responses, assuming all items contribute equally.  IRT scores examinees directly on a latent scale, which psychometricians call theta, allowing for more nuanced and precise ability estimates.

Linking and Equating

Linking and equating is a statistical analysis to determine comparable scores on different forms; e.g., Form A is “two points easier” than Form B and therefore a 72 on Form A is comparable to a 70 on Form B. CTT has several methods for this, including the Tucker and Levine methods, but there are methodological issues with these approaches. These issues, and other issues with CTT, eventually led to the development of IRT in the 1960s and 1970s.

IRT has methods to accomplish linking and equating which are much more powerful than CTT, including anchor-item calibration or conversion methods like Stocking-Lord. There are other advantages as well.

Vertical Scaling

One major advantage of IRT, as a corollary to the strong linking/equating, is that we can link/equate not just across multiple forms in one grade, but from grade to grade. This produces a vertical scale. A vertical scale can span across multiple grades, making it much easier to track student growth, or to measure students that are off-grade in their performance (e.g., 7th grader that is at a 5th grade level). A vertical scale is a substantial investment, but is extremely powerful for K-12 assessments.

Sample Sizes

Classical test theory can work effectively with 50 examinees, and provide useful results with as little as 20.  Depending on the IRT model you select (there are many), the minimum sample size can be 100 to 1,000.

Sample- and Test-Dependence

CTT analyses are sample-dependent and test-dependent, which means that such analyses are performed on a single test form and set of students. It is possible to combine data across multiple test forms to create a sparse matrix, but this has a detrimental effect on some of the statistics (especially alpha), even if the test is of high quality, and the results will not reflect reality.

For example, if Grade 7 Math has 3 forms (beginning, middle, end of year), it is conceivable to combine them into one “super-matrix” and analyze together. The same is true if there are 3 forms given at the same time, and each student randomly receives one of the forms. In that case, 2/3 of the matrix would be empty, which psychometricians call sparse.

Distractor Analysis

Classical test theory will analyze the distractors of a multiple choice item.  IRT models, except for the rarely-used Nominal Response Model, do not.  So even if you primarily use IRT, psychometricians will also use CTT for this.

Guessing

educational assessment

IRT has a parameter to account for guessing, though some psychometricians argue against its use.  CTT has no effective way to account for guessing.

Adaptive Testing

There are rare cases where adaptive testing (personalized assessment) can be done with classical test theory.  However, it pretty much requires the use of item response theory for one important reason: IRT puts people and items onto the same latent scale.

Linear Test Design

CTT and IRT differ in how test forms are designed and built.  CTT works best when there are lots of items of middle difficulty, as this maximizes the coefficient alpha reliability.  However, there are definitely situations where the purpose of the assessment is otherwise.  IRT provides stronger methods for designing such tests, and then scoring as well.

So… How to Choose?

There is no single best answer to the question of CTT vs. IRT.  You need to evaluate the aspects listed above, and in some cases other aspects (e.g., financial, or whether you have staff available with the expertise in the first place).  In many cases, BOTH are necessary.  This is especially true because IRT does not provide an effective and easy-to-understand distractor analysis that you can use to discuss with subject matter experts.  It is for this reason that IRT software will typically produce CTT analysis too, though the reverse is not true.

IRT is very powerful, and can provide additional information about tests if used just for analyzing results to evaluate item and test performance.  A researcher might choose IRT over CTT for its ability to provide detailed item-level data, handle varying item characteristics, and improve the precision of ability estimates.  IRT’s flexibility and advanced modeling capabilities make it suitable for complex assessments and adaptive testing scenarios.

However, IRT is really only useful if you are going to make it your psychometric paradigm, thereby using it in the list of activities above, especially IRT scoring of examines. Otherwise, IRT analysis is merely just another way of looking test and item performance that will correlate substantially with CTT.

Contact Us To Talk With An Expert

 

shocked-girl-all-psychometric-models-are-wrong

The British statistician George Box is credited with the quote, “All models are wrong but some are useful.”  As psychometricians, it is important that we never forget this perspective.  We cannot be so haughty as to think that our psychometric models actually represent the true underlying phenomena and any data that does not fit nicely is just noise.  We need to remember that everything we do is an approximation, and respect the balance between parsimony and parameterization.

Really… All psychometric models are wrong?

Yeah, there is no TRUE model that perfectly describes the interaction between an examinee and a test item.  Obviously the probability of a correct response is primarily due to important factors such as examinee ability, item difficulty, item quality, the presence of guessing, and the scoring function of the item.  There are also additional factors, such as student motivation, timing factors, lighting in the room, screen size, whether they broke up with their girlfriend/boyfriend the previous day, whether their mom made their favorite breakfast that morning… you get the picture.  Attempting to model all those factors is certainly overparameterization.

Wikipedia as has a lengthier quote on that aspect:

Since all models are wrong the scientist cannot obtain a “correct” one by excessive elaboration. On the contrary following William of Occam he should seek an economical description of natural phenomena. Just as the ability to devise simple but evocative models is the signature of the great scientist so overelaboration and overparameterization is often the mark of mediocrity.

Most, if not all psychometricians, would agree that my earlier description of overparameterization is valid.  The controversy in the field of Psychometrics is which of those “important factors” I mentioned qualify as overparameterization.  The Rasch model famously boils down the interaction to a single item parameter (difficulty) and a single person parameter (ability).  Many psychometricians consider this to be underparameterization since, for example, items are known widely differ in their quality (discrimination).  The Rasch cohort would consider the 2 and 3 parameter item response theory (IRT) models to be overparameterization, especially since they necessitated the development of new parameter estimation algorithms in the 1970s.  There are some practitioners in each camp who would claim that the other is the “mark of mediocrity.”

IRT continues to add more and more parameters, such as multidimensionality, response time, and upper asymptote.  For the most part, these are only academic curiosities, existing only to publish papers on new research, even though most assessments in the world still struggle to apply the Rasch model from 1960.

On the other end of the spectrum is classical test theory, which is based on simple mathematics like averages, proportions, and correlations.  This greatly underparameterizes what is actually going on.  The point-biserial coefficient, for example, assumes that the relation of ability to getting an item correct is linear, which is blatantly false since the probability cannot go above 1.0 or below 0.0.

Sooo… How do I select a psychometric model?

Well, try to be cognizant of that tradeoff, which is one of several tradeoffs when selecting an IRT model.  There is no right answer all the time, it is more a matter of whether your data fits a model and whether it satisfies your requirements for a particular situation.  That is, whether it is truly useful, which is Box’s original point. But don’t forget that all the models are wrong!

graded-response-model

Samejima’s (1969) Graded Response Model (GRM, sometimes SGRM) is an extension of the two parameter logistic model (2PL) within the item response theory (IRT) paradigm.  IRT provides a number of benefits over classical test theory, especially regarding the treatment of polytomous items; learn more about IRT vs. CTT here.

What is the Graded Response Model?

GRM is a family of latent trait (latent trait is a variable that is not directly measurable, e.g. a person’s level of neurosis, conscientiousness or openness) mathematical models for grading responses that was developed by Fumiko Samejima in 1969 and has been utilized widely since then. GRM is also known as Ordered Categorical Responses Model as it deals with ordered polytomous categories that can relate to both constructed-response or selected-response items where examinees are supposed to obtain various levels of scores like 0-4 points. In this case, the categories are as follows: 0, 1, 2, 3, and 4; and they are ordered. ‘Ordered’ means what it says, that there is a specific order or ranking of responses. ‘Polytomous’ means that the responses are divided into more than two categories, i.e., not just correct/incorrect or true/false.

 

When should I use the GRM?

This family of models is applicable when polytomous responses to an item can be classified into more than two ordered categories (something more than correct/incorrect), such as to represent different degrees of achievement in a solution to a problem or levels of agreement , a Likert scale, or frequency to a certain statement. GRM covers both homogeneous and heterogeneous cases, while the former implies that a discriminating power underlying a thinking process is constant throughout a range of attitude or reasoning.

Samejima (1997) highlights a reasonability of employing GRM in testing occasions when examinees are scored based on correctness (e.g., incorrect, partially correct, correct) or while measuring people’s attitudes and preferences, like in Likert-scale attitude surveys (e.g., strongly agree, agree, neutral, disagree, strongly disagree). For instance, GRM can be used in an extroversion scoring model considering “I like to go to parties” as a high difficulty construction, and “I like to go out for coffee with a close friend” as an easy one.emotion scale grm

Here are some examples of assessments where GRM is utilized:

  • Survey attitude questions using responses like ‘strongly disagree, disagree, neutral, agree, strongly agree’
  • Multiple response items, such as a list of 8 animals and student selects which 3 are reptiles
  • Drag and drop or other tech enhanced items with multiple points available
  • Letter grades assigned to an essay: A, B, C, D, and E
  • Essay responses graded on a 0-to-4 rubric

 

Why to use GRM?

There are three general goals of applying GRM:

  • estimating an ability level/latent trait
  • estimating an adequacy with which test questions measure an ability level/latent trait
  • evaluating a probability that a particular test domain will receive a specific score/grade for each question

Using item response theory in general (not just the GRM) provides a host of advantages.  It can help you validate the assessment.  Using the GRM can also enable adaptive testing.

 

How to calculate a response probability with the GRM?

There is a two-step process of calculating a probability that an examinee selects a certain category in a given question. The first step is to find a probability that an examinee with a definite ability level selects a category n or greater in a given question:

GRM formula1

where

1.7  is the scale factor

a  is the discrimination of the question

bm  is a probability of choosing category n or higher

e  is the constant that approximately equals to 2.718

Θ  is the ability level

P*m(Θ) = 1  if  m = 1  since a probability of replying in the lowest category or in all the major ones is a certain event

P*m(Θ) = 0  if  m = M + 1  since a probability of replying in a category following the largest is null.

 

The second step is to find a probability that an examinee responds in a given category:

GRM formula2

This formula describes the probability of choosing a specific response to the question for each level of the ability it measures.

 

How do I implement the GRM on my assessment?

You need item response theory software.  Start by downloading  Xcalibre  for free.  Below are outputs for two example items.

How to interpret this?  The GRM uses category response functions which show the probability of selecting a given response as a function of theta (trait or ability).  For item 6, we see that someone of theta -3.0 to -0.5 is very likely to select “2” on the Likert scale (or whatever our response is).  Examinees above -.05 are likely to select “3” on the scale.  But on Item 10, the green curve is low and not likely to be chosen at all; examinees from -2.0 to +2.0 are likely to select “3” on the Likert scale, and those above +2.0 are likely to select “4”.  Item 6 is relatively difficult, in a sense, because no one chose “4.”

Item 6 Item 10
Xcalibre - graded response model easy Xcalibre - graded response model difficult

References

Keller, L. A. (2014). Item Response Theory Models for Polytomous Response Data. Wiley StatsRef: Statistics Reference Online.

Samejima, F. (1969). Estimation of latent ability using a response pattern of graded coress. Psychometrika monograph supplement17(4), 2. doi:10.1002/j.2333-8504.1968.tb00153.x.

Samejima, F. (1997). Graded response model. In W. J. van der Linden and R. K. Hambleton (Eds), Handbook of Modern Item Response Theory, (pp. 85–100). Springer-Verlag.

 

Coefficient cronbachs alhpa interpretation

Coefficient alpha reliability, sometimes called Cronbach’s alpha, is a statistical index that is used to evaluate the internal consistency or reliability of an assessment. That is, it quantifies how consistent we can expect scores to be, by analyzing the item statistics. A high value indicates that the test is of high reliability, and a low value indicates low reliability. This is one of the most fundamental concepts in psychometrics, and alpha is arguably the most common index. You may also be interested in reading about its competitor, the Split Half Reliability Index.

What is coefficient alpha, aka Cronbach’s alpha?

The classic reference to alpha is Cronbach (1954). He defines it as:

coefficient alpha

where k is the number of items, sigma-i is variance of item i, and sigma-X is total score variance.

Kuder-Richardson 20

While Cronbach tends to get the credit, to the point that the index is often called “Cronbach’s Alpha” he really did not invent it. Kuder and Richardson (1927) suggested the following equation to estimate the reliability of a test with dichotomous (right/wrong) items.

kr 20 reliability

Note that it is the same as Cronbach’s equation, except that he replaced the binomial variance pq with the more general notation of variance (sigma). This just means that you can use Cronbach’s equation on polytomous data such as Likert rating scales. In the case of dichotomous data such as multiple choice items, Cronbach’s alpha and KR-20 are the exact same.

Additionally, Cyril Hoyt defined reliability in an equivalent approach using ANOVA in 1941, a decade before Cronbach’s paper.

How to interpret coefficient alpha

In general, alpha will range from 0.0 (random number generator) to 1.0 (perfect measurement). However, in rare cases, it can go below 0.0, such as if the test is very short or if there is a lot of missing data (sparse matrix). This, in fact, is one of the reasons NOT to use alpha in some cases. If you are dealing with linear-on-the-fly tests (LOFT), computerized adaptive tests (CAT), or a set of overlapping linear forms for equating (non-equivalent anchor test, or NEAT design), then you will likely have a large proportion of sparseness in the data matrix and alpha will be very low or negative. In such cases, item response theory provides a much more effective way of evaluating the test.

What is “perfect measurement?”  Well, imagine using a ruler to measure a piece of paper.  If it is American-sized, that piece of paper is always going to be 8.5 inches wide, no matter how many times you measure it with the ruler.  A bathroom scale is slightly less reliability; You might step on it, see 190.2 pounds, then step off and on again, and see 190.4 pounds.  This is a good example of how we often accept unreliability in measurement.

Of course, we never have this level of accuracy in the world of psychoeducational measurement.  Even a well-made test is something where a student might get 92% today and 89% tomorrow (assuming we could wipe their brain of memory of the exact questions).

Reliability can also be interpreted as the ratio of true score variance to total score variance. That is, all test score distributions have a total variance, which consist of variance due to the construct of interest (i.e., smart students do well and poor students do poorly), but also some error variance (random error, kids not paying attention to a question, second dimension in the test… could be many things.

What is a good value of coefficient alpha?

As psychometricians love to say, “it depends.” The rule of thumb that you generally hear is that a value of 0.70 is good and below 0.70 is bad, but that is terrible advice. A higher value indeed indicates higher reliability, but you don’t always need high reliability. A test to certify surgeons, of course, deserves all the items it needs to make it quite reliable. Anything below 0.90 would be horrible. However, the survey you take from a car dealership will likely have the statistical results analyzed, and a reliability of 0.60 isn’t going to be the end of the world; it will still provide much better information than not doing a survey at all!

Here’s a general depiction of how to evaluate levels of coefficient alpha.

Coefficient cronbachs alhpa interpretation

Using alpha: the classical standard error of measurement

Coefficient alpha is also often used to calculate the classical standard error of measurement (SEM), which provides a related method of interpreting the quality of a test and the precision of its scores. The SEM can be interpreted as the standard deviation of scores that you would expect if a person took the test many times, with their brain wiped clean of the memory each time. If the test is reliable, you’d expect them to get almost the same score each time, meaning that SEM would be small.

   SEM=SD*sqrt(1-r)

Note that SEM is a direct function of alpha, so that if alpha is 0.99, SEM will be small, and if alpha is 0.1, then SEM will be very large.

Coefficient alpha and unidimensionality

It can also be interpreted as a measure of unidimensionality. If all items are measuring the same construct, then scores on them will align, and the value of alpha will be high. If there are multiple constructs, alpha will be reduced, even if the items are still high quality. For example, if you were to analyze data from a Big Five personality assessment with all five domains at once, alpha would be quite low. Yet if you took the same data and calculated alpha separately on each domain, it would likely be quite high.

How to calculate the index

Because the calculation of coefficient alpha reliability is so simple, it can be done quite easily if you need to calculate it from scratch, such as using formulas in Microsoft Excel. However, any decent assessment platform or psychometric software will produce it for you as a matter of course. It is one of the most important statistics in psychometrics.

Cautions on overuse

Because alpha is just so convenient – boiling down the complex concept of test quality and accuracy to a single easy-to-read number – it is overused and over-relied upon. There are papers out in the literature that describe the cautions in detail; here is a classic reference.

One important consideration is the over-simplification of precision with coefficient alpha, and the classical standard error of measurement, when juxtaposed to the concept of conditional standard error of measurement from item response theory. This refers to the fact that most traditional tests have a lot of items of middle difficulty, which maximizes alpha. This measures students of middle ability quite well. However, if there are no difficult items on a test, it will do nothing to differentiate amongst the top students. Therefore, that test would have a high overall alpha, but have virtually no precision for the top students. In an extreme example, they’d all score 100%.

Also, alpha will completely fall apart when you calculate it on sparse matrices, because the total score variance is artifactually reduced.

Limitations of coefficient alpha

Cronbach’s alpha has several limitations. Firstly, it assumes that all items on a scale measure the same underlying construct and have equal variances, which is often not the case. Secondly, it is sensitive to the number of items on the scale; longer scales tend to produce higher alpha values, even if the additional items do not necessarily improve measurement quality. Thirdly, Cronbach’s alpha assumes that item errors are uncorrelated, an assumption that is frequently violated in practice. Lastly, it provides only a lower bound estimate of reliability, which means it can underestimate the true reliability of the test.

Summary

In conclusion, coefficient alpha is one of the most important statistics in psychometrics, and for good reason. It is quite useful in many cases, and easy enough to interpret that you can discuss it with test content developers and other non-psychometricians. However, there are cases where you should be cautious about its use, and some cases where it completely falls apart. In those situations, item response theory is highly recommended.

differential item functioning

Differential item functioning (DIF) is a term in psychometrics for the statistical analysis of assessment data to determine if items are performing in a biased manner against some group of examinees.  This analysis is often complemented by item fit analysis, which ensures that each item aligns appropriately with the theoretical model and functions uniformly across different groups.  Most often, this is based on a demographic variable such as gender, ethnicity, or first language. For example, you might analyze a test to see if items are biased against an ethnic minority, such as Blacks or Hispanics in the USA.  Another organization I have worked with was concerned primarily with Urban vs. Rural students.  In the scientific literature, the majority is called the reference group and the minority is called the focal group.

As you would expect from the name, they are trying to find evidence that an item functions (performs) differently for two groups. However, this is not as simple as one group getting the item incorrect (P value) more often. What if that group also has a lower ability/trait level on average? Therefore, we must analyze the difference in performance conditional on ability.  This means we find examinees at a given level of ability (e.g., 20-30th percentile) and compare the difficulty of the item with minority vs majority examinees.

Mantel-Haenszel analysis of differential item functioning

The Mantel-Haenszel approach is a simple yet powerful way to analyze differential item functioning. We simply use the raw classical number-correct score as the indicator of ability, and use it to evaluate group differences conditional on ability. For example, we could split up the sample into fifths (slices of 20%), and for each slice, we evaluate the difference in P value between the groups. An example of this is below, to help visualize how DIF might operate.  Here, there is a notable difference in the probability of getting an item correct, with ability held constant.  The item is biased against the focal group.  In the slice of examinees 41-60th percentile, the reference group has a 60% chance while the focal group (minority) has a 48% chance.

differential item functioning

Crossing and non-crossing DIF

Differential item functioning is sometimes described as crossing or non-crossing DIF. The example above is non-crossing, because the lines do not cross. In this case, there would be a difference in the overall P value between the groups. A case of crossing DIF would see the two lines cross, with potentially no difference in overall P value – which would mean that DIF would go completely unnoticed unless you specifically did a DIF analysis like this.  Hence, it is important to perform DIF analysis; though not for just this reason.

More methods of evaluating differential item functioning

There are, of course, more sophisticated methods of analyzing differential item functioning.  Logistic regression is a commonly used approach.  A sophisticated methodology is Raju’s differential functioning of items and tests (DFIT) approach.

How do I implement DIF?

There are three ways you can implement a DIF analysis.

1. General psychometric software: Well-known software for classical or item response theory analysis will often include an option for DIF. Examples are Iteman, Xcalibre, and IRTPRO (formerly Parscale/Multilog/Bilog).

2. DIF-specific software: While there are not many, there are software programs or R packages that are specific to DIF. An example is DFIT; there used to be a software named that, to do the analysis of the same name.  However, the software is no longer supported but you can use an R package like this.

3. General statistical software or programming environments: For example, if you are a fan of SPSS, you can use it to implement some DIF analyses such as logistic regression.

More resources on differential item functioning

Sage Publishing puts out “little green books” that are useful introductions to many topics.  There is one specifically on differential item functioning.

Juggling-statistics

What is the difference between the terms dichotomous and polytomous in psychometrics?  Well, these terms represent two subcategories within item response theory (IRT) which is the dominant psychometric paradigm for constructing, scoring and analyzing assessments.  Virtually all large-scale assessments utilize IRT because of its well-documented advantages.  In many cases, however, it is referred to as a single way of analyzing data.  But IRT is actually a family of fast-growing models, each requiring rigorous item fit analysis to ensure that each question functions appropriately within the model.   The models operate quite differently based on whether the test questions are scored right/wrong or yes/no (dichotomous), vs. complex items like an essay that might be scored on a rubric of 0 to 6 points (polytomous).  This post will provide a description of the differences and when to use one or the other.

 

Ready to use IRT?  Download Xcalibre for free

 

Dichotomous IRT Models

Dichotomous IRT models are those with two possible item scores.  Note that I say “item scores” and not “item responses” – the most common example of a dichotomous item is multiple choice, which typically has 4 to 5 options, but only two possible scores (correct/incorrect).  

True/False or Yes/No items are also obvious examples and are more likely to appear in surveys or inventories, as opposed to the ubiquity of the multiple-choice item in achievement/aptitude testing. Other item types that can be dichotomous are Scored Short Answer and Multiple Response (all or nothing scoring).  

What models are dichotomous?

The three most common dichotomous models are the 1PL/Rasch, the 2PL (Graded Response Model), and the 3PL.  Which one to use depends on the type of data you have, as well as your doctrine of course.  A great example is Scored Short Answer items: there should be no effect of guessing on such an item, so the 2PL is a logical choice.  Here is a broad overgeneralization:

  • 1PL/Rasch: Uses only the difficulty (b) parameter and does not take into account guessing effects or the possibility that some items might be more discriminating than others; however, can be useful with small samples and other situations
  • 2PL: Uses difficulty (b) and discrimination (a) parameters, but no guessing (c); relevant for the many types of assessment where there is no guessing
  • 3PL: Uses all three parameters, typically relevant for achievement/aptitude testing.

What do dichotomous models look like?

Dichotomous models, graphically, will have one S-shaped curve with a positive slope, as seen here.  This model that the probability of responding in the keyed direction increases with higher levels of the trait or ability.  

item response function

Technically, there is also a line for the probability of an incorrect response, which goes down, but this is obviously the 1-P complement, so it is rarely drawn in graphs.  It is, however, used in scoring algorithms (check out this white paper).

In the example, a student with theta = -3 has about a 0.28 chance of responding correctly, while theta = 0 has about 0.60 and theta = 1 has about 0.90.

Polytomous IRT Models

Polytomous models are for items that have more than two possible scores.  The most common examples are Likert-type items (Rate on a scale of 1 to 5) and partial credit items (score on an Essay might be 0 to 5 points). IRT models typically assume that the item scores are integers.

What models are polytomous?

Unsurprisingly, the most common polytomous models use names like rating scale and partial credit.

  • Rating Scale Model (Andrich, 1978)
  • Partial Credit Model (Masters, 1982)
  • Generalized Rating Scale Model (Muraki, 1990)
  • Generalized Partial Credit Model (Muraki, 1992)
  • Graded Response Model (Samejima, 1972)
  • Nominal Response Model (Bock, 1972)

What do polytomous models look like?

Polytomous models have a line that dictates each possible response.  The line for the highest point value is typically S-shaped like a dichotomous curve.  The line for the lowest point value is typically sloped down like the 1-P dichotomous curve.  Point values in the middle typically have a bell-shaped curve. The example is for an Essay that scored 0 to 5 points.  Only students with theta >2 are likely to get the full points (blue), while students 1<theta<2 are likely to receive 4 points (green).

I’ve seen “polychotomous.”  What does that mean?

It means the same as polytomous.  

How is IRT used in our platform?

We use it to support the test development cycle, including form assembly, scoring, and adaptive testing.  You can learn more on this page.

How can I analyze my tests with IRT?

You need specially designed software, like  Xcalibre.  Classical test theory is so simple that you can do it with Excel functions.

Recommended Readings

Item Response Theory for Psychologists by Embretson and Riese (2000).  

lock keyboard test security plan

A test security plan (TSP) is a document that lays out how an assessment organization address security of its intellectual property, to protect the validity of the exam scores.  If a test is compromised, the scores become meaningless, so security is obviously important.  The test security plan helps an organization anticipate test security issues, establish deterrent and detection methods, and plan responses.  It can also include validity threats not security-related, such as how to deal with examinees that have low motivation.  Note that it is not limited to delivery; it can often include topics like how to manage item writers.

Since the first tests were developed 2000 years ago for entry into the civil service of Imperial China, test security has been a concern.  The reason is quite straightforward: most threats to test security are also validity threats. The decisions we make with test scores could therefore be invalid, or at least suboptimal.  It is therefore imperative that organizations that use or develop tests should develop a TSP.

Why do we need a test security plan?

There are several reasons to develop a test security plan.  First, it drives greater security and therefore validity.  The TSP will enhance the legal defensibility of the testing program.  It helps to safeguard the content, which is typically an expensive investment for any organization that develops tests themselves.  If incidents do happen, they can be dealt with more swiftly and effectively.  It helps to manage all the security-related efforts.

The development of such a complex document requires a strong framework.  We advocate a framework with three phases: planning, implementation, and response.  In addition, the TSP should be revised periodically.

Phase 1: Planning

The first step in this phase is to list all potential threats to each assessment program at your organization.  This could include harvesting of test content, preknowledge of test content from past harvesters, copying other examinees, proxy testers, proctor help, and outside help.  Next, these should be rated on axes that are important to the organization; a simple approach would be to rate on potential impact to score validity, cost to the organization, and likelihood of occurrence.  This risk assessment exercise will help the remainder of the framework.

Next, the organization should develop the test security plan.  The first piece is to identify deterrents and procedures to reduce the possibility of issues.  This includes delivery procedures (such as a lockdown browser or proctoring), proctor training manuals, a strong candidate agreement, anonymous reporting pathways, confirmation testing, and candidate identification requirements.  The second piece is to explicitly plan for psychometric forensics. 

This can range from complex collusion indices based on item response theory to simple flags, such as a candidate responding to a certain multiple choice option more than 50% of the time or obtaining a score in the top 10% but in the lowest 10% of time.  The third piece is to establish planned responses.  What will you do if a proctor reports that two candidates were copying each other?  What if someone obtains a high score in an unreasonably short time? 

What if someone obviously did not try to pass the exam, but still sat there for the allotted time?  If a candidate were to lose a job opportunity due to your response, it helps you defensibility to show that the process was established ahead of time with the input of important stakeholders.

Phase 2: Implementation

The second phase is to implement the relevant aspects of the Test Security Plan, such as training all proctors in accordance with the manual and login procedures, setting IP address limits, or ensuring that a new secure testing platform with lockdown is rolled out to all testing locations.  There are generally two approaches.  Proactive approaches attempt to reduce the likelihood of issues in the first place, and reactive methods happen after the test is given.  The reactive methods can be observational, quantitative, or content-focused.  Observational methods include proctor reports or an anonymous tip line.  Quantitative methods include psychometric forensics, for which you will need software like SIFT.  Content-focused methods include automated web crawling.

Both approaches require continuous attention.  You might need to train new proctors several times per year, or update your lockdown browser.  If you use a virtual proctoring service based on record-and-review, flagged candidates must be periodically reviewed.  The reactive methods are similar: incoming anonymous tips or proctor reports must be dealt with at any given time.  The least continuous aspect is some of the psychometric forensics, which depend on a large-scale data analysis; for example, you might gather data from tens of thousands of examinees in a testing window and can only do a complete analysis at that point, which could take several weeks.

Phase 3: Response

The third phase, of course, to put your planned responses into motion if issues are detected.  Some of these could be relatively innocuous; if a proctor is reported as not following procedures, they might need some remedial training, and it’s certainly possible that no security breach occurred.  The more dramatic responses include actions taken against the candidate.  The most lenient is to provide a warning or simply ask them to retake the test.  The most extreme methods include a full invalidation of the score with future sanctions, such as a five-year ban on taking the test again, which could prevent someone from entering a profession for which they spent 8 years and hundreds of thousands of dollars in educative preparation.

What does a test security plan mean for me?

It is clear that test security threats are also validity threats, and that the extensive (and expensive!) measures warrant a strategic and proactive approach in many situations.  A framework like the one advocated here will help organizations identify and prioritize threats so that the measures are appropriate for a given program.  Note that the results can be quite different if an organization has multiple programs, from a practice test to an entry level screening test to a promotional test to a professional certification or licensure.

Another important difference between test sponsors/publishers and test consumers.  In the case of an organization that purchases off-the-shelf pre-employment tests, the validity of score interpretations is of more direct concern, while the theft of content might not be an immediate concern.  Conversely, the publisher of such tests has invested heavily in the content and could be massively impacted by theft, while the copying of two examinees in the hiring organization is not of immediate concern.

In summary, there are more security threats, deterrents, procedures, and psychometric forensic methods than can be discussed in one blog post, so the focus here rather on the framework itself.  For starters, start thinking strategically about test security and how it impacts their assessment programs by using the multi-axis rating approach, then begin to develop a Test Security Plan.  The end goal is to improve the health and validity of your assessments.


Want to implement some of the security aspects discussed here, like online delivery lockdown browser, IP address limits, and proctor passwords?

Sign up for a free account in FastTest!

Multistage testing algorithm

Multistage testing (MST) is a type of computerized adaptive testing (CAT).  This means it is an exam delivered on computers which dynamically personalize it for each examinee or student.  Typically, this is done with respect to the difficulty of the questions, by making the exam easier for lower-ability students and harder for high-ability students.  Doing this makes the test shorter and more accurate while providing additional benefits.  This post will provide more information on multistage testing so you can evaluate if it is a good fit for your organization.

Already interested in MST and want to implement it?  Contact us to talk to one of our experts and get access to our powerful online assessment platform, where you can create your own MST and CAT exams in a matter of hours.

 

What is multistage testing?Multistage testing algorithm

Like CAT, multistage testing adapts the difficulty of the items presented to the student. But while adaptive testing works by adapting each item one by one using item response theory (IRT), multistage works in blocks of items.  That is, CAT will deliver one item, score it, pick a new item, score it, pick a new item, etc.  Multistage testing will deliver a block of items, such as 10, score them, then deliver another block of 10.

The design of a multistage test is often referred to as panels.  There is usually a single routing test or routing stage which starts the exam, and then students are directed to different levels of panels for subsequent stages.  The number of levels is sometimes used to describe the design; the example on the right is a 1-3-3 design.  Unlike CAT, there are only a few potential paths, unless each stage has a pool of available testlets.

As with item-by-item CAT, multistage testing is almost always done using IRT as the psychometric paradigm, selection algorithm, and scoring method.  This is because IRT can score examinees on a common scale regardless of which items they see, which is not possible using classical test theory.

To learn more about MST, I recommend this book.

Why multistage testing?

Item-by-item CAT is not the best fit for all assessments, especially those that naturally tend towards testlets, such as language assessments where there is a reading passage with 3-5 associated questions.

Multistage testing allows you to realize some of the well-known benefits of adaptive testing (see below), with more control over content and exposure.  In addition to controlling content at an examinee level, it also can make it easier to manage item bank usage for the organization.

 

How do I implement multistage testing?

1. Develop your item banks using items calibrated with item response theory

2. Assemble a test with multiple stages, defining pools of items in each stage as testlets

3. Evaluate the test information functions for each testlet

4. Run simulation studies to validate the delivery algorithm with your predefined testlets

5. Publish for online delivery

Our industry-leading assessment platform manages much of this process for you.  The image to the right shows our test assembly screen where you can evaluate the test information functions for each testlet.

Multistage testing

 

Benefits of multistage testing

There are a number of benefits to this approach, which are mostly shared with CAT.

  • Shorter exams: because difficulty is targeted, you waste less time
  • Increased security: There are many possible configurations, unlike a linear exam where everyone sees the same set of items
  • Increased engagement: Lower ability students are not discouraged, and high ability students are not bored
  • Control of content: CAT has some content control algorithms, but they are sometimes not sufficient
  • Supports testlets: CAT does not support tests that have testlets, like a reading passage with 5 questions
  • Allows for review: CAT does not usually allow for review (students can go back a question to change an answer), while MST does

 

Examples of multistage testing

MST is often used in language assessment, which means that it is often used in educational assessment, such as benchmark K-12 exams, university admissions, or language placement/certification.  One of the most famous examples is the Scholastic Aptitude Test from The College Board; it is moving to an MST approach in 2023.

Because of the complexity of item response theory, most organizations that implement MST have a full-time psychometrician on staff.  If your organization does not, we would love to discuss how we can work together.