Posts on psychometrics: The Science of Assessment

what is Psychometrics

Psychometrics is the science of educational and psychological assessment. It scientifically studies how tests are developed, delivered, and scored, regardless of the test topic. Psychometrics tackles fundamental questions around assessment, such as how to determine if a test is reliable or if a question is of good quality, as well as much more complex questions like how to ensure that a score today means the same thing as it did 10 years ago.

Why do we need psychometrics?

This purpose of tests is providing useful information about people, such as whether to hire them, certify them in a profession, or determine what to teach them next in school.  Better tests mean better decisions.  Why?  The scientific evidence is overwhelming that tests provide better information for decision makers than many other types of information, such as interviews, resumes, or educational attainment.  Thus, tests serve an extremely useful role in our society.

The goal of psychometrics is to provide validity: evidence to support that interpretations of scores from the test are what we intended.  If a certification test is supposed to mean that someone passing it meets the minimum standard to work in a certain job, we need a lot of evidence about that, especially since the test is so high stakes in that case.

What is psychometrics? An introduction / definition.

Psychometrics is the study of assessment itself, regardless of what type of test is under consideration. In fact, many psychometricians don’t even work on a particular test, they just work on psychometrics itself, such as new methods of data analysis.  Most professionals don’t care about what the test is measuring, and will often switch to new jobs at completely unrelated topics, such as moving from a K-12 testing company to psychological measurement to an Accountant certification exam.  We often refer to whatever we are measuring simply as “theta” – a term from item response theory.

Generalized-partial-credit-model psychometrics IRT

Psychometrics is a branch of data science.  In fact, it’s been around a long time before that term was even a buzzword.  Don’t believe me?  Check out this Coursera course on Data Science, and the first example they give as one of the foundational historical projects in data science is… psychometrics!  (early research on factor analysis of intelligence)

Even though assessment is everywhere and Psychometrics is an essential aspect of assessment, to most people it remains a black box, and professionals are referred to as “psychomagicians” in jest. However, a basic understanding is important for anyone working in the testing industry, especially those developing or selling tests.  It’s also important for many areas that use assessments, like human resources and education.

Psychometrics is NOT limited to very narrow types of assessment.  Some people use the term interchangeably with concepts like IQ testing, personality assessment, or pre-employment testing.  These are each but tiny parts of the field!  Also, it is not the administration of a test.

 

What questions does the field of Psychometrics address?

 

Building and maintaining a high-quality test is not easy.  A lot of big issues can arise.  Much of the field revolves around solving major questions about tests: what should they cover, what is a good question, how do we set a good cutscore, how do we make sure that the test predicts job performance or student success, etc.  Many of these questions align with the test development cycle – more on that later.test development cycle job task analysis psychometrics

How do we define what should be covered by the test? (Test Design)

Before writing any items, you need to define very specifically what will be on the test.  If the test is in credentialing or pre-employment, psychometricians typically run a job analysis study to form a quantitative, scientific basis for the test blueprints.  A job analysis is necessary for a certification program to get accredited.  In Education, the test coverage is often defined by the curriculum.

How do we ensure the questions are good quality? (Item Writing)

There is a corpus of scientific literature on how to develop test items that accurately measure whatever you are trying to measure.  A great overview is the book by Haladyna.  This is not just limited to multiple-choice items, although that approach remains popular.  Psychometricians leverage their knowledge of best practices to guide the item authoring and review process in a way that the result is highly defensible test content.  Professional item banking software provides the most efficient way to develop high-quality content and publish multiple test forms, as well as store important historical information like item statistics.

How do we set a defensible cutscore? (Standard Setting)

Test scores are often used to classify candidates into groups, such as pass/fail (Certification/Licensure), hire/non-hire (Pre-Employment), and below-basic/basic/proficient/advanced (Education).  Psychometricians lead studies to determine the cutscores, using methodologies such as Angoff, Beuk, Contrasting-Groups, and Borderline.

How do we analyze results to improve the exam? (Psychometric Analysis)

Psychometricians are essential for this step, as the statistical analyses can be quite complex.  Smaller testing organizations typically utilize classical test theory, which is based on simple mathematics like proportions and correlations.  Large, high-profile organizations typically use item response theory (IRT), which is based on a type of nonlinear regression analysis.  Psychometricians evaluate overall reliability of the test, difficulty and discrimination of each item, distractor analysis, possible bias, multidimensionality, linking multiple test forms/years, and much more.  Software such as  Iteman  and  Xcalibre  is also available for organizations with enough expertise to run statistical analyses internally.  Scroll down below for examples.

How do we compare scores across groups or years? (Equating)

This is referred to as linking and equating.  There are some psychometricians that devote their entire career to this topic.  If you are working on a certification exam, for example, you want to make sure that the passing standard is the same this year as last year.  If you passed 76% last year and this year you passed 25%, not only will the candidates be angry, but there will be much less confidence in the meaning of the credential.

How do we know the test is measuring what it should? (Validity)

Validity is the evidence provided to support score interpretations.  For example, we might interpret scores on a test to reflect knowledge of English, and we need to provide documentation and research supporting this.  There are several ways to provide this evidence.  A straightforward approach is to establish content-related evidence, which includes the test definition, blueprints, and item authoring/review.  In some situations, criterion-related evidence is important, which directly correlates test scores to another variable of interest.  Delivering tests in a secure manner is also essential for validity.

 

Where is Psychometrics Used?

Certification/Licensure/Credentialing

In certification testing, psychometricians develop the test via a documented chain of evidence following a sequence of research outlined by accreditation bodies, typically: job analysis, test blueprints, item writing and review, cutscore study, and statistical analysis.  Web-based item banking software like  FastTest  is typically useful because the exam committee often consists of experts located across the country or even throughout the world; they can then easily log in from anywhere and collaborate.

Pre-Employment

In pre-employment testing, validity evidence relies primarily on establishing appropriate content (a test on PHP programming for a PHP programming job) and the correlation of test scores with an important criterion like job performance ratings (shows that the test predicts good job performance).  Adaptive tests are becoming much more common in pre-employment testing because they provide several benefits, the most important of which is cutting test time by 50% – a big deal for large corporations that test a million applicants each year. Adaptive testing is based on item response theory, and requires a specialized psychometrician as well as specially designed software like  FastTest.

K-12 Education

Most assessments in education fall into one of two categories: lower-stakes formative assessment in classrooms, and higher-stakes summative assessments like year-end exams.  Psychometrics is essential for establishing the reliability and validity of higher-stakes exams, and on equating the scores across different years.  They are also important for formative assessments, which are moving towards adaptive formats because of the 50% reduction in test time, meaning that student spend less time testing and more time learning.

Universities

Universities typically do not give much thought to psychometrics even though a significant amount of testing occurs in higher education, especially with the move to online learning and MOOCs.  Given that many of the exams are high stakes (consider a certificate exam after completing a year-long graduate program!), psychometricians should be used in the establishment of legally defensible cutscores and in statistical analysis to ensure reliable tests, and professionally designed assessment systems used for developing and delivering tests, especially with enhanced security.

Medicine/Psychology

Have you ever taken a survey at your doctor’s office, or before/after a surgery?  Perhaps a depression or anxiety inventory at a psychotherapist?  Psychometricians have worked on these.

 

The Test Development Cycle

Psychometrics is the core of the test development cycle, which is the process of developing a strong exam.  It is sometimes called similar names like assessment lifecycle.

You will recognize some of the terms from the introduction earlier.  What we are trying to demonstrate here is that those questions are not standalone topics, or something you do once and simply file a report.  An exam is usually a living thing.  Organizations will often be republishing a new version every year or 6 months, which means that much of the cycle is repeated on that timeline.  Not all of it is; for example, many orgs only do a job analysis and standard setting every 5 years.

Consider a certification exam in healthcare.  The profession does not change quickly because things like anatomy never change and medical procedures rarely change (e.g., how to measure blood pressure).  So, every 5 years it does a job analysis of its certificants to see what they are doing and what is important.  This is then converted to test blueprints.  Items are re-mapped if needed, but most likely do not need it because there are probably only minor changes to the blueprints.  Then a new cutscore is set with the modified-Angoff method, and the test is delivered this year.  It is delivered again next year, but equated to this year rather than starting again.  However, the item statistics are still analyzed, which leads to a new cycle of revising items and publishing a new form for next year.

 

Example of Psychometrics in Action

FastTest Iteman Psychometrics Analysis

Here is some output from our Iteman software.  This is deeply analyzing a single question on English vocabulary, to see if the student knows the word alleviate.  About 70% of the students answered correctly, with a very strong point-biserial.  The distractor P values were all in the minority and the distractor point-biserials were negative, which adds evidence to the validity.  The graph shows that the line for the correct answer is going up while the others are going down, which is good.  If you are familiar with item response theory, you’ll notice how the blue line is similar to an item response function.  That is not a coincidence.

 

Confectioner confetti

Now, let’s look at another one, which is more interesting.  Here’s a vocab question about the word confectioner.  Note that only 37% of the students get it right… even though there is a 25% chance just of guessing!!!  However, the point-biserial discrimination remains very strong at 0.49.  That means it is a really good item.  It’s just hard, which means it does a great job to differentiate amongst the top students.

 

Psychometrics looks fun!  How can I join the band?

You will need a graduate degree.  I recommend you look at the NCME website with resources for students.  Good luck!

Already have a degree and looking for a job?  Here’s the two sites that I recommend:

NCME – Also has a job listings page that is really good (ncme.org)

Horizon Search – Headhunter for Psychometricians and I/O Psychologists

Computerized adaptive testing is an AI-based approach to assessment where the test is personalized based on your performance as you take the test, making the test shorter, more accurate, more secure, more engaging, and fairer.  If you do well, the items get more difficult, and if you do poorly, the items get easier.  If an accurate score is reached, the test stops early.  The AI algorithms are almost always based on item response theory (IRT), an application of machine learning to assessment, but can be based on other models as well. 

Prefer to learn by doing?  Request a free account in FastTest, our powerful adaptive testing platform.

Free FastTest Account

What is computerized adaptive testing (CAT)?

Computerized adaptive testing, sometimes called computer-adaptive testing, adaptive assessment, or adaptive testing, is an algorithm that personalizes how an assessment is delivered to each examinee.  It is coded into a software platform, using the machine-learning approach of IRT to select items and score examinees.  The algorithm proceeds in a loop until the test is complete.  This makes the test smarter, shorter, fairer, and more precise.

computerized Adaptive testing options

The steps in the diagram above are adapted from Kingsbury and Weiss (1984). based on these components.

Components of a CAT

  1. Item bank calibrated with IRT
  2. Starting point (theta level before someone answers an item)
  3. Item selection algorithm (usually maximum Fisher information)
  4. Scoring method (e.g., maximum likelihood)
  5. Termination criterion (stop the test at 50 items, or when standard error is less than 0.30?  Both?)

How the components work

For starters, you need an item bank that has been calibrated with a relevant psychometric or machine learning model.  That is, you can’t just write a few items and subjectively rank them as Easy, Medium, or Hard difficulty.  That’s an easy way to get sued.  Instead, you need to write a large number of items (rule of thumb is 3x your intended test length) and then pilot them on a representative sample of examinees.  The sample must be large enough to support the psychometric model you choose, and can range from 100 to 1000.  You then need to perform simulation research – more on that later.

computerized adaptive testing

Once you have an item bank ready, here is how the computerized adaptive testing algorithm works for a student that sits down to take the test, with options for how to do so.

  1. Starting point: there are three option to select the starting score, which psychometricians call theta
    1. Everyone gets the same value, like 0.0 (average, in the case of non-Rasch models)
    2. Randomized within a range, to help test security and item exposure
    3. Predicted value, perhaps from external data, or from a previous exam
  2. Select item
    1. Find the item in the bank that has the highest information value
    2. Often, you need to balance this with practical constraints such as Item Exposure or Content Balancing
  3. Score the examinee
    1. Usually IRT, maximum likelihood or Bayes modal
  4. Evaluate termination criterion: using a predefined rule supported by your simulation research
    1. Is a certain level of precision reached, such as a standard error of measurement <0.30?
    2. Are there no good items left in the bank?
    3. Has a time limit been reached?
    4. Has a Max Items limit been reached?

The algorithm works by looping through 2-3-4 until the termination criterion is satisfied.

How does the test adapt? By Difficulty or Quantity?

CATs operate by adapting both the difficulty and quantity of items seen by each examinee.

Difficulty
Most characterizations of computerized adaptive testing focus on how item difficulty is matched to examinee ability. High-ability examinees receive more difficult items, while low ability examinees receive easier items, which has important benefits to the student and the organization. An adaptive test typically begins by delivering an item of medium difficulty; if you get it correct, you get a tougher item, and if you get it incorrect, you get an easier item.  This pattern continues.

Quantity: Fixed-Length vs. Variable-Length
A less publicized facet of adaptation is the number of items. Adaptive tests can be designed to stop when certain psychometric criteria are reached, such as a specific level of score precision. Some examinees finish very quickly with few items, so that adaptive tests are typically about half as many questions as a regular test, with at least as much accuracy. Since some examinees have longer tests, these adaptive tests are referred to as variable-length. Obviously, this makes for a massive benefit: cutting testing time in half, on average, can substantially decrease testing costs.

Some adaptive tests use a fixed length, and only adapt item difficulty. This is merely for public relations issues, namely the inconvenience of dealing with examinees who feel they were unfairly treated by the CAT, even though it is arguably more fair and valid than conventional tests.  In general, it is best practice to meld the two: allow test length to be shorter or longer, but put caps on either end that prevent inadvertently too-short tests or tests that could potentially go on to 400 items.  For example, the NCLEX has a minimum length exam of 75 items and the maximum length exam of 145 items.

 

Example of the computerized adaptive testing algorithm

Let’s walk through an oversimplified example.  Here, we have an item bank with 5 questions.  We will start with an item of average difficulty, and answer as would a student of below-average difficulty.

Below are the item information functions for five items in a bank.  Let’s suppose the starting theta is 0.0.  

item information functions

 

  1. We find the first item to deliver.  Which item has the highest information at 0.0?  It is Item 4.
  2. Suppose the student answers incorrectly.
  3. We run the IRT scoring algorithm, and suppose the score is -2.0.  
  4. Check the termination criterion; we certainly aren’t done yet, after 1 item.
  5. Find the next item.  Which has the highest information at -2.0?  Item 2.
  6. Suppose the student answers correctly.
  7. We run the IRT scoring algorithm, and suppose the score is -0.8.  
  8. Evaluate termination criterion; not done yet.
  9. Find the next item.  Item 2 is the highest at -0.8 but we already used it.  Item 4 is next best, but we already used it.  So the next best is Item 1.
  10. Item 1 is very easy, so the student gets it correct.
  11. New score is -0.2.
  12. Best remaining item at -0.2 is Item 3.
  13. Suppose the student gets it incorrect.
  14. New score is perhaps -0.4.
  15. Evaluate termination criterion.  Suppose that the test has a max of 3 items, an extremely simple criterion.  We have met it.  The test is now done and automatically submitted.

 

Advantages of computerized adaptive testing

By making the test more intelligent, adaptive testing provides a wide range of benefits.  Some of the well-known advantages of adaptive testing, recognized by scholarly psychometric research, are listed below.  
 

Shorter tests

Research has found that adaptive tests produce anywhere from a 50% to 90% reduction in test length.  This is no surprise.  Suppose you have a pool of 100 items.  A top student is practically guaranteed to get the easiest 70 correct; only the hardest 30 will make them think.  Vice versa for a low student.  Middle-ability students do no need the super-hard or the super-easy items.

Why does this matter?  Primarily, it can greatly reduce costs.  Suppose you are delivering 100,000 exams per year in testing centers, and you are paying $30/hour.  If you can cut your exam from 2 hours to 1 hour, you just saved $3,000,000.  Yes, there will be increased costs from the use of adaptive assessment, but you will likely save money in the end.

For the K12 assessment, you aren’t paying for seat time, but there is the opportunity cost of lost instruction time.  If students are taking formative assessments 3 times per year to check on progress, and you can reduce each by 20 minutes, that is 1 hour; if there are 500,000 students in your State, then you just saved 500,000 hours of learning.

More precise scores

CAT will make tests more accurate, in general.  It does this by designing the algorithms specifically around how to get more accurate scores without wasting examinee time.

More control of score precision (accuracy)

CAT ensures that all students will have the same accuracy, making the test much fairer.  Traditional tests measure the middle students well but not the top or bottom students.  Is it better than A) students see the same items but can have drastically different accuracy of scores, or B) have equivalent accuracy of scores, but see different items?

Better test security

Since all students are essentially getting an assessment that is tailored to them, there is better test security than everyone seeing the same 100 items.  Item exposure is greatly reduced; note, however, that this introduces its own challenges, and adaptive assessment algorithms have considerations of their own item exposure.

A better experience for examinees, with reduced fatigue

Adaptive assessments will tend to be less frustrating for examinees on all ranges of ability.  Moreover, by implementing variable-length stopping rules (e.g., once we know you are a top student, we don’t give you the 70 easy items), reduces fatigue.

Increased examinee motivation

Since examinees only see items relevant to them, this provides an appropriate challenge.  Low-ability examinees will feel more comfortable and get many more items correct than with a linear test.  High-ability students will get the difficult items that make them think.

Frequent retesting is possible

The whole “unique form” idea applies to the same student taking the same exam twice.  Suppose you take the test in September, at the beginning of a school year, and take the same one again in November to check your learning.  You’ve likely learned quite a bit and are higher on the ability range; you’ll get more difficult items, and therefore a new test.  If it was a linear test, you might see the same exact test.

This is a major reason that adaptive assessment plays a formative role in K-12 education, delivered several times per year to millions of students in the US alone.

Individual pacing of tests

Examinees can move at their own speed.  Some might move quickly and be done in only 30 items.  Others might waver, also seeing 30 items but taking more time.  Still, others might see 60 items.  The algorithms can be designed to maximize the process.

Advantages of computerized testing in general

Of course, the advantages of using a computer to deliver a test are also relevant.  Here are a few
  • Immediate score reporting
  • On-demand testing can reduce printing, scheduling, and other paper-based concerns
  • Storing results in a database immediately makes data management easier
  • Computerized testing facilitates the use of multimedia in items
  • You can immediately run psychometric reports
  • Timelines are reduced with an integrated item banking system

 

How to develop an adaptive assessment that is valid and defensible

CATs are the future of assessment. They operate by adapting both the difficulty and number of items to each individual examinee. The development of an adaptive test is no small feat, and requires five steps integrating the expertise of test content developers, software engineers, and psychometricians.

The development of a quality adaptive test is complex and requires experienced psychometricians in both item response theory (IRT) calibration and CAT simulation research. FastTest can provide you the psychometrician and software; if you provide test items and pilot data, we can help you quickly publish an adaptive version of your test.

   Step 1: Feasibility, applicability, and planning studies. First, extensive monte carlo simulation research must occur, and the results formulated as business cases, to evaluate whether adaptive testing is feasible, applicable, or even possible.

   Step 2: Develop item bank. An item bank must be developed to meet the specifications recommended by Step 1.

   Step 3: Pretest and calibrate item bank. Items must be pilot tested on 200-1000 examinees (depends on IRT model) and analyzed by a Ph.D. psychometrician.

  Step 4: Determine specifications for final CAT. Data from Step 3 is analyzed to evaluate CAT specifications and determine most efficient algorithms using CAT simulation software such as CATSim.

   Step 5: Publish live CAT. The adaptive test is published in a testing engine capable of fully adaptive tests based on IRT.  There are not very many of them out in the market.  Sign up for a free account in our platform FastTest and try for yourself!

Want to learn more about our one-of-a-kind model? Click here to read the seminal article by our two co-founders.  More adaptive testing research is available here.

Minimum requirements for computerized adaptive testing

Here are some minimum requirements to evaluate if you are considering a move to the CAT approach.

  • A large item bank piloted so that each item has at least 100 valid responses (Rasch model) or 500 (3PL model)
  • 500 examinees per year
  • Specialized IRT calibration and CAT simulation software like Xcalibre and CATsim.
  • Staff with a Ph.D. in psychometrics or an equivalent level of experience. Or, leverage our internationally recognized expertise in the field.
  • Items (questions) that can be scored objectively correct/incorrect in real-time
  • An item banking system and CAT delivery platform
  • Financial resources: Because it is so complex, the development of a CAT will cost at least $10,000 (USD) — but if you are testing large volumes of examinees, it will be a significantly positive investment. If you pay $20/hour for proctoring seats and cut a test from 2 hours to 1 hour for just 1,000 examinees… that’s a $20,000 savings.  If you are doing 200,000 exams?  That is $4,000,000 in seat time that is saved.

Adaptive testing: Resources for further reading

Visit the links below to learn more about adaptive assessment.  

  • We first recommend that you first read this landmark article by our co-founders.
  • Read this article on producing better measurements with CAT from Prof. David J. Weiss.
  • International Association for Computerized Adaptive Testing: www.iacat.org
  • Here is a video on the history of CAT, by the godfather of CAT, Prof. David J. Weiss

Quick FAQ

Let’s start with some quick FAQ.  Afterwards, we will delve into details about the machine learning algorithm.

How do computer adaptive tests work?

Computer adaptive tests adjust the difficulty of upcoming questions based on a test-taker's previous answers. The process starts with a question of medium difficulty; if answered correctly, a more difficult question follows. An incorrect answer leads to an easier question. This dynamic adjustment continues throughout the exam, creating a tailored testing experience that accurately measures the individual's ability level.

What is the purpose of computerized adaptive testing?

The purpose of Computerized Adaptive Testing (CAT) is to accurately measure an individual's proficiency with fewer questions and in less time. By tailoring question difficulty to each test-taker's performance, CAT ensures an efficient and secure testing process.

What are the pros and cons of computer adaptive testing?

Pros of computer adaptive testing include more efficient assessments (potentially saving millions of hours of time), greater student engagement, and enhanced test security. The main cons are the high cost and complexity of test development, which precludes CAT for small exams.

Is computer adaptive testing fair?

Yes. It is psychometrically more fair than a traditional, static test. Even though test-takers encounter different questions, the adaptive algorithm adjusts for question difficulty and they are scored on a percentile basis. This allows for an accurate assessment of each student's ability and provides employers with a fair basis to compare qualifications among candidates. A traditional test typically has mostly average items, leading to inaccurate scores for top or low students.

What is an example of an adaptive test?

The GRE (Graduate Record Examinations) is a prime example of an adaptive test. So is the NCLEX (nursing exam in the USA), GMAT (business school admissions), and many formative assessments like the NWEA MAP.

How to implement CAT

computerized Adaptive testing options

Our revolutionary platform, FastTest, makes it easy to publish a CAT.  Once you upload your item texts and the IRT parameters, you can choose whatever options you please for steps 2-3-4 of the algorithm, simply by clicking on elements in our easy-to-use interface.  

 

Contact us to sign up for a free account in our industry-leading CAT platform or to discuss with one of our PhD psychometricians.

 

 

graded-response-model

Samejima’s (1969) Graded Response Model (GRM, sometimes SGRM) is an extension of the two parameter logistic model (2PL) within the item response theory (IRT) paradigm.  IRT provides a number of benefits over classical test theory, especially regarding the treatment of polytomous items; learn more about IRT vs. CTT here.

What is the Graded Response Model?

GRM is a family of latent trait (latent trait is a variable that is not directly measurable, e.g. a person’s level of neurosis, conscientiousness or openness) mathematical models for grading responses that was developed by Fumiko Samejima (1969) and has been utilized widely since then. GRM is also known as Ordered Categorical Responses Model as it deals with ordered polytomous categories that can relate to both constructed-response or selected-response items where examinees are supposed to obtain various levels of scores like 0-4 points. In this case, the categories are as follows: 0, 1, 2, 3, and 4; and they are ordered. ‘Ordered’ means what it says, that there is a specific order or ranking of responses. ‘Polytomous’ means that the responses are divided into more than two categories, i.e., not just correct/incorrect or true/false.

 

When should I use the GRM?

This family of models is applicable when polytomous responses to an item can be classified into more than two ordered categories (something more than correct/incorrect), such as to represent different degrees of achievement in a solution to a problem or levels of agreement , a Likert scale, or frequency to a certain statement. GRM covers both homogeneous and heterogeneous cases, while the former implies that a discriminating power underlying a thinking process is constant throughout a range of attitude or reasoning.

Samejima (1997) highlights a reasonability of employing GRM in testing occasions when examinees are scored based on correctness (e.g., incorrect, partially correct, correct) or while measuring people’s attitudes and preferences, like in Likert-scale attitude surveys (e.g., strongly agree, agree, neutral, disagree, strongly disagree). For instance, GRM can be used in an extroversion scoring model considering “I like to go to parties” as a high difficulty construction, and “I like to go out for coffee with a close friend” as an easy one.emotion scale grm

Here are some examples of assessments where GRM is utilized:

  • Survey attitude questions using responses like ‘strongly disagree, disagree, neutral, agree, strongly agree’
  • Multiple response items, such as a list of 8 animals and student selects which 3 are reptiles
  • Drag and drop or other tech enhanced items with multiple points available
  • Letter grades assigned to an essay: A, B, C, D, and E
  • Essay responses graded on a 0-to-4 rubric

 

Why to use GRM?

There are three general goals of applying GRM:

  • estimating an ability level/latent trait
  • estimating an adequacy with which test questions measure an ability level/latent trait
  • evaluating a probability that a particular test domain will receive a specific score/grade for each question

Using item response theory in general (not just the GRM) provides a host of advantages.  It can help you validate the assessment.  Using the GRM can also enable adaptive testing.

 

How to calculate a response probability with the GRM?

There is a two-step process of calculating a probability that an examinee selects a certain category in a given question. The first step is to find a probability that an examinee with a definite ability level selects a category n or greater in a given question:

GRM formula1

where

1.7  is the scale factor

a  is the discrimination of the question

bm  is a probability of choosing category n or higher

e  is the constant that approximately equals to 2.718

Θ  is the ability level

P*m(Θ) = 1  if  m = 1  since a probability of replying in the lowest category or in all the major ones is a certain event

P*m(Θ) = 0  if  m = M + 1  since a probability of replying in a category following the largest is null.

 

The second step is to find a probability that an examinee responds in a given category:

GRM formula2

This formula describes the probability of choosing a specific response to the question for each level of the ability it measures.

 

How do I implement the GRM on my assessment?

You need item response theory software.  Start by downloading  Xcalibre  for free.  Below are outputs for two example items.

How to interpret this?  The GRM uses category response functions which show the probability of selecting a given response as a function of theta (trait or ability).  For item 6, we see that someone of theta -3.0 to -0.5 is very likely to select “2” on the Likert scale (or whatever our response is).  Examinees above -.05 are likely to select “3” on the scale.  But on Item 10, the green curve is low and not likely to be chosen at all; examinees from -2.0 to +2.0 are likely to select “3” on the Likert scale, and those above +2.0 are likely to select “4”.  Item 6 is relatively difficult, in a sense, because no one chose “4.”

Item 6

Xcalibre - graded response model easy

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Item 10

Xcalibre - graded response model difficult

 

 

 

 

 

 

 

 

 

 

 

 

 

 

References

Keller, L. A. (2014). Item Response Theory Models for Polytomous Response Data. Wiley StatsRef: Statistics Reference Online.

Samejima, F. (1969). Estimation of latent ability using a response pattern of graded coress. Psychometrika monograph supplement17(4), 2. doi:10.1002/j.2333-8504.1968.tb00153.x.

Samejima, F. (1997). Graded response model. In W. J. van der Linden and R. K. Hambleton (Eds), Handbook of Modern Item Response Theory, (pp. 85–100). Springer-Verlag.

Coefficient cronbachs alhpa interpretation

Coefficient alpha reliability, sometimes called Cronbach’s alpha, is a statistical index that is used to evaluate the internal consistency or reliability of an assessment. That is, it quantifies how consistent we can expect scores to be, by analyzing the item statistics. A high value indicates that the test is of high reliability, and a low value indicates low reliability.  This is one of the most fundamental concepts in psychometrics, and alpha is arguably the most common index.

What is coefficient alpha, aka Cronbach’s alpha?

The classic reference to alpha is Cronbach (1954). He defines it as:

coefficient alpha

where k is the number of items, sigma-i is variance of item i, and sigma-X is total score variance.

Kuder-Richardson 20

While Cronbach tends to get the credit, to the point that the index is often called “Cronbach’s Alpha” he really did not invent it. Kuder and Richardson (1927) suggested the following equation to estimate the reliability of a test with dichotomous (right/wrong) items.

kr 20 reliability

Note that it is the same as Cronbach’s equation, except that he replaced the binomial variance pq with the more general notation of variance (sigma). This just means that you can use Cronbach’s equation on polytomous data such as Likert rating scales. In the case of dichotomous data such as multiple choice items, Cronbach’s alpha and KR-20 are the exact same.

Additionally, Cyril Hoyt defined reliability in an equivalent approach using ANOVA in 1941, a decade before Cronbach’s paper.

How to interpret coefficient alpha

In general, alpha will range from 0.0 (random number generator) to 1.0 (perfect measurement). However, in rare cases, it can go below 0.0, such as if the test is very short or if there is a lot of missing data (sparse matrix). This, in fact, is one of the reasons NOT to use alpha in some cases. If you are dealing with linear-on-the-fly tests (LOFT), computerized adaptive tests (CAT), or a set of overlapping linear forms for equating (non-equivalent anchor test, or NEAT design), then you will likely have a large proportion of sparseness in the data matrix and alpha will be very low or negative. In such cases, item response theory provides a much more effective way of evaluating the test.

What is “perfect measurement?”  Well, imagine using a ruler to measure a piece of paper.  If it is American-sized, that piece of paper is always going to be 8.5 inches wide, no matter how many times you measure it with the ruler.  A bathroom scale is slightly less reliability; You might step on it, see 190.2 pounds, then step off and on again, and see 190.4 pounds.  This is a good example of how we often accept unreliability in measurement.

Of course, we never have this level of accuracy in the world of psychoeducational measurement.  Even a well-made test is something where a student might get 92% today and 89% tomorrow (assuming we could wipe their brain of memory of the exact questions).

Reliability can also be interpreted as the ratio of true score variance to total score variance. That is, all test score distributions have a total variance, which consist of variance due to the construct of interest (i.e., smart students do well and poor students do poorly), but also some error variance (random error, kids not paying attention to a question, second dimension in the test… could be many things.

What is a good value of coefficient alpha?

As psychometricians love to say, “it depends.” The rule of thumb that you generally hear is that a value of 0.70 is good and below 0.70 is bad, but that is terrible advice. A higher value indeed indicates higher reliability, but you don’t always need high reliability. A test to certify surgeons, of course, deserves all the items it needs to make it quite reliable. Anything below 0.90 would be horrible. However, the survey you take from a car dealership will likely have the statistical results analyzed, and a reliability of 0.60 isn’t going to be the end of the world; it will still provide much better information than not doing a survey at all!

Here’s a general depiction of how to evaluate levels of coefficient alpha.

Coefficient cronbachs alhpa interpretation

Using Alpha: The classical standard error of measurement

Coefficient alpha is also often used to calculate the classical standard error of measurement (SEM), which provides a related method of interpreting the quality of a test and the precision of its scores. The SEM can be interpreted as the standard deviation of scores that you would expect if a person took the test many times, with their brain wiped clean of the memory each time. If the test is reliable, you’d expect them to get almost the same score each time, meaning that SEM would be small.

   SEM=SD*sqrt(1-r)

Note that SEM is a direct function of alpha, so that if alpha is 0.99, SEM will be small, and if alpha is 0.1, then SEM will be very large.

Coefficient Alpha and Unidimensionality

It can also be interpreted as a measure of unidimensionality. If all items are measuring the same construct, then scores on them will align, and the value of alpha will be high. If there are multiple constructs, alpha will be reduced, even if the items are still high quality. For example, if you were to analyze data from a Big Five personality assessment with all five domains at once, alpha would be quite low. Yet if you took the same data and calculated alpha separately on each domain, it would likely be quite high.

How to calculate the index

Because the calculation of coefficient alpha reliability is so simple, it can be done quite easily if you need to calculate it from scratch, such as using formulas in Microsoft Excel. However, any decent assessment platform or psychometric software will produce it for you as a matter of course. It is one of the most important statistics in psychometrics.

Cautions on Overuse

Because alpha is just so convenient – boiling down the complex concept of test quality and accuracy to a single easy-to-read number – it is overused and over-relied upon. There are papers out in the literature that describe the cautions in detail; here is a classic reference.

One important consideration is the over-simplification of precision with coefficient alpha, and the classical standard error of measurement, when juxtaposed to the concept of conditional standard error of measurement from item response theory. This refers to the fact that most traditional tests have a lot of items of middle difficulty, which maximizes alpha. This measures students of middle ability quite well. However, if there are no difficult items on a test, it will do nothing to differentiate amongst the top students. Therefore, that test would have a high overall alpha, but have virtually no precision for the top students. In an extreme example, they’d all score 100%.

Also, alpha will completely fall apart when you calculate it on sparse matrices, because the total score variance is artifactually reduced.

Summary

In conclusion, coefficient alpha is one of the most important statistics in psychometrics, and for good reason. It is quite useful in many cases, and easy enough to interpret that you can discuss it with test content developers and other non-psychometricians. However, there are cases where you should be cautious about its use, and some cases where it completely falls apart. In those situations, item response theory is highly recommended.

differential item functioning

Differential item functioning (DIF) is a term in psychometrics for the statistical analysis of assessment data to determine if items are performing in a biased manner against some group of examinees. Most often, this is based on a demographic variable such as gender, ethnicity, or first language. For example, you might analyze a test to see if items are biased against an ethnic minority, such as Blacks or Hispanics in the USA.  Another organization I have worked with was concerned primarily with Urban vs. Rural students.  In the scientific literature, the majority is called the reference group and the minority is called the focal group.

As you would expect from the name, the are trying to find evidence that an item functions (performs) differently for two groups. However, this is not as simple as one group getting the item incorrect (P value) more often. What if that group also has a lower ability/trait level on average? Therefore, we must analyze the difference in performance conditional on ability.  This means we find examinees at a given level of ability (e.g., 20-30th percentile) and compare the difficulty of the item with minority vs majority examinees.

Mantel-Haenszel analysis of differential item functioning

The Mantel-Haenszel approach is a simple yet powerful way to analyze differential item functioning. We simply use the raw classical number-correct score as the indicator of ability, and use it to evaluate group differences conditional on ability. For example, we could split up the sample into fifths (slices of 20%), and for each slice, we evaluate the difference in P value between the groups. An example of this is below, to help visualize how DIF might operate.  Here, there is a notable difference in the probability of getting an item correct, with ability held constant.  The item is biased against the focal group.  In the slice of examinees 41-60th percentile, the reference group has a 60% chance while the focal group (minority) has a 48% chance.

differential item functioning

Crossing and non-crossing DIF

Differential item functioning is sometimes described as crossing or non-crossing DIF. The example above is non-crossing, because the lines do not cross. In this case, there would be a difference in the overall P value between the groups. A case of crossing DIF would see the two lines cross, with potentially no difference in overall P value – which would mean that DIF would go completely unnoticed unless you specifically did a DIF analysis like this.  Hence, it is important to perform DIF analysis; though not for just this reason.

More methods of evaluating differential item functioning

There are, of course, more sophisticated methods of analyzing differential item functioning.  Logistic regression is a commonly used approach.  A sophisticated methodology is Raju’s differential functioning of items and tests (DFIT) approach.

How do I implement DIF?

There are three ways you can implement a DIF analysis.

1. General psychometric software: Well-known software for classical or item response theory analysis will often include an option for DIF. Examples are Iteman, Xcalibre, and IRTPRO (formerly Parscale/Multilog/Bilog).

2. DIF-specific software: While there are not many, there are software programs or R packages that are specific to DIF. An example is DFIT; there used to be a software named that, to do the analysis of the same name.  However, the software is no longer supported but you can use an R package like this.

3. General statistical software or programming environments: For example, if you are a fan of SPSS, you can use it to implement some DIF analyses such as logistic regression.

More resources on differential item functioning

Sage Publishing puts out “little green books” that are useful introductions to many topics.  There is one specifically on differential item functioning.

Juggling-statistics

What is the difference between the terms dichotomous and polytomous in psychometrics?  Well, these terms represent two subcategories within item response theory.  Item response theory (IRT) is the dominant psychometric paradigm for constructing, scoring and analyzing assessments.  Virtually all large-scale assessments utilize IRT because of its well-documented advantages.  In many cases, however, it is referred to as a single way of analyzing data.  But, IRT is actually a family of fast-growing models.  The models operate quite differently based on whether the test questions are scored right/wrong or yes/no (dichotomous), vs. complex items like an essay that might be scored on a rubric of 0 to 6 points (polytomous).  This post will provide a description of the differences and when to use one or the other.

 

Ready to use IRT?  Download Xcalibre for free

 

Dichotomous IRT Models

Dichotomous IRT models are those with two possible item scores.  Note that I say “item scores” and not “item responses” – the most common example of a dichotomous item is multiple choice, which typically has 4 to 5 options, but only two possible scores (correct/incorrect).  

True/False or Yes/No items are also obvious examples and are more likely to appear in surveys or inventories, as opposed to the ubiquity of the multiple-choice item in achievement/aptitude testing. Other item types that can be dichotomous are Scored Short Answer and Multiple Response (all or nothing scoring).  

What models are dichotomous?

The three most common dichotomous models are the 1PL/Rasch, the 2PL, and the 3PL.  Which one to use depends on the type of data you have, as well as your doctrine of course.  A great example is Scored Short Answer items: there should be no effect of guessing on such an item, so the 2PL is a logical choice.  Here is a broad overgeneralization:

  • 1PL/Rasch: Uses only the difficulty (b) parameter and does not take into account guessing effects or the possibility that some items might be more discriminating than others; however, can be useful with small samples and other situations
  • 2PL: Uses difficulty (b) and discrimination (a) parameters, but no guessing (c); relevant for the many types of assessment where there is no guessing
  • 3PL: Uses all three parameters, typically relevant for achievement/aptitude testing.

What do dichotomous models look like?

Dichotomous models, graphically, will have one S-shaped curve with a positive slope, as seen here.  This model that the probability of responding in the keyed direction increases with higher levels of the trait or ability.  

item response function

Technically, there is also a line for the probability of an incorrect response, which goes down, but this is obviously the 1-P complement, so it is rarely drawn in graphs.  It is, however, used in scoring algorithms (check out this white paper).

In the example, a student with theta = -3 has about a 0.28 chance of responding correctly, while theta = 0 has about 0.60 and theta = 1 has about 0.90.

Polytomous IRT Models

Polytomous models are for items that have more than two possible scores.  The most common examples are Likert-type items (Rate on a scale of 1 to 5) and partial credit items (score on an Essay might be 0 to 5 points). IRT models typically assume that the item scores are integers.

What models are polytomous?

Unsurprisingly, the most common polytomous models use names like rating scale and partial credit.

  • Rating Scale Model (Andrich, 1978)
  • Partial Credit Model (Masters, 1982)
  • Generalized Rating Scale Model (Muraki, 1990)
  • Generalized Partial Credit Model (Muraki, 1992)
  • Graded Response Model (Samejima, 1972)
  • Nominal Response Model (Bock, 1972)

What do polytomous models look like?

Polytomous models have a line that dictates each possible response.  The line for the highest point value is typically S-shaped like a dichotomous curve.  The line for the lowest point value is typically sloped down like the 1-P dichotomous curve.  Point values in the middle typically have a bell-shaped curve. The example is for an Essay that scored 0 to 5 points.  Only students with theta >2 are likely to get the full points (blue), while students 1<theta<2 are likely to receive 4 points (green).

I’ve seen “polychotomous.”  What does that mean?

It means the same as polytomous.  

How is IRT used in our platform?

We use it to support the test development cycle, including form assembly, scoring, and adaptive testing.  You can learn more on this page.

How can I analyze my tests with IRT?

You need specially designed software, like Xcalibre.  Classical test theory is so simple that you can do it with Excel functions.

Recommended Readings

Item Response Theory for Psychologists by Embretson and Riese (2000).  

Multistage testing algorithm

Multistage testing (MST) is a type of computerized adaptive testing (CAT).  This means it is an exam delivered on computers which dynamically personalize it for each examinee or student.  Typically, this is done with respect to the difficulty of the questions, by making the exam easier for lower-ability students and harder for high-ability students.  Doing this makes the test shorter and more accurate while providing additional benefits.  This post will provide more information on multistage testing so you can evaluate if it is a good fit for your organization.

Already interested in MST and want to implement it?  Contact us to talk to one of our experts and get access to our powerful online assessment platform, where you can create your own MST and CAT exams in a matter of hours.

 

What is multistage testing?Multistage testing algorithm

Like CAT, multistage testing adapts the difficulty of the items presented to the student. But while adaptive testing works by adapting each item one by one using item response theory (IRT), multistage works in blocks of items.  That is, CAT will deliver one item, score it, pick a new item, score it, pick a new item, etc.  Multistage testing will deliver a block of items, such as 10, score them, then deliver another block of 10.

The design of a multistage test is often referred to as panels.  There is usually a single routing test or routing stage which starts the exam, and then students are directed to different levels of panels for subsequent stages.  The number of levels is sometimes used to describe the design; the example on the right is a 1-3-3 design.  Unlike CAT, there are only a few potential paths, unless each stage has a pool of available testlets.

As with item-by-item CAT, multistage testing is almost always done using IRT as the psychometric paradigm, selection algorithm, and scoring method.  This is because IRT can score examinees on a common scale regardless of which items they see, which is not possible using classical test theory.

To learn more about MST, I recommend this book.

Why multistage testing?

Item-by-item CAT is not the best fit for all assessments, especially those that naturally tend towards testlets, such as language assessments where there is a reading passage with 3-5 associated questions.

Multistage testing allows you to realize some of the well-known benefits of adaptive testing (see below), with more control over content and exposure.  In addition to controlling content at an examinee level, it also can make it easier to manage item bank usage for the organization.

 

How do I implement multistage testing?

1. Develop your item banks using items calibrated with item response theory

2. Assemble a test with multiple stages, defining pools of items in each stage as testlets

3. Evaluate the test information functions for each testlet

4. Run simulation studies to validate the delivery algorithm with your predefined testlets

5. Publish for online delivery

Our industry-leading assessment platform manages much of this process for you.  The image to the right shows our test assembly screen where you can evaluate the test information functions for each testlet.

Multistage testing

 

Benefits of multistage testing

There are a number of benefits to this approach, which are mostly shared with CAT.

  • Shorter exams: because difficulty is targeted, you waste less time
  • Increased security: There are many possible configurations, unlike a linear exam where everyone sees the same set of items
  • Increased engagement: Lower ability students are not discouraged, and high ability students are not bored
  • Control of content: CAT has some content control algorithms, but they are sometimes not sufficient
  • Supports testlets: CAT does not support tests that have testlets, like a reading passage with 5 questions
  • Allows for review: CAT does not usually allow for review (students can go back a question to change an answer), while MST does

 

Examples of multistage testing

MST is often used in language assessment, which means that it is often used in educational assessment, such as benchmark K-12 exams, university admissions, or language placement/certification.  One of the most famous examples is the Scholastic Aptitude Test from The College Board; it is moving to an MST approach in 2023.

Because of the complexity of item response theory, most organizations that implement MST have a full-time psychometrician on staff.  If your organization does not, we would love to discuss how we can work together.

 

maximum likelihood estimation laptop

Maximum Likelihood Estimation (MLE) is an approach to estimating parameters for a model.  It is one of the core aspects of Item Response Theory (IRT), especially to estimate item parameters (analyze questions) and estimate person parameters (scoring).  This article will provide an introduction to the concepts of MLE.

Content

  1. History behind Maximum Likelihood Estimation
  2. Likelihood EstimationDefining Maximum Likelihood Estimation
  3. Comparison of likelihood and probability
  4. Key characteristics of Maximum Likelihood Estimation
  5. Weaknesses of Maximum Likelihood Estimation
  6. Application of Maximum Likelihood Estimation
  7. Summarizing remarks about Maximum Likelihood Estimation
  8. References

History behind Maximum Likelihood Estimation

Even though early ideas about MLE appeared in the mid-1700s, Sir Ronald Aylmer Fisher developed them into a more formalized concept much later. Fisher was working seminally on maximum likelihood from 1912 to 1922, criticizing himself and producing several justifications. In 1925, he finally published “Statistical Methods for Research Workers”, one of the 20th century’s most influential books on statistical methods. In general, the production of maximum likelihood concept has been a breakthrough in Statistics.

 

Defining Maximum Likelihood Estimation

Wikipedia defines MLE as follows:

In statistics, Maximum Likelihood Estimation is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statistical model, the observed data is most probable. The point in the parameter space that maximizes the likelihood function is called the maximum likelihood estimate.

Merriam Webster has a slightly different definition for MLE:

A statistical method for estimating population parameters (as the mean and variance) from sample data that selects as estimates those parameter values maximizing the probability of obtaining the observed data.

To sum up, MLE is a method that detects parameter values of a model. These parameter values are identified such that they maximize the likelihood that the process designed by the model produced the data that were actually observed. To put it simply, MLE answers the question:

For which parameter value does the observed data have the biggest probability?

 

Comparison of likelihood and probability

The definitions above contain “probability” but it is important not to mix these two different concepts. Let us look at some differences between likelihood and probability, so that you could differentiate between them.

Likelihood

Probability

Refers to the occurred events with known outcomes Refers to the events that will occur in the future
Likelihoods do not add up to 1 Probabilities add up to 1
Example 1: I flipped a coin 20 times and obtained 20 heads. What is the likelihood that the coin is fair? Example 1: I flipped a coin 20 times. What is the probability of the coin to land heads or tails every time?
Example 2: Given the fixed outcomes (data), what is the likelihood of different parameter values? Example 2: The fixed parameter P = 0.5 is given. What is the probability of different outcomes?

 

Calculating Maximum Likelihood Estimation

MLE can be calculated as a derivative of a log-likelihood in relation to each parameter, the mean μ and the variance σ2, that is equated to 0. There are four general steps in estimating the parameters:

  • Call for a distribution of the observed data
  • Estimate distribution’s parameters using log-likelihood
  • Paste estimated parameters into a distribution’s probability function
  • Evaluate the distribution of the observed data

 

Key characteristics of Maximum Likelihood Estimation

  • MLE operates with one-dimensional data
  • MLE uses only “clean” data (e.g. no outliers)
  • MLE is usually computationally manageable
  • MLE is often real-time on modern computers
  • MLE works well for simple cases (e.g. binomial distribution)

 

Weaknesses of Maximum Likelihood Estimation

  • MLE is sensitive to outliers
  • MLE often demands optimization for speed and memory to obtain useful results
  • MLE is sometimes poor at differentiating between models with similar distributions
  • MLE can be technically challenging, especially for multidimensional data and complex models

 

Application of Maximum Likelihood Estimation

In order to apply MLE, two important assumptions (typically referred to as the i.i.d. assumption) need to be made:

  • Data must be independently distributed, i.e. the observation of any given data point does not depend on the observation of any other data point (each data point is an independent experiment)
  • Data must be identically distributed, i.e. each data point is generated from the same distribution family with the same parameters

Let us consider several world-known applications of MLE:

  • Global Positioning System (GPS)
  • Smart keyboard programs for iOS and Android operating systems (e.g. Swype)
  • Speech recognition programs (e.g. Carnegie Mellon open source SPHINX speech recognizer, Dragon Naturally Speaking)
  • Detection and measurement of the properties of the Higgs Boson at the European Organization for Nuclear Research (CERN) by means of the Large Hadron Collider (Francois Englert and Peter Higgs were awarded the Nobel Prize in Physics in 2013 for the theory of Higgs Boson)

Generally speaking, MLE is employed in agriculture, economics, finance, physics, medicine and many other fields.

 

Summarizing remarks about Maximum Likelihood Estimation

Despite some functional issues with MLE such as technical challenges for multidimensional data and complex multiparameter models that interfere solving many real world problems, MLE remains a powerful and widely used statistical approach for classification and parameter estimation. MLE has brought many successes to the mankind in both scientific and commercial worlds.

 

References

Aldrich, J. (1997). R. A. Fisher and the making of maximum likelihood 1912-1922. Statistical Science12(3), 162-176.

Stigler, S. M. (2007). The epic story of maximum likelihood. Statistical Science, 598-620.

 

multi dimensional item response theory

Multidimensional item response theory (MIRT) has been developing from its Factor Analytic and unidimensional item response theory (IRT) roots. This development has led to an increased emphasis on precise modeling of item-examinee interaction and a decreased emphasis on data reduction and simplification. MIRT represents a broad family of probabilistic models designed to portray an examinee’s likelihood of a correct response based on item parameters and multiple latent traits/dimensions. The MIRT models determine a compound multidimensional space to describe individual differences in the targeted dimensions.

Within MIRT framework, items are treated as fundamental units of test construction. Furthermore, items are considered as multidimensional trials to obtain valid and reliable information about examinee’s location in a complex space. This philosophy extends the work from unidimensional IRT to provide a more comprehensive description of item parameters and how the information from items combines to depict examinees’ characteristics. Therefore, items need to be crafted mindfully to be sufficiently sensitive to the targeted combinations of knowledge and skills, and then be carefully opted to help improve estimates of examinee’s characteristics in the multidimensional space.

Trigger for development of Multidimensional Item Response Theory

In modern psychometrics, IRT is employed for calibrating items belonging to individual scales so that each dimension is regarded as unidimensional. According to IRT models, an examinee’s response to an item depends solely on the item parameters and on the examinee’s single parameter, that is the latent trait θ. Unidimensional IRT models are advantageous in terms of operating with quite simple mathematical forms, having various fields of application, and being somewhat robust to violating assumptions.

However, there is a high probability that real interactions between examinees and items are far more jumbled than these IRT models imply. It is likely that responding to a specific item requires examinees to apply plentiful abilities and skills, especially in the compound areas such as the natural sciences. Thus, despite the fact that unidimensional IRT models are highly useful under specific conditions, the world of psychometrics faced the need for more sophisticated models that would reflect multiform examinee-item interactions. For that reason, unidimensional IRT models were extended to multidimensional models to become capable to express situations when examinees need multiple abilities and skills to respond to test items.

Categories of Multidimensional Item Response Theory models

There are two broad categories of MIRT models: compensatory and non-compensatory (partially compensatory).

ways-to-improve-item-banks

  • Under the compensatory model, examinees’ abilities work in cooperation to escalate the probability of a correct response to an item, i.e. higher ability on one trait/dimension compensates for lower ability on the other. For instance, an examinee should read a passage on a current event and answer a question about it. This item assesses two abilities: reading comprehension and knowledge of current events. If the examinee is aware of the current event, then that will compensate for their lower reading ability. On the other hand, if the examinee is an excellent reader then their reading skills will compensate for lack of knowledge about the event.
  • Under the non-compensatory model, abilities do not compensate each other, i.e. an examinee needs to possess a high level abilities on all traits/dimensions to have a high chance to respond to a test item correctly. For example, an examinee should solve a traditional mathematical word problem. This item assesses two abilities: reading comprehension and mathematical computation. If the examinee has excellent reading ability but low mathematical computation ability, they will be able to read the text but not be able to solve the problem. Possessing reverse abilities, the examinee will not be able to solve the problem without understanding what is being asked.

Within the literature, compensatory MIRT models are more commonly used.

Applications of Multidimensional Item Response Theory

  • Since MIRT analyses concentrate on the interaction between item parameters and examinee characteristics, they have provoked numerous studies of skills and abilities necessary to give a correct answer to an item, and of sensitivity dimensions for test items. This research area demonstrates the importance of a thorough comprehension of the ways that tests function. MIRT analyses can help verify group differences and item sensitivities that facilitate test and item bias, and define the reasons behind differential item functioning (DIF) statistics.
  • MIRT allows linking of calibrations, i.e. putting item parameter estimates from multiple calibrations into the same multidimensional coordinate system. This enables reporting examinee performance on different sets of items as profiles on multiple dimensions located on the same scales. Thus, MIRT makes it possible to create large pools of calibrated items that can be used for the construction of multidimensionally parallel test forms and computerized adaptive testing (CAT).

Conclusion

Given the complexity of the constructs in education and psychology and the level of details provided in test specifications, MIRT is particularly relevant for investigating how individuals approach their learning and, subsequently, how it is influenced by various factors. MIRT analysis is still at an early stage of its development and hence is a very active area of current research, in particular of CAT technologies. Interested readers are referred to Reckase (2009) for more detailed information about MIRT.

References

Reckase, M. D. (2009). Multidimensional Item Response Theory. Springer.

guessing-student

The item pseudo-guessing parameter is one of the three item parameters estimated under item response theory (IRT): discrimination a, difficulty b, and pseudo-guessing c. The parameter that is utilized only in the 3PL model is the pseudo-guessing parameter c.  It represents a lower asymptote for the probability of an examinee responding correctly to an item.

Background of IRT item pseudo-guessing parameter 

If you look at the post on the IRT 2PL model, you will realize that the probability of a response depends on the examinee ability level θ, the item discrimination parameter a, and the item difficulty parameter b. However, one of the realities in testing is that examinees will get some multiple-choice items by guessing. Therefore, the probability of the correct response might include a small component that is guessing.

Neither 1PL, nor 2PL considered guessing phenomenon, but Birnbaum (1968) altered the 2PL model to include it. Unfortunately, due to this inclusion the logistic function from the 2PL model lost its nice mathematical properties. Nevertheless, even though it is no longer a logistic model in a technical aspect, it has become known as the three-parameter logistic model (3PL or IRT 3PL). Baker (2001) suggested the following equation for the IRT 3PL model

3pl-formula

where:

a is the item discrimination parameter

b is the item difficulty parameter

c is the item pseudo-guessing parameter

θ is the examinee ability parameter

Interpretation of pseudo-guessing parameter

In general, the pseudo-guessing parameter c is the probability of getting the item correct by guessing alone. For instance, c = 0.20 means that at all ability levels, the probability of getting the item correct by guessing alone is 0.20.  This very often reflects the structure of multiple choice items: 5-options items will tend to have values around 0.20 and 4-option items around 0.25.

It is worth noting, that the value of c does not vary as a function of the trait/ability level θ, i.e. examinees with high and low ability levels have the same probability of responding correctly by guessing. Theoretically, the guessing parameter ranges between 0 and 1, but practically values above 0.35 are considered inacceptable, hence the range 0 < c < 0.35 is applied.  A value higher than 1/k, where k is the number of options, often indicates that a distractor is not performing.

How pseudo-guessing parameter affects other parameters

Due to the presence of the guessing parameter, the definition of the item difficulty parameter b is changed. Within the 1PL and 2PL models, b is the point on the ability scale at which the probability of the correct response is 0.5. Under the 3PL model, the lower limit of the item characteristic curve (ICC) or item response function (IRF) is the value of c rather than zero. According to Baker (2001), the item difficulty parameter is the point on the ability scale where:

probability-c

Therefore, the probability is halfway between the value of c and 1. Thus, the parameter c has defined a boundary to the lowest value of the probability of the correct response, and the item difficulty parameter b determines the point on the ability scale where the probability of the correct response is halfway between this boundary and 1.

The item discrimination parameter a can still be interpreted as being proportional to the slope of the ICC/IRF at the point θ = b. However, under the 3PL model, the slope of the ICC/IRF at θ = b actually equals to a×(1−c)/4. These changes in the definitions of the item parameters a and b are quite important when interpreting test analyses.

References

Baker, F. B. (2001). The basics of item response theory.

Birnbaum, A. L. (1968). Some latent trait models and their use in inferring an examinee’s ability. In F. M. Lord & M. R. Novick (Eds.), Statistical theories of mental test scores (pp. 395–479). Addison-Wesley.