item response theory

Classical test theory is a century-old paradigm for psychometrics – using quantitative and scientific processes to develop and analyze assessments to improve their quality.  (Nobody likes unfair tests!)  The most basic and frequently used item statistic from classical test theory is the P-value.  It is usually called item difficulty but is sometimes called item facility, which can lead to possible confusion.

The P-Value Statistic

The classical P-value is the proportion of examinees that respond correctly to a question, or respond in the “keyed direction” for items where the notion of correct is not relevant (imagine a personality assessment where all questions are Yes/No statements such as “I like to go to parties” … Yes is the keyed direction for an Extraversion scale).  Note that this is NOT the same as the p-value that is used in hypothesis testing from general statistical methods.  This P-value is almost universally agreed upon in terms of calculation.  But some people call it item difficulty and others call it item facility.  Why?

It has to do with the clarity interpretation.  It usually makes sense to think of difficulty as an important aspect of the item.  The P-value presents this, but in a reverse manner.  We usually expect higher values to indicate more of something, right?  But a P-value of 1.00 is high, and it means that there is not much difficulty; everyone gets the item correct, so it is actually no difficulty whatsoever.  A P-value of 0.25 is low, but it means that there is a lot of difficulty; only 25% of examinees are getting it correct, so it has quite a lot of difficulty.

So where does “item facility” come in?

See how the meaning is reversed?  It’s for this reason that some psychometricians prefer to call it item facility or item easiness.  We still use the P-value, but 1.00 means high facility/easiness, and 0.25 means low facility/easiness.  The direction of the semantics fits much better.

Nevertheless, this is a minority of psychometricians.  There’s too much momentum to change an entire field at this point!  It’s similar to the 3 dichotomous IRT parameters (a, b, c); some of you might have noticed that they are actually in the wrong order because the 1-parameter model does not use the parameter, it uses the b. 

At the end of the day, it doesn’t really matter, but it’s another good example of how we all just got used to doing something and it’s now too far down the road to change it.  Tradition is a funny thing.

Have you heard about standard setting approaches such as the Hofstee method, or perhaps the Angoff, Ebel, Nedelsky, or Bookmark methods?  There are certainly various ways to set a defensible cutscore or a professional credentialing or pre-employment test.  Today, we are going to discuss the Hofstee method.  You may also be interested in reading this introductory post on setting a cutscore using item response theory.

Why Standard Setting?

Certification organizations that care about the quality of their examinations need to follow best practices and international standards for test development, such as the Standards laid out by the National Commission for Certifying Agencies (NCCA).  One component of that is standard setting, also known as cutscore studies.  One of the most common and respected approaches for that is the modified-Angoff methodology.

However, the Angoff approach has one flaw: the subject matter experts (SMEs) tend to expect too much out of minimally competent candidates, and sometimes set a cutscore so high that even they themselves would not pass the exam.  There are several reasons this can occur.  For example, raters might think “I would expect anyone that worked for me to know how to do this” and not consider the fact that people who work for them might have 10 years of experience while test candidates could be fresh out of training/school and have the topic only touched on for 5 minutes.  SMEs often forget what it was like to be a much younger and inexperienced version of themselves.

For this reason, several compromise methods have been suggested to compare the Angoff-recommended cutscore with a “reality check” of actual score performance on the exam, allowing the SMEs to make a more informed decision when setting the official cutscore of the exam.  I like to use the Beuk method and the Hofstee method.

The Hofstee Method

One method of adjusting the cutscore based on raters’ impressions of the difficulty of the test and possible pass rates is the Hofstee method (Mills & Melican, 1987; Cizek, 2006; Burr et al., 2016).  This method requires the raters to estimate four values:

  1. The minimum acceptable failure rate
  2. The maximum acceptable failure rate
  3. The minimum cutscore, even if all examinees failed
  4. The maximum cutscore, even if all examinees passed

The first two values are failure rates, and are therefore between 0% and 100%, with 100% indicating a test that is too difficult for anyone to pass.  The latter two values are on the raw score scale, and therefore range between 0 and the number of items in the test, again with a higher value indicating a more difficult cutscore to achieve.

These values are paired, and the line that passes through the two points estimated.  The intersection of this line with the failure rate function, is the recommendation of the adjusted cutscore.   

hofstee

How can I use the Hofstee Method?

Unlike the Beuk, the Hofstee method does not utilize the Angoff ratings, so it represents a completely independent reality check.  In fact, it is sometimes used as a standalone cutscore setting method itself, but because it does not involve rating of every single item, I recommend it be used in concert with the Angoff and Beuk approaches.

 

three standard errors

One of my graduate school mentors once said in class that there are three standard errors that everyone in the assessment or I/O Psych field needs to know: mean, error, and estimate.  They are quite distinct in concept and application but easily confused by someone with minimal training.

I’ve personally seen the standard error of the mean reported as the standard error of measurement, which is completely unacceptable.

So in this post, I’ll briefly describe each so that the differences are clear.  In later posts, I’ll delve deeper into each of the standard errors.

Standard Error of the Mean

This is the standard error that you learned about in Introduction to Statistics back in your sophomore year of college/university.  It is related to the Central Limit Theorem, the cornerstone of statistics.  Its purpose is to provide an index of accuracy (or conversely, error) in a sample mean.  Any sample drawn from a population will have an average, but these can vary.  The standard error of the mean estimates the variation we might expect in these different means from different samples and is defined as

   SEmean = SD/sqrt(n)

Where SD is the sample’s standard deviation and n is the number of observations in the sample.  This can be used to create a confidence interval for the true population mean.

The most important thing to note, with respect to psychometrics, is that this has nothing to do with psychometrics.  This is just general statistics.  You could be weighing a bunch of hay bales and calculating their average; anything where you are making observations.  It can be used, however, with assessment data.

For example, if you do not want to make every student in a country take a test, and instead sample 50,000 students, with a mean of 71 items correct with an SD of 12.3, then the SEM is  12.3/sqrt(50000) = 0.055.  You can be 95% certain that the true population means then lies in the narrow range of 71 +- 0.055.

Click here to read more.

Standard Error of Measurement

More important in the world of assessment is the standard error of measurement.  Its purpose is to provide an index of the accuracy of a person’s score on a test.  That is a single person, rather than a group like with the standard error of the mean.  It can be used in both the classical test theory perspective and item response theory perspective, though it is defined quite differently in both.

In classical test theory, it is defined as

   SEM = SD*sqrt(1-r)

Where SD is the standard deviation of scores for everyone who took the test, and r is the reliability of the test.  It can be interpreted as the standard deviation of scores that you would find if you had the person take the test over and over, with a fresh mind each time.  A confidence interval with this is then interpreted as the band where you would expect the person’s true score on the test to fall.

Item Response Theory conceptualizes the SEM as a continuous function across the range of student abilities.  A test form will have more accuracy – less error – in a range of abilities where there are more items or items of higher quality.  That is, a test with most items of middle difficulty will produce accurate scores in the middle of the range, but not measure students on the top or bottom very well.  The example below is a test that has many items above the average examinee score (θ) of 0.0 so that any examinee with a score of less than 0.0 has a relatively inaccurate score, namely with an SEM greater than 0.50.

Standard error of measurement and test information function

 

For a deeper discussion of SEM, click here. 

Standard Error of the Estimate

Lastly, we have the standard error of the estimate.  This is an estimate of the accuracy of a prediction that is made, usually in the paradigm of linear regression.  Suppose we are using scores on a 40 item job knowledge test to predict job performance, and we have data on a sample of 1,000 job incumbents that took the test last year and have job performance ratings from this year on a measure that entails 20 items scored on a 5 point scale for a total of 100 points.

There might have been 86 incumbents that scored 30/40 on the test, and they will have a range of job performance, let’s say from 61 to 89.  If a new person takes the test and scores 30/40, how would we predict their job performance?

The SEE is defined as

       SEE = SDy*sqrt(1-r2)

Here, the r is the correlation of x and y, not reliability. Many statistical packages can estimate linear regression, SEE, and many other related statistics for you.  In fact, Microsoft Excel comes with a free package to implement simple linear regression.  Excel estimates the SEE as 4.69 in the example above, and the regression slope and intercept are 29.93 and 1.76, respectively

Given this, we can estimate the job performance of a person with a 30 test score to be 82.73.  A 95% confidence interval for a candidate with a test score of 30 is then 82.71-(4.69*1.96) to 82.71+(4.69*1.96), or 73.52 to 91.90.

You can see how this might be useful in prediction situations.  Suppose we wanted to be sure that we only hired people who are likely to have a job performance rating of 80 or better?  Well, a cutscore of 30 on the test is therefore quite feasible.

OK, so now what?

Well, remember that these three standard errors are quite different and are not even in related situations.  When you see a standard error requested – for example if you must report the standard error for an assessment – make sure you use the right one!