Classical test theory is a century-old paradigm for psychometrics – using quantitative and scientifically-based processes to develop and analyze assessments to maximize their quality. (nobody likes unfair tests!) The most basic and frequently used item statistic from classical test theory is the P-value. It is usually called item difficulty but is sometimes called item facility, which can lead to possible confusion.
The P-Value Statistic
The classical P-value is the proportion of examinees that respond correctly to a question, or respond in the “keyed direction” for items where the notion of correct is not relevant (imagine a personality assessment where all questions are Yes/No statements such as “I like to go to parties” … Yes is the keyed direction for an Extraversion scale). Note that this is NOT the same as the p-value that is used in hypothesis testing from general statistical methods. This P-value is almost universally agreed upon in terms of calculation. But some people call it item difficulty and others call it item facility. Why?
It has to do with the clarity interpretation. It usually makes sense to think of difficulty as an important aspect of the item. The P-value presents this, but in a reverse manner. We usually expect higher values to indicate more of something, right? But a P-value of 1.00 is high, and it means that there is not much difficulty; everyone gets the item correct, so it is actually no difficulty whatsoever. A P-value of 0.25 is low, but it means that there is a lot of difficulty; only 25% of examinees are getting it correct, so it has quite a lot of difficulty.
So where does “item facility” come in?
See how the meaning is reversed? It’s for this reason that some psychometricians prefer to call it item facility or item easiness. We still use the P-value, but 1.00 means high facility/easiness and 0.25 means low facility/easiness. The direction of the semantics fits much better.
Nevertheless, this is a minority of psychometricians. There’s too much momentum to change an entire field at this point! It’s similar to the 3 dichotomous IRT parameters (a,b,c); some of you might have noticed that they are actually in the wrong order, because the 1-parameter model does not use the a parameter, it uses the b. At the end of the day, it doesn’t really matter, but it’s another good example of how we all just got used to doing something and it’s now too far down the road to change it. Tradition is a funny thing.
Latest posts by nthompson (see all)
- What are the possible transformations for scaled scoring? - July 13, 2019
- What is computerized adaptive testing? - May 21, 2019
- What is a standard setting study? - May 21, 2019