decision consistency

If you are involved with certification testing and are accredited by the National Commission of Certifying Agencies (NCCA), you have come across the term decision consistency.  NCCA requires you to submit a report of 11 important statistics each year, each for all active test forms.  These 11 provide a high level summary of the psychometric health of each form; more on that report here.  One of the 11 is decision consistency.

Decision consistency is an estimate of how consistent the pass/fail decision is on your test.  That is, if someone took your test today, had their brain wiped of that memory, and took the test again next week, what is the probability that they would obtain the same classification both times?  This is often estimated as a proportion or percentage, and we would of course hope that this number is high, but if the test is unreliable it might not be.

The reasoning behind the need for a index specifically on this is that the psychometric aspect we are trying to estimate is different than reliability of point scores (Moltner, Timbil, & Junger, 2015; Downing & Mehrens, 1978).  The argument is that examinees near the cutscore are of interest, and reliability evaluates the entire scale.  It’s for this reason that if you are using item response theory, the NCCA allows you to instead submit the conditional standard error of measurement function at the cutscore.  But all of the classical decision consistency indices evaluate all examinees, and since most candidates are not near the cutscore, this inflates the baseline.  Only the CSEM – from IRT – follows the line of reasoning of focusing on examinees near the cutscore.

An important distinction that stems from this dichotomy is that of decision consistency vs. accuracy.  Consistency refers to receiving the same pass/fail classification each time if you take the test twice.  But what we really care about is whether your pass/fail based on the test matches with your true state.  For a more advanced treatment on this, I recommend Lathrop (2015).

There are a number of classical methods for estimating an index of decision consistency that have been suggested in the psychometric literature.  A simple and classic approach is Hambleton (1972), which is based on an assumption that examinees actually take the same test twice (or equivalent forms).  Of course, this is rarely feasible in practice, so a number of methods were suggested over the next few years on how to estimate this with a single test administration to a given set of examinees.  These include Huynh (1976), Livingston (1972), and Subkoviak (1976).  These are fairly complex.  I once reviewed a report from a psychometrician that faked the Hambleton index because they didn’t have the skills to figure out any of the indices.

How does decision consistency relate to reliability?

The note I made above about unreliability is worth another visit, however.  After the rash of publications on the topic, Mellenbergh and van der Linden (1978; 1980) pointed out that if you assume a linear loss function for misclassification, the conventional estimate of reliability – coefficient alpha – serves as a solid estimate of decision consistency.  What is a linear loss function?  It means that a misclassification is worse if the person’s score is further from the cutscore.  That is, of the cutscore is 70, failing someone with a true score of 80 is twice as bad as failing someone with a true score of 75.  Of course, we never know someone’s true score, so this is a theoretical assumption, but the researchers make an excellent point.

But while research amongst psychometricians on the topic cooled since they made that point, NCCA still requires one of the statistics -most from the 1970s – to be reported.  The only other well-known index on the topic was Hanson and Brennan (1990).  While the indices have been show to be different than classical reliability, I remain to be convinced that they are the right approach.  Of course, I’m not much of a fan of classical test theory at all in the first place; that acceptance of CSEM from IRT is definitely aligned with my views on how psychometrics should tackle measurement problems.

 

The following two tabs change content below.

Nathan Thompson, PhD

Nathan Thompson earned his PhD in Psychometrics from the University of Minnesota, with a focus on computerized adaptive testing. His undergraduate degree was from Luther College with a triple major of Mathematics, Psychology, and Latin. He is primarily interested in the use of AI and software automation to augment and replace the work done by psychometricians, which has provided extensive experience in software design and programming. Dr. Thompson has published over 100 journal articles and conference presentations, but his favorite remains https://pareonline.net/getvn.asp?v=16&n=1.

Latest posts by Nathan Thompson, PhD (see all)