ASC 2022 Logo no tagline 300

Confidence interval for test scores

test score confidence interval

A confidence interval for test scores is a common way to interpret the results of a test by phrasing it as a range rather than a single number.  We all know that tests are imperfect measurements that happen at a given slice in time, and performance could in actuality vary over time.  The examinee might be sick or tired today and score lower than their true score on the test, or get lucky with some items on topics they have studied more closely, then score higher today than they normally might (or vice versa with tricky items).

Psychometricians recognize this and have developed the concept of the standard error of measurement, which is an index of this variation.  The calculation of the SEM differs between classical test theory and item response theory, but in either case, we can use it to make a confidence interval around the observed score. Because tests are imperfect measurements, some psychometricians recommend always reporting scores as a range rather than a single number.

A confidence interval is a very common concept from statistics in general (not psychometrics alone) about making a likely range for the true value of something being estimated.  We can take 1.96 times a standard error on each side of a point estimate to get a 95% confidence interval.  Start by calculating 1.96 times the SEM, then add and subtract it to the original score to get a range.

Example of confidence interval with Classical Test Theory

With CTT, the confidence interval is placed on raw number-correct scores.  Suppose the reliability of a 100-item test is 0.90, with a mean of 85 and standard deviation of 5.  The SEM is then 5*sqrt(1-0.90) = 5*0.31 = 1.58.  If your score is a 67, then a 95% confidence interval is 63.90 to 70.10.  We are 95% sure that your true score lies in that range.

Example of confidence interval with Item Response Theory

The same concept applies to item response theory.  But the scale of numbers is quite different, because the theta scale runs from approximately -3 to +3.  Also, the SEM is calculated directly from item parameters, in a complex way that is beyond the scope of this discussion.  But if your score is -1.0 and the SEM is 0.30, then the 95% confidence interval for your score is -1.588 to -0.412.  This confidence interval can be compared to a cutscore as an adaptive testing approach to pass/fail tests.

Example of confidence interval with a Scaled Score

This concept also works on scaled scores.  IQ is typically reported on a scale with a mean of 100 and standard deviation of 15.  Suppose the test had an SEM of 3.2, and your score was 112.  Then if we take 1.96*3.2 and plus or minus it on either side, we get a confidence interval of 105.73 to 118.27.

Nathan Thompson, PhD

Nathan Thompson, PhD, is CEO and Co-Founder of Assessment Systems Corporation (ASC). He is a psychometrician, software developer, author, and researcher, and evangelist for AI and automation. His mission is to elevate the profession of psychometrics by using software to automate psychometric work like item review, job analysis, and Angoff studies, so we can focus on more innovative work. His core goal is to improve assessment throughout the world.

Nate was originally trained as a psychometrician, with an honors degree at Luther College with a triple major of Math/Psych/Latin, and then a PhD in Psychometrics at the University of Minnesota. He then worked multiple roles in the testing industry, including item writer, test development manager, essay test marker, consulting psychometrician, software developer, project manager, and business leader. He is also cofounder and Membership Director at the International Association for Computerized Adaptive Testing (iacat.org). He’s published 100+ papers and presentations, but his favorite remains https://scholarworks.umass.edu/pare/vol16/iss1/1/.

Share This Post

Facebook
Twitter
LinkedIn
Email

More To Explore

Multistage-testing-flow
Adaptive testing

Multistage Testing

Multistage testing (MST) is a type of computerized adaptive testing (CAT).  This means it is an exam delivered on computers which is dynamically personalized for