The standard error of measurement is one of the core concepts in psychometrics. One of the primary assumptions of any assessment is that it is accurately and consistently measuring whatever it is we want to measure. We, therefore, need to demonstrate that it is doing so. There are a number of ways of quantifying this, and one of the most common is the SEM.
The SEM can be used in both the classical test theory perspective and item response theory perspective, though it is defined quite differently in both.
The Standard Error of Measurement in Classical Test Theory
In classical test theory, it is defined as
Where SD is the standard deviation of scores for everyone who took the test, and r is the reliability of the test. It is interpreted as the standard deviation of scores that you would find if you had the person take the test over and over, with a fresh mind each time. A confidence interval with this is then interpreted as the band where you would expect the person’s true score on the test to fall.
This has some conceptual disadvantages. For one, it assumes that SEM is the same for all examinees, which is unrealistic. The interpretation focuses only on this single test form rather than the accuracy of measuring someone’s true standing on the trait. Moreover, it does not utilize the examinee’s responses in any way. Lord (1984) suggested a conditional standard error of measurement based on classical test theory, but it focuses on the error of the examinee taking the same test again, rather than the measurement of the true latent value as is done with IRT below.
The classical SEM is reported in Iteman. for each subscore, the total score, score on scored items only, and score on pretest items.
The Standard Error of Measurement in Item Response Theory
The weaknesses of the classical SEM are one of the reasons that IRT was developed. IRT conceptualizes the SEM as a continuous function across the range of student ability, which is an inversion of the test information function (TIF). A test form will have more accuracy – less error – in a range of ability where there are more items or items of higher quality. That is, a test with most items of middle difficulty will produce accurate scores in the middle of the range, but not measure students on the top or bottom very well. The example below is a test that has many items below the average examinee score (θ) of 0.0 so that any examinee with a score above 1.0 has a relatively inaccurate score, namely with an SEM greater than 0.25.
This is actually only the predicted SEM based on all the items in a test/pool. The observed SEM can differ for each examinee based on the items that they answered, and which ones they answered correctly. If you want to calculate the IRT SEM on a test of yours, you need to download Xcalibre and implement a full IRT calibration study.
Nathan Thompson, PhD, is CEO and Co-Founder of Assessment Systems Corporation (ASC). He is a psychometrician, software developer, author, and researcher, and evangelist for AI and automation. His mission is to elevate the profession of psychometrics by using software to automate psychometric work like item review, job analysis, and Angoff studies, so we can focus on more innovative work. His core goal is to improve assessment throughout the world.
Nate was originally trained as a psychometrician, with an honors degree at Luther College with a triple major of Math/Psych/Latin, and then a PhD in Psychometrics at the University of Minnesota. He then worked multiple roles in the testing industry, including item writer, test development manager, essay test marker, consulting psychometrician, software developer, project manager, and business leader. He is also cofounder and Membership Director at the International Association for Computerized Adaptive Testing (iacat.org). He’s published 100+ papers and presentations, but his favorite remains https://scholarworks.umass.edu/pare/vol16/iss1/1/.