Juggling-statistics

Background of the NCCA Annual Report

As a way of ensuring that accredited certification programs continue to provide high-quality certifications, the National Commission for Certifying Agencies (NCCA) requires the submission of an NCCA annual report.  The report includes operational information, but also statistics regarding the psychometric performance of your exams.  Psychometrics remains a black box to many certification professionals, so I provide some explanations below on the required statistics.  Note that these statistics must be reported for each form of each exam, separated by certification program.  So if you offer four certifications, each with two forms, you are going to have to calculate and submit eight sets of statistics.

NCCA provides two vital resources for this process at the links below.

Annual Report Form: www.credentialingexcellence.org/d/do/66    -This is what you would fill out and submit.

Sample Annual Report: www.credentialingexcellence.org/d/do/65    -This is filled with imaginary example data, but is very useful as a guideline.

NCCA requirementExplanationExample
Form name or numberThis is the name which you use to keep track of the exam form.Suppose you had two: MA2014-1, MA2014-2
Total # of candidates tested on this exam form in 20xxThis is simply the number of people that took this test during the given time period.1,234
% of Candidates Passing in 20xxThis is the pass rate of the form. NumberPassing/NumberCandidates x 100Suppose 802 passed out of the 1,234.  Then your pass rate is 65%.
Passing PointAlso known as the cutscore, this is the score needed to pass the exam.If you have 100 items and candidates need a 72 to pass, then this is 72.
Average ScoreThis is the average (mean) score for anyone that took this exam during the given time period.75.25
Standard DeviationThe standard deviation provides an index regarding the spread of scores.  If this number is small, it means that most examinees had scores near the average.  If it is large, it means that examinees had a wide range of scores.If you have 100 items, then an SD of 3.2 would be pretty small.  And SD of 18.4 would be considered large.
Standard Error of MeasurementA large SEM means high error and therefore low accuracy, so lower is better.  There are two ways to calculate SEM, which depend on the psychometric approach used by your organization.If you use classical test theory, the SEM is simply SEM=SD*sqrt(1-Reliability).If you use IRT, that SEM is based on extremely complex calculations beyond the scope of this paper, and is a continuous function rather than a single index.  You also have the option to just use the classical SEM, as you have to calculate the classical reliability anyway (see below).Suppose you have an SD of 5.4 and Reliability of 0.92.  This is then 5.4*(1-0.92)=1.527.  The SEM is fairly small because our Reliability is good.
Decision Consistency Estimate(of P/F decisions)This is the proportion of candidates to receive a consistent P/F decision if they took the test over. Again, there are two options here.Classical test theory programs will use an index that ranges from 0 to 1, with 1 being perfect.  There are several such indices but common ones are Livingston, Huynh, and Subkoviak. (Though actually, van der Linden and Mellenbergh proved that the Reliability coefficient should be used here.)IRT-based programs have the option to submit the value of the SEM function at the cutscore.0.94 would mean that we expect 94% of candidates to receive a consistent P/F decision if they took the test again.0.32 would mean that we expect that level of variation in IRT (theta) scores near the cutscore.  Above 0.50 is relatively inaccurate.
Reliability Estimate3(of test scores)Reliability is an attempt to boil down the quality of your entire assessment into a single number between 0 and 1.  Reliability of 0 means random numbers, while 1 is perfect measurement.  Obviously, you lose some important information be boiling down a complex assessment process to a single number, but it is highly convenient so it is highly ubiquitous.Need to raise this?  Either add more scored items to the test, or increase the quality of your items.<0.7 is generally regarded as unacceptable>0.7 is generally regarded as acceptable>0.9 is regarded as good (accurate scores)
Total Number of Items on Exam4The number of scored items on the exam.Suppose you had 100 items that count towards the score plus 20 pilot items.  This submission should then be 100.

NCCA also provides the following guidelines in footnotes

For Passing Point, Average Score, Standard Deviation, and Standard Error of Measurement, you must state the scale or metric that you use in the NCCA annual report.  For example, if you score all your tests by counting number of items correct and then report that to the candidates, these four things should all be calculated on number-correct scores.  If you use raw IRT scoring, with a bell curve that has a mean of 0.0 and a SD of 1.0, then these four things should be calculated on those scores.  If you convert all your scores to scaled scores (for example, how university admissions tests often use a scale of 200 to 800), then calculate using those scores.  The choice is in part up to your psychometrician and you; the actual choice does not matter as much as you being consistent.  Otherwise, it is difficult for the NCCA evaluators to conceptualize the performance of your exam.

For Decision Consistency, you need to note whether you are using the classical approach (index 0 to 1) or the IRT approach (SEM at cutscore).  If using classical, please note the name of the index (Livingston, Huynh, Subkoviak…).

For Reliability estimate, there are also several indices that could be used, such as alpha/KR20, alternative forms, and split-half with Spearman-Brown correction.  Note which one you use.  Alpha/KR20 is by far the most common.

Most tests are of fixed length, i.e., every candidate receives 100 items.  Very large certification programs will sometimes use adaptive testing, which is based on complex algorithms, and not every candidate receives the same number of items.  If this is the case, you need to provide the possible range; the Total Number of Items is then the average number of items seen by examinees. 

Example

The following table provides statistical information in the format required for the NCCA annual report.  This is only an example; as discussed above, there are sometimes a few ways you can approach a certain column.

Table 1: Test Summary Statistics for Each Test Form

TestForm NameN CandidatesN PassedPassing PointAverage ScoreStandard DeviationSEMDecision ConsistencyReliabilityItems
CBA2014-19786457275.949.032.710.860.91100
2014-29636387276.138.892.510.880.92100

Average score, standard deviation, SEM, and passing point are all reported on the raw number-correct score metric.

Decision consistency index is the Livingston coefficient.

Reliability is estimated by coefficient alpha.

OK, I now I need to get all these statistics.  Where do I find them?

Your psychometrician should report them to you.  Alternatively, you can calculate them in-house if you have any psychometric expertise.  If you prefer to have them calculated for you, we recommend you utilize our Certifior platform for credential management and delivery.  We have an automated report that provides you with all the necessary information.

Want to improve the quality of your assessments?

Sign up for our newsletter and hear about our free tools, product updates, and blog posts first! Don’t worry, we would never sell your email address, and we promise not to spam you with too many emails.

Newsletter Sign Up
First Name*
Last Name*
Email*
Company*
Market Sector*
Lead Source

The following two tabs change content below.

nthompson

Nathan Thompson earned his PhD in Psychometrics from the University of Minnesota, with a focus on computerized adaptive testing. His undergraduate degree was from Luther College with a triple major of Mathematics, Psychology, and Latin. He is primarily interested in the use of AI and software automation to augment and replace the work done by psychometricians, which has provided extensive experience in software design and programming. Dr. Thompson has published over 100 journal articles and conference presentations, but his favorite remains https://pareonline.net/getvn.asp?v=16&n=1.

Latest posts by nthompson (see all)