*Background of the NCCA Annual Report*

As a way of ensuring that accredited certification programs continue to provide high-quality certifications, the National Commission for Certifying Agencies (NCCA) requires the submission of an NCCA annual report. The report includes operational information, but also statistics regarding the psychometric performance of your exams. Psychometrics remains a black box to many certification professionals, so I provide some explanations below on the required statistics. Note that these statistics must be reported for each form of each exam, separated by certification program. So if you offer four certifications, each with two forms, you are going to have to calculate and submit eight sets of statistics.

NCCA provides two vital resources for this process at the links below.

Annual Report Form: www.credentialingexcellence.org/d/do/66* -This is what you would fill out and submit.*

Sample Annual Report: www.credentialingexcellence.org/d/do/65* -This is filled with imaginary example data, but is very useful as a guideline.*

NCCA requirement | Explanation | Example |

Form name or number | This is the name which you use to keep track of the exam form. | Suppose you had two: MA2014-1, MA2014-2 |

Total # of candidates tested on this exam form in 20xx | This is simply the number of people that took this test during the given time period. | 1,234 |

% of Candidates Passing in 20xx | This is the pass rate of the form. NumberPassing/NumberCandidates x 100 | Suppose 802 passed out of the 1,234. Then your pass rate is 65%. |

Passing Point | Also known as the cutscore, this is the score needed to pass the exam. | If you have 100 items and candidates need a 72 to pass, then this is 72. |

Average Score | This is the average (mean) score for anyone that took this exam during the given time period. | 75.25 |

Standard Deviation | The standard deviation provides an index regarding the spread of scores. If this number is small, it means that most examinees had scores near the average. If it is large, it means that examinees had a wide range of scores. | If you have 100 items, then an SD of 3.2 would be pretty small. And SD of 18.4 would be considered large. |

Standard Error of Measurement | A large SEM means high error and therefore low accuracy, so lower is better. There are two ways to calculate SEM, which depend on the psychometric approach used by your organization.If you use classical test theory, the SEM is simply SEM=SD*sqrt(1-Reliability).If you use IRT, that SEM is based on extremely complex calculations beyond the scope of this paper, and is a continuous function rather than a single index. You also have the option to just use the classical SEM, as you have to calculate the classical reliability anyway (see below). | Suppose you have an SD of 5.4 and Reliability of 0.92. This is then 5.4*(1-0.92)=1.527. The SEM is fairly small because our Reliability is good. |

Decision Consistency Estimate(of P/F decisions) | This is the proportion of candidates to receive a consistent P/F decision if they took the test over. Again, there are two options here.Classical test theory programs will use an index that ranges from 0 to 1, with 1 being perfect. There are several such indices but common ones are Livingston, Huynh, and Subkoviak. (Though actually, van der Linden and Mellenbergh proved that the Reliability coefficient should be used here.)IRT-based programs have the option to submit the value of the SEM function at the cutscore. | 0.94 would mean that we expect 94% of candidates to receive a consistent P/F decision if they took the test again.0.32 would mean that we expect that level of variation in IRT (theta) scores near the cutscore. Above 0.50 is relatively inaccurate. |

Reliability Estimate^{3}(of test scores) | Reliability is an attempt to boil down the quality of your entire assessment into a single number between 0 and 1. Reliability of 0 means random numbers, while 1 is perfect measurement. Obviously, you lose some important information be boiling down a complex assessment process to a single number, but it is highly convenient so it is highly ubiquitous.Need to raise this? Either add more scored items to the test, or increase the quality of your items. | <0.7 is generally regarded as unacceptable>0.7 is generally regarded as acceptable>0.9 is regarded as good (accurate scores) |

Total Number of Items on Exam^{4} | The number of scored items on the exam. | Suppose you had 100 items that count towards the score plus 20 pilot items. This submission should then be 100. |

**NCCA also provides the following guidelines in footnotes**

For **Passing Point, Average Score, Standard Deviation, **and** Standard Error of Measurement**, you must state the scale or metric that you use in the NCCA annual report. For example, if you score all your tests by counting number of items correct and then report that to the candidates, these four things should all be calculated on number-correct scores. If you use raw IRT scoring, with a bell curve that has a mean of 0.0 and a SD of 1.0, then these four things should be calculated on those scores. If you convert all your scores to scaled scores (for example, how university admissions tests often use a scale of 200 to 800), then calculate using those scores. The choice is in part up to your psychometrician and you; the actual choice does not matter as much as you being consistent. Otherwise, it is difficult for the NCCA evaluators to conceptualize the performance of your exam.

For Decision Consistency, you need to note whether you are using the classical approach (index 0 to 1) or the IRT approach (SEM at cutscore). If using classical, please note the name of the index (Livingston, Huynh, Subkoviak…).

For Reliability estimate, there are also several indices that could be used, such as alpha/KR20, alternative forms, and split-half with Spearman-Brown correction. Note which one you use. Alpha/KR20 is by far the most common.

Most tests are of fixed length, i.e., every candidate receives 100 items. Very large certification programs will sometimes use *adaptive testing*, which is based on complex algorithms, and not every candidate receives the same number of items. If this is the case, you need to provide the possible range; the Total Number of Items is then the average number of items seen by examinees.** **

### Example

The following table provides statistical information in the format required for the NCCA annual report. This is only an example; as discussed above, there are sometimes a few ways you can approach a certain column.

**Table 1: Test Summary Statistics for Each Test Form**

Test | Form Name | N Candidates | N Passed | Passing Point | Average Score | Standard Deviation | SEM | Decision Consistency | Reliability | Items |

CBA | 2014-1 | 978 | 645 | 72 | 75.94 | 9.03 | 2.71 | 0.86 | 0.91 | 100 |

2014-2 | 963 | 638 | 72 | 76.13 | 8.89 | 2.51 | 0.88 | 0.92 | 100 |

*Average score, standard deviation, SEM, and passing point are all reported on the raw number-correct score metric.*

*Decision consistency index is the Livingston coefficient.*

*Reliability is estimated by coefficient alpha.*

**OK, I now I need to get all these statistics. Where do I find them?**

Your psychometrician should report them to you. Alternatively, you can calculate them in-house if you have any psychometric expertise. If you prefer to have them calculated for you, we recommend you utilize our *Certifior* platform for credential management and delivery. We have an automated report that provides you with all the necessary information.

## Want to improve the quality of your assessments?

Sign up for our newsletter and hear about our free tools, product updates, and blog posts first! Don’t worry, we would never sell your email address, and we promise not to spam you with too many emails.

#### nthompson

#### Latest posts by nthompson (see all)

- What is online proctoring? - March 25, 2020
- What is an item distractor? - February 8, 2020
- What are the best practices for test item review? - February 8, 2020