Certification and Licensure are two terms that are used quite frequently to refer to examinations that someone has to pass to demonstrate skills in a certain profession or topic.  They are quite similar, and often confused.  This is exacerbated by even more similar terms in the field, such as accreditation, credentialing, certificate, and microcredentials.  This post will help you understand the differences.

What is Certification?

Certification is “a credential that you earn to show that you have specific skills or knowledge. They are usually tied to an occupation, technology, or industry.” (CareerOneStop)  The important aspect in this definition is the latter portion; the organization that runs the certification is generally across an industry or a profession, regardless of political boundaries.  It is almost always some sort of professional association or industry board, like the American Association of Widgetmakers (obviously not a real thing).  However, it is sometimes governed by a specific company or other organization regarding their products; perhaps the most well known is how Amazon Web Services will certify you in skills to hand their offerings.  Many other technology and software companies do the same.

What is Licensure?

Licensure is a “formal permission to do something: esp., authorization by law to do some specified thing (license to marry, practice medicine, hunt, etc.)” (Schmitt, 1995).  The key phrase here is by law.  The governing organization is a governmental entity, and that is defines what licensure is.  In fact, licensure is not even always about a profession; almost all of us have a Driver’s License for which we passed a simple exam.  Moreover, it does not always even have to be about a profession; many millions of people have a Fishing License, which is granted by the government (by States in the USA), for which you simply pay a small fee.  The license is still an attestation, but not of your skills, just that you have been authorized to do something.

Certification and Licensure

In almost all cases, there is a test that you must pass, for both certification and licensure.  The development and delivery of such tests is extremely similar, leading to the confusion.  They often will both utilize job analysis, Angoff studies, and the like.  The difference between the two is outside the test itself, and instead refers to the sponsoring organization: is it mandated/governed by a governmental entity, or is it unrelated to political/governmental boundaries?

Can they be the same exam?

To make things even more confusing… yes.  And it does not even have to be consistent.  In the US, some professions have a wide certification, which is also required in some States as licensure, but not in all States!  Some States might have their own exams, or not even require an exam.  This muddles the difference between certification and licensure.


This outline summarizes some of the relevant terms.  This is certainly more than can be covered in a single blog post, so this will need to be revisited!!!

  • Attestation of some level of quality for a person or organization = CREDENTIALING
    • Attestation of a person
      • By government = LICENSURE
      • By independent board or company
        • High stakes, wide profession = CERTIFICATION
        • Medium stakes = CERTIFICATE
        • Low stakes, quite specific skill = MICROCREDENTIAL
      • By an educational institution = DEGREE OR DIPLOMA
    • Attestation of an organization = ACCREDITATION




Laila Issayeva, MS

Nathan Thompson, PhD


The Bookmark Method of standard setting (Lewis, Mitzel, & Green, 1996) is a scientifically-based approach to setting cutscores on an examination. It allows stakeholders of an assessment to make decisions and classifications about examinees that are constructive rather than arbitrary (e.g., 70%), meet the goals of the test, and contribute to overall validity. A major advantage of the bookmark method over others is that it utilizes difficulty statistics on all items, making it very data-driven; but this can also be a disadvantage in situations where such data is not available. It also has the advantage of panelist confidence (Karantonis & Sireci, 2006).

The bookmark method operates by delivering a test to a representative sample (or population) of examinees, and then calculating the difficulty statistics for each item. We line up the items in order of difficulty, and experts review the items to place a bookmark where they think a cutscore should be. Nowadays, we use computer screens, but of course in the past this was often done by printing the items in paper booklets, and the experts would literally insert a bookmark.

What is standard setting?

Standard setting (Cizek & Bunch, 2006) is an integral part of the test development process even though it has been undervalued outside of practitioners’ view in the past (Bejar, 2008). Standard setting is the methodology of defining achievement or proficiency levels and corresponding cutscores. A cutscore is a score that serves as a measure of classifying test takers into categories.

Educational assessments and credentialing examinations are often employed to distribute test takers among ordered categories according to their performance across specific content and skills (AERA, APA, & NCME, 2014; Hambleton, 2013). For instance, in tests used for certification and licensing purposes, test takers are typically classified as “pass”—those who score at or above the cutscore—and those who “fail”. In education, students are often classified in terms of proficiency; the Nation’s Report Card assessment (NAEP) in the United States classifies students as Below Basic, Basic, Proficient, Advanced.

However, assessment results could come into question unless the cutscores are appropriately defined. This is why arbitrary cutscores are considered indefensible and lacking validity. Instead, psychometricians help test sponsors to set cutscores using methodologies from the scientific literature, driven by evaluations of item and test difficulty as well as examinee performance.

When to use the bookmark method?

Two approaches are mainly used in international practice to establish assessment standards: the Angoff method (Cizek, 2006) and the Bookmark method (Buckendahl, Smith, Impara, & Plake, 2000). The Bookmark method, unlike the Angoff method, requires the test to be administered prior to defining cutscores based on test data. This provides additional weight to the validity of the process, and better informs the subject matter experts during the process. Of course, many exams require a cutscore to be set before it is published, which is impossible with the bookmark; the Angoff procedure is very useful then.

How do I implement the bookmark method?

The process of standard setting employing the Bookmark method consists of the following stages:

  1. Identify a team of subject matter experts (SMEs); their number should be around 6-12, and led by a test developer/psychometrician/statistician
  2. Analyze test takers’ responses by means of the item response theory (IRT)
  3. Create a list items according to item difficulty in an ascending order
  4. Define the competency levels for test takers; for example, have the 6-12 experts discuss what should differentiate a “pass” candidate from a “fail” candidate
  5. Experts read the items in the ascending order (they do not need to see the IRT values), and place a bookmark where appropriate based on professional judgement across well-defined levels
  6. Calculate thresholds based on the bookmarks set, across all experts
  7. If needed, discuss results and perform a second round

Example of the Bookmark Method

If there are four competency levels such as the NAEP example, then SMEs need to set up three bookmarks in-between: first bookmark is set after the last item in a row that fits the minimally competent candidate for the first level, then second and third. There are thresholds/cutscores from 1 to 2, 2 to 3, and 3 to 4. SMEs perform this individually without discussion, by reading the items.

When all SMEs have provided their opinion, the standard setting coordinator combines all results into one spreadsheet and leads the discussion when all participants express their opinion referring to the bookmarks set. This might look like the sheet below. Note that SME4 had a relatively high standard in their mind, while SME2 had a low standard in their mind – placing virtually every student above an IRT score of 0.0 into the top category!

bookmark method 1

After the discussion, the SMEs are given one more opportunity to set the bookmarks again. Usually, after the exchange of opinions, the picture alters. SMEs gain consensus, and the variation in the graphic is reduced.  An example of this is below.

bookmark method 2

What do to with the results?

Based on the SMEs’ voting results, the coordinator or psychometrician calculates the final thresholds on the IRT scale, and provides them to the analytical team who would ultimately prepare reports for the assessment across competency levels. This might entail score reports to examinees, feedback reports to teachers, and aggregate reports to test sponsors, government officials, and more.

You can see how the scientific approach will directly impact the interpretations of such reports. Rather than government officials just knowing how many students scored 80-90% correct vs 90-100% correct, the results are framed in terms of how many students are truly proficient in the topic. This makes decisions from test scores – both at the individual and aggregate levels – much more defensible and informative.  They become truly criterion-referenced.  This is especially true when the scores are equated across years to account for differences in examinee distributions and test difficulty, and the standard can be demonstrated to be stable.  For high-stakes examinations such as medical certification/licensure, admissions exams, and many more situations, this is absolutely critical.

A test battery or assessment battery is a set multiple psychometrically-distinct exams delivered in one administration.  In some cases, these are various tests that are cobbled together for related purposes, such as a psychologist testing a 8 year old child on their intelligence, anxiety, and autism spectrum.  However, in many cases it is a single test title that we often refer to as a single test but is actually several separate tests, like a university admissions test that has English, Math, and Logical Reasoning components.  Why do so? The key here is that we want to keep them psychometrically separate, but maximize the amount of information about the person to meet the purposes of the test.

Examples of a Test Battery

Test batteries are used in a variety of fields, pretty much anywhere assessment is done.

Admissions and Placement Testing

The classic example is a university admissions test that has English, Math, and Logic portions.  These are separate tests, and psychometricians would calculate the reliability and other important statistics separately.  However, the scores are combined at the end to get an overall picture of examinee aptitude or achievement, and use that to maximally predict 4-graduation rates and other important criterion variables.

Why is is called a battery?  Because we are battering the poor student with not just one, but many exams!

Pre-Employment Testing

Exam batteries are often used in pre-employment testing.  You might get tested on computer skills, numerical reasoning, and noncognitive traits such as integrity or conscientiousness. These are used together to gain incremental validity.  A good example is the CAT-ASVAB, which is the selection test to get into the US Armed Forces.  There are 10 tests (vocabulary, math, mechanical aptitude…).

Psychological or Psychoeducational Assessment

In a clinical setting, clinicians will often use a battery of tests, such as IQ, autism, anxiety, and depression.  Some IQ tests themselves as a battery, as they might assess visual reasoning, logical reasoning, numerical reasoning, etc.  However, these have a positive manifold, meaning that they correlate quite highly with each other.  Another example is the Woodcock-Johnson.

K-12 Educational Assessment

Many large-scale tests that are used in schools are considered a battery, though often with only 2 or 3 aspects.  A common one in the USA is the NWEA Measures of Academic Progress.

Composite Scores

A composite score is a combination of scores in a battery.  If you took an admissions test like the SAT and GRE, you recall how it would add your scores on the different subtests, while the ACT test takes the average.  The ASVAB takes a linear combination of the 4 most important subtests and uses them for admission; the others are used for job matching.

A Different Animal: Test with Sections

The battery is different than a single test that has distinct sections.  For example, a K12 English test might have 10 vocab items, 10 sentence-completion grammar items, and 2 essays.  Such tests are usually analyzed as a single test, as they are psychometrically unidimensional.

How to Deliver A Test Battery

