NCCA Accreditation is a stamp of approval on the quality of a certification program, governed by the National Commission for Certifying Agencies (NCCA).™  This is part of the Institute for Credentialing Excellence™, the leader in the world of professional credentialing.  NCCA accreditation tells your certificants – and all stakeholders in your profession, including customers/patients – that the credential meets best practices and international standards, so they can trust the quality of the personnel who have achieved it.  In many cases, you can’t have this trust with an unaccredited credential; though there are definitely many decent ones who just lack the size/funding to get accredited.

What is NCCA Accreditation?

Getting a certification accredited shows that it is good quality.  Anyone can write 50 questions on a topic in their basement and throw it up on some free survey/quiz software, then call it a certification.  In fact, many places do, and charge hundreds of dollars for this.  NCCA accreditation is a push back on this practice, where respectable certifications banded together and agreed on a few main points regarding what his high quality.  Some examples:

  • You must have an oversight board, which includes a public member
  • You must have a legit organization with audited financial statements
  • You must have policies for application, retakes, continuing education, and more
  • There must be a firewall between certification staff and education staff
  • The test must be professionally designed and maintained
  • The test must be delivered securely, with proctoring.

What do we mean by “certification program”?

A certification is a validation of a person’s skills and knowledge for a particular profession.  We all think of it as a test that must be passed, but that’s actually a minority of the process.  There’s also things like initial education, eligibility pathways to sit for the exam, retake policies, how to get recertified, continuing education, etc.  On top of that, there are organizational issues; you need to make sure that there is an appropriate governing board, that education and certification staff don’t overlap, that you have valid financial accounting, etc.  So that’s why the accreditation refers to a “program” and not just a “test.”

This means that an organization with multiple certification programs will need to apply for accreditation on each.  However, since many of the aspects are about the organization (e.g., financial statements), there is massive overlap and these can be re-used for each.

What do we mean by “stamp of approval”?

NCCA is a panel of experts, composed of a range of stakeholders in the industry: PhD psychometricians, internationally-known certification managers, attorneys with expertise in this specific topic, and so on.  You need to complete a formal application process, submitting tons of documentation about the aforementioned topics.  The panel will then review this and grant accreditation, stating that you have followed all the standards.

Again, note that this is not just a stamp of approval on the exam.  If you have an exam for certified Widgetmakers and you have a panel of expert Widgetmakers, the NCCA is not going to evaluate your actual questions.  They are evaluating much bigger questions.  Do you have a nonprofit board set up and have the correct legal governance?  Do you have audited financial statements like any other sound entity?  Do you have a published Candidate Handbook that lays out everything from how to initially apply for the certification to how to maintain it for your career?

Why should we get accredited?

In many cases, it is not necessary to achieve NCCA accreditation.  There are really three reasons:

  1. Quality Assurance: Accreditation ensures that a certification program meets established standards of quality and rigor. It validates that the program has undergone a comprehensive review by an independent accrediting body and has demonstrated its adherence to industry-recognized standards and best practices. Accreditation helps maintain and improve the quality of the certification program over time.
  2. Credibility and Recognition: Accreditation adds credibility and recognition to a certification program. It signifies that the program has been evaluated by experts in the field and has met rigorous criteria. Accreditation enhances the reputation of the certification, making it more valuable and trusted by employers, professionals, and other stakeholders.  This helps you sell more certifications; remember, credentialing is a business and certifications are the flagship product!
  3. Industry Acceptance: Accreditation can increase the acceptance and recognition of a certification within the industry or professional community. It provides assurance to employers, clients, and regulatory bodies that the certified individuals have acquired the necessary knowledge, skills, and competencies to perform their roles effectively.
  4. Competitive Advantage: Some fields, like personal trainers, have many organizations that offer training and certifications.  Achieving certification provides an advantage over your competitors in the marketplace.
  5. Standardization: Accreditation promotes standardization and consistency in the certification process. It ensures that the program’s content, assessment methods, passing criteria, and recertification requirements are fair, transparent, and consistent across all candidates. Standardization helps maintain a level playing field and ensures that certified individuals possess the same level of expertise.
  6. Career Advancement: Accreditation can enhance career opportunities for individuals holding the certified credential. It demonstrates their commitment to professional development and continuous learning. Accredited certifications are often preferred or required by employers, which can lead to better job prospects, promotions, and salary advancements.
  7. Regulatory Compliance: In some industries or professions, accreditation may be a requirement for regulatory compliance. Certain certifications may be mandated by licensing boards or regulatory authorities to ensure public safety, consumer protection, or adherence to specific standards and regulations.  Another example is that if you are selling certifications to members of the US Military, they need to be accredited.

These are all very good reasons, certainly.

What is involved in NCCA Accreditation?

The time and cost can vary widely depending on the current state of your organization. If you read the NCCA Standards (requirements to get accredited), they generally fall into 3 categories:

1. Psychometrics and test development: You need to follow best practices in making the exam.  You can’t just write 50 items in your basement and throw it up on a survey platform.  You need statistical reports, job task analysis study, standard setting studies a defensible pass score, and much more.
2. Certification operations and policies: You need to establish policies and procedures, then document in a Candidate Handbook.  You need to set up a business: accepting payments, bookkeeping, tracking status, retakes, annual recertification, perhaps a member conference or webinars, etc.
3. Business/legal/governance:  You need to be a legit organization with Bylaws and audited financial statements.

What is the cost of NCCA Accreditation?

A rule of thumb that I have heard in the industry is that achieving NCCA accreditation for a certification exam will take 1 year and $100,000. Most of that is for parts 2 and 3, which are typically done by you, and not your testing vendor.  So those costs are not what is paid to NCCA for the application process, either.  It is to your staff, to work on a quality Candidate Handbook, set up quarterly webinars for continuing education, create a registration portal – whatever makes sense for you, as long as it follows the Standards.  In some cases they might be things you already do, such as audited financial statements.

We specialize in the psychometrics, which costs far less than $100,000 and takes 3-6 months depending on availability of your subject matter experts. We can certainly work on parts 2 and 3 if you do not have bandwidth and expertise internally.  We can also deliver the exams for you.

If you aren’t sure of the next steps, we can perform an audit on your current state and potential timeline, which will provide a much clearer picture.  CONTACT US  to learn more.

Note: this is not an endorsement of NCCA by ASC, or vice versa, and is meant for educational purposes only.

 

The Beuk Compromise or Beuk Adjustment is a method for a “reality check” on the results of a modified-Angoff standard setting study.  It is well-known that experts will often overestimate examinee capabilities and choose a cutscore that is too high – in some cases, so high that even the experts themselves would fail the exam!  The Beuk Compromise was designed to balance this with the reality of actual examinee performance.  There are similar methods as well, such as the Hofstee Method.

What is a modified-Angoff study?

The Angoff approach is one of the most common ways of setting a defensible cutscore on an exam, especially in the world of professional credentialing (certification and licensure testing).  A panel of subject matter experts (SMEs) is convened to discuss the concept of a minimally competent candidate (MCC) and then review each item on the exam to estimate the percentage of minimally competent candidates that would get each item correct.  The average of these ratings is then the average score that the panel expects an MCC to achieve – a very compelling argument for what should be the passing score!

OK, then what is the issue?

But in practice, the experts are often in rarified air and forgot what it was like to be 22 years old and entering the profession wide-eyed, so they often overestimate both the description of the MCC and the difficulty ratings themselves.  You might find a situation where they set the cutscore at 82, but the average score on the exam is 63.  You might go further and ask the experts to take the exam themselves and find their average is only 75!

So, psychometricians have developed add-on procedures to address this issue.  Each SME can also be asked to provide information for an adjustment or compromise method.  A compromise method assumes that we should not rely on modified-Angoff ratings alone; the results of another method should be considered in conjunction.

The most common adjustment method is the Beuk adjustment or Beuk compromise, which recognizes that a pure Angoff study makes no use of actual data on the test, and instead attempts to reconcile the Angoff approach with an estimate of the score distribution on the test.  Of course, this approach can then be only used if data exists; if there is no data available with which to estimate the score distribution, the Beuk adjustment is not possible.

What is the Beuk Compromise?

To find the Beuk compromise, two pieces of information are needed from each SME: an estimated pass rate and an estimated cutscore.  The estimated cutscore is obtained by calculating the average Angoff rating for each SME; you need to ask them for what they think the MCC pass rate should be.  What you will often find is that the say the pass rate should be, say, 75%, but when you continue the example before (average score of 63), the pass rate with their recommended cutscore turns out to be 10%!

How do I implement the Beuk Compromise?

Use the Angoff Analysis Tool.

SMEs are then simply asked in the meeting to estimate the pass rate of examinees who take the test, after having reviewed all the items.   Enter those values into the AAT in the assigned cells. If the SMEs consider the test difficult with regards to the cutscore that should be applied and the types of examinees, a low pass rate will be estimated.  These ratings are recorded on the “Adjustments” tab of the AAT.

The Beuk adjustment is best depicted graphically, and this figure is presented on the last tab of the AAT workbook.  It involves two functions:

  1. A curve that presents the pass rate as a function a function of all possible cutscores – this is calculated using the estimates of the score distribution.
  2. A straight line that is a function of the estimated pass rates. The line must pass through the point on the plane where the expected pass rate and panel-recommended cutscore intersect, and has a slope equal to the ratio of the standard deviations of the rater’s cutscore and pass rate estimates.

The x-coordinate of the intersection of these two functions is the Beuk adjustment.  An example of this graph is presented below.  Here, we have a 200-point exam.  A cutscore of 170 would produce a pass rate of about 20%.  A cutscore of 120 would produce a pass rate of about 90%.  The Beuk comes out to be about 145.

Beuk compromise

Test equating refers to the issue of defensibly translating scores from one test form to another. That is, if you have an exam where half of students see one set of items while the other half see a different set, how do you know that a score of 70 is the same one both forms? What if one is a bit easier? If you are delivering assessments in conventional linear forms – or piloting a bank for CAT/LOFT – you are likely to utilize more than one test form, and, therefore, are faced with the issue of test equating.

When two test forms have been properly equated, educators can validly interpret performance on one test form as having the same substantive meaning compared to the equated score of the other test form (Ryan & Brockmann, 2009). While the concept is simple, the methodology can be complex, and there is an entire area of psychometric research devoted to this topic. This post will provide an overview of the topic.

Why do we need test linking and equating?

The need is obvious: to adjust for differences in difficulty to ensure that all examinees receive a fair score on a stable scale. Suppose you take Form A and get a score of 72/100 while your friend takes Form B and gets a score of 74/100. Is your friend smarter than you, or did his form happen to have easier questions?  What if the passing score on the exam was 73? Well, if the test designers built-in some overlap of items between the forms, we can answer this question empirically.

Suppose the two forms overlap by 50 items, called anchor items or equator items. They are delivered to a large, representative sample. Here are the results.

Mean score on 50 overlap items Mean score on 100 total items
30 72
32 74

Because the mean score on the anchor items was higher, we then think that the Form B group was a little smarter, which led to a higher total score.

Now suppose these are the results:

Mean score on 50 overlap items Mean score on 100 total items
32 72
32 74

Now, we have evidence that the groups are of equal ability. The higher total score on Form B must then be because the unique items on that form are a bit easier.

What is test equating?

According to Ryan and Brockmann (2009), “Equating is a technical procedure or process conducted to establish comparable scores, with equivalent meaning, on different versions of test forms of the same test; it allows them to be used interchangeably.” (p. 8). Thus, successful equating is an important factor in evaluating assessment validity, and, therefore, it often becomes an important topic of discussion within testing programs.

Practice has shown that scores, and tests producing scores, must satisfy very strong requirements to achieve this demanding goal of interchangeability. Equating would not be necessary if test forms were assembled as strictly parallel, meaning that they would have identical psychometric properties. In reality, it is almost impossible to construct multiple test forms that are strictly parallel, and equating is necessary to attune a test construction process.

Dorans, Moses, and Eignor (2010) suggest the following five requirements towards equating of two test forms:

  • tests should measure the same construct (e.g. latent trait, skill, ability);
  • tests should have the same level of reliability;
  • equating transformation for mapping the scores of tests should be the inverse function;
  • test results should not depend on the test form an examinee actually takes;
  • the equating function used to link the scores of two tests should be the same regardless of the choice of (sub) population from which it is derived.

How do I calculate an equating?

Classical test theory (CTT) methods include linear equating and equipercentile equating as well as several others. Some newer approaches that work well with small samples are Circle-Arc (Livingston & Kim, 2009) and Nominal Weights (Babcock, Albano, & Raymond, 2012).  Specific methods for linear equating include Tucker, Levine, and Chained (von Davier & Kong, 2003). Linear equating approaches are conceptually simple and easy to interpret; given the examples above, the equating transformation might be estimated with a slope of 1.01 and an intercept of 1.97, which would directly confirm the hypothesis that one form was about 2 points easier than the other.

Item response theory (IRT) approaches include equating through common items (equating by applying an equating constant, equating by concurrent or simultaneous calibration, and equating with common items through test characteristic curves), and common person calibration (Ryan & Brockmann, 2009). The common-item approach is quite often used, and specific methods for finding the constants (conversion parameters) include Stocking-Lord, Haebara, Mean/Mean, and Mean/Sigma. Because IRT assumes that two scales on the same construct differ by only a simple linear transformation, all we need to do is find the slope and intercept of that transformation. Those methods do so, and often produce nice looking figures like the one below from the program IRTEQ (Han, 2007). Note that the b parameters do not fall on the identity line, because there was indeed a difference between the groups, and the results clearly find that is the case.

IRTEQ IRT equating

Practitioners can equate forms with CTT or IRT. However, one of the reasons that IRT was invented was that equating with CTT was very weak. Hambleton and Jones (1993) explain that when CTT equating methods are applied, both ability parameter (i.e., observed score) and item parameters (i.e., difficulty and discrimination) are dependent on each other, limiting its utility in practical test development. IRT solves the CTT interdependency problem by combining ability and item parameters in one model. The IRT equating methods are more accurate and stable than the CTT methods (Hambleton & Jones, 1993; Han, Kolen, & Pohlmann, 1997; De Ayala, 2013; Kolen and Brennan, 2014) and provide a solid basis for modern large-scale computer-based tests, such as computerized adaptive tests (Educational Testing Service, 2010; OECD, 2017).

Of course, one of the reasons that CTT is still around in general is that it works much better with smaller samples, and this is also the case for CTT test equating (Babcock, Albano, & Raymond, 2012).

How do I implement test equating?

Test equating is a mathematically complex process, regardless of which method you use.  Therefore, it requires special software.  Here are some programs to consider.

  1. CIPE performs both linear and equipercentile equating with classical test theory. It is available from the University of Iowa’s CASMA site, which also includes several other software programs.
  2. IRTEQ is an easy-to-use program which performs all major methods of IRT Conversion equating.  It is available from the University of Massachusetts website, as well as several other good programs.
  3. There are many R packages for equating and related psychometric topics. This article claims that there are 45 packages for IRT analysis alone!
  4. If you want to do IRT equating, you need IRT calibration software. We highly recommend Xcalibre since it is easy to use and automatically creates reports in Word for you. If you want to do the calibration approach to IRT equating (both anchor-item and concurrent-calibration), rather than the conversion approach, this is handled directly by IRT software like Xcalibre. For the conversion approach, you need separate software like IRTEQ.

Equating is typically performed by highly trained psychometricians; in many cases, an organization will contract out to a testing company or consultant with the relevant experience. Contact us if you’d like to discuss this.

Does equating happen before or after delivery?

Both. These are called pre-equating and post-equating (Ryan & Brockmann, 2009).  Post-equating means the calculation is done after delivery and you have a full data set, for example if a test is delivered twice per year on a single day, we can do it after that day.  Pre-equating is more tricky, because you are trying to calculate the equating before a test form has ever been delivered to an examinee; but this is 100% necessary in many situations, especially those with continuous delivery windows.

How do I learn more about test equating?

If you are eager to learn more about the topic of equating, the classic reference is the book by Kolen and Brennan (2004; 2014) that provides the most complete coverage of score equating and linking.  There are other resources more readily available on the internet, like this free handbook from CCSSO. If you would like to learn more about IRT, we suggest the books by De Ayala (2008) and Embretson and Reise (2000). A brief intro of IRT equating is available on our website.

Several new ideas of general use in equating, with a focus on kernel equating, were introduced in the book by von Davier, Holland, and Thayer (2004). Holland and Dorans (2006) presented a historical background for test score linking, based on work by Angoff (1971), Flanagan (1951), and Petersen, Kolen, and Hoover (1989). If you look for a straightforward description of the major issues and procedures encountered in practice, then you should turn to Livingston (2004).


Want to learn more? Talk to a Psychometric Consultant

References

Angoff, W. H. (1971). Scales, norms and equivalent scores. In R. L. Thorndike (Ed.), Educational measurement (2nd ed., pp. 508-600). American Council on Education.

Babcock, B., Albano, A., & Raymond, M. (2012). Nominal Weights Mean Equating: A Method for Very Small Samples. Educational and Psychological Measurement, 72(4), 1-21.

Dorans, N. J., Moses, T. P., & Eignor, D. R. (2010). Principles and practices of test score equating. ETS Research Report Series2010(2), i-41.

De Ayala, R. J. (2008). A commentary on historical perspectives on invariant measurement: Guttman, Rasch, and Mokken.

De Ayala, R. J. (2013). Factor analysis with categorical indicators: Item response theory. In Applied quantitative analysis in education and the social sciences (pp. 220-254). Routledge.

Educational Testing Service (2010). Linking TOEFL iBT Scores to IELTS Scores: A Research Report. Educational Testing Service.

Embretson, S. E., & Reise, S. P. (2000). Item response theory for psychologists. Maheah.

Flanagan, J. C. (1951). Units, scores, and norms. In E. F. Lindquist (Ed.), Educational measurement (pp. 695-763). American Council on Education.

Hambleton, R. K., & Jones, R. W. (1993). Comparison of classical test theory and item response theory and their applications to test development. Educational measurement: issues and practice12(3), 38-47.

Han, T., Kolen, M., & Pohlmann, J. (1997). A comparison among IRT true-and observed-score equatings and traditional equipercentile equating. Applied Measurement in Education10(2), 105-121.

Holland, P. W., & Dorans, N. J. (2006). Linking and equating. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 187-220). Praeger.

Kolen, M. J., & Brennan, R. L. (2004). Test equating, linking, and scaling: Methods and practices (2nd ed.). Springer-Verlag.

Kolen, M. J., & Brennan, R. L. (2014). Item response theory methods. In Test Equating, Scaling, and Linking (pp. 171-245). Springer.

Livingston, S. A. (2004). Equating test scores (without IRT). ETS.

Livingston, S. A., & Kim, S. (2009). The Circle‐Arc Method for Equating in Small Samples. Journal of Educational Measurement 46(3): 330-343.

OECD (2017). PISA 2015 Technical Report. OECD Publishing.

Petersen, N. S., Kolen, M. J., & Hoover, H. D. (1989). Scaling, norming and equating. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 221-262). Macmillan.

Ryan, J., & Brockmann, F. (2009). A Practitioner’s Introduction to Equating with Primers on Classical Test Theory and Item Response Theory. Council of Chief State School Officers.

von Davier, A. A., Holland, P. W., & Thayer, D. T. (2004). The kernel method of test equating. Springer.

von Davier, A. A., & Kong, N. (2003). A unified approach to linear equating for non-equivalent groups design. Research report 03-31 from Educational Testing Service. https://www.ets.org/Media/Research/pdf/RR-03-31-vonDavier.pdf

Certification vs Licensure exams are two terms that are used quite frequently to refer to examinations that someone has to pass to demonstrate skills in a certain profession or topic.  They are quite similar, and often confused.  This is exacerbated by even more similar terms in the field, such as accreditation, credentialing, certificate, and microcredentials.  This post will help you understand the differences.

What is Certification?

Certification is “a credential that you earn to show that you have specific skills or knowledge. They are usually tied to an occupation, technology, or industry.” (CareerOneStop)  The important aspect in this definition is the latter portion; the organization that runs the certification is generally across an industry or a profession, regardless of political boundaries.  It is almost always some sort of professional association or industry board, like the American Association of Widgetmakers (obviously not a real thing).  However, it is sometimes governed by a specific company or other organization regarding their products; perhaps the most well known is how Amazon Web Services will certify you in skills to hand their offerings.  Many other technology and software companies do the same.

What is Licensure?

Licensure is a “formal permission to do something: esp., authorization by law to do some specified thing (license to marry, practice medicine, hunt, etc.)” (Schmitt, 1995).  The key phrase here is by law.  The sponsoring organization is a governmental entity, and that is defines what licensure is.  In fact, licensure is not even always about a profession; almost all of us have a Driver’s License for which we passed a simple exam.  Moreover, it does not always even have to be about a profession; many millions of people have a Fishing License, which is granted by the government (by States in the USA), for which you simply pay a small fee.  The license is still an attestation, but not of your skills, just that you have been authorized to do something.  Of course, in the context of assessment, it means that you have passed some sort of exam which is mandated by law, typically for professions that are dangerous enough or impact a wide range of people that the government has stepped in to provide oversight: attorneys, physicians, medical professionals, etc.

Certification vs Licensure Exams

woman-taking-testUsually, there is a test that you must pass, but the sponsor can differ with certification vs licensure.  The development and delivery of such tests is extremely similar, leading to the confusion.  They often will both utilize job analysis, Angoff studies, and the like.  The difference between the two is outside the test itself, and instead refers to the sponsoring organization: is it mandated/governed by a governmental entity, or is it unrelated to political/governmental boundaries?  You are awarded a credential after successful completion, but the difference is in the group that awards the credential, what it means, and where it is recognized.

However, there are many licensures that do not involve an exam, but you simply need to file some paperwork with the government.  An example of this is a marriage license.  You certainly don’t have to take a test to qualify!

Can they be the same exam?

To make things even more confusing… yes.  And it does not even have to be consistent.  In the US, some professions have a wide certification, which is also required in some States as licensure, but not in all States!  Some States might have their own exams, or not even require an exam.  This muddles the difference between certification vs licensure.  ICRC notes that they are sometimes complementary or parallel processes.

Differences between Certification and Licensure

Aspect Certification Licensure
Mandatory? No Yes
Run by Association, Board, Nonprofit, Private Company Government
Does it use an exam? Yes, especially if it is accredited Sometimes, but often not (consider a marriage license)
Accreditation involved? Yes, NCCA and ANSI provide accreditation that a certification is high quality No; often there is no check on quality
Examples Certified Chiropractic Sports Physician (CCSP®), Certified in Clean Needle Technique (CNT) Marriage license; Driver’s License; Fishing License; License to practice law (Bar Exam)

How do these terms relate to other, similar terms?

This outline summarizes some of the relevant terms regarding certification vs licensure and other credentials.  This is certainly more than can be covered in a single blog post!

  • Attestation of some level of quality for a person or organization = CREDENTIALING
    • Attestation of a person
      • By government = LICENSURE
      • By independent board or company
        • High stakes, wide profession = CERTIFICATION
        • Medium stakes = CERTIFICATE
        • Low stakes, quite specific skill = MICROCREDENTIAL
      • By an educational institution = DEGREE OR DIPLOMA
    • Attestation of an organization = ACCREDITATION

Authors: 

Laila Issayeva, MS

Nathan Thompson, PhD

The Bookmark Method of standard setting (Lewis, Mitzel, & Green, 1996) is a scientifically-based approach to setting cutscores on an examination. It allows stakeholders of an assessment to make decisions and classifications about examinees that are constructive rather than arbitrary (e.g., 70%), meet the goals of the test, and contribute to overall validity. A major advantage of the bookmark method over others is that it utilizes difficulty statistics on all items, making it very data-driven; but this can also be a disadvantage in situations where such data is not available. It also has the advantage of panelist confidence (Karantonis & Sireci, 2006).

The bookmark method operates by delivering a test to a representative sample (or population) of examinees, and then calculating the difficulty statistics for each item. We line up the items in order of difficulty, and experts review the items to place a bookmark where they think a cutscore should be. Nowadays, we use computer screens, but of course in the past this was often done by printing the items in paper booklets, and the experts would literally insert a bookmark.

What is standard setting?

Standard setting (Cizek & Bunch, 2006) is an integral part of the test development process even though it has been undervalued outside of practitioners’ view in the past (Bejar, 2008). Standard setting is the methodology of defining achievement or proficiency levels and corresponding cutscores. A cutscore is a score that serves as a measure of classifying test takers into categories.

Educational assessments and credentialing examinations are often employed to distribute test takers among ordered categories according to their performance across specific content and skills (AERA, APA, & NCME, 2014; Hambleton, 2013). For instance, in tests used for certification and licensing purposes, test takers are typically classified as “pass”—those who score at or above the cutscore—and those who “fail”. In education, students are often classified in terms of proficiency; the Nation’s Report Card assessment (NAEP) in the United States classifies students as Below Basic, Basic, Proficient, Advanced.

However, assessment results could come into question unless the cutscores are appropriately defined. This is why arbitrary cutscores are considered indefensible and lacking validity. Instead, psychometricians help test sponsors to set cutscores using methodologies from the scientific literature, driven by evaluations of item and test difficulty as well as examinee performance.

When to use the bookmark method?

Two approaches are mainly used in international practice to establish assessment standards: the Angoff method (Cizek, 2006) and the Bookmark method (Buckendahl, Smith, Impara, & Plake, 2000). The Bookmark method, unlike the Angoff method, requires the test to be administered prior to defining cutscores based on test data. This provides additional weight to the validity of the process, and better informs the subject matter experts during the process. Of course, many exams require a cutscore to be set before it is published, which is impossible with the bookmark; the Angoff procedure is very useful then.

How do I implement the bookmark method?

The process of standard setting employing the Bookmark method consists of the following stages:

  1. Identify a team of subject matter experts (SMEs); their number should be around 6-12, and led by a test developer/psychometrician/statistician
  2. Analyze test takers’ responses by means of the item response theory (IRT)
  3. Create a list items according to item difficulty in an ascending order
  4. Define the competency levels for test takers; for example, have the 6-12 experts discuss what should differentiate a “pass” candidate from a “fail” candidate
  5. Experts read the items in the ascending order (they do not need to see the IRT values), and place a bookmark where appropriate based on professional judgement across well-defined levels
  6. Calculate thresholds based on the bookmarks set, across all experts
  7. If needed, discuss results and perform a second round

Example of the Bookmark Method

If there are four competency levels such as the NAEP example, then SMEs need to set up three bookmarks in-between: first bookmark is set after the last item in a row that fits the minimally competent candidate for the first level, then second and third. There are thresholds/cutscores from 1 to 2, 2 to 3, and 3 to 4. SMEs perform this individually without discussion, by reading the items.

When all SMEs have provided their opinion, the standard setting coordinator combines all results into one spreadsheet and leads the discussion when all participants express their opinion referring to the bookmarks set. This might look like the sheet below. Note that SME4 had a relatively high standard in their mind, while SME2 had a low standard in their mind – placing virtually every student above an IRT score of 0.0 into the top category!

bookmark method 1

After the discussion, the SMEs are given one more opportunity to set the bookmarks again. Usually, after the exchange of opinions, the picture alters. SMEs gain consensus, and the variation in the graphic is reduced.  An example of this is below.

bookmark method

What to do with the results?

Based on the SMEs’ voting results, the coordinator or psychometrician calculates the final thresholds on the IRT scale, and provides them to the analytical team who would ultimately prepare reports for the assessment across competency levels. This might entail score reports to examinees, feedback reports to teachers, and aggregate reports to test sponsors, government officials, and more.

You can see how the scientific approach will directly impact the interpretations of such reports. Rather than government officials just knowing how many students scored 80-90% correct vs 90-100% correct, the results are framed in terms of how many students are truly proficient in the topic. This makes decisions from test scores – both at the individual and aggregate levels – much more defensible and informative.  They become truly criterion-referenced.  This is especially true when the scores are equated across years to account for differences in examinee distributions and test difficulty, and the standard can be demonstrated to be stable.  For high-stakes examinations such as medical certification/licensure, admissions exams, and many more situations, this is absolutely critical.

Want to talk to an expert about implementing this for your exams?  Contact us.

References

[AERA, APA, & NCME] (American Educational Research Association, American Psychological Association, & National Council on Measurement in Education). (2014). Standards for educational and psychological testing. Washington, DC: American Educational Research Association.

Bejar, I. I. (2008). Standard setting: What is it? Why is it important. R&D Connections, 7, 1-6. Retrieved from https://www.ets.org/Media/Research/pdf/RD_Connections7.pdf

Buckendahl, C. W., Smith, R. W., Impara, J. C., & Plake, B. S. (2000). A comparison of Angoff and Bookmark standard setting methods. Paper presented at the Annual Meeting of the Mid-Western Educational Research Association, Chicago, IL: October 25-28, 2000.

Cizek, G., & Bunch, M. (2006). Standard Setting: A Guide to Establishing and Evaluating Performance Standards on Tests.  Thousand Oaks, CA: Sage.

Cizek, G. J. (2007). Standard setting. In Steven M. Downing and Thomas M. Haladyna (Eds.) Handbook of test development. Mahwah, NJ: Lawrence Erlbaum Associates, Publishers, pp. 225-258.

Hambleton, R. K. (2013). Setting performance standards on educational assessments and criteria for evaluating the process. In Setting performance standards, pp. 103-130. Routledge. Retrieved from https://www.nciea.org/publications/SetStandards_Hambleton99.pdf

Karantonis, A., & Sireci, S. (2006). The Bookmark Standard‐Setting Method: A Literature Review. Educational Measurement Issues and Practice 25(1):4 – 12.

Lewis, D. M., Mitzel, H. C., & Green, D. R. (1996, June). Standard setting: A Book-mark approach. In D. R. Green (Chair),IRT-based standard setting procedures utilizing behavioral anchoring. Symposium conducted at the Council of Chief State School Officers National Conference on Large-Scale Assessment, Phoenix, AZ.

Certification exam administration and proctoring is a crucial component of the professional credentialing process.  Certification exams are expensive to develop well, so an organization wants to protect that investment by delivering the exam with appropriate security so that items are not stolen.  Moreover, there is an obvious incentive for candidates to cheat.  So, a certification body needs appropriate processes in place to deliver the certification exams.  Here are some tips

1. Determine the best approach for certification exam administration and proctoring

Here are a few of the considerations to take into account.

Cohorts vs. Continuous

Do you have cohorts, where events make more sense, or do you need continuous?  For example, if the test is tied to university training programs that graduate candidates in December and May each year, that affects your need for delivery.  Alternatively, some certifications are not tied to such training; you might have to only show work experience.  In those cases, candidates are ready to take the test continuously throughout the year.

Paper vs computer

Does it make more sense to deliver the test on paper or on computer?  This used to be a cost issue, but now the cost of computerized delivery, especially with online proctoring at home, has dropped significantly while saving so much time for candidates.  Also, some exam types like clinical simulations can only be delivered on computers.

Test centers vs online proctored vs events

Some types of tests require events, such as a clinical assessment in an actual clinic with standardized patients.  Some tests can be taken anywhere.  Exam events can also coincide with other events; perhaps you have online delivery through the year but deliver a paper version of the test at your annual conference, for convenience.

ansi accreditation certification exam candidates

Geographic dispersion

If your exam is for a small US state or a small country, it might be easy to require exams in a test center, because you can easily set up only one or two test centers to cover the geography.  Some certifications are international, and need to deliver on-demand throughout the year; those are a great fit for online.

Security level needs

If your test has extremely high stakes, there is extremely high incentive to cheat.  An entry-level certification on WordPress is different than a medical licensure exam.  The latter is a better fit for test centers, while the former might be fine with online proctoring on-demand.

2. Evaluate remote proctoring options

If you choose to explore this approach, here are three main types to evaluate.

A. AI only

AI only proctoring means that there are no humans.  The examinee is recorded on video, and AI algorithms flag potential issues, such as if they leave their seat, then notify an administrator (usually a professor) of students with a high number of flags.  This approach is usually not relevant for certifications or other credentialing exams, it is more for low-stakes exams like a Psychology 101 Midterm at your local university.  The vendors for this approach are interested in large-scale projects, such as proctoring all midterms and finals at a university, perhaps hundreds of thousands of exams per year.

B. Record and Review

Record and review proctoring means that the examinee is recorded on video, but that video is watched by a real human and flagged if they think there is cheating, theft, or other issues.  This is much higher quality, and higher price, but has one major flaw that might be concerning to certification tests: if someone steals your test by taking pictures, you won’t find out until tomorrow.  But at least you know who it was and you are certain of what happened, with a video proof.  Perhaps useful for microcredentials or recertification exams.

C. Live Online Proctoring

Live online proctoring (LOP), or what I call “live human proctoring” (because some AI proctoring is also “live” in real time!) means that there is a professional human proctor on the other side of the video from the examinee.  They check the examinee in, confirm their identity, scan the room, provide instructions, and actually watch them take the test.  Some providers like MonitorEDU even have the examinee make a second video stream on their phone, which is placed on a bookshelf or similar spot to see the entire room through the test.  Certainly, this approach is a very good fit with certification exams and other credentialing.  You protect the test content as well as the validity of that individual’s score; that is not possible with the other two approaches.

3. Determine other technology, psychometric, and operational needs

Next, your organization should establish the other needs for your exams.  Do you require special item types?  Perhaps adaptive testing or linear on the fly testingPsychometric consulting services?  Specific operational controls such as exam time/date windows or navigation limits?  Registration and payment portal?  Write all these up so that you can use the list to shop for a provider.

4. Find an integrated provider for certification exam administration

test development cycle fasttest

Most providers of remote proctoring are just that: remote proctoring.  They do not have a professional platform to manage item banks, schedule examinees, deliver tests, create custom score reports, and analyze psychometrics.  Some do not even integrate with such platforms, and only integrate with learning management systems like Moodle, seeing as their entire target market is only low-stakes university exams.  So if you are seeking a vendor for certification testing or other credentialing, the list of potential vendors is smaller.

Our flagship platform, FastTest, works with 6 different remote proctoring providers and can easily integrate with more.  It also supports paper exams, self-hosted events, and testing centers.

5. Establish the new process

Once you have selected a vendor, work with them to establish the new process for delivering your certification exams with remote proctoring.  Remember, this goes FAR beyond exam day!

  • Candidate Handbook
  • Registration and scheduling
  • Candidate training and practice tests
  • Exam delivery (including verification, environmental rules, materials allowed, break policy, etc.)
  • Test security plan:  What do you do if someone is caught taking pictures of the exam with their phone, or the other potential events?

Ready to start?

ASC is one of the world leaders in this endeavor.  Contact us to get a free account in our platform and experience the examinee process, or to receive a demonstration from one of our experts.

Online proctoring software refers to platforms that proctor educational or professional assessments (exams or tests) when the proctor is not in the same room as the examinee.  This means that it is done with a video stream or recording using a webcam and sometimes an additional device, which are monitored by a human and/or AI.  It is also referred to as remote proctoring or invigilation. Online proctoring offers a compelling alternative to in-person proctoring, somewhere in between unproctored at-home tests and tests delivered at an expensive testing center in an office building.  This makes it a perfect fit for medium-stakes exams, such as university placement, pre-employment screening, and many types of certification/licensure tests.

What are the types of online proctoring?

There are many types of online proctoring software on the market, spread across dozens of vendors, especially new ones that sought to capitalize on the pandemic which were not involved with assessment before hand.  With so many options, how can you more effectively select amongst the types of remote proctoring? There are four types of remote proctoring platforms, which can be adapted to a particular use case, sometimes varying between different tests in a single organization.  ASC supports all four types, and partners with 5 different vendors to help provide the best solution to our clients.  In descending order of security:

Approach What it entails for you What it entails for the candidate

Live with professional proctors

  • You register a set of examinees in FastTest, and tell us when they are to take their exams and under what rules.
  • We provide the relevant information to the proctors.
  • You send all the necessary information to your examinees.
  • The most secure of the types of remote proctoring.
  • Examinee goes to ascproctor.com, where they will initiate a chat with a proctor.
  • After confirmation of their identity and workspace, they are provided information on how to take the test.
  • The proctor then watches a video stream from their webcam as well as a phone on the side of the room, ensuring that the environment is secure. They do not see the screen, so your exam content is not exposed. They maintain exam invigilation continuously.
  • When the examinee is finished, they notify the proctor, and are excused.

Live, bring your own proctor (BYOP)

  • You upload examinees into FastTest, which will generate links.
  • You send relevant instructions and the links to examinees.
  • Your staff logs into the admin portal and awaits examinees.
  • Videos with AI flagging are available for later review if needed.
  • Examinee will click on a link, which launches the proctoring software.
  • An automated system check is performed.
  • The proctoring is launched.  Proctors ask the examinee to provide identity verification, then launch the test.
  • Examinee is watched on the webcam and screencast.  AI algorithms help to flag irregular behavior.
  • Examinee concludes the test

Record and Review (with option for AI)

  • You upload examinees into FastTest, which will generate links.
  • You send relevant instructions and the links to examinees.
  • After examinees take the test, your staff (or ours) logs into review all the videos and report on any issues.  AI will automatically flag irregular behavior, making your reviews more time-efficient.
  • Examinee will click on a link, which launches the proctoring software.
  • An automated system check is performed.
  • The proctoring is launched.  System asks the examinee to provide identity verification, then launch the test.
  • Examinee is recorded on the webcam and screencast.  AI algorithms help to flag irregular behavior.
  • Examinee concludes the test

AI only

  • You upload examinees into FastTest, which will generate links.
  • You send relevant instructions and the links to examinees.
  • Videos are stored for 1 month if you need to check any.
  • Examinee will click on a link, which launches the proctoring software.
  • An automated system check is performed.
  • The proctoring is launched.  System asks the examinee to provide identity verification, then launch the test.
  • Examinee is recorded on the webcam and screencast.  AI algorithms help to flag irregular behavior.
  • Examinee concludes the test

 

Some case studies for different types of exams

We’ve worked with all types of remote proctoring software, across many types of assessment:

  • ASC delivers high-stakes certification exams for a number of certification boards, in multiple countries, using the live proctoring with professional proctors.  Some of these are available continuously on-demand, while others are on specific days where hundreds of candidates log in.
  • We partnered with a large university in South America, where their admissions exams were delivered using Bring Your Own Proctor, enabling them to drastically reduce costs by utilizing their own staff.
  • We partnered with a private company to provide AI-enhanced record-and-review proctoring for applicants, where ASC staff reviews the results and provides a report to the client.
  • We partner with an organization that delivers civil service exams for a country, and utilizes both unproctored and AI-only proctoring, differing across a range of exam titles.

 

Online Proctoring Software: Two Distinct Markets

First, I would describe the online proctoring industry as actually falling into two distinct markets, so the first step is to determine which of these fits your organizationlaptop-desk-above

  1. Large scale, lower cost (when large scale), lower security systems designed to be used only as a plugin to major LMS platforms like Blackboard or Canvas. These systems are therefore designed for medium-stakes exams like an Intro to Psychology midterm at a university.
  2. Lower scale, higher cost, higher security systems designed to be used with standalone assessment platforms. These are generally for higher-stakes exams like certification or workforce, or perhaps special use at universities like Admissions and Placement exams.

How to tell the difference? The first type will advertise about easy integration with systems like Blackboard or Canvas as a key feature. They will also often focus on AI review of videos, rather than using real humans. Another key consideration is to look at the existing client base, which is often advertised.  

Other ways that online proctoring software can differ

Screen capture:

Some online proctoring providers have an option to record/stream the screen as well as the webcam. Some also provide the option to only do this (no webcam) for lower stakes exams.

Mobile phone as the second camera:

Some newer platforms provide the option to easily integrate the examinee’s mobile phone as a second camera (third stream, if you include screen capture), which effectively operates as a human proctor. Examinees will be instructed to use the video to show under the table, behind the monitor, etc., before starting the exam. They then might be instructed to stand up the phone 2 meters away with a clear view of the entire room while the test is being delivered.  This is in addition to the webcam.

API integrations:

Some systems require software developers to set up an API integration with your LMS or assessment platform. Others are more flexible, and you can just log in yourself, upload a list of examinees, and you are all set.

On-Demand vs. Scheduled:

Some platforms involve the examinee scheduling a time slot. Others are purely on-demand, and the examinee can show up whenever they are ready. MonitorEDU is a prime example of this: examinees show up at any time, present their ID to a live human, and are then started on the test immediately – no downloads/installs, no system checks, no API integrations, nothing.  

More security: A better test delivery software

A good testing delivery platform will also come with its own functionality to enhance test security: randomization, automated item generation, computerized adaptive testing, linear-on-the-fly testing, professional item banking, item response theory scoring, scaled scoring, psychometric analytics, equating, lockdown delivery, and more. In the context of online proctoring, perhaps the most salient is the lockdown delivery. In this case, the test will completely take over the examinee’s computer and they can’t use it for anything else until the test is done.

LMS systems rarely include any of this functionality, because they are not needed for a midterm exam of Intro to Psychology. However, most assessments in the world that have real stakes – university admissions, certifications, workforce hiring, etc. – depend heavily on such functionality. It’s not just out of habit or tradition, either. Such methods are considered essential by international standards including AERA/APA/NCMA, ITC, and NCCA.  

ASC’s preferred online proctoring partners

ASC’s online assessment platforms are integrated with some of the leading remote proctoring software providers.

Type Vendors
Live MonitorEDU
AI Alemira, Sumadi, ProctorFree
Record and Review Alemira, ProctorFree
Bring Your Own Proctor Alemira

 

List of Online Proctoring Software Providers

Looking to evaluate potential vendors?  Here is a great place to start.

# Name Website Country Proctor Service
1 Aiproctor https://www.aiproctor.com/ USA AI
2 Centre Based Test (CBT) https://www.conductexam.com/center-based-online-test-software India Live, Record and Review
3 Class in Pocket https://classinpocket.com/ India AI
4 Datamatics https://www.datamatics.com/industries/education-technology/proctoring India AI, Live, Record and Review
5 DigiProctor https://www.digiproctor.com India AI
6 Disamina https://disamina.in/ India AI
7 Examity https://www.examity.com/ USA Live
8 ExamMonitor https://examsoft.com/ USA Record and Review
9 ExamOnline https://examonline.in/remote-proctoring-solution-for-employee-hiring/ India AI, Live
10 Eduswitch https://eduswitch.com/  India AI
11 Examus https://examus.com Russia AI, Bring Your Own Proctor, Live
12 EasyProctor https://www.excelsoftcorp.com/products/assessment-and-proctoring-solutions/ India AI, Live, Record and Review
13 HonorLock https://honorlock.com/ USA AI, Record and Review
14 Internet Testing Systems https://www.testsys.com/ USA Bring your own proctor/td>
14 Invigulus https://www.invigulus.com/ USA AI, Live, Record and Review
15 Iris Invigilation https://www.irisinvigilation.com/ Australia AI
16 Mettl https://mettl.com/en/online-remote-proctoring/ India AI, Live, Record and Review
17 MonitorEdu https://monitoredu.com/proctoring USA Live
18 OnVUE https://home.pearsonvue.com/Test-takers/OnVUE-online-proctoring.aspx USA Live
19 Oxagile https://www.oxagile.com/competence/edtech-solutions/proctoring/ USA AI, Live, Record and Review
20 Parakh https://parakh.online/blog/remote-proctoring-ultimate-solution-for-secure-online-exam India AI, Live, Record and Review
21 ProctorFree https://www.proctorfree.com/ USA AI, Live
22 Proctor360 https://proctor360.com/ USA AI, Bring Your Own Proctor, Live, Record and Review
23 ProctorEDU https://proctoredu.com/ Russia AI, Live, Record and Review
24 ProctorExam https://proctorexam.com/ Netherlands Bring Your Own Proctor, Live, Record and Review
25 Proctorio https://proctorio.com/products/online-proctoring USA AI, Live
26 Proctortrack https://www.proctortrack.com/ USA AI, Live
27 ProctorU https://www.proctoru.com/ USA AI, Live, Record and Review
28 Proview https://proview.io/ USA AI, Live
29 PSI Bridge https://www.psionline.com/en-gb/platforms/psi-bridge/ USA Live, Record and Review
30 Respondus Monitor https://web.respondus.com/he/monitor/ USA AI, Live, Record and Review
31 Rosalyn https://www.rosalyn.ai/ USA AI, Live
32 SmarterProctoring https://smarterservices.com/smarterproctoring/ USA AI, Bring Your Own Proctor, Live
33 Sumadi https://sumadi.net/ Honduras AI, Live, Record and Review
34 Suyati https://suyati.com/ India AI, Live, Record and Review
35 TCS iON Remote Assessments https://learning.tcsionhub.in/hub/remote-assessment-marking-internship/ India AI, Live
36 Think Exam https://www.thinkexam.com/remoteproctoring India AI, Live
37 uxpertise XP https://uxpertise.ca/en/uxpertise-xp/ Canada AI, Live, Record and Review
38 Proctor AI https://www.visive.ai/solutions/proctor-ai India AI, Live, Record and Review
39 Wise Proctor https://wiseattend.com/wiseproctor USA AI, Record and Review
40 Xobin https://xobin.com/online-remote-proctoring India AI
41 Youtestme https://www.youtestme.com/online-proctoring/ Canada AI, Live

 

How do I select a vendor?

First, determine the level of security necessary, and the trade-off with costs.  Live proctoring with professionals can cost $30 to $100 or more, while AI proctoring can be as little as a few dollars.  Then, evaluate some vendors to see which group they fall into; note that some vendors can do all of them!  Then, ask for some demos so you understand the business processes involved and the UX on the examinee side, both of which could substantially impact the soft costs for your organization.  Then, start negotiating with the vendor you want!

Want some more information?

Get in touch with us, we’d love to show you a demo or introduce you to partners!

Email solutions@assess.com.

If you have worked in the field of assessment and psychometrics, you have undoubtedly encountered the word “standard.” While a relatively simple word, it has the potential to be confusing because it is used in three (and more!) completely different but very important ways. Here’s a brief discussion.

Standard = Cutscore

As noted by the well-known professor Gregory Cizek here, “standard setting refers to the process of establishing one or more cut scores on a test.” The various methods of setting a cutscore, like Angoff or Bookmark, are referred to as standard setting studies. In this context, the standard is the bar that separates a Pass from a Fail. We use methods like the ones mentioned to determine this bar in as scientific and defensible fashion as possible, and give it more concrete meaning than an arbitrarily selected round number like 70%. Selecting a round number like that will likely get you sued since there is no criterion-referenced interpretation.

Standard = Blueprint

If you work in the field of education, you often hear the term “educational standards.” These refer to the curriculum blueprints for an educational system, which also translate into assessment blueprints, because you want to assess what is on the curriculum. Several important ones in the USA are noted here, perhaps the most common of which nowadays is the Common Core State Standards, which attempted to standardize the standards across states. These standards exist to standardize the educational system, by teaching what a group of experts have agreed upon should be taught in 6th grade Math classes for example. Note that they don’t state how or when a topic should be taught, merely that 6th Grade Math should cover Number Lines, Measurement Scales, Variables, whatever – sometime in the year.

Standard = Guideline

If you work in the field of professional certification, you hear the term just as often but in a different context, accreditation standards. The two most common are the National Commission for Certifying Agencies (NCCA) and the ANSI National Accreditation Board (ANAB). These two organizations are a consortium of credentialing bodies that give a stamp of approval to credentialing bodies, stating that a Certification or Certificate program is legit. Why? Because there is no law to stop me from buying a textbook on any topic, writing 50 test questions in my basement, and selling it as a Certification. It is completely a situation of caveat emptor, and these organizations are helping the buyers by giving a stamp of approval that the certification was developed with accepted practices like a Job Analysis, Standard Setting Study, etc.

In addition, there are the professional standards for our field. These are guidelines on assessment in general rather than just credentialing. Two great examples are the AERA/APA/NCME Standards for Educational and Psychological Measurement and the International Test Commission’s Guidelines (yes they switch to that term) on various topics.

Also: Standardized = Equivalent Conditions

The word is also used quite frequently in the context of standardized testing, though it is rarely chopped to the root word “standard.” In this case, it refers to the fact that the test is given under equivalent conditions to provide greater fairness and validity. A standardized test does NOT mean multiple choice, bubble sheets, or any of the other pop connotations that are carried with it. It just means that we are standardizing the assessment and the administration process. Think of it as a scientific experiment; the basic premise of the scientific method is holding all variables constant except the variable in question, which in this case is the student’s ability. So we ensure that all students receive a psychometrically equivalent exam, with equivalent (as much as possible) writing utensils, scrap paper, computer, time limit, and all other practical surroundings. The problem comes with the lack of equivalence in access to study materials, prep coaching, education, and many bigger questions… but those are a societal issue and not a psychometric one.

So despite all the bashing that the term gets, a standardized test is MUCH better than the alternatives of no assessment at all, or an assessment that is not a level playing field and has low reliability. Consider the case of hiring employees: if assessments were not used to provide objective information on applicant skills and we could only use interviews (which are famously subjective and inaccurate), all hiring would be virtually random and the amount of incompetent people in jobs would increase a hundredfold. And don’t we already have enough people in jobs where they don’t belong?

A standard setting study is a formal, quantitative process for establishing a performance standard on an exam, such as what score is “proficient” or “passing.”  This is typically manifested as a cutscore which is then used for making decisions about people: hire them, pass them, accept them into university, etc.  Because it is used for such important decisions, a lot of work goes into standard setting, using methods based on scientific research.

What is NOT standard setting?

In the assessment world, there are actually three uses of the word standard:

  1. A formal definition of the content that is being tested, such as the Common Core State Standards in the USA.  
  2. A formalized process for delivering exams, as seen in the phrase “standardized testing.”
  3. A benchmark for performance, like we are discussing here.

For this reason, I prefer the term cutscore study, but the phrase standard setting is used more often.  

How is a standard setting study used?

As part of a comprehensive test development cycle, after item authoring, item review, and test form assembly, a cutscore or passing score will often be set to determine what level of performance qualified as “pass” or a similar classification.  This cannot be done arbitrarily, such as setting it at 70% because that’s what you saw when you were in school.  That is a legal landmine!  To be legally defensible and eligible for Accreditation of a Certification Program, it must be done using one of several standard-setting approaches from the psychometric literature.  So, if your organization is classifying examinees into Pass/Fail, Hire/NotHire, Basic/Proficient/Advanced, or any other groups, you most likely need a standard setting study.  This is NOT limited to certification, although it is often discussed in that pass/fail context.

What are some methods of a standard setting study?

There have been many methods suggested in the scientific literature of psychometrics.  They are often delineated into examinee-centered and item-centered approaches.  Angoff and Bookmark are designed around evaluating items, while Contrasting Groups and Borderline Groups are designed around evaluating the distributions of actual examinee scores.  The Bookmark approach is sort of both types, however, because it uses examinee performance on the items as the object of interest.

Angoff

Modified Angoff analysis

In an Angoff study, a panel of subject matter experts rates each item, estimating the percentage of minimally competent candidates that would answer each item correctly.  If we take the average of all raters, this then translates into the average percentage-correct score that the raters expect from a minimally competent candidate – a very compelling argument for a cutscore to pass competent examinees!  It is often done in tandem with the Beuk Compromise.  The Angoff method does not require actual examinee data, though the Beuk does.

Bookmark

The bookmark method orders the items in a test form in ascending difficulty, and a panel of experts reads through and places a “bookmark” in the book where they think a cutscore should be.  Obviously, this requires enough real data to calibrate item difficulty, usually using item response theory, which requires several hundred examinees.

Contrasting Groups

contrasting groups cutscore

With the contrasting groups approach, candidates are sorted into Pass and Fail groups based on their performance on a different exam or some other unrelated standard.  We can then compare the score distributions on our exam for the two separate groups, and pick a cutscore that best differentiates Pass vs Fail on the other standard.  An example of this is below.  If using data from another exam, a sample of at least 50 candidates is obviously needed, since you are evaluating distributions.

Borderline Group

The Borderline Group method is similar to Contrasting Groups, but a borderline group is defined using alternative information such as biodata, and the scores of the group are evaluated.

Hofstee

The Hofstee approach is often used as a reality check for the modified-Angoff method, but can be done on its own.  It involves only a few estimates from a panel of SMEs.

Ebel

The Ebel approach categorizes items by importance as well as difficulty.  It is very old and not used anymore.

How to choose an approach?

There is often no specifically correct answer.  In fact, guidelines like NCCA do not lay out which method to use, they just tell you to use an appropriate method.

There are several considerations.  Perhaps the most important is whether you have existing data.  The Bookmark, Contrasting Groups, and Borderline Group approaches all assume that we have data from a test already delivered, which we can analyze with the perspective of the latent standard.  The Angoff and Hofstee approaches, in contrast, can be done before a test is ever delivered.  This is arguably less defensible, but is a huge practical advantage.

The choice also depends on whether you can easily recruit a panel of subject matter experts, as that is required for Angoff and Bookmark.  The Contrasting Groups method assumes we have a gold standard, which is rare.

How can I implement a standard setting study?

If your organization has an in-house psychometrician, they can usually do this.  If, for example, you are a board of experts in a profession but lack experience in psychometrics, you need to hire a firm.  We can perform such work for you – contact us to learn more.

 

Subject matter experts are an important part of the process in developing a defensible exam.  There are several ways that their input is required.  Here is a list from highest involvement/responsibility to lowest:

  1. Serving on the Certification Committee (if relevant) to decide important things like eligibility pathways
  2. Serving on panels for psychometric steps like Job Task Analysis or Standard Setting (Angoff)
  3. Writing and reviewing the test questions
  4. Answering the survey for the Job Task Analysis

Who are Subject Matter Experts?

A subject matter expert (SME) is someone with knowledge of the exam content.  If you are developing a certification exam for widgetmakers, you need a panel of expert widgetmakers, and sometimes other stakeholders like widget factory managers.

You also need test development staff and psychometricians.  Their job is to guide the process to meet international standards, and make the SME time the most efficient.

Example: Item Writing Workshop

psychometric training and workshopsThe most obvious usage of subject matter experts in exam development is item writing and review. Again, if you are making a certification exam for experienced widgetmakers, then only experienced widgetmakers know enough to write good items.  In some cases, supervisors do as well, but then they are also SMEs.  For example, I once worked on exams for ophthalmic technicians; some of the SMEs were ophthalmic technicians, but some of the SMEs (and much of the nonprofit board) were ophthalmologists, the medical doctors for whom the technicians worked.

An item writing workshop typically starts with training on item writing, including what makes a good item, terminology, and format.  Item writers will then author questions, sometimes alone and sometimes as a group or in pairs.  For higher stakes exams, all items will then be reviewed/edited by other SMEs.

Example: Job Task Analysis

Job Task Analysis studies are a key step in the development of a defensible certification program.  It is the second step in the process, after the initial definition, and sets the stage for everything that comes afterward.  Moreover, if you seek to get your certification accredited by organizations such as NCCA or ANSI, you need to re-perform the job task analysis study periodically. JTAs are sometimes called job analysis, practice analysis, or role delineation studies.

The job task analysis study relies heavily on the experience of Subject Matter Experts (SMEs), just like Cutscore studies. The SMEs have the best tabs on where the profession is evolving and what is most important, which is essential both for the initial JTA and the periodic re-set of the exam. The frequency depends on how quickly your field is evolving, but a cycle of 5 years is often recommended.

The goal of the job task analysis study is to gain quantitative data on the structure of the profession.  Therefore, it typically utilizes a survey approach to gain data from as many professionals as possible.  This starts with a group of SMEs generating an initial list of on-the-job tasks, categorizing them, and then publishing a survey.  The end goal is a formal report with a blueprint of what knowledge, skills, and abilities (KSAs) are required for certification in a given role or field, and therefore what are the specifications of the certification test.

  • Observe— Typically the psychometrician (that’s us) shadows a representative sample of people who perform the job in question (chosen through Panel Composition) to observe and take notes. After the day(s) of observation, the SMEs sit down with the observer so that he or she may ask any clarifying questions.

    The goal is to avoid doing this during the observation so that the observer has an untainted view of the job.  Alternatively, your SMEs can observe job incumbents – which is often the case when the SMEs are supervisors.

  • Generate— The SMEs now have a corpus of information on what is involved with the job, and generate a list of tasks that describe the most important job-related components. Not all job analysis uses tasks, but this is the most common approach in certification testing, hence you will often hear the term job task analysis as a general term.
  • Survey— Now that we have a list of tasks, we send a survey out to a larger group of SMEs and ask them to rate various features of each task.

    How important is the task? How often is it performed? What larger category of tasks does it fall into?

  • Analyze— Next, we crunch the data and quantitatively evaluate the SMEs’ subjective ratings to determine which of the tasks and categories are most important.

  • Review— As a non-SME, the psychometrician needs to take their findings back to the SME panel to review the recommendation and make sure it makes sense.

  • Report— We put together a comprehensive report that outlines what the most important tasks/categories are for the given job.  This in turn serves as the foundation for a test blueprint, because more important content deserves more weight on the test.

    This connection is one of the fundamental links in the validity argument for an assessment.

Example: Cutscore studies

When the JTA is completed, we have to determine who should pass the assessment, and who should fail. This is most often done using the modified Angoff process, where the SMEs conceptualize a minimally competent candidate (MCC) and then set pass/fail point so that the MCC would just barely pass.  There are other methods too, such as Bookmark or Contrasting Groups.

For newly-launching certification programs, these processes go hand-in-hand with item writing and review. The use of evidence-based practices in conducting the job task analysis, test design, writing items, and setting a cutscore serve as the basis for a good certification program.  Moreover, if you are seeking to achieve accreditation – a third part stamp of approval that your credential is high quality – documentation that you completed all these steps is required.

Performing these tasks with a trained psychometrician inherently checks a lot of boxes on the accreditation to-do list, which can position your organization well for the future. When it comes to accreditation— the psychometricians and measurement specialists at Assessment Systems have been around the block a time or two. We can walk you through the lengthy process of becoming accredited, or we can help you perform these tasks a la carte.