parcc ebsr items

The Partnership for Assessment of Readiness for College and Careers (PARCC) is a consortium of US States working together to develop educational assessments aligned with the Common Core State Standards.  This is a daunting task, and PARCC is doing an admirable job, especially with their focus on utilizing technology.  However, one of the new item types has a serious psychometric fault that deserves a caveat with regards to scoring and validation.

What is an Evidence-Based Selected-­Response (EBSR) question?

The item type is an “Evidence-Based Selected-­Response” (PARCC EBSR) item format, commonly called a Part A/B item or Two-Part item.  The goal of this format is to delve deeper into student understanding, and award credit for deeper knowledge while minimizing the impact of guessing.  This is obviously an appropriate goal for assessment.  To do so, the item is presented as two parts to the student, where the first part asks a simple question and the second part asks for supporting evidence to their answer in Part A.  Students must answer Part A correctly to receive credit on Part B.  As described on the PARCC website:

In order to receive full credit for this item, students must choose two supporting facts that support the adjective chosen for Part A. Unlike tests in the past, students may not guess on Part A and receive credit; they will only receive credit for the details they’ve chosen to support Part A.

How EBSR items are scored

While this makes sense in theory, it leads to problem in data analysis, especially if using Item Response Theory (IRT). Obviously, this violates the fundamental assumption of IRT: local independence (items are not dependent on each other).  So when working with a client of mine, we decided to combine it into one multi-point question, which matches the theoretical approach PARCC EBSR items are taking.  The goal was to calibrate the item with Muraki’s Generalized Partial Credit Model (GPCM), which is the standard approach used to analyze polytomous items in K12 assessment (learn more here).  The GPCM tries to order students based on the points they earn: 0 point students tend to have the lowest ability, 1 point students of moderate ability, and 2 point students are of the highest ability.  Should be obvious, right?  Nope.

The first thing we noticed was that some point levels had very small sample sizes.  Suppose that Part A is 1 point and Part B is 1 point (select two evidence pieces but must get both).  Most students will get 0 points or 2 points.  Not many will receive 1 point.  We thought about it, and realized that the only way to earn 1 point is to guess Part A but select no correct evidence or only select one evidence point.  This leads to issues with the GPCM.

Using the Generalized Partial Credit Model

Even when there was sufficient N at each level, we found that the GPCM had terrible fit statistics, meaning that the item was not performing according to the model described above.  So I ran  Iteman, our classical analysis software, to obtain quantile plots that approximate the polytomous IRFs without imposing the GPCM modeling.  I found that in the 0-2 point items tend to have the issue where not many students get 1 point, and moreover the line for them is relatively flat.  The GPCM assumes that it is relatively bell-shaped.  So the GPCM is looking for where the drop-offs are in the bell shape, crossing with adjacent CRFs – the thresholds – and they aren’t there.  The GPCM would blow up, usually not even estimating thresholds in correct ordering.

PARCC EBSR Graphs

So I tried to think of this from a test development perspective.  How do students get 1 point on these PARCC EBSR items?  The only way to do so is to get Part A right but not Part B.  Given that Part B is the reason for Part A, this means this group is students who answer Part A correctly but don’t know the reason, which means they are guessing.  It is then no surprise that the data for 1-point students is in a flat line – it’s just like the c parameter in the 3PL.  So the GPCM will have an extremely tough time estimating threshold parameters.

Why EBSR items don’t work

From a psychometric perspective, point levels are supposed to represent different levels of ability.  A 1-point student should be higher ability than a 0-point student on this item, and a 2-point student of higher ability than a 1-point student.  This seems obvious and intuitive.  But this item, by definition, violates the idea that a 1-point student should have higher ability than a 0-point student.  The only way to get 1 point is to guess the first part – and therefore not know the answer and are no different than the 0-point examinees whatsoever.  So of course the 1-point results look funky here.

The items were calibrated as two separate dichotomous items rather than one polytomous item, and the statistics turned out much better.  This still violates the IRT assumption but at least produces usable IRT parameters that can score students.  Nevertheless, I think the scoring of these items needs to be revisited so that the algorithm produces data which is able to be calibrated in IRT.

The entire goal of test items is to provide data points used to measure students; if the Evidence-Based Selected-­Response item type is not providing usable data, then it is not worth using, no matter how good it seems in theory!

test-scaling

Scaling is a psychometric term regarding the establishment of a score metric for a test, and it often has two meanings. First, it involves defining the method to operationally scoring the test, establishing an underlying scale on which people are being measured.  A common example is the T-score, which transforms raw scores into a standardized scale with a mean of 50 and a standard deviation of 10, making it easier to compare results across different populations or test forms.  It also refers to score conversions used for reporting scores, especially conversions that are designed to carry specific information.  The latter is typically called scaled scoring.

Examples of Scaling

You have all been exposed to this type of scaling, though you might not have realized it at the time. Most high-stakes tests like the ACT, SAT, GRE, and MCAT are reported on scales that are selected to convey certain information, with the actual numbers selected more or less arbitrarily. The SAT and GRE have historically had a nominal mean of 500 and a standard deviation of 100, while the ACT has a nominal mean of 18 and standard deviation of 6. These are actually the same scale, because they are nothing more than a converted z-score (standard or zed score), simply because no examinee wants to receive a score report that says you got a score of -1. The numbers above were arbitrarily selected, and then the score range bounds were selected based on the fact that 99% of the population is within plus or minus three standard deviations. Hence, the SAT and GRE range from 200 to 800 and the ACT ranges from 0 to 36. This leads to the urban legend of receiving 200 points for writing your name correctly on the SAT; again, it feels better for the examinee. A score of 300 might seem like a big number and 100 points above the minimum, but it just means that someone is in the 3rd percentile.

Now, notice that I said “nominal.” I said that because the tests do not actually have those means observed in samples, because the samples have substantial range restriction. Because these tests are only taken by students serious about proceeding to the next level of education, the actual sample is of higher ability than the population. The lower third or so of high school students usually do not bother with the SAT or ACT. So many states will have an observed average ACT of 21 and standard deviation of 4. This is an important issue to consider in developing any test. Consider just how restricted the population of medical school students is; it is a very select group.

How can I select a score scale?

score-scale

For various reasons, actual observed scores from tests are often not reported, and only converted scores are reported.  If there are multiple forms which are being equated, scaling will hide the fact that the forms differ in difficulty, and in many cases, differ in cutscore.  Scaled scores can facilitate feedback.  They can also help the organization avoid explanations of IRT scoring, which can be a headache to some.

When deciding on the conversion calculations, there are several important questions to consider.

First, do we want to be able to make fine distinctions among examinees? If so, the range should be sufficiently wide. My personal view is that the scale should be at least as wide as the number of items; otherwise you are voluntarily giving up information. This in turn means you are giving up variance, which makes it more difficult to correlate your scaled scores with other variables, like the MCAT is correlated with success in medical school. This, of course, means that you are hampering future research – unless that research is able to revert back to actual observed scores to make sure all information possible is used. For example, supposed a test with 100 items is reported on a 5-point grade scale of A-B-C-D-F. That scale is quite restricted, and therefore difficult to correlate with other variables in research. But you have the option of reporting the grades to students and still using the original scores (0 to 100) for your research.

Along the same lines, we can swing completely in the other direction. For many tests, the purpose of the test is not to make fine distinctions, but only to broadly categorize examinees. The most common example of this is a mastery test, where the examinee is being assessed on their mastery of a certain subject, and the only possible scores are pass and fail. Licensure and certification examinations are an example. An extension of this is the “proficiency categories” used in K-12 testing, where students are classified into four groups: Below Basic, Basic, Proficient, and Advanced. This is used in the National Assessment of Educational Progress. Again, we see the care taken for reporting of low scores; instead of receiving a classification like “nonmastery” or “fail,” the failures are given the more palatable “Below Basic.”

Another issue to consider, which is very important in some settings but irrelevant in others, is vertical scaling. This refers to the chaining of scales across various tests that are at quite different levels. In education, this might involve linking the scales of exams in 8th grade, 10th grade, and 12th grade (graduation), so that student progress can be accurately tracked over time. Obviously, this is of great use in educational research, such as the medical school process. But for a test to award a certification in a medical specialty, it is not relevant because it is really a one-time deal.

Lastly, there are three calculation options: pure linear (ScaledScore = RawScore * Slope + Intercept), standardized conversion (Old Mean/SD to New Mean/SD), and nonlinear approaches like Equipercentile.

Perhaps the most important issue is whether the scores from the test will be criterion-referenced or norm-referenced. Often, this choice will be made for you because it distinctly represents the purpose of your tests. However, it is quite important and usually misunderstood, so I will discuss this in detail.

Criterion-Referenced vs. Norm-Referenced

data-analysis-norms

This is a distinction between the ways test scores are used or interpreted. A criterion-referenced score interpretation means that the score is interpreted with regards to defined content, blueprint, or curriculum (the criterion), and ignores how other examinees perform (Bond, 1996). A classroom assessment is the most common example; students are scored on the percent of items correct, which is taken to imply the percent of the content they have mastered. Conversely, a norm-referenced score interpretation is one where the score provides information about the examinee’s standing in the population, but no absolute (or ostensibly absolute) information regarding their mastery of content. This is often the case with non-educational measurements like personality or psychopathology. There is no defined content which we can use as a basis for some sort of absolute interpretation. Instead, scores are often either z-scores or some linear function of z-scores.  IQ is historically scaled with a mean of 100 and standard deviation of 15.

It is important to note that this dichotomy is not a characteristic of the test, but of the test score interpretations. This fact is more apparent when you consider that a single test or test score can have several interpretations, some of which are criterion-referenced and some of which are norm-referenced. We will discuss this deeper when we reach the topic of validity, but consider the following example. A high school graduation exam is designed to be a comprehensive summative assessment of a secondary education. It is therefore specifically designed to cover the curriculum used in schools, and scores are interpreted within that criterion-referenced context. Yet scores from this test could also be used for making acceptance decisions at universities, where scores are only interpreted with respect to their percentile (e.g., accept the top 40%). The scores might even do a fairly decent job at this norm-referenced application. However, this is not what they are designed for, and such score interpretations should be made with caution.

Another important note is the definition of “criterion.” Because most tests with criterion-referenced scores are educational and involve a cutscore, a common misunderstanding is that the cutscore is the criterion. It is still the underlying content or curriculum that is the criterion, because we can have this type of score interpretation without a cutscore. Regardless of whether there is a cutscore for pass/fail, a score on a classroom assessment is still interpreted with regards to mastery of the content.  To further add to the confusion, Industrial/Organizational psychology refers to outcome variables as the criterion; for a pre-employment test, the criterion is typically Job Performance at a later time.

This dichotomy also leads to some interesting thoughts about the nature of your construct. If you have a criterion-referenced score, you are assuming that the construct is concrete enough that anybody can make interpretations regarding it, such as mastering a certain percentage of content. This is why non-concrete constructs like personality tend to be only norm-referenced. There is no agreed-upon blueprint of personality.

Multidimensional Scaling

camera lenses for multidimensional item response theory

An advanced topic worth mentioning is multidimensional scaling (see Davison, 1998). The purpose of multidimensional scaling is similar to factor analysis (a later discussion!) in that it is designed to evaluate the underlying structure of constructs and how they are represented in items. This is therefore useful if you are working with constructs that are brand new, so that little is known about them, and you think they might be multidimensional. This is a pretty small percentage of the tests out there in the world; I encountered the topic in my first year of graduate school – only because I was in a Psychological Scaling course – and have not encountered it since.

Summary of test scaling

Scaling is the process of defining the scale that on which your measurements will take place. It raises fundamental questions about the nature of the construct. Fortunately, in many cases we are dealing with a simple construct that has a well-defined content, like an anatomy course for first-year medical students. Because it is so well-defined, we often take criterion-referenced score interpretations at face value. But as constructs become more complex, like job performance of a first-year resident, it becomes harder to define the scale, and we start to deal more in relatives than absolutes. At the other end of the spectrum are completely ephemeral constructs where researchers still can’t agree on the nature of the construct and we are pretty much limited to z-scores. Intelligence is a good example of this.

Some sources attempt to delineate the scaling of people and items or stimuli as separate things, but this is really impossible as they are so confounded. Especially since people define item statistics (the percent of people that get an item correct) and items define people scores (the percent of items a person gets correct). It is for this reason that item response theory, the most advanced paradigm in measurement theory, was designed to place items and people on the same scale. It is also for this reason that item writing should consider how they are going to be scored and therefore lead to person scores. But because we start writing items long before the test is administered, and the nature of the construct is caught up in the scale, the issues presented here need to be addressed at the very beginning of the test development cycle.

certification exam development construction

Certification exams are a critical component of workforce development for many professions and play a significant role in the global Testing, Inspection, and Certification (TIC) market, which was valued at approximately $359.35 billion in 2022 and is projected to grow at a compound annual growth rate (CAGR) of 4.0% from 2023 to 2030. As such, a lot of effort goes into exam development and delivery, working to ensure that the exams are valid and fair, then delivered securely yet with enough convenience to reach the target market. If you work for a certification organization or awarding body, this article provides a guidebook to that process and how to select a vendor.

Certification Exam Development

Certification exam development, is a well-defined process governed by accreditation guidelines such as NCCA, requiring steps such as job task analysis and standard setting studies.  For certification, and other credentialing like licensure or certificates, this process is incredibly important to establishing validity.  Such exams serve as gatekeepers into many professions, often after people have invested a ton of money and years of their life in preparation.  Therefore, it is critical that the tests be developed well, and have the necessary supporting documentation to show that they are defensible.

So what exactly goes into developing a quality exam, sound psychometrics, and establishing the validity documentation, perhaps enough to achieve NCCA accreditation for your certification? Well, there is a well-defined and recognized process for certification exam development, though it is rarely the exact same for every organization.  In general, the accreditation guidelines say you need to address these things, but leave the specific approach up to you.  For example, you have to do a cutscore study, but you are allowed to choose Bookmark vs Angoff vs other method.

Job Analysis / Practice Analysis

A job analysis study provides the vehicle for defining the important job knowledge, skills, and abilities (KSA) that will later be translated into content on a certification exam. During a job analysis, important job KSAs are obtained by directly analyzing job performance of highly competent job incumbents or surveying subject-matter experts regarding important aspects of successful job performance. The job analysis generally serves as a fundamental source of evidence supporting the validity of scores for certification exams.

Test Specifications and Blueprints

The results of the job analysis study are quantitatively converted into a blueprint for the certification exam.  Basically, it comes down to this: if the experts say that a certain topic or skill is done quite often or is very critical, then it deserves more weight on the exam, right?  There are different ways to do this.  My favorite article on the topic is Raymond & Neustel, 2006Here’s a free tool to help.

test development cycle job task analysis

Item Development

After important job KSAs are established, subject-matter experts write test items to assess them. The end result is the development of an item bank from which exam forms can be constructed. The quality of the item bank also supports test validity.  A key operational step is the development of an Item Writing Guide and holding an item writing workshop for the SMEs.

Pilot Testing

There should be evidence that each item in the bank actually measures the content that it is supposed to measure; in order to assess this, data must be gathered from samples of test-takers. After items are written, they are generally pilot tested by administering them to a sample of examinees in a low-stakes context—one in which examinees’ responses to the test items do not factor into any decisions regarding competency. After pilot test data is obtained, a psychometric analysis of the test and test items can be performed. This analysis will yield statistics that indicate the degree to which the items measure the intended test content. Items that appear to be weak indicators of the test content generally are removed from the item bank or flagged for item review so they can be reviewed by subject matter experts for correctness and clarity.

Note that this is not always possible, and is one of the ways that different organizations diverge in how they approach exam development.

Standard Setting

Standard setting also is a critical source of evidence supporting the validity of professional credentialing exam (i.e. pass/fail) decisions made based on test scores.  Standard setting is a process by which a passing score (or cutscore) is established; this is the point on the score scale that differentiates between examinees that are and are not deemed competent to perform the job. In order to be valid, the cutscore cannot be arbitrarily defined. Two examples of arbitrary methods are the quota (setting the cut score to produce a certain percentage of passing scores) and the flat cutscore (such as 70% on all tests). Both of these approaches ignore the content and difficulty of the test.  Avoid these!

Instead, the cutscore must be based on one of several well-researched criterion-referenced methods from the psychometric literature.  There are two types of criterion-referenced standard-setting procedures (Cizek, 2006): examinee-centered and test-centered.

The Contrasting Groups method is one example of a defensible examinee-centered standard-setting approach. This method compares the scores of candidates previously defined as Pass or Fail. Obviously, this has the drawback that a separate method already exists for classification. Moreover, examinee-centered approaches such as this require data from examinees, but many testing programs wish to set the cutscore before publishing the test and delivering it to any examinees. Therefore, test-centered methods are more commonly used in credentialing.

The most frequently used test-centered method is the Modified Angoff Method (Angoff, 1971) which requires a committee of subject matter experts (SMEs).  Another commonly used approach is the Bookmark Method.

Equating

If the test has more than one form – which is required by NCCA Standards and other guidelines – they must be statistically equated.  If you use classical test theory, there are methods like Tucker or Levine.  If you use item response theory, you can either bake the equating into the item calibration process with software like Xcalibre, or use conversion methods like Stocking & Lord.

What does this process do?  Well, if this year’s certification exam had an average of 3 points higher than last years, how do you know if this year’s version was 3 points easier, or this year’s cohort was 3 points smarter, or a mixture of both?  Learn more here.

Psychometric Analysis & Reporting

This part is an absolutely critical step in the exam development cycle for professional credentialing.  You need to statistically analyze the results to flag any items that are not performing well, so you can replace or modify them.  This looks at statistics like item p-value (difficulty), item point biserial (discrimination), option/distractor analysis, and differential item functioning.  You should also look at overall test reliability/precision and other psychometric indices.  If you are accredited, you need to perform year-end reports and submit them to the governing body.  Learn more about item and test analysis.

Exam Development: It’s a Vicious Cycle

Now, consider the big picture: in many cases, an exam is not a one-and-done thing.  It is re-used, perhaps continually.  Often there are new versions released, perhaps based on updated blueprints or simply to swap out questions so that they don’t get overexposed.  That’s why this is better conceptualized as an exam development cycle, like the circle shown above.  Often some steps like Job Analysis are only done once every 5 years, while the rotation of item development, piloting, equating, and psychometric reporting might happen with each exam window (perhaps you do exams in December and May each year).

ASC has extensive expertise in managing this cycle for professional credentialing exams, as well as many other types of assessments.  Get in touch with us to talk to one of our psychometricians.

Certification Exam Delivery & Administration

Certification exam administration and proctoring is a crucial component of the professional credentialing process.  Certification exams are expensive to develop well, so an organization wants to protect that investment by delivering the exam with appropriate security so that items are not stolen.  Moreover, there is an obvious incentive for candidates to cheat.  So, a certification body needs appropriate processes in place to deliver the certification exams.  Here are some tips.

1. Determine the best approach for certification exam administration and proctoring

Here are a few of the considerations to take into account.  These can be crossed with each other, such as delivering paper exams at Events vs. Test Centers.

Timing: Cohorts/Windows vs Continuous Availability

Do you have cohorts, where events make more sense, or do you need continuous?  For example, if the test is tied to university training programs that graduate candidates in December and May each year, that affects your need for delivery.  Alternatively, some certifications are not tied to such training; you might have to only show work experience.  In those cases, candidates are ready to take the test continuously throughout the year.

Mode: Paper vs Computer

Does it make more sense to deliver the test on paper or on computer?  This used to be a cost issue, but now the cost of computerized delivery, especially with online proctoring at home, has dropped significantly while saving so much time for candidates.  Also, some exam types like clinical simulations can only be delivered on computers.

Location: Test centers vs Online proctored vs Events vs Multi-Modal

Some types of tests require events, such as a clinical assessment in an actual clinic with standardized patients.  Some tests can be taken anywhere.  Exam events can also coincide with other events; perhaps you have online delivery through the year but deliver a paper version of the test at your annual conference, for convenience.

Do you have an easy way to make your own locations, if you are considering that?  One example is that you have quarterly regional conferences for your profession, where you could simply get a side room to deliver your test to candidates since they will already be there.  Another is that most of your candidates are coming from training programs at universities, and you are able to use classrooms at those universities.

ansi accreditation certification exam candidates

Geography: State, National, or International

If your exam is for a small US state or a small country, it might be easy to require exams in a test center, because you can easily set up only one or two test centers to cover the geography.  Some certifications are international, and need to deliver on-demand throughout the year; those are a great fit for online.

Security: Low vs High

If your test has extremely high stakes, there is extremely high incentive to cheat.  An entry-level certification on WordPress is different than a medical licensure exam.  The latter is a better fit for test centers, while the former might be fine with online proctoring on-demand.

Online proctoring: AI vs Recorded vs Live

If you choose to explore this approach, here are three main types to evaluate.

A. AI only: AI only proctoring means that there are no humans.  The examinee is recorded on video, and AI algorithms flag potential issues, such as if they leave their seat, then notify an administrator (usually a professor) of students with a high number of flags.  This approach is usually not relevant for certifications or other credentialing exams, it is more for low-stakes exams like a Psychology 101 Midterm at your local university.  The vendors for this approach are interested in large-scale projects, such as proctoring all midterms and finals at a university, perhaps hundreds of thousands of exams per year.

B. Record and Review: Record and review proctoring means that the examinee is recorded on video, but that video is watched by a real human and flagged if they think there is cheating, theft, or other issues.  This is much higher quality, and higher price, but has one major flaw that might be concerning to certification tests: if someone steals your test by taking pictures, you won’t find out until tomorrow.  But at least you know who it was and you are certain of what happened, with a video proof.  Perhaps useful for microcredentials or recertification exams.

C. Live Online Proctoring: Live online proctoring (LOP), or what I call “live human proctoring” (because some AI proctoring is also “live” in real time!) means that there is a professional human proctor on the other side of the video from the examinee.  They check the examinee in, confirm their identity, scan the room, provide instructions, and actually watch them take the test.  Some providers like MonitorEDU even have the examinee make a second video stream on their phone, which is placed on a bookshelf or similar spot to see the entire room through the test.  Certainly, this approach is a very good fit with certification exams and other credentialing.  You protect the test content as well as the validity of that individual’s score; that is not possible with the other two approaches.

We have also prepared a list of the best online proctoring software platforms.

2. Determine other technology, psychometric, and operational needs

Next, your organization should establish any other needs for your exams that could impact the vendor selection.

  1. Do you require special item types, such that the delivery platform needs to support or integrate with them?
  2. Do you have simulations or OSCEs?
  3. Do you have specific needs around accessibility and accommodations for your candidates?
  4. Do you need adaptive testing or linear on the fly testing?
  5. Do you need extensive Psychometric consulting services?
  6. Do you need an integrated registration and payment portal?  Or a certification management system to track expirations and other important information?

Write all these up so that you can use the list to shop for a provider.

3. Find a provider – or several!

test development cycle fasttest

While it might seem easier to find a single provider for everything, that’s often not the best solution.  Look for those vendors that specifically fit your needs.

For example, most providers of remote proctoring are just that: remote proctoring.  They do not have a professional platform to manage item banks, schedule examinees, deliver tests, create custom score reports, and analyze psychometrics.  Some do not even integrate with such platforms, and only integrate with learning management systems like Moodle, seeing as their entire target market is only low-stakes university exams.  So if you are seeking a vendor for certification testing or other credentialing, the list of potential vendors is smaller.

Likewise, there are some vendors that only do the exam development and psychometrics, but lack a software platform and proctoring services for deliver.  In these cases, they might have very specific expertise, and often have lower costs due to lower overhead.  An example is JML Testing Services.

Once you have some idea what you are looking for, start shopping for vendors that provide services for certification exam delivery, development, and scoring.  In some cases, you might not settle on a certain approach right away, and that’s OK.  See what is out there and compare prices.  Perhaps the cost of Live Remote Proctoring is more affordable than you anticipated, and you can upgrade to that.

Besides a simple Google search some good places to start are the member listings of the Association of Test Publishers and the Institute for Credentialing Excellence.

4. Establish the new process with policies and documentation

Once you have finalized your vendors, you need to write policies and documentation around them.  For example, if your vendor has a certain login page for proctoring (we have ascproctor.com), you should take relevant screenshots and write up a walkthrough so candidates know what to expect.  Much of this should go into your Candidate Handbook.  Some of the things to cover that are specific to exam day for the candidates:

  • How to prepare for the exam
  • How to take a practice test
  • What is allowed during the exam
  • What is not allowed
  • ID needed and the check-in process
  • Details on specific locations (if using locations)
  • Rules for accessibility and accommodations
  • Time limits and other practical considerations in the exam

Next, consider all the things that are impacted other than exam day.

  • ExamOps - Admin purchases report (blurred)Eligibility pathways and applications
  • Registration and scheduling
  • Candidate training and practice tests
  • Reporting: just to the candidates, or perhaps to training programs as well?
  • Accounting and other operations: consider your business needs, such as how you manage money, monthly accounting reports, etc.
  • Test security plan:  What do you do if someone is caught taking pictures of the exam with their phone, or the other potential events?

5. Let Everyone Know

Once you have written up everything, make sure all the relevant stakeholders know.  Publish the new Candidate Handbook and announce to the world.  Send emails to all upcoming candidates with instructions and an opportunity for a practice exam.  Put a link on your homepage.  Get in touch with all the training programs or universities in your field.  Make sure that everyone has ample opportunity to know about the new process!

6. Roll Out

Finally, of course, you can implement the new approach to certification exam delivery.  You might launch a new certification exam from scratch, or perhaps you are moving one from paper to online with remote proctoring, or some other change.  Either way, you need a date to start using it and a change management process.  The good news is that, even though it’s probably a lot of work to get here, the new approach is probably going to save you time and money in the long run.  Roll it out!

Also, remember that this is not a single point in time.  You’ll need to update into the future.  You should also consider the implementation of audits or quality control as a way to drive improvement.

 

Ready to start?

exam development certification committee

Certification exam delivery is the process of administering a certification test to candidates.  This might seem straightforward, but it is surprisingly complex.  The greater the scale and the stakes, the more potential threats and pitfalls.  Assessment Systems Corporation is one of the world leaders in the development and delivery of certification exams.  Contact us to get a free account in our platform and experience the examinee process, or to receive a demonstration from one of our experts.

 

 

One of my favorite quotes is from Mark Twain: “There is no such thing as a new idea. It is impossible. We simply take a lot of old ideas and put them into a sort of mental kaleidoscope.”  How can we construct a better innovation kaleidoscope for assessment?

We all attend conferences to get ideas from our colleagues in the assessment community on how to manage challenges. But ideas from across industries have been the source for some of the most radical innovations. Did you know that the inspiration for fast food drive-throughs was race car pit stops? Or that the idea for wine packaging came from egg cartons?

Most of the assessment conferences we have attended recently have been filled with sessions about artificial intelligence. AI is one of the most exciting developments to come along in our industry – as well as in other industries – in a long time. But many small- or moderate-sized organizations may feel it is out of reach for their organizations. Or they may be reluctant to adopt it for security or other concerns.

There are other worthwhile ideas that can be borrowed from other industries and adapted for use by small and moderate-sized assessment organizations. For instance, concepts from product development, design thinking, and lean manufacturing can be beneficial to assessment processes.

Agile Software Development

Many organizations use agile product methodologies for software development. While strict adherence to an agile methodology may not be appropriate for item development activities, there are pieces of the agile philosophy that might be helpful for item development processes. For instance, in the agile methodology, user stories are used to describe the end goal of a software feature from the standpoint of a customer or end user. In the same way, the user story concept could be used to delineate the intended construct responsibilities items must meet or how items are intended to be scored. This can help ensure that everyone involved in test development has a clear understanding of the measurement intent of the item from the onset.item review kanban

Another feature of agile development is the use of acceptance criteria. Acceptance criteria are predefined standards used to determine if user stories have been completed. In item development processes, acceptance criteria can be developed to set and communicate common standards to all involved in the item authoring process.

Agile development also uses a tool known as a Kanban Board to manage the process of software development by assigning tasks and moving development requests through various stages such as new, awaiting specs, in development, in QA, and user review. This approach can be applied the management of item development in assessment, as you see here from our Assess.ai platform.

Design Thinking and Innovation

Design thinking is a human-centered approach to innovation. At its core is empathy for customers and users. A key design thinking tool is the journey map, which is a visual representation of a process that individuals (e.g., customers or users) go through to achieve a goal. The purpose of creating a journey map is to identify pain points in the user experience and create better user experiences. Journey maps could potentially be used by assessment organizations to diagram the volunteer SME experience and identify potential improvements. Likewise, it could be used in the candidate application and registration process.

Lean Manufacturing

Lean manufacturing is a methodology aimed at reducing production times. A key technique within the lean methodology is value stream mapping (VSM). VSM is a way of visualizing both the flow of information and materials through a process as a means of identifying waste. Admittedly, I do not know a great deal about the intricacies of the technique, but it is most helpful to understand the underlying philosophy and intentions:

· To develop a mutual understanding between all stakeholders involved in the process;

· To eliminate process steps and tasks which do not add value to the process but may contribute to user frustration and to error.

The big question for innovation: Why?

A key question to ask when examining a process is ‘why.’ So often we proceed with processes year in and year out, keeping them the same, because ‘it’s the way we’ve always done them’ without ever questioning why, for so long that we have forgotten what the original answer to the question was. ‘Why’ is an immensely powerful and helpful question.

In addition to asking the ‘why’ question, a takeaway from value stream mapping and from journey mapping is visual representation. Being able to diagram or display the process is a fantastic way to develop a mutual understanding from all stakeholders involved in the process. So often we also concentrate on pursuing shiny new tools like AI that we neglect potential efficiencies in the underlying processes. Visually displaying processes can be extremely helpful in process improvement.

T scores

A T Score is a conversion of scores on a test to a standardized scale with a mean of 50 and standard deviation of 10.  This is a common example of a scaled score in psychometrics and assessment.  A scaled score is simply a way to present scores in a more meaningful and easier-to-digest context, especially across different types of assessment.  Therefore, a T Score is a standardized way that scores are presented to make them easier to understand.

Details and examples are below.  If you would like to explore the concept on your own, here’s a free tool in Excel that you can download!

What is a T Score?

A T score is a conversion of the standard normal distribution, aka Bell Curve or z-score.  The normal distribution places observations (of anything, not just test scores) on a scale that has a mean of 0.00 and a standard deviation of 1.00.  We simply convert this to have a mean of 50 and standard deviation of 10.  Doing so has two immediate benefits to most consumers:

  1. There are no negative scores; people generally do not like to receive a negative score!
  2. Scores are round numbers that generally range from 0 to 100, depending on whether 3, 4, or 5 standard deviations is the bound (usually 20 to 80); this somewhat fits with what most people expect from their school days, even though the numbers are entirely different.

The image below shows the normal distribution, labeled with the different scales for interpretation.

T score vs z score vs percentile

How do I calculate a T score?

Use this formula:

T = z*10 + 50

where  is the standard z-score on the normal distribution N(0,1).

Example of a T score

Suppose you have a z-score of -0.5.  If you put that into the formula, you get T = -0.5*10 + 50 = -5 + 50 = 45.  If you look at the graphic above, you can see how being half a standard deviation below the mean translates to a T score of 45.

How to interpret a T score?

As you can see above, a T Score of 40 means that you are approximately the 16th percentile.   This is a low score, obviously, but a student will feel better than if they received a score if -1.  It is for the same reason that many educational assessments use other scaled scores.  The SAT has a scale of mean=500 SD=100 (T score x 10), so if you receive a score of 400 it again means that you are z=-1 or percentile of 16.

A 70 means that you are approximately the 98th percentile – so that it is actually quite high though students who are used to receiving 90s will feel like it is low!

Since there is a 1-to-1 mapping of T Score to the other rows, you can see that it does not actually provide any new information.  It is simply a conversion to round, positive numbers, that is easier to digest and less likely to upset someone that is unfamiliar with psychometrics.  My undergraduate professor who introduced me to psychometrics used the term “repackaging” to describe scaled scores.  Like if you take an object out of one box and put it in a different box, it looks different superficially, but the object itself and its meaning (e.g., weight) have not changed.

Is a T Score like a t-test?

No.  Couldn’t be more unrelated.  Nothing like the t-test.

How do I implement with an assessment?

If you are using off-the-shelf psychological assessments, they will likely produce a T Score for you in the results.  If you want to utilize it for your own assessments, you need a world-class assessment platform like  FastTest  that has strong functionality for scoring methods and scaled scoring.  An example of this is below.  Here, we are utilizing item response theory for the raw score.

As with all scaled scoring, it is a good idea to provide an explanation to your examinees and stakeholders.

Scaled scores in FastTest

item-writing-tips

Item writing (aka item authoring) is a science as well as an art, and if you have done it, you know just how challenging it can be!  You are experts at what you do, and you want to make sure that your examinees are too.  But it’s hard to write questions that are clear, reliable, unbiased, and differentiate on the thing you are trying to assess.  Here are some tips.

What is Item Writing / Item Authoring ?

Item authoring is the process of creating test questions.  You have certainly seen “bad” test questions in your life, and know firsthand just how frustrating and confusing that can be.  Fortunately, there is a lot of research in the field of psychometrics on how to write good questions, and also how to have other experts review them to ensure quality.  It is best practice to make items go through a workflow, so that the test development process is similar to the software development process.

Because items are the building blocks of tests, it is likely that the test items within your tests are the greatest threat to its overall validity and reliability.  Here are some important tips in item authoring.  Want deeper guidance?  Check out our Item Writing Guide.

Anatomy of an Item

First, let’s talk a little bit about the parts of a test question.  The diagram on the right shows a reading passage with two questions on it.  Here are some of the terms used:

  • Asset/Stimulus: This is a reading passage here, but could also be an audio, video, table, PDF, or other resource
  • Item: An overall test question, usually called an “item” rather than a “question” because sometimes they might be statements.
  • Stem: The part of the item that presents the situation or poses a question.
  • Options: All of the choices to answer.
  • Key: The correct answer.
  • Distractors: The incorrect answers.

Parts of a test item

Item writing tips: The Stem

To find out whether your test items are your allies or your enemies, read through your test and identify the items that contain the most prevalent item construction flaws.  The first three of the most prevalent construction flaws are located in the item stem (i.e. question).  Look to see if your item stems contain…

1) BIAS

Nowadays, we tend to think of bias as relating to culture or religion, but there are many more subtle types of biases that oftentimes sneak into your tests.  Consider the following questions to determine the extent of bias in your tests:

  • Are there are acronyms in your test that are not considered industry standard?
  • Are you testing on policies and procedures that may vary from one location to another?
  • Are you using vocabulary that is more recognizable to a female examinee than a male?
  • Are you referencing objects that are not familiar to examinees from a newer or older generation?

2) NOT

We’ve all taken tests which ask a negatively worded question. These test items are often the product of item authoring by newbies, but they are devastating to the validity and reliability of your tests—particularly fast test-takers or individuals with lower reading skills.  If the examinee misses that one single word, they will get the question wrong even if they actually know the material.  This test item ends up penalizing the wrong examinees!

3) EXCESS VERBIAGEborderline method educational assessment

Long stems can be effective and essential in many situations, but they are also more prone to two specific item construction flaws.  If the stem is unnecessarily long, it can contribute to examinee fatigue.  Because each item requires more energy to read and understand, examinees tire sooner and may begin to perform more poorly later on in the test—regardless of their competence level.

Additionally, long stems often include information that can be used to answer other questions in the test.  This could lead your test to be an assessment of whose test-taking memory is best (i.e. “Oh yeah, #5 said XYZ, so the answer to #34 is XYZ.”) rather than who knows the material.

Item writing tips:  distractors / options

Unfortunately, item stems aren’t the only offenders.  Experienced test writers actually know that the distractors (i.e. options) are actually more difficult to write than the stems themselves.  When you review your test items, look to see if your item distractors contain

4) IMPLAUSIBILTY

The purpose of a distractor is to pull less qualified examinees away from the correct answer by other options that look correct.  In order for them to “distract” an examinee from the correct answer, they have to be plausible.  The closer they are to being correct, the more difficult the exam will be.  If the distractors are obviously incorrect, even unqualified examinees won’t pick them.  Then your exam will not help you discriminate between examinees who know the material and examinees that do not, which is the entire goal.

5) 3-TO-1 SPLITS

You may recall watching Sesame Street as a child.  If so, you remember the song “One of these things…”  (Either way, enjoy refreshing your memory!)   Looking back, it seems really elementary, but sometimes our test item options are written in such a way that an examinee can play this simple game with your test.  Instead of knowing the material, they can look for the option that stands out as different from the others.  Consider the following questions to determine if one of your items falls into this category:

  • Is the correct answer significantly longer than the distractors?
  • Does the correct answer contain more detail than the distractors?
  • Is the grammatical structure different for the answer than for the distractors?

6) ALL OF THE ABOVE

There are a couple of problems with having this phrase (or the opposite “None of the above”) as an option.  For starters, good test takers know that this is—statistically speaking—usually the correct answer.  If it’s there and the examinee picks it, they have a better than 50% chance of getting the item right—even if they don’t know the content.  Also, if they are able to identify two options as correct, they can select “All of the above” without knowing whether or not the third option was correct.  These sorts of questions also get in the way of good item analysis.   Whether the examinee gets this item right or wrong, it’s harder to ascertain what knowledge they have because the correct answer is so broad.

This is helpful, can I learn more?

Want to learn more about item writing?  Here’s an instructional video from one of our PhD psychometricians.  You should also check out this book.

Item authoring is easier with an item banking system

The process of reading through your exams in search of these flaws in the item authoring is time-consuming (and oftentimes depressing), but it is an essential step towards developing an exam that is valid, reliable, and reflects well on your organization as a whole.  We also recommend that you look into getting a dedicated item banking platform, designed to help with this process.

Summary Checklist

 

Issue

Recommendation

Key is invalid due to multiple correct answers. Consider each answer option individually; the key should be fully correct with each distractor being fully incorrect.
Item was written in a hard to comprehend way, examinees were unable to apply their knowledge because of poor wording.

 

Ensure that the item can be understood after just one read through. If you have to read the stem multiple times, it needs to be rewritten.
Grammar, spelling, or syntax errors direct savvy test takers toward the correct answer (or away from incorrect answers). Read the stem, followed by each answer option, aloud. Each answer option should fit with the stem.
Information was introduced in the stem text that was not relevant to the question. After writing each question, evaluate the content of the stem. It should be clear and concise without introducing irrelevant information.
Item emphasizes trivial facts. Work off of a test blue print to ensure that each of your items map to a relevant construct. If you are using Bloom’s taxonomy or a similar approach, items should be from higher order levels.
Numerical answer options overlap. Carefully evaluate numerical ranges to ensure there is no overlap among options.
Examinees noticed answer was most often A. Distribute the key evenly among the answer options. This can be avoided with FastTest’s randomized delivery functionality.
Key was overly specific compared to distractors. Answer options should all be about the same length and contain the same amount of information.
Key was only option to include key word from item stem. Avoid re-using key words from the stem text in your answer options. If you do use such words, evenly distribute them among all of the answer options so as to not call out individual options.
Rare exception can be argued to invalidate true/false always/never question. Avoid using “always” or “never” as there can be unanticipated or rare scenarios. Opt for less absolute terms like “most often” or “rarely”.
Distractors were not plausible, key was obvious. Review each answer option and ensure that it has some bearing in reality. Distractors should be plausible.
Idiom or jargon was used; non-native English speakers did not understand. It is best to avoid figures of speech, keep the stem text and answer options literal to avoid introducing undue discrimination against certain groups.
Key was significantly longer than distractors. There is a strong tendency to write a key that is very descriptive. Be wary of this and evaluate distractors to ensure that they are approximately the same length.
 

validity threats

Validity threats are issues with a test or assessment that hinder the interpretations and use of scores, such as cheating, inappropriate use of scores, unfair preparation, or non-standardized delivery.  It is important to establish a test security plan to define the threats relevant for you and address them.

Validity, in its modern conceptualization, refers to evidence that supports our intended interpretations of test scores (see Chapter 1 of APA/AERA/NCME Standards for full treatment).   The word “interpretation” is key because test scores can be interpreted in different ways, including ways that are not intended by the test designers.  For example, a test given at the end of Nursing school to prepare for a national licensure exam might be used by the school as a sort of Final Exam.  However, the test was not designed for this purpose and might not even be aligned with the school’s curriculum.  Another example is that certification tests are usually designed to demonstrate minimal competence, not differentiate amongst experts, so interpreting a high score as expertise might not be warranted.

Validity threats: Always be on the lookout!

Test sponsors, therefore, must be vigilant against any validity threats.  Some of these, like the two aforementioned examples, might be outside the scope of the organization.  While it is certainly worthwhile to address such issues, our primary focus is on aspects of the exam itself.

Which validity threats rise to the surface in psychometric forensics?

Here, we will discuss several threats to validity that typically present themselves in psychometric forensics, with a focus on security aspects.  However, I’m not just listing security threats here, as psychometric forensics is excellent at flagging other types of validity threats too.

Threat Description Approach Example Indices
Collusion (copying) Examinees are copying answers from one another, usually with a defined Source. Error similarity (only looks at incorrect) 2 examinees get the same 10 items wrong, and select the same distractor on each B-B Ran, B-B Obs, K, K1, K2, S2
Response similarity 2 examinees give the same response on 98/100 items S2, g2, ω, Zjk
Group level help/issues Similar to collusion but at a group level; could be examinees working together, or receiving answers from a teacher/proctor.  Note that many examinees using the same brain dump would have a similar signature but across locations. Group level statistics Location has one of the highest mean scores but lowest mean times Descriptive statistics such as mean score, mean time, and pass rate
Response or error similarity On a certain group of items, the entire classroom gives the same answers Roll-up analysis, such as mean collusion flags per group; also erasure analysis (paper only)
Pre-Knowledge Examinee comes in to take the test already knowing the items and answers, often purchased from a brain dump website. Time-Score analysis Examinee has high score and very short time RTE or total time vs. scores
Response or error similarity Examinee has all the same responses as a known brain dump site All indices
Pretest item comparison Examinee gets 100% on existing items but 50% on new items Pre vs Scored results
Person fit Examinee gets the 10 hardest items correct but performs below average on the rest of the items Guttman indices, lz
Harvesting Examinee is not actually taking the test, but is sitting it to memorize items so they can be sold afterwards, often at a brain dump website.  Similar signature to Sleepers but more likely to occur on voluntary tests, or where high scores benefit examinees. Time-Score analysis Low score, high time, few attempts. RTE or total time vs. scores
Mean vs Median item time Examinee “camps” on 10 items to memorize them; mean item time much higher than the median Mean-Median index
Option flagging Examinee answers “C” to all items in the second half Option proportions
Low motivation: Sleeper Examinees are disengaged, producing data that is flagged as unusual and invalid; fortunately, not usually a security concern but could be a policy concern. Similar signature to Harvester but more likely to occur on mandatory tests, or where high scores do not benefit examinees. Time-Score analysis Low score, high time, few attempts. RTE or total time vs. scores
Item timeout rate If you have item time limits, examinee hits them Proportion items that hit limit
Person fit Examinee attempt a few items, passes through the rest Guttman indices, lz
Low motivation: Clicker Examinees are disengaged, producing data that is flagged as unusual and invalid; fortunately, not usually a security concern but could be a policy concern. Similar idea to Sleeper but data is quite different. Time-Score analysis Examinee quickly clicks “A” to all items, finishing with a low time and low score RTE, Total time vs. scores
Option flagging See above Option proportions

Psychometric Forensics to Find Evidence of Cheating

An emerging sector in the field of psychometrics is the area devoted to analyzing test data to find cheaters and other illicit or invalid testing behavior.  There is a distinction between primary and secondary collusion, and there are specific collusion detection indices and methods to investigate aberrant testing behavior, such as

While research on this topic is more than 50 years old, the modern era did not begin until Wollack published his paper on the Omega index in 1997. Since then, the sophistication and effectiveness of methodology in the field has multiplied, and many more publications focus on it than in the pre-Omega era. This is evidenced by not one but three recent books on the subject:

  1. Wollack, J., & Fremer, J. (2013).  Handbook of Test Security.
  2. Kingston, N., & Clark, A. (2014).  Test Fraud: Statistical Detection and Methodology.
  3. Cizek, G., & Wollack, J. (2016). Handbook of Quantitative Methods for Detecting Cheating on Tests.

 

woman-taking-test

Item review is the process of ensuring that newly-written test questions go through a rigorous peer review, to ensure that they are high quality and meet industry standards.

What is an item review workflow?

Developing a high-quality item bank is an extremely involved process, and authoring of the items is just the first step.  Items need to go through a defined workflow, with multiple people providing item review.  For example, you might require all items to be reviewed by another content expert, a psychometrician, an editor, and a bias reviewer.  Each needs to give their input and pass the item along to the next in line.  You need to record the results of the review for posterity, as part of the concept of validity is that we have documentation to support the development of a test.

What to review?

You should first establish what you want reviewed.  Assessment organizations will often formalize the guidelines as an Item Writing Guide.  Here is the guide that Assessment Systems uses with out clients, but I also recommend checking out the NBME Item Writing Guide.  For an even deeper treatment, I recommend the book Developing and Validating Test Items by Haladyna and Rodriguez (2013).

Here are some aspects to consider for item review.

Content

Most importantly, other content experts should check the item’s content.  Is the correct answer actually correct?  Are all the distractors actually correct?  Does the stem provide all the necessary info?  You’d be surprised how many times such issues slip past even the best reviewers!

Psychometrics

Psychometricians will often review an item to confirm that it meets best practices and that there are no tip-offs.  A common one is that the correct answer is often longer (more words) than the distractors.  Some organizations avoid “all of the above” and other approaches.

Format

Formal editors are sometimes brought in to work on the language and format of the item.  A common mistake is to end the stem with a colon even though that does not follow basic grammatical rules of English.

Bias/Sensitivity

For high-stakes exams that are used on diverse populations, it is important to add this step.  You don’t want items that are biased against a subset of students.  This is not just racial; it can include other differentiations of students.  Years ago I worked on items for the US State of Alaska, which has some incredibly rural regions; we had to avoid concepts that many people take for granted, like roads or shopping malls!

How to implement an item review workflow

item review kanban

This is an example of how to implement the process in a professional-grade item banking platform.  Both of our platforms,  FastTest  and  Assess.ai, have powerful functionality to manage this process.  Admin users can define the stages and the required input, then manage the team members and flow of items.  Assess.ai is unique in the industry with its use of Kanban boards – recognized as the best UI for workflow management – for item review.

An additional step, often at the same time, is standard setting.  One of the most common approaches is called the modified-Angoff method, which requires you to obtain a difficulty rating from a team of experts for each item.  The Item Review interfaces excel in managing this process as well, saving you all the effort of manually managing that process!

CREATE WORKFLOW
Assess.ai item review submit optionsSpecify your stages and how items can move between them

DEFINE YOUR REVIEW FIELDS
These are special item metadata fields that require input from multiple users

MOVE NEW ITEMS INTO THE WORKFLOW
Once an item is written, it is ready for review

ASSIGN ITEMS TO USERS
Assign the item in the UI, with the option to send an email

USERS PERFORM REVIEWS
They can read the item, interact as a student would, and then leave feedback and other metadata in the review fields; then push the item down the line

ADMINS EVALUATE/EXPORT THE RESULTS
Admins can evaluate the results and decide if an item needs revision, or if it can be considered released.

 

job-task-analysis

Job Task Analysis (JTA) is an essential step in designing a test to be used in the workforce, such as pre-employment or certification/licensure, by analyzing data on what is actually being done in the job.  Also known as Job Analysis or Role Delineation, job task analysis is important to design a test that is legally defensible and eligible for accreditation.  It usually involves a panel of subject matter experts to develop a survey, which you then deliver to professionals in your field to get quantitative data about what is most frequently done on the job and what is most critical/important.  This data can then be used for several important purposes.

Need help? Our experts can help you efficiently produce a job task analysis study for your certification, guide the process of item writing and standard setting, then publish and deliver the exam on our secure platform.

 

Reasons to do a Job Task Analysis

Job analysis is extremely important in the field of industrial/organizational psychology, hence the meme here from @iopsychmemes.  It’s not just limited to credentialing.

Job analysis I/O Psychology

Exam design

The most common reason is to get quantitative data that will help you design an exam.  By knowing what knowledge, skills, or abilities (KSAs), are most commonly used, you then know which deserve more questions on the test.  It can also help you with more complex design aspects, such as defining a practical exam with live patients.

Training curriculum

Similarly, that quantitative info can help design a curriculum and other training materials.  You will have data on what is most important or frequent.

Compensation analysis

You have a captive audience with the JTA survey.  Ask them other things that you want to know!  This is an excellent time to gather information about compensation.  I worked on a JTA in the past which asked about work location: clinic, hospital, private practice, or vendor/corporate.

Job descriptions

A good job analysis will help you write a job description for postings.  It will tell you the job responsibilities (common tasks), qualifications (required skills, abilities, and education), and other important aspects.  If you gather compensation data in the survey, that can be used to define the salary range of the open position.

Workforce planning

Important trends might become obvious when analyzing the data.  Are fewer people entering your profession, perhaps specific to a certain region or demographic?  Are they entering without certain skills?  Are there certain universities or training programs that are not performing well?  A JTA can help you discover such issues and then work with stakeholders to address them.  These are major potential problems for the profession.

IT IS MANDATORY

If you have a professional certification exam and want to get it accredited by a board such as NCCA or ANSI/ANAB/ISO, then you are REQUIRED to do some sort of job task analysis.

 

Why is a JTA so important for certification and licensure?  Validity.

The fundamental goal of psychometrics is validity, which is evidence that the interpretations we make from scores are actually true. In the case of certification and licensure exams, we are interpreting that someone who passes the test is qualified to work in that job role. So, the first thing we need to do is define exactly what is the job role, and to do it in a quantitative, scientific way. You can’t just have someone sit down in their basement and write up 17 bullet points as the exam blueprint.  That is a lawsuit waiting to happen.

There are other aspects that are essential as well, such as item writer training and standard setting studies.

 

The Methodology: Job Task Inventory

It’s not easy to develop a defensible certification exam, but the process of job task analysis (JTA) doesn’t require a Ph.D. in Psychometrics to understand. Here’s an overview of what to expect.

  1. Convene a panel of subject matter experts (SMEs), and provide a training on the JTA process.
  2. The SMEs then discuss the role of the certification in the profession, and establish high-level topics (domains) that the certification test should cover. Usually, there is 5-20. Sometimes there are subdomains, and occasionally sub-subdomains.
  3. The SME panel generates a list of job tasks that are assigned to domains; the list is reviewed for duplicates and other potential issues. These tasks have an action verb, a subject, and sometimes a qualifier. Examples: “Calibrate the lensometer,” “Take out the trash”, “Perform an equating study.”  There is a specific approach to help with the generation, called the critical incident technique.  With this, you ask the SMEs to describe a critical incident that happened on the job and what skills or knowledge led to success by the professional.  While this might not generate ideas for frequent yet simple tasks, it can help generate ideas for tasks that are rarer but very important.
  4. The final list is used to generate a survey, which is sent to a representative sample of professionals that actually work in the role. The respondents take the survey, whereby they rate each task, usually on its importance and time spent (sometimes called criticality and frequency). Demographics are also gathered, which include age range, geographic region, work location (e.g., clinic vs hospital if medical), years of experience, educational level, and additional certifications.
  5. A psychometrician analyzes the results and creates a formal report, which is essential for validity documentation.  This report is sometimes considered confidential, sometimes published on the organization’s website for the benefit of the profession, and sometimes published in an abbreviated form.  It’s up to you.  For example, this site presents the final results, but then asks you to submit your email address for the full report.

 

Using JTA results to create test blueprints

Many corporations do a job analysis purely for in-house purposes, such as job descriptions and compensation.  This becomes important for large corporations where you might have thousands of people in the same job; it needs to be well-defined, with good training and appropriate compensation.

If you work for a credentialing organization (typically a non-profit, but sometimes the Training arm of a corporation… for example, Amazon Web Services has a division dedicated to certification exams, you will need to analyze the results of the JTA to develop exam blueprints.  We will discuss this process in more detail with another blog post.  But below is an example of how this will look, and here is a free spreadsheet to perform the calculations: Job Task Analysis to Test Blueprints.

 

Job Task Analysis Example

Suppose you are an expert widgetmaker in charge of the widgetmaker certification exam.  You hire a psychometrician to guide the organization through the test development process.  The psychometrician would start by holding a webinar or in-person meeting for a panel of SMEs to define the role and generate a list of tasks.  The group comes up with a list of 20 tasks, sorted into 4 content domains.  These are listed in a survey to current widgetmakers, who rate them on importance and frequency.  The psychometrician analyzes the data and presents a table like you see below.

We can see here that Task 14 is the most frequent, while Task 2 is the least frequent.  Task 7 is the most important while Task 17 is the least.  When you combine Importance and Frequency either by adding or multiplying, you get the weights on the right-hand columns.  If we sum these and divide by the total, we get the suggested blueprints in the green cells.

 

Job task analysis to test blueprints

 

laptop screen demonstrating automated test assembly

Question and Test Interoperability® (QTI®) is a set of standards around the format of import/export files for test questions in educational assessment and HR/credentialing exams.  This facilitates the movement of questions from one software platform to another, including item banking, test assembly, e-Learning, training, and exam delivery.  This serves two main purposes:

  1. Allows you to use multiple vendors more easily, such as one for item banking and another for exam delivery;
  2. Migrating to a new vendor, as you can export all your content from the old vendor then move into the new vendor easily.

In this blog post, we’ll discuss the significance of QTI and how it serves helps test sponsors in the world of certification, workforce, and educational assessment.

What is Question and Test Interoperability (QTI)?

QTI is a widely adopted standard that facilitates the exchange of assessment content and results between various learning platforms and assessment tools. Developed by the IMS Global Learning Consortium / 1EdTech, its goal is that assessments can be created, delivered, and evaluated consistently across different systems, paving the way for a more efficient and streamlined educational experience.  QTI is similar to SCORM, which is intended for learning content, while QTI is specific for assessment.

QTI uses an XML approach to content and markup, specifically modified for the situation of educational assessment, such as stems, answer, correct answers, and scoring information.  Version 2.x creates a zip file of all content, including a manifest file that lets the importing platform know what’s supposed to be coming in, with items then as separate XML files, and media files saved separately, sometimes into a subfolder.

Here is an example of the file arrangement inside the zip:

QTI files

Here is an example of what a test question would look like:

QTI example item

 

Why is QTI important?

Interoperability Across Platforms

QTI enables educators to create assessments on one platform and seamlessly transfer them to another. This cross-platform compatibility is crucial in today’s diverse educational technology landscape, where institutions often use a combination of learning management systems, assessment tools, and other applications.

Enhanced Efficiency

With QTI, the time-consuming process of manually transferring assessment content between systems is eliminated. This not only saves valuable time for educators but also ensures that the integrity of the assessment is maintained throughout the transfer process.

Adaptability to Diverse Assessment Types

QTI supports a wide range of question types, including multiple-choice, true/false, short answer, and more. This adaptability allows educators to create diverse and engaging assessments that cater to different learning styles and subject matter.

Data Standardization

The standardization of data formats within QTI ensures that assessment results are consistent and easily interpretable. This standardization not only facilitates a smoother exchange of information but also enables educators to gain valuable insights into student performance across various assessments.

Facilitating Accessibility

QTI is designed to support accessibility standards, making assessments more inclusive for all students, including those with disabilities. By adhering to accessibility guidelines, educational institutions can ensure that assessments are a fair and effective means of evaluating student knowledge.

 

How to Use QTI

Creating Assessments

qti specifications computer

QTI allows educators to author assessments in a standardized format that can be easily transferred between different platforms. When creating assessments, users adhere to the specification, ensuring compatibility and consistency.

Importing and Exporting Assessments

Educational institutions often use multiple learning management systems and assessment tools. QTI simplifies the process of transferring assessments between different platforms, eliminating the need for manual adjustments and reducing the risk of data corruption.

Adhering to QTI Specifications

To fully leverage the benefits of QTI, users must adhere to its specifications when creating and implementing assessments. Understanding the QTI schema and guidelines is essential for ensuring that assessments are interoperable across various systems.  This is dependent on the vendor you select.  Note that there have been different sets of QTI standards that have evolved over the years, and some vendors have slightly modified their own format!

 

Examples of QTI Applications

Online Testing Platforms

QTI is widely used in online testing platforms to facilitate the seamless transfer of assessments. Whether transitioning between different learning management systems or integrating third-party assessment tools, it ensures a smooth and standardized process.

Learning Management Systems (LMS)

Educational institutions often employ different LMS platforms. QTI allows educators to create assessments in one LMS and seamlessly transfer them to another, ensuring continuity and consistency in the assessment process.

Assessment Authoring Tools

QTI is integrated into various assessment authoring tools, enabling educators to create assessments in a standardized format. This integration ensures that assessments can be easily shared and used across different educational platforms.

 

Resources for Implementation

IMS Global Learning Consortium

The official website of the IMS Global Learning Consortium provides comprehensive documentation, specifications, and updates related to QTI. Educators and developers can access valuable resources to understand and implement QTI effectively.

QTI-Compatible Platforms and Tools

Many learning platforms and assessment tools explicitly support these specifications. Exploring and adopting compatible solutions simplifies the implementation process and ensures a seamless experience for both educators and students.  Our FastTest platform provides support for QTI.

Community Forums and Support Groups

Engaging with the educational technology community through forums and support groups allows users to share experiences, seek advice, and stay updated on best practices for QTI implementation.  See this thread in Moodle forums, for example.

Wikipedia

Wikipedia has an overview of the topic.

 

Conclusion

In a world where educational technology is advancing rapidly, the Question and Test Interoperability specification stands out as a crucial standard for achieving interoperability in assessment tools, fostering a more efficient, accessible, and adaptable educational environment.. By understanding what QTI is, how to use it, exploring real-world examples, and tapping into valuable resources, educators can navigate the educational landscape more effectively, ensuring a streamlined and consistent e-Assessment experience for students and instructors alike.