Posts on psychometrics: The Science of Assessment

Test response function 10 items Angoff

Setting a cutscore on a test scored with item response theory (IRT) requires some psychometric knowledge.  This post will get you started.

How do I set a cutscore with item response theory?

There are two approaches: directly with IRT, or using CTT then converting to IRT.

  1. Some standard setting methods work directly with IRT, such as the Bookmark method.  Here, you calibrate your test with IRT, rank the items by difficulty, and have an expert panel place “bookmarks” in the ranked list.  The average IRT difficulty of their bookmarks is then a defensible IRT cutscore.  The Contrasting Groups method and the Hofstee method can also work directly with IRT.
  2. Cutscores set with classical test theory, such as the Angoff, Nedelsky, or Ebel methods, are easy to implement when the test is scored classically.  But if your test is scored with the IRT paradigm, you need to convert your cutscores onto the theta scale.  The easiest way to do that is to reverse-calculate the test response function (TRF) from IRT.

The Test Response Function

The TRF (sometimes called a test characteristic curve) is an important method of characterizing test performance in the IRT paradigm.  The TRF predicts a classical score from an IRT score, as you see below.  Like the item response function and test information function (item response and test information function), it uses the theta scale as the X-axis.  The Y-axis can be either the number-correct metric or proportion-correct metric.

Test response function 10 items Angoff

In this example, you can see that a theta of -0.3 translates to an estimated number-correct score of approximately 7, or 70%.

Classical cutscore to IRT

So how does this help us with the conversion of a classical cutscore?  Well, we hereby have a way of translating any number-correct score or proportion-correct score.  So any classical cutscore can be reverse-calculated to a theta value.  If your Angoff study (or Beuk) recommends a cutscore of 7 out of 10 points (70%), you can convert that to a theta cutscore of -0.3 as above.  If the recommended cutscore was 8 (80%), the theta cutscore would be approximately 0.7.

Because IRT works in a way that it scores examinees on the same scale with any set of items, as long as those items have been part of a linking/equating study.  Therefore, a single study on a set of items can be equated to any other linear test form, LOFT pool, or CAT pool.  This makes it possible to apply the classically-focused Angoff method to IRT-focused programs.  You can even set the cutscore with a subset of your item pool, in a linear sense, with the full intention to apply it on CAT tests later.

Note that the number-correct metric only makes sense for linear or LOFT exams, where every examinee receives the same number of items.  In the case of CAT exams, only the proportion correct metric makes sense.

How do I implement IRT?

Interested in applying IRT to improve your assessments?  Download a free trial copy of  Xcalibre  here.  If you want to deliver online tests that are scored directly with IRT, in real time (including computerized adaptive testing), check out  FastTest.

modified-Angoff Beuk compromise

A modified-Angoff method study is one of the most common ways to set a defensible cutscore on an exam.  It therefore means that the pass/fail decisions made by the test are more trustworthy than if you picked an arbitrary round number like 70%.  If your doctor, lawyer, accountant, or other professional has passed an exam where the cutscore has been set with this method, you can place more trust in their skills.

What is the Angoff method?

The Angoff method is a scientific way of setting a cutscore (pass point) on a test.  If you have a criterion-referenced interpretation, it is not legally defensible to just conveniently pick a round number like 70%; you need a formal process.  There are a number of acceptable methodologies in the psychometric literature for standard-setting studies, also known as cutscores or passing points.  Some examples include Angoff, modified-Angoff, Bookmark, Contrasting Groups, and Borderline.  The modified-Angoff approach is by far the popular approach.  It is used especially frequently for certification, licensure, certificate, and other credentialing exams.

It was originally suggested as a mere footnote by renowned researcher William Angoff, at Educational Testing Service. Studies found that panelists involved in modified-Angoff sessions typically demonstrate high agreement levels, with inter-rater reliability often surpassing 0.85, showcasing its robustness in decision consistency

How does the Angoff approach work?

First, you gather a group of subject matter experts (SMEs), with a minimum of 6, though 8-10 is preferred for better reliability, and have them define what they consider to be a Minimally Competent Candidate (MCC).  Next, you have them estimate the percentage of minimally competent candidates that will answer each item correctly.  You then analyze the results for outliers or inconsistencies.  If experts disagree, you will need to evaluate inter-rater reliability and agreement, and after that have the experts discuss and re-rate the items to gain better consensus.  The average final rating is then the expected percent-correct score for a minimally competent candidate.

Advantages of the Angoff method

  1. It is defensible.  Because it is the most commonly used approach and is widely studied in the scientific literature, it is well-accepted.
  2. You can implement it before a test is ever delivered.  Some other methods require you to deliver the test to a large sample first.
  3. It is conceptually simple, easy enough to explain to non-psychometricians.
  4. It incorporates the judgment of a panel of experts, not just one person or a round number.
  5. It works for tests with both classical test theory and item response theory.
  6. It does not take long to implement – if a short test, it can be done in a matter of hours!
  7. It can be used with different item types, including polytomously scored items (multi-points).

Disadvantages of the Angoff method

  1. It does not use actual data, unless you implement the Beuk method alongside.
  2. It can lead to the experts overestimating the performance of entry-level candidates, as they forgot what it was like to start out 20-30 years ago.  This is one reason to use the Beuk method as a “reality check” by showing the experts that if they stay with the cutscore they just picked, the majority of candidates might fail!

Example of the Modified-Angoff Approach

First of all, do not expect a straightforward, easy process that leads to an unassailably correct cutscore.  All standard-setting methods involve some degree of subjectivity.  The goal of the methods is to reduce that subjectivity as much as possible.  Some methods focus on content, others on examinee performance data, while some try to meld the two.

Step 1: Prepare Your Team

The modified-Angoff process depends on a representative sample of SMEs, usually 6-20. By “representative” I mean they should represent the various stakeholders. For instance, a certification for medical assistants might include experienced medical assistants, nurses, and physicians, from different areas of the country. You must train them about their role and how the process works, so they can understand the end goal and drive toward it.

Step 2: Define The Minimally Competent Candidate (MCC)

This concept is the core of the modified-Angoff method, though it is known by a range of terms or acronyms, including minimally qualified candidates (MQC) or just barely qualified (JBQ).  The reasoning is that we want our exam to separate candidates that are qualified from those that are not.  So we ask the SMEs to define what makes someone qualified (or unqualified!) from a perspective of skills and knowledge. This leads to a conceptual definition of an MCC. We then want to estimate what score this borderline candidate would achieve, which is the goal of the remainder of the study. This step can be conducted in person, or via webinar.

Step 3: Round 1 Ratings

Next, ask your SMEs to read through all the items on your test form and estimate the percentage of MCCs that would answer each correctly.  A rating of 100 means the item is a slam dunk; it is so easy that every MCC would get it right.  A rating of 40 is very difficult.  Most ratings are in the 60-90 range if the items are well-developed. The ratings should be gathered independently; if everyone is in the same room, let them work on their own in silence. This can easily be conducted remotely, though.

Step 4: Discussion

This is where it gets fun.  Identify items where there is the most disagreement (as defined by grouped frequency distributions or standard deviation) and make the SMEs discuss it.  Maybe two SMEs thought it was super easy and gave it a 95 and two other SMEs thought it was super hard and gave it a 45.  They will try to convince the other side of their folly. Chances are that there will be no shortage of opinions and you, as the facilitator, will find your greatest challenge is keeping the meeting on track. This step can be conducted in person, or via webinar.

Step 5: Round 2 Ratings

Raters then re-rate the items based on the discussion.  The goal is that there will be a greater consensus.  In the previous example, it’s not likely that every rater will settle on a 70.  But if your raters all end up from 60-80, that’s OK. How do you know there is enough consensus?  We recommend the inter-rater reliability suggested by Shrout and Fleiss (1979), as well as looking at inter-rater agreement and dispersion of ratings for each item. This use of multiple rounds is known as the Delphi approach; it pertains to all consensus-driven discussions in any field, not just psychometrics.

Step 6: Evaluate Results and Final Recommendation

Evaluate the results from Round 2 as well as Round 1.  An example of this is below.  What is the recommended cutscore, which is the average or sum of the Angoff ratings depending on the scale you prefer?  Did the reliability improve?  Estimate the mean and SD of examinee scores (there are several methods for this). What sort of pass rate do you expect?  Even better, utilize the Beuk Compromise as a “reality check” between the modified-Angoff approach and actual test data.  You should take multiple points of view into account, and the SMEs need to vote on a final recommendation. They, of course, know the material and the candidates so they have the final say.  This means that standard setting is a political process; again, reduce that effect as much as you can.

Some organizations do not set the cutscore at the recommended point, but at one standard error of judgment (SEJ) below the recommended point.  The SEJ is based on the inter-rater reliability; note that it is NOT the standard error of the mean or the standard error of measurement.  Some organizations use the latter; the former is just plain wrong (though I have seen it used by amateurs).

 

modified angoff

Step 7: Write Up Your Report

Validity refers to evidence gathered to support test score interpretations.  Well, you have lots of relevant evidence here. Document it.  If your test gets challenged, you’ll have all this in place.  On the other hand, if you just picked 70% as your cutscore because it was a nice round number, you could be in trouble.

Additional Topics

In some situations, there are more issues to worry about.  Multiple forms?  You’ll need to equate in some way.  Using item response theory?  You’ll have to convert the cutscore from the modified-Angoff method onto the theta metric using the Test Response Function (TRF).  New credential and no data available? That’s a real chicken-and-egg problem there.

Where Do I Go From Here?

Ready to take the next step and actually apply the modified-Angoff process to improving your exams?  Sign up for a free account in our  FastTest item banker. You can also download our Angoff analysis tool for free.

References

Shrout, P. E., & Fleiss, J. L. (1979). Intraclass correlations: uses in assessing rater reliability. Psychological bulletin86(2), 420.

test response functions

Item response theory (IRT) is a family of machine learning models in the field of psychometrics, which are used to design, analyze, validate, and score assessments.  It is a very powerful psychometric paradigm that allows researchers to build stronger assessments, whether they work in Education, Psychology, Human Resources, or other fields.  It also solves critical measurement problems like equating across years, designing adaptive tests, or creating vertical scales.

Want to learn more about IRT, how it works, and why it is so important for assessment?  Read on.

What is Item Response Theory?

IRT is a family of models that try to describe how examinees respond to items on a test, hence the name.  These models can be used to evaluate item performance, because the descriptions are quite useful in and of themselves.  However, item response theory ended up doing so much more.Example Item response theory function

IRT is model-driven, in that there is a specific mathematical equation that is assumed, and we fit the models based on raw data, similar to linear regression.  There are different parameters (a, b, c) that shape this equation to different needs.  That’s what defines different IRT models.  This will be discussed at length below.

The models put people and items onto a latent scale, which is usually called θ (theta).  This represents whatever is being measured, whether IQ, anxiety, or knowledge of accounting laws in Croatia.  IRT helps us understand the nature of the scale, how a person answers each question, the distribution of item difficulty, and much more.  IRT used to be known as latent trait theory and item characteristic curve theory.

IRT requires specially-designed software.  Click the link below to download our software Xcalibre, which provides a user-friendly and visual platform for implementing IRT.

 

IRT analysis with Xcalibre

 

Why do we need Item Response Theory?

IRT represents an important innovation in the field of psychometrics. While now more than 50 years old – assuming the “birth” is the classic Lord and Novick (1969) text – it is still underutilized and remains a mystery to many practitioners.

Item response theory is more than just a way of analyzing exam data, it is a paradigm to drive the entire lifecycle of designing, building, delivering, scoring, and analyzing assessments.

IRT requires larger sample sizes and is much more complex than its predecessor, classical test theory, but is also far more powerful.  IRT requires quite a lot of expertise, typically a PhD.  So it is not used for small assessments like a final exam at universities, but is used for almost all major assessments in the world.

 

The Driver: Problems with Classical Test Theory

Classical test theory (CTT) is approximately 100 years old, and still remains commonly used because it is appropriate for certain situations, and it is simple enough that it can be used by many people without formal training in psychometrics.  Most statistics are limited to means, proportions, and correlations.  However, its simplicity means that it lacks the sophistication to deal with a number of very important measurement problems.  A list of these is presented later.

Learn more about the differences between CTT and IRT here.

 

Item Response Theory Parameters

The foundation of IRT is a mathematical model defined by item parametersA parameter is an aspect of a mathematical model that can change its shape or other aspects.  For dichotomous items (those scored correct/incorrect), each item has three parameters:

 

   a: the discrimination parameter, an index of how well the item differentiates low from top examinees; typically ranges from 0 to 2, where higher is better, though not many items are above 1.0.

   b: the difficulty parameter, an index of what level of examinees for which the item is appropriate; typically ranges from -3 to +3, with 0 being an average examinee level.

   c: the pseudo-guessing parameter, which is a lower asymptote; typically is focused on 1/k where k is the number of options.

These paramters are used in the formula below, but are also displayed graphically.

3PL irt equation

Item response function

These parameters are used to graphically display an item response function (IRF), which models the probability of a correct answer as a function of ability.  In the example IRF, the a parameter is approximately, 1.0, indicating a fairly discriminating test item.  The b parameter is approximately 0.0 (the point on the x-axis where the midpoint of the curve is), indicating an average-difficulty item; examinees of average ability would have a 60% chance of answering correctly.  The c parameter is approximately 0.20, like a 5-option multiple choice item.  Consider the x-axis to be z-scores on a standard normal scale.

In some cases, there is no guessing involved, and we only use and b.  This is called the two-parameter model.  If we only use b, this is the one-parameter or Rasch Model.  Here is how that is calculated.

One-parameter-logistic-model-IRT

Item parameters, which are crucial within the IRT framework, might change over time or multiple testing occasions, a phenomenon known as item parameter drift.

 

Example Item Response Theory calculations

Examinees with higher ability are much more likely to respond correctly.  Look at the graph above.  Someone at +2.0 (97th percentile) has about a 94% chance of getting the item correct.  Meanwhile, someone at -2.0 has only a 25% chance – barely above the 1 in 5 guessing rate of 20%.  An average person (0.0) has a 60% chance.  Why 60?  Because we are accounting for guessing.  If the curve went from 0% to 100% probability, then yes, the middle would be 50% change.  But here, we assume 20% as a baseline due to guessing, so halfway up is 60%.

five item response functions

Of course, the parameters can and should differ from item to item, reflecting differences in item performance.  The following graph shows five IRFs with the three-parameter model.  The dark blue line is the easiest item, with a b of -2.00.  The light blue item is the hardest, with a b of +1.80.  The purple one has a c=0.00 while the light blue has c=0.25, indicating that it is more susceptible to guessing.

These IRFs are not just a pretty graph or a way to describe how an item performs.  They are the basic building block to accomplishing those important goals mentioned earlier.  That comes next…

 

Applications of Item Response Theory to Improve Assessment

Item response theory uses the IRF for several purposes.  Here are a few.

test information function from item response theory

  1. Interpreting and improving item performance
  2. Scoring examinees with maximum likelihood or Bayesian methods
  3. Form assembly, including linear on the fly testing (LOFT) and pre-equating
  4. Calculating the accuracy of examinee scores
  5. Development of computerized adaptive tests (CAT)
  6. Post-equating
  7. Differential item functioning (finding bias)
  8. Data forensics to find cheaters or other issues

In addition to being used to evaluate each item individually, IRFs are combined in various ways to evaluate the overall test or form.  The two most important approaches are the conditional standard error of measurement (CSEM) and the test information function (TIF).  The test information function is higher where the test is providing more measurement information about examinees; if relatively low in a certain range of examinee ability, those examinees are not being measured accurately.  The CSEM is the inverse of the TIF, and has the interpretable advantage of being usable for confidence intervals; a person’s score plus or minus 1.96 times the SEM is a 95% confidence interval for their score.  The graph on the right shows part of the form assembly process in our  FastTest  platform.

 

Assumptions of Item Response Theory

Item response theory assumes a few things about your data.

  1. The latent trait you are measuring is unidimensional.  If it is multidimensional, there is multidimensional item response theory, or you can treat the dimensions as separate traits.
  2. Items have local independence, which means that the act of answering one is not impacted by others.  This affects the use of testlets and enemy items.
  3. The probability of responding correctly to an item (or in a certain response, in the case of polytomous like Likert), is a function of the examinee’s ability/trait level and the parameters of the model, following the calculation of the item response function, with some allowance for random error.  As a corollary, we are assuming that the ability/trait has some distribution, with some people having higher or lower levels (e.g., intelligence) and that we are trying to find those differences.

Many texts will only postulate the first two as assumptions, because the third is just implicitly assumed.

 

Advantages and Benefits of Item Response Theory

So why does this matter?  Let’s go back to the problems with classical test theory.  Why is IRT better?

  • Sample-independence of scale: Classical statistics are all sample dependent, and unusable on a different sample; results from IRT are sample-independent. within a linear transformation.  Two samples of different ability levels can be easily converted onto the same scale.
  • Test statistics: Classical statistics are tied to a specific test form.
  • Sparse matrices are OK: Classical test statistics do not work with sparse matrices introduced by multiple forms, linear on the fly testing, or adaptive testing.
  • Linking/equating: Item response theory has much stronger equating, so if your exam has multiple forms, or if you deliver twice per year with a new form, you can have much greater validity in the comparability of scores.
  • Measuring the range of students: Classical tests are built for the average student, and do not measure high or low students very well; conversely, statistics for very difficult or easy items are suspect.
  • Vertical scaling: IRT can do vertical scaling but CTT cannot.
  • Accounting for guessing: CTT does not account for guessing on multiple choice exams.
  • Scoring: Scoring in classical test theory does not take into account item difficulty.  With IRT, you can score a student on any set of items and be sure it is on the same latent scale.
  • Adaptive testing: CTT does not support adaptive testing in most cases.  Adaptive testing has its own list of benefits.
  • Characterization of error: CTT assumes that every examinee has the same amount of error in their score (SEM); IRT recognizes that if the test is all middle-difficulty items, then low or high students will have inaccurate scores.
  • Stronger form building: IRT has functionality to build forms to be more strongly equivalent and meet the purposes of the exam.
  • Nonlinear function: IRT does not assume linear function of the student-item relationship when it is impossible.  CTT assumes a linear function (point-biserial) when it is blatantly impossible.

 

Item Response Theory Models: One Big Happy Family

Remember: IRT is actually a family of models, making flexible use of the parameters.  In some cases, only two (a,b) or one parameters (b) are used, depending on the type of assessment and fit of the data.  If there are multipoint items, such as Likert rating scales or partial credit items, the models are extended to include additional parameters. Learn more about the partial credit situation here.

Here’s a quick breakdown of the family tree, with the most common models.

 

How do I analyze my test with Item Response Theory?

OK item fit

First: you need to get special software.  There are some commercial packages like  Xcalibre, or you can use packages inside platforms like R and Python.

The software will analyze the data in cycles or loops to try to find the best model.  This is because, as always, the data does not always perfectly align.  You might see graphs like the one below if you compared actual proportions (red) to the predicted ones from the item response function (black).  That’s OK!  IRT is quite robust.  And there are analyses built in to help you evaluate model fit.

Some more unpacking of the image above:

  • This was item #39 on the test
  • We are using the three parameter logistic model (3PL), as this was a multiple choice item with 4 options
  • 3422 examinees answered the item
  • 76.9 of them got it correct
  • The classical item discrimination (point biserial item-total correlation) was 0.253, which is OK but not very high
  • The a parameter was 0.432, which is OK but not very strong
  • The b parameter was -1.195, which means the item was quite easy
  • The c parameter was 0.248, which you would expect if there was a 25% chance of guessing
  • The Chi-square fit statistic rejected the null, indicating poor fit, but this statistic is susceptible to sample size
  • The z-Resid fit statistic is a bit more robust, and it did not flag the item for bad fit

Xcalibre-poly-output
The image here shows output from  Xcalibre  from the generalized partial credit model, which is a polytomous model often used for items scored with partial credit.  For example, if a question lists 6 animals and asks students to click on the ones that are reptiles, of which there are 3.  The possible scores are then 0, 1, 2, 3.

Here, the graph labels them as 1-2-3-4, but the meaning is the same.  Here is how you can interpret this.

  • Someone is likely to get 0 points if their theta is below -2.0 (bottom 3% or so of students).
  • A few low students might get 1 point (green)
  • Low-middle ability students are likely to get 2 correct (blue)
  • Anyone above average (0.0) is likely to get all 3 correct.

The boundary locations are where one level becomes more likely than another, i.e., where the curves cross.  For example, you can see that the blue and black lines cross at the boundary -0.339.

Where can I learn more?

For more information, we recommend the textbook Item Response Theory for Psychologists by Embretson & Riese (2000) for those interested in a less mathematical treatment, or de Ayala (2009) for a more mathematical treatment.  If you really want to dive in, you can try the 3-volume Handbook of Item Response Theory edited by van der Linden, which contains a chapter discussing ASC’s IRT analysis software,  Xcalibre.

Want to talk to one of our experts about how to apply IRT?  Get in touch!

TALK TO US Contact

laptop data graph

Criterion-related validity is evidence that test scores are related to other data which we expect them to be.  This is an essential part of the larger issue of test score validity, which is providing evidence that test scores have the meaning we intend them to have.  If you’ve ever felt that a test doesn’t cover what it should be covering, or that it doesn’t reflect the skills needed to perform the job you are applying for – that’s validity.

What is criterion-related validity?

Criterion-related validity is an aspect of test score validity which refers to evidence that scores from a test correlate with an external variable that it should correlate with.  In many situations, this is the critical consideration of a test; for example, a university admissions exam would be quite suspect if scores did not correlate well with high school GPA or accurately predict university GPA.  That is literally its purpose for existence, so we want to have some proof that the test is performing that way.  A test serves its purpose, and people have faith in it, when we have such highly relevant evidence.

Incremental validity is a specific aspect of criterion-related validity that assesses the added predictive value of a new assessment or variable beyond the information provided by existing measures.  There are two approaches to establishing criterion-related validity: concurrent and predictive.  There are also two directions: discriminant and convergent.

Concurrent validity

The concurrent approach to criterion-related validity means that we are looking at variables at the same point in time, or at least very close.  In the example of university admissions testing, this would be correlating the test scores with high school GPA.  The students would most likely just be finishing high school at the time they took the test, excluding special cases like students that take a gap year before university.

Predictive validity

The predictive validity approach, as its name suggests, is regarding the prediction of future variables. For instance, in university admissions testing, we use test scores to predict outcomes like university GPA or graduation rates. Studies show that SAT scores correlate with college GPA, with values typically ranging from 0.30 to 0.50, reflecting data from predictive validity research conducted by organizations like the College Board.  A common application of this is pre-employment testing, where job candidates are testing with the goal of predicting positive variables like job performance, or variables that the employer might want to avoid, like counterproductive work behavior.  Which leads us to the next point…

Convergent validity

Convergent validity refers to criterion-related validity where we want a positive correlation, such as test scores with job performance or university GPA.  This is frequently the case with criterion-related validity studies.  One thing to be careful of in this case is differential prediction, also known as predictive bias.  This is where the validity is different for one group of examinees, often a certain demographic group, even though the average score might be the same for each group.

Here is an example of the data you might evaluate for predictive convergent validity of a university admissions test.

Predictive validity

Discriminant validity

Unlike convergent, discriminant validity is where we want to correlate negatively or zero with other variables.  As noted above, some pre-employment tests have this case.  An integrity or conscientiousness assessment should correlate negatively with instances of counterproductive work behavior, perhaps quantified as number if disciplinary marks on employee HR files.  In some cases, the goal might be to find a zero correlation.  That can be the case with noncognitive traits, where a measure of conscientiousness should not have a strong correlation in any direction with other members of the Big Five.

The big picture

Validity is a complex topic with many aspects.  Criterion-related validity is only one part of the picture.  However, as seen in some of the examples above, it is profoundly critical to some types of assessment, especially where the exam exists only to predict some future variables.

Want to delve further into validity?  The classic reference is Cronbach & Meehl (1955).  We also recommend work by Messick, such as this one.  Of course, check with relevant standards to your assessment, such as AERA/APA/NCME or NCCA.

students-taking-digital-test

Digital assessment (DA) aka e-Assessment or electronic assessment is the delivery of assessments, tests, surveys, and other measures via digital devices such as computers, tablets, and mobile phones.  The primary goal is to be able to develop items, publish tests, deliver tests, and provide meaningful results – as quickly, easily, and validly as possible.  The use of computers enables many modern benefits, from adaptive testing (e.g. adaptive SAT) to tech-enhanced items.  To deliver digital assessment, an organization typically implements cloud-based digital assessment platforms.  Such platforms do much more than just the delivery though, and modules include:

test development cycle fasttest

 

 

Why Digital Assessment / e-Assessment?

Globalization and digital technology are rapidly changing the world of education, human resources, and professional development. Teaching and learning are becoming more learner-centric, and technology provides an opportunity for assessment to be integrated into the learning process with corresponding adjustments. Furthermore, digital technology grants opportunities for teaching and learning to move their focus from content to critical thinking. Teachers are already implementing new strategies in classrooms, and assessment needs to reflect these changes, as well.

Looking for such a platform?  Request a free account in ASC’s industry-leading e-Assessment ecosystem.

 

Free FastTest Account

 

Advantages of Digital Assessment

Accessibility

 

online testing platform

One of the main pros of DA is the ease-of-use for staff and learners—examiners can easily set up questionnaires, determine grading methods, and send invitations to examinees. In turn, examinees do not always have to be in a classroom setting to take assessments and can do it remotely in a more comfortable environment. In addition, DA gives learners the option of taking practice tests whenever they are available for that.

Transparency

DA allows educators quickly evaluate performance of a group against an individual learner for analytical and pedagogical reasons. Report-generating capabilities of DA enable educators to identify learning problem areas on both individual and group levels soon after assessments occur in order to adapt to learners’ needs, strengths, and weaknesses. As for learners, DA provides them with instant feedback, unlike traditional paper exams.

Profitability

Conducting exams online, especially those at scale, seems very practical since there is no need to print innumerable question papers, involve all school staff in organization of procedures, assign invigilators, invite hundreds of students to spacious classrooms to take tests, and provide them with answer-sheets and supplementary materials. Thus, flexibility of time and venue, lowered human, logistic and administrative costs lend considerable preeminence to electronic assessment over traditional exam settings.

Eco-friendliness

In this digital era, our utmost priority should be minimizing detrimental effects on the environment that pen-and-paper exams bring. Mercilessly cutting down trees for paper can no longer be the norm as it has the adverse environmental impact. DA will ensure that organizations and institutions can go paper-free and avoid printing exam papers and other materials. Furthermore, DAs take up less storage space since all data can be stored on a single server, especially in respect to keeping records in paper.

Security

Enhanced privacy for students is another advantage of digital assessment that validates its utility. There is a tiny probability of malicious activities, such as cheating and other unlawful practices that can potentially rig the system and lead to incorrect results. Secure assessment system supported by AI-based proctoring features makes students embrace test results without contesting them, which, in turn, fosters a more positive mindset toward institutions and organizations building a stronger mutual trust between educators and learners.

Autograding

The benefits of DA include setting up an automated grading system, more convenient and time-efficient than standard marking and grading procedures, which minimizes human error. Automated scoring juxtaposes examinees’ responses against model answers and makes relevant judgements. The dissemination of technology in e-education and the increasing number of learners demand a sophisticated scoring mechanism that would ease teachers’ burden, save a lot of time, and ensure fairness of assessment results. For example, digital assessment platforms can include complex modules for essay scoring, or easily implement item response theory and computerized adaptive testing.

Time-efficiency

Those involved in designing, managing and evaluating assessments are aware of the tediousness of these tasks. Probably, the most routine process among assessment procedures is manual invigilation which can be easily avoided by employing proctoring services. Smart exam software, such as FastTest, features the options of automated item generation, item banking, test assembling and publishing, saving precious time that would otherwise be wasted on repetitive tasks. Examiners should only upload the examinees’ emails or ids to invite them for assessment. The best part about it all is instant exporting of results and delivering reports to stakeholders.

Public relations and visibility

There is a considerably lower use of pen and paper in the digital age. The infusion of technology has considerably altered human preferences, so these days an immense majority of educators rely more on computers for communication, presentations, digital designing, and other various tasks. Educators have an opportunity to mix question styles on exams, including graphics, to make them more interactive than paper ones. Many educational institutions utilize learning management systems (LMS) for publishing study materials on the cloud-based platforms and enabling educators to evaluate and grade with ease. In turn, students benefit from such systems as they can submit their assignments remotely.

 

Challenges of Implementing Digital Assessment

Difficulty in grading long-answer questions

DA copes brilliantly with multiple-choice questions; however, there are still some challenges with grading long-answer questions. This is where Digital e-Assessment intersects with the traditional one as subjective answers ask for manual grading. Luckily, technology in the education sector continues to evolve and even essays can already be marked digitally with a help of AI-features on the platforms like FastTest.

Need to adapt

Implementing something new always brings disruption and demands some time to familiarize all stakeholders with it. Obviously, transition from traditional assessment to DA will require certain investments to upgrade the system, such as professional development of staff and finances. Some staff and students might even resist this tendency and feel isolated without face-to-face interactions. However, this stage is inevitable and will definitely be a step forward for both educators and learners.

Infrastructural barriers & vulnerability

One of the major cons of DA is that technology is not always reliable and some locations cannot provide all examinees with stable access to electricity, internet connection, and other basic system requirements. This is a huge problem in developing nations, and still remains a problem in many areas of well-developed nations. In addition, integrating DA technology might be very costly in case of wrong strategies while planning assessment design, both conceptual and aesthetic. Such barriers hamper DA, which is why authorities should consider addressing them prior to implementing DA.

Selecting a Digital Assessment Platform

Digital Assessment is a critical component in education and workforce assessment, managing and delivering exams via the internet.  It requires a cloud-based platform that is designed specifically to build, deliver, manage, and validate exams that are either large-scale or high-stakes.  It is a critical core-business tool for high-stakes professional and educational assessment programs, such as certification, licensure, or university admissions.  There are many, many software products out in market that provide at least some functionality for online testing.

The biggest problem when you start shopping is that there is an incredible range in quality, though there are also other differentiators, such as some being made only to deliver pre-packaged employment skill tests rather than being for general usage.  This article provides some tips on how to implement e-assessment more effectively.

Type of e-Assessment tools

So how do you know what level of quality you need in an e-Assessment solution?  It mostly depends on the stakes of your test, which governs the need for quality in the test itself, which then drives the need for a quality platform to build and deliver the test.  This post helps you identify the types of functionality that set apart “real” online exam platforms, and you can evaluate which components are most critical for you once you go shopping.

This table depicts one way to think about what sort of solution you need.

Non-professional level Professional level
Not dedicated to assessment Systems that can do minimal assessment and are inexpensive, such as survey software (LimeSurvey, QuestionPro, etc.) Related systems like LMS platforms that are high quality (Blackboard, Canvas); these have some assessment functionality but lack professional functionality like IRT, adaptive testing, and true item banking
Dedicated to assessment Systems designed for assessment but without professional functionality; anybody can make a simple platform for MCQ exams etc. Powerful systems designed for high-stakes exams, with professional functionality like IRT/CAT

What is a professional assessment platform, anyway?

test development cycle fasttest

A true e-Assessment system is much more than an exam module in a learning management system (LMS) or an inexpensive quiz/survey maker.  A real online exam platform is designed for professionals, that is, someone whose entire job is to make assessments.  A good comparison is a customer relationship management (CRM) system.  That is a platform that is designed for use be people whose job is to manage customers, whether for existing customers or to manage the sales process.  While it is entirely possible to use a spreadsheet to manage such things at a small scale, all organizations doing any sort of scale will leverage a true CRM like SalesForce or Zoho.   You wouldn’t hire a team of professional sales experts and then have them waste hours each day in a spreadsheet; you would give them SalesForce to make them much more effective.

The same is true for online testing and assessment.  If you are a teacher making math quizzes, then Microsoft Word might be sufficient.  But there are many organizations that are doing a professional level of assessment, with dedicated staff.  Some examples, by no means an exhaustive list:

  • Professional credentialing: Certification and licensure exams that a person passes to work in a profession, such as chiropractors
  • Employment: Evaluating job applicants to make sure they have relevant skills, ability, and experience
  • Universities: Not for classroom assessments, but rather for topics like placement exams of all incoming students, or for nationwide admissions exams
  • K-12 benchmark: If you are a government that tests all 8th graders at the end of the year, or a company that delivers millions of formative assessments

 

Goal 1: Item banking that makes your team more efficient

True item banking:  The platform should treat items as reusable objects that exist with persistent IDs and metadata.  Learn more about item banking.

Configurability:  The platform should allow you to configure how items are scored and presented, such as font size, answer layout, and weighting.

Multimedia management:  Audio, video, and images should be stored in their own banks, with their own metadata fields, as reusable objects.  If an image is in 7 questions, you should not have to upload 7 times… you upload once and the system tracks which items use it.

item review kanban

Statistics and other metadata:  All items should have many fields that are essential metadata: author name, date created, tests which use the item, content area, Bloom’s taxonomy, classical statistics, IRT parameters, and much more.

Custom fields:  You should be able to create any new metadata fields that you like.

Item review workflow:  Professionally built items will go through a review process, like Psychometric Review, English Editing, and Content Review. The platform should manage this, allowing you to assign items to people with due dates and email notifications.

Standard Setting:  The exam platform should include functionality to help you do standard setting like the modified-Angoff approach.

Automated item generation:  There should be functionality for automated item generation.

Powerful test assembly:  When you publish a test, there should be many options, including sections, navigation limits, paper vs online, scoring algorithms, instructional screens, score reports, etc.  You should also have aids in psychometric aspects, such as a Test Information Function.

Equation Editor:  Many math exams need a professional equation editor to write the items, embedded in the item authoring.

 

Goal 2: Professional=grade exam delivery

Scheduling options:  Date ranges for availability, retake rules, alternate forms, passwords, etc.  These are essential for maintaining the integrity of high stakes tests.

Adaptive testing optionsItem response theory:  Item response theory modern psychometric paradigm used by organizations dedicated to stronger assessment.  It is far superior to the oversimplified, classical approach based on proportions and correlations.

Linear on the fly testing (LOFT):  Suppose you have a pool of 200 questions, and you want every student to get 50 randomly picked, but balanced so that there are 10 items from each of 5 content areas.  This is known as linear-on-the-fly testing, and can greatly enhance the security and validity of the test.

Computerized adaptive testing:  This uses AI and machine learning to customize the test uniquely to every examinee.  Adaptive testing is much more secure, more accurate, more engaging, and can reduce test length by 50-90%.

Tech-enhanced item types:  Drag and drop, audio/video, hotspot, fill-in-the-blank, etc.

Scalability:  Because most “real” exams will be doing thousands, tens of thousands, or even hundreds of thousands of examinees, the online exam platform needs to be able to scale up.

Online essay marking:  The e-Assessment platform should have a module to score open-response items. Preferably with advanced options, like having multiple markers or anonymity.

 

Goal 3: Maintaining test integrity and security during e-Assessment

New test scheduler sites proctor code

Delivery security options:  There should be choices for how to create/disseminate passcodes, set time/date windows, disallow movement back to previous sections, etc.

Lockdown browser:  An option to deliver with software that locks the computer while the examinee is in the test.

Remote proctoring:  There should be an option for remote (online) proctoring.  This can be AI, record and review, or live.

Live proctoring:  There should be functionality that facilitates live human proctoring, such as in computer labs at a university.  The system might have Proctor Codes or a site management module.

User roles and content access:  There should be various roles for users, as well as options to limit them by content.  For example, limiting a Math teacher doing reviews to do nothing but review Math items.

Rescoring:  If items are compromised or challenged, you need functionality to easily remove them from scoring for an exam, and rescore all candidates

Live dashboard:  You should be able to see who is in the online exam, stop them if needed, and restart or re-register if needed.

 

Goal 4: Powerful reporting and exporting

iteman item analysisSupport for QTI:  You should be able to import and export items with QTI, as well as common formats like Word or Excel.

Psychometric analytics & data visualization:  You should be able to see reports on reliability, standard error of measurement, point-biserial item discriminations, and all the other statistics that a psychometrician needs.  Sophisticated users will need things like item response theory.

Exporting of detailed raw files:  You should be able to easily export the examinee response matrix, item times, item comments, scores, and all other result data.

API connections:  You should have options to set up APIs to other platforms, like an LMS or CRM.

 

General Considerations

Ease-of-Use:  As Albert Einstein said, Everything should be made as simple as possible, but no simpler.  The best e-Assessment software is one that offers sophisticated solutions in a way that anyone can use.  Power users should be able to leverage technology like adaptive testing, while there should also be simpler roles for item writers or reviewers.

Integrations:  Your platform should integrate with learning management systems, job applicant tracking systems, certification management systems, or whatever other business operations software is important to you.

Support and Training:  Does the platform have a detailed manual?  Bank of tutorial videos?  Email support from product experts?  Training webinars?

 

OK, now how do I find a Digital Assessment platform that fits my needs?

If you are out shopping, ask about the aspects in the list above.  For sure, make sure to check the websites for documentation on these.  There is a huge range out there, from free survey software up to multi-million dollar platforms.

Want to save yourself some time?  Click here to request a free account in our platform.

 

Conclusion

To sum up, implementing DA has its merits and demerits, as outlined above. Even though technology simplifies and enhances many processes for institutions and stakeholders, it still has some limitations. Nevertheless, all possible drawbacks can be averted by choosing the right methodology and examination software. We cannot reject the necessity to transit from traditional form of assessment to digital one, admitting that the benefits of DA outweigh its drawbacks and costs by far. Of course, it is up to you to choose whether to keep using hard copy assessments or go for online option. However, we believe that in the digital era all we need to do is to plan wisely and choose an easy-to-use and robust examination platform with AI-based anti-cheating measures, such as FastTest, to secure credible outcomes.

 

Reference

Wall, J. E. (2000). Technology-Delivered Assessment: Diamonds or Rocks? ERIC Clearinghouse on Counseling and Student Services.

 

Leadership assessments are more than just tools; they are crucial to identifying and developing effective organizational leadership. Plenty of options exist for “leadership assessments,” from off-the-shelf tools costing $15 to the incredible bespoke, intense, and, sometimes, invasive assessments that use multiple psychologists and can cost upwards of $50,000. In this blog, I’ll provide a framework for leadership assessments with the right amount of measurement rigor that won’t break the bank or your candidates. I’ll focus on leadership assessments in the selection (i.e., hiring) context, although the information extends to assessments for leadership development.

The Role of Job Analysis in Leadership Assessments

In any selection context, it’s important to begin with a job analysis so you understand what you are trying to measure. In other words, what does “good” look like? We find that using historical information (e.g., job description, ONET data, industry information, etc.) followed by detailed discussions with subject matter experts within an organization appropriately balances efficiency and comprehensiveness in pinpointing the essential skills and tasks necessary for a leader to succeed. 

Designing Robust Leadership Assessments

psychometric training and workshops

Senior female CEO and multicultural business people discussing company presentation at boardroom table. Diverse corporate team working together in modern meeting room office. Top view through glass

From the job analysis, we gather a list of the critical skills along with their importance ratings. With this information, we create a comprehensive leadership assessment incorporating several methodologies to evaluate potential leaders. These include:

  • Construct Validity: Utilizing psychometric tools that are rigorously tested and validated to measure specific leadership constructs effectively. For example, measuring a candidate’s ability to influence may involve assessments like the Hogan Insight Series and the Leadership Effectiveness Analysis, as they have proven their constructs with leadership samples over the past decades.
  • Content Validity: Structured interviews reflect the competencies identified during the job analysis stage. These interviews are crafted to probe deep into the candidate’s experiences and skills in areas critical to the role.
  • Criterion-Related Validity: The real test of an assessment’s value is its ability to predict actual job performance. Correlating assessment outcomes with real job performance metrics confirms the predictive validity of the assessment tools.

At People Strategies, we integrate these approaches to ensure our leadership assessments are not just theoretical but also practical and actionable.

Effective Reporting for Informed Decision-Making

Despite the complex underpinnings, the reporting of assessment results is simplified to aid quick and informed decision-making. Each leadership skill assessed is rated and integrated into a weighted fit score. This score reflects its importance as determined in the job analysis, providing a clear, quantifiable measure of a candidate’s suitability for the leadership role.

Conclusion

Effective leadership assessments are vital for nurturing and selecting the right leaders to meet an organization’s strategic goals. At People Strategies, we emphasize a scientific approach, integrating thorough validation processes to ensure that our assessments are accurate and deeply informative. This rigorous methodology allows us to deliver assessments that are both comprehensive in their analysis and practical in their application, ensuring that organizations can confidently make crucial leadership decisions.

You may also be interested in reading two related blog posts: ‘Improving Employee Retention with Assessment: Strategies for Success’ and ‘HR Assessment Software: Approaches and Solutions.’

About the Guest Author: David Dubin, PhD

David Dubin, PhD, is founder and principal of People Strategies.  People Strategies knows that a business strategy means nothing without the people. We help busy leaders and HR professionals find and implement the tools they need to bring the benefits of people science to their organizations so they can bring hiring and employee development to the next level.

 

parcc ebsr items

The Partnership for Assessment of Readiness for College and Careers (PARCC) is a consortium of US States working together to develop educational assessments aligned with the Common Core State Standards.  This is a daunting task, and PARCC is doing an admirable job, especially with their focus on utilizing technology.  However, one of the new item types has a serious psychometric fault that deserves a caveat with regards to scoring and validation.

What is an Evidence-Based Selected-­Response (EBSR) question?

The item type is an “Evidence-Based Selected-­Response” (PARCC EBSR) item format, commonly called a Part A/B item or Two-Part item.  The goal of this format is to delve deeper into student understanding, and award credit for deeper knowledge while minimizing the impact of guessing.  This is obviously an appropriate goal for assessment.  To do so, the item is presented as two parts to the student, where the first part asks a simple question and the second part asks for supporting evidence to their answer in Part A.  Students must answer Part A correctly to receive credit on Part B.  As described on the PARCC website:

In order to receive full credit for this item, students must choose two supporting facts that support the adjective chosen for Part A. Unlike tests in the past, students may not guess on Part A and receive credit; they will only receive credit for the details they’ve chosen to support Part A.

How EBSR items are scored

While this makes sense in theory, it leads to problem in data analysis, especially if using Item Response Theory (IRT). Obviously, this violates the fundamental assumption of IRT: local independence (items are not dependent on each other).  So when working with a client of mine, we decided to combine it into one multi-point question, which matches the theoretical approach PARCC EBSR items are taking.  The goal was to calibrate the item with Muraki’s Generalized Partial Credit Model (GPCM), which is the standard approach used to analyze polytomous items in K12 assessment (learn more here).  The GPCM tries to order students based on the points they earn: 0 point students tend to have the lowest ability, 1 point students of moderate ability, and 2 point students are of the highest ability.  Should be obvious, right?  Nope.

The first thing we noticed was that some point levels had very small sample sizes.  Suppose that Part A is 1 point and Part B is 1 point (select two evidence pieces but must get both).  Most students will get 0 points or 2 points.  Not many will receive 1 point.  We thought about it, and realized that the only way to earn 1 point is to guess Part A but select no correct evidence or only select one evidence point.  This leads to issues with the GPCM.

Using the Generalized Partial Credit Model

Even when there was sufficient N at each level, we found that the GPCM had terrible fit statistics, meaning that the item was not performing according to the model described above.  So I ran  Iteman, our classical analysis software, to obtain quantile plots that approximate the polytomous IRFs without imposing the GPCM modeling.  I found that in the 0-2 point items tend to have the issue where not many students get 1 point, and moreover the line for them is relatively flat.  The GPCM assumes that it is relatively bell-shaped.  So the GPCM is looking for where the drop-offs are in the bell shape, crossing with adjacent CRFs – the thresholds – and they aren’t there.  The GPCM would blow up, usually not even estimating thresholds in correct ordering.

PARCC EBSR Graphs

So I tried to think of this from a test development perspective.  How do students get 1 point on these PARCC EBSR items?  The only way to do so is to get Part A right but not Part B.  Given that Part B is the reason for Part A, this means this group is students who answer Part A correctly but don’t know the reason, which means they are guessing.  It is then no surprise that the data for 1-point students is in a flat line – it’s just like the c parameter in the 3PL.  So the GPCM will have an extremely tough time estimating threshold parameters.

Why EBSR items don’t work

From a psychometric perspective, point levels are supposed to represent different levels of ability.  A 1-point student should be higher ability than a 0-point student on this item, and a 2-point student of higher ability than a 1-point student.  This seems obvious and intuitive.  But this item, by definition, violates the idea that a 1-point student should have higher ability than a 0-point student.  The only way to get 1 point is to guess the first part – and therefore not know the answer and are no different than the 0-point examinees whatsoever.  So of course the 1-point results look funky here.

The items were calibrated as two separate dichotomous items rather than one polytomous item, and the statistics turned out much better.  This still violates the IRT assumption but at least produces usable IRT parameters that can score students.  Nevertheless, I think the scoring of these items needs to be revisited so that the algorithm produces data which is able to be calibrated in IRT.

The entire goal of test items is to provide data points used to measure students; if the Evidence-Based Selected-­Response item type is not providing usable data, then it is not worth using, no matter how good it seems in theory!

test-scaling

Scaling is a psychometric term regarding the establishment of a score metric for a test, and it often has two meanings. First, it involves defining the method to operationally scoring the test, establishing an underlying scale on which people are being measured.  A common example is the T-score, which transforms raw scores into a standardized scale with a mean of 50 and a standard deviation of 10, making it easier to compare results across different populations or test forms.  It also refers to score conversions used for reporting scores, especially conversions that are designed to carry specific information.  The latter is typically called scaled scoring.

Examples of Scaling

You have all been exposed to this type of scaling, though you might not have realized it at the time. Most high-stakes tests like the ACT, SAT, GRE, and MCAT are reported on scales that are selected to convey certain information, with the actual numbers selected more or less arbitrarily. The SAT and GRE have historically had a nominal mean of 500 and a standard deviation of 100, while the ACT has a nominal mean of 18 and standard deviation of 6. These are actually the same scale, because they are nothing more than a converted z-score (standard or zed score), simply because no examinee wants to receive a score report that says you got a score of -1. The numbers above were arbitrarily selected, and then the score range bounds were selected based on the fact that 99% of the population is within plus or minus three standard deviations. Hence, the SAT and GRE range from 200 to 800 and the ACT ranges from 0 to 36. This leads to the urban legend of receiving 200 points for writing your name correctly on the SAT; again, it feels better for the examinee. A score of 300 might seem like a big number and 100 points above the minimum, but it just means that someone is in the 3rd percentile.

Now, notice that I said “nominal.” I said that because the tests do not actually have those means observed in samples, because the samples have substantial range restriction. Because these tests are only taken by students serious about proceeding to the next level of education, the actual sample is of higher ability than the population. The lower third or so of high school students usually do not bother with the SAT or ACT. So many states will have an observed average ACT of 21 and standard deviation of 4. This is an important issue to consider in developing any test. Consider just how restricted the population of medical school students is; it is a very select group.

How can I select a score scale?

score-scale

For various reasons, actual observed scores from tests are often not reported, and only converted scores are reported.  If there are multiple forms which are being equated, scaling will hide the fact that the forms differ in difficulty, and in many cases, differ in cutscore.  Scaled scores can facilitate feedback.  They can also help the organization avoid explanations of IRT scoring, which can be a headache to some.

When deciding on the conversion calculations, there are several important questions to consider.

First, do we want to be able to make fine distinctions among examinees? If so, the range should be sufficiently wide. My personal view is that the scale should be at least as wide as the number of items; otherwise you are voluntarily giving up information. This in turn means you are giving up variance, which makes it more difficult to correlate your scaled scores with other variables, like the MCAT is correlated with success in medical school. This, of course, means that you are hampering future research – unless that research is able to revert back to actual observed scores to make sure all information possible is used. For example, supposed a test with 100 items is reported on a 5-point grade scale of A-B-C-D-F. That scale is quite restricted, and therefore difficult to correlate with other variables in research. But you have the option of reporting the grades to students and still using the original scores (0 to 100) for your research.

Along the same lines, we can swing completely in the other direction. For many tests, the purpose of the test is not to make fine distinctions, but only to broadly categorize examinees. The most common example of this is a mastery test, where the examinee is being assessed on their mastery of a certain subject, and the only possible scores are pass and fail. Licensure and certification examinations are an example. An extension of this is the “proficiency categories” used in K-12 testing, where students are classified into four groups: Below Basic, Basic, Proficient, and Advanced. This is used in the National Assessment of Educational Progress. Again, we see the care taken for reporting of low scores; instead of receiving a classification like “nonmastery” or “fail,” the failures are given the more palatable “Below Basic.”

Another issue to consider, which is very important in some settings but irrelevant in others, is vertical scaling. This refers to the chaining of scales across various tests that are at quite different levels. In education, this might involve linking the scales of exams in 8th grade, 10th grade, and 12th grade (graduation), so that student progress can be accurately tracked over time. Obviously, this is of great use in educational research, such as the medical school process. But for a test to award a certification in a medical specialty, it is not relevant because it is really a one-time deal.

Lastly, there are three calculation options: pure linear (ScaledScore = RawScore * Slope + Intercept), standardized conversion (Old Mean/SD to New Mean/SD), and nonlinear approaches like Equipercentile.

Perhaps the most important issue is whether the scores from the test will be criterion-referenced or norm-referenced. Often, this choice will be made for you because it distinctly represents the purpose of your tests. However, it is quite important and usually misunderstood, so I will discuss this in detail.

Criterion-Referenced vs. Norm-Referenced

data-analysis-norms

This is a distinction between the ways test scores are used or interpreted. A criterion-referenced score interpretation means that the score is interpreted with regards to defined content, blueprint, or curriculum (the criterion), and ignores how other examinees perform (Bond, 1996). A classroom assessment is the most common example; students are scored on the percent of items correct, which is taken to imply the percent of the content they have mastered. Conversely, a norm-referenced score interpretation is one where the score provides information about the examinee’s standing in the population, but no absolute (or ostensibly absolute) information regarding their mastery of content. This is often the case with non-educational measurements like personality or psychopathology. There is no defined content which we can use as a basis for some sort of absolute interpretation. Instead, scores are often either z-scores or some linear function of z-scores.  IQ is historically scaled with a mean of 100 and standard deviation of 15.

It is important to note that this dichotomy is not a characteristic of the test, but of the test score interpretations. This fact is more apparent when you consider that a single test or test score can have several interpretations, some of which are criterion-referenced and some of which are norm-referenced. We will discuss this deeper when we reach the topic of validity, but consider the following example. A high school graduation exam is designed to be a comprehensive summative assessment of a secondary education. It is therefore specifically designed to cover the curriculum used in schools, and scores are interpreted within that criterion-referenced context. Yet scores from this test could also be used for making acceptance decisions at universities, where scores are only interpreted with respect to their percentile (e.g., accept the top 40%). The scores might even do a fairly decent job at this norm-referenced application. However, this is not what they are designed for, and such score interpretations should be made with caution.

Another important note is the definition of “criterion.” Because most tests with criterion-referenced scores are educational and involve a cutscore, a common misunderstanding is that the cutscore is the criterion. It is still the underlying content or curriculum that is the criterion, because we can have this type of score interpretation without a cutscore. Regardless of whether there is a cutscore for pass/fail, a score on a classroom assessment is still interpreted with regards to mastery of the content.  To further add to the confusion, Industrial/Organizational psychology refers to outcome variables as the criterion; for a pre-employment test, the criterion is typically Job Performance at a later time.

This dichotomy also leads to some interesting thoughts about the nature of your construct. If you have a criterion-referenced score, you are assuming that the construct is concrete enough that anybody can make interpretations regarding it, such as mastering a certain percentage of content. This is why non-concrete constructs like personality tend to be only norm-referenced. There is no agreed-upon blueprint of personality.

Multidimensional Scaling

camera lenses for multidimensional item response theory

An advanced topic worth mentioning is multidimensional scaling (see Davison, 1998). The purpose of multidimensional scaling is similar to factor analysis (a later discussion!) in that it is designed to evaluate the underlying structure of constructs and how they are represented in items. This is therefore useful if you are working with constructs that are brand new, so that little is known about them, and you think they might be multidimensional. This is a pretty small percentage of the tests out there in the world; I encountered the topic in my first year of graduate school – only because I was in a Psychological Scaling course – and have not encountered it since.

Summary of test scaling

Scaling is the process of defining the scale that on which your measurements will take place. It raises fundamental questions about the nature of the construct. Fortunately, in many cases we are dealing with a simple construct that has a well-defined content, like an anatomy course for first-year medical students. Because it is so well-defined, we often take criterion-referenced score interpretations at face value. But as constructs become more complex, like job performance of a first-year resident, it becomes harder to define the scale, and we start to deal more in relatives than absolutes. At the other end of the spectrum are completely ephemeral constructs where researchers still can’t agree on the nature of the construct and we are pretty much limited to z-scores. Intelligence is a good example of this.

Some sources attempt to delineate the scaling of people and items or stimuli as separate things, but this is really impossible as they are so confounded. Especially since people define item statistics (the percent of people that get an item correct) and items define people scores (the percent of items a person gets correct). It is for this reason that item response theory, the most advanced paradigm in measurement theory, was designed to place items and people on the same scale. It is also for this reason that item writing should consider how they are going to be scored and therefore lead to person scores. But because we start writing items long before the test is administered, and the nature of the construct is caught up in the scale, the issues presented here need to be addressed at the very beginning of the test development cycle.

certification exam development construction

Certification exams are a critical component of workforce development for many professions and play a significant role in the global Testing, Inspection, and Certification (TIC) market, which was valued at approximately $359.35 billion in 2022 and is projected to grow at a compound annual growth rate (CAGR) of 4.0% from 2023 to 2030. As such, a lot of effort goes into exam development and delivery, working to ensure that the exams are valid and fair, then delivered securely yet with enough convenience to reach the target market. If you work for a certification organization or awarding body, this article provides a guidebook to that process and how to select a vendor.

Certification Exam Development

Certification exam development, is a well-defined process governed by accreditation guidelines such as NCCA, requiring steps such as job task analysis and standard setting studies.  For certification, and other credentialing like licensure or certificates, this process is incredibly important to establishing validity.  Such exams serve as gatekeepers into many professions, often after people have invested a ton of money and years of their life in preparation.  Therefore, it is critical that the tests be developed well, and have the necessary supporting documentation to show that they are defensible.

So what exactly goes into developing a quality exam, sound psychometrics, and establishing the validity documentation, perhaps enough to achieve NCCA accreditation for your certification? Well, there is a well-defined and recognized process for certification exam development, though it is rarely the exact same for every organization.  In general, the accreditation guidelines say you need to address these things, but leave the specific approach up to you.  For example, you have to do a cutscore study, but you are allowed to choose Bookmark vs Angoff vs other method.

Job Analysis / Practice Analysis

A job analysis study provides the vehicle for defining the important job knowledge, skills, and abilities (KSA) that will later be translated into content on a certification exam. During a job analysis, important job KSAs are obtained by directly analyzing job performance of highly competent job incumbents or surveying subject-matter experts regarding important aspects of successful job performance. The job analysis generally serves as a fundamental source of evidence supporting the validity of scores for certification exams.

Test Specifications and Blueprints

The results of the job analysis study are quantitatively converted into a blueprint for the certification exam.  Basically, it comes down to this: if the experts say that a certain topic or skill is done quite often or is very critical, then it deserves more weight on the exam, right?  There are different ways to do this.  My favorite article on the topic is Raymond & Neustel, 2006Here’s a free tool to help.

test development cycle job task analysis

Item Development

After important job KSAs are established, subject-matter experts write test items to assess them. The end result is the development of an item bank from which exam forms can be constructed. The quality of the item bank also supports test validity.  A key operational step is the development of an Item Writing Guide and holding an item writing workshop for the SMEs.

Pilot Testing

There should be evidence that each item in the bank actually measures the content that it is supposed to measure; in order to assess this, data must be gathered from samples of test-takers. After items are written, they are generally pilot tested by administering them to a sample of examinees in a low-stakes context—one in which examinees’ responses to the test items do not factor into any decisions regarding competency. After pilot test data is obtained, a psychometric analysis of the test and test items can be performed. This analysis will yield statistics that indicate the degree to which the items measure the intended test content. Items that appear to be weak indicators of the test content generally are removed from the item bank or flagged for item review so they can be reviewed by subject matter experts for correctness and clarity.

Note that this is not always possible, and is one of the ways that different organizations diverge in how they approach exam development.

Standard Setting

Standard setting also is a critical source of evidence supporting the validity of professional credentialing exam (i.e. pass/fail) decisions made based on test scores.  Standard setting is a process by which a passing score (or cutscore) is established; this is the point on the score scale that differentiates between examinees that are and are not deemed competent to perform the job. In order to be valid, the cutscore cannot be arbitrarily defined. Two examples of arbitrary methods are the quota (setting the cut score to produce a certain percentage of passing scores) and the flat cutscore (such as 70% on all tests). Both of these approaches ignore the content and difficulty of the test.  Avoid these!

Instead, the cutscore must be based on one of several well-researched criterion-referenced methods from the psychometric literature.  There are two types of criterion-referenced standard-setting procedures (Cizek, 2006): examinee-centered and test-centered.

The Contrasting Groups method is one example of a defensible examinee-centered standard-setting approach. This method compares the scores of candidates previously defined as Pass or Fail. Obviously, this has the drawback that a separate method already exists for classification. Moreover, examinee-centered approaches such as this require data from examinees, but many testing programs wish to set the cutscore before publishing the test and delivering it to any examinees. Therefore, test-centered methods are more commonly used in credentialing.

The most frequently used test-centered method is the Modified Angoff Method (Angoff, 1971) which requires a committee of subject matter experts (SMEs).  Another commonly used approach is the Bookmark Method.

Equating

If the test has more than one form – which is required by NCCA Standards and other guidelines – they must be statistically equated.  If you use classical test theory, there are methods like Tucker or Levine.  If you use item response theory, you can either bake the equating into the item calibration process with software like Xcalibre, or use conversion methods like Stocking & Lord.

What does this process do?  Well, if this year’s certification exam had an average of 3 points higher than last years, how do you know if this year’s version was 3 points easier, or this year’s cohort was 3 points smarter, or a mixture of both?  Learn more here.

Psychometric Analysis & Reporting

This part is an absolutely critical step in the exam development cycle for professional credentialing.  You need to statistically analyze the results to flag any items that are not performing well, so you can replace or modify them.  This looks at statistics like item p-value (difficulty), item point biserial (discrimination), option/distractor analysis, and differential item functioning.  You should also look at overall test reliability/precision and other psychometric indices.  If you are accredited, you need to perform year-end reports and submit them to the governing body.  Learn more about item and test analysis.

Exam Development: It’s a Vicious Cycle

Now, consider the big picture: in many cases, an exam is not a one-and-done thing.  It is re-used, perhaps continually.  Often there are new versions released, perhaps based on updated blueprints or simply to swap out questions so that they don’t get overexposed.  That’s why this is better conceptualized as an exam development cycle, like the circle shown above.  Often some steps like Job Analysis are only done once every 5 years, while the rotation of item development, piloting, equating, and psychometric reporting might happen with each exam window (perhaps you do exams in December and May each year).

ASC has extensive expertise in managing this cycle for professional credentialing exams, as well as many other types of assessments.  Get in touch with us to talk to one of our psychometricians.

Certification Exam Delivery & Administration

Certification exam administration and proctoring is a crucial component of the professional credentialing process.  Certification exams are expensive to develop well, so an organization wants to protect that investment by delivering the exam with appropriate security so that items are not stolen.  Moreover, there is an obvious incentive for candidates to cheat.  So, a certification body needs appropriate processes in place to deliver the certification exams.  Here are some tips.

1. Determine the best approach for certification exam administration and proctoring

Here are a few of the considerations to take into account.  These can be crossed with each other, such as delivering paper exams at Events vs. Test Centers.

Timing: Cohorts/Windows vs Continuous Availability

Do you have cohorts, where events make more sense, or do you need continuous?  For example, if the test is tied to university training programs that graduate candidates in December and May each year, that affects your need for delivery.  Alternatively, some certifications are not tied to such training; you might have to only show work experience.  In those cases, candidates are ready to take the test continuously throughout the year.

Mode: Paper vs Computer

Does it make more sense to deliver the test on paper or on computer?  This used to be a cost issue, but now the cost of computerized delivery, especially with online proctoring at home, has dropped significantly while saving so much time for candidates.  Also, some exam types like clinical simulations can only be delivered on computers.

Location: Test centers vs Online proctored vs Events vs Multi-Modal

Some types of tests require events, such as a clinical assessment in an actual clinic with standardized patients.  Some tests can be taken anywhere.  Exam events can also coincide with other events; perhaps you have online delivery through the year but deliver a paper version of the test at your annual conference, for convenience.

Do you have an easy way to make your own locations, if you are considering that?  One example is that you have quarterly regional conferences for your profession, where you could simply get a side room to deliver your test to candidates since they will already be there.  Another is that most of your candidates are coming from training programs at universities, and you are able to use classrooms at those universities.

ansi accreditation certification exam candidates

Geography: State, National, or International

If your exam is for a small US state or a small country, it might be easy to require exams in a test center, because you can easily set up only one or two test centers to cover the geography.  Some certifications are international, and need to deliver on-demand throughout the year; those are a great fit for online.

Security: Low vs High

If your test has extremely high stakes, there is extremely high incentive to cheat.  An entry-level certification on WordPress is different than a medical licensure exam.  The latter is a better fit for test centers, while the former might be fine with online proctoring on-demand.

Online proctoring: AI vs Recorded vs Live

If you choose to explore this approach, here are three main types to evaluate.

A. AI only: AI only proctoring means that there are no humans.  The examinee is recorded on video, and AI algorithms flag potential issues, such as if they leave their seat, then notify an administrator (usually a professor) of students with a high number of flags.  This approach is usually not relevant for certifications or other credentialing exams, it is more for low-stakes exams like a Psychology 101 Midterm at your local university.  The vendors for this approach are interested in large-scale projects, such as proctoring all midterms and finals at a university, perhaps hundreds of thousands of exams per year.

B. Record and Review: Record and review proctoring means that the examinee is recorded on video, but that video is watched by a real human and flagged if they think there is cheating, theft, or other issues.  This is much higher quality, and higher price, but has one major flaw that might be concerning to certification tests: if someone steals your test by taking pictures, you won’t find out until tomorrow.  But at least you know who it was and you are certain of what happened, with a video proof.  Perhaps useful for microcredentials or recertification exams.

C. Live Online Proctoring: Live online proctoring (LOP), or what I call “live human proctoring” (because some AI proctoring is also “live” in real time!) means that there is a professional human proctor on the other side of the video from the examinee.  They check the examinee in, confirm their identity, scan the room, provide instructions, and actually watch them take the test.  Some providers like MonitorEDU even have the examinee make a second video stream on their phone, which is placed on a bookshelf or similar spot to see the entire room through the test.  Certainly, this approach is a very good fit with certification exams and other credentialing.  You protect the test content as well as the validity of that individual’s score; that is not possible with the other two approaches.

We have also prepared a list of the best online proctoring software platforms.

2. Determine other technology, psychometric, and operational needs

Next, your organization should establish any other needs for your exams that could impact the vendor selection.

  1. Do you require special item types, such that the delivery platform needs to support or integrate with them?
  2. Do you have simulations or OSCEs?
  3. Do you have specific needs around accessibility and accommodations for your candidates?
  4. Do you need adaptive testing or linear on the fly testing?
  5. Do you need extensive Psychometric consulting services?
  6. Do you need an integrated registration and payment portal?  Or a certification management system to track expirations and other important information?

Write all these up so that you can use the list to shop for a provider.

3. Find a provider – or several!

test development cycle fasttest

While it might seem easier to find a single provider for everything, that’s often not the best solution.  Look for those vendors that specifically fit your needs.

For example, most providers of remote proctoring are just that: remote proctoring.  They do not have a professional platform to manage item banks, schedule examinees, deliver tests, create custom score reports, and analyze psychometrics.  Some do not even integrate with such platforms, and only integrate with learning management systems like Moodle, seeing as their entire target market is only low-stakes university exams.  So if you are seeking a vendor for certification testing or other credentialing, the list of potential vendors is smaller.

Likewise, there are some vendors that only do the exam development and psychometrics, but lack a software platform and proctoring services for deliver.  In these cases, they might have very specific expertise, and often have lower costs due to lower overhead.  An example is JML Testing Services.

Once you have some idea what you are looking for, start shopping for vendors that provide services for certification exam delivery, development, and scoring.  In some cases, you might not settle on a certain approach right away, and that’s OK.  See what is out there and compare prices.  Perhaps the cost of Live Remote Proctoring is more affordable than you anticipated, and you can upgrade to that.

Besides a simple Google search some good places to start are the member listings of the Association of Test Publishers and the Institute for Credentialing Excellence.

4. Establish the new process with policies and documentation

Once you have finalized your vendors, you need to write policies and documentation around them.  For example, if your vendor has a certain login page for proctoring (we have ascproctor.com), you should take relevant screenshots and write up a walkthrough so candidates know what to expect.  Much of this should go into your Candidate Handbook.  Some of the things to cover that are specific to exam day for the candidates:

  • How to prepare for the exam
  • How to take a practice test
  • What is allowed during the exam
  • What is not allowed
  • ID needed and the check-in process
  • Details on specific locations (if using locations)
  • Rules for accessibility and accommodations
  • Time limits and other practical considerations in the exam

Next, consider all the things that are impacted other than exam day.

  • ExamOps - Admin purchases report (blurred)Eligibility pathways and applications
  • Registration and scheduling
  • Candidate training and practice tests
  • Reporting: just to the candidates, or perhaps to training programs as well?
  • Accounting and other operations: consider your business needs, such as how you manage money, monthly accounting reports, etc.
  • Test security plan:  What do you do if someone is caught taking pictures of the exam with their phone, or the other potential events?

5. Let Everyone Know

Once you have written up everything, make sure all the relevant stakeholders know.  Publish the new Candidate Handbook and announce to the world.  Send emails to all upcoming candidates with instructions and an opportunity for a practice exam.  Put a link on your homepage.  Get in touch with all the training programs or universities in your field.  Make sure that everyone has ample opportunity to know about the new process!

6. Roll Out

Finally, of course, you can implement the new approach to certification exam delivery.  You might launch a new certification exam from scratch, or perhaps you are moving one from paper to online with remote proctoring, or some other change.  Either way, you need a date to start using it and a change management process.  The good news is that, even though it’s probably a lot of work to get here, the new approach is probably going to save you time and money in the long run.  Roll it out!

Also, remember that this is not a single point in time.  You’ll need to update into the future.  You should also consider the implementation of audits or quality control as a way to drive improvement.

 

Ready to start?

exam development certification committee

Certification exam delivery is the process of administering a certification test to candidates.  This might seem straightforward, but it is surprisingly complex.  The greater the scale and the stakes, the more potential threats and pitfalls.  Assessment Systems Corporation is one of the world leaders in the development and delivery of certification exams.  Contact us to get a free account in our platform and experience the examinee process, or to receive a demonstration from one of our experts.

 

 

item-writing-tips

Item writing (aka item authoring) is a science as well as an art, and if you have done it, you know just how challenging it can be!  You are experts at what you do, and you want to make sure that your examinees are too.  But it’s hard to write questions that are clear, reliable, unbiased, and differentiate on the thing you are trying to assess.  Here are some tips.

What is Item Writing / Item Authoring ?

Item authoring is the process of creating test questions.  You have certainly seen “bad” test questions in your life, and know firsthand just how frustrating and confusing that can be.  Fortunately, there is a lot of research in the field of psychometrics on how to write good questions, and also how to have other experts review them to ensure quality.  It is best practice to make items go through a workflow, so that the test development process is similar to the software development process.

Because items are the building blocks of tests, it is likely that the test items within your tests are the greatest threat to its overall validity and reliability.  Here are some important tips in item authoring.  Want deeper guidance?  Check out our Item Writing Guide.

Anatomy of an Item

First, let’s talk a little bit about the parts of a test question.  The diagram on the right shows a reading passage with two questions on it.  Here are some of the terms used:

  • Asset/Stimulus: This is a reading passage here, but could also be an audio, video, table, PDF, or other resource
  • Item: An overall test question, usually called an “item” rather than a “question” because sometimes they might be statements.
  • Stem: The part of the item that presents the situation or poses a question.
  • Options: All of the choices to answer.
  • Key: The correct answer.
  • Distractors: The incorrect answers.

Parts of a test item

Item writing tips: The Stem

To find out whether your test items are your allies or your enemies, read through your test and identify the items that contain the most prevalent item construction flaws.  The first three of the most prevalent construction flaws are located in the item stem (i.e. question).  Look to see if your item stems contain…

1) BIAS

Nowadays, we tend to think of bias as relating to culture or religion, but there are many more subtle types of biases that oftentimes sneak into your tests.  Consider the following questions to determine the extent of bias in your tests:

  • Are there are acronyms in your test that are not considered industry standard?
  • Are you testing on policies and procedures that may vary from one location to another?
  • Are you using vocabulary that is more recognizable to a female examinee than a male?
  • Are you referencing objects that are not familiar to examinees from a newer or older generation?

2) NOT

We’ve all taken tests which ask a negatively worded question. These test items are often the product of item authoring by newbies, but they are devastating to the validity and reliability of your tests—particularly fast test-takers or individuals with lower reading skills.  If the examinee misses that one single word, they will get the question wrong even if they actually know the material.  This test item ends up penalizing the wrong examinees!

3) EXCESS VERBIAGEborderline method educational assessment

Long stems can be effective and essential in many situations, but they are also more prone to two specific item construction flaws.  If the stem is unnecessarily long, it can contribute to examinee fatigue.  Because each item requires more energy to read and understand, examinees tire sooner and may begin to perform more poorly later on in the test—regardless of their competence level.

Additionally, long stems often include information that can be used to answer other questions in the test.  This could lead your test to be an assessment of whose test-taking memory is best (i.e. “Oh yeah, #5 said XYZ, so the answer to #34 is XYZ.”) rather than who knows the material.

Item writing tips:  distractors / options

Unfortunately, item stems aren’t the only offenders.  Experienced test writers actually know that the distractors (i.e. options) are actually more difficult to write than the stems themselves.  When you review your test items, look to see if your item distractors contain

4) IMPLAUSIBILTY

The purpose of a distractor is to pull less qualified examinees away from the correct answer by other options that look correct.  In order for them to “distract” an examinee from the correct answer, they have to be plausible.  The closer they are to being correct, the more difficult the exam will be.  If the distractors are obviously incorrect, even unqualified examinees won’t pick them.  Then your exam will not help you discriminate between examinees who know the material and examinees that do not, which is the entire goal.

5) 3-TO-1 SPLITS

You may recall watching Sesame Street as a child.  If so, you remember the song “One of these things…”  (Either way, enjoy refreshing your memory!)   Looking back, it seems really elementary, but sometimes our test item options are written in such a way that an examinee can play this simple game with your test.  Instead of knowing the material, they can look for the option that stands out as different from the others.  Consider the following questions to determine if one of your items falls into this category:

  • Is the correct answer significantly longer than the distractors?
  • Does the correct answer contain more detail than the distractors?
  • Is the grammatical structure different for the answer than for the distractors?

6) ALL OF THE ABOVE

There are a couple of problems with having this phrase (or the opposite “None of the above”) as an option.  For starters, good test takers know that this is—statistically speaking—usually the correct answer.  If it’s there and the examinee picks it, they have a better than 50% chance of getting the item right—even if they don’t know the content.  Also, if they are able to identify two options as correct, they can select “All of the above” without knowing whether or not the third option was correct.  These sorts of questions also get in the way of good item analysis.   Whether the examinee gets this item right or wrong, it’s harder to ascertain what knowledge they have because the correct answer is so broad.

This is helpful, can I learn more?

Want to learn more about item writing?  Here’s an instructional video from one of our PhD psychometricians.  You should also check out this book.

Item authoring is easier with an item banking system

The process of reading through your exams in search of these flaws in the item authoring is time-consuming (and oftentimes depressing), but it is an essential step towards developing an exam that is valid, reliable, and reflects well on your organization as a whole.  We also recommend that you look into getting a dedicated item banking platform, designed to help with this process.

Summary Checklist

 

Issue

Recommendation

Key is invalid due to multiple correct answers. Consider each answer option individually; the key should be fully correct with each distractor being fully incorrect.
Item was written in a hard to comprehend way, examinees were unable to apply their knowledge because of poor wording.

 

Ensure that the item can be understood after just one read through. If you have to read the stem multiple times, it needs to be rewritten.
Grammar, spelling, or syntax errors direct savvy test takers toward the correct answer (or away from incorrect answers). Read the stem, followed by each answer option, aloud. Each answer option should fit with the stem.
Information was introduced in the stem text that was not relevant to the question. After writing each question, evaluate the content of the stem. It should be clear and concise without introducing irrelevant information.
Item emphasizes trivial facts. Work off of a test blue print to ensure that each of your items map to a relevant construct. If you are using Bloom’s taxonomy or a similar approach, items should be from higher order levels.
Numerical answer options overlap. Carefully evaluate numerical ranges to ensure there is no overlap among options.
Examinees noticed answer was most often A. Distribute the key evenly among the answer options. This can be avoided with FastTest’s randomized delivery functionality.
Key was overly specific compared to distractors. Answer options should all be about the same length and contain the same amount of information.
Key was only option to include key word from item stem. Avoid re-using key words from the stem text in your answer options. If you do use such words, evenly distribute them among all of the answer options so as to not call out individual options.
Rare exception can be argued to invalidate true/false always/never question. Avoid using “always” or “never” as there can be unanticipated or rare scenarios. Opt for less absolute terms like “most often” or “rarely”.
Distractors were not plausible, key was obvious. Review each answer option and ensure that it has some bearing in reality. Distractors should be plausible.
Idiom or jargon was used; non-native English speakers did not understand. It is best to avoid figures of speech, keep the stem text and answer options literal to avoid introducing undue discrimination against certain groups.
Key was significantly longer than distractors. There is a strong tendency to write a key that is very descriptive. Be wary of this and evaluate distractors to ensure that they are approximately the same length.