response-time-effort

The concept of Speeded vs Power Test is one of the ways of differentiating psychometric or educational assessments. In the context of educational measurement and depending on the assessment goals and time constraints, tests are categorized as speeded and power. There is also the concept of a Timed test, which is really a Power test. Let’s look at these types more carefully.

Speeded test

In this test, examinees are limited in time but expected to answer as many questions as possible but there is a unreasonably short time limit that prevents even the best examinees from completing the test, and therefore forces the speed.  Items are delivered sequentially starting from the first one and until the last one. All items are relatively easy, usually.  Sometimes they are increasing in difficulty.  If a time limit and difficulty level are correctly set, none of the test takers will be able to reach the last item before the time limit is reached. A speeded test is supposed to demonstrate how fast an examinee can respond to questions within a time limit. In this case, examinees’ answers are not as important as their speed of answering questions. Total score is usually computed as a number of questions answered correctly when a time limit is met, and differences in scores are mainly attributed to individual differences in speed rather than knowledge.

An example of this might be a mathematical calculation speed test. Examinees are given 100 multiplication problems and told to solve as many as they can in 20 seconds. Most examinees know the answers to all the items, it is a question of how many they can finish. Another might be a 10-key task, where examinees are given a list of 100 5-digit strings and told to type as many as they can in 20 seconds.

Pros of a speeded test

  • Speeded test is appropriate for when you actually want to test the speed of examinees; the 10-digit task above would be useful in selecting data entry clerks, for example. The concept of “knowledge of 5 digit string” in this case is not relevant and doesn’t even make sense.
  • Tests can sometimes be very short but still discriminating.
  • In case when a test is a mixture of items in terms of their difficulty, examinees might save some time when responding easier items in order to respond to more difficult items. This can create an increased spread in scores.

Cons of a speeded test

  • Most situations where a test is used is to evaluate knowledge, not speed.
  • The nature of the test provokes examinees commit errors even if they know the answers, which can be stressful.
  • Speeded test does not consider individual peculiarities of examinees.

 

Power test

A power test provides examinees with sufficient time so that they could attempt all items and express their true level of knowledge or ability. Therefore, this testing category focuses on assessing knowledge, skills, and abilities of the examinees.  The total score is often computed as a number of questions answered correctly (or with item response theory), and individual differences in scores are attributed to differences in ability under assessment, not to differences in basic cognitive abilities such as processing speed or reaction time.

There is also the concept of a Timed Test. This has a time limit, but it is NOT a major factor in how examinees respond to questions or affect their score. For example, the time limit might be set so that 95% of examinees are not affected at all, and the remaining 5% are slightly hurried. This is done with the CAT-ASVAB.

Pros of a power test

  • There is no time restrictions for test-takers
  • Power test is great to evaluate knowledge, skills, and abilities of examinees
  • Power test reduces chances of committing errors by examinees even if they know the answers
  • Power test considers individual peculiarities of examinees

Cons of a power test

  • It can be time consuming (some of these exams are 8 hours long or even more!)
  • This test format sometimes does not suit competitive examinations because of administrative issues (too much test time across too many examinees)
  • Power test is sometimes bad for discriminative purposes, since all examinees have high chances to perform well.  There are certainly some pass/fail knowledge exams where almost everyone passes.  But the purpose of those exams is not to differentiate for selection, but to make sure students have mastered the material, so this is a good thing in that case.

 

Speeded test vs power test

The categorization of speed or power test depends on the assessment purpose. For instance, an arithmetical test for Grade 8 students might be a speeded test when containing many relatively easy questions but the same test could be a power test for Grade 7 students. Thus, a speeded test measures the power when all of the items are correctly responded in a limited time period. Similarly, a power test might turn into a speeded test when easy items are correctly responded in shorter time period. Once a time limit is fixed for a power test, it becomes a speeded test. Today, a pure speeded or power test is rare. Usually, what we meet in practice is a mixture of both, typically a Timed Test.

Below you may find a comparison of a speeded vs power test, in terms of the main features.

 

Speeded test Power test
Time limit is fixed, and it affects all examinees There is no time limit, or there is one and it only affects a small percentage of examinees
The goal is to evaluate speed only, or a combination of speed and correctness The goal is to evaluate correctness in the sense knowledge, skills, and abilities of test-takers
Questions are relatively easy in nature Questions are relatively difficult in nature
Test format increases chances of committing errors Test format reduces chances of committing errors

 

enhance-assessment

Distractor analysis refers to the process of evaluating the performance of incorrect answers vs the correct answer for multiple choice items on a test.  It is a key step in the psychometric analysis process to evaluate item and test performance as part of documenting test reliability and validity.

What is a distractor?

An item distractor, also known as a foil or a trap, is an incorrect option for a selected-response item on an assessment. Multiple-choice questions always have a few options for an answer, one of which is a key/correct answer, and the remaining ones are distractors/wrong answers. It is worth noting that distractors should not be just any wrong answers but have to be probable answers in case an examinee makes a mistake when looking for a right option. In short, distractors are feasible answers that a examinee might select when making misjudgments or having partial knowledge/understanding.  A great example is later in this article with the word “confectioner.”

Parts of an item - stem options distractor

What makes a good item distractor?

One word: plausibility.  We need the item distractor to attract examinees.  If it is so irrelevant that no one considers it, then it does not do any good to include it in the item.  Consider the following item.

 

   What is the capital of the United States of America?

 A. Los Angeles

 B. New York

 C. Washington, D.C.

 D. Mexico City

 

The last option is quite implausible – not only is it outside the USA, but it mentions another country in the name, so no student is likely to select this.  This then becomes a three-horse race, and students have a 1 in 3 chance of guessing.  This certainly makes the item easier. How much do distractors matter?  Well, how much is the difficulty affected by this new set?

 

   What is the capital of the United States of America?

 A. Paris

B. Rome

 C. Washington, D.C.

 D. Mexico City  

 

In addition, the distractor needs to have negative discrimination.  That is, while we want the correct answer to attract the more capable examinees, we want the distractors to attract the lower examinees.  If you have a distractor that you thought was incorrect, and it turns out to attract all the top students, you need to take a long, hard look at that question! To calculate discrimination statistics on distractors, you will need software such as  Iteman.

What makes a bad item distractor?

Obviously, implausibility and negative discrimination are frequent offenders.  But if you think more deeply about plausibility, the key is actually plausibility without being arguably correct.  This can be a fine line to walk, and is a common source of problems for items.  You might have a medical item that presents a scenario and asks for a likely diagnosis; perhaps one of the distractors is very unlikely so as to be essentially implausible, but it might actually be possible for a small subset of patients under certain conditions.  If the author and item reviewers did not catch this, the examinees probably will, and this will be evident in the statistics.  This is one of the reasons it is important to do psychometric analysis of test results, including distractor analysis to evaluate the effectiveness of incorrect options in multiple-choice questions.  In fact, accreditation standards often require you to go through this process at least once a year.

Why do we need a distractor analysis?

After a test form is delivered to examinees, distractor analysis should be implemented to make sure that all answer options work well, and that the item is performing well and defensibly. For example, it is expected that around 40-95% of students pick a correct answer, and the distractors will be chosen by the smaller number of examinees compared to the number chosen the key with approximately equal distribution of choices. Distractor analysis is usually done with classical test theory, even if item response theory is used for scoring, equating, and other tasks.

How to do a distractor analysis

There are three main aspects:

  1. Option frequencies/proportions
  2. Option point-biserial
  3. Quantile plot

The option frequencies/proportions just refers to the analysis of how many examinees selected each answer.  Usually it is a proportion and labeled as “P.”  Did 70% choose the correct answer while the remaining 30% were evenly distributed amongst the 3 distractors?  Great.  But if only 40% chose the correct answer and 45% chose one of the distractors, you might have a problem on your hands.  Perhaps the answer specified as the Key was not actually correct. The point-biserials (Rpbis) will help you evaluate if this is the case.  The point-biserial is an item-total correlation, meaning that we correlate scores on the item with the total score on the test, which is a proxy index of examinee ability.  If 0.0, there is no relationship, which means the item is not correlated with ability, and therefore probably not doing any good.  If negative, it means that the lower-ability students are selecting it more often; if positive, it means that the higher-ability students are selecting it more often.  We want the correct answer to have a positive value and the distractors to have a negative value.  This is one of the most important points in determining if the item is performing well. In addition, there is a third approach, which is visual, called the quantile plot.  It is very useful for diagnosing how an item is working and how it might be improved.  This splits the sample up into blocks ordered by performance, such as 5 groups where Group 1 is 0-20th percentile, Group 2 is 21-40th, etc.  We expect the smartest group to have a high proportion of examinees selecting the correct answer and low proportion selecting the distractors, and vise versa.  You can see how this aligns with the concept of point-biserial.  An example of this is below. Note that the P and point-biserial for the correct answer serve as “the” statistics for the item as a whole.  The P for the item is called the item difficulty or facility statistic.

Examples of a distractor analysis

Here is an example of a good item.  The P is medium (67% correct) and the Rpbis is strongly positive for the correct answer while strongly positive for the incorrect answers.  This translates to a clean quantile plot where the curve for the correct answer (B) goes up while the curves for the incorrect answers go down.  An ideal situation.

 

Distractor analysis quantile plot classical

 

Now contrast that with the following item.  Here, only 12% of examinees got this correct, and the Rpbis was negative.  Answer C had 21% and a nicely positive Rpbis, as well as a quantile curve that goes up.  This item should be reviewed to see if C is actually correct.  Or B, which had the most responses.  Most likely, this item will need a total rewrite!

 

Bad quantile plot and table for distractor analysis

 

Note that an item can be extremely difficult but still perform well.  Here is an example where the distractor analysis supports continued use of the item.  The distractor is just extremely attractive to lower students; they think that a confectioner makes confetti, since those two words look the closest.  Look how strong the Rpbis is here, and very negative for that distractor.  This is a good result!

 

Confectioner confetti distractor analysis

 

students taking test security

Multi-modal test delivery refers to an exam that is capable of being delivered in several different ways, or of a online testing software platform designed to support this process. For example, you might provide the option for a certification exam to be taken on computer at third-party testing centers or via paper at the annual conference for the profession. The goal of multi-modal test delivery is to improve access and convenience for the examinees. In the example, perhaps the testing center approach requires an extra $60 for the proctoring fee as well as requiring the examinee to drive up to an hour to get there; they might be attending the annual conference next month anyway, and it would be very convenient for them to duck into a side room to take the exam.

Multi-modal test delivery requires scalable security on the part of your delivery partner. The exam platform should be able to support various types of exam delivery. Here are some approaches to consider.

Paper exams

Your platform should be able to make print-ready versions of the test. Note that this is quite different from exporting test items to Word or PDF; straight exports are often ugly and include metadata.  You might also need advanced formats like Adobe InDesign.

Additionally, the system should also be able to import the results of a paper test back in, so that it is available for scoring and reporting along with other modes of delivery.

FastTest can do all of these things, as well as the points below.  You can sign up for a free account and try it yourself.

Online unproctored

The platform should be able to deliver exams online, without proctoring. Additionally, there can be several ways of candidates entering the exam.

1. As a direct link, without registration, such as an anonymous survey

2. As a direct link, but requiring self-registration

3. Pre-registration, with some sort of password to ensure the right person is taking the exam. This can be emailed or distributed, or perhaps is available via another software platform like a learning management system or applicant tracking system.

Online remote-proctored

The platform should be able to deliver the test online, with remote proctoring. There are several levels of remote proctoring, corresponding to increasing levels of security or stakes.

1. AI only: Video is recorded of the candidate taking the exam, and it is “reviewed” by AI algorithms. A human has the opportunity to review the flagged students, but in many cases it does not happen.

2. Record and review: Video is recorded, and every video is reviewed by a human. This provides stronger security than AI only, but it does not prevent test theft because it would only be found a day or two later.

3. Live: Video is live-streamed and watched in real time. This provides the opportunity to stop the exam if someone is cheating. The proctors can be third-party or in some cases the organization’s staff. If you are using your staff, make sure to avoid the mistakes made by Cleveland State University.

Testing centers managed by you

Some testing platforms have functionality for you to manage your own testing centers. When candidates are registered for an exam, they are assigned to an appropriate center. In some cases, the center is also assigned a proctor. The platform might have a separate login for the proctor, requiring them to enter a password before the examinee can enter theirs (or the proctor enter it on their behalf).

New test scheduler sites proctor code

Formal third-party testing centers

Some vendors will have access to a network of testing centers. These will have trained proctors, computers, and sometimes additional security considerations like video monitoring or biometric scanners when candidates arrive. There are three types of testing centers.

1. Owned: The testing company actually owns their own centers, and they are professionally staffed.

2. Independent/affiliated: The testing might contract with professional testing centers that are owned by a different company. In some cases, these are independent.

3. Public: Some organizations will contract with public locations, such as computer labs at universities or libraries.

Summary: multi-modal test delivery

Multi-modal test delivery provides flexibility for exam sponsors. There are two situations where this is important. First, a single test can be delivered in multiple ways with equivalent security, to allow for greater convenience, like the Conference example above. But it also empowers a testing organization to run multiple types of exams at different levels of security. For instance, a credentialing board might have an unproctored online exam as a practice test, a test center exam for their primary certification exam, and a remote-proctored test for annual recertification. Having a single platform makes it easier for the organization to manage their assessment activities, reducing costs while increasing the customer experience for the people for whom it really matters – the candidates.

split-half-reliability-analysis

Split Half Reliability is an internal consistency approach to quantifying the reliability of a test, in the paradigm of classical test theoryReliability refers to the repeatability or consistency of the test scores; we definitely want a test to be reliable.  The name comes from a simple description of the method: we split the test into two halves, calculate the score on each half for each examinee, then correlate those two columns of numbers.  If the two halves measure the same thing, then the correlation is high, indicating a decent level of unidimensionality in the construct and reliability in measuring the construct.

Why do we need to estimate reliability?  Well, it is one of the easiest ways to quantify the quality of the test.  Some would argue, in fact, that it is a gross oversimplification.  However, because it is so convenient, classical indices of reliability are incredibly popular.  The most popular is coefficient alpha, which is a competitor to split half reliability.

How to Calculate Split Half Reliability

The process is simple.

  1. Take the test and split it in half
  2. Calculate the score of each examinee on each half
  3. Correlate the scores on the two halves

The correlation is best done with the standard Pearson correlation.

Pearson-correlation-formula

This, of course, begs the question:  How do we split the test into two halves?  There are so many ways.  Well, psychometricians generally recommend three ways:

  1. First half vs last half
  2. Odd-numbered items vs even-numbered items
  3. Random split

You can do these manually with your matrix of data, but good psychometric software will for these for you, and more (see screenshot below).

Example

Suppose this is our data set, and we want to calculate split half reliability.

Person Item1 Item2 Item3 Item4 Item5 Item6 Score
1 1 0 0 0 0 0 1
2 1 0 1 0 0 0 2
3 1 1 0 1 0 0 3
4 1 0 1 1 1 1 5
5 1 1 0 1 0 1 4

Let’s split it by first half and last half.  Here are the scores.

Score 1 Score 2
1 0
2 0
2 1
2 3
2 2

The correlation of these is 0.51.

Now, let’s try odd/even.

Score 1 Score 2
1 0
2 0
1 2
3 2
1 3

The correlation of these is -0.04!  Obviously, the different ways of splitting don’t always agree.  Of course, with such a small sample here, we’d expect a wide variation.

Advantages of Split Half Reliability

One advantage is that it is so simple, both conceptually and computationally.  It’s easy enough that you can calculate it in Excel if you need to.  This also makes it easy to interpret and understand.

Another advantage, which I was taught in grad school, is that split half reliability assumes equivalence of the two halves that you have created; on the other hand, coefficient alpha is based at an item level and assumes equivalence of items.  This of course is never the case – but alpha is fairly robust and everyone uses it anyway.

Disadvantages… and the Spearman-Brown Formula

The major disadvantage is that this approach is evaluating half a test.  Because tests are more reliable with more items, having fewer items in a measure will reduce its reliability.  So if we take a 100 item test and divide it into two 50-item halves, then we are essentially making a quantification of reliability for a 50 item test.  This means we are underestimating the reliability of the 100 item test.  Fortunately, there is a way to adjust for this.  It is called the Spearman-Brown Formula.  This simple formula adjusts the correlation back up to what it should be for a 100 item test.

Another disadvantage was mentioned above: the different ways of splitting don’t always agree.  Again, fortunately, if you have a larger sample of people or a longer test, the variation is minimal.

OK, how do I actually implement?

Any good psychometric software will provide some estimates of split half reliability.  Below is the table of reliability analysis from Iteman.  This table actually continues for all subscores on the test as well.  You can download Iteman for free at its page and try it yourself.

This test had 100 items and 85 scored items (15 unscored pilot).  The alpha was around 0.82, which is acceptable, though it should be higher for 100 items.  The results then show for all three split half methods, and then again for the Spearman-Brown (S-B) adjusted version of each.  Do they agree with alpha?  For the total test, the results don’t jive for two of the three methods.  But for the Scored Items, the three S-B calculations align with the alpha value.  This is most likely because some of the 15 pilot items were actually quite bad.  In fact, note that the alpha for 85 items is higher than for 100 items – which says the 15 new items were actually hurting the test!

Reliability analysis Iteman

This is a good example of using alpha and split half reliability together.  We made an important conclusion about the exam and its items, merely by looking at this table.  Next, the researcher should evaluate those items, usually with P value difficulty and point-biserial discrimination.

 

high jump adaptive testing 2

A cutscore or passing point (aka cut-off score and cutoff score as well) is a score on a test that is used to categorize examinees.  The most common example of this is pass/fail, which we are all familiar with from our school days.  For instance, a score of 70% and above will pass, while below 70% will fail.  However, many tests have more than one cutscore.  An example of this is the National Assessment of Educational Progress (NAEP) in the USA, which has 3 cutscores, creating 4 categories: Below Basic, Basic, Proficient, and Advanced.

The process of setting a cutscore is called a standard-setting study.  However, I dislike this term because the word “standard” is used to reflect other things in the assessment world.  In some cases, it is the definition of what is to be learned or covered (see Common Core State Standards) and in other cases it refers to the process of reducing construct-irrelevant variance by ensuring that all examinees are taking the testing in standardized conditions (standardized testing).  So I prefer cutscore or passing point.  And passing point is limited to the case of an exam with only one cutscore where the classifications are pass/fail, which is not always the case – not only are there many situations where there are more than one cutscore, but many two-category situations might use other decisions, like Hire/NotHire or a clinical diagnosis like Depressed/NotDepressed.

When establishing cutscores, it is important to use scaled scores to ensure consistency and fairness.  Scaling adjusts raw scores to a common metric, which helps to accurately reflect the intended performance standards across different test forms or administrations.  You may read about setting a cutscore on a test scored with item response theory in this blog post.  For a deeper understanding of how measurement variability can affect the interpretation of cutscores, be sure to check out our blog post on confidence intervals.

Types of cutscores

There are two types of cutscores, reflecting the two ways that a test score can be interpreted: norm-referenced and criterion-referenced.  The Hofstee method represents a compromise approach that incorporates aspects of both.

Criterion-referenced Cutscore

A cutscore of this type is referenced to the material of the exam, regardless of examinee performance.  In most cases, this is the sort of cutscore that you need to be legally defensible for high stakes exams.  Psychometricians have spent a lot of time inventing ways to do this, and scientifically studying them.

Names of some methods you might see for this type are: modified-Angoff, Nedelsky, and Bookmark.

Example

An example of this is a certification exam.  If the cutscore is 75%, you pass.  In some months or years, this might be most candidates, in other months it might be fewer.  The standard does not change.  In fact, the organizations that manage such exams go to great lengths to keep it stable over time, a process known as equating.

Norm-referenced Cutscore

A cutscore of this type is referenced to the examinees, regardless of their mastery of the material.

A name of this you might see is a quota.  Such as when a test is delivered to only accept the top 10% of applicants.

Example

An example of this was in my college Biology class.  It was a weeder class, to weed out the students who start college planning to be pre-med simply because they like the idea of being a doctor or are drawn to the potential salary.  So, the exams were intentionally made very hard, so that the average score might only be 50% correct.  They then awarded an A to anyone who had a z-score of 1.0 or greater, which is the top 15% of students – regardless of how well you actually scored on the exam.  You might get a score of 60% correct but be 95th percentile and get an A.

Nedelsky-method-standard-setting-panel-meeting

The Nedelsky method is an approach to setting the cutscore of an exam.  Originally suggested by Nedelsky (1954), it is an early attempt to implement a quantitative, rigorous procedure to the process of standard setting.  Quantitative approaches are needed to eliminate the arbitrariness and subjectivity that would otherwise dominate the process of setting a cutscore.  The most obvious and common example of this is simply setting the cutscore at a round number like 70%, regardless of the difficulty of the test or the ability level of the examinees.  It is for this reason that a cutscore must be set with a method such as the Nedelsky approach to be legally defensible or meet accreditation standards.

How to implement the Nedelsky method

The first step, like several other standard setting methods, is to gather a panel of subject matter experts (SMEs).  The next step is for the panel to discuss the concept of a minimally qualified candidate   This is a concept about the type of candidate that should barely pass this exam, and sits on the borderline of competence. They then review a test form, paying specific attention to each of the items on the form.  For every item in the test form, each rater estimates the number of options that an MCC will be able to eliminate.  This then translates into the probability of a correct response, assuming that each candidate guesses amongst the remaining options.   If an MCC can only eliminate one of the options of a four option item, they then have a 1/3 = 33% chance of getting the item correct.  If two, then ½ = 50%.

These ratings are then averaged across all items and all raters.  This then represents the percentage score expected of an MCC on this test form, as defined by the panel.  This makes a compelling, quantitative argument for what the cutscore should then be, because we would expect anyone that is minimally qualified to score at that point or higher.

Item Rater1 Rater2 Rater3
1 33 50 33
2 25 25 25
3 25 33 25
4 33 50 50
5 50 100 50
Total 33.2 51.6 36.6

 

Drawbacks to the Nedelsky method

This approach only works on multiple choice items, because it depends on the evaluation of option probability.  It is also a gross oversimplification.  If the item has four options, there are only four possible values for the Nedelsky rating 25%, 33%, 50%, 100%.  This is all the more striking when you consider that most items tend to have a percent-correct value between 50% and 100%, and reflecting this fact is impossible with the Nedelsky method. Obviously, more goes into answering a question than simply eliminating one or two of the distractors.  This is one reason that another method is generally preferred and supersedes this method…

Nedelsky vs Modified-Angoff

The Nedelsky method has been superseded by the modified-Angoff method.  The modified-Angoff method is essentially the same process but allows for finer variations, and can be applied to other item types.  The modified-Angoff method subsumes the Nedelsky method, as a rater can still implement the Nedelsky approach within that paradigm.  In fact, I often tell raters to use the Nedelsky approach as a starting point or benchmark.  For example, if they think that the examinee can easily eliminate two options, and is slightly more likely to guess one of the remaining two options, the rating is not 50%, but rather 60%.  The modified-Angoff approach also allows for a second round of ratings after discussion to increase consensus (Delphi Method).  Raters can slightly adjust their rating without being hemmed into one of only four possible ratings.

Enemy items lego

Enemy items is a psychometric term that refers to two test questions (items) which should not be on the same test form (if linear) seen by a given examinee (if LOFT or adaptive).  This can therefore be relevant to linear forms, but also pertains to linear on the fly testing (LOFT) and computerized adaptive testing (CAT).  There are several reasons why two items might be considered enemies:

  1. Too similar: the text of the two items is almost the same.
  2. One gives away the answer to the other.
  3. The items are on the same topic/answer, even if the text is different.

 

How do we find enemy items?

There are two ways (as there often is): Manual and Automated.fasttest-item-authoring

Manual means that humans are reading items and intentionally mark two of them as enemies.  So maybe you have a reviewer that is reviewing new items from a pool of 5 authors, and finds two that cover the same concept.  They would mark them as enemies.

Automated means that you have a machine learning algorithm, such as one which uses natural language processing (NLP) to evaluate all items in a pool and then uses distance/similarity metrics to quantify how similar they are.  Of course, this could miss some of the situations, like if two items have the same topic but have fairly different text.  It is also difficult to do if items have formulas, multimedia files, or other aspects that could not be caught by NLP.

 

Why are enemy items a problem?

This violates the assumption of local independence; that the interaction of an examinee with an item should not be affected by other items.  It also means that the examinee is in double jeopardy; if they don’t know that topic, they will be getting two questions wrong, not one.  There are other potential issues as well, as discussed in this article.

 

What does this mean for test development?

We want to identify enemy items and ensure that they don’t get used together.  Your item banking and assessment platform should have functionality to track which items are enemies.  You can sign up for a free account in FastTest to see an example.

 

HR assessment is a critical part of the HR ecosystem, used to select the best candidates with pre-employment testing, assess training, certify skills, and more.  But there is a huge range in quality, as well as a wide range in the type of assessment that it is designed for.  This post will break down the different approaches and help you find the best solution.

HR assessment platforms help companies create effective assessments, thus saving valuable resources, improving candidate experience & quality, providing more accurate and actionable information about human capital, and reducing hiring bias.  But, finding software solutions that can help you reap these benefits can be difficult, especially because of the explosion of solutions in the market.  If you are lost on which tools will help you develop and deliver your own HR assessments, this guide is for you.

What is HR assessment?

HR assessment is a comprehensive process used by human resources professionals to evaluate various aspects of potential and current employees’ abilities, skills, and performance. This process encompasses a wide range of tools and methodologies designed to provide insights into an individual’s suitability for a role, their developmental needs, and their potential for future growth within the organization.

hr assessment software presentation

The primary goal of HR assessment is to make informed decisions about recruitment, employee development, and succession planning. During the recruitment phase, HR assessments help in identifying candidates who possess the necessary competencies and cultural fit for the organization.

There are various types of assessments used in HR.  Here are four main areas, though this list is by no means exhaustive.

  1. Pre-employment tests to select candidates
  2. Post-training assessments
  3. Certificate or certification exams (can be internal or external)
  4. 360-degree assessments and other performance appraisals

 

Pre-employment tests

Finding good employees in an overcrowded market is a daunting task. In fact, according to the Harvard Business Review, 80% of employee turnover is attributed to poor hiring decisions. Bad hires are not only expensive, but can also adversely affect cultural dynamics in the workforce. This is one area where HR assessment software shows its value.

There are different types of pre-employment assessments. Each of them achieves a different goal in the hiring process. The major types of pre-employment assessments include:

Personality tests: Despite rapidly finding their way into HR, these types of pre-employment tests are widely misunderstood. Personality tests answer questions in the social spectrum.  One of the main goals of these tests is to quantify the success of certain candidates based on behavioral traits.

Aptitude tests: Unlike personality tests or emotional intelligence tests which tend to lie on the social spectrum, aptitude tests measure problem-solving, critical thinking, and agility.  These types of tests are popular because can predict job performance than any other type because they can tap into areas that cannot be found in resumes or job interviews.

Skills Testing: The kinds of tests can be considered a measure of job experience; ranging from high-end skills to low-end skills such as typing or Microsoft excel. Skill tests can either measure specific skills such as communication or measure generalized skills such as numeracy.

Emotional Intelligence tests: These kinds of assessments are a new concept but are becoming important in the HR industry. With strong Emotional Intelligence (EI) being associated with benefits such as improved workplace productivity and good leadership, many companies are investing heavily in developing these kinds of tests.  Despite being able to be administered to any candidates, it is recommended they be set aside for people seeking leadership positions, or those expected to work in social contexts.

Risk tests: As the name suggests, these types of tests help companies reduce risks. Risk assessments offer assurance to employers that their workers will commit to established work ethics and not involve themselves in any activities that may cause harm to themselves or the organization.  There are different types of risk tests. Safety tests, which are popular in contexts such as construction, measure the likelihood of the candidates engaging in activities that can cause them harm. Other common types of risk tests include Integrity tests.

 

Post-training assessments

This refers to assessments that are delivered after training.  It might be a simple quiz after an eLearning module, up to a certification exam after months of training (see next section).  Often, it is somewhere in between.  For example you might take an afternoon sit through a training course, after which you take a formal test that is required to do something on the job.  When I was a high school student, I worked in a lumber yard, and did exactly this to become an OSHA-approved forklift driver.

 

Certificate or certification exams

Sometimes, the exam process can be high-stakes and formal.  It is then a certificate or certification, or sometimes a licensure exam.  More on that here.  This can be internal to the organization, or external.

Internal certification: The credential is awarded by the training organization, and the exam is specifically tied to a certain product or process that the organization provides in the market.  There are many such examples in the software industry.  You can get certifications in AWS, SalesForce, Microsoft, etc.  One of our clients makes MRI and other medical imaging machines; candidates are certified on how to calibrate/fix them.

External certification: The credential is awarded by an external board or government agency, and the exam is industry-wide.  An example of this is the SIE exams offered by FINRA.  A candidate might go to work at an insurance company or other financial services company, who trains them and sponsors them to take the exam in hopes that the company will get a return by the candidate passing and then selling their insurance policies as an agent.  But the company does not sponsor the exam; FINRA does.

 

360-degree assessments and other performance appraisals

Job performance is one of the most important concepts in HR, and also one that is often difficult to measure.  John Campbell, one of my thesis advisors, was known for developing an 8-factor model of performance.  Some aspects are subjective, and some are easily measured by real-world data, such as number of widgets made or number of cars sold by a car salesperson.  Others involve survey-style assessments, such as asking customers, business partners, co-workers, supervisors, and subordinates to rate a person on a Likert scale.  HR assessment platforms are needed to develop, deliver, and score such assessments.

 

The Benefits of Using Professional-Level Exam Software

Now that you have a good understanding of what pre-employment and other HR tests are, let’s discuss the benefits of integrating pre-employment assessment software into your hiring process. Here are some of the benefits:

Saves Valuable resources

Unlike the lengthy and costly traditional hiring processes, pre-employment assessment software helps companies increase their ROI by eliminating HR snugs such as face-to-face interactions or geographical restrictions. Pre-employment testing tools can also reduce the amount of time it takes to make good hires while reducing the risks of facing the financial consequences of a bad hire.

Supports Data-Driven Hiring Decisions

Data runs the modern world, and hiring is no different. You are better off letting complex algorithms crunch the numbers and help you decide which talent is a fit, as opposed to hiring based on a hunch or less-accurate methods like an unstructured interview.  Pre-employment assessment software helps you analyze assessments and generate reports/visualizations to help you choose the right candidates from a large talent pool.

Improving candidate experience 

Candidate experience is an important aspect of a company’s growth, especially considering the fact that 69% of candidates admitting not to apply for a job in a company after having a negative experience. Good candidate experience means you get access to the best talent in the world.

Elimination of Human Bias

Traditional hiring processes are based on instinct. They are not effective since it’s easy for candidates to provide false information on their resumes and cover letters. But, the use of pre-employment assessment software has helped in eliminating this hurdle. The tools have leveled the playing ground, and only the best candidates are considered for a position.

 

What To Consider When Choosing HR assessment tools

Now that you have a clear idea of what pre-employment tests are and the benefits of integrating pre-employment assessment software into your hiring process, let’s see how you can find the right tools.

Here are the most important things to consider when choosing the right pre-employment testing software for your organization.

Ease-of-use

The candidates should be your top priority when you are sourcing pre-employment assessment software. This is because the ease of use directly co-relates with good candidate experience. Good software should have simple navigation modules and easy comprehension.

Here is a checklist to help you decide if a pre-employment assessment software is easy to use;

  • Are the results easy to interpret?
  • What is the UI/UX like?
  • What ways does it use to automate tasks such as applicant management?
  • Does it have good documentation and an active community?

Tests Delivery and Remote Proctoring

Good online assessment software should feature good online proctoring functionalities. This is because most remote jobs accept applications from all over the world. It is therefore advisable to choose a pre-employment testing software that has secure remote proctoring capabilities. Here are some things you should look for on remote proctoring;

  • Does the platform support security processes such as IP-based authentication, lockdown browser, and AI-flagging?
  • What types of online proctoring does the software offer? Live real-time, AI review, or record and review?
  • Does it let you bring your own proctor?
  • Does it offer test analytics?

Test & data security, and compliance

Defensibility is what defines test security. There are several layers of security associated with pre-employment test security. When evaluating this aspect, you should consider what pre-employment testing software does to achieve the highest level of security. This is because data breaches are wildly expensive.

The first layer of security is the test itself. The software should support security technologies and frameworks such as lockdown browser, IP-flagging, and IP-based authentication.

The other layer of security is on the candidate’s side. As an employer, you will have access to the candidate’s private information. How can you ensure that your candidate’s data is secure? That is reason enough to evaluate the software’s data protection and compliance guidelines.

A good pre-employment testing software should be compliant with certifications such as GDRP. The software should also be flexible to adapt to compliance guidelines from different parts of the world.

Questions you need to ask;

  • What mechanisms does the software employ to eliminate infidelity?
  • Is their remote proctoring function reliable and secure?
  • Are they compliant with security compliance guidelines including ISO, SSO, or GDPR?
  • How does the software protect user data?

Psychometrics

Psychometrics is the science of assessment, helping to drive accurate scores from defensible tests, as well as making them more efficient, reducing bias, and a host of other benefits.  You should ensure that your solution supports the necessary level of psychometrics.  Some suggestions:

 

User experience

A good user experience is a must-have when you are sourcing any enterprise software. A new age pre-employment testing software should create user experience maps with both the candidates and employer in mind. Some ways you can tell if a software offers a seamless user experience includes;

  • User-friendly interface
  • Simple and easy to interact with
  • Easy to create and manage item banks
  • Clean dashboard with advanced analytics and visualizations

Customizing your user-experience maps to fit candidates’ expectations attracts high-quality talent.

 

Scalability and automation

With a single job post attracting approximately 250 candidates, scalability isn’t something you should overlook. A good pre-employment testing software should thus have the ability to handle any kind of workload, without sacrificing assessment quality.

It is also important you check the automation capabilities of the software. The hiring process has many repetitive tasks that can be automated with technologies such as Machine learning, Artificial Intelligence (AI), and robotic process automation (RPA).

Here are some questions you should consider in relation to scalability and automation;

  • Does the software offer Automated Item Generation (AIG)?
  • How many candidates can it handle?
  • Can it support candidates from different locations worldwide?

Reporting and analytics

iteman item analysis

A good pre-employment assessment software will not leave you hanging after helping you develop and deliver the tests. It will enable you to derive important insight from the assessments.

The analytics reports can then be used to make data-driven decisions on which candidate is suitable and how to improve candidate experience. Here are some queries to make on reporting and analytics.

  • Does the software have a good dashboard?
  • What format are reports generated in?
  • What are some key insights that prospects can gather from the analytics process?
  • How good are the visualizations?

Customer and Technical Support

Customer and technical support is not something you should overlook. A good pre-employment assessment software should have an Omni-channel support system that is available 24/7. This is mainly because some situations need a fast response. Here are some of the questions your should ask when vetting customer and technical support;

  • What channels of support does the software offer/How prompt is their support?
  • How good is their FAQ/resources page?
  • Do they offer multi-language support mediums?
  • Do they have dedicated managers to help you get the best out of your tests?

 

Conclusion

Finding the right HR assessment software is a lengthy process, yet profitable in the long run. We hope the article sheds some light on the important aspects to look for when looking for such tools. Also, don’t forget to take a pragmatic approach when implementing such tools into your hiring process.

Are you stuck on how you can use pre-employment testing tools to improve your hiring process? Feel free to contact us and we will guide you on the entire process, from concept development to implementation. Whether you need off-the-shelf tests or a comprehensive platform to build your own exams, we can provide the guidance you need.  We also offer free versions of our industry-leading software  FastTest  and  Assess.ai  – visit our Contact Us page to get started!

If you are interested in delving deeper into leadership assessments, you might want to check out this blog post.  For more insights and an example of how HR assessments can fail, check out our blog post called Public Safety Hiring Practices and Litigation. The blog post titled Improving Employee Retention with Assessment: Strategies for Success explores how strategic use of assessments throughout the employee lifecycle can enhance retention, build stronger teams, and drive business success by aligning organizational goals with employee development and engagement.

creative workplace incremental validity

Incremental validity is a specific aspect of criterion-related validity that refers to what an additional assessment or predictive variable can add to the information provided by existing assessments or variables.  It refers to the amount of “bonus” predictive power by adding in another predictor.  In many cases, it is on the same or similar trait, but often the most incremental validity comes from using a predictor/trait that is relatively unrelated to the original.  See examples below.

Note that this is often discussed with respect to tests and assessment, but in many cases a predictor is not a test or assessment, as you will also see.

How is Incremental Validity Evaluated?

It is most often quantified with a linear regression model and correlations.  However, any predictive modeling approach could work from support vector machines to neural networks.

Example of Incremental Validity: University Admissions

One of the most commonly used predictors for university admissions is an admissions test, or battery of tests.  You might be required to take an assessment which includes an English/Verbal test, a Logic/Reasoning test, and a Quantitative/Math test.  These might be used individually or aggregate to create a mathematical model, based on past data, that predicts your performance at university. (There are actually several variables for this, such as first year GPA, final GPA, and 4 year graduation rate, but that’s beyond the scope of this article.)

Of course, the admissions exams scores are not the only point of information that the university has on students.  It also has their high school GPA, perhaps an admissions essay which is graded by instructors, and so on.  Incremental validity poses this question: if the admissions exam correlates 0.59 with first year GPA, what happens if we make it into a multiple regression/correlation with High School GPA (HGPA) as a second predictor?  It might go up to, say, 0.64.  There is an increment of 0.05.  If the university has that data from students, they would be wasting it by not using it.

Of course, HGPA will correlate very highly with the admissions exam scores.  So it will likely not add a lot of incremental validity.  Perhaps the school finds that essays add a 0.09 increment to the predictive power, because it is more orthogonal to the admissions exam scores.  Does it make sense to add that, given the additional expense of scoring thousands of essays?  That’s a business decision for them.

Example of Incremental Validity: Pre-Employment Testing

Another common use case is that of pre-employment testing, where the purpose of the test is to predict criterion variables like job performance, tenure, 6-month termination rate, or counterproductive work behavior.  You might start with a skills test; perhaps you are hiring accountants or bookkeepers and you give them a test on MS Excel.  What additional predictive power would we get by also doing a quantitative reasoning test?  Probably some, but that most likely correlates highly with MS Excel knowledge.  So what about using a personality assessment like Conscientiousness?  That would be more orthogonal.  It’s up to the researcher to determine what the best predictors are.  This topic, personnel selection, is one of the primary areas of Industrial/ Organizational Psychology.

students discussing formative summative assessment

Summative and formative assessment are a crucial component of the educational process.  If you work in the educational assessment field or even in educational generally, you have probably encountered these terms.  What do they mean?  This post will explore the differences between summative and formative assessment.

Assessment plays a crucial role in education, serving as a powerful tool to gauge student understanding and guide instructional practices. Among the various assessment methods, two approaches stand out: formative assessment and summative assessment. While both types aim to evaluate student performance, they serve distinct purposes and are applied at different stages of the learning process.

 

What is Summative Assessment?

Summative assessment refers to an assessment that is at the end (sum) of an educational experience.  The “educational experience” can vary widely.  Perhaps it is a one-day training course, or even shorter.  I worked at a lumber yard in high school, and I remember getting a rudimentary training – maybe an hour – on how to use a forklift before they had me take an exam to become OSHA Certified to used a forklift.  Proctored by the guy who had just showed me the ropes, of course.  On the other end of a spectrum is board certification for a physician specialty like ophthalmology: after 4 years of undergrad, 4 years of med school, and several more years of specialty training, then you finally get to take the exam.  Either way, the purpose is to evaluate what you learned in some educational experience.

Note that it does not have to be formal education.  Many certifications have multiple eligibility pathways.  For example, to be eligible to sit for the exam, you might need:

  1. A bachelor’s degree
  2. An associate degree plus 1 year of work experience
  3. 3 years of work experience.

How it is developed

Summative assessments are usually developed by assessment professionals, or a board of subject matter experts led by assessment professionals.  For example, a certification for ophthalmology is not informally developed by a teacher; there is a panel of experienced ophthalmologists led by a psychometrician.  A high school graduation exam might be developed by a panel of experienced math or English teachers, again led by a psychometrician and test developers.

The process is usually very long and time-intensive, and therefore quite expensive.  A certification will need a job analysis, item writing workshop, standard-setting study, and other important developments that contribute to the validity of the exam scores.  A high school graduation exam has expensive curriculum alignment studies and other aspects.

Implementation of Summative Assessment

Let’s explore the key aspects of summative assessment:

  1. End-of-Term Evaluation: Summative assessments are administered after the completion of a unit, semester, or academic year. They aim to evaluate the overall achievement of students and determine their readiness for advancement or graduation.
  2. Formal and Standardized: Summative assessments are often formal, standardized, and structured, ensuring consistent evaluation across different students and classrooms. Common examples include final exams, standardized tests, and grading rubrics.
  3. Accountability: Summative assessment holds students accountable for their learning outcomes and provides a comprehensive summary of their performance. It also serves as a basis for grade reporting, academic placement, and program evaluation.
  4. Future Planning: Summative assessment results can guide future instructional planning and curriculum development. They provide insights into areas of strength and weakness, helping educators identify instructional strategies and interventions to improve student outcomes.

 

What is Formative Assessment?student assessment

Formative assessment is something that is used during the educational process.  Everyone is familiar with this from their school days.  A quiz, an exam, or even just the teacher asking you a few questions verbally to understand your level of knowledge.  Usually, but not always, a formative assessment is used to to direct instruction.  A common example of formative assessment is low-stakes exams given in K-12 schools purely to check on student growth, without any counting towards their grades.  Some of the most widely used titles are the NWEA MAP, Renaissance Learning STAR, and Imagine Learning MyPath.

Formative assessment is a great fit for computerized adaptive testing, a method that adapts the difficulty of the exam to each student.  If a student is 3 grades behind, the test will quickly adapt down to that level, providing a better experience for the student and more accurate feedback on their level of knowledge.

How it is developed

Formative assessments are typically much more informal than summative assessments.  Most of the exams we take in our life are informally developed formative assessments; think of all the quizzes and tests you ever took during courses as a student.  Even taking a test during training on the job will often count.  However, some are developed with heavy investment, such as a nationwide K-12 adaptive testing platform.

Implementation of Formative Assessment

Formative assessment refers to the ongoing evaluation of student progress throughout the learning journey. It is designed to provide immediate feedback, identify knowledge gaps, and guide instructional decisions. Here are some key characteristics of formative assessment:

  1. Timely Feedback: Formative assessments are conducted during the learning process, allowing educators to provide immediate feedback to students. This feedback focuses on specific strengths and areas for improvement, helping students adjust their understanding and study strategies.
  2. Informal Nature: Formative assessments are typically informal and flexible, offering a wide range of techniques such as quizzes, class discussions, peer evaluations, and interactive activities. They encourage active participation and engagement, promoting deeper learning and critical thinking skills.
  3. Diagnostic Function: Formative assessment serves as a diagnostic tool, enabling teachers to monitor individual and class-wide progress. It helps identify misconceptions, adapt instructional approaches, and tailor learning experiences to meet students’ needs effectively.
  4. Growth Mindset: The primary goal of formative assessment is to foster a growth mindset among students. By focusing on improvement rather than grades, it encourages learners to embrace challenges, learn from mistakes, and persevere in their educational journey.

 

Summative vs Formative Assessment

Below you may find some principal discrepancies between summative and formative assessment across the general aspects.

Aspect Summative Assessment Formative Assessment
Purpose To evaluate overall student learning at the end of an instructional period. To monitor student learning and provide ongoing feedback for improvement.
Timing Conducted at the end of a unit, semester, or course. Conducted throughout the learning process.
Role in Learning Process To determine the extent of learning and achievement. To identify learning needs and guide instructional adjustments.
Feedback Mechanism Feedback is usually provided after the assessment is completed and is often limited to final results or scores. Provides immediate, specific, and actionable feedback to improve learning.
Nature of Evaluation Typically evaluative and judgmental, focusing on the outcome. Diagnostic and supportive, focusing on the process and improvement.
Impact on Grading Often a major component of the final grade. Generally not used for grading; intended to inform learning.
Level of Standardization Highly standardized to ensure fairness and comparability. Less standardized, often tailored to individual needs and contexts.
Frequency of Implementation Typically infrequent, such as once per term or unit. Frequent and ongoing, integrated into the daily learning activities.
Stakeholders Involved Primarily involves educators and administrative bodies for accountability purposes. Involves students, educators, and sometimes parents for immediate learning support.
Flexibility in Use Rigid in format and timing; used to meet predetermined educational benchmarks. Highly flexible; can be adapted to fit specific instructional goals and learner needs.

 

The Synergy Between Summative and Formative Assessment

While formative and summative assessments have distinct purposes, they work together in a complementary manner to enhance learning outcomes. Here are a few ways in which these assessment types can be effectively integrated:

  1. Feedback Loop: The feedback provided during formative assessments can inform and improve summative assessments. It allows students to understand their strengths and weaknesses, guiding their study efforts for better performance in the final evaluation.
  2. Continuous Improvement: By employing formative assessments throughout a course, teachers can continuously monitor student progress, identify learning gaps, and adjust instructional strategies accordingly. This iterative process can ultimately lead to improved summative assessment results.
  3. Balanced Assessment Approach: Combining both formative and summative assessments creates a more comprehensive evaluation system. It ensures that student growth and understanding are assessed both during the learning process and at the end, providing a holistic view.

 

Summative and Formative Assessment: A Validity Perspective

So what is the difference?  You will notice it is the situation and use of the exam, not the exam itself.  You could take those K-12 feedback assessments and deliver them at the end of the year, with weighting towards the student’s final grade.  That would make them summative.  But that is not what the test was designed for.  This is the concept of validity; the evidence showing that interpretations and use of test scores are supported towards their intended use.  So the key is to design a test for its intended use, provide evidence for that use, and make sure that the exam is being used in the way that it should be.