Posts on psychometrics: The Science of Assessment

multiple choice test bubble sheet scores

A confidence interval for test scores is a common way to interpret the results of a test by phrasing it as a range rather than a single number.  We all understand that tests provide imperfect measurements at a specific point in time, and actual performance can vary over different occasions.  The examinee might be sick or tired today and score lower than their true score on the test, or get lucky with some items on topics they have studied more closely, then score higher today than they normally might (or vice versa with tricky items).

Psychometricians recognize this and have developed the concept of the standard error of measurement, which is an index of this variation.  The calculation of the SEM differs between classical test theory and item response theory, but in either case, we can use it to make a confidence interval around the observed score. Because tests are imperfect measurements, some psychometricians recommend always reporting scores as a range rather than a single number.

A confidence interval is a very common concept from statistics in general (not psychometrics alone) about making a likely range for the true value of something being estimated.  We can take 1.96 times a standard error on each side of a point estimate to get a 95% confidence interval.  Start by calculating 1.96 times the SEM, then add and subtract it to the original score to get a range.

Example of confidence interval with Classical Test Theory

With CTT, the confidence interval is placed on raw number-correct scores.  Suppose the reliability of a 100-item test is 0.90, with a mean of 85 and standard deviation of 5.  The SEM is then 5*sqrt(1-0.90) = 5*0.31 = 1.58.  If your score is a 67, then a 95% confidence interval is 63.90 to 70.10.  We are 95% sure that your true score lies in that range.

Example of confidence interval with Item Response Theory

The same concept applies to item response theory (IRT).  But the scale of numbers is quite different, because the theta scale runs from approximately -3 to +3.  Also, the SEM is calculated directly from item parameters, in a complex way that is beyond the scope of this discussion.  But if your score is -1.0 and the SEM is 0.30, then the 95% confidence interval for your score is -1.588 to -0.412.  This confidence interval can be compared to a cutscore as an adaptive testing approach to pass/fail tests.

Example of confidence interval with a Scaled Score

This concept also works on scaled scores.  IQ is typically reported on a scale with a mean of 100 and standard deviation of 15.  Suppose the test had an SEM of 3.2, and your score was 112.  Then if we take 1.96*3.2 and plus or minus it on either side, we get a confidence interval of 105.73 to 118.27.

Composite Scores

A composite test score refers to a test score that is combined from the scores of multiple tests, that is, a test battery.  The purpose is to create a single number that succinctly summarizes examinee performance.  Of course, some information is lost by this, so the original scores are typically reported as well.

This is a case where multiple tests are delivered to each examinee, but an overall score is desired.  Note that this is different than the case of a single test with multiple domains; in that case, there is one latent dimension, while with a battery each test has a different dimension, though possibly highly correlated.  That is, we have four measurement situations:

  1. Single test, one domain
  2. Single test, multiple domains
  3. Multiple tests, but correlated or related
  4. Multiple tests, but unrelated latent dimensions

With regards to the composite test score, we are only considering #3.  A case of #4 where a composite score does not make sense is a Big 5 personality assessment.  There are five components (Openness, Conscientiousness, Extraversion, Agreeableness, and Neuroticism), but they are unrelated, and a sum of their scores would not quantify you as having a good or bad personality, or any other meaningful interpretation!

Example of a Composite Test Score

A common example of a composite test score situation is a university admissions exam.  There are often several component tests, such as Logical Reasoning, Mathematics, and English.  These are psychometrically distinct, but there is definitely a positive manifold amongst them.  The exam sponsor will probably report each separately, but also sum all three to a total score as a way to summarize student performance in a single number.

How do you calculate a Composite Test Score?

Here are four ways that you can calculate a composite test score.  They typically use a Scaled Score rather than a Raw Score.

  1. Average – An example is the ACT assessment in the United States, for university admissions. There are four tests (English, Math, Science, Reading), each of which is reported on a scale of 0 to 36, but also the average of them is reported.  Here is a nice explanation.
  2. Sum – An example of this is the SAT, also a university admissions test in the United States.  See explanation at Khan Academy.
  3. Linear combination – You also have the option to combine like a sum, but with differential weighting. An example of this is the ASVAB, the test to enter the United States military. There are 12 tests, but the primary summary score is called AFQT and it is calculated by combining only 4 of the tests.
  4. Nonlinear transformation – There is also the possibility of any nonlinear transformation that you can think of, but this is rare.

How to implement a CompositeComposite Scores Test Score

You will need an online testing platform that supports the concept of a test battery, provides scaled scoring, and then also provides functionality for composite scores.  An example of this screen from our platform is below.  Click here to sign up for a free account.

agreement reliability handshake

Inter-rater reliability and inter-rater agreement are important concepts in certain psychometric situations.  For many assessments, there is never any encounter with raters, but there certainly are plenty of assessments that do.  This article will define these two concepts and discuss two psychometric situations where they are important.  For a more detailed treatment, I recommend Tinsley and Weiss (1975), which is one of the first articles that I read in grad school.

Inter-Rater Reliability

Inter-rater reliability refers to the consistency between raters, which is slightly different than agreement.  Reliability can be quantified by a correlation coefficient.  In some cases this is the standard Pearson correlation, but in others it might be tetrachoric or intraclass (Shrout & Fleiss, 1979), especially if there are more than two raters.  If raters correlate highly, then they are consistent with each other and would have a high reliability estimate.

Inter-Rater Agreement

Inter-rater agreement looks at how often the two raters give exact the same result.  There are different ways to quantify this as well, as discussed below.  Perhaps the simplest, in the two-rater case, is to simply calculate the proportion of rows where the two provided the same rating.  If there are more than two raters in a case, you will need an index of dispersion amongst their ratings.  Standard deviation and mean absolute difference are two examples.

Situation 1: Scoring Essays with Rubrics

If you have an assessment with open-response questions like essays, they need to be scored with a rubric to convert them to numeric scores.  In some cases, there is only one rater doing this.  You have all had essays graded by a single teacher within a classroom when you were a student.  But for larger scale or higher stakes exams, two raters are often used, to provide quality assurance on each other.  Moreover, this is often done in an aggregate scale; if you have 10,000 essays to mark, that is a lot for two raters, so instead of two raters rating 10,000 each you might have a team of 20 rating 1,000 each.  Regardless, each essay has two ratings, so that inter-rater reliability and agreement can be evaluated.  For any given rater, we can easily calculate the correlation of their 1,000 marks with the 1,000 marks from the other rater (even if the other rater rotates between the 19 remaining).  Similarly, we can calculate the proportion of times that they provided the same rating or were within 1 point of the other rater.

Situation 2: Modified-Angoff Standard Setting

Another common assessment situation is a modified-Angoff study, which is used to set a cutscore on an exam.  Typically, there are 6 to 15 raters that rate each item on its difficulty, on a scale of 0 to 100 in multiple of 5.  This makes a more complex situation, since there are not only many more raters per instance (item) but there are many more possible ratings.

To evaluate inter-rater reliability, I typically use the intra-class correlation coefficient, which is:

intraclass correlation reliability

Where BMS is the between items mean-square, EMS is the error mean-square, JMS is the judges mean-square, and n is the number of items.  It is like the Pearson correlation used in a two-rater situation, but aggregated across the raters and improved.  There are other indices as well, as discussed on Wikipedia.

For inter-rater agreement, I often use the standard deviation (as a very gross index) or quantile “buckets.”  See the Angoff Analysis Tool for more information.

 

Examples of Inter-Rater Reliability vs. Agreement

Consider these three examples with a very simple set of data: two raters scoring 4 students on a 5 point rubric (0 to 5).

Reliability = 1, Agreement = 1

Student Rater 1 Rater 2
1 0 0
2 1 1
3 2 2
4 3 3
5 4 4

Here, the two are always the same, so both reliability and agreement are 1.0.

Reliability = 1, Agreement = 0

Student Rater 1 Rater 2
1 0 1
2 1 2
3 2 3
4 3 4
5 4 5

In this example, Rater 1 is always 1 point lower.  They never have the same rating, so agreement is 0.0, but they are completely consistent, so reliability is 1.0.

Reliability = -1, agreement is 0.20 (because they will intersect at middle point)

Student Rater 1 Rater 2
1 0 4
2 1 3
3 2 2
4 3 1
5 4 0

In this example, we have a perfect inverse relationship.  The correlation of the two is -1.0, while the agreement is 0.20 (they agree 20% of the time).

Now consider Example 2 with the modified-Angoff situation, with an oversimplification of only two raters.

Item Rater 1 Rater 2
1 80 90
2 50 60
3 65 75
4 85 95

This is like Example 2 above; one is always 10 points higher, so that there is reliability of 1.0 but agreement of 0.  Even though agreement is an abysmal  0, the psychometrician running this workshop would be happy with the results!  Of course, real Angoff workshops have more raters and many more items, so this is an overly simplistic example.

 

References

Tinsley, H.E.A., & Weiss, D.J. (1975).  Interrater reliability and agreement of subjective judgments. Journal of Counseling Psychology, 22(4), 358-376.

Shrout, P.E., & Fleiss, J.L. (1979).  Intraclass correlations: Uses in assessing rater reliability.  Psychological Bulletin, 86(2), 420-428.

split-half-reliability-analysis

Split Half Reliability is an internal consistency approach to quantifying the reliability of a test, in the paradigm of classical test theoryReliability refers to the repeatability or consistency of the test scores; we definitely want a test to be reliable.  The name comes from a simple description of the method: we split the test into two halves, calculate the score on each half for each examinee, then correlate those two columns of numbers.  If the two halves measure the same thing, then the correlation is high, indicating a decent level of unidimensionality in the construct and reliability in measuring the construct.

Why do we need to estimate reliability?  Well, it is one of the easiest ways to quantify the quality of the test.  Some would argue, in fact, that it is a gross oversimplification.  However, because it is so convenient, classical indices of reliability are incredibly popular.  The most popular is coefficient alpha, which is a competitor to split half reliability.

How to Calculate Split Half Reliability

The process is simple.

  1. Take the test and split it in half
  2. Calculate the score of each examinee on each half
  3. Correlate the scores on the two halves

The correlation is best done with the standard Pearson correlation.

Pearson-correlation-formula

This, of course, begs the question:  How do we split the test into two halves?  There are so many ways.  Well, psychometricians generally recommend three ways:

  1. First half vs last half
  2. Odd-numbered items vs even-numbered items
  3. Random split

You can do these manually with your matrix of data, but good psychometric software will for these for you, and more (see screenshot below).

Example

Suppose this is our data set, and we want to calculate split half reliability.

Person Item1 Item2 Item3 Item4 Item5 Item6 Score
1 1 0 0 0 0 0 1
2 1 0 1 0 0 0 2
3 1 1 0 1 0 0 3
4 1 0 1 1 1 1 5
5 1 1 0 1 0 1 4

Let’s split it by first half and last half.  Here are the scores.

Score 1 Score 2
1 0
2 0
2 1
2 3
2 2

The correlation of these is 0.51.

Now, let’s try odd/even.

Score 1 Score 2
1 0
2 0
1 2
3 2
1 3

The correlation of these is -0.04!  Obviously, the different ways of splitting don’t always agree.  Of course, with such a small sample here, we’d expect a wide variation.

Advantages of Split Half Reliability

One advantage is that it is so simple, both conceptually and computationally.  It’s easy enough that you can calculate it in Excel if you need to.  This also makes it easy to interpret and understand.

Another advantage, which I was taught in grad school, is that split half reliability assumes equivalence of the two halves that you have created; on the other hand, coefficient alpha is based at an item level and assumes equivalence of items.  This of course is never the case – but alpha is fairly robust and everyone uses it anyway.

Disadvantages… and the Spearman-Brown Formula

The major disadvantage is that this approach is evaluating half a test.  Because tests are more reliable with more items, having fewer items in a measure will reduce its reliability.  So if we take a 100 item test and divide it into two 50-item halves, then we are essentially making a quantification of reliability for a 50 item test.  This means we are underestimating the reliability of the 100 item test.  Fortunately, there is a way to adjust for this.  It is called the Spearman-Brown Formula.  This simple formula adjusts the correlation back up to what it should be for a 100 item test.

Another disadvantage was mentioned above: the different ways of splitting don’t always agree.  Again, fortunately, if you have a larger sample of people or a longer test, the variation is minimal.

OK, how do I actually implement?

Any good psychometric software will provide some estimates of split half reliability.  Below is the table of reliability analysis from Iteman.  This table actually continues for all subscores on the test as well.  You can download Iteman for free at its page and try it yourself.

This test had 100 items and 85 scored items (15 unscored pilot).  The alpha was around 0.82, which is acceptable, though it should be higher for 100 items.  The results then show for all three split half methods, and then again for the Spearman-Brown (S-B) adjusted version of each.  Do they agree with alpha?  For the total test, the results don’t jive for two of the three methods.  But for the Scored Items, the three S-B calculations align with the alpha value.  This is most likely because some of the 15 pilot items were actually quite bad.  In fact, note that the alpha for 85 items is higher than for 100 items – which says the 15 new items were actually hurting the test!

Reliability analysis Iteman

This is a good example of using alpha and split half reliability together.  We made an important conclusion about the exam and its items, merely by looking at this table.  Next, the researcher should evaluate those items, usually with P value difficulty and point-biserial discrimination.

 

Nedelsky-method-standard-setting-panel-meeting

The Nedelsky method is an approach to setting the cutscore of an exam.  Originally suggested by Nedelsky (1954), it is an early attempt to implement a quantitative, rigorous procedure to the process of standard setting.  Quantitative approaches are needed to eliminate the arbitrariness and subjectivity that would otherwise dominate the process of setting a cutscore.  The most obvious and common example of this is simply setting the cutscore at a round number like 70%, regardless of the difficulty of the test or the ability level of the examinees.  It is for this reason that a cutscore must be set with a method such as the Nedelsky approach to be legally defensible or meet accreditation standards.

How to implement the Nedelsky method

The first step, like several other standard setting methods, is to gather a panel of subject matter experts (SMEs).  The next step is for the panel to discuss the concept of a minimally qualified candidate   This is a concept about the type of candidate that should barely pass this exam, and sits on the borderline of competence. They then review a test form, paying specific attention to each of the items on the form.  For every item in the test form, each rater estimates the number of options that an MCC will be able to eliminate.  This then translates into the probability of a correct response, assuming that each candidate guesses amongst the remaining options.   If an MCC can only eliminate one of the options of a four option item, they then have a 1/3 = 33% chance of getting the item correct.  If two, then ½ = 50%.

These ratings are then averaged across all items and all raters.  This then represents the percentage score expected of an MCC on this test form, as defined by the panel.  This makes a compelling, quantitative argument for what the cutscore should then be, because we would expect anyone that is minimally qualified to score at that point or higher.

Item Rater1 Rater2 Rater3
1 33 50 33
2 25 25 25
3 25 33 25
4 33 50 50
5 50 100 50
Total 33.2 51.6 36.6

 

Drawbacks to the Nedelsky method

This approach only works on multiple choice items, because it depends on the evaluation of option probability.  It is also a gross oversimplification.  If the item has four options, there are only four possible values for the Nedelsky rating 25%, 33%, 50%, 100%.  This is all the more striking when you consider that most items tend to have a percent-correct value between 50% and 100%, and reflecting this fact is impossible with the Nedelsky method. Obviously, more goes into answering a question than simply eliminating one or two of the distractors.  This is one reason that another method is generally preferred and supersedes this method…

Nedelsky vs Modified-Angoff

The Nedelsky method has been superseded by the modified-Angoff method.  The modified-Angoff method is essentially the same process but allows for finer variations, and can be applied to other item types.  The modified-Angoff method subsumes the Nedelsky method, as a rater can still implement the Nedelsky approach within that paradigm.  In fact, I often tell raters to use the Nedelsky approach as a starting point or benchmark.  For example, if they think that the examinee can easily eliminate two options, and is slightly more likely to guess one of the remaining two options, the rating is not 50%, but rather 60%.  The modified-Angoff approach also allows for a second round of ratings after discussion to increase consensus (Delphi Method).  Raters can slightly adjust their rating without being hemmed into one of only four possible ratings.

Dogleg example

Scaled scoring is a process used in assessment and psychometrics to transform exam scores to another scale (set of numbers), typically to make the scores more easily interpretable but also to hide sensitive information like raw scores and differences in form difficulty (equating).  For example, the ACT test produces scores on a 0 to 36 scale; obviously, there are more than 36 questions on the test, so this is not your number correct score, but rather a repackaging.  So how does this repackaging happen, and why are we doing it in the first place?

An Example of Scales: Temperature

First, let’s talk about the definition of a scale.  A scale is a range of numbers for which you can assign values and interpretations.  Scores on a student essay might be 0 to 5 points, for example, where 0 is horrible and 5 is wonderful.  Raw scores on a test, like number-correct, are also a scale, but there are reasons to hide this, which we will discuss.

An example of scaling that we are all familiar with is temperature.  There are three scales that you have probably heard of: Fahrenheit, Celsius, and Kelvin.  Of course, the concept of temperature does not change, we are just changing the set of numbers used to report it.  Water freezes at 32 Fahrenheit and boils at 212, while these numbers are 0 and 100 with Celsius.  Same with assessment: the concept of what we are measuring does not change on a given exam (e.g., knowledge of 5th grade math curriculum in the USA, mastery of Microsoft Excel, clinical skills as a neurologist), but we can change the numbers.

What is Scaled Scoring?

In assessment and psychometrics, we can change the number range (scale) used to report scores, just like we can change the number range for temperature.  If a test is 100 items but we don’t want to report the actual score to students, we can shift the scale to something like 40 to 90.  Or 0 to 5.  Or 824,524 to 965,844.  It doesn’t matter from a mathematical perspective.  But since one goal is to make it more easily interpretable for students, the first two are much better than the third.

So, if an organization is reporting Scaled Scores, it means that they have picked some arbitrary new scale and are converting all scores to that scale.  Here’s some examples…

Real Examples

Many assessments are normed on a standard normal bell curve.  Those which use item response theory do so implicitly, because scores are calculated directly on the z-score scale (there are some semantic differences, but it’s the basic idea).  Well, any scores on the z-score bell curve can be converted to other scales quite easily, and back again.  Here some of the common scales used in the world of assessment.

z score T score IQ Percentile ACT SAT
-3 20 55 0.02 0 200
-2 30 70 2.3 6 300
-1 40 85 15.9 12 400
0 50 100 50 18 500
1 60 115 84.1 24 600
2 70 130 97.7 30 700
3 80 145 99.8 36 800

Note how the translation from normal curve based approaches to Percentile is very non-linear!  The curve-based approaches stretch out the ends.  Here is how these numbers look graphically.

T scores

Why do Scaled Scoring?

There are a few good reasons:

  1. Differences in form difficulty (equating) – Many exams use multiple forms, especially across years.  What if this year’s form has a few more easy questions and we need to drop the passing score by 1 point on the raw score metric?  Well, if you are using scaled scores like 200 to 400 with a cutscore of 350, then you just adjust the scaling each year so the reported cutscore is always 350.
  2. Hiding the raw score – In many cases, even if there is only one form of 100 items, you don’t want students to know their actual score.
  3. Hiding the z scale (IRT) – IRT scores people on the z-score scale.  Nobody wants to be told they have a score of -2.  That makes it feel like you have negative intelligence or something like that.  But if you convert it to a big scale like the SAT above, that person gets a score of 300, which is a big number so they don’t feel as bad.  This doesn’t change the fact that they are only at the 2nd percentile though.  It’s just public relations and marketing, really.

 

Who uses Scaled Scoring?

Just about all the “real” exams in the world use this.  Of course, most use IRT, which makes it even more important to use scaled scoring.

Methods of Scaled Scoring

There are 4 types of scaled scoring.  The rest of this post will get into some psychometric details on these, for advanced readers.

  1. Normal/standardized
  2. Linear
  3. Linear dogleg
  4. Equipercentile

 

Normal/standardized

This is an approach to scaled scoring that many of us are familiar with due to some famous applications, including the T score, IQ, and large-scale assessments like the SAT. It starts by finding the mean and standard deviation of raw scores on a test, then converts whatever that is to another mean and standard deviation. If this seems fairly arbitrary and doesn’t change the meaning… you are totally right!

Let’s start by assuming we have a test of 50 items, and our data has a raw score average of 35 points with an SD of 5. The T score transformation – which has been around so long that a quick Googling can’t find me the actual citation – says to convert this to a mean of 50 with an SD of 10. So, 35 raw points become a scaled score of 50. A raw score of 45 (2 SDs above mean) becomes a T of 70. We could also place this on the IQ scale (mean=100, SD=15) or the classic SAT scale (mean=500, SD=100).

A side not about the boundaries of these scales: one of the first things you learn in any stats class is that plus/minus 3 SDs contains 99% of the population, so many scaled scores adopt these and convenient boundaries. This is why the classic SAT scale went from 200 to 800, with the urban legend that “you get 200 points for putting your name on the paper.” Similarly, the ACT goes from 0 to 36 because it nominally had a mean=18 and SD=6.

The normal/standardized approach can be used with classical number-correct scoring, but makes more sense if you are using item response theory, because all scores default to a standardized metric.

Linear

The linear approach is quite simple. It employs the  y=mx+b  that we all learned as schoolkids. With the previous example of a 50 item test, we might say intercept=200 and slope=4. This then means that scores range from 200 to 400 on the test.

Yes, I know… the Normal conversion above is technically linear also, but deserves its own definition.

Linear dogleg

The Linear Dogleg approach is a special case of the previous one, where you need to stretch the scale to reach two endpoints. Let’s suppose we published a new form of the test, and a classical equating method like Tucker or Levine says that it is 2 points easier and the slope of Form A to Form B is 3.8 rather than 4. This throws off our clean conversion of 200 to 400 scale. So suppose we use the equation SCALED = 200 + 3.8*RAW but only up until the score of 30. From 31 onwards, we use SCALED = 185 + 4.3*RAW. Note that the raw score of 50 then still comes out to be scaled of 400, so we still go from 200 to 800 but there is now a slight bend in the line. This is called the “dogleg” similar to the golf hole of the same name.

Dogleg example

 

Equipercentile

Lastly, there is Equipercentile, which is mostly used for equating forms but can similarly be used for scaling.  In this conversion, we match the percentile for each, even if it is a very nonlinear transformation.  For example, suppose our Form A had a 90th percentile of 46, which became a scale of 384.  We find that Form B has a 90th percentile at 44 points, so we call that a scaled score of 384, and calculate a similar conversion for all other points.

Why are we doing this again?

Well, you can kind of see it in the example of having two forms with a difference in difficulty. In the Equipercentile example, suppose there is a cut score to be in the top 10% to win a scholarship. If you get 45 on Form A you will lose, but if you get 45 on Form B you will win. Test sponsors don’t want to have this conversation with angry examinees, so they convert all scores to an arbitrary scale. The 90th percentile is always a 384, no matter how hard the test is. (Yes, that simple example assumes the populations are the same… there’s an entire portion of psychometric research dedicated to performing stronger equating.)

How do we implement scaled scoring?

Some transformations are easily done in a spreadsheet, but any good online assessment platform should handle this topic for you.  Here’s an example screenshot from our software.

Scaled scores in FastTest

 

Enemy items lego

Enemy items is a psychometric term that refers to two test questions (items) which should not be on the same test form (if linear) seen by a given examinee (if LOFT or adaptive).  This can therefore be relevant to linear forms, but also pertains to linear on the fly testing (LOFT) and computerized adaptive testing (CAT).  There are several reasons why two items might be considered enemies:

  1. Too similar: the text of the two items is almost the same.
  2. One gives away the answer to the other.
  3. The items are on the same topic/answer, even if the text is different.

 

How do we find enemy items?

There are two ways (as there often is): Manual and Automated.fasttest-item-authoring

Manual means that humans are reading items and intentionally mark two of them as enemies.  So maybe you have a reviewer that is reviewing new items from a pool of 5 authors, and finds two that cover the same concept.  They would mark them as enemies.

Automated means that you have a machine learning algorithm, such as one which uses natural language processing (NLP) to evaluate all items in a pool and then uses distance/similarity metrics to quantify how similar they are.  Of course, this could miss some of the situations, like if two items have the same topic but have fairly different text.  It is also difficult to do if items have formulas, multimedia files, or other aspects that could not be caught by NLP.

 

Why are enemy items a problem?

This violates the assumption of local independence; that the interaction of an examinee with an item should not be affected by other items.  It also means that the examinee is in double jeopardy; if they don’t know that topic, they will be getting two questions wrong, not one.  There are other potential issues as well, as discussed in this article.

 

What does this mean for test development?

We want to identify enemy items and ensure that they don’t get used together.  Your item banking and assessment platform should have functionality to track which items are enemies.  You can sign up for a free account in FastTest to see an example.

 

HR assessment is a critical part of the HR ecosystem, used to select the best candidates with pre-employment testing, assess training, certify skills, and more.  But there is a huge range in quality, as well as a wide range in the type of assessment that it is designed for.  This post will break down the different approaches and help you find the best solution.

HR assessment platforms help companies create effective assessments, thus saving valuable resources, improving candidate experience & quality, providing more accurate and actionable information about human capital, and reducing hiring bias.  But, finding software solutions that can help you reap these benefits can be difficult, especially because of the explosion of solutions in the market.  If you are lost on which tools will help you develop and deliver your own HR assessments, this guide is for you.

What is HR assessment?

HR assessment is a comprehensive process used by human resources professionals to evaluate various aspects of potential and current employees’ abilities, skills, and performance. This process encompasses a wide range of tools and methodologies designed to provide insights into an individual’s suitability for a role, their developmental needs, and their potential for future growth within the organization.

hr assessment software presentation

The primary goal of HR assessment is to make informed decisions about recruitment, employee development, and succession planning. During the recruitment phase, HR assessments help in identifying candidates who possess the necessary competencies and cultural fit for the organization.

There are various types of assessments used in HR.  Here are four main areas, though this list is by no means exhaustive.

  1. Pre-employment tests to select candidates
  2. Post-training assessments
  3. Certificate or certification exams (can be internal or external)
  4. 360-degree assessments and other performance appraisals

 

Pre-employment tests

Finding good employees in an overcrowded market is a daunting task. In fact, according to the Harvard Business Review, 80% of employee turnover is attributed to poor hiring decisions. Bad hires are not only expensive, but can also adversely affect cultural dynamics in the workforce. This is one area where HR assessment software shows its value.

There are different types of pre-employment assessments. Each of them achieves a different goal in the hiring process. The major types of pre-employment assessments include:

Personality tests: Despite rapidly finding their way into HR, these types of pre-employment tests are widely misunderstood. Personality tests answer questions in the social spectrum.  One of the main goals of these tests is to quantify the success of certain candidates based on behavioral traits.

Aptitude tests: Unlike personality tests or emotional intelligence tests which tend to lie on the social spectrum, aptitude tests measure problem-solving, critical thinking, and agility.  These types of tests are popular because can predict job performance than any other type because they can tap into areas that cannot be found in resumes or job interviews.

Skills Testing: The kinds of tests can be considered a measure of job experience; ranging from high-end skills to low-end skills such as typing or Microsoft excel. Skill tests can either measure specific skills such as communication or measure generalized skills such as numeracy.

Emotional Intelligence tests: These kinds of assessments are a new concept but are becoming important in the HR industry. With strong Emotional Intelligence (EI) being associated with benefits such as improved workplace productivity and good leadership, many companies are investing heavily in developing these kinds of tests.  Despite being able to be administered to any candidates, it is recommended they be set aside for people seeking leadership positions, or those expected to work in social contexts.

Risk tests: As the name suggests, these types of tests help companies reduce risks. Risk assessments offer assurance to employers that their workers will commit to established work ethics and not involve themselves in any activities that may cause harm to themselves or the organization.  There are different types of risk tests. Safety tests, which are popular in contexts such as construction, measure the likelihood of the candidates engaging in activities that can cause them harm. Other common types of risk tests include Integrity tests.

 

Post-training assessments

This refers to assessments that are delivered after training.  It might be a simple quiz after an eLearning module, up to a certification exam after months of training (see next section).  Often, it is somewhere in between.  For example you might take an afternoon sit through a training course, after which you take a formal test that is required to do something on the job.  When I was a high school student, I worked in a lumber yard, and did exactly this to become an OSHA-approved forklift driver.

 

Certificate or certification exams

Sometimes, the exam process can be high-stakes and formal.  It is then a certificate or certification, or sometimes a licensure exam.  More on that here.  This can be internal to the organization, or external.

Internal certification: The credential is awarded by the training organization, and the exam is specifically tied to a certain product or process that the organization provides in the market.  There are many such examples in the software industry.  You can get certifications in AWS, SalesForce, Microsoft, etc.  One of our clients makes MRI and other medical imaging machines; candidates are certified on how to calibrate/fix them.

External certification: The credential is awarded by an external board or government agency, and the exam is industry-wide.  An example of this is the SIE exams offered by FINRA.  A candidate might go to work at an insurance company or other financial services company, who trains them and sponsors them to take the exam in hopes that the company will get a return by the candidate passing and then selling their insurance policies as an agent.  But the company does not sponsor the exam; FINRA does.

 

360-degree assessments and other performance appraisals

Job performance is one of the most important concepts in HR, and also one that is often difficult to measure.  John Campbell, one of my thesis advisors, was known for developing an 8-factor model of performance.  Some aspects are subjective, and some are easily measured by real-world data, such as number of widgets made or number of cars sold by a car salesperson.  Others involve survey-style assessments, such as asking customers, business partners, co-workers, supervisors, and subordinates to rate a person on a Likert scale.  HR assessment platforms are needed to develop, deliver, and score such assessments.

 

The Benefits of Using Professional-Level Exam Software

Now that you have a good understanding of what pre-employment and other HR tests are, let’s discuss the benefits of integrating pre-employment assessment software into your hiring process. Here are some of the benefits:

Saves Valuable resources

Unlike the lengthy and costly traditional hiring processes, pre-employment assessment software helps companies increase their ROI by eliminating HR snugs such as face-to-face interactions or geographical restrictions. Pre-employment testing tools can also reduce the amount of time it takes to make good hires while reducing the risks of facing the financial consequences of a bad hire.

Supports Data-Driven Hiring Decisions

Data runs the modern world, and hiring is no different. You are better off letting complex algorithms crunch the numbers and help you decide which talent is a fit, as opposed to hiring based on a hunch or less-accurate methods like an unstructured interview.  Pre-employment assessment software helps you analyze assessments and generate reports/visualizations to help you choose the right candidates from a large talent pool.

Improving candidate experience 

Candidate experience is an important aspect of a company’s growth, especially considering the fact that 69% of candidates admitting not to apply for a job in a company after having a negative experience. Good candidate experience means you get access to the best talent in the world.

Elimination of Human Bias

Traditional hiring processes are based on instinct. They are not effective since it’s easy for candidates to provide false information on their resumes and cover letters. But, the use of pre-employment assessment software has helped in eliminating this hurdle. The tools have leveled the playing ground, and only the best candidates are considered for a position.

 

What To Consider When Choosing HR assessment tools

Now that you have a clear idea of what pre-employment tests are and the benefits of integrating pre-employment assessment software into your hiring process, let’s see how you can find the right tools.

Here are the most important things to consider when choosing the right pre-employment testing software for your organization.

Ease-of-use

The candidates should be your top priority when you are sourcing pre-employment assessment software. This is because the ease of use directly co-relates with good candidate experience. Good software should have simple navigation modules and easy comprehension.

Here is a checklist to help you decide if a pre-employment assessment software is easy to use;

  • Are the results easy to interpret?
  • What is the UI/UX like?
  • What ways does it use to automate tasks such as applicant management?
  • Does it have good documentation and an active community?

Tests Delivery and Remote Proctoring

Good online assessment software should feature good online proctoring functionalities. This is because most remote jobs accept applications from all over the world. It is therefore advisable to choose a pre-employment testing software that has secure remote proctoring capabilities. Here are some things you should look for on remote proctoring;

  • Does the platform support security processes such as IP-based authentication, lockdown browser, and AI-flagging?
  • What types of online proctoring does the software offer? Live real-time, AI review, or record and review?
  • Does it let you bring your own proctor?
  • Does it offer test analytics?

Test & data security, and compliance

Defensibility is what defines test security. There are several layers of security associated with pre-employment test security. When evaluating this aspect, you should consider what pre-employment testing software does to achieve the highest level of security. This is because data breaches are wildly expensive.

The first layer of security is the test itself. The software should support security technologies and frameworks such as lockdown browser, IP-flagging, and IP-based authentication.

The other layer of security is on the candidate’s side. As an employer, you will have access to the candidate’s private information. How can you ensure that your candidate’s data is secure? That is reason enough to evaluate the software’s data protection and compliance guidelines.

A good pre-employment testing software should be compliant with certifications such as GDRP. The software should also be flexible to adapt to compliance guidelines from different parts of the world.

Questions you need to ask;

  • What mechanisms does the software employ to eliminate infidelity?
  • Is their remote proctoring function reliable and secure?
  • Are they compliant with security compliance guidelines including ISO, SSO, or GDPR?
  • How does the software protect user data?

Psychometrics

Psychometrics is the science of assessment, helping to drive accurate scores from defensible tests, as well as making them more efficient, reducing bias, and a host of other benefits.  You should ensure that your solution supports the necessary level of psychometrics.  Some suggestions:

 

User experience

A good user experience is a must-have when you are sourcing any enterprise software. A new age pre-employment testing software should create user experience maps with both the candidates and employer in mind. Some ways you can tell if a software offers a seamless user experience includes;

  • User-friendly interface
  • Simple and easy to interact with
  • Easy to create and manage item banks
  • Clean dashboard with advanced analytics and visualizations

Customizing your user-experience maps to fit candidates’ expectations attracts high-quality talent.

 

Scalability and automation

With a single job post attracting approximately 250 candidates, scalability isn’t something you should overlook. A good pre-employment testing software should thus have the ability to handle any kind of workload, without sacrificing assessment quality.

It is also important you check the automation capabilities of the software. The hiring process has many repetitive tasks that can be automated with technologies such as Machine learning, Artificial Intelligence (AI), and robotic process automation (RPA).

Here are some questions you should consider in relation to scalability and automation;

  • Does the software offer Automated Item Generation (AIG)?
  • How many candidates can it handle?
  • Can it support candidates from different locations worldwide?

Reporting and analytics

iteman item analysis

A good pre-employment assessment software will not leave you hanging after helping you develop and deliver the tests. It will enable you to derive important insight from the assessments.

The analytics reports can then be used to make data-driven decisions on which candidate is suitable and how to improve candidate experience. Here are some queries to make on reporting and analytics.

  • Does the software have a good dashboard?
  • What format are reports generated in?
  • What are some key insights that prospects can gather from the analytics process?
  • How good are the visualizations?

Customer and Technical Support

Customer and technical support is not something you should overlook. A good pre-employment assessment software should have an Omni-channel support system that is available 24/7. This is mainly because some situations need a fast response. Here are some of the questions your should ask when vetting customer and technical support;

  • What channels of support does the software offer/How prompt is their support?
  • How good is their FAQ/resources page?
  • Do they offer multi-language support mediums?
  • Do they have dedicated managers to help you get the best out of your tests?

 

Conclusion

Finding the right HR assessment software is a lengthy process, yet profitable in the long run. We hope the article sheds some light on the important aspects to look for when looking for such tools. Also, don’t forget to take a pragmatic approach when implementing such tools into your hiring process.

Are you stuck on how you can use pre-employment testing tools to improve your hiring process? Feel free to contact us and we will guide you on the entire process, from concept development to implementation. Whether you need off-the-shelf tests or a comprehensive platform to build your own exams, we can provide the guidance you need.  We also offer free versions of our industry-leading software  FastTest  and  Assess.ai  – visit our Contact Us page to get started!

If you are interested in delving deeper into leadership assessments, you might want to check out this blog post.  For more insights and an example of how HR assessments can fail, check out our blog post called Public Safety Hiring Practices and Litigation. The blog post titled Improving Employee Retention with Assessment: Strategies for Success explores how strategic use of assessments throughout the employee lifecycle can enhance retention, build stronger teams, and drive business success by aligning organizational goals with employee development and engagement.

creative workplace incremental validity

Incremental validity is a specific aspect of criterion-related validity that refers to what an additional assessment or predictive variable can add to the information provided by existing assessments or variables.  It refers to the amount of “bonus” predictive power by adding in another predictor.  In many cases, it is on the same or similar trait, but often the most incremental validity comes from using a predictor/trait that is relatively unrelated to the original.  See examples below.

Note that this is often discussed with respect to tests and assessment, but in many cases a predictor is not a test or assessment, as you will also see.

How is Incremental Validity Evaluated?

It is most often quantified with a linear regression model and correlations.  However, any predictive modeling approach could work from support vector machines to neural networks.

Example of Incremental Validity: University Admissions

One of the most commonly used predictors for university admissions is an admissions test, or battery of tests.  You might be required to take an assessment which includes an English/Verbal test, a Logic/Reasoning test, and a Quantitative/Math test.  These might be used individually or aggregate to create a mathematical model, based on past data, that predicts your performance at university. (There are actually several variables for this, such as first year GPA, final GPA, and 4 year graduation rate, but that’s beyond the scope of this article.)

Of course, the admissions exams scores are not the only point of information that the university has on students.  It also has their high school GPA, perhaps an admissions essay which is graded by instructors, and so on.  Incremental validity poses this question: if the admissions exam correlates 0.59 with first year GPA, what happens if we make it into a multiple regression/correlation with High School GPA (HGPA) as a second predictor?  It might go up to, say, 0.64.  There is an increment of 0.05.  If the university has that data from students, they would be wasting it by not using it.

Of course, HGPA will correlate very highly with the admissions exam scores.  So it will likely not add a lot of incremental validity.  Perhaps the school finds that essays add a 0.09 increment to the predictive power, because it is more orthogonal to the admissions exam scores.  Does it make sense to add that, given the additional expense of scoring thousands of essays?  That’s a business decision for them.

Example of Incremental Validity: Pre-Employment Testing

Another common use case is that of pre-employment testing, where the purpose of the test is to predict criterion variables like job performance, tenure, 6-month termination rate, or counterproductive work behavior.  You might start with a skills test; perhaps you are hiring accountants or bookkeepers and you give them a test on MS Excel.  What additional predictive power would we get by also doing a quantitative reasoning test?  Probably some, but that most likely correlates highly with MS Excel knowledge.  So what about using a personality assessment like Conscientiousness?  That would be more orthogonal.  It’s up to the researcher to determine what the best predictors are.  This topic, personnel selection, is one of the primary areas of Industrial/ Organizational Psychology.

students discussing formative summative assessment

Summative and formative assessment are a crucial component of the educational process.  If you work in the educational assessment field or even in educational generally, you have probably encountered these terms.  What do they mean?  This post will explore the differences between summative and formative assessment.

Assessment plays a crucial role in education, serving as a powerful tool to gauge student understanding and guide instructional practices. Among the various assessment methods, two approaches stand out: formative assessment and summative assessment. While both types aim to evaluate student performance, they serve distinct purposes and are applied at different stages of the learning process.

 

What is Summative Assessment?

Summative assessment refers to an assessment that is at the end (sum) of an educational experience.  The “educational experience” can vary widely.  Perhaps it is a one-day training course, or even shorter.  I worked at a lumber yard in high school, and I remember getting a rudimentary training – maybe an hour – on how to use a forklift before they had me take an exam to become OSHA Certified to used a forklift.  Proctored by the guy who had just showed me the ropes, of course.  On the other end of a spectrum is board certification for a physician specialty like ophthalmology: after 4 years of undergrad, 4 years of med school, and several more years of specialty training, then you finally get to take the exam.  Either way, the purpose is to evaluate what you learned in some educational experience.

Note that it does not have to be formal education.  Many certifications have multiple eligibility pathways.  For example, to be eligible to sit for the exam, you might need:

  1. A bachelor’s degree
  2. An associate degree plus 1 year of work experience
  3. 3 years of work experience.

How it is developed

Summative assessments are usually developed by assessment professionals, or a board of subject matter experts led by assessment professionals.  For example, a certification for ophthalmology is not informally developed by a teacher; there is a panel of experienced ophthalmologists led by a psychometrician.  A high school graduation exam might be developed by a panel of experienced math or English teachers, again led by a psychometrician and test developers.

The process is usually very long and time-intensive, and therefore quite expensive.  A certification will need a job analysis, item writing workshop, standard-setting study, and other important developments that contribute to the validity of the exam scores.  A high school graduation exam has expensive curriculum alignment studies and other aspects.

Implementation of Summative Assessment

Let’s explore the key aspects of summative assessment:

  1. End-of-Term Evaluation: Summative assessments are administered after the completion of a unit, semester, or academic year. They aim to evaluate the overall achievement of students and determine their readiness for advancement or graduation.
  2. Formal and Standardized: Summative assessments are often formal, standardized, and structured, ensuring consistent evaluation across different students and classrooms. Common examples include final exams, standardized tests, and grading rubrics.
  3. Accountability: Summative assessment holds students accountable for their learning outcomes and provides a comprehensive summary of their performance. It also serves as a basis for grade reporting, academic placement, and program evaluation.
  4. Future Planning: Summative assessment results can guide future instructional planning and curriculum development. They provide insights into areas of strength and weakness, helping educators identify instructional strategies and interventions to improve student outcomes.

 

What is Formative Assessment?student assessment

Formative assessment is something that is used during the educational process.  Everyone is familiar with this from their school days.  A quiz, an exam, or even just the teacher asking you a few questions verbally to understand your level of knowledge.  Usually, but not always, a formative assessment is used to to direct instruction.  A common example of formative assessment is low-stakes exams given in K-12 schools purely to check on student growth, without any counting towards their grades.  Some of the most widely used titles are the NWEA MAP, Renaissance Learning STAR, and Imagine Learning MyPath.

Formative assessment is a great fit for computerized adaptive testing, a method that adapts the difficulty of the exam to each student.  If a student is 3 grades behind, the test will quickly adapt down to that level, providing a better experience for the student and more accurate feedback on their level of knowledge.

How it is developed

Formative assessments are typically much more informal than summative assessments.  Most of the exams we take in our life are informally developed formative assessments; think of all the quizzes and tests you ever took during courses as a student.  Even taking a test during training on the job will often count.  However, some are developed with heavy investment, such as a nationwide K-12 adaptive testing platform.

Implementation of Formative Assessment

Formative assessment refers to the ongoing evaluation of student progress throughout the learning journey. It is designed to provide immediate feedback, identify knowledge gaps, and guide instructional decisions. Here are some key characteristics of formative assessment:

  1. Timely Feedback: Formative assessments are conducted during the learning process, allowing educators to provide immediate feedback to students. This feedback focuses on specific strengths and areas for improvement, helping students adjust their understanding and study strategies.
  2. Informal Nature: Formative assessments are typically informal and flexible, offering a wide range of techniques such as quizzes, class discussions, peer evaluations, and interactive activities. They encourage active participation and engagement, promoting deeper learning and critical thinking skills.
  3. Diagnostic Function: Formative assessment serves as a diagnostic tool, enabling teachers to monitor individual and class-wide progress. It helps identify misconceptions, adapt instructional approaches, and tailor learning experiences to meet students’ needs effectively.
  4. Growth Mindset: The primary goal of formative assessment is to foster a growth mindset among students. By focusing on improvement rather than grades, it encourages learners to embrace challenges, learn from mistakes, and persevere in their educational journey.

 

Summative vs Formative Assessment

Below you may find some principal discrepancies between summative and formative assessment across the general aspects.

Aspect Summative Assessment Formative Assessment
Purpose To evaluate overall student learning at the end of an instructional period. To monitor student learning and provide ongoing feedback for improvement.
Timing Conducted at the end of a unit, semester, or course. Conducted throughout the learning process.
Role in Learning Process To determine the extent of learning and achievement. To identify learning needs and guide instructional adjustments.
Feedback Mechanism Feedback is usually provided after the assessment is completed and is often limited to final results or scores. Provides immediate, specific, and actionable feedback to improve learning.
Nature of Evaluation Typically evaluative and judgmental, focusing on the outcome. Diagnostic and supportive, focusing on the process and improvement.
Impact on Grading Often a major component of the final grade. Generally not used for grading; intended to inform learning.
Level of Standardization Highly standardized to ensure fairness and comparability. Less standardized, often tailored to individual needs and contexts.
Frequency of Implementation Typically infrequent, such as once per term or unit. Frequent and ongoing, integrated into the daily learning activities.
Stakeholders Involved Primarily involves educators and administrative bodies for accountability purposes. Involves students, educators, and sometimes parents for immediate learning support.
Flexibility in Use Rigid in format and timing; used to meet predetermined educational benchmarks. Highly flexible; can be adapted to fit specific instructional goals and learner needs.

 

The Synergy Between Summative and Formative Assessment

While formative and summative assessments have distinct purposes, they work together in a complementary manner to enhance learning outcomes. Here are a few ways in which these assessment types can be effectively integrated:

  1. Feedback Loop: The feedback provided during formative assessments can inform and improve summative assessments. It allows students to understand their strengths and weaknesses, guiding their study efforts for better performance in the final evaluation.
  2. Continuous Improvement: By employing formative assessments throughout a course, teachers can continuously monitor student progress, identify learning gaps, and adjust instructional strategies accordingly. This iterative process can ultimately lead to improved summative assessment results.
  3. Balanced Assessment Approach: Combining both formative and summative assessments creates a more comprehensive evaluation system. It ensures that student growth and understanding are assessed both during the learning process and at the end, providing a holistic view.

 

Summative and Formative Assessment: A Validity Perspective

So what is the difference?  You will notice it is the situation and use of the exam, not the exam itself.  You could take those K-12 feedback assessments and deliver them at the end of the year, with weighting towards the student’s final grade.  That would make them summative.  But that is not what the test was designed for.  This is the concept of validity; the evidence showing that interpretations and use of test scores are supported towards their intended use.  So the key is to design a test for its intended use, provide evidence for that use, and make sure that the exam is being used in the way that it should be.