students taking test security

Multi-modal test delivery refers to an exam that is capable of being delivered in several different ways, or of a online testing software platform designed to support this process. For example, you might provide the option for a certification exam to be taken on computer at third-party testing centers or via paper at the annual conference for the profession. The goal of multi-modal test delivery is to improve access and convenience for the examinees. In the example, perhaps the testing center approach requires an extra $60 for the proctoring fee as well as requiring the examinee to drive up to an hour to get there; they might be attending the annual conference next month anyway, and it would be very convenient for them to duck into a side room to take the exam.

Multi-modal test delivery requires scalable security on the part of your delivery partner. The exam platform should be able to support various types of exam delivery. Here are some approaches to consider.

Paper exams

Your platform should be able to make print-ready versions of the test. Note that this is quite different from exporting test items to Word or PDF; straight exports are often ugly and include metadata.  You might also need advanced formats like Adobe InDesign.

Additionally, the system should also be able to import the results of a paper test back in, so that it is available for scoring and reporting along with other modes of delivery.

FastTest can do all of these things, as well as the points below.  You can sign up for a free account and try it yourself.

Online unproctored

The platform should be able to deliver exams online, without proctoring. Additionally, there can be several ways of candidates entering the exam.

1. As a direct link, without registration, such as an anonymous survey

2. As a direct link, but requiring self-registration

3. Pre-registration, with some sort of password to ensure the right person is taking the exam. This can be emailed or distributed, or perhaps is available via another software platform like a learning management system or applicant tracking system.

Online remote-proctored

The platform should be able to deliver the test online, with remote proctoring. There are several levels of remote proctoring, corresponding to increasing levels of security or stakes.

1. AI only: Video is recorded of the candidate taking the exam, and it is “reviewed” by AI algorithms. A human has the opportunity to review the flagged students, but in many cases it does not happen.

2. Record and review: Video is recorded, and every video is reviewed by a human. This provides stronger security than AI only, but it does not prevent test theft because it would only be found a day or two later.

3. Live: Video is live-streamed and watched in real time. This provides the opportunity to stop the exam if someone is cheating. The proctors can be third-party or in some cases the organization’s staff. If you are using your staff, make sure to avoid the mistakes made by Cleveland State University.

Testing centers managed by you

Some testing platforms have functionality for you to manage your own testing centers. When candidates are registered for an exam, they are assigned to an appropriate center. In some cases, the center is also assigned a proctor. The platform might have a separate login for the proctor, requiring them to enter a password before the examinee can enter theirs (or the proctor enter it on their behalf).

New test scheduler sites proctor code

Formal third-party testing centers

Some vendors will have access to a network of testing centers. These will have trained proctors, computers, and sometimes additional security considerations like video monitoring or biometric scanners when candidates arrive. There are three types of testing centers.

1. Owned: The testing company actually owns their own centers, and they are professionally staffed.

2. Independent/affiliated: The testing might contract with professional testing centers that are owned by a different company. In some cases, these are independent.

3. Public: Some organizations will contract with public locations, such as computer labs at universities or libraries.

Summary: multi-modal test delivery

Multi-modal test delivery provides flexibility for exam sponsors. There are two situations where this is important. First, a single test can be delivered in multiple ways with equivalent security, to allow for greater convenience, like the Conference example above. But it also empowers a testing organization to run multiple types of exams at different levels of security. For instance, a credentialing board might have an unproctored online exam as a practice test, a test center exam for their primary certification exam, and a remote-proctored test for annual recertification. Having a single platform makes it easier for the organization to manage their assessment activities, reducing costs while increasing the customer experience for the people for whom it really matters – the candidates.

split-half-reliability-analysis

Split Half Reliability is an internal consistency approach to quantifying the reliability of a test, in the paradigm of classical test theoryReliability refers to the repeatability or consistency of the test scores; we definitely want a test to be reliable.  The name comes from a simple description of the method: we split the test into two halves, calculate the score on each half for each examinee, then correlate those two columns of numbers.  If the two halves measure the same thing, then the correlation is high, indicating a decent level of unidimensionality in the construct and reliability in measuring the construct.

Why do we need to estimate reliability?  Well, it is one of the easiest ways to quantify the quality of the test.  Some would argue, in fact, that it is a gross oversimplification.  However, because it is so convenient, classical indices of reliability are incredibly popular.  The most popular is coefficient alpha, which is a competitor to split half reliability.

How to Calculate Split Half Reliability

The process is simple.

  1. Take the test and split it in half
  2. Calculate the score of each examinee on each half
  3. Correlate the scores on the two halves

The correlation is best done with the standard Pearson correlation.

Pearson-correlation-formula

This, of course, begs the question:  How do we split the test into two halves?  There are so many ways.  Well, psychometricians generally recommend three ways:

  1. First half vs last half
  2. Odd-numbered items vs even-numbered items
  3. Random split

You can do these manually with your matrix of data, but good psychometric software will for these for you, and more (see screenshot below).

Example

Suppose this is our data set, and we want to calculate split half reliability.

Person Item1 Item2 Item3 Item4 Item5 Item6 Score
1 1 0 0 0 0 0 1
2 1 0 1 0 0 0 2
3 1 1 0 1 0 0 3
4 1 0 1 1 1 1 5
5 1 1 0 1 0 1 4

Let’s split it by first half and last half.  Here are the scores.

Score 1 Score 2
1 0
2 0
2 1
2 3
2 2

The correlation of these is 0.51.

Now, let’s try odd/even.

Score 1 Score 2
1 0
2 0
1 2
3 2
1 3

The correlation of these is -0.04!  Obviously, the different ways of splitting don’t always agree.  Of course, with such a small sample here, we’d expect a wide variation.

Advantages of Split Half Reliability

One advantage is that it is so simple, both conceptually and computationally.  It’s easy enough that you can calculate it in Excel if you need to.  This also makes it easy to interpret and understand.

Another advantage, which I was taught in grad school, is that split half reliability assumes equivalence of the two halves that you have created; on the other hand, coefficient alpha is based at an item level and assumes equivalence of items.  This of course is never the case – but alpha is fairly robust and everyone uses it anyway.

Disadvantages… and the Spearman-Brown Formula

The major disadvantage is that this approach is evaluating half a test.  Because tests are more reliable with more items, having fewer items in a measure will reduce its reliability.  So if we take a 100 item test and divide it into two 50-item halves, then we are essentially making a quantification of reliability for a 50 item test.  This means we are underestimating the reliability of the 100 item test.  Fortunately, there is a way to adjust for this.  It is called the Spearman-Brown Formula.  This simple formula adjusts the correlation back up to what it should be for a 100 item test.

Another disadvantage was mentioned above: the different ways of splitting don’t always agree.  Again, fortunately, if you have a larger sample of people or a longer test, the variation is minimal.

OK, how do I actually implement?

Any good psychometric software will provide some estimates of split half reliability.  Below is the table of reliability analysis from Iteman.  This table actually continues for all subscores on the test as well.  You can download Iteman for free at its page and try it yourself.

This test had 100 items and 85 scored items (15 unscored pilot).  The alpha was around 0.82, which is acceptable, though it should be higher for 100 items.  The results then show for all three split half methods, and then again for the Spearman-Brown (S-B) adjusted version of each.  Do they agree with alpha?  For the total test, the results don’t jive for two of the three methods.  But for the Scored Items, the three S-B calculations align with the alpha value.  This is most likely because some of the 15 pilot items were actually quite bad.  In fact, note that the alpha for 85 items is higher than for 100 items – which says the 15 new items were actually hurting the test!

Reliability analysis Iteman

This is a good example of using alpha and split half reliability together.  We made an important conclusion about the exam and its items, merely by looking at this table.  Next, the researcher should evaluate those items, usually with P value difficulty and point-biserial discrimination.

 

high jump adaptive testing 2

A cutscore or passing point (aka cut-off score and cutoff score as well) is a score on a test that is used to categorize examinees.  The most common example of this is pass/fail, which we are all familiar with from our school days.  For instance, a score of 70% and above will pass, while below 70% will fail.  However, many tests have more than one cutscore.  An example of this is the National Assessment of Educational Progress (NAEP) in the USA, which has 3 cutscores, creating 4 categories: Below Basic, Basic, Proficient, and Advanced.

The process of setting a cutscore is called a standard-setting study.  However, I dislike this term because the word “standard” is used to reflect other things in the assessment world.  In some cases, it is the definition of what is to be learned or covered (see Common Core State Standards) and in other cases it refers to the process of reducing construct-irrelevant variance by ensuring that all examinees are taking the testing in standardized conditions (standardized testing).  So I prefer cutscore or passing point.  And passing point is limited to the case of an exam with only one cutscore where the classifications are pass/fail, which is not always the case – not only are there many situations where there are more than one cutscore, but many two-category situations might use other decisions, like Hire/NotHire or a clinical diagnosis like Depressed/NotDepressed.

When establishing cutscores, it is important to use scaled scores to ensure consistency and fairness.  Scaling adjusts raw scores to a common metric, which helps to accurately reflect the intended performance standards across different test forms or administrations.  You may read about setting a cutscore on a test scored with item response theory in this blog post.  For a deeper understanding of how measurement variability can affect the interpretation of cutscores, be sure to check out our blog post on confidence intervals.

Types of cutscores

There are two types of cutscores, reflecting the two ways that a test score can be interpreted: norm-referenced and criterion-referenced.  The Hofstee method represents a compromise approach that incorporates aspects of both.

Criterion-referenced Cutscore

A cutscore of this type is referenced to the material of the exam, regardless of examinee performance.  In most cases, this is the sort of cutscore that you need to be legally defensible for high stakes exams.  Psychometricians have spent a lot of time inventing ways to do this, and scientifically studying them.

Names of some methods you might see for this type are: modified-Angoff, Nedelsky, and Bookmark.

Example

An example of this is a certification exam.  If the cutscore is 75%, you pass.  In some months or years, this might be most candidates, in other months it might be fewer.  The standard does not change.  In fact, the organizations that manage such exams go to great lengths to keep it stable over time, a process known as equating.

Norm-referenced Cutscore

A cutscore of this type is referenced to the examinees, regardless of their mastery of the material.

A name of this you might see is a quota.  Such as when a test is delivered to only accept the top 10% of applicants.

Example

An example of this was in my college Biology class.  It was a weeder class, to weed out the students who start college planning to be pre-med simply because they like the idea of being a doctor or are drawn to the potential salary.  So, the exams were intentionally made very hard, so that the average score might only be 50% correct.  They then awarded an A to anyone who had a z-score of 1.0 or greater, which is the top 15% of students – regardless of how well you actually scored on the exam.  You might get a score of 60% correct but be 95th percentile and get an A.

Nedelsky-method-standard-setting-panel-meeting

The Nedelsky method is an approach to setting the cutscore of an exam.  Originally suggested by Nedelsky (1954), it is an early attempt to implement a quantitative, rigorous procedure to the process of standard setting.  Quantitative approaches are needed to eliminate the arbitrariness and subjectivity that would otherwise dominate the process of setting a cutscore.  The most obvious and common example of this is simply setting the cutscore at a round number like 70%, regardless of the difficulty of the test or the ability level of the examinees.  It is for this reason that a cutscore must be set with a method such as the Nedelsky approach to be legally defensible or meet accreditation standards.

How to implement the Nedelsky method

The first step, like several other standard setting methods, is to gather a panel of subject matter experts (SMEs).  The next step is for the panel to discuss the concept of a minimally qualified candidate   This is a concept about the type of candidate that should barely pass this exam, and sits on the borderline of competence. They then review a test form, paying specific attention to each of the items on the form.  For every item in the test form, each rater estimates the number of options that an MCC will be able to eliminate.  This then translates into the probability of a correct response, assuming that each candidate guesses amongst the remaining options.   If an MCC can only eliminate one of the options of a four option item, they then have a 1/3 = 33% chance of getting the item correct.  If two, then ½ = 50%.

These ratings are then averaged across all items and all raters.  This then represents the percentage score expected of an MCC on this test form, as defined by the panel.  This makes a compelling, quantitative argument for what the cutscore should then be, because we would expect anyone that is minimally qualified to score at that point or higher.

Item Rater1 Rater2 Rater3
1 33 50 33
2 25 25 25
3 25 33 25
4 33 50 50
5 50 100 50
Total 33.2 51.6 36.6

Drawbacks to the Nedelsky method

This approach only works on multiple choice items, because it depends on the evaluation of option probability.  It is also a gross oversimplification.  If the item has four options, there are only four possible values for the Nedelsky rating 25%, 33%, 50%, 100%.  This is all the more striking when you consider that most items tend to have a percent-correct value between 50% and 100%, and reflecting this fact is impossible with the Nedelsky method. Obviously, more goes into answering a question than simply eliminating one or two of the distractors.  This is one reason that another method is generally preferred and supersedes this method…

Nedelsky vs Modified-Angoff

The Nedelsky method has been superseded by the modified-Angoff method.  The modified-Angoff method is essentially the same process but allows for finer variations, and can be applied to other item types.  The modified-Angoff method subsumes the Nedelsky method, as a rater can still implement the Nedelsky approach within that paradigm.  In fact, I often tell raters to use the Nedelsky approach as a starting point or benchmark.  For example, if they think that the examinee can easily eliminate two options, and is slightly more likely to guess one of the remaining two options, the rating is not 50%, but rather 60%.  The modified-Angoff approach also allows for a second round of ratings after discussion to increase consensus (Delphi Method).  Raters can slightly adjust their rating without being hemmed into one of only four possible ratings.

Enemy items lego

Enemy items is a psychometric term that refers to two test questions (items) which should not be on the same test form (if linear) seen by a given examinee (if LOFT or adaptive).  This can therefore be relevant to linear forms, but also pertains to linear on the fly testing (LOFT) and computerized adaptive testing (CAT).  There are several reasons why two items might be considered enemies:

  1. Too similar: the text of the two items is almost the same.
  2. One gives away the answer to the other.
  3. The items are on the same topic/answer, even if the text is different.

 

How do we find enemy items?

There are two ways (as there often is): Manual and Automated.fasttest-item-authoring

Manual means that humans are reading items and intentionally mark two of them as enemies.  So maybe you have a reviewer that is reviewing new items from a pool of 5 authors, and finds two that cover the same concept.  They would mark them as enemies.

Automated means that you have a machine learning algorithm, such as one which uses natural language processing (NLP) to evaluate all items in a pool and then uses distance/similarity metrics to quantify how similar they are.  Of course, this could miss some of the situations, like if two items have the same topic but have fairly different text.  It is also difficult to do if items have formulas, multimedia files, or other aspects that could not be caught by NLP.

 

Why are enemy items a problem?

This violates the assumption of local independence; that the interaction of an examinee with an item should not be affected by other items.  It also means that the examinee is in double jeopardy; if they don’t know that topic, they will be getting two questions wrong, not one.  There are other potential issues as well, as discussed in this article.

 

What does this mean for test development?

We want to identify enemy items and ensure that they don’t get used together.  Your item banking and assessment platform should have functionality to track which items are enemies.  You can sign up for a free account in FastTest to see an example.

 

HR assessment is a critical part of the HR ecosystem, used to select the best candidates with pre-employment testing, assess training, certify skills, and more.  But there is a huge range in quality, as well as a wide range in the type of assessment that it is designed for.  This post will break down the different approaches and help you find the best solution.

HR assessment platforms help companies create effective assessments, thus saving valuable resources, improving candidate experience & quality, providing more accurate and actionable information about human capital, and reducing hiring bias.  But, finding software solutions that can help you reap these benefits can be difficult, especially because of the explosion of solutions in the market.  If you are lost on which tools will help you develop and deliver your own HR assessments, this guide is for you.

What is HR assessment?

HR assessment is a comprehensive process used by human resources professionals to evaluate various aspects of potential and current employees’ abilities, skills, and performance. This process encompasses a wide range of tools and methodologies designed to provide insights into an individual’s suitability for a role, their developmental needs, and their potential for future growth within the organization.

hr assessment software presentation

The primary goal of HR assessment is to make informed decisions about recruitment, employee development, and succession planning. During the recruitment phase, HR assessments help in identifying candidates who possess the necessary competencies and cultural fit for the organization.

There are various types of assessments used in HR.  Here are four main areas, though this list is by no means exhaustive.

  1. Pre-employment tests to select candidates
  2. Post-training assessments
  3. Certificate or certification exams (can be internal or external)
  4. 360-degree assessments and other performance appraisals

 

Pre-employment tests

Finding good employees in an overcrowded market is a daunting task. In fact, according to the Harvard Business Review, 80% of employee turnover is attributed to poor hiring decisions. Bad hires are not only expensive, but can also adversely affect cultural dynamics in the workforce. This is one area where HR assessment software shows its value.

There are different types of pre-employment assessments. Each of them achieves a different goal in the hiring process. The major types of pre-employment assessments include:

Personality tests: Despite rapidly finding their way into HR, these types of pre-employment tests are widely misunderstood. Personality tests answer questions in the social spectrum.  One of the main goals of these tests is to quantify the success of certain candidates based on behavioral traits.

Aptitude tests: Unlike personality tests or emotional intelligence tests which tend to lie on the social spectrum, aptitude tests measure problem-solving, critical thinking, and agility.  These types of tests are popular because can predict job performance than any other type because they can tap into areas that cannot be found in resumes or job interviews.

Skills Testing: The kinds of tests can be considered a measure of job experience; ranging from high-end skills to low-end skills such as typing or Microsoft excel. Skill tests can either measure specific skills such as communication or measure generalized skills such as numeracy.

Emotional Intelligence tests: These kinds of assessments are a new concept but are becoming important in the HR industry. With strong Emotional Intelligence (EI) being associated with benefits such as improved workplace productivity and good leadership, many companies are investing heavily in developing these kinds of tests.  Despite being able to be administered to any candidates, it is recommended they be set aside for people seeking leadership positions, or those expected to work in social contexts.

Risk tests: As the name suggests, these types of tests help companies reduce risks. Risk assessments offer assurance to employers that their workers will commit to established work ethics and not involve themselves in any activities that may cause harm to themselves or the organization.  There are different types of risk tests. Safety tests, which are popular in contexts such as construction, measure the likelihood of the candidates engaging in activities that can cause them harm. Other common types of risk tests include Integrity tests.

 

Post-training assessments

This refers to assessments that are delivered after training.  It might be a simple quiz after an eLearning module, up to a certification exam after months of training (see next section).  Often, it is somewhere in between.  For example you might take an afternoon sit through a training course, after which you take a formal test that is required to do something on the job.  When I was a high school student, I worked in a lumber yard, and did exactly this to become an OSHA-approved forklift driver.

 

Certificate or certification exams

Sometimes, the exam process can be high-stakes and formal.  It is then a certificate or certification, or sometimes a licensure exam.  More on that here.  This can be internal to the organization, or external.

Internal certification: The credential is awarded by the training organization, and the exam is specifically tied to a certain product or process that the organization provides in the market.  There are many such examples in the software industry.  You can get certifications in AWS, SalesForce, Microsoft, etc.  One of our clients makes MRI and other medical imaging machines; candidates are certified on how to calibrate/fix them.

External certification: The credential is awarded by an external board or government agency, and the exam is industry-wide.  An example of this is the SIE exams offered by FINRA.  A candidate might go to work at an insurance company or other financial services company, who trains them and sponsors them to take the exam in hopes that the company will get a return by the candidate passing and then selling their insurance policies as an agent.  But the company does not sponsor the exam; FINRA does.

 

360-degree assessments and other performance appraisals

Job performance is one of the most important concepts in HR, and also one that is often difficult to measure.  John Campbell, one of my thesis advisors, was known for developing an 8-factor model of performance.  Some aspects are subjective, and some are easily measured by real-world data, such as number of widgets made or number of cars sold by a car salesperson.  Others involve survey-style assessments, such as asking customers, business partners, co-workers, supervisors, and subordinates to rate a person on a Likert scale.  HR assessment platforms are needed to develop, deliver, and score such assessments.

 

The Benefits of Using Professional-Level Exam Software

Now that you have a good understanding of what pre-employment and other HR tests are, let’s discuss the benefits of integrating pre-employment assessment software into your hiring process. Here are some of the benefits:

Saves Valuable resources

Unlike the lengthy and costly traditional hiring processes, pre-employment assessment software helps companies increase their ROI by eliminating HR snugs such as face-to-face interactions or geographical restrictions. Pre-employment testing tools can also reduce the amount of time it takes to make good hires while reducing the risks of facing the financial consequences of a bad hire.

Supports Data-Driven Hiring Decisions

Data runs the modern world, and hiring is no different. You are better off letting complex algorithms crunch the numbers and help you decide which talent is a fit, as opposed to hiring based on a hunch or less-accurate methods like an unstructured interview.  Pre-employment assessment software helps you analyze assessments and generate reports/visualizations to help you choose the right candidates from a large talent pool.

Improving candidate experience 

Candidate experience is an important aspect of a company’s growth, especially considering the fact that 69% of candidates admitting not to apply for a job in a company after having a negative experience. Good candidate experience means you get access to the best talent in the world.

Elimination of Human Bias

Traditional hiring processes are based on instinct. They are not effective since it’s easy for candidates to provide false information on their resumes and cover letters. But, the use of pre-employment assessment software has helped in eliminating this hurdle. The tools have leveled the playing ground, and only the best candidates are considered for a position.

 

What To Consider When Choosing HR assessment tools

Now that you have a clear idea of what pre-employment tests are and the benefits of integrating pre-employment assessment software into your hiring process, let’s see how you can find the right tools.

Here are the most important things to consider when choosing the right pre-employment testing software for your organization.

Ease-of-use

The candidates should be your top priority when you are sourcing pre-employment assessment software. This is because the ease of use directly co-relates with good candidate experience. Good software should have simple navigation modules and easy comprehension.

Here is a checklist to help you decide if a pre-employment assessment software is easy to use;

  • Are the results easy to interpret?
  • What is the UI/UX like?
  • What ways does it use to automate tasks such as applicant management?
  • Does it have good documentation and an active community?

Tests Delivery and Remote Proctoring

Good online assessment software should feature good online proctoring functionalities. This is because most remote jobs accept applications from all over the world. It is therefore advisable to choose a pre-employment testing software that has secure remote proctoring capabilities. Here are some things you should look for on remote proctoring;

  • Does the platform support security processes such as IP-based authentication, lockdown browser, and AI-flagging?
  • What types of online proctoring does the software offer? Live real-time, AI review, or record and review?
  • Does it let you bring your own proctor?
  • Does it offer test analytics?

Test & data security, and compliance

Defensibility is what defines test security. There are several layers of security associated with pre-employment test security. When evaluating this aspect, you should consider what pre-employment testing software does to achieve the highest level of security. This is because data breaches are wildly expensive.

The first layer of security is the test itself. The software should support security technologies and frameworks such as lockdown browser, IP-flagging, and IP-based authentication.

The other layer of security is on the candidate’s side. As an employer, you will have access to the candidate’s private information. How can you ensure that your candidate’s data is secure? That is reason enough to evaluate the software’s data protection and compliance guidelines.

A good pre-employment testing software should be compliant with certifications such as GDRP. The software should also be flexible to adapt to compliance guidelines from different parts of the world.

Questions you need to ask;

  • What mechanisms does the software employ to eliminate infidelity?
  • Is their remote proctoring function reliable and secure?
  • Are they compliant with security compliance guidelines including ISO, SSO, or GDPR?
  • How does the software protect user data?

Psychometrics

Psychometrics is the science of assessment, helping to drive accurate scores from defensible tests, as well as making them more efficient, reducing bias, and a host of other benefits.  You should ensure that your solution supports the necessary level of psychometrics.  Some suggestions:

 

User experience

A good user experience is a must-have when you are sourcing any enterprise software. A new age pre-employment testing software should create user experience maps with both the candidates and employer in mind. Some ways you can tell if a software offers a seamless user experience includes;

  • User-friendly interface
  • Simple and easy to interact with
  • Easy to create and manage item banks
  • Clean dashboard with advanced analytics and visualizations

Customizing your user-experience maps to fit candidates’ expectations attracts high-quality talent.

 

Scalability and automation

With a single job post attracting approximately 250 candidates, scalability isn’t something you should overlook. A good pre-employment testing software should thus have the ability to handle any kind of workload, without sacrificing assessment quality.

It is also important you check the automation capabilities of the software. The hiring process has many repetitive tasks that can be automated with technologies such as Machine learning, Artificial Intelligence (AI), and robotic process automation (RPA).

Here are some questions you should consider in relation to scalability and automation;

  • Does the software offer Automated Item Generation (AIG)?
  • How many candidates can it handle?
  • Can it support candidates from different locations worldwide?

Reporting and analytics

iteman item analysis

A good pre-employment assessment software will not leave you hanging after helping you develop and deliver the tests. It will enable you to derive important insight from the assessments.

The analytics reports can then be used to make data-driven decisions on which candidate is suitable and how to improve candidate experience. Here are some queries to make on reporting and analytics.

  • Does the software have a good dashboard?
  • What format are reports generated in?
  • What are some key insights that prospects can gather from the analytics process?
  • How good are the visualizations?

Customer and Technical Support

Customer and technical support is not something you should overlook. A good pre-employment assessment software should have an Omni-channel support system that is available 24/7. This is mainly because some situations need a fast response. Here are some of the questions your should ask when vetting customer and technical support;

  • What channels of support does the software offer/How prompt is their support?
  • How good is their FAQ/resources page?
  • Do they offer multi-language support mediums?
  • Do they have dedicated managers to help you get the best out of your tests?

 

Conclusion

Finding the right HR assessment software is a lengthy process, yet profitable in the long run. We hope the article sheds some light on the important aspects to look for when looking for such tools. Also, don’t forget to take a pragmatic approach when implementing such tools into your hiring process.

Are you stuck on how you can use pre-employment testing tools to improve your hiring process? Feel free to contact us and we will guide you on the entire process, from concept development to implementation. Whether you need off-the-shelf tests or a comprehensive platform to build your own exams, we can provide the guidance you need.  We also offer free versions of our industry-leading software  FastTest  and  Assess.ai  – visit our Contact Us page to get started!

If you are interested in delving deeper into leadership assessments, you might want to check out this blog post.  For more insights and an example of how HR assessments can fail, check out our blog post called Public Safety Hiring Practices and Litigation. The blog post titled Improving Employee Retention with Assessment: Strategies for Success explores how strategic use of assessments throughout the employee lifecycle can enhance retention, build stronger teams, and drive business success by aligning organizational goals with employee development and engagement.

creative workplace incremental validity

Incremental validity is a specific aspect of criterion-related validity that refers to what an additional assessment or predictive variable can add to the information provided by existing assessments or variables.  It refers to the amount of “bonus” predictive power by adding in another predictor.  In many cases, it is on the same or similar trait, but often the most incremental validity comes from using a predictor/trait that is relatively unrelated to the original.  See examples below.

Note that this is often discussed with respect to tests and assessment, but in many cases a predictor is not a test or assessment, as you will also see.

How is Incremental Validity Evaluated?

It is most often quantified with a linear regression model and correlations.  However, any predictive modeling approach could work from support vector machines to neural networks.

Example of Incremental Validity: University Admissions

One of the most commonly used predictors for university admissions is an admissions test, or battery of tests.  You might be required to take an assessment which includes an English/Verbal test, a Logic/Reasoning test, and a Quantitative/Math test.  These might be used individually or aggregate to create a mathematical model, based on past data, that predicts your performance at university. (There are actually several variables for this, such as first year GPA, final GPA, and 4 year graduation rate, but that’s beyond the scope of this article.)

Of course, the admissions exams scores are not the only point of information that the university has on students.  It also has their high school GPA, perhaps an admissions essay which is graded by instructors, and so on.  Incremental validity poses this question: if the admissions exam correlates 0.59 with first year GPA, what happens if we make it into a multiple regression/correlation with High School GPA (HGPA) as a second predictor?  It might go up to, say, 0.64.  There is an increment of 0.05.  If the university has that data from students, they would be wasting it by not using it.

Of course, HGPA will correlate very highly with the admissions exam scores.  So it will likely not add a lot of incremental validity.  Perhaps the school finds that essays add a 0.09 increment to the predictive power, because it is more orthogonal to the admissions exam scores.  Does it make sense to add that, given the additional expense of scoring thousands of essays?  That’s a business decision for them.

Example of Incremental Validity: Pre-Employment Testing

Another common use case is that of pre-employment testing, where the purpose of the test is to predict criterion variables like job performance, tenure, 6-month termination rate, or counterproductive work behavior.  You might start with a skills test; perhaps you are hiring accountants or bookkeepers and you give them a test on MS Excel.  What additional predictive power would we get by also doing a quantitative reasoning test?  Probably some, but that most likely correlates highly with MS Excel knowledge.  So what about using a personality assessment like Conscientiousness?  That would be more orthogonal.  It’s up to the researcher to determine what the best predictors are.  This topic, personnel selection, is one of the primary areas of Industrial/ Organizational Psychology.

students discussing formative summative assessment

Summative and formative assessment are a crucial component of the educational process.  If you work in the educational assessment field or even in educational generally, you have probably encountered these terms.  What do they mean?  This post will explore the differences between summative and formative assessment.

Assessment plays a crucial role in education, serving as a powerful tool to gauge student understanding and guide instructional practices. Among the various assessment methods, two approaches stand out: formative assessment and summative assessment. While both types aim to evaluate student performance, they serve distinct purposes and are applied at different stages of the learning process.

 

What is Summative Assessment?

Summative assessment refers to an assessment that is at the end (sum) of an educational experience.  The “educational experience” can vary widely.  Perhaps it is a one-day training course, or even shorter.  I worked at a lumber yard in high school, and I remember getting a rudimentary training – maybe an hour – on how to use a forklift before they had me take an exam to become OSHA Certified to used a forklift.  Proctored by the guy who had just showed me the ropes, of course.  On the other end of a spectrum is board certification for a physician specialty like ophthalmology: after 4 years of undergrad, 4 years of med school, and several more years of specialty training, then you finally get to take the exam.  Either way, the purpose is to evaluate what you learned in some educational experience.

Note that it does not have to be formal education.  Many certifications have multiple eligibility pathways.  For example, to be eligible to sit for the exam, you might need:

  1. A bachelor’s degree
  2. An associate degree plus 1 year of work experience
  3. 3 years of work experience.

How it is developed

Summative assessments are usually developed by assessment professionals, or a board of subject matter experts led by assessment professionals.  For example, a certification for ophthalmology is not informally developed by a teacher; there is a panel of experienced ophthalmologists led by a psychometrician.  A high school graduation exam might be developed by a panel of experienced math or English teachers, again led by a psychometrician and test developers.

The process is usually very long and time-intensive, and therefore quite expensive.  A certification will need a job analysis, item writing workshop, standard-setting study, and other important developments that contribute to the validity of the exam scores.  A high school graduation exam has expensive curriculum alignment studies and other aspects.

Implementation of Summative Assessment

Let’s explore the key aspects of summative assessment:

  1. End-of-Term Evaluation: Summative assessments are administered after the completion of a unit, semester, or academic year. They aim to evaluate the overall achievement of students and determine their readiness for advancement or graduation.
  2. Formal and Standardized: Summative assessments are often formal, standardized, and structured, ensuring consistent evaluation across different students and classrooms. Common examples include final exams, standardized tests, and grading rubrics.
  3. Accountability: Summative assessment holds students accountable for their learning outcomes and provides a comprehensive summary of their performance. It also serves as a basis for grade reporting, academic placement, and program evaluation.
  4. Future Planning: Summative assessment results can guide future instructional planning and curriculum development. They provide insights into areas of strength and weakness, helping educators identify instructional strategies and interventions to improve student outcomes.

 

What is Formative Assessment?student assessment

Formative assessment is something that is used during the educational process.  Everyone is familiar with this from their school days.  A quiz, an exam, or even just the teacher asking you a few questions verbally to understand your level of knowledge.  Usually, but not always, a formative assessment is used to to direct instruction.  A common example of formative assessment is low-stakes exams given in K-12 schools purely to check on student growth, without any counting towards their grades.  Some of the most widely used titles are the NWEA MAP, Renaissance Learning STAR, and Imagine Learning MyPath.

Formative assessment is a great fit for computerized adaptive testing, a method that adapts the difficulty of the exam to each student.  If a student is 3 grades behind, the test will quickly adapt down to that level, providing a better experience for the student and more accurate feedback on their level of knowledge.

How it is developed

Formative assessments are typically much more informal than summative assessments.  Most of the exams we take in our life are informally developed formative assessments; think of all the quizzes and tests you ever took during courses as a student.  Even taking a test during training on the job will often count.  However, some are developed with heavy investment, such as a nationwide K-12 adaptive testing platform.

Implementation of Formative Assessment

Formative assessment refers to the ongoing evaluation of student progress throughout the learning journey. It is designed to provide immediate feedback, identify knowledge gaps, and guide instructional decisions. Here are some key characteristics of formative assessment:

  1. Timely Feedback: Formative assessments are conducted during the learning process, allowing educators to provide immediate feedback to students. This feedback focuses on specific strengths and areas for improvement, helping students adjust their understanding and study strategies.
  2. Informal Nature: Formative assessments are typically informal and flexible, offering a wide range of techniques such as quizzes, class discussions, peer evaluations, and interactive activities. They encourage active participation and engagement, promoting deeper learning and critical thinking skills.
  3. Diagnostic Function: Formative assessment serves as a diagnostic tool, enabling teachers to monitor individual and class-wide progress. It helps identify misconceptions, adapt instructional approaches, and tailor learning experiences to meet students’ needs effectively.
  4. Growth Mindset: The primary goal of formative assessment is to foster a growth mindset among students. By focusing on improvement rather than grades, it encourages learners to embrace challenges, learn from mistakes, and persevere in their educational journey.

 

Summative vs Formative Assessment

Below you may find some principal discrepancies between summative and formative assessment across the general aspects.

Aspect Summative Assessment Formative Assessment
Purpose To evaluate overall student learning at the end of an instructional period. To monitor student learning and provide ongoing feedback for improvement.
Timing Conducted at the end of a unit, semester, or course. Conducted throughout the learning process.
Role in Learning Process To determine the extent of learning and achievement. To identify learning needs and guide instructional adjustments.
Feedback Mechanism Feedback is usually provided after the assessment is completed and is often limited to final results or scores. Provides immediate, specific, and actionable feedback to improve learning.
Nature of Evaluation Typically evaluative and judgmental, focusing on the outcome. Diagnostic and supportive, focusing on the process and improvement.
Impact on Grading Often a major component of the final grade. Generally not used for grading; intended to inform learning.
Level of Standardization Highly standardized to ensure fairness and comparability. Less standardized, often tailored to individual needs and contexts.
Frequency of Implementation Typically infrequent, such as once per term or unit. Frequent and ongoing, integrated into the daily learning activities.
Stakeholders Involved Primarily involves educators and administrative bodies for accountability purposes. Involves students, educators, and sometimes parents for immediate learning support.
Flexibility in Use Rigid in format and timing; used to meet predetermined educational benchmarks. Highly flexible; can be adapted to fit specific instructional goals and learner needs.

 

The Synergy Between Summative and Formative Assessment

While formative and summative assessments have distinct purposes, they work together in a complementary manner to enhance learning outcomes. Here are a few ways in which these assessment types can be effectively integrated:

  1. Feedback Loop: The feedback provided during formative assessments can inform and improve summative assessments. It allows students to understand their strengths and weaknesses, guiding their study efforts for better performance in the final evaluation.
  2. Continuous Improvement: By employing formative assessments throughout a course, teachers can continuously monitor student progress, identify learning gaps, and adjust instructional strategies accordingly. This iterative process can ultimately lead to improved summative assessment results.
  3. Balanced Assessment Approach: Combining both formative and summative assessments creates a more comprehensive evaluation system. It ensures that student growth and understanding are assessed both during the learning process and at the end, providing a holistic view.

 

Summative and Formative Assessment: A Validity Perspective

So what is the difference?  You will notice it is the situation and use of the exam, not the exam itself.  You could take those K-12 feedback assessments and deliver them at the end of the year, with weighting towards the student’s final grade.  That would make them summative.  But that is not what the test was designed for.  This is the concept of validity; the evidence showing that interpretations and use of test scores are supported towards their intended use.  So the key is to design a test for its intended use, provide evidence for that use, and make sure that the exam is being used in the way that it should be.

QUESTION:   “What are the costs associated with using validated assessments in public safety hiring?”

ANSWER:       “Always cheaper than a lawsuit!”

It is not uncommon for public safety hiring practices to be called into question. There are several landmark court cases surrounding discrimination in hiring or testing that prove that point. Each year, millions and millions of dollars are spent defending or rectifying these occurrences. It is vital that steps are taken to avoid even the appearance of discrimination.

These four mistakes in public safety testing are some of the most common oversights made by human resources and public safety personnel. It is imperative that those responsible for hiring and promotional processes stay vigilant and aware of their legal responsibilities throughout the hiring and promotional process.

# 1:  Failing to validate the written test to a current job description for public safety hiring

Test questions must be related to the job description. This is one of the biggest mistakes that hiring officials make and is a frequent reason for public safety testing lawsuits. The test must either measure critical skills and abilities necessary for the job, or must predict which candidates will be most successful on the job (predictive validity). At the very least, the most important skills should be reflected on the test.

The United States filed a lawsuit against the City of New York in 2007 for unfair public safety hiring practices. The United States alleged that the examinations that the City used for hiring its firefighters was not an adequate method for determining whether an applicant qualified for the position or not. In this case, Judge Nicholas G. Garaufis ruled in favor of the United States. He determined that the City was in violation of Title VII and the written examinations that were used excluded certain minorities like Black and Hispanic candidates and were not job-related.

# 2:  Failing to include job-related practices that mitigate adverse impact

The City of New Haven, Connecticut, found itself in hot water in 2003, when seeking to fill 15 supervisory positions for it fire department. The test consisted of an oral and a written exam. There were 118 firefighters who took the test. When the test scores were calculated, there was a distinct racial bias. The White applicants passed the test at a rate that was twice that of the Black applicants. During the court case, it was determined that the fire department was guilty of disparate-impact discrimination.

Simply put, disparate impact discrimination occurs when hiring practice rules or tests show a distinct slant toward one race. In this case it was asserted that the test and ranking was structured in such a way that it eliminated any Black or Hispanic applicants.

# 3:  Failing to use a locally-validated, job-related Physical Ability Test (PAT)

Of all of the selection practices administered by public safety departments, the physical ability test, or PAT, is most likely to have the highest failure rate to females. It’s imperative that the PAT measures the critical physical skills that a police officer must possess on day one. Those departments that utilize work-sample PATs rather than fitness tests tend to have more success in court as it is easier to demonstrate job-relatedness to a PAT that measures specific, critical job duties than a fitness test that requires candidates to run a mile-and-one-half or complete a number of pushups and sit-ups.

# 4:  Failing to use a structured interview with trained raters

A study was conducted analyzing the occurrence of litigation across the different tests included in most entry-level recruitments for public safety. Of the most common selection practices (i.e., a written test, a PAT, and an interview), the unstructured interview was the selection practice that was most commonly challenged in court and which resulted in the plaintiff’s success in court.

Departments should ensure that the questions asked during the interview are structured, job-related and utilize structured scoring methods. Additionally, all parties who sit on the interview panel should be properly trained in how to objectively administer, assess, and score the interview. Questions like “Tell me why you want to work with our department” should never be included in a structured interview process.

About FPSI

This is a guest post on pre-employment testing and hiring practices in public safety, by one of the leaders in the field, Fire & Police Selection, Inc. (FPSI).  FPSI consultants are well-versed in public safety litigation. Contact us for assistance with your public safety testing needs.

scale-reliability-small

Test score reliability and validity are core concepts in the field of psychometrics and assessment.  Both of them refer to the quality of a test, the scores it produces, and how we use those scores.  Because test scores are often used for very important purposes with high stakes, it is of course paramount that the tests be of high quality.  But because it is such a complex situation, it is not a simple yes/no answer of whether a test is good.  There is a ton of work that goes into establishing validity and reliability, and that work never ends!

This post provide an introduction to this incredibly complex topic.  For more information, we recommend you delve into books that are dedicated to the topic.  Here is a classic.

 

Why do we need reliability and validity?

To begin a discussion of reliability and validity, let us first pose the most fundamental question in psychometrics: Why are we testing people? Why are we going through an extensive and expensive process to develop examinations, inventories, surveys, and other forms of assessment? The answer is that the assessments provide information, in the form of test scores and subscores, that can be used for practical purposes to the benefit of individuals, organizations, and society. Moreover, that information is of higher quality for a particular purpose than information available from alternative sources. For example, a standardized test can provide better information about school students than parent or teacher ratings. A preemployment test can provide better information about specific job skills than an interview or a resume, and therefore be used to make better hiring decisions.

So, exams are constructed in order to draw conclusions about examinees based on their performance. The next question would be, just how supported are various conclusions and inferences we are making? What evidence do we have that a given standardized test can provide better information about school students than parent or teacher ratings? This is the central question that defines the most important criterion for evaluating an assessment process: validity. Validity, from a broad perspective, refers to the evidence we have to support a given use or interpretation of test scores. The importance of validity is so widely recognized that it typically finds its way into laws and regulations regarding assessment (Koretz, 2008).

Test score reliability is a component of validity. Reliability indicates the degree to which test scores are stable, reproducible, and free from measurement error. If test scores are not reliable, they cannot be valid since they will not provide a good estimate of the ability or trait that the test intends to measure. Reliability is therefore a necessity but not sufficient condition for validity.

 

Test Score Reliability

Reliability refers to the precision, accuracy, or repeatability of the test scores. There is no universally accepted way to define and evaluate the concept; classical test theory provides several indices, while item response theory drops the idea of a single index (and drops the term “reliability” entirely!) and reconceptualizes it as a conditional standard error of measurement, an index of precision.  This is actually a very important distinction, though outside the scope of this article.

An extremely common way of evaluating classical test reliability is the internal consistency index, called KR-20 or α (alpha). The KR-20 index ranges from 0.0 (test scores are comprised only of random error) to 1.0 (scores have no measurement error). Of course, because human behavior is generally not perfectly reproducible, perfect reliability is not possible; typically, a reliability of 0.90 or higher is desired for high-stakes certification exams. The relevant standard for a test depends on its stakes. A test for medical doctors might require reliability of 0.95 or greater. A test for florists or a personality self-assessment might suffice with 0.80. Another method for assessing reliability is the Split Half Reliability Index, which can also be useful depending on the context and nature of the test.

Reliability depends on several factors, including the stability of the construct, length of the test, and the quality of the test items.

  • Stability of the construct: Reliability will be higher if the trait/ability is more stable (mood is inherently difficult to measure repeatedly). A test sponsor typically has little control over the nature of the construct – if you need to measure knowledge of algebra, well, that’s what we have to measure, and there’s no way around that.
  • Length of the test: Obviously, a test with 100 items is going to produce better scores than one with 5 items, assuming the items are not worthless.
  • Item Quality: A test will have higher reliability if the items are good.  Often, this is operationalized as point-biserial discrimination coefficients.

How to you calculate reliability?  You need psychometric analysis software like Iteman.

 

Validity

Validity is conventionally defined as the extent to which a test measures what it purports to measure.  Test validation is the process of gathering evidence to support the inferences made by test scores. Validation is an ongoing process which makes it difficult to know when one has reached a sufficient amount of validity evidence to interpret test scores appropriately.

Academically, Messick (1989) defines validity as an “integrated evaluative judgment of the degree to which empirical evidence and theoretical rationales support the adequacy and appropriateness of inferences and actions based on test scores or other modes of measurement.” This definition suggests that the concept of validity contains a number of important characteristics to review or propositions to test and that validity can be described in a number of ways. The modern concept of validity (AERA, APA, & NCME Standards) is multi-faceted and refers to the meaningfulness, usefulness, and appropriateness of inferences made from test scores.

First of all, validity is not an inherent characteristic of a test. It is the reasonableness of using the test score for a particular purpose or for a particular inference. It is not correct to say a test or measurement procedure is valid or invalid. It is more reasonable to ask, “Is this a valid use of test scores or is this a valid interpretation of the test scores?” Test score validity evidence should always be reviewed in relation to how test scores are used and interpreted.  Example: we might use a national university admissions aptitude test as a high school graduation exam, since they occur in the same period of a student’s life.  But it is likely that such a test does not match the curriculum of a particular state, especially since aptitude and achievement are different things!  You could theoretically use the aptitude test as a pre-employment exam as well; while valid in its original use it is likely not valid in that use.

Secondly, validity cannot be adequately summarized by a single numerical index like a reliability coefficient or a standard error of measurement. A validity coefficient may be reported as a descriptor of the strength of relationship between other suitable and important measurements. However, it is only one of many pieces of empirical evidence that should be reviewed and reported by test score users. Validity for a particular test score use is supported through an accumulation of empirical, theoretical, statistical, and conceptual evidence that makes sense for the test scores.

Thirdly, there can be many aspects of validity dependent on the intended use and intended inferences to be made from test scores. Scores obtained from a measurement procedure can be valid for certain uses and inferences and not valid for other uses and inferences. Ultimately, an inference about probable job performance based on test scores is usually the kind of inference desired in test score interpretation in today’s test usage marketplace. This can take the form of making an inference about a person’s competency measured by a tested area.

Example 1: A Ruler

A standard ruler has both reliability and validity.  If you measure something that is 10 cm long, and measure it again and again, you will get the same measurement.  It is highly consistent and repeatable.  And if the object is actually 10 cm long, you have validity. (If not, you have a bad ruler.)

Example 2: A Bathroom Scale

Bathroom scales are not perfectly reliable (though this is often a function of their price).  But that meets the reliability requirements of this measurement.

  • If you weigh 180 lbs, and step on the scale several times, you will likely get numbers like 179.8 or 180.1.  That is quite reliable, and valid.
  • If the numbers were more spread out, like 168.9 and 185.7, then you can consider it unreliable but valid.
  • If the results were 190.00 lbs every time, you have perfectly reliable measurement… but poor validity
  • If the results were spread like 25.6, 2023.7, 0.000053 – then it is neither reliable or valid.

This is similar to the classic “target” example of reliability and validity, like you see below (image from Wikipedia).

Reliability_and_validity

Example 3: A Pre-Employment Test

Now, let’s get to a real example.  You have a test of quantitative reasoning that is being used to assess bookkeepers that apply for a job at a large company.  Jack has very high ability, and scores around the 90th percentile each time he takes the test.  This is reliability.  But does it actually predict job performance?  That is validity.  Does it predict job performance better than a Microsoft Excel test?  Good question, time for some validity research.  What if we also tack on a test of conscientiousness?  That is incremental validity.

 

Summary

In conclusion, validity and reliability are two essential aspects in evaluating an assessment, be it an examination of knowledge, a psychological inventory, a customer survey, or an aptitude test. Validity is an overarching, fundamental issue that drives at the heart of the reason for the assessment in the first place: the use of test scores. Reliability is an aspect of validity, as it is a necessary but not sufficient condition. Developing a test that produces reliable scores and valid interpretations is not an easy task, and progressively higher stakes indicate a progressively greater need for a professional psychometrician. High-stakes exams like national university admissions often have teams of experts devoted to producing a high quality assessment.

If you need professional psychometric consultancy and support, do not hesitate to contact us.