Test Development Archives

HR Assessment for Pre-Employment: Approaches and Solutions

AdminJune 30, 2022

HR assessment is a critical part of the HR ecosystem, used to select the best candidates with pre-employment testing, assess training, certify skills, and more. But there is a huge range in quality, as well as a wide range in the type of assessment that it is designed for. This post will break down the different approaches and help you find the best solution.

HR assessment platforms help companies create effective assessments, thus saving valuable resources, improving candidate experience & quality, providing more accurate and actionable information about human capital, and reducing hiring bias. But, finding software solutions that can help you reap these benefits can be difficult, especially because of the explosion of solutions in the market. If you are lost on which tools will help you develop and deliver your own HR assessments, this guide is for you.

What is HR assessment?

HR assessment is a comprehensive process used by human resources professionals to evaluate various aspects of potential and current employees’ abilities, skills, and performance. This process encompasses a wide range of tools and methodologies designed to provide insights into an individual’s suitability for a role, their developmental needs, and their potential for future growth within the organization.

The primary goal of HR assessment is to make informed decisions about recruitment, employee development, and succession planning. During the recruitment phase, HR assessments help in identifying candidates who possess the necessary competencies and cultural fit for the organization.

There are various types of assessments used in HR. Here are four main areas, though this list is by no means exhaustive.

Pre-employment tests to select candidates
Post-training assessments
Certificate or certification exams (can be internal or external)
360-degree assessments and other performance appraisals

Pre-employment tests

Finding good employees in an overcrowded market is a daunting task. In fact, according to the Harvard Business Review, 80% of employee turnover is attributed to poor hiring decisions. Bad hires are not only expensive, but can also adversely affect cultural dynamics in the workforce. This is one area where HR assessment software shows its value.

There are different types of pre-employment assessments. Each of them achieves a different goal in the hiring process. The major types of pre-employment assessments include:

Personality tests: Despite rapidly finding their way into HR, these types of pre-employment tests are widely misunderstood. Personality tests answer questions in the social spectrum. One of the main goals of these tests is to quantify the success of certain candidates based on behavioral traits.

Aptitude tests: Unlike personality tests or emotional intelligence tests which tend to lie on the social spectrum, aptitude tests measure problem-solving, critical thinking, and agility. These types of tests are popular because can predict job performance than any other type because they can tap into areas that cannot be found in resumes or job interviews.

Skills Testing: The kinds of tests can be considered a measure of job experience; ranging from high-end skills to low-end skills such as typing or Microsoft excel. Skill tests can either measure specific skills such as communication or measure generalized skills such as numeracy.

Emotional Intelligence tests: These kinds of assessments are a new concept but are becoming important in the HR industry. With strong Emotional Intelligence (EI) being associated with benefits such as improved workplace productivity and good leadership, many companies are investing heavily in developing these kinds of tests. Despite being able to be administered to any candidates, it is recommended they be set aside for people seeking leadership positions, or those expected to work in social contexts.

Risk tests: As the name suggests, these types of tests help companies reduce risks. Risk assessments offer assurance to employers that their workers will commit to established work ethics and not involve themselves in any activities that may cause harm to themselves or the organization. There are different types of risk tests. Safety tests, which are popular in contexts such as construction, measure the likelihood of the candidates engaging in activities that can cause them harm. Other common types of risk tests include Integrity tests.

Here’s a webinar that we hosted which talks in detail about types of pre-employment assessment, when to use them, how it fits into the big picture of recruitment, and much more.

Post-training assessments

This refers to assessments that are delivered after training. It might be a simple quiz after an eLearning module, up to a certification exam after months of training (see next section). Often, it is somewhere in between. For example you might take an afternoon sit through a training course, after which you take a formal test that is required to do something on the job. When I was a high school student, I worked in a lumber yard, and did exactly this to become an OSHA-approved forklift driver.

Certificate or certification exams

Sometimes, the exam process can be high-stakes and formal. It is then a certificate or certification, or sometimes a licensure exam. More on that here. This can be internal to the organization, or external.

Internal certification: The credential is awarded by the training organization, and the exam is specifically tied to a certain product or process that the organization provides in the market. There are many such examples in the software industry. You can get certifications in AWS, SalesForce, Microsoft, etc. One of our clients makes MRI and other medical imaging machines; candidates are certified on how to calibrate/fix them.

External certification: The credential is awarded by an external board or government agency, and the exam is industry-wide. An example of this is the SIE exams offered by FINRA. A candidate might go to work at an insurance company or other financial services company, who trains them and sponsors them to take the exam in hopes that the company will get a return by the candidate passing and then selling their insurance policies as an agent. But the company does not sponsor the exam; FINRA does.

360-degree assessments and other performance appraisals

Job performance is one of the most important concepts in HR, and also one that is often difficult to measure. John Campbell, one of my thesis advisors, was known for developing an 8-factor model of performance. Some aspects are subjective, and some are easily measured by real-world data, such as number of widgets made or number of cars sold by a car salesperson. Others involve survey-style assessments, such as asking customers, business partners, co-workers, supervisors, and subordinates to rate a person on a Likert scale. HR assessment platforms are needed to develop, deliver, and score such assessments.

The Benefits of Using Professional-Level Exam Software

Now that you have a good understanding of what pre-employment and other HR tests are, let’s discuss the benefits of integrating pre-employment assessment software into your hiring process. Here are some of the benefits:

Saves Valuable resources

Unlike the lengthy and costly traditional hiring processes, pre-employment assessment software helps companies increase their ROI by eliminating HR snugs such as face-to-face interactions or geographical restrictions. Pre-employment testing tools can also reduce the amount of time it takes to make good hires while reducing the risks of facing the financial consequences of a bad hire.

Supports Data-Driven Hiring Decisions

Data runs the modern world, and hiring is no different. You are better off letting complex algorithms crunch the numbers and help you decide which talent is a fit, as opposed to hiring based on a hunch or less-accurate methods like an unstructured interview. Pre-employment assessment software helps you analyze assessments and generate reports/visualizations to help you choose the right candidates from a large talent pool.

Improving candidate experience

Candidate experience is an important aspect of a company’s growth, especially considering the fact that 69% of candidates admitting not to apply for a job in a company after having a negative experience. Good candidate experience means you get access to the best talent in the world.

Elimination of Human Bias

Traditional hiring processes are based on instinct. They are not effective since it’s easy for candidates to provide false information on their resumes and cover letters. But, the use of pre-employment assessment software has helped in eliminating this hurdle. The tools have leveled the playing ground, and only the best candidates are considered for a position.

What To Consider When Choosing HR assessment tools

Now that you have a clear idea of what pre-employment tests are and the benefits of integrating pre-employment assessment software into your hiring process, let’s see how you can find the right tools.

Here are the most important things to consider when choosing the right pre-employment testing software for your organization.

Ease-of-use

The candidates should be your top priority when you are sourcing pre-employment assessment software. This is because the ease of use directly co-relates with good candidate experience. Good software should have simple navigation modules and easy comprehension.

Here is a checklist to help you decide if a pre-employment assessment software is easy to use:

Are the results easy to interpret?
What is the UI/UX like?
What ways does it use to automate tasks such as applicant management?
Does it have good documentation and an active community?

Tests Delivery and Remote Proctoring

Good online assessment software should feature good online proctoring functionalities. This is because most remote jobs accept applications from all over the world. It is therefore advisable to choose a pre-employment testing software that has secure remote proctoring capabilities. Here are some things you should look for on remote proctoring;

Does the platform support security processes such as IP-based authentication, lockdown browser, and AI-flagging?
What types of online proctoring does the software offer? Live real-time, AI review, or record and review?
Does it let you bring your own proctor?
Does it offer test analytics?

Test & data security, and compliance

Defensibility is what defines test security. There are several layers of security associated with pre-employment test security. When evaluating this aspect, you should consider what pre-employment testing software does to achieve the highest level of security. This is because data breaches are wildly expensive.

The first layer of security is the test itself. The software should support security technologies and frameworks such as lockdown browser, IP-flagging, and IP-based authentication.

The other layer of security is on the candidate’s side. As an employer, you will have access to the candidate’s private information. How can you ensure that your candidate’s data is secure? That is reason enough to evaluate the software’s data protection and compliance guidelines.

A good pre-employment testing software should be compliant with certifications such as GDRP. The software should also be flexible to adapt to compliance guidelines from different parts of the world.

Questions you need to ask;

What mechanisms does the software employ to eliminate infidelity?
Is their remote proctoring function reliable and secure?
Are they compliant with security compliance guidelines including ISO, SSO, or GDPR?
How does the software protect user data?

Psychometrics

Psychometrics is the science of assessment, helping to drive accurate scores from defensible tests, as well as making them more efficient, reducing bias, and a host of other benefits. You should ensure that your solution supports the necessary level of psychometrics. Some suggestions:

Storing item statistics in the item banker
Managing modified-Angoff studies
Using test statistics for form assembly, such as the test information function
Support for item response theory
Adaptive testing

User experience

A good user experience is a must-have when you are sourcing any enterprise software. A new age pre-employment testing software should create user experience maps with both the candidates and employer in mind. Some ways you can tell if a software offers a seamless user experience includes;

User-friendly interface
Simple and easy to interact with
Easy to create and manage item banks
Clean dashboard with advanced analytics and visualizations

Customizing your user-experience maps to fit candidates’ expectations attracts high-quality talent.

Scalability and automation

With a single job post attracting approximately 250 candidates, scalability isn’t something you should overlook. A good pre-employment testing software should thus have the ability to handle any kind of workload, without sacrificing assessment quality.

It is also important you check the automation capabilities of the software. The hiring process has many repetitive tasks that can be automated with technologies such as Machine learning, Artificial Intelligence (AI), and robotic process automation (RPA).

Here are some questions you should consider in relation to scalability and automation;

Does the software offer Automated Item Generation (AIG)?
How many candidates can it handle?
Can it support candidates from different locations worldwide?

Reporting and analytics

A good pre-employment assessment software will not leave you hanging after helping you develop and deliver the tests. It will enable you to derive important insight from the assessments.

The analytics reports can then be used to make data-driven decisions on which candidate is suitable and how to improve candidate experience. Here are some queries to make on reporting and analytics.

Does the software have a good dashboard?
What format are reports generated in?
What are some key insights that prospects can gather from the analytics process?
How good are the visualizations?

Customer and Technical Support

Customer and technical support is not something you should overlook. A good pre-employment assessment software should have an Omni-channel support system that is available 24/7. This is mainly because some situations need a fast response. Here are some of the questions your should ask when vetting customer and technical support:

What channels of support does the software offer/How prompt is their support?
How good is their FAQ/resources page?
Do they offer multi-language support mediums?
Do they have dedicated managers to help you get the best out of your tests?

Conclusion

Finding the right HR assessment software is a lengthy process, yet profitable in the long run. We hope the article sheds some light on the important aspects to look for when looking for such tools. Also, don’t forget to take a pragmatic approach when implementing such tools into your hiring process.

Are you stuck on how you can use pre-employment testing tools to improve your hiring process? Feel free to contact us and we will guide you on the entire process, from concept development to implementation. Whether you need off-the-shelf tests or a comprehensive platform to build your own exams, we can provide the guidance you need. We also offer free versions of our industry-leading software FastTest and Assess.ai – visit our Contact Us page to get started!

If you are interested in delving deeper into leadership assessments, you might want to check out this blog post. For more insights and an example of how HR assessments can fail, check out our blog post called Public Safety Hiring Practices and Litigation. The blog post titled Improving Employee Retention with Assessment: Strategies for Success explores how strategic use of assessments throughout the employee lifecycle can enhance retention, build stronger teams, and drive business success by aligning organizational goals with employee development and engagement.

June 30, 2022/by Admin

Incremental Validity

Nathan Thompson, PhDJune 3, 2022

Incremental validity is a specific aspect of criterion-related validity that refers to what an additional assessment or predictive variable can add to the information provided by existing assessments or variables. It refers to the amount of “bonus” predictive power by adding in another predictor. In many cases, it is on the same or similar trait, but often the most incremental validity comes from using a predictor/trait that is relatively unrelated to the original. See examples below.

Note that this is often discussed with respect to tests and assessment, but in many cases a predictor is not a test or assessment, as you will also see.

How is Incremental Validity Evaluated?

It is most often quantified with a linear regression model and correlations. However, any predictive modeling approach could work from support vector machines to neural networks.

Example of Incremental Validity: University Admissions

One of the most commonly used predictors for university admissions is an admissions test, or battery of tests. You might be required to take an assessment which includes an English/Verbal test, a Logic/Reasoning test, and a Quantitative/Math test. These might be used individually or aggregate to create a mathematical model, based on past data, that predicts your performance at university. (There are actually several variables for this, such as first year GPA, final GPA, and 4 year graduation rate, but that’s beyond the scope of this article.)

Of course, the admissions exams scores are not the only point of information that the university has on students. It also has their high school GPA, perhaps an admissions essay which is graded by instructors, and so on. Incremental validity poses this question: if the admissions exam correlates 0.59 with first year GPA, what happens if we make it into a multiple regression/correlation with High School GPA (HGPA) as a second predictor? It might go up to, say, 0.64. There is an increment of 0.05. If the university has that data from students, they would be wasting it by not using it.

Of course, HGPA will correlate very highly with the admissions exam scores. So it will likely not add a lot of incremental validity. Perhaps the school finds that essays add a 0.09 increment to the predictive power, because it is more orthogonal to the admissions exam scores. Does it make sense to add that, given the additional expense of scoring thousands of essays? That’s a business decision for them.

Example of Incremental Validity: Pre-Employment Testing

Another common use case is that of pre-employment testing, where the purpose of the test is to predict criterion variables like job performance, tenure, 6-month termination rate, or counterproductive work behavior. You might start with a skills test; perhaps you are hiring accountants or bookkeepers and you give them a test on MS Excel. What additional predictive power would we get by also doing a quantitative reasoning test? Probably some, but that most likely correlates highly with MS Excel knowledge. So what about using a personality assessment like Conscientiousness? That would be more orthogonal. It’s up to the researcher to determine what the best predictors are. This topic, personnel selection, is one of the primary areas of Industrial/ Organizational Psychology.

June 3, 2022/by Nathan Thompson, PhD

Public Safety Hiring Practices and Litigation

Jorie BoswellMay 27, 2022

QUESTION: “What are the costs associated with using validated assessments in public safety hiring?”

ANSWER: “Always cheaper than a lawsuit!”

It is not uncommon for public safety hiring practices to be called into question. There are several landmark court cases surrounding discrimination in hiring or testing that prove that point. Each year, millions and millions of dollars are spent defending or rectifying these occurrences. It is vital that steps are taken to avoid even the appearance of discrimination.

These four mistakes in public safety testing are some of the most common oversights made by human resources and public safety personnel. It is imperative that those responsible for hiring and promotional processes stay vigilant and aware of their legal responsibilities throughout the hiring and promotional process.

# 1: Failing to validate the written test to a current job description for public safety hiring

Test questions must be related to the job description. This is one of the biggest mistakes that hiring officials make and is a frequent reason for public safety testing lawsuits. The test must either measure critical skills and abilities necessary for the job, or must predict which candidates will be most successful on the job (predictive validity). At the very least, the most important skills should be reflected on the test.

The United States filed a lawsuit against the City of New York in 2007 for unfair public safety hiring practices. The United States alleged that the examinations that the City used for hiring its firefighters was not an adequate method for determining whether an applicant qualified for the position or not. In this case, Judge Nicholas G. Garaufis ruled in favor of the United States. He determined that the City was in violation of Title VII and the written examinations that were used excluded certain minorities like Black and Hispanic candidates and were not job-related.

# 2: Failing to include job-related practices that mitigate adverse impact

The City of New Haven, Connecticut, found itself in hot water in 2003, when seeking to fill 15 supervisory positions for it fire department. The test consisted of an oral and a written exam. There were 118 firefighters who took the test. When the test scores were calculated, there was a distinct racial bias. The White applicants passed the test at a rate that was twice that of the Black applicants. During the court case, it was determined that the fire department was guilty of disparate-impact discrimination.

Simply put, disparate impact discrimination occurs when hiring practice rules or tests show a distinct slant toward one race. In this case it was asserted that the test and ranking was structured in such a way that it eliminated any Black or Hispanic applicants.

# 3: Failing to use a locally-validated, job-related Physical Ability Test (PAT)

Of all of the selection practices administered by public safety departments, the physical ability test, or PAT, is most likely to have the highest failure rate to females. It’s imperative that the PAT measures the critical physical skills that a police officer must possess on day one. Those departments that utilize work-sample PATs rather than fitness tests tend to have more success in court as it is easier to demonstrate job-relatedness to a PAT that measures specific, critical job duties than a fitness test that requires candidates to run a mile-and-one-half or complete a number of pushups and sit-ups.

# 4: Failing to use a structured interview with trained raters

A study was conducted analyzing the occurrence of litigation across the different tests included in most entry-level recruitments for public safety. Of the most common selection practices (i.e., a written test, a PAT, and an interview), the unstructured interview was the selection practice that was most commonly challenged in court and which resulted in the plaintiff’s success in court.

Departments should ensure that the questions asked during the interview are structured, job-related and utilize structured scoring methods. Additionally, all parties who sit on the interview panel should be properly trained in how to objectively administer, assess, and score the interview. Questions like “Tell me why you want to work with our department” should never be included in a structured interview process.

About FPSI

This is a guest post on pre-employment testing and hiring practices in public safety, by one of the leaders in the field, Fire & Police Selection, Inc. (FPSI). FPSI consultants are well-versed in public safety litigation. Contact us for assistance with your public safety testing needs.

May 27, 2022/by Jorie Boswell

Test Score Reliability and Validity

Nathan Thompson, PhDMay 13, 2022

Test score reliability and validity are core concepts in the field of psychometrics and assessment. Both of them refer to the quality of a test, the scores it produces, and how we use those scores. Because test scores are often used for very important purposes with high stakes, it is of course paramount that the tests be of high quality. But because it is such a complex situation, it is not a simple yes/no answer of whether a test is good. There is a ton of work that goes into establishing validity and reliability, and that work never ends!

This post provide an introduction to this incredibly complex topic. For more information, we recommend you delve into books that are dedicated to the topic. Here is a classic.

Why do we need reliability and validity?

To begin a discussion of reliability and validity, let us first pose the most fundamental question in psychometrics: Why are we testing people? Why are we going through an extensive and expensive process to develop examinations, inventories, surveys, and other forms of assessment? The answer is that the assessments provide information, in the form of test scores and subscores, that can be used for practical purposes to the benefit of individuals, organizations, and society. Moreover, that information is of higher quality for a particular purpose than information available from alternative sources. For example, a standardized test can provide better information about school students than parent or teacher ratings. A preemployment test can provide better information about specific job skills than an interview or a resume, and therefore be used to make better hiring decisions.

So, exams are constructed in order to draw conclusions about examinees based on their performance. The next question would be, just how supported are various conclusions and inferences we are making? What evidence do we have that a given standardized test can provide better information about school students than parent or teacher ratings? This is the central question that defines the most important criterion for evaluating an assessment process: validity. Validity, from a broad perspective, refers to the evidence we have to support a given use or interpretation of test scores. The importance of validity is so widely recognized that it typically finds its way into laws and regulations regarding assessment (Koretz, 2008).

Test score reliability is a component of validity. Reliability indicates the degree to which test scores are stable, reproducible, and free from measurement error. If test scores are not reliable, they cannot be valid since they will not provide a good estimate of the ability or trait that the test intends to measure. Reliability is therefore a necessity but not sufficient condition for validity.

Test Score Reliability

Reliability refers to the precision, accuracy, or repeatability of the test scores. There is no universally accepted way to define and evaluate the concept; classical test theory provides several indices, while item response theory drops the idea of a single index (and drops the term “reliability” entirely!) and reconceptualizes it as a conditional standard error of measurement, an index of precision. This is actually a very important distinction, though outside the scope of this article.

An extremely common way of evaluating classical test reliability is the internal consistency index, called KR-20 or α (alpha). The KR-20 index ranges from 0.0 (test scores are comprised only of random error) to 1.0 (scores have no measurement error). Of course, because human behavior is generally not perfectly reproducible, perfect reliability is not possible; typically, a reliability of 0.90 or higher is desired for high-stakes certification exams. The relevant standard for a test depends on its stakes. A test for medical doctors might require reliability of 0.95 or greater. A test for florists or a personality self-assessment might suffice with 0.80. Another method for assessing reliability is the Split Half Reliability Index, which can also be useful depending on the context and nature of the test.

Reliability depends on several factors, including the stability of the construct, length of the test, and the quality of the test items.

Stability of the construct: Reliability will be higher if the trait/ability is more stable (mood is inherently difficult to measure repeatedly). A test sponsor typically has little control over the nature of the construct – if you need to measure knowledge of algebra, well, that’s what we have to measure, and there’s no way around that.
Length of the test: Obviously, a test with 100 items is going to produce better scores than one with 5 items, assuming the items are not worthless.
Item Quality: A test will have higher reliability if the items are good. Often, this is operationalized as point-biserial discrimination coefficients.

How to you calculate reliability? You need psychometric analysis software like Iteman.

Validity

Validity is conventionally defined as the extent to which a test measures what it purports to measure. Test validation is the process of gathering evidence to support the inferences made by test scores. Validation is an ongoing process which makes it difficult to know when one has reached a sufficient amount of validity evidence to interpret test scores appropriately.

Academically, Messick (1989) defines validity as an “integrated evaluative judgment of the degree to which empirical evidence and theoretical rationales support the adequacy and appropriateness of inferences and actions based on test scores or other modes of measurement.” This definition suggests that the concept of validity contains a number of important characteristics to review or propositions to test and that validity can be described in a number of ways. The modern concept of validity (AERA, APA, & NCME Standards) is multi-faceted and refers to the meaningfulness, usefulness, and appropriateness of inferences made from test scores.

First of all, validity is not an inherent characteristic of a test. It is the reasonableness of using the test score for a particular purpose or for a particular inference. It is not correct to say a test or measurement procedure is valid or invalid. It is more reasonable to ask, “Is this a valid use of test scores or is this a valid interpretation of the test scores?” Test score validity evidence should always be reviewed in relation to how test scores are used and interpreted. Example: we might use a national university admissions aptitude test as a high school graduation exam, since they occur in the same period of a student’s life. But it is likely that such a test does not match the curriculum of a particular state, especially since aptitude and achievement are different things! You could theoretically use the aptitude test as a pre-employment exam as well; while valid in its original use it is likely not valid in that use.

Secondly, validity cannot be adequately summarized by a single numerical index like a reliability coefficient or a standard error of measurement. A validity coefficient may be reported as a descriptor of the strength of relationship between other suitable and important measurements. However, it is only one of many pieces of empirical evidence that should be reviewed and reported by test score users. Validity for a particular test score use is supported through an accumulation of empirical, theoretical, statistical, and conceptual evidence that makes sense for the test scores.

Thirdly, there can be many aspects of validity dependent on the intended use and intended inferences to be made from test scores. Scores obtained from a measurement procedure can be valid for certain uses and inferences and not valid for other uses and inferences. Ultimately, an inference about probable job performance based on test scores is usually the kind of inference desired in test score interpretation in today’s test usage marketplace. This can take the form of making an inference about a person’s competency measured by a tested area.

Example 1: A Ruler

A standard ruler has both reliability and validity. If you measure something that is 10 cm long, and measure it again and again, you will get the same measurement. It is highly consistent and repeatable. And if the object is actually 10 cm long, you have validity. (If not, you have a bad ruler.)

Example 2: A Bathroom Scale

Bathroom scales are not perfectly reliable (though this is often a function of their price). But that meets the reliability requirements of this measurement.

If you weigh 180 lbs, and step on the scale several times, you will likely get numbers like 179.8 or 180.1. That is quite reliable, and valid.
If the numbers were more spread out, like 168.9 and 185.7, then you can consider it unreliable but valid.
If the results were 190.00 lbs every time, you have perfectly reliable measurement… but poor validity
If the results were spread like 25.6, 2023.7, 0.000053 – then it is neither reliable or valid.

This is similar to the classic “target” example of reliability and validity, like you see below (image from Wikipedia).

Example 3: A Pre-Employment Test

Now, let’s get to a real example. You have a test of quantitative reasoning that is being used to assess bookkeepers that apply for a job at a large company. Jack has very high ability, and scores around the 90th percentile each time he takes the test. This is reliability. But does it actually predict job performance? That is validity. Does it predict job performance better than a Microsoft Excel test? Good question, time for some validity research. What if we also tack on a test of conscientiousness? That is incremental validity.

Summary

In conclusion, validity and reliability are two essential aspects in evaluating an assessment, be it an examination of knowledge, a psychological inventory, a customer survey, or an aptitude test. Validity is an overarching, fundamental issue that drives at the heart of the reason for the assessment in the first place: the use of test scores. Reliability is an aspect of validity, as it is a necessary but not sufficient condition. Developing a test that produces reliable scores and valid interpretations is not an easy task, and progressively higher stakes indicate a progressively greater need for a professional psychometrician. High-stakes exams like national university admissions often have teams of experts devoted to producing a high quality assessment.

If you need professional psychometric consultancy and support, do not hesitate to contact us.

May 13, 2022/by Nathan Thompson, PhD

Test Score Equating and Linking

Laila Issayeva M.Sc.October 31, 2021

Test equating refers to the issue of defensibly translating scores from one test form to another. That is, if you have an exam where half of students see one set of items while the other half see a different set, how do you know that a score of 70 is the same one both forms? What if one is a bit easier? If you are delivering assessments in conventional linear forms – or piloting a bank for CAT/LOFT – you are likely to utilize more than one test form, and, therefore, are faced with the issue of test equating.

When two test forms have been properly equated, educators can validly interpret performance on one test form as having the same substantive meaning compared to the equated score of the other test form (Ryan & Brockmann, 2009). While the concept is simple, the methodology can be complex, and there is an entire area of psychometric research devoted to this topic. This post will provide an overview of the topic.

Why do we need test linking and equating?

The need is obvious: to adjust for differences in difficulty to ensure that all examinees receive a fair score on a stable scale. Suppose you take Form A and get a score of 72/100 while your friend takes Form B and gets a score of 74/100. Is your friend smarter than you, or did his form happen to have easier questions? What if the passing score on the exam was 73? Well, if the test designers built-in some overlap of items between the forms, we can answer this question empirically.

Suppose the two forms overlap by 50 items, called anchor items or equator items. They are delivered to a large, representative sample. Here are the results.

Mean score on 50 overlap items	Mean score on 100 total items
30	72
32	74

Because the mean score on the anchor items was higher, we then think that the Form B group was a little smarter, which led to a higher total score.

Now suppose these are the results:

Mean score on 50 overlap items	Mean score on 100 total items
32	72
32	74

Now, we have evidence that the groups are of equal ability. The higher total score on Form B must then be because the unique items on that form are a bit easier.

What is test equating?

According to Ryan and Brockmann (2009), “Equating is a technical procedure or process conducted to establish comparable scores, with equivalent meaning, on different versions of test forms of the same test; it allows them to be used interchangeably.” (p. 8). Thus, successful equating is an important factor in evaluating assessment validity, and, therefore, it often becomes an important topic of discussion within testing programs.

Practice has shown that scores, and tests producing scores, must satisfy very strong requirements to achieve this demanding goal of interchangeability. Equating would not be necessary if test forms were assembled as strictly parallel, meaning that they would have identical psychometric properties. In reality, it is almost impossible to construct multiple test forms that are strictly parallel, and equating is necessary to attune a test construction process.

Dorans, Moses, and Eignor (2010) suggest the following five requirements towards equating of two test forms:

tests should measure the same construct (e.g. latent trait, skill, ability);
tests should have the same level of reliability;
equating transformation for mapping the scores of tests should be the inverse function;
test results should not depend on the test form an examinee actually takes;
the equating function used to link the scores of two tests should be the same regardless of the choice of (sub) population from which it is derived.

Detecting item parameter drift (IPD) is crucial for the equating process because it helps in identifying items whose parameters have changed. By addressing IPD, test developers can ensure that the equating process remains valid and reliable.

How do I calculate an equating?

Classical test theory (CTT) methods include linear equating and equipercentile equating as well as several others. Some newer approaches that work well with small samples are Circle-Arc (Livingston & Kim, 2009) and Nominal Weights (Babcock, Albano, & Raymond, 2012). Specific methods for linear equating include Tucker, Levine, and Chained (von Davier & Kong, 2003). Linear equating approaches are conceptually simple and easy to interpret; given the examples above, the equating transformation might be estimated with a slope of 1.01 and an intercept of 1.97, which would directly confirm the hypothesis that one form was about 2 points easier than the other.

Item response theory (IRT) approaches include equating through common items (equating by applying an equating constant, equating by concurrent or simultaneous calibration, and equating with common items through test characteristic curves), and common person calibration (Ryan & Brockmann, 2009). The common-item approach is quite often used, and specific methods for finding the constants (conversion parameters) include Stocking-Lord, Haebara, Mean/Mean, and Mean/Sigma. Because IRT assumes that two scales on the same construct differ by only a simple linear transformation, all we need to do is find the slope and intercept of that transformation. Those methods do so, and often produce nice looking figures like the one below from the program IRTEQ (Han, 2007). Note that the b parameters do not fall on the identity line, because there was indeed a difference between the groups, and the results clearly find that is the case.

Practitioners can equate forms with CTT or IRT. However, one of the reasons that IRT was invented was that equating with CTT was very weak. Hambleton and Jones (1993) explain that when CTT equating methods are applied, both ability parameter (i.e., observed score) and item parameters (i.e., difficulty and discrimination) are dependent on each other, limiting its utility in practical test development. IRT solves the CTT interdependency problem by combining ability and item parameters in one model. The IRT equating methods are more accurate and stable than the CTT methods (Hambleton & Jones, 1993; Han, Kolen, & Pohlmann, 1997; De Ayala, 2013; Kolen and Brennan, 2014) and provide a solid basis for modern large-scale computer-based tests, such as computerized adaptive tests (Educational Testing Service, 2010; OECD, 2017).

Of course, one of the reasons that CTT is still around in general is that it works much better with smaller samples, and this is also the case for CTT test equating (Babcock, Albano, & Raymond, 2012).

How do I implement test equating?

Test equating is a mathematically complex process, regardless of which method you use. Therefore, it requires special software. Here are some programs to consider.

CIPE performs both linear and equipercentile equating with classical test theory. It is available from the University of Iowa’s CASMA site, which also includes several other software programs.
IRTEQ is an easy-to-use program which performs all major methods of IRT Conversion equating. It is available from the University of Massachusetts website, as well as several other good programs.
There are many R packages for equating and related psychometric topics. This article claims that there are 45 packages for IRT analysis alone!
If you want to do IRT equating, you need IRT calibration software. We highly recommend Xcalibre since it is easy to use and automatically creates reports in Word for you. If you want to do the calibration approach to IRT equating (both anchor-item and concurrent-calibration), rather than the conversion approach, this is handled directly by IRT software like Xcalibre. For the conversion approach, you need separate software like IRTEQ.

Equating is typically performed by highly trained psychometricians; in many cases, an organization will contract out to a testing company or consultant with the relevant experience. Contact us if you’d like to discuss this.

Does equating happen before or after delivery?

Both. These are called pre-equating and post-equating (Ryan & Brockmann, 2009). Post-equating means the calculation is done after delivery and you have a full data set, for example if a test is delivered twice per year on a single day, we can do it after that day. Pre-equating is more tricky, because you are trying to calculate the equating before a test form has ever been delivered to an examinee; but this is 100% necessary in many situations, especially those with continuous delivery windows.

How do I learn more about test equating?

If you are eager to learn more about the topic of equating, the classic reference is the book by Kolen and Brennan (2004; 2014) that provides the most complete coverage of score equating and linking. There are other resources more readily available on the internet, like this free handbook from CCSSO. If you would like to learn more about IRT, we suggest the books by De Ayala (2008) and Embretson and Reise (2000). A brief intro of IRT equating is available on our website.

Several new ideas of general use in equating, with a focus on kernel equating, were introduced in the book by von Davier, Holland, and Thayer (2004). Holland and Dorans (2006) presented a historical background for test score linking, based on work by Angoff (1971), Flanagan (1951), and Petersen, Kolen, and Hoover (1989). If you look for a straightforward description of the major issues and procedures encountered in practice, then you should turn to Livingston (2004).

Want to learn more? Talk to a Psychometric Consultant!

References

Angoff, W. H. (1971). Scales, norms and equivalent scores. In R. L. Thorndike (Ed.), Educational measurement (2nd ed., pp. 508-600). American Council on Education.

Babcock, B., Albano, A., & Raymond, M. (2012). Nominal Weights Mean Equating: A Method for Very Small Samples. Educational and Psychological Measurement, 72(4), 1-21.

Dorans, N. J., Moses, T. P., & Eignor, D. R. (2010). Principles and practices of test score equating. ETS Research Report Series, 2010(2), i-41.

De Ayala, R. J. (2008). A commentary on historical perspectives on invariant measurement: Guttman, Rasch, and Mokken.

De Ayala, R. J. (2013). Factor analysis with categorical indicators: Item response theory. In Applied quantitative analysis in education and the social sciences (pp. 220-254). Routledge.

Educational Testing Service (2010). Linking TOEFL iBT Scores to IELTS Scores: A Research Report. Educational Testing Service.

Embretson, S. E., & Reise, S. P. (2000). Item response theory for psychologists. Maheah.

Flanagan, J. C. (1951). Units, scores, and norms. In E. F. Lindquist (Ed.), Educational measurement (pp. 695-763). American Council on Education.

Hambleton, R. K., & Jones, R. W. (1993). Comparison of classical test theory and item response theory and their applications to test development. Educational measurement: issues and practice, 12(3), 38-47.

Han, T., Kolen, M., & Pohlmann, J. (1997). A comparison among IRT true-and observed-score equatings and traditional equipercentile equating. Applied Measurement in Education, 10(2), 105-121.

Holland, P. W., & Dorans, N. J. (2006). Linking and equating. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 187-220). Praeger.

Kolen, M. J., & Brennan, R. L. (2004). Test equating, linking, and scaling: Methods and practices (2nd ed.). Springer-Verlag.

Kolen, M. J., & Brennan, R. L. (2014). Item response theory methods. In Test Equating, Scaling, and Linking (pp. 171-245). Springer.

Livingston, S. A. (2004). Equating test scores (without IRT). ETS.

Livingston, S. A., & Kim, S. (2009). The Circle‐Arc Method for Equating in Small Samples. Journal of Educational Measurement 46(3): 330-343.

OECD (2017). PISA 2015 Technical Report. OECD Publishing.

Petersen, N. S., Kolen, M. J., & Hoover, H. D. (1989). Scaling, norming and equating. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 221-262). Macmillan.

Ryan, J., & Brockmann, F. (2009). A Practitioner’s Introduction to Equating with Primers on Classical Test Theory and Item Response Theory. Council of Chief State School Officers.

von Davier, A. A., Holland, P. W., & Thayer, D. T. (2004). The kernel method of test equating. Springer.

von Davier, A. A., & Kong, N. (2003). A unified approach to linear equating for non-equivalent groups design. Research report 03-31 from Educational Testing Service. https://www.ets.org/Media/Research/pdf/RR-03-31-vonDavier.pdf

October 31, 2021/by Laila Issayeva M.Sc.

Certification, Certificate, and Licensure Exams – What is the difference?

Nathan Thompson, PhDOctober 10, 2021

Certification, Certificate, and Licensure are terms that are used quite frequently to refer to credentialing examinations that someone has to pass to demonstrate skills in a certain profession or topic. They are quite similar, and often confused. This is exacerbated by even more similar terms in the field, such as accreditation, badge, and microcredentials. This post will help you understand the differences.

What is Certification?

Certification is “a credential that you earn to show that you have specific skills or knowledge. They are usually tied to an occupation, technology, or industry.” (CareerOneStop) The important aspect in this definition is the latter portion; the organization that runs the certification is generally across an industry or a profession, regardless of political boundaries. It is almost always some sort of professional association or industry board, like the American Association of Widgetmakers (obviously not a real thing). However, it is sometimes governed by a specific company or other organization regarding their products; perhaps the most well known is how Amazon Web Services will certify you in skills to hand their offerings. Many other technology and software companies do the same.

Professional or personnel certification is a voluntary process by which individuals are evaluated against predetermined standards for knowledge, skills, or competencies. Participants who demonstrate that they meet the standards by successfully passing an assessment process are granted the certification. Note that this 1) refers to an individual (not a program or organization) and 2) refers only to showing that they have competencies (not simply attending or completing a course).

Certifications are usually put out by independent organizations, not a government or other entity (that is licensure). Usually it is a nonprofit association or board, which can be for a specific country (American Board of _______) or worldwide (International Association of ________).

A lot of work goes into developing good certifications, sometimes millions of dollars. This includes test development practices like job task analysis, item writer training workshops, modified-Angoff cutscore studies, equating, and item response theory.

What is a Certificate?

A certificate program is less rigorous than a certification, and is often just a piece of paper that you might receive after sitting through a course. For example, you might take an online course in a MOOC where you watch a few video lectures and perhaps answer a few quizzes with no proctoring, and afterwards you receive a certificate.

A step up from this is an assessment based certificate which requires that you pass a fairly exam afterwards. This exam is sometimes built with some of the stringent practices of a certification, like a modified-Angoff standard setting study.

An assessment-based certificate program is a non-degree granting program that:
(a) provides instruction and training to aid participants in acquiring specific knowledge, skills, and/or competencies associated with intended learning outcomes;
(b) evaluates participants’ achievement of the intended learning outcomes; and
(c) awards a certificate only to those participants who meet the performance, proficiency or passing standard for the assessment(s).

What is Licensure?

Licensure is a “formal permission to do something: esp., authorization by law to do some specified thing (license to marry, practice medicine, hunt, etc.)” (Schmitt, 1995). The key phrase here is by law. The sponsoring organization is a governmental entity, and that is defines what licensure is. In fact, licensure is not even always about a profession; almost all of us have a Driver’s License for which we passed a simple exam. Moreover, it does not always even have to be about a profession; many millions of people have a Fishing License, which is granted by the government (by States in the USA), for which you simply pay a small fee.

The license is still an attestation, but not of your skills, just that you have been authorized to do something by the government. Of course, in the context of assessment, it means that you have passed some sort of exam which is mandated by law, typically for professions that are dangerous enough or impact a wide range of people that the government has stepped in to provide oversight: attorneys, physicians, medical professionals, etc.

Certification vs Licensure Exams

Usually, there is a test that you must pass, but the sponsor can differ with certification vs licensure. The development and delivery of such tests is extremely similar, leading to the confusion. The difference between the two is outside the test itself, and instead refers to the sponsoring organization: is it mandated/governed by a governmental entity, or is it unrelated to political/governmental boundaries? You are awarded a credential after successful completion, but the difference is in the group that awards the credential, what it means, and where it is recognized.

However, there are many licensures that do not involve an exam, but you simply need to file some paperwork with the government. An example of this is a marriage license. You certainly don’t have to take a test to qualify!

Can they be the same exam?

To make things even more confusing… yes. And it does not even have to be consistent. In the US, some professions have a wide certification, which is also required in some States as licensure, but not in all States! Some States might have their own exams, or not even require an exam. This muddles the difference between certification vs licensure.

Differences between Certification and Licensure

Aspect	Certification	Licensure
Mandatory?	No	Yes
Run by	Association, Board, Nonprofit, Private Company	Government
Does it use an exam?	Yes, especially if it is accredited	Sometimes, but often not (consider a marriage license)
Accreditation involved?	Yes, NCCA and ANSI provide accreditation that a certification is high quality	No; often there is no check on quality
Examples	Certified Chiropractic Sports Physician (CCSP®), Certified in Clean Needle Technique (CNT)	Marriage license; Driver’s License; Fishing License; License to practice law (Bar Exam)

How do these terms relate to other, similar terms?

This outline summarizes some of the relevant terms regarding certification vs licensure and other credentials. This is certainly more than can be covered in a single blog post!

Attestation of some level of quality for a person or organization = CREDENTIALING
- Attestation of a person
  - By government = LICENSURE
  - By independent board or company
    - High stakes, wide profession = CERTIFICATION
    - Medium stakes = CERTIFICATE
    - Low stakes, quite specific skill = MICROCREDENTIAL
  - By an educational institution = DEGREE OR DIPLOMA
- Attestation of an organization = ACCREDITATION

Learn more about accreditation of certification organizations in this article.

October 10, 2021/by Nathan Thompson, PhD

The Bookmark Method of Standard Setting

Laila Issayeva M.Sc.October 8, 2021

The Bookmark Method of standard setting (Lewis, Mitzel, & Green, 1996) is a scientifically-based approach to setting cutscores on an examination. It allows stakeholders of an assessment to make decisions and classifications about examinees that are constructive rather than arbitrary (e.g., 70%), meet the goals of the test, and contribute to overall validity. A major advantage of the bookmark method over others is that it utilizes difficulty statistics on all items, making it very data-driven; but this can also be a disadvantage in situations where such data is not available. It also has the advantage of panelist confidence (Karantonis & Sireci, 2006).

The bookmark method operates by delivering a test to a representative sample (or population) of examinees, and then calculating the difficulty statistics for each item. We line up the items in order of difficulty, and experts review the items to place a bookmark where they think a cutscore should be. Nowadays, we use computer screens, but of course in the past this was often done by printing the items in paper booklets, and the experts would literally insert a bookmark.

What is standard setting?

Standard setting (Cizek & Bunch, 2006) is an integral part of the test development process even though it has been undervalued outside of practitioners’ view in the past (Bejar, 2008). Standard setting is the methodology of defining achievement or proficiency levels and corresponding cutscores. A cutscore is a score that serves as a measure of classifying test takers into categories.

Educational assessments and credentialing examinations are often employed to distribute test takers among ordered categories according to their performance across specific content and skills (AERA, APA, & NCME, 2014; Hambleton, 2013). For instance, in tests used for certification and licensing purposes, test takers are typically classified as “pass”—those who score at or above the cutscore—and those who “fail”. In education, students are often classified in terms of proficiency; the Nation’s Report Card assessment (NAEP) in the United States classifies students as Below Basic, Basic, Proficient, Advanced.

However, assessment results could come into question unless the cutscores are appropriately defined. This is why arbitrary cutscores are considered indefensible and lacking validity. Instead, psychometricians help test sponsors to set cutscores using methodologies from the scientific literature, driven by evaluations of item and test difficulty as well as examinee performance.

When to use the bookmark method?

Two approaches are mainly used in international practice to establish assessment standards: the Angoff method (Cizek, 2006) and the Bookmark method (Buckendahl, Smith, Impara, & Plake, 2000). The Bookmark method, unlike the Angoff method, requires the test to be administered prior to defining cutscores based on test data. This provides additional weight to the validity of the process, and better informs the subject matter experts during the process. Of course, many exams require a cutscore to be set before it is published, which is impossible with the bookmark; the Angoff procedure is very useful then.

How do I implement the bookmark method?

The process of standard setting employing the Bookmark method consists of the following stages:

Identify a team of subject matter experts (SMEs); their number should be around 6-12, and led by a test developer/psychometrician/statistician
Analyze test takers’ responses by means of the item response theory (IRT)
Create a list items according to item difficulty in an ascending order
Define the competency levels for test takers; for example, have the 6-12 experts discuss what should differentiate a “pass” candidate from a “fail” candidate
Experts read the items in the ascending order (they do not need to see the IRT values), and place a bookmark where appropriate based on professional judgement across well-defined levels
Calculate thresholds based on the bookmarks set, across all experts
If needed, discuss results and perform a second round

Example of the Bookmark Method

If there are four competency levels such as the NAEP example, then SMEs need to set up three bookmarks in-between: first bookmark is set after the last item in a row that fits the minimally competent candidate for the first level, then second and third. There are thresholds/cutscores from 1 to 2, 2 to 3, and 3 to 4. SMEs perform this individually without discussion, by reading the items.

When all SMEs have provided their opinion, the standard setting coordinator combines all results into one spreadsheet and leads the discussion when all participants express their opinion referring to the bookmarks set. This might look like the sheet below. Note that SME4 had a relatively high standard in their mind, while SME2 had a low standard in their mind – placing virtually every student above an IRT score of 0.0 into the top category!

After the discussion, the SMEs are given one more opportunity to set the bookmarks again. Usually, after the exchange of opinions, the picture alters. SMEs gain consensus, and the variation in the graphic is reduced. An example of this is below.

What to do with the results?

Based on the SMEs’ voting results, the coordinator or psychometrician calculates the final thresholds on the IRT scale, and provides them to the analytical team who would ultimately prepare reports for the assessment across competency levels. This might entail score reports to examinees, feedback reports to teachers, and aggregate reports to test sponsors, government officials, and more.

You can see how the scientific approach will directly impact the interpretations of such reports. Rather than government officials just knowing how many students scored 80-90% correct vs 90-100% correct, the results are framed in terms of how many students are truly proficient in the topic. This makes decisions from test scores – both at the individual and aggregate levels – much more defensible and informative. They become truly criterion-referenced. This is especially true when the scores are equated across years to account for differences in examinee distributions and test difficulty, and the standard can be demonstrated to be stable. For high-stakes examinations such as medical certification/licensure, admissions exams, and many more situations, this is absolutely critical.

Want to talk to an expert about implementing this for your exams? Contact us.

References

[AERA, APA, & NCME] (American Educational Research Association, American Psychological Association, & National Council on Measurement in Education). (2014). Standards for educational and psychological testing. Washington, DC: American Educational Research Association.

Bejar, I. I. (2008). Standard setting: What is it? Why is it important. R&D Connections, 7, 1-6. Retrieved from https://www.ets.org/Media/Research/pdf/RD_Connections7.pdf

Buckendahl, C. W., Smith, R. W., Impara, J. C., & Plake, B. S. (2000). A comparison of Angoff and Bookmark standard setting methods. Paper presented at the Annual Meeting of the Mid-Western Educational Research Association, Chicago, IL: October 25-28, 2000.

Cizek, G., & Bunch, M. (2006). Standard Setting: A Guide to Establishing and Evaluating Performance Standards on Tests. Thousand Oaks, CA: Sage.

Cizek, G. J. (2007). Standard setting. In Steven M. Downing and Thomas M. Haladyna (Eds.) Handbook of test development. Mahwah, NJ: Lawrence Erlbaum Associates, Publishers, pp. 225-258.

Hambleton, R. K. (2013). Setting performance standards on educational assessments and criteria for evaluating the process. In Setting performance standards, pp. 103-130. Routledge. Retrieved from https://www.nciea.org/publications/SetStandards_Hambleton99.pdf

Karantonis, A., & Sireci, S. (2006). The Bookmark Standard‐Setting Method: A Literature Review. Educational Measurement Issues and Practice 25(1):4 – 12.

Lewis, D. M., Mitzel, H. C., & Green, D. R. (1996, June). Standard setting: A Book-mark approach. In D. R. Green (Chair),IRT-based standard setting procedures utilizing behavioral anchoring. Symposium conducted at the Council of Chief State School Officers National Conference on Large-Scale Assessment, Phoenix, AZ.

October 8, 2021/by Laila Issayeva M.Sc.

What is an Assessment / Test Battery?

Nathan Thompson, PhDOctober 6, 2021

A test battery or assessment battery is a set multiple psychometrically-distinct exams delivered in one administration. In some cases, these are various tests that are cobbled together for related purposes, such as a psychologist testing a 8 year old child on their intelligence, anxiety, and autism spectrum. However, in many cases it is a single test title that we often refer to as a single test but is actually several separate tests, like a university admissions test that has English, Math, and Logical Reasoning components. Why do so? The key here is that we want to keep them psychometrically separate, but maximize the amount of information about the person to meet the purposes of the test.

Learn more about our powerful exam platform that allows you to easily develop and deliver test batteries.

Examples of a Test Battery

Test batteries are used in a variety of fields, pretty much anywhere assessment is done.

Admissions and Placement Testing

The classic example is a university admissions test that has English, Math, and Logic portions. These are separate tests, and psychometricians would calculate the reliability and other important statistics separately. However, the scores are combined at the end to get an overall picture of examinee aptitude or achievement, and use that to maximally predict 4-graduation rates and other important criterion variables.

Why is is called a battery? Because we are battering the poor student with not just one, but many exams!

Pre-Employment Testing

Exam batteries are often used in pre-employment testing. You might get tested on computer skills, numerical reasoning, and noncognitive traits such as integrity or conscientiousness. These are used together to gain incremental validity. A good example is the CAT-ASVAB, which is the selection test to get into the US Armed Forces. There are 10 tests (vocabulary, math, mechanical aptitude…).

Psychological or Psychoeducational Assessment

In a clinical setting, clinicians will often use a battery of tests, such as IQ, autism, anxiety, and depression. Some IQ tests themselves as a battery, as they might assess visual reasoning, logical reasoning, numerical reasoning, etc. However, these have a positive manifold, meaning that they correlate quite highly with each other. Another example is the Woodcock-Johnson.

K-12 Educational Assessment

Many large-scale tests that are used in schools are considered a battery, though often with only 2 or 3 aspects. A common one in the USA is the NWEA Measures of Academic Progress.

Composite Scores

A composite score is a combination of scores in a battery. If you took an admissions test like the SAT and GRE, you recall how it would add your scores on the different subtests, while the ACT test takes the average. The ASVAB takes a linear combination of the 4 most important subtests and uses them for admission; the others are used for job matching.

A Different Animal: Test with Sections

The battery is different than a single test that has distinct sections. For example, a K12 English test might have 10 vocab items, 10 sentence-completion grammar items, and 2 essays. Such tests are usually analyzed as a single test, as they are psychometrically unidimensional.

How to Deliver A Test Battery

In ASC’s platforms, Assess.ai and FastTest, all this functionality is available out of the box: test batteries, composite scores, and sections within a test. Moreover, they come with a lot of important functionality, such as separation of time limits, navigation controls, customizable score reporting, and more. Click here to request a free account and start applying best practices.

October 6, 2021/by Nathan Thompson, PhD

What is a Psychometric Test?

AdminOctober 1, 2021

Psychometric tests are assessments of people to measure psychological attributes such as personality or intelligence. Over the past century, psychometric tests have played an increasingly important part in revolutionizing how we approach important fields such as education, psychiatry, and recruitment. One of the main reasons why psychometric tests have become popular in corporate recruitment and education is their accuracy and objectivity.

However, getting the best out of psychometric tests requires one to have a concrete understanding of what they are, how they work, and why you need them. This article, therefore, aims to provide you with the fundamentals of psychometric testing, the benefits, and everything else you need to know.

Interested in talking to a psychometrician about test development and validation, or a demo of our powerful platform that empowers you to develop custom psychometric tests?

What is a psychometric test or assessment?

Psychometric tests and assessments are different from other types of tests in that they measure a person’s knowledge, abilities, interests, and other attributes. They focus on measuring “mental processes” rather than “objective facts.” Psychometric tests are used to determine suitability for employment, education, training, or placement, as well as the suitability of the person for specific situations.

A psychometric test or assessment is an evaluation of a candidate’s personality traits and cognitive abilities. They also help assess mental health status by screening the individual for potential mental disorders. In recruitment and job performance, companies use psychometric tests for reasons such as:

Make data-driven comparisons among candidates
Making leadership decisions
Reduce hiring bias and improve workforce diversification
Identify candidate strengths and weaknesses
Help complete candidate personas
Deciding management strategies

Psychometrics is different, and refers to a field of study associated with the theory and technique of psychoeducational measurement. It is not limited to the topic of recruitment and careers, but spans all assessments, from K-12 formative assessment to addiction inventories in medical clinics to university admissions to certification of nurses.

The different types of psychometric tests

The following are the main types of psychometric assessments:

Personality tests

Personality tests mainly help recruiters identify desirable personality traits that would make one fit for a certain role in a company. These tests contain a series of questions that measure and categorize important metrics such as leadership capabilities and candidate motivations as well as job-related traits such as integrity or conscientiousness. Some personality assessments seek to categorize people into relatively arbitrary “types” while some place people on a continuum of various traits.

‘Type focused’ Personality tests

Some examples of popular psychometric tests that use type theory include the Myers-Briggs Type Indicator (MBTI) and the DISC profile. Personality types are of limited usefulness in recruitment because they lack objectivity and reliability in determining important metrics that can predict the success of certain candidates in a specific role, as well as having more limited scientific backing. They are, to a large extent, Pop Psychology.

‘Trait-focused’ personality types

Personality assessments based on trait theory on the other hand tend to mainly rely on the OCEAN model, like the NEO-PI-R. These psychometric assessments determine the intensity of five traits; openness, conscientiousness, extraversion, agreeableness, and neuroticism, using a series of questions and exercises. Psychometric assessments based on this model provide more insight into the ability of candidates to perform in a certain role, compared to type-focused assessments.

Cognitive Ability and Aptitude Tests

Cognitive ability tests, also known as intelligence tests or aptitude, measure a person’s latent/unlearned cognitive skills and attributes. Common examples of this are logical reasoning, numerical reasoning, and mechanical reasoning. It is important to stress that these are generally unlearned, as opposed to achievement tests.

Job Knowledge and Achievement tests

These psychometric tests are designed to assess what people have learned. For example, if you are applying for a job as an accountant, you might be given a numerical reasoning or logical reasoning test, and a test in the use of Microsoft Excel. The former is aptitude, while the latter is job knowledge or achievement. (Though there is certainly some learning involved with basic math skills).

What are the importance and benefits of psychometric tests?

Psychometric tests have been proven to be effective in domains such as recruitment and education. In recruitment, psychometric tests have been integrated into pre-employment assessment software because of their effectiveness in the hiring process. Here are several ways psychometric tests are beneficial in corporate environments, along with Learning and Development (L & D):

Cost and Time Efficiency — Psychometric tests save organizations a lot of resources because they help eliminate the guesswork in hiring processes. Psychometric tests help employers go through thousands of resumes to find the perfect candidates.

Cultural fulfillment — In the modern business world, culture is a great determinant of success. Through psychometric tests, employees can predict the types of candidates that can fit into their company culture.

Standardization — Traditional hiring processes have a lot of hiring bias cases. However, psychometric tests can level the playing ground and give a chance for the best candidates to get what they deserve.

Effectiveness — Psychometric tests have been scientifically proven to play a critical role in hiring the best talent. This is mainly because they can spot important attributes that can’t be spotted by traditional hiring processes.

In L&D, psychometric tests can help organizations generate important insights such as learning abilities, candidate strengths and weaknesses, and learning strategy effectiveness. This can help re-write the learning strategies, for improved ROI.

What makes a good psychometric test?

As with all tests, you need reliability and validity. In the case of pre-employment testing, the validity is usually one of two things:

Content validity via job-relatedness; if the job requires several hours per day of Microsoft Excel, then a test on Microsoft Excel makes sense
Predictive validity: numerical reasoning but not be as overtly related to the job as Microsoft Excel, but if you can show that it predicts job performance, then this is helpful. This is especially true for noncognitive assessments like conscientiousness.

Conclusion

There is no doubt that psychometric tests are important in essential aspects of life such as recruitment and education. Not only do they help us understand people, but also simplify the hiring process. However, psychometric tests should be used with caution. It’s advisable to develop a concrete strategy on how you are going to integrate them into your operation mechanism.

Ready To Start Developing Your Own Psychometric Tests?

ASC’s comprehensive platform provides you with all the tools necessary to develop and securely deliver psychometric assessments. It is equipped with powerful psychometric software, online essay marking modules, advanced reporting, tech-enhanced items, and so much more! You also have access to the world’s greatest psychometricians to help you out if you get stuck in the process!

October 1, 2021/by Admin

Three Approaches for IRT Equating

Nathan Thompson, PhDSeptember 21, 2021

If you are delivering high-stakes tests in linear forms – or piloting a bank for CAT/LOFT – you are faced with the issue of how to equate the forms together. That is, how can we defensibly translate a score on Form A to a score on Form B? While the concept is simple, the methodology can be complex, and there is an entire area of psychometric research devoted to this topic. There are a number of ways to approach this issue, and IRT equating is the strongest.

Why do we need equating?

The need is obvious: to adjust for differences in difficulty to ensure that all examinees receive a fair score on a stable scale. Suppose you take Form A and get s score of 72/100 while your friend takes Form B and gets a score of 74/100. Is your friend smarter than you, or did his form happen to have easier questions? Well, if the test designers built-in some overlap, we can answer this question empirically.

Suppose the two forms overlap by 50 items, called anchor items or equator items. Both forms are each delivered to a large, representative sample. Here are the results.

Form	Mean score on 50 overlap items	Mean score on 100 total items
A	30	72
B	30	74

Because the mean score on the anchor items was higher, we then think that the Form B group was a little smarter, which led to a higher total score.

Now suppose these are the results:

Form	Mean score on 50 overlap items	Mean score on 100 total items
A	32	72
B	32	74

Now, we have evidence that the groups are of equal ability. The higher total score on Form B must then be because the unique items on that form are a bit easier.

How do I calculate an equating?

You can equate forms with classical test theory (CTT) or item response theory (IRT). However, one of the reasons that IRT was invented was that equating with CTT was very weak. CTT methods include Tucker, Levine, and equipercentile. Right now, though, let’s focus on IRT.

IRT equating

There are three general approaches to IRT equating. All of them can be accomplished with our industry-leading software Xcalibre, though conversion equating requires an additional software called IRTEQ.

Conversion
Concurrent Calibration
Fixed Anchor Calibration

Conversion

With this approach, you need to calibrate each form of your test using IRT, completely separately. We then evaluate the relationship between IRT parameters on each form and use that to estimate the relationship to convert examinee scores. Theoretically what you do is line up the IRT parameters of the common items and perform a linear regression, so you can then apply that linear conversion to scores.

But DO NOT just do a regular linear regression. There are specific methods you must use, including mean/mean, mean/sigma, Stocking & Lord, and Haebara. Fortunately, you don’t have to figure out all the calculations yourself, as there is free software available to do it for you: IRTEQ.

Concurrent Calibration

The second approach is to combine the datasets into what is known as a sparse matrix. You then run this single data set through the IRT calibration, and it will place all items and examinees onto a common scale. The concept of a sparse matrix is typically represented by the figure below, representing the non-equivalent anchor test (NEAT) design approach.

The IRT calibration software will automatically equate the two forms and you can use the resultant scores.

Fixed Anchor Calibration

The third approach is a combination of the two above; it utilizes the separate calibration concept but still uses the IRT calibration process to perform the equating rather than separate software.

With this approach, you would first calibrate your data for Form A. You then find all the IRT item parameters for the common items and input them into your IRT calibration software when you calibrate Form B.

You can tell the software to “fix” the item parameters so that those particular ones (from the common items) do not change. Then all the item parameters for the unique items are forced onto the scale of the common items, which of course is the underlying scale from Form A. This then also forces the scores from the Form B students onto the Form A scale.

How do these IRT equating approaches compare to each other?

Concurrent calibration is arguably the easiest but has the drawback that it merges the scales of each form into a new scale somewhere in the middle. If you need to report the scores on either form on the original scale, then you must use the Conversion or Fixed Anchor approaches. This situation commonly happens if you are equating across time periods.

Suppose you delivered Form A last year and are now trying to equate Form B. You can’t just create a new scale and thereby nullify all the scores you reported last year. You must map Form B onto Form A so that this year’s scores are reported on last year’s scale and everyone’s scores will be consistent.

Where do I go from here?

If you want to do IRT equating, you need IRT calibration software. All three approaches use it. I highly recommend Xcalibre since it is easy to use and automatically creates reports in Word for you. If you want to learn more about the topic of equating, the classic reference is the book by Kolen and Brennan (2004; 2014). There are other resources more readily available on the internet, like this free handbook from CCSSO. If you would like to learn more about IRT, I recommend the books by de Ayala (2008) and Embretson & Reise (2000). An intro is available in our blog post.

September 21, 2021/by Nathan Thompson, PhD