Posts on psychometrics: The Science of Assessment

psychometric-tests-measure-mental-processes

Psychometric tests are assessments of people to measure psychological attributes such as personality or intelligence. Over the past century, psychometric tests have played an increasingly important part in revolutionizing how we approach important fields such as education, psychiatry, and recruitment.  One of the main reasons why psychometric tests have become popular in corporate recruitment and education is their accuracy and objectivity.

However, getting the best out of psychometric tests requires one to have a concrete understanding of what they are, how they work, and why you need them.  This article, therefore, aims to provide you with the fundamentals of psychometric testing, the benefits, and everything else you need to know.

Interested in talking to a psychometrician about test development and validation, or a demo of our powerful platform that empowers you to develop custom psychometric tests?

TALK TO US Contact

What is a psychometric test or assessment?

Psychometric tests and assessments are different from other types of tests in that they measure a person’s knowledge, abilities, interests, and other attributes. They focus on measuring “mental processes” rather than “objective facts.” Psychometric tests are used to determine suitability for employment, education, training, or placement, as well as the suitability of the person for specific situations.

A psychometric test or assessment is an evaluation of a candidate’s personality traits and cognitive abilities. They also help assess mental health status by screening the individual for potential mental disorders. In recruitment and job performance, companies use psychometric tests for reasons such as:

  • Make data-driven comparisons among candidates
  • Making leadership decisions
  • Reduce hiring bias and improve workforce diversification
  • Identify candidate strengths and weaknesses
  • Help complete candidate personas
  • Deciding management strategies

Psychometrics is different, and refers to a field of study associated with the theory and technique of psychoeducational measurement.  It is not limited to the topic of recruitment and careers, but spans all assessments, from K-12 formative assessment to addiction inventories in medical clinics to university admissions to certification of nurses.

The different types of psychometric tests

The following are the main types of psychometric assessments:

Personality tests

Personality tests mainly help recruiters identify desirable personality traits that would make one fit for a certain role in a company. These tests contain a series of questions that measure and categorize important metrics such as leadership capabilities and candidate motivations as well as job-related traits such as integrity or conscientiousness. Some personality assessments seek to categorize people into relatively arbitrary “types” while some place people on a continuum of various traits.

‘Type focused’ Personality tests

Some examples of popular psychometric tests that use type theory include the Myers-Briggs Type Indicator (MBTI) and the DISC profile.  Personality types are of limited usefulness in recruitment because they lack objectivity and reliability in determining important metrics that can predict the success of certain candidates in a specific role, as well as having more limited scientific backing. They are, to a large extent, Pop Psychology.

‘Trait-focused’ personality types

Personality assessments based on trait theory on the other hand tend to mainly rely on the OCEAN model, like the NEO-PI-R. These psychometric assessments determine the intensity of five traits; openness, conscientiousness, extraversion, agreeableness, and neuroticism, using a series of questions and exercises. Psychometric assessments based on this model provide more insight into the ability of candidates to perform in a certain role, compared to type-focused assessments.

Cognitive Ability and Aptitude Tests

Cognitive ability tests, also known as intelligence tests or aptitude, measure a person’s latent/unlearned cognitive skills and attributes.  Common examples of this are logical reasoning, numerical reasoning, and mechanical reasoning.  It is important to stress that these are generally unlearned, as opposed to achievement tests.

Job Knowledge and Achievement tests

These psychometric tests are designed to assess what people have learned.  For example, if you are applying for a job as an accountant, you might be given a numerical reasoning or logical reasoning test, and a test in the use of Microsoft Excel.  The former is aptitude, while the latter is job knowledge or achievement. (Though there is certainly some learning involved with basic math skills).

What are the importance and benefits of psychometric tests?

Psychometric tests have been proven to be effective in domains such as recruitment and education. In recruitment, psychometric tests have been integrated into pre-employment assessment software because of their effectiveness in the hiring process. Here are several ways psychometric tests are beneficial in corporate environments, along with Learning and Development (L & D):hr-manager-interviewing-a-candidate

Cost and Time Efficiency  Psychometric tests save organizations a lot of resources because they help eliminate the guesswork in hiring processes. Psychometric tests help employers go through thousands of resumes to find the perfect candidates.

Cultural fulfillment In the modern business world, culture is a great determinant of success. Through psychometric tests, employees can predict the types of candidates that can fit into their company culture.

Standardization — Traditional hiring processes have a lot of hiring bias cases. However, psychometric tests can level the playing ground and give a chance for the best candidates to get what they deserve.

Effectiveness Psychometric tests have been scientifically proven to play a critical role in hiring the best talent. This is mainly because they can spot important attributes that can’t be spotted by traditional hiring processes.

In L&D, psychometric tests can help organizations generate important insights such as learning abilities, candidate strengths and weaknesses, and learning strategy effectiveness. This can help re-write the learning strategies, for improved ROI.

What makes a good psychometric test?

As with all tests, you need reliability and validity.  In the case of pre-employment testing, the validity is usually one of two things:

  1. Content validity via job-relatedness; if the job requires several hours per day of Microsoft Excel, then a test on Microsoft Excel makes sense
  2. Predictive validity: numerical reasoning but not be as overtly related to the job as Microsoft Excel, but if you can show that it predicts job performance, then this is helpful.  This is especially true for noncognitive assessments like conscientiousness.

Conclusion

There is no doubt that psychometric tests are important in essential aspects of life such as recruitment and education. Not only do they help us understand people, but also simplify the hiring process. However, psychometric tests should be used with caution. It’s advisable to develop a concrete strategy on how you are going to integrate them into your operation mechanism.

Ready To Start Developing Your Own Psychometric Tests? 

test information functionASC’s comprehensive platform provides you with all the tools necessary to develop and securely deliver psychometric assessments. It is equipped with powerful psychometric software, online essay marking modules, advanced reporting, tech-enhanced items, and so much more! You also have access to the world’s greatest psychometricians to help you out if you get stuck in the process!

classroom students exam

If you are delivering high-stakes tests in linear forms – or piloting a bank for CAT/LOFT – you are faced with the issue of how to equate the forms together.  That is, how can we defensibly translate a score on Form A to a score on Form B?  While the concept is simple, the methodology can be complex, and there is an entire area of psychometric research devoted to this topic. There are a number of ways to approach this issue, and IRT equating is the strongest.

Why do we need equating?

The need is obvious: to adjust for differences in difficulty to ensure that all examinees receive a fair score on a stable scale.  Suppose you take Form A and get s score of 72/100 while your friend takes Form B and gets a score of 74/100.  Is your friend smarter than you, or did his form happen to have easier questions?  Well, if the test designers built-in some overlap, we can answer this question empirically.

Suppose the two forms overlap by 50 items, called anchor items or equator items.  Both forms are each delivered to a large, representative sample. Here are the results.

Form Mean score on 50 overlap items Mean score on 100 total items
A 30 72
B 30 74

Because the mean score on the anchor items was higher, we then think that the Form B group was a little smarter, which led to a higher total score.

Now suppose these are the results:

Form Mean score on 50 overlap items Mean score on 100 total items
A 32 72
B 32 74

Now, we have evidence that the groups are of equal ability.  The higher total score on Form B must then be because the unique items on that form are a bit easier.

How do I calculate an equating?

You can equate forms with classical test theory (CTT) or item response theory (IRT).  However, one of the reasons that IRT was invented was that equating with CTT was very weak.  CTT methods include Tucker, Levine, and equipercentile.  Right now, though, let’s focus on IRT.

IRT equating

There are three general approaches to IRT equating.  All of them can be accomplished with our industry-leading software  Xcalibre, though conversion equating requires an additional software called IRTEQ.

  1. Conversion
  2. Concurrent Calibration
  3. Fixed Anchor Calibration

Conversion

With this approach, you need to calibrate each form of your test using IRT, completely separately.  We then evaluate the relationship between IRT parameters on each form and use that to estimate the relationship to convert examinee scores.  Theoretically what you do is line up the IRT parameters of the common items and perform a linear regression, so you can then apply that linear conversion to scores.

But DO NOT just do a regular linear regression.  There are specific methods you must use, including mean/mean, mean/sigma, Stocking & Lord, and Haebara.  Fortunately, you don’t have to figure out all the calculations yourself, as there is free software available to do it for you:  IRTEQ.

Concurrent Calibrationcommon item linking irt equating

The second approach is to combine the datasets into what is known as a sparse matrix.  You then run this single data set through the IRT calibration, and it will place all items and examinees onto a common scale.  The concept of a sparse matrix is typically represented by the figure below, representing the non-equivalent anchor test (NEAT) design approach.

The IRT calibration software will automatically equate the two forms and you can use the resultant scores.

Fixed Anchor Calibration

The third approach is a combination of the two above; it utilizes the separate calibration concept but still uses the IRT calibration process to perform the equating rather than separate software.

With this approach, you would first calibrate your data for Form A.  You then find all the IRT item parameters for the common items and input them into your IRT calibration software when you calibrate Form B.

You can tell the software to “fix” the item parameters so that those particular ones (from the common items) do not change.  Then all the item parameters for the unique items are forced onto the scale of the common items, which of course is the underlying scale from Form A.  This then also forces the scores from the Form B students onto the Form A scale.

How do these IRT equating approaches compare to each other?
concurrent calibration irt equating linking

Concurrent calibration is arguably the easiest but has the drawback that it merges the scales of each form into a new scale somewhere in the middle.  If you need to report the scores on either form on the original scale, then you must use the Conversion or Fixed Anchor approaches.  This situation commonly happens if you are equating across time periods.

Suppose you delivered Form A last year and are now trying to equate Form B.  You can’t just create a new scale and thereby nullify all the scores you reported last year.  You must map Form B onto Form A so that this year’s scores are reported on last year’s scale and everyone’s scores will be consistent.

Where do I go from here?

If you want to do IRT equating, you need IRT calibration software.  All three approaches use it.  I highly recommend  Xcalibre  since it is easy to use and automatically creates reports in Word for you.  If you want to learn more about the topic of equating, the classic reference is the book by Kolen and Brennan (2004; 2014).  There are other resources more readily available on the internet, like this free handbook from CCSSO.  If you would like to learn more about IRT, I recommend the books by de Ayala (2008) and Embretson & Reise (2000).  An intro is available in our blog post.

psychometrician laptop

The Rasch model, also known as the one-parameter logistic model, was developed by Danish mathematician Georg Rasch and published in 1960.  Over the ensuing years it has attracted many educational measurement specialists and psychometricians because of its simplicity and ease of computational implementation.  Indeed, since it predated the availability of computers for test development work, it was capable of being implemented using simple calculating equipment available at that time.  The original model, developed for achievement and ability tests scored correct or incorrect, or for dichotomously scored personality or attitude scales, has since been extended into a family of models that work with polytomously scored data, rating scales, and a number of other measurement applications. The majority of those models maintain the simplicity of the original model proposed by Rasch.

In 1960, the Rasch model represented a step forward from classical test theory that had then been the major source of methods for test and scale development, and measuring individual differences, since the early 1900s.  The model was accepted by some educational measurement specialists because of its simplicity, its relative ease of implementation, and most importantly because it maintained the use of the familiar number-correct score to quantify an individual’s performance on a test.  Beyond serving as a bridge from classical test methods, the Rasch model is notable as the first formal statement of what, about ten years later, would be known as item characteristic curve theory or item response theory (IRT).  As an IRT model, the Rasch model placed persons and items on the same scale, and introduced concepts such as item information, test information, conditional standard error of measurement, maximum likelihood estimation, and model fit. Furthermore, the evolution of IRT has given rise to multidimensional item response theory (MIRT), which extends these concepts to accommodate multiple latent traits.

What is the Rasch Model?

As an IRT model, the dichotomous Rasch model can be expressed as an equation,

Rasch model

This equation defines the item response function (or item characteristic curve) for a single test item.  It states that the probability (pi) of a correct (or keyed) response (uij = 1) to an item (i), given a trait level (qj) for a person and the difficulty of an item (bj), is an exponential function of the difference between a person’s trait level and the difficulty of an item.  If the difference between those two terms is zero, the probability is 0.50.  If the person is “above” the item (the item is easy for the person) the probability is greater than 0.50; if the person is “below” the item (the item is difficult for the person) the probability will be less than .50.  The probability correct, therefore, varies with the distance between the person and the item. This characteristic of the Rasch model is found, with some important modifications, in all later IRT models.

Assumptions of the Rasch Model

Based on the above equation, the Rasch model can be seen to make some strong assumptions.  It obviously assumes that the only characteristic of test items that affects a person’s response to the item is its difficulty.  But anyone who has ever done a classical item analysis has observed that items also vary in their discriminations—frequently markedly so.  Yet the Rasch model ignores that information and assumes that all test items have a constant discrimination.  Also, when used with multiple-choice items (or worse, true-false or other items with only two alternatives) guessing can also affect how examinees answer test items—yet the Rasch model also ignores guessing.

Because the model assumes that all items have the same discriminations, it allows examinees to be scored by number correct. But like classical test scoring methods using number-correct scores, their use results in a substantial loss of capability of reflecting individual differences among examinees by using number-correct scoring.  For example, a 6-item number-correct scored test (too short to be useful) can make seven distinctions among a group of examinees, regardless of group size, whereas a more advanced IRT model can make 64 distinctions among those examinees by taking response pattern into account (i.e., which questions were answered correctly and which incorrectly); a 20-item test scored by number-correct results in 21 possible scores—again regardless of group size—whereas non-Rasch IRT scoring will result in 1,048,576 potentially different scores.

Perspectives on the Model

Although the Rasch model is the simplest case of more advanced IRT models, it incorporates a fundamental difference from its more realistic expansions—the two-, three- and four-parameter logistic (or normal ogive) models for dichotomously scored items.  Applications of the Rasch model assume that the model is correct and that observed data must fit the model, Thus, items whose discriminations are not consistent with the model are eliminated in the construction of a measuring instrument.  Similarly, in estimating latent trait scores for examinees, examinees whose responses do not fit the model are eliminated from the calibration data analysis. By contrast, the more advanced and flexible IRT models fit the model to the data.  Although they evaluate model fit for each item (and similarly can evaluate it for each person) the model fit to a given dataset (whether it has two, three, or four item parameters) is the model that best fits that data.  This philosophical—and operational—difference between the Rasch and the other IRT models has important practical implications for the outcome of the test development process.

The Rasch Model and other IRT Models

Although the Rasch model was an advancement is psychometrics in 1960, over the last 60 years it has been replaced by more general models that allow test items to vary in discrimination, guessing, and a fourth parameter.  With the development of powerful computing capabilities, IRT has given rise to a wide-ranging family of models that function flexibly with ability, achievement, personality, attitude, and other educational and psychological variables.  These IRT models are easily implemented with a variety of readily available software packages, and are based on models that can be fit to unidimensional or multidimensional datasets, model response times, and in many respects vastly improve the development of measuring instruments and measurement of individual differences.  Given these advanced IRT models, the Rasch model can best be viewed as an early historical footnote in the history of modern psychometrics.

Implementing the Rasch Model

You will need specialized software.  The most common is WINSTEPS.  You might also be interested in  Xcalibre  (download trial version for free).

Paper-and-pencil testing used to be the only way to deliver assessments at scale.  The introduction of computer-based testing (CBT) in the 1980s was a revelation – higher fidelity item types, immediate scoring & feedback, and scalability all changed with the advent of the personal computer and then later the internet.  Delivery mechanisms including remote proctoring provided students with the ability to take their exams anywhere in the world.  This all exploded tenfold when the pandemic arrived.  So why are some exams still offline, with paper and pencil?

Many education institutions are confused about which examination models to stick to.  Should you go on with the online model they used when everyone was stuck in their homes?  Should you adopt multi-modal examination models, or should you go back to the traditional pen-and-paper method?  

This blog post will provide you with an evaluation of whether paper-and-pencil exams are still worth it in 2021. 

 

Paper-and-pencil testing; The good, the bad, and the ugly

The Good

Answer Bubble Sheet OrangeOffline exams have been a stepping stone towards the development of modern assessment models that are more effective. We can’t ignore the fact that there are several advantages of traditional exams. 

Some advantages of paper-and-pencil testing include students having familiarity with the system, development of a social connection between learners, exemption from technical glitches, and affordability. Some schools don’t have the resources and pen-and-paper assessments are the only option available. 

This is especially true in areas of the world that do not have the internet bandwidth or other technology necessary to deliver internet-based testing.

Another advantage of paper exams is that they can often work better for students with special needs, such as blind students which need a reader.

Paper and pencil testing is often more cost-efficient in certain situations where the organization does not have access to a professional assessment platform or learning management system.

 

The Bad and The Ugly

However, the paper-and-pencil testing does have a number of shortfalls.

1. Needs a lot of resources to scale

Delivery of paper-and-pencil testing at large scale requires a lot of resources. You are printing and shipping, sometimes with hundreds of trucks around the country.  Then you need to get all the exams back, which is even more of a logistical lift.

2. Prone to cheating

Most people think that offline exams are cheat-proof but that is not the case. Most offline exams count on invigilators and supervisors to make sure that cheating does not occur. However, many pen-and-paper assessments are open to leakages. High candidate-to-ratio is another factor that contributes to cheating in offline exams.

3. Poor student engagement

We live in a world of instant gratification and that is the same when it comes to assessments. Unlike online exams which have options to keep the students engaged, offline exams are open to constant destruction from external factors.

Offline exams also have few options when it comes to question types. 

4. Time to score

To err is human.” But, when it comes to assessments, accuracy, and consistency. Traditional methods of hand-scoring paper tests are slow and labor-intensive. Instructors take a long time to evaluate tests. This defeats the entire purpose of assessments.

5. Poor result analysis

Pen-and-paper exams depend on instructors to analyze the results and come up with insight. This requires a lot of human resources and expensive software. It is also difficult to find out if your learning strategy is working or it needs some adjustments. 

6. Time to release results

Online exams can be immediate.  If you ship paper exams back to a single location, score them, perform psychometrics, then mail out paper result letters?  Weeks.

7. Slow availability of results to analyze

Similarly, psychometricians and other stakeholders do not have immediate access to results.  This prevents psychometric analysis, timely feedback to students/teachers, and other issues.

8. Accessibility

Online exams can be built with tools for zoom, color contrast changes, automated text-to-speech, and other things to support accessibility.

9. Convenience

traditional approach vs modern approach

Online tests are much more easily distributed.  If you publish one on the cloud, it can immediately be taken, anywhere in the world.

10. Support for diversified question types

Unlike traditional exams which are limited to a certain number of question types, online exams offer many question types.  Videos, audio, drag and drop, high-fidelity simulations, gamification, and much more are possible.

11. Lack of modern psychometrics

Paper exams cannot use computerized adaptive testing, linear-on-the-fly testing, process data, computational psychometrics, and other modern innovations.

12. Environmental friendliness

Sustainability is an important aspect of modern civilization.  Online exams eliminate the need to use resources that are not environmentally friendly such as paper. 

 

Conclusion

Is paper-and-pencil testing still useful?  In most situations, it is not.  The disadvantages outweigh the advantages.  However, there are many situations where paper remains the only option, such as poor tech infrastructure.

How ASC Can Help 

Transitioning from paper-and-pencil testing to the cloud is not a simple task.  That is why ASC is here to help you every step of the way, from test development to delivery.  We provide you with the best assessment software and access to the most experienced team of psychometricians.  Ready to take your assessments online?  Contact us!

 

linear-on-the-fly-test

Linear on the fly testing (LOFT) is an approach to assessment delivery that increases test security by limiting item exposure. It tries to balance the advantages of linear testing (e.g., everyone sees the same number of items, which feels fairer) with the advantages of algorithmic exams (e.g., creating a unique test for everyone).

In general, there are two families of test delivery.  Static approaches deliver the same test form or forms to everyone; this is the ubiquitous and traditional “linear” method of testing.  Algorithmic approaches deliver the test to each examinee based on a computer algorithm; this includes LOFT, computerized adaptive testing (CAT), and multistage testing (MST).

What is LOFT?

The purpose of linear on the fly testing is to give every examinee a linear form that is uniquely created for them – but each one is created to be psychometrically equivalent to all others to ensure fairness.  For example, we might have a pool of 200 items, and every person only gets 100, but that 100 is balanced for each person.  This can be done by ensuring content and/or statistical equivalency, as well ancillary metadata such as item types or cognitive level.

Content Equivalence

This portion is relatively straightforward.  If your test blueprint calls for 20 items in each of 5 domains, for a total of 100 items, then each form administered to examinees should follow this blueprint.  Sometimes the content blueprint might go 2 or even 3 levels deep.

Statistical Equivalence

There are, of course, two predominant psychometric paradigms: classical test theory (CTT) and item response theory (IRT).  With CTT, forms can easily be built to have an equivalent P value, and therefore expected mean score.  If point-biserial statistics are available for each item, you can also design the algorithm to design forms that have the same standard deviation and reliability.

With item response theory, the typical approach is to design forms to have the same test information function, or inversely, conditional standard error of measurement function.  To learn more about how these are implemented, read this blog post about IRT or download our Classical Form Assembly Tool.

Implementing LOFT

LOFT is typically implemented by publishing a pool of items with an algorithm to select subsets that meet the requirements.  Therefore, you need a psychometrically sophisticated testing engine that stores the necessary statistics and item metadata, lets you define a pool of items, specify the relevant options such as target statistics and blueprints, and deliver the test in a secure manner.  Very few testing platforms can implement a quality LOFT assessment.  ASC’s platform does; click here to request a demo.

Benefits of Using LOFT in Testing

It certainly is not easy to build a strong item bank, design LOFT pools, and develop a complex algorithm that meets the content and statistical balancing needs.  So why would an organization use linear on the fly testing?

Well, it is much more secure than having a few linear forms.  Since everyone receives a unique form, it is impossible for words to get out about what the first questions on the test are.  And of course, we could simply perform a random selection of 100 items from a pool of 200, but that would be potentially unfair.  Using LOFT will ensure the test remains fair and defensible.

T scores

The two terms Norm-Referenced and Criterion-Referenced are commonly used to describe tests, exams, and assessments.  They are often some of the first concepts learned when studying assessment and psychometrics. Norm-referenced means that we are referencing how your score compares to other people.  Criterion-referenced means that we are referencing how your score compares to a criterion such as a cutscore or a body of knowledge. Test scaling is integral to both types of assessments, as it involves adjusting scores to facilitate meaningful comparisons.

Do we say a test is “Norm-Referenced” vs. “Criterion-Referenced”?

Norm-Referenced Vs. Criterion-Referenced Testing

Actually, that’s a slight misuse.

The terms Norm-Referenced and Criterion-Referenced refer to score interpretations.  Most tests can actually be interpreted in both ways, though they are usually designed and validated for only one of the other.  More on that later.

Hence the shorthand usage of saying “this is a norm-referenced test” even though it just means that it is the primarily intended interpretation.

Examples of Norm-Referenced vs. Criterion-Referenced

Suppose you received a score of 90% on a Math exam in school.  This could be interpreted in both ways.  If the cutscore was 80%, you clearly passed; that is the criterion-referenced interpretation.  If the average score was 75%, then you performed at the top of the class; this is the norm-referenced interpretation.  Same test, both interpretations are possible.  And in this case, valid interpretations.T scores

What if the average score was 95%?  Well, that changes your norm-referenced interpretation (you are now below average) but the criterion-referenced interpretation does not change.

Now consider a certification exam.  This is an example of a test that is specifically designed to be criterion-referenced.  It is supposed to measure that you have the knowledge and skills to practice in your profession.  It doesn’t matter whether all candidates pass or only a few candidates pass; the cutscore is the cutscore.

However, you could interpret your score by looking at your percentile rank compared to other examinees; it just doesn’t impact the cutscore

On the other hand, we have an IQ test.  There is no criterion-referenced cutscore of whether you are “smart” or “passed.”  Instead, the scores are located on the standard normal curve (mean=100, SD=15), and all interpretations are norm-referenced.  Namely, where do you stand compared to others?  The scales of the T score and z-score are norm-referenced, as are Percentiles.  So are many tests in the world, like the SAT with a mean of 500 and SD of 100.

Is this impacted by item response theory (IRT)?

If you have looked at item response theory (IRT), you know that it scores examinees on what is effectively the standard normal curve (though this is shifted if Rasch).  But, IRT-scored exams can still be criterion-referenced.  It can still be designed to measure a specific body of knowledge and have a cutscore that is fixed and stable over time.

Even computerized adaptive testing can be used like this.  An example is the NCLEX exam for nurses in the United States.  It is an adaptive test, but the cutscore is -0.18 (NCLEX-PN on Rasch scale) and it is most definitely criterion-referenced.

Building and validating an exam

The process of developing a high-quality assessment is surprisingly difficult and time-consuming. The greater the stakes, volume, and incentives for stakeholders, the more effort that goes into developing and validating.  ASC’s expert consultants can help you navigate these rough waters.

Want to develop smarter, stronger exams?

Contact us to request a free account in our world-class platform, or talk to one of our psychometric experts.

 

point biserial discrimination

The item-total point-biserial correlation is a common psychometric index regarding the quality of a test item, namely how well it differentiates between examinees with high vs low ability.

What is item discrimination?

While the word “discrimination” has a negative connotation, it is actually a really good thing for an item to have.  It means that it is differentiating between examinees, which is entirely the reason that an assessment item exists.  If a math item on Fractions is good, then students with good knowledge of fractions will tend to get it correct, while students with poor knowledge will get it wrong.  If this isn’t the case, and the item is essentially producing random data, then it has no discrimination.  If the reverse is the case, then the discrimination will be negative.  This is a total red flag; it means that good students are getting the item wrong and poor students are getting it right, which almost always means that there is incorrect content or the item is miskeyed.

What is the point-biserial correlation?

The point-biserial coefficient is a Pearson correlation between scores on the item (usually 0=wrong and 1=correct) and the total score on the test.  As such, it is sometimes called an item-total correlation.

Consider the example below.  There are 10 examinees that got the item wrong, and 10 that got it correct.  The scores are definitely higher for the Correct group.  If you fit a regression line, it would have a positive slope.  If you calculated a correlation, it would be around 0.10.

point biserial discrimination

How do you calculate the point-biserial?

Since it is a Pearson correlation, you can easily calculate it with the CORREL function in Excel or similar software.  Of course, psychometric software like Iteman will also do it for you, and many more important things besides (e.g., the point-biserial for each of the incorrect options!).  This is an important step in item analysis.  The image below is example output from Iteman, where Rpbis is the point-biserial.  This item is very good, as it has a very high point-biserial for the correct answer and strongly negative point-biserials for the incorrect answers (which means the not-so-smart students are selecting them).

FastTest Iteman Psychometric Analysis

How do you interpret the point-biserial?

Well, most importantly consider the points above about near-zero and negative values.  Besides that, a minimal-quality item might have a point-biserial of 0.10, a good item of about 0.20, and strong items 0.30 or higher.  But, these can vary with sample size and other considerations.  Some constructs are easier to measure than others, which makes item discrimination higher.

Are there other indices?

There are two other indices commonly used in classical test theory.  There is the cousin of the point-biserial, the biserial.  There is also the top/bottom coefficient, where the sample is split into a highly performing group and a lowly performing group based on total score, the P value calculated for each, and those subtracted.  So if 85% of top examinees got it right and 60% of low examinees got it right, the index would be 0.25.

Of course, there is also the a parameter from item response theory.  There are a number of advantages to that approach, most notably that the classical indices try to fit a linear model on something that is patently nonlinear.  For more on IRT, I recommend a book like Embretson & Riese (2000).

The California Department of Human Resources (CalHR, calhr.ca.gov/) has selected Assessment Systems Corporation (ASC, assess.com) as its vendor for an online assessment platform. CalHR is responsible for the personnel selection and hiring of many job roles for the State, and delivers hundreds of thousands of tests per year to job applicants. CalHR seeks to migrate to a modern cloud-based platform that allows it to manage large item banks, quickly publish new test forms, and deliver large-scale assessments that align with modern psychometrics like item response theory (IRT) and computerized adaptive testing (CAT).

Assess.ai as a solution

ASC’s landmark assessment platform Assess.ai was selected as a solution for this project. ASC has been providing computerized assessment platforms with modern psychometric capabilities since the 1980s, and released Assess.ai in 2019 as a successor to its industry-leading platform FastTest. It includes modules for item authoring, item review, automated item generation, test publishing, online delivery, and automated psychometric reporting.

Read the full article here.

Multistage adaptive testing

Multistage testing

Automated item generation

automated item generation

standard-setting-study

A standard setting study is a formal, quantitative process for establishing a performance standard on an exam, such as what score is “proficient” or “passing.”  This is typically manifested as a cutscore which is then used for making decisions about people: hire them, pass them, accept them into university, etc.  Because it is used for such important decisions, a lot of work goes into standard setting, using methods based on scientific research.

What is NOT standard setting?

In the assessment world, there are actually three uses of the word standard:

  1. A formal definition of the content that is being tested, such as the Common Core State Standards in the USA.
  2. A formalized process for delivering exams, as seen in the phrase “standardized testing.”
  3. A benchmark for performance, like we are discussing here.

For this reason, I prefer the term cutscore study, but the phrase standard setting is used more often.

How is a standard setting study used?

As part of a comprehensive test development cycle, after item authoring, item review, and test form assembly, a cutscore or passing score will often be set to determine what level of performance qualified as “pass” or a similar classification.  This cannot be done arbitrarily, such as setting it at 70% because that’s what you saw when you were in school.  That is a legal landmine!  To be legally defensible and eligible for Accreditation of a Certification Program, it must be done using one of several standard-setting approaches from the psychometric literature.  So, if your organization is classifying examinees into Pass/Fail, Hire/NotHire, Basic/Proficient/Advanced, or any other groups, you most likely need a standard setting study.  This is NOT limited to certification, although it is often discussed in that pass/fail context.

What are some methods of a standard setting study?

There have been many methods suggested in the scientific literature of psychometrics.  They are often delineated into examinee-centered and item-centered approaches. Angoff and Bookmark are designed around evaluating items, while Contrasting Groups and Borderline Groups are designed around evaluating the distributions of actual examinee scores.  The Bookmark approach is sort of both types, however, because it uses examinee performance on the items as the object of interest.  You may also be interested in reading this introductory post on setting a cutscore using item response theory.

Angoff

Modified Angoff analysis

In an Angoff study, a panel of subject matter experts rates each item, estimating the percentage of minimally competent candidates that would answer each item correctly.  If we take the average of all raters, this then translates into the average percentage-correct score that the raters expect from a minimally competent candidate – a very compelling argument for a cutscore to pass competent examinees!  It is often done in tandem with the Beuk Compromise.  The Angoff method does not require actual examinee data, though the Beuk does.

Bookmark

The bookmark method orders the items in a test form in ascending difficulty, and a panel of experts reads through and places a “bookmark” in the book where they think a cutscore should be. This process requires a sufficient amount of real data to calibrate item difficulty accurately, typically using item response theory, which necessitates data from several hundred examinees. Additionally, the method ensures that the cutscore is both valid and reliable, reflecting the true proficiency needed for the test.

Contrasting Groups

contrasting groups cutscore

With the contrasting groups approach, candidates are sorted into Pass and Fail groups based on their performance on a different exam or some other unrelated standard.  We can then compare the score distributions on our exam for the two separate groups, and pick a cutscore that best differentiates Pass vs Fail on the other standard.  An example of this is below.  If using data from another exam, a sample of at least 50 candidates is obviously needed, since you are evaluating distributions.

Borderline Group

The Borderline Group method is similar to the Contrasting Groups method, but it defines a borderline group using alternative information, such as biodata, and evaluates the scores of this group. This method involves selecting individuals who are deemed to be on the threshold of passing or failing based on external criteria. These criteria might include previous performance data, demographic information, or other relevant biodata. The scores from this borderline group are then analyzed to determine the cutscore. This approach helps in refining the accuracy of the cutscore by incorporating more nuanced and contextual information about the test-takers.

Hofstee

The Hofstee approach is often used as a reality check for the modified-Angoff method but can also stand alone as a method for setting cutscores. It involves only a few estimates from a panel of SMEs. Specifically, the SMEs provide estimates for the minimum and maximum acceptable failure rates and the minimum and maximum acceptable scores. This data is then plotted to determine a compromise cutscore that balances these criteria. The simplicity and practicality of the Hofstee approach make it a valuable tool in various testing scenarios, ensuring the cutscore is both realistic and justifiable.

Ebel

The Ebel approach categorizes test items by both their importance and difficulty level. This method involves a panel of experts who rate each item on these two dimensions, creating a matrix that helps in determining the cutscore. Despite its thorough and structured approach, the Ebel method is considered very old and has largely fallen out of use in modern testing practices. Advances in psychometric techniques and the development of more efficient and accurate methods, such as item response theory, have led to the Ebel approach being replaced by more contemporary standard-setting techniques.

How to choose an approach?

There is often no specifically correct answer.  In fact, guidelines like NCCA do not lay out which method to use, they just tell you to use an appropriate method.

There are several considerations.  Perhaps the most important is whether you have existing data.  The Bookmark, Contrasting Groups, and Borderline Group approaches all assume that we have data from a test already delivered, which we can analyze with the perspective of the latent standard.  The Angoff and Hofstee approaches, in contrast, can be done before a test is ever delivered.  This is arguably less defensible, but is a huge practical advantage.

The choice also depends on whether you can easily recruit a panel of subject matter experts, as that is required for Angoff and Bookmark.  The Contrasting Groups method assumes we have a gold standard, which is rare.

How can I implement a standard setting study?

If your organization has an in-house psychometrician, they can usually do this.  If, for example, you are a board of experts in a profession but lack experience in psychometrics, you need to hire a firm.  We can perform such work for you – contact us to learn more.

Item banking refers to the purposeful creation of a database of assessment items to serve as a central repository of all test content, improving efficiency and quality. The term item refers to what many call questions; though their content need not be restricted as such and can include problems to solve or situations to evaluate in addition to straightforward questions. Regular item review is essential to ensure that each item meets content standards, is fair, and is free from bias, thereby maintaining the integrity and accuracy of the item bank. As a critical foundation to the test development cycle, item banking is the foundation for the development of valid, reliable content and defensible test forms.

Automated item banking systems, such as  Assess.ai  or  FastTest, result in significantly reduced administrative time for developing/reviewing items and assembling/publishing tests, while producing exams that have greater reliability and validity.  Contact us to request a free account.

What is Item Banking?

While there are no absolute standards in creating and managing item banks, best practice guidelines are emerging. Here are the essentials your should be looking for:

   Items are reusable objects; when selecting an item banking platform it is important to ensure that items can be used more than once; ideally, item performance should be tracked not only within a test form but across test forms as well.

   Item history and usage are tracked; the usage of a given item, whether it is actively on a test form or dormant waiting to be assigned, should be easily accessible for test developers to assess, as the over-exposure of items can reduce the validity of a test form. As you deliver your items, their content is exposed to examinees. Upon exposure to many examinees, items can then be flagged for retirement or revision to reduce cheating or teaching to the test.

   Items can be sorted; as test developers select items for a test form, it is imperative that they can sort items based on their content area or other categorization methods, so as to select a sample of items that is representative of the full breadth of constructs we intend to measure.

   Item versions are tracked; as items appear on test forms, their content may be revised for clarity. Any such changes should be tracked and versions of the same item should have some link between them so that we can easily review the performance of earlier versions in conjunction with current versions.

   Review process workflow is tracked; as items are revised and versioned, it is imperative that the changes in content and the users who made these changes are tracked. In post-test assessment, there may be a need for further clarification, and the ability to pinpoint who took part in reviewing an item and expedite that process.

   Metadata is recorded; any relevant information about an item should be recorded and stored with the item. The most common applications for metadata that we see are author, source, description, content area, depth of knowledge, item response theory parameters, and classical test theory statistics, but there are likely many data points specific to your organization that is worth storing.

 

Managing an Item Bank

Names are important. As you create or import your item banks it is important to identify each item with a unique, but recognizable name. Naming conventions should reflect your bank’s structure and should include numbers with leading zeros to support true numerical sorting.  You might want to also add additional pieces of information.  If importing, the system should be smart enough to recognize duplicates.

Search and filter. The system should also have a reliable sorting mechanism. 

automated item generation cpr

 

Prepare for the Future: Store Extensive Metadata

Metadata is valuable. As you create items, take the time to record simple metadata like author and source. Having this information can prove very useful once the original item writer has moved to another department, or left the organization. Later in your test development life cycle, as you deliver items, you have the ability to aggregate and record item statistics. Values like discrimination and difficulty are fundamental to creating better tests, driving reliability, and validity.

Statistics are used in the assembly of test forms while classical statistics can be used to estimate mean, standard deviation, reliability, standard error, and pass rate. 

Item banking statistics

Item response theory parameters can come in handy when calculating test information and standard error functions. Data from both psychometric theories can be used to pre-equate multiple forms.

In the event that your organization decides to publish an adaptive test, utilizing computerized adaptive testing delivery, item parameters for each item will be essential. This is because they are used for intelligent selection of items and scoring examinees. Additionally, in the event that the integrity of your test or scoring mechanism is ever challenged, documentation of validity is essential to defensibility and the storage of metadata is one such vital piece of documentation.

 

Increase Content Quality: Track Workflow

Utilize a review workflow to increase quality. Using a standardized review process will ensure that all items are vetted in a similar matter. Have a step in the process for grammar, spelling, and syntax review, as well as content review by a subject matter expert. As an item progresses through the workflow, its development should be tracked, as workflow results also serve as validity documentation.

Accept comments and suggestions from a variety of sources. It is not uncommon for each item reviewer to view an item through their distinctive lens. Having a diverse group of item reviewers stands to benefit your test-takers, as they are likely to be diverse as well!

item review kanban

 

Keep Your Items Organized: Categorize Them

Identify items by content area. Creating a content hierarchy can also help you to organize your item bank and ensure that your test covers the relevant topics. Most often, we see content areas defined first by an analysis of the construct(s) being tested. In the event of a high school science test, this may include the evaluation of the content taught in class. A high-stakes certification exam, almost always includes a job-task analysis. Both methods produce what is called a test blueprint, indicating how important various content areas are to the demonstration of knowledge in the areas being assessed.

Once content areas are defined, we can assign items to levels or categories based on their content. As you are developing your test, and invariably referring back to your test blueprint, you can use this categorization to determine which items from each content area to select.

 

The Benefits of Item Banking

There is no doubt that item banking is a key aspect of developing and maintaining quality assessments. Utilizing best practices, and caring for your items throughout the test development life cycle, will pay great dividends as it increases the reliability, validity, and defensibility of your assessment. Moreover, good item banking will make the job easier and more efficient thus reducing the cost of item development and test publishing.

 

Ready to Improve assessment quality through item banking?

Visit our Contact Us page, where you can request a demonstration or a free account (up to 500 items).  I also recommend you watch this tutorial video.