Posts on psychometrics: The Science of Assessment

linear-on-the-fly-test

Linear on the fly testing (LOFT) is an approach to assessment delivery that increases test security by limiting item exposure. It tries to balance the advantages of linear testing (e.g., everyone sees the same number of items, which feels fairer) with the advantages of algorithmic exams (e.g., creating a unique test for everyone).

In general, there are two families of test delivery.  Static approaches deliver the same test form or forms to everyone; this is the ubiquitous and traditional “linear” method of testing.  Algorithmic approaches deliver the test to each examinee based on a computer algorithm; this includes LOFT, computerized adaptive testing (CAT), and multistage testing (MST).

What is LOFT?

The purpose of linear on the fly testing is to give every examinee a linear form that is uniquely created for them – but each one is created to be psychometrically equivalent to all others to ensure fairness.  For example, we might have a pool of 200 items, and every person only gets 100, but that 100 is balanced for each person.  This can be done by ensuring content and/or statistical equivalency, as well ancillary metadata such as item types or cognitive level.

Content Equivalence

This portion is relatively straightforward.  If your test blueprint calls for 20 items in each of 5 domains, for a total of 100 items, then each form administered to examinees should follow this blueprint.  Sometimes the content blueprint might go 2 or even 3 levels deep.

Statistical Equivalence

There are, of course, two predominant psychometric paradigms: classical test theory (CTT) and item response theory (IRT).  With CTT, forms can easily be built to have an equivalent P value, and therefore expected mean score.  If point-biserial statistics are available for each item, you can also design the algorithm to design forms that have the same standard deviation and reliability.

With item response theory, the typical approach is to design forms to have the same test information function, or inversely, conditional standard error of measurement function.  To learn more about how these are implemented, read this blog post about IRT or download our Classical Form Assembly Tool.

Implementing LOFT

LOFT is typically implemented by publishing a pool of items with an algorithm to select subsets that meet the requirements.  Therefore, you need a psychometrically sophisticated testing engine that stores the necessary statistics and item metadata, lets you define a pool of items, specify the relevant options such as target statistics and blueprints, and deliver the test in a secure manner.  Very few testing platforms can implement a quality LOFT assessment.  ASC’s platform does; click here to request a demo.

Benefits of Using LOFT in Testing

It certainly is not easy to build a strong item bank, design LOFT pools, and develop a complex algorithm that meets the content and statistical balancing needs.  So why would an organization use linear on the fly testing?

Well, it is much more secure than having a few linear forms.  Since everyone receives a unique form, it is impossible for words to get out about what the first questions on the test are.  And of course, we could simply perform a random selection of 100 items from a pool of 200, but that would be potentially unfair.  Using LOFT will ensure the test remains fair and defensible.

T scores

The two terms Norm-Referenced and Criterion-Referenced are commonly used to describe tests, exams, and assessments.  They are often some of the first concepts learned when studying assessment and psychometrics. Norm-referenced means that we are referencing how your score compares to other people.  Criterion-referenced means that we are referencing how your score compares to a criterion such as a cutscore or a body of knowledge. Test scaling is integral to both types of assessments, as it involves adjusting scores to facilitate meaningful comparisons.

Do we say a test is “Norm-Referenced” vs. “Criterion-Referenced”?

Norm-Referenced Vs. Criterion-Referenced Testing
Actually, that’s a slight misuse.

The terms Norm-Referenced and Criterion-Referenced refer to score interpretations.  Most tests can actually be interpreted in both ways, though they are usually designed and validated for only one of the other.  More on that later.

Hence the shorthand usage of saying “this is a norm-referenced test” even though it just means that it is the primarily intended interpretation.

Examples of Norm-Referenced vs. Criterion-Referenced

Suppose you received a score of 90% on a Math exam in school.  This could be interpreted in both ways.  If the cutscore was 80%, you clearly passed; that is the criterion-referenced interpretation.  If the average score was 75%, then you performed at the top of the class; this is the norm-referenced interpretation.  Same test, both interpretations are possible.  And in this case, valid interpretations.T scores

What if the average score was 95%?  Well, that changes your norm-referenced interpretation (you are now below average) but the criterion-referenced interpretation does not change.

Now consider a certification exam.  This is an example of a test that is specifically designed to be criterion-referenced.  It is supposed to measure that you have the knowledge and skills to practice in your profession.  It doesn’t matter whether all candidates pass or only a few candidates pass; the cutscore is the cutscore.

However, you could interpret your score by looking at your percentile rank compared to other examinees; it just doesn’t impact the cutscore

On the other hand, we have an IQ test.  There is no criterion-referenced cutscore of whether you are “smart” or “passed.”  Instead, the scores are located on the standard normal curve (mean=100, SD=15), and all interpretations are norm-referenced.  Namely, where do you stand compared to others?  The scales of the T score and z-score are norm-referenced, as are Percentiles.  So are many tests in the world, like the SAT with a mean of 500 and SD of 100.

Is this impacted by item response theory (IRT)?

If you have looked at item response theory (IRT), you know that it scores examinees on what is effectively the standard normal curve (though this is shifted if Rasch).  But, IRT-scored exams can still be criterion-referenced.  It can still be designed to measure a specific body of knowledge and have a cutscore that is fixed and stable over time.

Even computerized adaptive testing can be used like this.  An example is the NCLEX exam for nurses in the United States.  It is an adaptive test, but the cutscore is -0.18 (NCLEX-PN on Rasch scale) and it is most definitely criterion-referenced.

Building and validating an exam

The process of developing a high-quality assessment is surprisingly difficult and time-consuming. The greater the stakes, volume, and incentives for stakeholders, the more effort that goes into developing and validating.  ASC’s expert consultants can help you navigate these rough waters.

Want to develop smarter, stronger exams?

Contact us to request a free account in our world-class platform, or talk to one of our psychometric experts.

 

point biserial discrimination

The item-total point-biserial correlation is a common psychometric index regarding the quality of a test item, namely how well it differentiates between examinees with high vs low ability.

What is item discrimination?

While the word “discrimination” has a negative connotation, it is actually a really good thing for an item to have.  It means that it is differentiating between examinees, which is entirely the reason that an assessment item exists.  If a math item on Fractions is good, then students with good knowledge of fractions will tend to get it correct, while students with poor knowledge will get it wrong.  If this isn’t the case, and the item is essentially producing random data, then it has no discrimination.  If the reverse is the case, then the discrimination will be negative.  This is a total red flag; it means that good students are getting the item wrong and poor students are getting it right, which almost always means that there is incorrect content or the item is miskeyed.

What is the point-biserial correlation?

The point-biserial coefficient is a Pearson correlation between scores on the item (usually 0=wrong and 1=correct) and the total score on the test.  As such, it is sometimes called an item-total correlation.

Consider the example below.  There are 10 examinees that got the item wrong, and 10 that got it correct.  The scores are definitely higher for the Correct group.  If you fit a regression line, it would have a positive slope.  If you calculated a correlation, it would be around 0.10.

point biserial discrimination

How do you calculate the point-biserial?

Since it is a Pearson correlation, you can easily calculate it with the CORREL function in Excel or similar software.  Of course, psychometric software like Iteman will also do it for you, and many more important things besides (e.g., the point-biserial for each of the incorrect options!).  This is an important step in item analysis.  The image below is example output from Iteman, where Rpbis is the point-biserial.  This item is very good, as it has a very high point-biserial for the correct answer and strongly negative point-biserials for the incorrect answers (which means the not-so-smart students are selecting them).

FastTest Iteman Psychometric Analysis

How do you interpret the point-biserial?

Well, most importantly consider the points above about near-zero and negative values.  Besides that, a minimal-quality item might have a point-biserial of 0.10, a good item of about 0.20, and strong items 0.30 or higher.  But, these can vary with sample size and other considerations.  Some constructs are easier to measure than others, which makes item discrimination higher.

Are there other indices?

There are two other indices commonly used in classical test theory.  There is the cousin of the point-biserial, the biserial.  There is also the top/bottom coefficient, where the sample is split into a highly performing group and a lowly performing group based on total score, the P value calculated for each, and those subtracted.  So if 85% of top examinees got it right and 60% of low examinees got it right, the index would be 0.25.

Of course, there is also the a parameter from item response theory.  There are a number of advantages to that approach, most notably that the classical indices try to fit a linear model on something that is patently nonlinear.  For more on IRT, I recommend a book like Embretson & Riese (2000).

The California Department of Human Resources (CalHR, calhr.ca.gov/) has selected Assessment Systems Corporation (ASC, assess.com) as its vendor for an online assessment platform. CalHR is responsible for the personnel selection and hiring of many job roles for the State, and delivers hundreds of thousands of tests per year to job applicants. CalHR seeks to migrate to a modern cloud-based platform that allows it to manage large item banks, quickly publish new test forms, and deliver large-scale assessments that align with modern psychometrics like item response theory (IRT) and computerized adaptive testing (CAT).

Assess.ai as a solution

ASC’s landmark assessment platform Assess.ai was selected as a solution for this project. ASC has been providing computerized assessment platforms with modern psychometric capabilities since the 1980s, and released Assess.ai in 2019 as a successor to its industry-leading platform FastTest. It includes modules for item authoring, item review, automated item generation, test publishing, online delivery, and automated psychometric reporting.

Read the full article here.

Multistage adaptive testing

Multistage testing

Automated item generation

automated item generation

standard-setting-study

A standard setting study is a formal, quantitative process for establishing a performance standard on an exam, such as what score is “proficient” or “passing.”  This is typically manifested as a cutscore which is then used for making decisions about people: hire them, pass them, accept them into university, etc.  Because it is used for such important decisions, a lot of work goes into standard setting, using methods based on scientific research.

What is NOT standard setting?

In the assessment world, there are actually three uses of the word standard:

  1. A formal definition of the content that is being tested, such as the Common Core State Standards in the USA.
  2. A formalized process for delivering exams, as seen in the phrase “standardized testing.”
  3. A benchmark for performance, like we are discussing here.

For this reason, I prefer the term cutscore study, but the phrase standard setting is used more often.

How is a standard setting study used?

As part of a comprehensive test development cycle, after item authoring, item review, and test form assembly, a cutscore or passing score will often be set to determine what level of performance qualified as “pass” or a similar classification.  This cannot be done arbitrarily, such as setting it at 70% because that’s what you saw when you were in school.  That is a legal landmine!  To be legally defensible and eligible for Accreditation of a Certification Program, it must be done using one of several standard-setting approaches from the psychometric literature.  So, if your organization is classifying examinees into Pass/Fail, Hire/NotHire, Basic/Proficient/Advanced, or any other groups, you most likely need a standard setting study.  This is NOT limited to certification, although it is often discussed in that pass/fail context.

What are some methods of a standard setting study?

There have been many methods suggested in the scientific literature of psychometrics.  They are often delineated into examinee-centered and item-centered approaches. Angoff and Bookmark are designed around evaluating items, while Contrasting Groups and Borderline Groups are designed around evaluating the distributions of actual examinee scores.  The Bookmark approach is sort of both types, however, because it uses examinee performance on the items as the object of interest.  You may also be interested in reading this introductory post on setting a cutscore using item response theory.

Angoff

Modified Angoff analysis

In an Angoff study, a panel of subject matter experts rates each item, estimating the percentage of minimally competent candidates that would answer each item correctly.  If we take the average of all raters, this then translates into the average percentage-correct score that the raters expect from a minimally competent candidate – a very compelling argument for a cutscore to pass competent examinees!  It is often done in tandem with the Beuk Compromise.  The Angoff method does not require actual examinee data, though the Beuk does.

Bookmark

The bookmark method orders the items in a test form in ascending difficulty, and a panel of experts reads through and places a “bookmark” in the book where they think a cutscore should be. This process requires a sufficient amount of real data to calibrate item difficulty accurately, typically using item response theory, which necessitates data from several hundred examinees. Additionally, the method ensures that the cutscore is both valid and reliable, reflecting the true proficiency needed for the test.

Contrasting Groups

contrasting groups cutscore

With the contrasting groups approach, candidates are sorted into Pass and Fail groups based on their performance on a different exam or some other unrelated standard.  We can then compare the score distributions on our exam for the two separate groups, and pick a cutscore that best differentiates Pass vs Fail on the other standard.  An example of this is below.  If using data from another exam, a sample of at least 50 candidates is obviously needed, since you are evaluating distributions.

Borderline Group

The Borderline Group method is similar to the Contrasting Groups method, but it defines a borderline group using alternative information, such as biodata, and evaluates the scores of this group. This method involves selecting individuals who are deemed to be on the threshold of passing or failing based on external criteria. These criteria might include previous performance data, demographic information, or other relevant biodata. The scores from this borderline group are then analyzed to determine the cutscore. This approach helps in refining the accuracy of the cutscore by incorporating more nuanced and contextual information about the test-takers.

Hofstee

The Hofstee approach is often used as a reality check for the modified-Angoff method but can also stand alone as a method for setting cutscores. It involves only a few estimates from a panel of SMEs. Specifically, the SMEs provide estimates for the minimum and maximum acceptable failure rates and the minimum and maximum acceptable scores. This data is then plotted to determine a compromise cutscore that balances these criteria. The simplicity and practicality of the Hofstee approach make it a valuable tool in various testing scenarios, ensuring the cutscore is both realistic and justifiable.

Ebel

The Ebel approach categorizes test items by both their importance and difficulty level. This method involves a panel of experts who rate each item on these two dimensions, creating a matrix that helps in determining the cutscore. Despite its thorough and structured approach, the Ebel method is considered very old and has largely fallen out of use in modern testing practices. Advances in psychometric techniques and the development of more efficient and accurate methods, such as item response theory, have led to the Ebel approach being replaced by more contemporary standard-setting techniques.

How to choose an approach?

There is often no specifically correct answer.  In fact, guidelines like NCCA do not lay out which method to use, they just tell you to use an appropriate method.

There are several considerations.  Perhaps the most important is whether you have existing data.  The Bookmark, Contrasting Groups, and Borderline Group approaches all assume that we have data from a test already delivered, which we can analyze with the perspective of the latent standard.  The Angoff and Hofstee approaches, in contrast, can be done before a test is ever delivered.  This is arguably less defensible, but is a huge practical advantage.

The choice also depends on whether you can easily recruit a panel of subject matter experts, as that is required for Angoff and Bookmark.  The Contrasting Groups method assumes we have a gold standard, which is rare.

How can I implement a standard setting study?

If your organization has an in-house psychometrician, they can usually do this.  If, for example, you are a board of experts in a profession but lack experience in psychometrics, you need to hire a firm.  We can perform such work for you – contact us to learn more.

Item banking refers to the purposeful creation of a database of assessment items to serve as a central repository of all test content, improving efficiency and quality. The term item refers to what many call questions; though their content need not be restricted as such and can include problems to solve or situations to evaluate in addition to straightforward questions. Regular item review is essential to ensure that each item meets content standards, is fair, and is free from bias, thereby maintaining the integrity and accuracy of the item bank. As a critical foundation to the test development cycle, item banking is the foundation for the development of valid, reliable content and defensible test forms.

Automated item banking systems, such as  Assess.ai  or  FastTest, result in significantly reduced administrative time for developing/reviewing items and assembling/publishing tests, while producing exams that have greater reliability and validity.  Contact us to request a free account.

What is Item Banking?

While there are no absolute standards in creating and managing item banks, best practice guidelines are emerging. Here are the essentials your should be looking for:

   Items are reusable objects; when selecting an item banking platform it is important to ensure that items can be used more than once; ideally, item performance should be tracked not only within a test form but across test forms as well.

   Item history and usage are tracked; the usage of a given item, whether it is actively on a test form or dormant waiting to be assigned, should be easily accessible for test developers to assess, as the over-exposure of items can reduce the validity of a test form. As you deliver your items, their content is exposed to examinees. Upon exposure to many examinees, items can then be flagged for retirement or revision to reduce cheating or teaching to the test.

   Items can be sorted; as test developers select items for a test form, it is imperative that they can sort items based on their content area or other categorization methods, so as to select a sample of items that is representative of the full breadth of constructs we intend to measure.

   Item versions are tracked; as items appear on test forms, their content may be revised for clarity. Any such changes should be tracked and versions of the same item should have some link between them so that we can easily review the performance of earlier versions in conjunction with current versions.

   Review process workflow is tracked; as items are revised and versioned, it is imperative that the changes in content and the users who made these changes are tracked. In post-test assessment, there may be a need for further clarification, and the ability to pinpoint who took part in reviewing an item and expedite that process.

   Metadata is recorded; any relevant information about an item should be recorded and stored with the item. The most common applications for metadata that we see are author, source, description, content area, depth of knowledge, item response theory parameters, and classical test theory statistics, but there are likely many data points specific to your organization that is worth storing.

 

Managing an Item Bank

Names are important. As you create or import your item banks it is important to identify each item with a unique, but recognizable name. Naming conventions should reflect your bank’s structure and should include numbers with leading zeros to support true numerical sorting.  You might want to also add additional pieces of information.  If importing, the system should be smart enough to recognize duplicates.

Search and filter. The system should also have a reliable sorting mechanism. 

automated item generation cpr

 

Prepare for the Future: Store Extensive Metadata

Metadata is valuable. As you create items, take the time to record simple metadata like author and source. Having this information can prove very useful once the original item writer has moved to another department, or left the organization. Later in your test development life cycle, as you deliver items, you have the ability to aggregate and record item statistics. Values like discrimination and difficulty are fundamental to creating better tests, driving reliability, and validity.

Statistics are used in the assembly of test forms while classical statistics can be used to estimate mean, standard deviation, reliability, standard error, and pass rate. 

Item banking statistics

Item response theory parameters can come in handy when calculating test information and standard error functions. Data from both psychometric theories can be used to pre-equate multiple forms.

In the event that your organization decides to publish an adaptive test, utilizing computerized adaptive testing delivery, item parameters for each item will be essential. This is because they are used for intelligent selection of items and scoring examinees. Additionally, in the event that the integrity of your test or scoring mechanism is ever challenged, documentation of validity is essential to defensibility and the storage of metadata is one such vital piece of documentation.

 

Increase Content Quality: Track Workflow

Utilize a review workflow to increase quality. Using a standardized review process will ensure that all items are vetted in a similar matter. Have a step in the process for grammar, spelling, and syntax review, as well as content review by a subject matter expert. As an item progresses through the workflow, its development should be tracked, as workflow results also serve as validity documentation.

Accept comments and suggestions from a variety of sources. It is not uncommon for each item reviewer to view an item through their distinctive lens. Having a diverse group of item reviewers stands to benefit your test-takers, as they are likely to be diverse as well!

item review kanban

 

Keep Your Items Organized: Categorize Them

Identify items by content area. Creating a content hierarchy can also help you to organize your item bank and ensure that your test covers the relevant topics. Most often, we see content areas defined first by an analysis of the construct(s) being tested. In the event of a high school science test, this may include the evaluation of the content taught in class. A high-stakes certification exam, almost always includes a job-task analysis. Both methods produce what is called a test blueprint, indicating how important various content areas are to the demonstration of knowledge in the areas being assessed.

Once content areas are defined, we can assign items to levels or categories based on their content. As you are developing your test, and invariably referring back to your test blueprint, you can use this categorization to determine which items from each content area to select.

 

The Benefits of Item Banking

There is no doubt that item banking is a key aspect of developing and maintaining quality assessments. Utilizing best practices, and caring for your items throughout the test development life cycle, will pay great dividends as it increases the reliability, validity, and defensibility of your assessment. Moreover, good item banking will make the job easier and more efficient thus reducing the cost of item development and test publishing.

 

Ready to Improve assessment quality through item banking?

Visit our Contact Us page, where you can request a demonstration or a free account (up to 500 items).  I also recommend you watch this tutorial video.

psychometric training and workshops

Post-training assessment is an integral part of improving the performance and productivity of employees. To gauge the effectiveness of the training, assessments are the go-to solution for many businesses.  They ensure transfer and retention of the training knowledge, provide feedback to employees, and can be used for evaluations.  At the aggregate level, they help determine opportunities for improvement at the company. Effective test preparation can enhance the accuracy and reliability of these assessments, ensuring that employees are adequately prepared to demonstrate their knowledge and skills.

future of assessment

Benefits Of Post-Training Assessments

Insight On Company Strengths and Weaknesses

Testing gives businesses and organizations insight into the positives and negatives of their training programs. For example, if an organization realizes that certain employees can’t grasp certain concepts, they may decide to modify how they are delivered or eliminate them completely. The employees can also work on their areas of weaknesses after the assessments, hence improving productivity. 

Image: The Future Of Assessments: Tech. ed

Helps in Measuring Performance

Unlike traditional testing which is impossible to perform analytics on the assessments, measure performance and high-fidelity, computer-based assessments can quantify initial goals such as call center skills. By measuring performance, businesses can create data-driven roadmaps on how their employees can achieve their best form in terms of performance. 

Advocate For Ideas and Concepts That Can Be Integrated Into The Real World

Workers learn every day and sometimes what they learn is not used in driving the business towards attaining its objectives. This can lead to burnout and information overload in employees, which in turn lowers performance and work quality. By using post-training assessments, you can customize tests to help workers attain skills that are only in alignment with your business goals. Implementing digital assessments can streamline this process, making it easier to deploy adaptive testing methods that provide real-time feedback and tailored learning paths. This can be done by using methods such as adaptive testing

Other Benefits of Cloud-based Testing Include:

  • The assessments can be taken from anywhere in the world
  • Saves the company a lot of time and resources
  • Improved security compared to traditional assessments
  • Improved accuracy and reliability
  • Scalability and flexibility
  • Increases skill and knowledge transfer

 

Tips To Improve Your Post-Training Assessments

1. Personalized Testing

Most businesses have an array of different training needs. Most employees have different backgrounds and responsibilities in organizations, which is difficult to create effective generalized tests. To achieve the main objectives of your training, it is important to differentiate the assessments. Sales assessments, technical assessments, management assessments, etc. can not be the same. Even in the same department, there could be diversification in terms of skills and responsibilities. One way to achieve personalized testing is by using methods such as Computerized Adaptive Testing. Through the immense power of AI and machine learning, this method gives you the power to create tests that are unique to employees. Not only does personalized testing improve effectiveness in your workforce, but it is also cost-effective, secure, and in alignment with the best Psychometrics practices in the corporate world. It is also important to keep in mind the components of effective assessments when creating personalized tests.  

2. Analyzing Assessment Results

Many businesses don’t see the importance of analyzing training assessment results. How do you expect to improve your training programs and assessments if you don’t check the data?  This can tell you important things like where the students are weakest and perhaps need more instruction, or if some questions are wonky.

Analyzing Assessment Results

Example of Assessment analysis on Iteman

 

Analyze assessment results using psychometric analytics software such as Iteman to get important insights such as successful participants, item performance issues, and many others. This provides you with a blueprint to improve your assessments and employee training programs. 

3. Integrating Assessment Into Company Culture

Getting the best out of assessment is not about getting it right once, but getting it right over a long period of time. Integrating assessment into company culture is one great way to achieve this. This will make assessment part of the systems and employees will always look forward to improving their skills. You can also use strategies such as gamification to make sure that your employees enjoy the process. It is also critical to give the employees the freedom to provide feedback on the training programs. 

4. Diversify Your Assessment Types

One great myth about assessments is that they are limited in terms of questions and problems you can present to your employees. However, this is not true!

By using methods such as item banking, assessment systems are able to provide users with the ability to develop assessments using different question types. Some modern question types include:

  • Drag & drop 
  • Multiple correct 
  • Embedded audio or video
  • Cloze or fill in the blank
  • Number lines
  • Situational judgment test items
  • Counter or timer for performance tests

Diversification of question types improves comprehension in employees and helps them develop skills to approach problems from multiple angles. 

5. Choose Your Assessment Tools Carefully

This is among the most important considerations you should make when creating a workforce training assessment strategy. This is because software tools are the core of how your campaigns turn out. 

There are many assessment tools available, but choosing one that meets your requirements can be a daunting task. Apart from the key considerations of budget, functionality, etc., there are many other factors to keep in mind before choosing online assessment tools. 

To help you choose an assessment tool that will help you in your assessment journey, here are a few things to consider:

Ease-of-use

Most people are new to assessments, and as much as some functionalities can be powerful, they may be overwhelming to candidates and the test development staff. This may make candidates underperform. It is, therefore, important to vet the platform and its functionalities to make sure that they are easy to use. 

Functionality

Training assessments are becoming popular and new inventions are being made every day. Does the assessment software have the latest innovations in the industry? Do you get value for your money? Does it support modern psychometrics like item response theory? These are just but a few questions to ask when vetting a platform for functionality. 

Assessment Reporting and Visualizations

One major advantage of assessments over traditional ones is that they offer access to instant assessment reporting. You should therefore look for a platform that offers advanced reporting and visualizations in metrics such as performance, question strengths, and many others. 

Cheating precautions and Security

When it comes to assessments, there are two concerns when it comes to security. How secure are the assessments? And how secure is the platform? In relation to the tests, the platform should provide precautions and technologies such as Lockdown browser against cheating. They should also have measures in place to make sure that user data is secure. 

Reliable Support System

This is one consideration that many businesses don’t keep in mind, and end up regretting in the long run. Which channels does the corporate training assessment platform use to provide its users with support? Do they have resources such as whitepapers and documentation in case you need them? How fast is their support?  These are questions you should ask before selecting a platform to take care of your assessment needs. 

Scalability

A good testing vendor should be able to provide you with resources should your needs go beyond expectation. This includes delivery volume – server scalability – but also being able to manage more item authors, more assessments, more examinees, and greater psychometric rigor.

Final Thoughts

Adopting effective post-training assessments can be daunting tasks with a lot of forces at play, and we hope these tips will help you get the best out of your assessments. 

Do you want to integrate smarter assessments into your corporate environment or any industry but feel overwhelmed by the process? Feel free to contact an experienced team of professionals to help you create an assessment strategy that helps you achieve your long-term goals and objectives.  

You can also sign up to get free access to our online assessment suite including 60 item types, IRT, adaptive testing, and so much more functionality!

Confectioner-confetti

Item analysis is the statistical evaluation of test questions to ensure they are good quality, and fix them if they are not.  This is a key step in the test development cycle; after items have been delivered to examinees (either as a pilot, or in full usage), we analyze the statistics to determine if there are issues which affect validity and reliability, such as being too difficult or biased.  This post will describe the basics of this process.  If you’d like further detail and instructions on using software, you can also you can also check out our tutorial videos on our YouTube channel and download our free psychometric software.


Download a free copy of Iteman: Software for Item Analysis

What is Item Analysis?

Item analysis refers to the process of statistically analyzing assessment data to evaluate the quality and performance of your test items. This is an important step in the test development cycle, not only because it helps improve the quality of your test, but because it provides documentation for validity: evidence that your test performs well and score interpretations mean what you intend.  It is one of the most common applications of psychometrics, by using item statistics to flag, diagnose, and fix the poorly performing items on a test.  Every item that is poorly performing is potentially hurting the examinees.Iteman Statistics Screenshot

Item analysis boils down to two goals:

  1. Find the items that are not performing well (difficulty and discrimination, usually)
  2. Figure out WHY those items are not performing well, so we can determine whether to revise or retire them

There are different ways to evaluate performance, such as whether the item is too difficult/easy, too confusing (not discriminating), miskeyed, or perhaps even biased to a minority group.

Moreover, there are two completely different paradigms for this analysis: classical test theory (CTT) and item response theory (IRT). On top of that, the analyses can differ based on whether the item is dichotomous (right/wrong) or polytomous (2 or more points).

Because of the possible variations, item analysis complex topic. But, that doesn’t even get into the evaluation of test performance. In this post, we’ll cover some of the basics for each theory, at the item level.

 

How to do Item Analysis

1. Prepare your data for item analysis

Most psychometric software utilizes a person x item matrix.  That is, a data file where examinees are rows and items are columns.  Sometimes, it is a sparse matrix where is a lot of missing data, like linear on the fly testing.  You will also need to provide metadata to the software, such as your Item IDs, correct answers, item types, etc.  The format for this will differ by software.

2. Run data through item analysis software

To implement item analysis, you should utilize dedicated software designed for this purpose. If you utilize an online assessment platform, it will provide you output for item analysis, such as distractor P values and point-biserials (if not, it isn’t a real assessment platform). In some cases, you might utilize standalone software. CITAS  provides a simple spreadsheet-based approach to help you learn the basics, completely for free.  A screenshot of the CITAS output is here.  However, professionals will need a level above this.  Iteman  and  Xcalibre  are two specially-designed software programs from ASC for this purpose, one for CTT and one for IRT.

CITAS output with histogram

3. Interpret results of item analysis

Item analysis software will produce tables of numbers.  Sometimes, these will be ugly ASCII-style tables from the 1980s.  Sometimes, they will be beautiful Word docs with graphs and explanations.  Either way, you need to interpret the statistics to determine which items have problems and how to fix them.  The rest of this article will delve into that.

 

Item Analysis with Classical Test Theory

Classical Test Theory provides a simple and intuitive approach to item analysis. It utilizes nothing more complicated than proportions, averages, counts, and correlations. For this reason, it is useful for small-scale exams or use with groups that do not have psychometric expertise.

Item Difficulty: Dichotomous

CTT quantifies item difficulty for dichotomous items as the proportion (P value) of examinees that correctly answer it.

It ranges from 0.0 to 1.0. A high value means that the item is easy, and a low value means that the item is difficult.  There are no hard and fast rules because interpretation can vary widely for different situations.  For example, a test given at the beginning of the school year would be expected to have low statistics since the students have not yet been taught the material.  On the other hand, a professional certification exam, where someone can not even sit unless they have 3 years of experience and a relevant degree, might have all items appear easy even though they are quite advanced topics!  Here are some general guidelines”

    0.95-1.0 = Too easy (not doing much good to differentiate examinees, which is really the purpose of assessment)

    0.60-0.95 = Typical

    0.40-0.60 = Hard

    <0.40 = Too hard (consider that a 4 option multiple choice has a 25% chance of pure guessing)

With Iteman, you can set bounds to automatically flag items.  The minimum P value bound represents what you consider the cut point for an item being too difficult. For a relatively easy test, you might specify 0.50 as a minimum, which means that 50% of the examinees have answered the item correctly.

For a test where we expect examinees to perform poorly, the minimum might be lowered to 0.4 or even 0.3. The minimum should take into account the possibility of guessing; if the item is multiple-choice with four options, there is a 25% chance of randomly guessing the answer, so the minimum should probably not be 0.20.  The maximum P value represents the cut point for what you consider to be an item that is too easy. The primary consideration here is that if an item is so easy that nearly everyone gets it correct, it is not providing much information about the examinees.  In fact, items with a P of 0.95 or higher typically have very poor point-biserial correlations.

Note that because the scale is inverted (lower value means higher difficulty), this is sometimes referred to as item facility.

The Item Mean (Polytomous)

This refers to an item that is scored with 2 or more point levels, like an essay scored on a 0-4 point rubric or a Likert-type item that is “Rate on a scale of 1 to 5.”

  • 1=Strongly Disagree
  • 2=Disagree
  • 3=Neutral
  • 4=Agree
  • 5=Strongly Agree

The item mean is the average of the item responses converted to numeric values across all examinees. The range of the item mean is dependent on the number of categories and whether the item responses begin at 0. The interpretation of the item mean depends on the type of item (rating scale or partial credit). A good rating scale item will have an item mean close to ½ of the maximum, as this means that on average, examinees are not endorsing categories near the extremes of the continuum.

You will have to adjust for your own situation, but here is an example for the 5-point Likert-style item.

    1-2 is very low; people disagree fairly strongly on average

    2-3 is low to neutral; people tend to disagree on average

    3-4 is neutral to high; people tend to agree on average

    4-5 is very high; people agree fairly strongly on average

Iteman also provides flagging bounds for this statistic.  The minimum item mean bound represents what you consider the cut point for the item mean being too low.  The maximum item mean bound represents what you consider the cut point for the item mean being too high.

The number of categories for the items must be considered when setting the bounds of the minimum/maximum values. This is important as all items of a certain type (e.g., 3-category) might be flagged.

Item Discrimination: Dichotomous

In psychometrics, discrimination is a GOOD THING, even though the word often has a negative connotation in general. The entire point of an exam is to discriminate amongst examinees; smart students should get a high score and not-so-smart students should get a low score. If everyone gets the same score, there is no discrimination and no point in the exam! Item discrimination evaluates this concept.

CTT uses the point-biserial item-total correlation (Rpbis) as its primary statistic for this.

The Pearson point-biserial correlation (r-pbis) is a measure of the discrimination or differentiating strength, of the item. It ranges from −1.0 to 1.0 and is a correlation of item scores and total raw scores.  If you consider a scored data matrix (multiple-choice items converted to 0/1 data), this would be the correlation between the item column and a column that is the sum of all item columns for each row (a person’s score).

A good item is able to differentiate between examinees of high and low ability yet have a higher point-biserial, but rarely above 0.50. A negative point-biserial is indicative of a very poor item because it means that the high-ability examinees are answering incorrectly, while the low examinees are answering it correctly, which of course would be bizarre, and therefore typically indicates that the specified correct answer is actually wrong. A point-biserial of 0.0 provides no differentiation between low-scoring and high-scoring examinees, essentially random “noise.”  Here are some general guidelines on interpretation.  Note that these assume a decent sample size; if you only have a small number of examinees, many item statistics will be flagged!

    0.20+ = Good item; smarter examinees tend to get the item correct

    0.10-0.20 = OK item; but probably review it

    0.0-0.10 = Marginal item quality; should probably be revised or replaced

    <0.0 = Terrible item; replace it

***Major red flag is if the correct answer has a negative Rpbis and a distractor has a positive Rpbis

The minimum item-total correlation bound represents the lowest discrimination you are willing to accept. This is typically a small positive number, like 0.10 or 0.20. If your sample size is small, it could possibly be reduced.  The maximum item-total correlation bound is almost always 1.0, because it is typically desired that the Rpbis be as high as possible.

The biserial correlation is also a measure of the discrimination or differentiating strength, of the item. It ranges from −1.0 to 1.0. The biserial correlation is computed between the item and total score as if the item was a continuous measure of the trait. Since the biserial is an estimate of Pearson’s r it will be larger in absolute magnitude than the corresponding point-biserial.

The biserial makes the stricter assumption that the score distribution is normal. The biserial correlation is not recommended for traits where the score distribution is known to be non-normal (e.g., pathology).

Item Discrimination: Polytomous

The Pearson’s r correlation is the product-moment correlation between the item responses (as numeric values) and total score. It ranges from −1.0 to 1.0. The r correlation indexes the linear relationship between item score and total score and assumes that the item responses for an item form a continuous variable. The r correlation and the Rpbis are equivalent for a 2-category item, so guidelines for interpretation remain unchanged.

The minimum item-total correlation bound represents the lowest discrimination you are willing to accept. Since the typical r correlation (0.5) will be larger than the typical Rpbis (0.3) correlation, you may wish to set the lower bound higher for a test with polytomous items (0.2 to 0.3). If your sample size is small, it could possibly be reduced.  The maximum item-total correlation bound is almost always 1.0, because it is typically desired that the Rpbis be as high as possible.

The eta coefficient is an additional index of discrimination computed using an analysis of variance with the item response as the independent variable and total score as the dependent variable. The eta coefficient is the ratio of the between-groups sum of squares to the total sum of squares and has a range of 0 to 1. The eta coefficient does not assume that the item responses are continuous and also does not assume a linear relationship between the item response and total score.

As a result, the eta coefficient will always be equal or greater than Pearson’s r. Note that the biserial correlation will be reported if the item has only 2 categories.

Key and Distractor Analysis

In the case of many item types, it pays to evaluate the answers. A distractor is an incorrect option. We want to make sure that more examinees are not selecting a distractor than the key (P value) and also that no distractor has higher discrimination. The latter would mean that smart students are selecting the wrong answer, and not-so-smart students are selecting what is supposedly correct. In some cases, the item is just bad. In others, the answer is just incorrectly recorded, perhaps by a typo. We call this a miskey of the item. In both cases, we want to flag the item and then dig into the distractor statistics to figure out what is wrong.

Iteman Psychometric Item Analysis

Example

Here is an example output for one item from our  Iteman  software, which you can download for free. You might also be interested in this video.  This is a very well-performing item.  Here are some key takeaways.

  • This is a 4-option multiple choice item
  • It was on a subscore named “Example subscore”
  • This item was seen by 736 examinees
  • 70% of students answered it correctly, so it was fairly easy, but not too easy
  • The Rpbis was 0.53 which is extremely high; the item is good quality
  • The line for the correct answer in the quantile plot has a clear positive slope, which reflects the high discrimination quality
  • The proportion of examinees selecting the wrong answers was nicely distributed, not too high, and with negative Rpbis values. This means the distractors are sufficiently incorrect and not confusing.

 

Item Analysis with Item Response Theory

Item Response Theory (IRT) is a very sophisticated paradigm of item analysis and tackles numerous psychometric tasks, from item analysis to equating to adaptive testing. It requires much larger sample sizes than CTT (100-1000 responses per item) and extensive expertise (typically a PhD psychometrician). Maximum Likelihood Estimation (MLE) is a key concept in IRT used to estimate model parameters for better accuracy in assessments.

IRT isn’t suitable for small-scale exams like classroom quizzes. However, it is used by virtually every “real” exam you will take in your life, from K-12 benchmark exams to university admissions to professional certifications.

If you haven’t used IRT, I recommend you check out this blog post first.

Item Difficulty

IRT evaluates item difficulty for dichotomous items as a b-parameter, which is sort of like a z-score for the item on the bell curve: 0.0 is average, 2.0 is hard, and -2.0 is easy. (This can differ somewhat with the Rasch approach, which rescales everything.) In the case of polytomous items, there is a b-parameter for each threshold, or step between points.

Item Discrimination

IRT evaluates item discrimination by the slope of its item response function, which is called the a-parameter. Often, values above 0.80 are good and below 0.80 are less effective.

Key and Distractor Analysis

Xcalibre-poly-output

In the case of polytomous items, the multiple b-parameters provide an evaluation of the different answers. For dichotomous items, the IRT modeling does not distinguish amongst correct answers. Therefore, we utilize the CTT approach for distractor analysis. This remains extremely important for diagnosing issues in multiple choice items.

Example

Here is an example of what output from an IRT analysis program (Xcalibre) looks like. You might also be interested in this video.

  • Here, we have a polytomous item, such as an essay scored from 0 to 3 points.
  • It is calibrated with the generalized partial credit model.
  • It has strong classical discrimination (0.62)
  • It has poor IRT discrimination (0.466)
  • The average raw score was 2.314 out of 3.0, so fairly easy
  • There was a sufficient distribution of responses over the four point levels
  • The boundary parameters are not in sequence; this item should be reviewed

 

Summary

This article is a very broad overview and does not do justice to the complexity of psychometrics and the art of diagnosing/revising items!  I recommend that you download some of the item analysis software and start exploring your own data.

For additional reading, I recommend some of the common textbooks.  For more on how to write/revise items, check out Haladyna (2004) and subsequent works.  For item response theory, I highly recommend Embretson & Riese (2000).

 

So, yeah, the use of “hacks” in the title is definitely on the ironic and gratuitous side, but there is still a point to be made: are you making full use of current technology to keep your tests secure?  Gone are the days when you are limited to linear test forms on paper in physical locations.  Here are some quick points on how modern assessment technology can deliver assessments more securely, effectively, and efficiently than traditional methods:

1.  AI delivery like CAT and LOFT

Psychometrics was one of the first areas to apply modern data science and machine learning (see this blog post for a story about a MOOC course).  But did you know it was also one of the first areas to apply artificial intelligence (AI)?  Early forms of computerized adaptive testing (CAT) were suggested in the 1960s and had become widely available in the 1980s.  CAT delivers a unique test to each examinee by using complex algorithms to personalize the test.  This makes it much more secure, and can also reduce test length by 50-90%.

2. Psychometric forensics

Modern psychometrics has suggested many methods for finding cheaters and other invalid test-taking behavior.  These can range from very simple rules like flagging someone for having a top 5% score in a bottom 5% time, to extremely complex collusion indices.  These approaches are designed explicitly to keep your test more secure.

3. Tech enhanced items

Tech enhanced items (TEIs) are test questions that leverage technology to be more complex than is possible on paper tests.  Classic examples include drag and drop or hotspot items.  These items are harder to memorize and therefore contribute to security.

4. IP address limits

Suppose you want to make sure that your test is only delivered in certain school buildings, campuses, or other geographic locations.  You can build a test delivery platform that limits your tests to a range of IP addresses, which implements this geographic restriction.

5. Lockdown browser

A lockdown browser is a special software that locks a computer screen onto a test in progress, so for example a student cannot open Google in another tab and simply search for answers.  Advanced versions can also scan the computer for software that is considered a threat, like a screen capture software.

6. Identity verification

Tests can be built to require unique login procedures, such as requiring a proctor to enter their employee ID and the test-taker to enter their student ID.  Examinees can also be required to show photo ID, and of course, there are new biometric methods being developed.

7. Remote proctoring

The days are gone when you need to hop in the car and drive 3 hours to sit in a windowless room at a community college to take a test.  Nowadays, proctors can watch you and your desktop via webcam.  This is arguably as secure as in-person proctoring, and certainly more convenient and cost-effective.

So, how can I implement these to deliver assessments more securely?

Some of these approaches are provided by vendors specifically dedicated to that space, such as ProctorExam for remote proctoring.  However, if you use ASC’s FastTest platform, all of these methods are available for you right out of the box.  Want to see for yourself?  Sign up for a free account!

Conditional standard error of measurement function

Do you conduct adaptive testing research? Perhaps a thesis or dissertation? Or maybe you have developed adaptive tests and have a technical report or validity study? I encourage you to check out the Journal of Computerized Adaptive Testing as a publication outlet for your adaptive testing research. JCAT is the official journal of the International Association for Computerized Adaptive Testing (IACAT), a nonprofit organization dedicated to improving the science of assessments.

JCAT has an absolutely stellar board of editors and was founded to focus on improving the dissemination of research in adaptive testing. The IACAT website also contains a comprehensive bibliography of research in adaptive testing, across all journals and tech reports, for the past 50 years.  IACAT was founded at the 2009 conference on computerized adaptive testing and has since held conferences every other year as well as hosting the JCAT journal.

Potential research topics at the JCAT journal

Here are some of the potential research topics:

laptop data graph

  • Item selection algorithms
  • Item exposure algorithms
  • Termination criteria
  • Cognitive diagnostic models
  • Simulation studies
  • Validation studies
  • Item response theory models
  • Multistage testing
  • Use of adaptive testing in new(er) situations, like patient reported outcomes
  • Design of actual adaptive assessments and their release into the wild

If you are not involved in CAT research but are interested, please visit the IACAT and journal website to read the articles.  Access is free.  JCAT would also appreciate it if you would share this information to colleagues so that they might consider publication.