Test Development Archives

$math educational assessment$

Paper-and-Pencil Testing: Still Around?

AdminAugust 16, 2021

Paper-and-pencil testing used to be the only way to deliver assessments at scale. The introduction of computer-based testing (CBT) in the 1980s was a revelation – higher fidelity item types, immediate scoring & feedback, and scalability all changed with the advent of the personal computer and then later the internet. Delivery mechanisms including remote proctoring provided students with the ability to take their exams anywhere in the world. This all exploded tenfold when the pandemic arrived. So why are some exams still offline, with paper and pencil?

Many education institutions are confused about which examination models to stick to. Should you go on with the online model they used when everyone was stuck in their homes? Should you adopt multi-modal examination models, or should you go back to the traditional pen-and-paper method?

This blog post will provide you with an evaluation of whether paper-and-pencil exams are still worth it in 2021.

Paper-and-pencil testing; The good, the bad, and the ugly

The Good

Answer Bubble Sheet Orange Offline exams have been a stepping stone towards the development of modern assessment models that are more effective. We can’t ignore the fact that there are several advantages of traditional exams.

Some advantages of paper-and-pencil testing include students having familiarity with the system, development of a social connection between learners, exemption from technical glitches, and affordability. Some schools don’t have the resources and pen-and-paper assessments are the only option available.

This is especially true in areas of the world that do not have the internet bandwidth or other technology necessary to deliver internet-based testing.

Another advantage of paper exams is that they can often work better for students with special needs, such as blind students which need a reader.

Paper and pencil testing is often more cost-efficient in certain situations where the organization does not have access to a professional assessment platform or learning management system.

The Bad and The Ugly

However, the paper-and-pencil testing does have a number of shortfalls.

1. Needs a lot of resources to scale

Delivery of paper-and-pencil testing at large scale requires a lot of resources. You are printing and shipping, sometimes with hundreds of trucks around the country. Then you need to get all the exams back, which is even more of a logistical lift.

2. Prone to cheating

Most people think that offline exams are cheat-proof but that is not the case. Most offline exams count on invigilators and supervisors to make sure that cheating does not occur. However, many pen-and-paper assessments are open to leakages. High candidate-to-ratio is another factor that contributes to cheating in offline exams.

3. Poor student engagement

We live in a world of instant gratification and that is the same when it comes to assessments. Unlike online exams which have options to keep the students engaged, offline exams are open to constant destruction from external factors.

Offline exams also have few options when it comes to question types.

4. Time to score

“To err is human.” But, when it comes to assessments, accuracy, and consistency. Traditional methods of hand-scoring paper tests are slow and labor-intensive. Instructors take a long time to evaluate tests. This defeats the entire purpose of assessments.

5. Poor result analysis

Pen-and-paper exams depend on instructors to analyze the results and come up with insight. This requires a lot of human resources and expensive software. It is also difficult to find out if your learning strategy is working or it needs some adjustments.

6. Time to release results

Online exams can be immediate. If you ship paper exams back to a single location, score them, perform psychometrics, then mail out paper result letters? Weeks.

7. Slow availability of results to analyze

Similarly, psychometricians and other stakeholders do not have immediate access to results. This prevents psychometric analysis, timely feedback to students/teachers, and other issues.

8. Accessibility

Online exams can be built with tools for zoom, color contrast changes, automated text-to-speech, and other things to support accessibility.

9. Convenience

Online tests are much more easily distributed. If you publish one on the cloud, it can immediately be taken, anywhere in the world.

10. Support for diversified question types

Unlike traditional exams which are limited to a certain number of question types, online exams offer many question types. Videos, audio, drag and drop, high-fidelity simulations, gamification, and much more are possible.

11. Lack of modern psychometrics

Paper exams cannot use computerized adaptive testing, linear-on-the-fly testing, process data, computational psychometrics, and other modern innovations.

12. Environmental friendliness

Sustainability is an important aspect of modern civilization. Online exams eliminate the need to use resources that are not environmentally friendly such as paper.

Conclusion

Is paper-and-pencil testing still useful? In most situations, it is not. The disadvantages outweigh the advantages. However, there are many situations where paper remains the only option, such as poor tech infrastructure.

How ASC Can Help

Transitioning from paper-and-pencil testing to the cloud is not a simple task. That is why ASC is here to help you every step of the way, from test development to delivery. We provide you with the best assessment software and access to the most experienced team of psychometricians. Ready to take your assessments online? Contact us!

August 16, 2021/by Admin

Power of linear on the fly testing

Nathan Thompson, PhDAugust 2, 2021

Linear on the fly testing (LOFT) is an approach to assessment delivery that increases test security by limiting item exposure. It tries to balance the advantages of linear testing (e.g., everyone sees the same number of items, which feels fairer) with the advantages of algorithmic exams (e.g., creating a unique test for everyone).

In general, there are two families of test delivery. Static approaches deliver the same test form or forms to everyone; this is the ubiquitous and traditional “linear” method of testing. Algorithmic approaches deliver the test to each examinee based on a computer algorithm; this includes LOFT, computerized adaptive testing (CAT), and multistage testing (MST).

What is LOFT?

The purpose of linear on the fly testing is to give every examinee a linear form that is uniquely created for them – but each one is created to be psychometrically equivalent to all others to ensure fairness. For example, we might have a pool of 200 items, and every person only gets 100, but that 100 is balanced for each person. This can be done by ensuring content and/or statistical equivalency, as well ancillary metadata such as item types or cognitive level.

Content Equivalence

This portion is relatively straightforward. If your test blueprint calls for 20 items in each of 5 domains, for a total of 100 items, then each form administered to examinees should follow this blueprint. Sometimes the content blueprint might go 2 or even 3 levels deep.

Statistical Equivalence

There are, of course, two predominant psychometric paradigms: classical test theory (CTT) and item response theory (IRT). With CTT, forms can easily be built to have an equivalent P value, and therefore expected mean score. If point-biserial statistics are available for each item, you can also design the algorithm to design forms that have the same standard deviation and reliability.

With item response theory, the typical approach is to design forms to have the same test information function, or inversely, conditional standard error of measurement function. To learn more about how these are implemented, read this blog post about IRT or download our Classical Form Assembly Tool.

Implementing LOFT

LOFT is typically implemented by publishing a pool of items with an algorithm to select subsets that meet the requirements. Therefore, you need a psychometrically sophisticated testing engine that stores the necessary statistics and item metadata, lets you define a pool of items, specify the relevant options such as target statistics and blueprints, and deliver the test in a secure manner. Very few testing platforms can implement a quality LOFT assessment. ASC’s platform does; click here to request a demo.

Benefits of Using LOFT in Testing

It certainly is not easy to build a strong item bank, design LOFT pools, and develop a complex algorithm that meets the content and statistical balancing needs. So why would an organization use linear on the fly testing?

Well, it is much more secure than having a few linear forms. Since everyone receives a unique form, it is impossible for words to get out about what the first questions on the test are. And of course, we could simply perform a random selection of 100 items from a pool of 200, but that would be potentially unfair. Using LOFT will ensure the test remains fair and defensible.

August 2, 2021/by Nathan Thompson, PhD

What is a Standard Setting Study?

Nathan Thompson, PhDMay 21, 2021

A standard setting study is a formal, quantitative process for establishing a performance standard on an exam, such as what score is “proficient” or “passing.” This is typically manifested as a cutscore which is then used for making decisions about people: hire them, pass them, accept them into university, etc. Because it is used for such important decisions, a lot of work goes into standard setting, using methods based on scientific research.

What is NOT standard setting?

In the assessment world, there are actually three uses of the word standard:

A formal definition of the content that is being tested, such as the Common Core State Standards in the USA.
A formalized process for delivering exams, as seen in the phrase “standardized testing.”
A benchmark for performance, like we are discussing here.

For this reason, I prefer the term cutscore study, but the phrase standard setting is used more often.

How is a standard setting study used?

As part of a comprehensive test development cycle, after item authoring, item review, and test form assembly, a cutscore or passing score will often be set to determine what level of performance qualified as “pass” or a similar classification. This cannot be done arbitrarily, such as setting it at 70% because that’s what you saw when you were in school. That is a legal landmine! To be legally defensible and eligible for Accreditation of a Certification Program, it must be done using one of several standard-setting approaches from the psychometric literature. So, if your organization is classifying examinees into Pass/Fail, Hire/NotHire, Basic/Proficient/Advanced, or any other groups, you most likely need a standard setting study. This is NOT limited to certification, although it is often discussed in that pass/fail context.

What are some methods of a standard setting study?

There have been many methods suggested in the scientific literature of psychometrics. They are often delineated into examinee-centered and item-centered approaches. Angoff and Bookmark are designed around evaluating items, while Contrasting Groups and Borderline Groups are designed around evaluating the distributions of actual examinee scores. The Bookmark approach is sort of both types, however, because it uses examinee performance on the items as the object of interest. You may also be interested in reading this introductory post on setting a cutscore using item response theory.

Angoff

In an Angoff study, a panel of subject matter experts rates each item, estimating the percentage of minimally competent candidates that would answer each item correctly. If we take the average of all raters, this then translates into the average percentage-correct score that the raters expect from a minimally competent candidate – a very compelling argument for a cutscore to pass competent examinees! It is often done in tandem with the Beuk Compromise. The Angoff method does not require actual examinee data, though the Beuk does.

Bookmark

The bookmark method orders the items in a test form in ascending difficulty, and a panel of experts reads through and places a “bookmark” in the book where they think a cutscore should be. This process requires a sufficient amount of real data to calibrate item difficulty accurately, typically using item response theory, which necessitates data from several hundred examinees. Additionally, the method ensures that the cutscore is both valid and reliable, reflecting the true proficiency needed for the test.

Contrasting Groups

With the contrasting groups approach, candidates are sorted into Pass and Fail groups based on their performance on a different exam or some other unrelated standard. We can then compare the score distributions on our exam for the two separate groups, and pick a cutscore that best differentiates Pass vs Fail on the other standard. An example of this is below. If using data from another exam, a sample of at least 50 candidates is obviously needed, since you are evaluating distributions.

Borderline Group

The Borderline Group method is similar to the Contrasting Groups method, but it defines a borderline group using alternative information, such as biodata, and evaluates the scores of this group. This method involves selecting individuals who are deemed to be on the threshold of passing or failing based on external criteria. These criteria might include previous performance data, demographic information, or other relevant biodata. The scores from this borderline group are then analyzed to determine the cutscore. This approach helps in refining the accuracy of the cutscore by incorporating more nuanced and contextual information about the test-takers.

Hofstee

The Hofstee approach is often used as a reality check for the modified-Angoff method but can also stand alone as a method for setting cutscores. It involves only a few estimates from a panel of SMEs. Specifically, the SMEs provide estimates for the minimum and maximum acceptable failure rates and the minimum and maximum acceptable scores. This data is then plotted to determine a compromise cutscore that balances these criteria. The simplicity and practicality of the Hofstee approach make it a valuable tool in various testing scenarios, ensuring the cutscore is both realistic and justifiable.

Ebel

The Ebel approach categorizes test items by both their importance and difficulty level. This method involves a panel of experts who rate each item on these two dimensions, creating a matrix that helps in determining the cutscore. Despite its thorough and structured approach, the Ebel method is considered very old and has largely fallen out of use in modern testing practices. Advances in psychometric techniques and the development of more efficient and accurate methods, such as item response theory, have led to the Ebel approach being replaced by more contemporary standard-setting techniques.

How to choose an approach?

There is often no specifically correct answer. In fact, guidelines like NCCA do not lay out which method to use, they just tell you to use an appropriate method.

There are several considerations. Perhaps the most important is whether you have existing data. The Bookmark, Contrasting Groups, and Borderline Group approaches all assume that we have data from a test already delivered, which we can analyze with the perspective of the latent standard. The Angoff and Hofstee approaches, in contrast, can be done before a test is ever delivered. This is arguably less defensible, but is a huge practical advantage.

The choice also depends on whether you can easily recruit a panel of subject matter experts, as that is required for Angoff and Bookmark. The Contrasting Groups method assumes we have a gold standard, which is rare.

How can I implement a standard setting study?

If your organization has an in-house psychometrician, they can usually do this. If, for example, you are a board of experts in a profession but lack experience in psychometrics, you need to hire a firm. We can perform such work for you – contact us to learn more.

May 21, 2021/by Nathan Thompson, PhD

What is Item Banking? What are Item Banks?

AdminApril 12, 2021

Item banking refers to the purposeful creation of a database of assessment items to serve as a central repository of all test content, improving efficiency and quality. The term item refers to what many call questions; though their content need not be restricted as such and can include problems to solve or situations to evaluate in addition to straightforward questions. Regular item review is essential to ensure that each item meets content standards, is fair, and is free from bias, thereby maintaining the integrity and accuracy of the item bank. As a critical foundation to the test development cycle, item banking is the foundation for the development of valid, reliable content and defensible test forms.

Automated item banking systems, such as Assess.ai or FastTest, result in significantly reduced administrative time for developing/reviewing items and assembling/publishing tests, while producing exams that have greater reliability and validity. Contact us to request a free account.

Request demo account

What is Item Banking?

While there are no absolute standards in creating and managing item banks, best practice guidelines are emerging. Here are the essentials your should be looking for:

Items are reusable objects; when selecting an item banking platform it is important to ensure that items can be used more than once; ideally, item performance should be tracked not only within a test form but across test forms as well.

Item history and usage are tracked; the usage of a given item, whether it is actively on a test form or dormant waiting to be assigned, should be easily accessible for test developers to assess, as the over-exposure of items can reduce the validity of a test form. As you deliver your items, their content is exposed to examinees. Upon exposure to many examinees, items can then be flagged for retirement or revision to reduce cheating or teaching to the test.

Items can be sorted; as test developers select items for a test form, it is imperative that they can sort items based on their content area or other categorization methods, so as to select a sample of items that is representative of the full breadth of constructs we intend to measure.

Item versions are tracked; as items appear on test forms, their content may be revised for clarity. Any such changes should be tracked and versions of the same item should have some link between them so that we can easily review the performance of earlier versions in conjunction with current versions.

Review process workflow is tracked; as items are revised and versioned, it is imperative that the changes in content and the users who made these changes are tracked. In post-test assessment, there may be a need for further clarification, and the ability to pinpoint who took part in reviewing an item and expedite that process.

Metadata is recorded; any relevant information about an item should be recorded and stored with the item. The most common applications for metadata that we see are author, source, description, content area, depth of knowledge, item response theory parameters, and classical test theory statistics, but there are likely many data points specific to your organization that is worth storing.

Managing an Item Bank

Names are important. As you create or import your item banks it is important to identify each item with a unique, but recognizable name. Naming conventions should reflect your bank’s structure and should include numbers with leading zeros to support true numerical sorting. You might want to also add additional pieces of information. If importing, the system should be smart enough to recognize duplicates.

Search and filter. The system should also have a reliable sorting mechanism.

Prepare for the Future: Store Extensive Metadata

Metadata is valuable. As you create items, take the time to record simple metadata like author and source. Having this information can prove very useful once the original item writer has moved to another department, or left the organization. Later in your test development life cycle, as you deliver items, you have the ability to aggregate and record item statistics. Values like discrimination and difficulty are fundamental to creating better tests, driving reliability, and validity.

Statistics are used in the assembly of test forms while classical statistics can be used to estimate mean, standard deviation, reliability, standard error, and pass rate.

Item response theory parameters can come in handy when calculating test information and standard error functions. Data from both psychometric theories can be used to pre-equate multiple forms.

In the event that your organization decides to publish an adaptive test, utilizing computerized adaptive testing delivery, item parameters for each item will be essential. This is because they are used for intelligent selection of items and scoring examinees. Additionally, in the event that the integrity of your test or scoring mechanism is ever challenged, documentation of validity is essential to defensibility and the storage of metadata is one such vital piece of documentation.

Increase Content Quality: Track Workflow

Utilize a review workflow to increase quality. Using a standardized review process will ensure that all items are vetted in a similar matter. Have a step in the process for grammar, spelling, and syntax review, as well as content review by a subject matter expert. As an item progresses through the workflow, its development should be tracked, as workflow results also serve as validity documentation.

Accept comments and suggestions from a variety of sources. It is not uncommon for each item reviewer to view an item through their distinctive lens. Having a diverse group of item reviewers stands to benefit your test-takers, as they are likely to be diverse as well!

Keep Your Items Organized: Categorize Them

Identify items by content area. Creating a content hierarchy can also help you to organize your item bank and ensure that your test covers the relevant topics. Most often, we see content areas defined first by an analysis of the construct(s) being tested. In the event of a high school science test, this may include the evaluation of the content taught in class. A high-stakes certification exam, almost always includes a job-task analysis. Both methods produce what is called a test blueprint, indicating how important various content areas are to the demonstration of knowledge in the areas being assessed.

Once content areas are defined, we can assign items to levels or categories based on their content. As you are developing your test, and invariably referring back to your test blueprint, you can use this categorization to determine which items from each content area to select.

The Benefits of Item Banking

There is no doubt that item banking is a key aspect of developing and maintaining quality assessments. Utilizing best practices, and caring for your items throughout the test development life cycle, will pay great dividends as it increases the reliability, validity, and defensibility of your assessment. Moreover, good item banking will make the job easier and more efficient thus reducing the cost of item development and test publishing.

Ready to Improve assessment quality through item banking?

Visit our Contact Us page, where you can request a demonstration or a free account (up to 500 items). I also recommend you watch this tutorial video.

April 12, 2021/by Admin

Item Analysis and Statistics

Nathan Thompson, PhDMarch 31, 2021

Item analysis is the statistical evaluation of test questions to ensure they are good quality, and fix them if they are not. This is a key step in the test development cycle; after items have been delivered to examinees (either as a pilot, or in full usage), we analyze the statistics to determine if there are issues which affect validity and reliability, such as being too difficult or biased. This post will describe the basics of this process. If you’d like further detail and instructions on using software, you can also you can also check out our tutorial videos on our YouTube channel and download our free psychometric software.

Download a free copy of Iteman: Software for Item Analysis

What is Item Analysis?

Item analysis refers to the process of statistically analyzing assessment data to evaluate the quality and performance of your test items. This is an important step in the test development cycle, not only because it helps improve the quality of your test, but because it provides documentation for validity: evidence that your test performs well and score interpretations mean what you intend. It is one of the most common applications of psychometrics, by using item statistics to flag, diagnose, and fix the poorly performing items on a test. Every item that is poorly performing is potentially hurting the examinees.

Item analysis boils down to two goals:

Find the items that are not performing well (difficulty and discrimination, usually)
Figure out WHY those items are not performing well, so we can determine whether to revise or retire them

There are different ways to evaluate performance, such as whether the item is too difficult/easy, too confusing (not discriminating), miskeyed, or perhaps even biased to a minority group.

Moreover, there are two completely different paradigms for this analysis: classical test theory (CTT) and item response theory (IRT). On top of that, the analyses can differ based on whether the item is dichotomous (right/wrong) or polytomous (2 or more points).

Because of the possible variations, item analysis complex topic. But, that doesn’t even get into the evaluation of test performance. In this post, we’ll cover some of the basics for each theory, at the item level.

How to do Item Analysis

1. Prepare your data for item analysis

Most psychometric software utilizes a person x item matrix. That is, a data file where examinees are rows and items are columns. Sometimes, it is a sparse matrix where is a lot of missing data, like linear on the fly testing. You will also need to provide metadata to the software, such as your Item IDs, correct answers, item types, etc. The format for this will differ by software.

2. Run data through item analysis software

To implement item analysis, you should utilize dedicated software designed for this purpose. If you utilize an online assessment platform, it will provide you output for item analysis, such as distractor P values and point-biserials (if not, it isn’t a real assessment platform). In some cases, you might utilize standalone software. CITAS provides a simple spreadsheet-based approach to help you learn the basics, completely for free. A screenshot of the CITAS output is here. However, professionals will need a level above this. Iteman and Xcalibre are two specially-designed software programs from ASC for this purpose, one for CTT and one for IRT.

3. Interpret results of item analysis

Item analysis software will produce tables of numbers. Sometimes, these will be ugly ASCII-style tables from the 1980s. Sometimes, they will be beautiful Word docs with graphs and explanations. Either way, you need to interpret the statistics to determine which items have problems and how to fix them. The rest of this article will delve into that.

Item Analysis with Classical Test Theory

Classical Test Theory provides a simple and intuitive approach to item analysis. It utilizes nothing more complicated than proportions, averages, counts, and correlations. For this reason, it is useful for small-scale exams or use with groups that do not have psychometric expertise.

Item Difficulty: Dichotomous

CTT quantifies item difficulty for dichotomous items as the proportion (P value) of examinees that correctly answer it.

It ranges from 0.0 to 1.0. A high value means that the item is easy, and a low value means that the item is difficult. There are no hard and fast rules because interpretation can vary widely for different situations. For example, a test given at the beginning of the school year would be expected to have low statistics since the students have not yet been taught the material. On the other hand, a professional certification exam, where someone can not even sit unless they have 3 years of experience and a relevant degree, might have all items appear easy even though they are quite advanced topics! Here are some general guidelines”

0.95-1.0 = Too easy (not doing much good to differentiate examinees, which is really the purpose of assessment)

0.60-0.95 = Typical

0.40-0.60 = Hard

<0.40 = Too hard (consider that a 4 option multiple choice has a 25% chance of pure guessing)

With Iteman, you can set bounds to automatically flag items. The minimum P value bound represents what you consider the cut point for an item being too difficult. For a relatively easy test, you might specify 0.50 as a minimum, which means that 50% of the examinees have answered the item correctly.

For a test where we expect examinees to perform poorly, the minimum might be lowered to 0.4 or even 0.3. The minimum should take into account the possibility of guessing; if the item is multiple-choice with four options, there is a 25% chance of randomly guessing the answer, so the minimum should probably not be 0.20. The maximum P value represents the cut point for what you consider to be an item that is too easy. The primary consideration here is that if an item is so easy that nearly everyone gets it correct, it is not providing much information about the examinees. In fact, items with a P of 0.95 or higher typically have very poor point-biserial correlations.

Note that because the scale is inverted (lower value means higher difficulty), this is sometimes referred to as item facility.

The Item Mean (Polytomous)

This refers to an item that is scored with 2 or more point levels, like an essay scored on a 0-4 point rubric or a Likert-type item that is “Rate on a scale of 1 to 5.”

1=Strongly Disagree
2=Disagree
3=Neutral
4=Agree
5=Strongly Agree

The item mean is the average of the item responses converted to numeric values across all examinees. The range of the item mean is dependent on the number of categories and whether the item responses begin at 0. The interpretation of the item mean depends on the type of item (rating scale or partial credit). A good rating scale item will have an item mean close to ½ of the maximum, as this means that on average, examinees are not endorsing categories near the extremes of the continuum.

You will have to adjust for your own situation, but here is an example for the 5-point Likert-style item.

1-2 is very low; people disagree fairly strongly on average

2-3 is low to neutral; people tend to disagree on average

3-4 is neutral to high; people tend to agree on average

4-5 is very high; people agree fairly strongly on average

Iteman also provides flagging bounds for this statistic. The minimum item mean bound represents what you consider the cut point for the item mean being too low. The maximum item mean bound represents what you consider the cut point for the item mean being too high.

The number of categories for the items must be considered when setting the bounds of the minimum/maximum values. This is important as all items of a certain type (e.g., 3-category) might be flagged.

Item Discrimination: Dichotomous

In psychometrics, discrimination is a GOOD THING, even though the word often has a negative connotation in general. The entire point of an exam is to discriminate amongst examinees; smart students should get a high score and not-so-smart students should get a low score. If everyone gets the same score, there is no discrimination and no point in the exam! Item discrimination evaluates this concept.

CTT uses the point-biserial item-total correlation (Rpbis) as its primary statistic for this.

The Pearson point-biserial correlation (r-pbis) is a measure of the discrimination or differentiating strength, of the item. It ranges from −1.0 to 1.0 and is a correlation of item scores and total raw scores. If you consider a scored data matrix (multiple-choice items converted to 0/1 data), this would be the correlation between the item column and a column that is the sum of all item columns for each row (a person’s score).

A good item is able to differentiate between examinees of high and low ability yet have a higher point-biserial, but rarely above 0.50. A negative point-biserial is indicative of a very poor item because it means that the high-ability examinees are answering incorrectly, while the low examinees are answering it correctly, which of course would be bizarre, and therefore typically indicates that the specified correct answer is actually wrong. A point-biserial of 0.0 provides no differentiation between low-scoring and high-scoring examinees, essentially random “noise.” Here are some general guidelines on interpretation. Note that these assume a decent sample size; if you only have a small number of examinees, many item statistics will be flagged!

0.20+ = Good item; smarter examinees tend to get the item correct

0.10-0.20 = OK item; but probably review it

0.0-0.10 = Marginal item quality; should probably be revised or replaced

<0.0 = Terrible item; replace it

***Major red flag is if the correct answer has a negative Rpbis and a distractor has a positive Rpbis

The minimum item-total correlation bound represents the lowest discrimination you are willing to accept. This is typically a small positive number, like 0.10 or 0.20. If your sample size is small, it could possibly be reduced. The maximum item-total correlation bound is almost always 1.0, because it is typically desired that the Rpbis be as high as possible.

The biserial correlation is also a measure of the discrimination or differentiating strength, of the item. It ranges from −1.0 to 1.0. The biserial correlation is computed between the item and total score as if the item was a continuous measure of the trait. Since the biserial is an estimate of Pearson’s r it will be larger in absolute magnitude than the corresponding point-biserial.

The biserial makes the stricter assumption that the score distribution is normal. The biserial correlation is not recommended for traits where the score distribution is known to be non-normal (e.g., pathology).

Item Discrimination: Polytomous

The Pearson’s r correlation is the product-moment correlation between the item responses (as numeric values) and total score. It ranges from −1.0 to 1.0. The r correlation indexes the linear relationship between item score and total score and assumes that the item responses for an item form a continuous variable. The r correlation and the Rpbis are equivalent for a 2-category item, so guidelines for interpretation remain unchanged.

The minimum item-total correlation bound represents the lowest discrimination you are willing to accept. Since the typical r correlation (0.5) will be larger than the typical Rpbis (0.3) correlation, you may wish to set the lower bound higher for a test with polytomous items (0.2 to 0.3). If your sample size is small, it could possibly be reduced. The maximum item-total correlation bound is almost always 1.0, because it is typically desired that the Rpbis be as high as possible.

The eta coefficient is an additional index of discrimination computed using an analysis of variance with the item response as the independent variable and total score as the dependent variable. The eta coefficient is the ratio of the between-groups sum of squares to the total sum of squares and has a range of 0 to 1. The eta coefficient does not assume that the item responses are continuous and also does not assume a linear relationship between the item response and total score.

As a result, the eta coefficient will always be equal or greater than Pearson’s r. Note that the biserial correlation will be reported if the item has only 2 categories.

Key and Distractor Analysis

In the case of many item types, it pays to evaluate the answers. A distractor is an incorrect option. We want to make sure that more examinees are not selecting a distractor than the key (P value) and also that no distractor has higher discrimination. The latter would mean that smart students are selecting the wrong answer, and not-so-smart students are selecting what is supposedly correct. In some cases, the item is just bad. In others, the answer is just incorrectly recorded, perhaps by a typo. We call this a miskey of the item. In both cases, we want to flag the item and then dig into the distractor statistics to figure out what is wrong.

Example

Here is an example output for one item from our Iteman software, which you can download for free. You might also be interested in this video. This is a very well-performing item. Here are some key takeaways.

This is a 4-option multiple choice item
It was on a subscore named “Example subscore”
This item was seen by 736 examinees
70% of students answered it correctly, so it was fairly easy, but not too easy
The Rpbis was 0.53 which is extremely high; the item is good quality
The line for the correct answer in the quantile plot has a clear positive slope, which reflects the high discrimination quality
The proportion of examinees selecting the wrong answers was nicely distributed, not too high, and with negative Rpbis values. This means the distractors are sufficiently incorrect and not confusing.

Item Analysis with Item Response Theory

Item Response Theory (IRT) is a very sophisticated paradigm of item analysis and tackles numerous psychometric tasks, from item analysis to equating to adaptive testing. It requires much larger sample sizes than CTT (100-1000 responses per item) and extensive expertise (typically a PhD psychometrician). Maximum Likelihood Estimation (MLE) is a key concept in IRT used to estimate model parameters for better accuracy in assessments.

IRT isn’t suitable for small-scale exams like classroom quizzes. However, it is used by virtually every “real” exam you will take in your life, from K-12 benchmark exams to university admissions to professional certifications.

If you haven’t used IRT, I recommend you check out this blog post first.

Item Difficulty

IRT evaluates item difficulty for dichotomous items as a b-parameter, which is sort of like a z-score for the item on the bell curve: 0.0 is average, 2.0 is hard, and -2.0 is easy. (This can differ somewhat with the Rasch approach, which rescales everything.) In the case of polytomous items, there is a b-parameter for each threshold, or step between points.

Item Discrimination

IRT evaluates item discrimination by the slope of its item response function, which is called the a-parameter. Often, values above 0.80 are good and below 0.80 are less effective.

Key and Distractor Analysis

In the case of polytomous items, the multiple b-parameters provide an evaluation of the different answers. For dichotomous items, the IRT modeling does not distinguish amongst correct answers. Therefore, we utilize the CTT approach for distractor analysis. This remains extremely important for diagnosing issues in multiple choice items.

Example

Here is an example of what output from an IRT analysis program (Xcalibre) looks like. You might also be interested in this video.

Here, we have a polytomous item, such as an essay scored from 0 to 3 points.
It is calibrated with the generalized partial credit model.
It has strong classical discrimination (0.62)
It has poor IRT discrimination (0.466)
The average raw score was 2.314 out of 3.0, so fairly easy
There was a sufficient distribution of responses over the four point levels
The boundary parameters are not in sequence; this item should be reviewed

Summary

This article is a very broad overview and does not do justice to the complexity of psychometrics and the art of diagnosing/revising items! I recommend that you download some of the item analysis software and start exploring your own data.

For additional reading, I recommend some of the common textbooks. For more on how to write/revise items, check out Haladyna (2004) and subsequent works. For item response theory, I highly recommend Embretson & Riese (2000).

March 31, 2021/by Nathan Thompson, PhD

Seven Technology Hacks to Deliver Assessments More Securely

Nathan Thompson, PhDMarch 27, 2021

So, yeah, the use of “hacks” in the title is definitely on the ironic and gratuitous side, but there is still a point to be made: are you making full use of current technology to keep your tests secure? Gone are the days when you are limited to linear test forms on paper in physical locations. Here are some quick points on how modern assessment technology can deliver assessments more securely, effectively, and efficiently than traditional methods:

1. AI delivery like CAT and LOFT

Psychometrics was one of the first areas to apply modern data science and machine learning (see this blog post for a story about a MOOC course). But did you know it was also one of the first areas to apply artificial intelligence (AI)? Early forms of computerized adaptive testing (CAT) were suggested in the 1960s and had become widely available in the 1980s. CAT delivers a unique test to each examinee by using complex algorithms to personalize the test. This makes it much more secure, and can also reduce test length by 50-90%.

2. Psychometric forensics

Modern psychometrics has suggested many methods for finding cheaters and other invalid test-taking behavior. These can range from very simple rules like flagging someone for having a top 5% score in a bottom 5% time, to extremely complex collusion indices. These approaches are designed explicitly to keep your test more secure.

3. Tech enhanced items

Tech enhanced items (TEIs) are test questions that leverage technology to be more complex than is possible on paper tests. Classic examples include drag and drop or hotspot items. These items are harder to memorize and therefore contribute to security.

4. IP address limits

Suppose you want to make sure that your test is only delivered in certain school buildings, campuses, or other geographic locations. You can build a test delivery platform that limits your tests to a range of IP addresses, which implements this geographic restriction.

5. Lockdown browser

A lockdown browser is a special software that locks a computer screen onto a test in progress, so for example a student cannot open Google in another tab and simply search for answers. Advanced versions can also scan the computer for software that is considered a threat, like a screen capture software.

6. Identity verification

Tests can be built to require unique login procedures, such as requiring a proctor to enter their employee ID and the test-taker to enter their student ID. Examinees can also be required to show photo ID, and of course, there are new biometric methods being developed.

7. Remote proctoring

The days are gone when you need to hop in the car and drive 3 hours to sit in a windowless room at a community college to take a test. Nowadays, proctors can watch you and your desktop via webcam. This is arguably as secure as in-person proctoring, and certainly more convenient and cost-effective.

So, how can I implement these to deliver assessments more securely?

Some of these approaches are provided by vendors specifically dedicated to that space, such as ProctorExam for remote proctoring. However, if you use ASC’s FastTest platform, all of these methods are available for you right out of the box. Want to see for yourself? Sign up for a free account!

March 27, 2021/by Nathan Thompson, PhD

IRT Test Information Function

Nathan Thompson, PhDFebruary 2, 2021

The IRT Test Information Function is a concept from item response theory (IRT) that is designed to evaluate how well an assessment differentiates examinees, and at what ranges of ability. For example, we might expect an exam composed of difficult items to do a great job in differentiating top examinees, but it is worthless for the lower half of examinees because they will be so confused and lost.

The reverse is true of an easy test; it doesn’t do any good for top examinees. The test information function quantifies this and has a lot of other important applications and interpretations.

IRT Test Information Function: how to calculate it

The test information function is not something you can calculate by hand. First, you need to estimate item-level IRT parameters, which define the item response function. The only way to do this is with specialized software; there are a few options in the market, but we recommend Xcalibre.

Next, the item response function is converted to an item information function for each item. The item information functions can then be summed into a test information function. Lastly, the test information function is often inverted into the conditional standard error of measurement function, which is extremely useful in test design and evaluation.

IRT Item Parameters

Software like Xcalibre will estimate a set of item parameters. The parameter you use depends on the item types and other aspects of your assessment.

For example, let’s just use the 3-parameter model, which estimates a, b, and c. And we’ll use a small test of 5 items. These are ordered by difficulty: item 1 is very easy and Item 5 is very hard.

Item	a	b	c
1	1.00	-2.00	0.20
2	0.70	-1.00	0.40
3	0.40	0.00	0.30
4	0.80	1.00	0.00
5	1.20	2.00	0.25

Item Response Function

The item response function uses the IRT equation to convert the parameters into a curve. The purpose of the item parameters is to fit this curve for each item, like a regression model to describe how it performs.

Here are the response functions for those 5 items. Note the scale on the x-axis, similar to the bell curve, with the easy items to the left and hard ones to the right.

Item Information Function

The item information function evaluates the calculus derivative of the item response function. An item provides more information about examinees where it provides more slope.

For example, consider Item 5: it is difficult, so it is not very useful for examinees in the bottom half of ability. The slope of the Item 5 IRF is then nearly 0 for that entire range. This then means that its information function is nearly 0.

Test Information Function

The test information function then sums up the item information functions to summarize where the test is providing information. If you imagine adding the graphs above, you can easily imagine some humps near the top and bottom of the range where there are the prominent IIFs.

Conditional Standard Error of Measurement Function

The test information function can be inverted into an estimate of the conditional standard error of measurement. What do we mean by conditional? If you are familiar with classical test theory, you know that it estimates the same standard error of measurement for everyone that takes a test.

But given the reasonable concepts above, it is incredibly unreasonable to expect this. If a test has only difficult items, then it measures top students well, and does not measure lower students well, so why should we say that their scores are just as accurate? The conditional standard error of measurement turns this into a function of ability.

Also, note that it refers to the theta scale and not to the number-correct scale.

How can I implement all this?

For starters, I recommend delving deeper into an item response theory book. My favorite is Item Response Theory for Psychologists by Embretson and Riese. Next, you need some item response theory software.

Xcalibre can be downloaded as a free version for learning and is the easiest program to learn how to use (no 1980s-style command code… how is that still a thing?). But if you are an R fan, there are plenty of resources in that community as well.

Tell me again: why are we doing this?

The purpose of all this is to effectively model how items and tests work, namely, how they interact with examinees. This then allows us to evaluate their performance so that we can improve them, thereby enhancing reliability and validity.

Classical test theory had a lot of shortcomings in this endeavor, which led to IRT being invented. IRT also facilitates some modern approaches to assessment, such as linear on-the-fly testing, adaptive testing, and multistage testing.

February 2, 2021/by Nathan Thompson, PhD

$math educational assessment$

What is Classical Item Difficulty (P Value)?

Nathan Thompson, PhDNovember 6, 2020

One of the core concepts in psychometrics is item difficulty. This refers to the probability that examinees will get the item correct for educational/cognitive assessments or respond in the keyed direction with psychological/survey assessments (more on that later). Difficulty is important for evaluating the characteristics of an item and whether it should continue to be part of the assessment; in many cases, items are deleted if they are too easy or too hard. It also allows us to better understand how the items and test as a whole operate as a measurement instrument, and what they can tell us about examinees.

I’ve heard of “item facility.” Is that similar?

Item difficulty is also called item facility, which is actually a more appropriate name. Why? The P value is a reverse of the concept: a low value indicates high difficulty, and vice versa. If we think of the concept as facility or easiness, then the P value aligns with the concept; a high value means high easiness. Of course, it’s hard to break with tradition, and almost everyone still calls it difficulty. But it might help you here to think of it as “easiness.”

How do we calculate classical item difficulty?

There are two predominant paradigms in psychometrics: classical test theory (CTT) and item response theory (IRT). Here, I will just focus on the simpler approach, CTT.

To calculate classical item difficulty with dichotomous items, you simply count the number of examinees that responded correctly (or in the keyed direction) and divide by the number of respondents. This gets you a proportion, which is like a percentage but is on the scale of 0 to 1 rather than 0 to 100. Therefore, the possible range that you will see reported is 0 to 1. Consider this data set.

Person	Item1	Item2	Item3	Item4	Item5	Item6	Score
1	0	0	0	0	0	1	1
2	0	0	0	0	1	1	2
3	0	0	0	1	1	1	3
4	0	0	1	1	1	1	4
5	0	1	1	1	1	1	5
Diff:	0.00	0.20	0.40	0.60	0.80	1.00

Item6 has a high difficulty index, meaning that it is very easy. Item4 and Item5 are typical items, where the majority of items are responding correctly. Item1 is extremely difficult; no one got it right!

For polytomous items (items with more than one point), classical item difficulty is the mean response value. That is, if we have a 5 point Likert item, and two people respond 4 and two response 5, then the average is 4.5. This, of course, is mathematically equivalent to the P value if the points are 0 and 1 for a no/yes item. An example of this situation is this data set:

Person	Item1	Item2	Item3	Item4	Item5	Item6	Score
1	1	1	2	3	4	5	1
2	1	2	2	4	4	5	2
3	1	2	3	4	4	5	3
4	1	2	3	4	4	5	4
5	1	2	3	5	4	5	5
Diff:	1.00	1.80	2.60	4.00	4.00	5.00

Note that this is approach to calculating difficulty is sample-dependent. If we had a different sample of people, the statistics could be quite different. This is one of the primary drawbacks to classical test theory. Item response theory tackles that issue with a different paradigm. It also has an index with the right “direction” – high values mean high difficulty with IRT.

If you are working with multiple choice items, remember that while you might have 4 or 5 responses, you are still scoring the items as right/wrong. Therefore, the data ends up being dichotomous 0/1.

Very important final note: this P value is NOT to be confused with p value from the world of hypothesis testing. They have the same name, but otherwise are completely unrelated. For this reason, some psychometricians call it P+ (pronounced “P-plus”), but that hasn’t caught on.

How do I interpret classical item difficulty?

For educational/cognitive assessments, difficulty refers to the probability that examinees will get the item correct. If more examinees get the item correct, it has low difficulty. For psychological/survey type data, difficulty refers to the probability of responding in the keyed direction. That is, if you are assessing Extraversion, and the item is “I like to go to parties” then you are evaluating how many examinees agreed with the statement.

What is unique with survey type data is that it often includes reverse-keying; the same assessment might also have an item that is “I prefer to spend time with books rather than people” and an examinee disagreeing with that statement counts as a point towards the total score.

For the stereotypical educational/knowledge assessment, with 4 or 5 option multiple choice items, we use general guidelines like this for interpretation.

Range	Interpretation	Notes
0.0-0.3	Extremely difficult	Examinees are at chance level or even below, so your item might be miskeyed or have other issues
0.3-0.5	Very difficult	Items in this range will challenge even top examinees, and therefore might elicit complaints, but are typically very strong
0.5-0.7	Moderately difficult	These items are fairly common, and a little on the tougher side
0.7-0.90	Moderately easy	These are the most common range of items on most classically built tests; easy enough that examinees rarely complain
0.90-1.0	Very easy	These items are mastered by most examinees; they are actually too easy to provide much info on examinees though, and can be detrimental to reliability.

Do I need to calculate this all myself?

No. There is plenty of software to do it for you. If you are new to psychometrics, I recommend CITAS, which is designed to get you up and running quickly but is too simple for advanced situations. If you have large samples or are involved with production-level work, you need Iteman. Sign up for a free account with the button below. If that is you, I also recommend that you look into learning IRT if you have not yet.

November 6, 2020/by Nathan Thompson, PhD

What can Machine Learning tell us about our item banks?

Nathan Thompson, PhDMarch 29, 2020

Artificial intelligence (AI) and machine learning (ML) have become buzzwords over the past few years. As I already wrote about, they are actually old news in the field of psychometrics. Factor analysis is a classical example of ML, and item response theory (IRT) also qualifies as ML. Computerized adaptive testing (CAT) is actually an application of AI to psychometrics that dates back to the 1970s.

One thing that is very different about the world of AI/ML today is the massive power available in free platforms like R, Python, and TensorFlow. I’ve been thinking a lot over the past few years about how these tools can impact the world of assessment. A straightforward application is too automated essay scoring; a common way to approach that problem is through natural language processing with the “bag of words” model and utilize the document-term matrix (DTM) as predictors in a model for essay score as a criterion variable. Surprisingly simple. This got me to wondering where else we could apply that sort of modeling. Obviously, student response data on selected-response items provides a ton of data, but the research questions are less clear. So, I turned to the topic that I think has the next largest set of data and text: item banks.

Step 1: Text Mining

The first step was to explore tools for text mining in R. I found this well-written and clear tutorial on the text2vec package and used that as my springboard. Within minutes I was able to get a document term matrix, and in a few more minutes was able to prune it. This DTM alone can provide useful info to an organization on their item bank, but I wanted to delve further. Can the DTM predict item quality?

Step 2: Fit Models

To do this, I utilized both the caret and glmnet packages to fit models. I love the caret package, but if you search the literature you’ll find it has a problem with sparse matrices, which is exactly what the DTM is. One blog post I found said that anyone with a sparse matrix is pretty much stuck using glmnet.

I tried a few models on a small item bank of 500 items from a friend of mine, and my adjusted R squared for the prediction of IRT parameters (as an index of item quality) was 0.53 – meaning that I could account for more than half the variance of item quality just by knowing some of the common words in each item’s stem. I wasn’t even using the answer texts n-grams, or additional information like Author and content domain.

Want to learn more about your item banks?

I’d love to swim even deeper on this issue. If you have a large item bank and would like to work with me to analyze it so you can provide better feedback and direction to your item writers and test developers, drop me a message at solutions@assess.com! This could directly impact the efficiency of your organization and the quality of your assessments.

March 29, 2020/by Nathan Thompson, PhD

Ways the Word “Standard” is used in Assessment

Nathan Thompson, PhDDecember 6, 2019

If you have worked in the field of assessment and psychometrics, you have undoubtedly encountered the word “standard.” While a relatively simple word, it has the potential to be confusing because it is used in three (and more!) completely different but very important ways. Here’s a brief discussion.

Standard = Cutscore

As noted by the well-known professor Gregory Cizek here, “standard setting refers to the process of establishing one or more cut scores on a test.” The various methods of setting a cutscore, like Angoff or Bookmark, are referred to as standard setting studies. In this context, the standard is the bar that separates a Pass from a Fail. We use methods like the ones mentioned to determine this bar in as scientific and defensible fashion as possible, and give it more concrete meaning than an arbitrarily selected round number like 70%. Selecting a round number like that will likely get you sued since there is no criterion-referenced interpretation.

Standard = Blueprint

If you work in the field of education, you often hear the term “educational standards.” These refer to the curriculum blueprints for an educational system, which also translate into assessment blueprints, because you want to assess what is on the curriculum. Several important ones in the USA are noted here, perhaps the most common of which nowadays is the Common Core State Standards, which attempted to standardize the standards across states. These standards exist to standardize the educational system, by teaching what a group of experts have agreed upon should be taught in 6th grade Math classes for example. Note that they don’t state how or when a topic should be taught, merely that 6th Grade Math should cover Number Lines, Measurement Scales, Variables, whatever – sometime in the year.

Standard = Guideline

If you work in the field of professional certification, you hear the term just as often but in a different context, accreditation standards. The two most common are the National Commission for Certifying Agencies (NCCA) and the ANSI National Accreditation Board (ANAB). These two organizations are a consortium of credentialing bodies that give a stamp of approval to credentialing bodies, stating that a Certification or Certificate program is legit. Why? Because there is no law to stop me from buying a textbook on any topic, writing 50 test questions in my basement, and selling it as a Certification. It is completely a situation of caveat emptor, and these organizations are helping the buyers by giving a stamp of approval that the certification was developed with accepted practices like a Job Analysis, Standard Setting Study, etc.

In addition, there are the professional standards for our field. These are guidelines on assessment in general rather than just credentialing. Two great examples are the AERA/APA/NCME Standards for Educational and Psychological Measurement and the International Test Commission’s Guidelines (yes they switch to that term) on various topics.

Also: Standardized = Equivalent Conditions

The word is also used quite frequently in the context of standardized testing, though it is rarely chopped to the root word “standard.” In this case, it refers to the fact that the test is given under equivalent conditions to provide greater fairness and validity. A standardized test does NOT mean multiple choice, bubble sheets, or any of the other pop connotations that are carried with it. It just means that we are standardizing the assessment and the administration process. Think of it as a scientific experiment; the basic premise of the scientific method is holding all variables constant except the variable in question, which in this case is the student’s ability. So we ensure that all students receive a psychometrically equivalent exam, with equivalent (as much as possible) writing utensils, scrap paper, computer, time limit, and all other practical surroundings. The problem comes with the lack of equivalence in access to study materials, prep coaching, education, and many bigger questions… but those are a societal issue and not a psychometric one.

So despite all the bashing that the term gets, a standardized test is MUCH better than the alternatives of no assessment at all, or an assessment that is not a level playing field and has low reliability. Consider the case of hiring employees: if assessments were not used to provide objective information on applicant skills and we could only use interviews (which are famously subjective and inaccurate), all hiring would be virtually random and the amount of incompetent people in jobs would increase a hundredfold. And don’t we already have enough people in jobs where they don’t belong?

December 6, 2019/by Nathan Thompson, PhD