Posts on psychometrics: The Science of Assessment

concurrent calibration irt equating linking

Test equating refers to the issue of defensibly translating scores from one test form to another. That is, if you have an exam where half of students see one set of items while the other half see a different set, how do you know that a score of 70 is the same one both forms? What if one is a bit easier? If you are delivering assessments in conventional linear forms – or piloting a bank for CAT/LOFT – you are likely to utilize more than one test form, and, therefore, are faced with the issue of test equating.

When two test forms have been properly equated, educators can validly interpret performance on one test form as having the same substantive meaning compared to the equated score of the other test form (Ryan & Brockmann, 2009). While the concept is simple, the methodology can be complex, and there is an entire area of psychometric research devoted to this topic. This post will provide an overview of the topic.

 

Why do we need test linking and equating?

The need is obvious: to adjust for differences in difficulty to ensure that all examinees receive a fair score on a stable scale. Suppose you take Form A and get a score of 72/100 while your friend takes Form B and gets a score of 74/100. Is your friend smarter than you, or did his form happen to have easier questions?  What if the passing score on the exam was 73? Well, if the test designers built-in some overlap of items between the forms, we can answer this question empirically.

Suppose the two forms overlap by 50 items, called anchor items or equator items. They are delivered to a large, representative sample. Here are the results.

Mean score on 50 overlap items Mean score on 100 total items
30 72
32 74

Because the mean score on the anchor items was higher, we then think that the Form B group was a little smarter, which led to a higher total score.

Now suppose these are the results:

Mean score on 50 overlap items Mean score on 100 total items
32 72
32 74

Now, we have evidence that the groups are of equal ability. The higher total score on Form B must then be because the unique items on that form are a bit easier.

 

What is test equating?

According to Ryan and Brockmann (2009), “Equating is a technical procedure or process conducted to establish comparable scores, with equivalent meaning, on different versions of test forms of the same test; it allows them to be used interchangeably.” (p. 8). Thus, successful equating is an important factor in evaluating assessment validity, and, therefore, it often becomes an important topic of discussion within testing programs.

Practice has shown that scores, and tests producing scores, must satisfy very strong requirements to achieve this demanding goal of interchangeability. Equating would not be necessary if test forms were assembled as strictly parallel, meaning that they would have identical psychometric properties. In reality, it is almost impossible to construct multiple test forms that are strictly parallel, and equating is necessary to attune a test construction process.

Dorans, Moses, and Eignor (2010) suggest the following five requirements towards equating of two test forms:

  • tests should measure the same construct (e.g. latent trait, skill, ability);
  • tests should have the same level of reliability;
  • equating transformation for mapping the scores of tests should be the inverse function;
  • test results should not depend on the test form an examinee actually takes;
  • the equating function used to link the scores of two tests should be the same regardless of the choice of (sub) population from which it is derived.

Detecting item parameter drift (IPD) is crucial for the equating process because it helps in identifying items whose parameters have changed. By addressing IPD, test developers can ensure that the equating process remains valid and reliable.

 

How do I calculate an equating?

Classical test theory (CTT) methods include linear equating and equipercentile equating as well as several others. Some newer approaches that work well with small samples are Circle-Arc (Livingston & Kim, 2009) and Nominal Weights (Babcock, Albano, & Raymond, 2012).  Specific methods for linear equating include Tucker, Levine, and Chained (von Davier & Kong, 2003). Linear equating approaches are conceptually simple and easy to interpret; given the examples above, the equating transformation might be estimated with a slope of 1.01 and an intercept of 1.97, which would directly confirm the hypothesis that one form was about 2 points easier than the other.

Item response theory (IRT) approaches include equating through common items (equating by applying an equating constant, equating by concurrent or simultaneous calibration, and equating with common items through test characteristic curves), and common person calibration (Ryan & Brockmann, 2009). The common-item approach is quite often used, and specific methods for finding the constants (conversion parameters) include Stocking-Lord, Haebara, Mean/Mean, and Mean/Sigma. Because IRT assumes that two scales on the same construct differ by only a simple linear transformation, all we need to do is find the slope and intercept of that transformation. Those methods do so, and often produce nice looking figures like the one below from the program IRTEQ (Han, 2007). Note that the b parameters do not fall on the identity line, because there was indeed a difference between the groups, and the results clearly find that is the case.

IRTEQ IRT equating

Practitioners can equate forms with CTT or IRT. However, one of the reasons that IRT was invented was that equating with CTT was very weak. Hambleton and Jones (1993) explain that when CTT equating methods are applied, both ability parameter (i.e., observed score) and item parameters (i.e., difficulty and discrimination) are dependent on each other, limiting its utility in practical test development. IRT solves the CTT interdependency problem by combining ability and item parameters in one model. The IRT equating methods are more accurate and stable than the CTT methods (Hambleton & Jones, 1993; Han, Kolen, & Pohlmann, 1997; De Ayala, 2013; Kolen and Brennan, 2014) and provide a solid basis for modern large-scale computer-based tests, such as computerized adaptive tests (Educational Testing Service, 2010; OECD, 2017).

Of course, one of the reasons that CTT is still around in general is that it works much better with smaller samples, and this is also the case for CTT test equating (Babcock, Albano, & Raymond, 2012).

 

How do I implement test equating?

Test equating is a mathematically complex process, regardless of which method you use.  Therefore, it requires special software.  Here are some programs to consider.

  1. CIPE performs both linear and equipercentile equating with classical test theory. It is available from the University of Iowa’s CASMA site, which also includes several other software programs.
  2. IRTEQ is an easy-to-use program which performs all major methods of IRT Conversion equating.  It is available from the University of Massachusetts website, as well as several other good programs.
  3. There are many R packages for equating and related psychometric topics. This article claims that there are 45 packages for IRT analysis alone!
  4. If you want to do IRT equating, you need IRT calibration software. We highly recommend Xcalibre since it is easy to use and automatically creates reports in Word for you. If you want to do the calibration approach to IRT equating (both anchor-item and concurrent-calibration), rather than the conversion approach, this is handled directly by IRT software like Xcalibre. For the conversion approach, you need separate software like IRTEQ.

Equating is typically performed by highly trained psychometricians; in many cases, an organization will contract out to a testing company or consultant with the relevant experience. Contact us if you’d like to discuss this.

 

Does equating happen before or after delivery?

Both. These are called pre-equating and post-equating (Ryan & Brockmann, 2009).  Post-equating means the calculation is done after delivery and you have a full data set, for example if a test is delivered twice per year on a single day, we can do it after that day.  Pre-equating is more tricky, because you are trying to calculate the equating before a test form has ever been delivered to an examinee; but this is 100% necessary in many situations, especially those with continuous delivery windows.

 

How do I learn more about test equating?

If you are eager to learn more about the topic of equating, the classic reference is the book by Kolen and Brennan (2004; 2014) that provides the most complete coverage of score equating and linking.  There are other resources more readily available on the internet, like this free handbook from CCSSO. If you would like to learn more about IRT, we suggest the books by De Ayala (2008) and Embretson and Reise (2000). A brief intro of IRT equating is available on our website.

Several new ideas of general use in equating, with a focus on kernel equating, were introduced in the book by von Davier, Holland, and Thayer (2004). Holland and Dorans (2006) presented a historical background for test score linking, based on work by Angoff (1971), Flanagan (1951), and Petersen, Kolen, and Hoover (1989). If you look for a straightforward description of the major issues and procedures encountered in practice, then you should turn to Livingston (2004).


Want to learn more? Talk to a Psychometric Consultant!

References

Angoff, W. H. (1971). Scales, norms and equivalent scores. In R. L. Thorndike (Ed.), Educational measurement (2nd ed., pp. 508-600). American Council on Education.

Babcock, B., Albano, A., & Raymond, M. (2012). Nominal Weights Mean Equating: A Method for Very Small Samples. Educational and Psychological Measurement, 72(4), 1-21.

Dorans, N. J., Moses, T. P., & Eignor, D. R. (2010). Principles and practices of test score equating. ETS Research Report Series2010(2), i-41.

De Ayala, R. J. (2008). A commentary on historical perspectives on invariant measurement: Guttman, Rasch, and Mokken.

De Ayala, R. J. (2013). Factor analysis with categorical indicators: Item response theory. In Applied quantitative analysis in education and the social sciences (pp. 220-254). Routledge.

Educational Testing Service (2010). Linking TOEFL iBT Scores to IELTS Scores: A Research Report. Educational Testing Service.

Embretson, S. E., & Reise, S. P. (2000). Item response theory for psychologists. Maheah.

Flanagan, J. C. (1951). Units, scores, and norms. In E. F. Lindquist (Ed.), Educational measurement (pp. 695-763). American Council on Education.

Hambleton, R. K., & Jones, R. W. (1993). Comparison of classical test theory and item response theory and their applications to test development. Educational measurement: issues and practice12(3), 38-47.

Han, T., Kolen, M., & Pohlmann, J. (1997). A comparison among IRT true-and observed-score equatings and traditional equipercentile equating. Applied Measurement in Education10(2), 105-121.

Holland, P. W., & Dorans, N. J. (2006). Linking and equating. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 187-220). Praeger.

Kolen, M. J., & Brennan, R. L. (2004). Test equating, linking, and scaling: Methods and practices (2nd ed.). Springer-Verlag.

Kolen, M. J., & Brennan, R. L. (2014). Item response theory methods. In Test Equating, Scaling, and Linking (pp. 171-245). Springer.

Livingston, S. A. (2004). Equating test scores (without IRT). ETS.

Livingston, S. A., & Kim, S. (2009). The Circle‐Arc Method for Equating in Small Samples. Journal of Educational Measurement 46(3): 330-343.

OECD (2017). PISA 2015 Technical Report. OECD Publishing.

Petersen, N. S., Kolen, M. J., & Hoover, H. D. (1989). Scaling, norming and equating. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 221-262). Macmillan.

Ryan, J., & Brockmann, F. (2009). A Practitioner’s Introduction to Equating with Primers on Classical Test Theory and Item Response Theory. Council of Chief State School Officers.

von Davier, A. A., Holland, P. W., & Thayer, D. T. (2004). The kernel method of test equating. Springer.

von Davier, A. A., & Kong, N. (2003). A unified approach to linear equating for non-equivalent groups design. Research report 03-31 from Educational Testing Service. https://www.ets.org/Media/Research/pdf/RR-03-31-vonDavier.pdf

student assessment cheating copying test data forensics

In the field of psychometric forensics, examinee collusion refers to cases where an examinee takes a test with some sort of external help in obtaining the correct answers.  There are several possibilities:

  1. Low ability examinees copy off a high ability examinee (known as the Source); could be unbeknownst to the source.
  2. Examinees work together and pool their answers.
  3. Examinees receive answers from a present third party, such as a friend outside with Bluetooth, a teacher, or a proctor.
  4. Multiple examinees obtaining answers from a brain dump site, known as preknowledge; while not strictly collusion, we’d likely find similar answer patterns in that group, even if they never met each other.

The research on the field has treated all of these more or less equivalently, and as a group.  I’d like to argue that we stop doing that.  Why? The data patterns can be different, and can therefore be detected (or not detected) by different types of forensics.  Most importantly, some collusion indices (such as Holland’s K index) are designed specifically for some cases, and we need to treat them as such.  More broadly, perhaps we need a theoretical background on types of collusion – part of an even broader theory on types of invalid test-taking – as a basis to fit, evaluate, and develop indices.

What do I mean by this?  Well, some collusion indices are explicitly designed around the assumption that there is a Source and a copier(s).  The strongest examples are the K-variants (Sotaridona, 2003) and Wollack’s (1997) omega.  Other indices do not assume this, such as Bellezza & Bellezza (1989); they simply look for similarities in responses, either on errors alone or on both errors and correct answers.

Collusion indices, if you’re not familiar with them, evaluate each possible pair of examinees and look for similar response patterns; if you have 1000 examinees, there are (1000 x 9999)/2 = 4,999,500 pairs.  If you use an index that evaluates the two directions separately (Examinee 1 is Source and 2 is Copier, then 2 is Source and 1 is Copier), you are calculating the index 9,999,000 times on this relatively small data set.

Primary vs. Secondary Collusion

Consider the first case above, which is generally of interest.  I’d like to delineate the difference between primary collusion and secondary collusion.  Primary collusion is what we are truly trying to find: if examinees are copying off another examinee.  Some indices are great at finding this.  But there is also secondary collusion; if multiple examinees are copiers, they will also have very similar response patterns, though the data signature will be different than if paired with the Source.  Indices designed to find primary collusion might not be as strong as finding this; rightly so, you might argue.

The other three cases above, I’d argue, are also secondary collusion.  There is no true Source present in the examinees, but we are certainly going to find unusually similar response patterns.  Indices that are explicitly designed to find the Source are not going to have the statistical power that was intended.  They require a slightly different set of assumptions, different interpretations, and different calculations.

What might the different calculations be?  That’s for another time.  I’m currently working through a few ideas.  The most recent is based on the discovery that the highly regarded S2 index (Sotaridona, 2003) makes assumptions that are regularly violated in real data, and that the calculations need to be modified.  I’ll be integrating the new index into SIFT.

An example of examinee collusion

To demonstrate, I conducted a brief simulation study, part of a much larger study to be published at a future date.  I took a data set of a national educational assessment (N=16,666, 50 MC items), randomly selected 1,000 examinees, randomly assigned them to one of 10 locations, and then created a copying situation in one location.  To do this, I found the highest ability examinee at that location and the lowest 50, then made those low 50 copy the answers off the high examinee for the 25 hardest items.  This is a fairly extreme and clear-cut example, so we’d hope that the collusion indices pick up on such an obvious data signature.

Let’s start by investigating that statement.  This is an evaluation of power, which is the strength of a statistical test to find something when something is actually there (though technically, not all of these indices are statistical tests).  We hope that we can find all 50 instances of primary collusion as well as most of the secondary collusion.  As of today, SIFT calculates 15 collusion indices; for a full description of these, download a free copy and check out the manual.

SIFT test security data forensics

What does this all mean?

Well, my original point was that we need to differentiate the Primary and Secondary cases of collusion, and evaluate them separately when we can.  An even broader point was that we need a specific framework on types of invalid test instances and the data signatures they each leave, but that’s worth an article on its own.

If the topic of index comparison is interesting to you, check out the chapter by Zopluogu (2016).  He also found the power for S2 to be low, and the Type I errors to be conservative, though the Type II were not reported there.  That research stemmed from a dissertation at the University of Minnesota, which probably has a fuller treatment; my goal here is not to fully repeat that extensive study, but to provide some interesting interpretations.

positive-manifold

Positive manifold refers to the fact that scores on cognitive assessment tend to correlate very highly with each other, indicating a common latent dimension that is very strong.  This latent dimension became known as g for general intelligence or general cognitive ability.  This post discusses what the positive manifold is, but since there are MANY other resources on the definition, the post also explains how this concept is useful in the real world.

The term ‘positive manifold’ originally came out of work in the field of intelligence testing, including research by Charles Spearman.  There are literally hundreds of studies on this topic, and over one hundred years of research has shown that this concept is scientifically supported, but it is important to remember that it is just a manifold and not a perfect relationship.  That is, we can expect verbal reasoning ability to correlate highly with quantitative reasoning or logical reasoning, but it is by no means a 1-to-1 relationship.  There are certainly some people that can be high on one but not another.  But it is very unlikely for you to be in the 90th percentile on one but 10th percentile on another.

What is Positive Manifold?

If you were to take a set of cognitive tests, either separate, or as subtests of a battery like the Wechsler Adult Intelligence Scale, and correlate their scores, the correlation matrix would be overwhelmingly positive.  For example, look at Table 2-9 in this book.  There are many, many more examples if you search for keywords like “intelligence intercorrelation.”

As you might expect, related constructs will correlate more highly.  A battery might have a Verbal Reasoning test and a Vocabulary test; we would expect these to correlate more highly with each other (maybe 0.80) than a Figural Reasoning test (maybe 0.50).  Researchers like to use a methodology called factor analysis to analyze this structure and drive interpretations.

Practical implications

Positive manifold and the structure of cognitive ability is historically an academic research topic, and remains so.  Researchers are still publishing articles like this one.  However, the concept of positive manifold has many practical implications in the real world.  It affects situations where cognitive ability testing is used to obtain information about people and make decisions about them.  Two of the most common examples are test batteries for admissions/placement or employment.

Admissions/placement exams are used in the education sector to evaluate student ability and make decisions about schools or courses that the student can/should enter.  Admissions refers to whether the student should be admitted to a school, such as a university or a prestigious high school.  Examples of this in the USA are the SAT and ACT exams.  Placement refers to sending students to the right course, such as testing them on Math and English to determine if they are ready for certain courses.  Both of these examples will typically test the student on 3 or 4 aspects, which is an example of a test battery.  The SAT discusses intercorrelations of its subtests in the technical manual (page 104).  Tests like the SAT can provide incremental validity above the predictive power of high school grade point average (HSGPA) alone, as seen in this report.  The scores from these multiple tests can be combined into a composite test score, providing a more comprehensive evaluation of the student’s abilities.

Employment testing is also often done with several cognitive tests.  You might take psychometric tests to apply for a job, and they test you on quantitative reasoning and verbal reasoning.

In both cases, the tests are validated by doing research to show that they predict a criterion of interest.  In the case of university admissions, this might be First Year GPA or Four Year Graduation Rate.  In the case of Employment Testing, it could be Job Performance Rating by a supervisor or 1-year retention rate.

Why are they using multiple tests?  They are trying to capitalize on the differences to get more predictive power for the criterion.  Success in university isn’t due to just verbal/language skills alone, but also logical reasoning and other skills.  They recognize that there is a high correlation, but the differences between the constructs can be leveraged to get more information about people.  Employment testing goes further, and tries to add incremental validity by adding other tests that are completely unrelated but relevant to the job world, like job samples or even noncognitive tests like Conscientiousness.  These also correlate with job performance, and therefore help with prediction, but correlate even lower with measures of g than another cognitive test would; this then adds more prediction power.

bookmark-method-of-standard-setting

The Bookmark Method of standard setting (Lewis, Mitzel, & Green, 1996) is a scientifically-based approach to setting cutscores on an examination. It allows stakeholders of an assessment to make decisions and classifications about examinees that are constructive rather than arbitrary (e.g., 70%), meet the goals of the test, and contribute to overall validity. A major advantage of the bookmark method over others is that it utilizes difficulty statistics on all items, making it very data-driven; but this can also be a disadvantage in situations where such data is not available. It also has the advantage of panelist confidence (Karantonis & Sireci, 2006).

The bookmark method operates by delivering a test to a representative sample (or population) of examinees, and then calculating the difficulty statistics for each item. We line up the items in order of difficulty, and experts review the items to place a bookmark where they think a cutscore should be. Nowadays, we use computer screens, but of course in the past this was often done by printing the items in paper booklets, and the experts would literally insert a bookmark.

What is standard setting?

Standard setting (Cizek & Bunch, 2006) is an integral part of the test development process even though it has been undervalued outside of practitioners’ view in the past (Bejar, 2008). Standard setting is the methodology of defining achievement or proficiency levels and corresponding cutscores. A cutscore is a score that serves as a measure of classifying test takers into categories.

Educational assessments and credentialing examinations are often employed to distribute test takers among ordered categories according to their performance across specific content and skills (AERA, APA, & NCME, 2014; Hambleton, 2013). For instance, in tests used for certification and licensing purposes, test takers are typically classified as “pass”—those who score at or above the cutscore—and those who “fail”. In education, students are often classified in terms of proficiency; the Nation’s Report Card assessment (NAEP) in the United States classifies students as Below Basic, Basic, Proficient, Advanced.

However, assessment results could come into question unless the cutscores are appropriately defined. This is why arbitrary cutscores are considered indefensible and lacking validity. Instead, psychometricians help test sponsors to set cutscores using methodologies from the scientific literature, driven by evaluations of item and test difficulty as well as examinee performance.

When to use the bookmark method?

Two approaches are mainly used in international practice to establish assessment standards: the Angoff method (Cizek, 2006) and the Bookmark method (Buckendahl, Smith, Impara, & Plake, 2000). The Bookmark method, unlike the Angoff method, requires the test to be administered prior to defining cutscores based on test data. This provides additional weight to the validity of the process, and better informs the subject matter experts during the process. Of course, many exams require a cutscore to be set before it is published, which is impossible with the bookmark; the Angoff procedure is very useful then.

How do I implement the bookmark method?

The process of standard setting employing the Bookmark method consists of the following stages:

  1. Identify a team of subject matter experts (SMEs); their number should be around 6-12, and led by a test developer/psychometrician/statistician
  2. Analyze test takers’ responses by means of the item response theory (IRT)
  3. Create a list items according to item difficulty in an ascending order
  4. Define the competency levels for test takers; for example, have the 6-12 experts discuss what should differentiate a “pass” candidate from a “fail” candidate
  5. Experts read the items in the ascending order (they do not need to see the IRT values), and place a bookmark where appropriate based on professional judgement across well-defined levels
  6. Calculate thresholds based on the bookmarks set, across all experts
  7. If needed, discuss results and perform a second round

 

Example of the Bookmark Method

If there are four competency levels such as the NAEP example, then SMEs need to set up three bookmarks in-between: first bookmark is set after the last item in a row that fits the minimally competent candidate for the first level, then second and third. There are thresholds/cutscores from 1 to 2, 2 to 3, and 3 to 4. SMEs perform this individually without discussion, by reading the items.

When all SMEs have provided their opinion, the standard setting coordinator combines all results into one spreadsheet and leads the discussion when all participants express their opinion referring to the bookmarks set. This might look like the sheet below. Note that SME4 had a relatively high standard in their mind, while SME2 had a low standard in their mind – placing virtually every student above an IRT score of 0.0 into the top category!

bookmark method 1

After the discussion, the SMEs are given one more opportunity to set the bookmarks again. Usually, after the exchange of opinions, the picture alters. SMEs gain consensus, and the variation in the graphic is reduced.  An example of this is below.

bookmark method

What to do with the results?

Based on the SMEs’ voting results, the coordinator or psychometrician calculates the final thresholds on the IRT scale, and provides them to the analytical team who would ultimately prepare reports for the assessment across competency levels. This might entail score reports to examinees, feedback reports to teachers, and aggregate reports to test sponsors, government officials, and more.

You can see how the scientific approach will directly impact the interpretations of such reports. Rather than government officials just knowing how many students scored 80-90% correct vs 90-100% correct, the results are framed in terms of how many students are truly proficient in the topic. This makes decisions from test scores – both at the individual and aggregate levels – much more defensible and informative.  They become truly criterion-referenced.  This is especially true when the scores are equated across years to account for differences in examinee distributions and test difficulty, and the standard can be demonstrated to be stable.  For high-stakes examinations such as medical certification/licensure, admissions exams, and many more situations, this is absolutely critical.

Want to talk to an expert about implementing this for your exams?  Contact us.

References

[AERA, APA, & NCME] (American Educational Research Association, American Psychological Association, & National Council on Measurement in Education). (2014). Standards for educational and psychological testing. Washington, DC: American Educational Research Association.

Bejar, I. I. (2008). Standard setting: What is it? Why is it important. R&D Connections, 7, 1-6. Retrieved from https://www.ets.org/Media/Research/pdf/RD_Connections7.pdf

Buckendahl, C. W., Smith, R. W., Impara, J. C., & Plake, B. S. (2000). A comparison of Angoff and Bookmark standard setting methods. Paper presented at the Annual Meeting of the Mid-Western Educational Research Association, Chicago, IL: October 25-28, 2000.

Cizek, G., & Bunch, M. (2006). Standard Setting: A Guide to Establishing and Evaluating Performance Standards on Tests.  Thousand Oaks, CA: Sage.

Cizek, G. J. (2007). Standard setting. In Steven M. Downing and Thomas M. Haladyna (Eds.) Handbook of test development. Mahwah, NJ: Lawrence Erlbaum Associates, Publishers, pp. 225-258.

Hambleton, R. K. (2013). Setting performance standards on educational assessments and criteria for evaluating the process. In Setting performance standards, pp. 103-130. Routledge. Retrieved from https://www.nciea.org/publications/SetStandards_Hambleton99.pdf

Karantonis, A., & Sireci, S. (2006). The Bookmark Standard‐Setting Method: A Literature Review. Educational Measurement Issues and Practice 25(1):4 – 12.

Lewis, D. M., Mitzel, H. C., & Green, D. R. (1996, June). Standard setting: A Book-mark approach. In D. R. Green (Chair),IRT-based standard setting procedures utilizing behavioral anchoring. Symposium conducted at the Council of Chief State School Officers National Conference on Large-Scale Assessment, Phoenix, AZ.

assessment-test-battery

A test battery or assessment battery is a set multiple psychometrically-distinct exams delivered in one administration.  In some cases, these are various tests that are cobbled together for related purposes, such as a psychologist testing a 8 year old child on their intelligence, anxiety, and autism spectrum.  However, in many cases it is a single test title that we often refer to as a single test but is actually several separate tests, like a university admissions test that has English, Math, and Logical Reasoning components.  Why do so? The key here is that we want to keep them psychometrically separate, but maximize the amount of information about the person to meet the purposes of the test.

Learn more about our powerful exam platform that allows you to easily develop and deliver test batteries.

 

Examples of a Test Battery

Test batteries are used in a variety of fields, pretty much anywhere assessment is done.

Admissions and Placement Testing

The classic example is a university admissions test that has English, Math, and Logic portions.  These are separate tests, and psychometricians would calculate the reliability and other important statistics separately.  However, the scores are combined at the end to get an overall picture of examinee aptitude or achievement, and use that to maximally predict 4-graduation rates and other important criterion variables.

Why is is called a battery?  Because we are battering the poor student with not just one, but many exams!

Pre-Employment Testing

Exam batteries are often used in pre-employment testing.  You might get tested on computer skills, numerical reasoning, and noncognitive traits such as integrity or conscientiousness. These are used together to gain incremental validity.  A good example is the CAT-ASVAB, which is the selection test to get into the US Armed Forces.  There are 10 tests (vocabulary, math, mechanical aptitude…).

Psychological or Psychoeducational Assessment

In a clinical setting, clinicians will often use a battery of tests, such as IQ, autism, anxiety, and depression.  Some IQ tests themselves as a battery, as they might assess visual reasoning, logical reasoning, numerical reasoning, etc.  However, these have a positive manifold, meaning that they correlate quite highly with each other.  Another example is the Woodcock-Johnson.

K-12 Educational Assessment

Many large-scale tests that are used in schools are considered a battery, though often with only 2 or 3 aspects.  A common one in the USA is the NWEA Measures of Academic Progress.

 

Composite Scores

A composite score is a combination of scores in a battery.  If you took an admissions test like the SAT and GRE, you recall how it would add your scores on the different subtests, while the ACT test takes the average.  The ASVAB takes a linear combination of the 4 most important subtests and uses them for admission; the others are used for job matching.

 

A Different Animal: Test with Sections

The battery is different than a single test that has distinct sections.  For example, a K12 English test might have 10 vocab items, 10 sentence-completion grammar items, and 2 essays.  Such tests are usually analyzed as a single test, as they are psychometrically unidimensional.

 

How to Deliver A Test Battery

In ASC’s platforms,  Assess.ai  and  FastTest, all this functionality is available out of the box: test batteries, composite scores, and sections within a test.  Moreover, they come with a lot of important functionality, such as separation of time limits, navigation controls, customizable score reporting, and more.  Click here to request a free account and start applying best practices.

progress monitoring education

The Test Response Function (TRF) in item response theory (IRT) is a mathematical function that describes the relationship between the latent trait that a test is measuring, which psychometricians call theta (θ), and the predicted raw score on the test in a traditional notion (percentage/proportion/number correct).  An important concept from IRT, it provides a way to evaluate the performance of the test, as well as other aspects such as relative difficulty.  In addition, it has more advanced usages, such as being used for some approaches to IRT equating.

What is the Test Response Function in IRT?

The test response function is a logistic-shaped curve like you see below.  This is an example of a test with 10 dichotomous items.  It tells us that a person with a theta of -0.30 is likely to get 7 items correct on the test, a proportion of 0.70.  Meanwhile, an examinee with a theta of +2.0 is expected to get about 9.5 items correct on average.  You can see how this provides some conceptual linkage between item response theory and classical test theory.

Test response function 10 items Angoff

How do we determine the Test Response Function?

The test response function is obtained by summing the item response functions from all items in the test.  Each item response function has a logistic shape as well, but the y-axis is the probability of getting a certain item correct and therefore ranges 0.0 to 1.0.  So if you add these over 10 items, the raw range is then 0 to 10, as you see above.

Here is a brief example.  Suppose you have a test of 5 items with the following IRT parameters.

Seq a b c
1 0.80 -2.00 0.20
2 1.00 -1.00 0.00
3 0.70 0.00 0.25
4 1.20 0.00 0.25
5 1.10 1.00 0.20

The item response functions for these look like the graphic below.  The dark blue line is the easiest item, and the light blue line is the hardest item.

5 items IRFs

If you sum up those IRFs, you would obtain this as the test response function.  Note that the y-axis is now 5, because we have summed five 1-point items.  It is telling us that a theta of -1.40 will likely get 2 items correct, and 0.60 get 4 items correct.

5 items TRF

For completeness, also note that we can take the item information functions…

item information functions

…and sum them into the Test Information Function (see here for an article on that).

test information function

And that can be inverted into a conditional standard error of measurement function:

Note how these are all a function of each other, which is one of the strengths of item response theory.  The last two are clearly an inverse of each other.  The test response function is a function of the test information function if you consider the derivative.  The TIF is a function of the derivative of the TRF, and therefore the TIF is highest where the TRF has the highest slope.  The TRF usually has a high slope in the middle and less slope on the ends, so the TIF ends up being shaped like a mountain.

How can we use this to solve measurement issues?

The test response function can be used to compare different test forms.  The image below is from our Form Building Tool, and shows 4 test forms together.  Form 3 is the easiest from a classical perspective, because it is higher than the rest for a given theta.  That is, at a theta of 0.0, the expected score is 4.5, compared to 3.5 on forms 1 and 2, and 3.0 on Form 4.  However, if you are scoring examinees with IRT, the actual form does not matter – such an examinee will receive a theta of 0.0 regardless of the form they take.  (This won’t be exact for a test of 6 items… but for a test of 100 items it would work out.)

test response functions

An important topic in psychometrics is Linking and Equating, that is, trying to determine comparable scores on different test forms.  Two important methods for this, Stocking-Lord and Haebara, utilize the test response function in their calculations.

Generalizing the TRF to CAT and LOFT

Note that the TRF can be adapted by reconceptualizing the y-axis.  Suppose we have a pool of 100 items and everyone will randomly receive 50, which is a form of linear on the fly testing (LOFT).  The test response function with the scale of 0 to 100 is not as useful, but we can take the proportion scale and use that to predict raw scores out of 50 items, with some allowance for random variation.  Of course, we could also use the actual 50 items that a given person saw to calculate the test response function for them.  This could be compared across examinees to evaluate how equivalent the test forms are across examinees.  In fact, some LOFT algorithms seek to maximize the equivalence of TRFs/TIFs across examinees rather than randomly selecting 50 items.  The same can be said for computerized adaptive testing (CAT), but the number of items can vary between examinees with CAT.

OK, how do I actually apply this?

You will need software that can perform IRT calibrations, assemble test forms with IRT, and deliver tests that are scored with IRT.  We provide all of this.  Contact us to learn more.

psychometric-tests-measure-mental-processes

Psychometric tests are assessments of people to measure psychological attributes such as personality or intelligence. Over the past century, psychometric tests have played an increasingly important part in revolutionizing how we approach important fields such as education, psychiatry, and recruitment.  One of the main reasons why psychometric tests have become popular in corporate recruitment and education is their accuracy and objectivity.

However, getting the best out of psychometric tests requires one to have a concrete understanding of what they are, how they work, and why you need them.  This article, therefore, aims to provide you with the fundamentals of psychometric testing, the benefits, and everything else you need to know.

Interested in talking to a psychometrician about test development and validation, or a demo of our powerful platform that empowers you to develop custom psychometric tests?

TALK TO US Contact

What is a psychometric test or assessment?

Psychometric tests and assessments are different from other types of tests in that they measure a person’s knowledge, abilities, interests, and other attributes. They focus on measuring “mental processes” rather than “objective facts.” Psychometric tests are used to determine suitability for employment, education, training, or placement, as well as the suitability of the person for specific situations.

A psychometric test or assessment is an evaluation of a candidate’s personality traits and cognitive abilities. They also help assess mental health status by screening the individual for potential mental disorders. In recruitment and job performance, companies use psychometric tests for reasons such as:

  • Make data-driven comparisons among candidates
  • Making leadership decisions
  • Reduce hiring bias and improve workforce diversification
  • Identify candidate strengths and weaknesses
  • Help complete candidate personas
  • Deciding management strategies

Psychometrics is different, and refers to a field of study associated with the theory and technique of psychoeducational measurement.  It is not limited to the topic of recruitment and careers, but spans all assessments, from K-12 formative assessment to addiction inventories in medical clinics to university admissions to certification of nurses.

The different types of psychometric tests

The following are the main types of psychometric assessments:

Personality tests

Personality tests mainly help recruiters identify desirable personality traits that would make one fit for a certain role in a company. These tests contain a series of questions that measure and categorize important metrics such as leadership capabilities and candidate motivations as well as job-related traits such as integrity or conscientiousness. Some personality assessments seek to categorize people into relatively arbitrary “types” while some place people on a continuum of various traits.

‘Type focused’ Personality tests

Some examples of popular psychometric tests that use type theory include the Myers-Briggs Type Indicator (MBTI) and the DISC profile.  Personality types are of limited usefulness in recruitment because they lack objectivity and reliability in determining important metrics that can predict the success of certain candidates in a specific role, as well as having more limited scientific backing. They are, to a large extent, Pop Psychology.

‘Trait-focused’ personality types

Personality assessments based on trait theory on the other hand tend to mainly rely on the OCEAN model, like the NEO-PI-R. These psychometric assessments determine the intensity of five traits; openness, conscientiousness, extraversion, agreeableness, and neuroticism, using a series of questions and exercises. Psychometric assessments based on this model provide more insight into the ability of candidates to perform in a certain role, compared to type-focused assessments.

Cognitive Ability and Aptitude Tests

Cognitive ability tests, also known as intelligence tests or aptitude, measure a person’s latent/unlearned cognitive skills and attributes.  Common examples of this are logical reasoning, numerical reasoning, and mechanical reasoning.  It is important to stress that these are generally unlearned, as opposed to achievement tests.

Job Knowledge and Achievement tests

These psychometric tests are designed to assess what people have learned.  For example, if you are applying for a job as an accountant, you might be given a numerical reasoning or logical reasoning test, and a test in the use of Microsoft Excel.  The former is aptitude, while the latter is job knowledge or achievement. (Though there is certainly some learning involved with basic math skills).

What are the importance and benefits of psychometric tests?

Psychometric tests have been proven to be effective in domains such as recruitment and education. In recruitment, psychometric tests have been integrated into pre-employment assessment software because of their effectiveness in the hiring process. Here are several ways psychometric tests are beneficial in corporate environments, along with Learning and Development (L & D):hr-manager-interviewing-a-candidate

Cost and Time Efficiency  Psychometric tests save organizations a lot of resources because they help eliminate the guesswork in hiring processes. Psychometric tests help employers go through thousands of resumes to find the perfect candidates.

Cultural fulfillment In the modern business world, culture is a great determinant of success. Through psychometric tests, employees can predict the types of candidates that can fit into their company culture.

Standardization — Traditional hiring processes have a lot of hiring bias cases. However, psychometric tests can level the playing ground and give a chance for the best candidates to get what they deserve.

Effectiveness Psychometric tests have been scientifically proven to play a critical role in hiring the best talent. This is mainly because they can spot important attributes that can’t be spotted by traditional hiring processes.

In L&D, psychometric tests can help organizations generate important insights such as learning abilities, candidate strengths and weaknesses, and learning strategy effectiveness. This can help re-write the learning strategies, for improved ROI.

What makes a good psychometric test?

As with all tests, you need reliability and validity.  In the case of pre-employment testing, the validity is usually one of two things:

  1. Content validity via job-relatedness; if the job requires several hours per day of Microsoft Excel, then a test on Microsoft Excel makes sense
  2. Predictive validity: numerical reasoning but not be as overtly related to the job as Microsoft Excel, but if you can show that it predicts job performance, then this is helpful.  This is especially true for noncognitive assessments like conscientiousness.

 

Conclusion

There is no doubt that psychometric tests are important in essential aspects of life such as recruitment and education. Not only do they help us understand people, but also simplify the hiring process. However, psychometric tests should be used with caution. It’s advisable to develop a concrete strategy on how you are going to integrate them into your operation mechanism.

Ready To Start Developing Your Own Psychometric Tests? 

test information functionASC’s comprehensive platform provides you with all the tools necessary to develop and securely deliver psychometric assessments. It is equipped with powerful psychometric software, online essay marking modules, advanced reporting, tech-enhanced items, and so much more! You also have access to the world’s greatest psychometricians to help you out if you get stuck in the process!

classroom students exam

If you are delivering high-stakes tests in linear forms – or piloting a bank for CAT/LOFT – you are faced with the issue of how to equate the forms together.  That is, how can we defensibly translate a score on Form A to a score on Form B?  While the concept is simple, the methodology can be complex, and there is an entire area of psychometric research devoted to this topic. There are a number of ways to approach this issue, and IRT equating is the strongest.

Why do we need equating?

The need is obvious: to adjust for differences in difficulty to ensure that all examinees receive a fair score on a stable scale.  Suppose you take Form A and get s score of 72/100 while your friend takes Form B and gets a score of 74/100.  Is your friend smarter than you, or did his form happen to have easier questions?  Well, if the test designers built-in some overlap, we can answer this question empirically.

Suppose the two forms overlap by 50 items, called anchor items or equator items.  Both forms are each delivered to a large, representative sample. Here are the results.

Form Mean score on 50 overlap items Mean score on 100 total items
A 30 72
B 30 74

Because the mean score on the anchor items was higher, we then think that the Form B group was a little smarter, which led to a higher total score.

Now suppose these are the results:

Form Mean score on 50 overlap items Mean score on 100 total items
A 32 72
B 32 74

Now, we have evidence that the groups are of equal ability.  The higher total score on Form B must then be because the unique items on that form are a bit easier.

How do I calculate an equating?

You can equate forms with classical test theory (CTT) or item response theory (IRT).  However, one of the reasons that IRT was invented was that equating with CTT was very weak.  CTT methods include Tucker, Levine, and equipercentile.  Right now, though, let’s focus on IRT.

IRT equating

There are three general approaches to IRT equating.  All of them can be accomplished with our industry-leading software  Xcalibre, though conversion equating requires an additional software called IRTEQ.

  1. Conversion
  2. Concurrent Calibration
  3. Fixed Anchor Calibration

Conversion

With this approach, you need to calibrate each form of your test using IRT, completely separately.  We then evaluate the relationship between IRT parameters on each form and use that to estimate the relationship to convert examinee scores.  Theoretically what you do is line up the IRT parameters of the common items and perform a linear regression, so you can then apply that linear conversion to scores.

But DO NOT just do a regular linear regression.  There are specific methods you must use, including mean/mean, mean/sigma, Stocking & Lord, and Haebara.  Fortunately, you don’t have to figure out all the calculations yourself, as there is free software available to do it for you:  IRTEQ.

Concurrent Calibrationcommon item linking irt equating

The second approach is to combine the datasets into what is known as a sparse matrix.  You then run this single data set through the IRT calibration, and it will place all items and examinees onto a common scale.  The concept of a sparse matrix is typically represented by the figure below, representing the non-equivalent anchor test (NEAT) design approach.

The IRT calibration software will automatically equate the two forms and you can use the resultant scores.

Fixed Anchor Calibration

The third approach is a combination of the two above; it utilizes the separate calibration concept but still uses the IRT calibration process to perform the equating rather than separate software.

With this approach, you would first calibrate your data for Form A.  You then find all the IRT item parameters for the common items and input them into your IRT calibration software when you calibrate Form B.

You can tell the software to “fix” the item parameters so that those particular ones (from the common items) do not change.  Then all the item parameters for the unique items are forced onto the scale of the common items, which of course is the underlying scale from Form A.  This then also forces the scores from the Form B students onto the Form A scale.

How do these IRT equating approaches compare to each other?

concurrent calibration irt equating linking
Concurrent calibration is arguably the easiest but has the drawback that it merges the scales of each form into a new scale somewhere in the middle.  If you need to report the scores on either form on the original scale, then you must use the Conversion or Fixed Anchor approaches.  This situation commonly happens if you are equating across time periods.

Suppose you delivered Form A last year and are now trying to equate Form B.  You can’t just create a new scale and thereby nullify all the scores you reported last year.  You must map Form B onto Form A so that this year’s scores are reported on last year’s scale and everyone’s scores will be consistent.

Where do I go from here?

If you want to do IRT equating, you need IRT calibration software.  All three approaches use it.  I highly recommend  Xcalibre  since it is easy to use and automatically creates reports in Word for you.  If you want to learn more about the topic of equating, the classic reference is the book by Kolen and Brennan (2004; 2014).  There are other resources more readily available on the internet, like this free handbook from CCSSO.  If you would like to learn more about IRT, I recommend the books by de Ayala (2008) and Embretson & Reise (2000).  An intro is available in our blog post.

psychometrician laptop

The Rasch model, also known as the one-parameter logistic model, was developed by Danish mathematician Georg Rasch and published in 1960.  Over the ensuing years it has attracted many educational measurement specialists and psychometricians because of its simplicity and ease of computational implementation.  Indeed, since it predated the availability of computers for test development work, it was capable of being implemented using simple calculating equipment available at that time.  The original model, developed for achievement and ability tests scored correct or incorrect, or for dichotomously scored personality or attitude scales, has since been extended into a family of models that work with polytomously scored data, rating scales, and a number of other measurement applications. The majority of those models maintain the simplicity of the original model proposed by Rasch.

In 1960, the Rasch model represented a step forward from classical test theory that had then been the major source of methods for test and scale development, and measuring individual differences, since the early 1900s.  The model was accepted by some educational measurement specialists because of its simplicity, its relative ease of implementation, and most importantly because it maintained the use of the familiar number-correct score to quantify an individual’s performance on a test.  Beyond serving as a bridge from classical test methods, the Rasch model is notable as the first formal statement of what, about ten years later, would be known as item characteristic curve theory or item response theory (IRT).  As an IRT model, the Rasch model placed persons and items on the same scale, and introduced concepts such as item information, test information, conditional standard error of measurement, maximum likelihood estimation, and model fit. Furthermore, the evolution of IRT has given rise to multidimensional item response theory (MIRT), which extends these concepts to accommodate multiple latent traits.

What is the Rasch Model?

As an IRT model, the dichotomous Rasch model can be expressed as an equation,

Rasch model

This equation defines the item response function (or item characteristic curve) for a single test item.  It states that the probability (pi) of a correct (or keyed) response (uij = 1) to an item (i), given a trait level (qj) for a person and the difficulty of an item (bj), is an exponential function of the difference between a person’s trait level and the difficulty of an item.  If the difference between those two terms is zero, the probability is 0.50.  If the person is “above” the item (the item is easy for the person) the probability is greater than 0.50; if the person is “below” the item (the item is difficult for the person) the probability will be less than .50.  The probability correct, therefore, varies with the distance between the person and the item. This characteristic of the Rasch model is found, with some important modifications, in all later IRT models.

Assumptions of the Rasch Model

Based on the above equation, the Rasch model can be seen to make some strong assumptions.  It obviously assumes that the only characteristic of test items that affects a person’s response to the item is its difficulty.  But anyone who has ever done a classical item analysis has observed that items also vary in their discriminations—frequently markedly so.  Yet the Rasch model ignores that information and assumes that all test items have a constant discrimination.  Also, when used with multiple-choice items (or worse, true-false or other items with only two alternatives) guessing can also affect how examinees answer test items—yet the Rasch model also ignores guessing.

Because the model assumes that all items have the same discriminations, it allows examinees to be scored by number correct. But like classical test scoring methods using number-correct scores, their use results in a substantial loss of capability of reflecting individual differences among examinees by using number-correct scoring.  For example, a 6-item number-correct scored test (too short to be useful) can make seven distinctions among a group of examinees, regardless of group size, whereas a more advanced IRT model can make 64 distinctions among those examinees by taking response pattern into account (i.e., which questions were answered correctly and which incorrectly); a 20-item test scored by number-correct results in 21 possible scores—again regardless of group size—whereas non-Rasch IRT scoring will result in 1,048,576 potentially different scores.

Perspectives on the Model

Although the Rasch model is the simplest case of more advanced IRT models, it incorporates a fundamental difference from its more realistic expansions—the two-, three- and four-parameter logistic (or normal ogive) models for dichotomously scored items.  Applications of the Rasch model assume that the model is correct and that observed data must fit the model, Thus, items whose discriminations are not consistent with the model are eliminated in the construction of a measuring instrument.  Similarly, in estimating latent trait scores for examinees, examinees whose responses do not fit the model are eliminated from the calibration data analysis. By contrast, the more advanced and flexible IRT models fit the model to the data.  Although they evaluate model fit for each item (and similarly can evaluate it for each person) the model fit to a given dataset (whether it has two, three, or four item parameters) is the model that best fits that data.  This philosophical—and operational—difference between the Rasch and the other IRT models has important practical implications for the outcome of the test development process.

The Rasch Model and other IRT Models

Although the Rasch model was an advancement is psychometrics in 1960, over the last 60 years it has been replaced by more general models that allow test items to vary in discrimination, guessing, and a fourth parameter.  With the development of powerful computing capabilities, IRT has given rise to a wide-ranging family of models that function flexibly with ability, achievement, personality, attitude, and other educational and psychological variables.  These IRT models are easily implemented with a variety of readily available software packages, and are based on models that can be fit to unidimensional or multidimensional datasets, model response times, and in many respects vastly improve the development of measuring instruments and measurement of individual differences.  Given these advanced IRT models, the Rasch model can best be viewed as an early historical footnote in the history of modern psychometrics.

Implementing the Rasch Model

You will need specialized software.  The most common is WINSTEPS.  You might also be interested in  Xcalibre  (download trial version for free).

Paper-and-pencil testing used to be the only way to deliver assessments at scale.  The introduction of computer-based testing (CBT) in the 1980s was a revelation – higher fidelity item types, immediate scoring & feedback, and scalability all changed with the advent of the personal computer and then later the internet.  Delivery mechanisms including remote proctoring provided students with the ability to take their exams anywhere in the world.  This all exploded tenfold when the pandemic arrived.  So why are some exams still offline, with paper and pencil?

Many education institutions are confused about which examination models to stick to.  Should you go on with the online model they used when everyone was stuck in their homes?  Should you adopt multi-modal examination models, or should you go back to the traditional pen-and-paper method?  

This blog post will provide you with an evaluation of whether paper-and-pencil exams are still worth it in 2021. 

 

Paper-and-pencil testing; The good, the bad, and the ugly

The Good

Answer Bubble Sheet OrangeOffline exams have been a stepping stone towards the development of modern assessment models that are more effective. We can’t ignore the fact that there are several advantages of traditional exams. 

Some advantages of paper-and-pencil testing include students having familiarity with the system, development of a social connection between learners, exemption from technical glitches, and affordability. Some schools don’t have the resources and pen-and-paper assessments are the only option available. 

This is especially true in areas of the world that do not have the internet bandwidth or other technology necessary to deliver internet-based testing.

Another advantage of paper exams is that they can often work better for students with special needs, such as blind students which need a reader.

Paper and pencil testing is often more cost-efficient in certain situations where the organization does not have access to a professional assessment platform or learning management system.

 

The Bad and The Ugly

However, the paper-and-pencil testing does have a number of shortfalls.

1. Needs a lot of resources to scale

Delivery of paper-and-pencil testing at large scale requires a lot of resources. You are printing and shipping, sometimes with hundreds of trucks around the country.  Then you need to get all the exams back, which is even more of a logistical lift.

2. Prone to cheating

Most people think that offline exams are cheat-proof but that is not the case. Most offline exams count on invigilators and supervisors to make sure that cheating does not occur. However, many pen-and-paper assessments are open to leakages. High candidate-to-ratio is another factor that contributes to cheating in offline exams.

3. Poor student engagement

We live in a world of instant gratification and that is the same when it comes to assessments. Unlike online exams which have options to keep the students engaged, offline exams are open to constant destruction from external factors.

Offline exams also have few options when it comes to question types. 

4. Time to score

To err is human.” But, when it comes to assessments, accuracy, and consistency. Traditional methods of hand-scoring paper tests are slow and labor-intensive. Instructors take a long time to evaluate tests. This defeats the entire purpose of assessments.

5. Poor result analysis

Pen-and-paper exams depend on instructors to analyze the results and come up with insight. This requires a lot of human resources and expensive software. It is also difficult to find out if your learning strategy is working or it needs some adjustments. 

6. Time to release results

Online exams can be immediate.  If you ship paper exams back to a single location, score them, perform psychometrics, then mail out paper result letters?  Weeks.

7. Slow availability of results to analyze

Similarly, psychometricians and other stakeholders do not have immediate access to results.  This prevents psychometric analysis, timely feedback to students/teachers, and other issues.

8. Accessibility

Online exams can be built with tools for zoom, color contrast changes, automated text-to-speech, and other things to support accessibility.

9. Convenience

traditional approach vs modern approach

Online tests are much more easily distributed.  If you publish one on the cloud, it can immediately be taken, anywhere in the world.

10. Support for diversified question types

Unlike traditional exams which are limited to a certain number of question types, online exams offer many question types.  Videos, audio, drag and drop, high-fidelity simulations, gamification, and much more are possible.

11. Lack of modern psychometrics

Paper exams cannot use computerized adaptive testing, linear-on-the-fly testing, process data, computational psychometrics, and other modern innovations.

12. Environmental friendliness

Sustainability is an important aspect of modern civilization.  Online exams eliminate the need to use resources that are not environmentally friendly such as paper. 

 

Conclusion

Is paper-and-pencil testing still useful?  In most situations, it is not.  The disadvantages outweigh the advantages.  However, there are many situations where paper remains the only option, such as poor tech infrastructure.

How ASC Can Help 

Transitioning from paper-and-pencil testing to the cloud is not a simple task.  That is why ASC is here to help you every step of the way, from test development to delivery.  We provide you with the best assessment software and access to the most experienced team of psychometricians.  Ready to take your assessments online?  Contact us!