Posts on psychometrics: The Science of Assessment

Positive manifold refers to the fact that scores on cognitive assessment tend to correlate very highly with each other, indicating a common latent dimension that is very strong.  This latent dimension became known as g for general intelligence or general cognitive ability.  This post discusses what the positive manifold is, but since there are MANY other resources on the definition, the post also explains how this concept is useful in the real world.

The term positive manifold originally came out of work in the field of intelligence testing, including research by Charles Spearman.  There literally hundreds of studies on this topic, and over one hundred years of research has shown that this concept is scientifically supported, but it is important to remember that it is just a manifold and not a perfect relationship.  That is, we can expect verbal reasoning ability to correlate highly with quantitative reasoning or logical reasoning, but it is by no means a 1-to-1 relationship.  There are certainly some people that can be high on one but not another.  But it is very unlikely for you to be in the 90th percentile on one but 10th percentile on another.

What is Positive Manifold?

If you were to take a set of cognitive tests, either separate, or as subtests of a battery like the Wechsler Adult Intelligence Scale, and correlate their scores, the correlation matrix would be overwhelmingly positive.  For example, look at Table 2-9 in this book.   Or Table 4 in this article.  There are many, many more examples if you search for keywords like “intelligence intercorrelation.”

As you might expect, related constructs will correlate more highly.  A battery might have a Verbal Reasoning test and a Vocabulary test; we would expect these to correlate more highly with each other (maybe 0.80) than a Figural Reasoning test (maybe 0.50).  Researchers like to use a methodology called factor analysis to analyze this structure and drive interpretations.

Practical implications

Positive manifold and the structure of cognitive ability is historically an academic research topic, and remains so.  Researchers are still publishing articles like this one.  However, the concept of positive manifold has many practical implications in the real world.  It affects situations where cognitive ability testing is used to obtain information about people and make decisions about them.  Two of the most common examples are test batteries for admissions/placement or employment.

Admissions/placement exams are used in the education sector to evaluate student ability and make decisions about schools or courses that the student can/should enter.  Admissions refers to whether the student should be admitted to a school, such as a university or a prestigious high school.  Examples of this in the USA are the SAT and ACT exams.  Placement refers to sending students to the right course, such as testing them on Math and English to determine if they are ready for certain courses.  Both of these examples will typically test the student on 3 or 4 aspects, which is an example of a test battery.  The SAT discusses intercorrelations of its subtests in the technical manual (page 104).  Tests like the SAT can provide incremental validity above the predictive power of high school grade point average (HSGPA) alone, as seen in this report.

Employment testing is also often done with several cognitive tests.  You might take psychometric tests to apply for a job, and they test you on quantitative reasoning and verbal reasoning.

In both cases, the tests are validated by doing research to show that they predict a criterion of interest.  In the case of university admissions, this might be First Year GPA or Four Year Graduation Rate.  In the case of Employment Testing, it could be Job Performance Rating by a supervisor or 1-year retention rate.

Why are they using multiple tests?  They are trying to capitalize on the differences to get more predictive power for the criterion.  Success in university isn’t due to just verbal/language skills alone, but also logical reasoning and other skills.  They recognize that there is a high correlation, but the differences between the constructs can be leveraged to get more information about people.  Employment testing goes further, and tries to add incremental validity by adding other tests that are completely unrelated but relevant to the job world, like job samples or even noncognitive tests like Conscientiousness .  These also correlate with job performance, and therefore help with prediction, but correlate even lower with measures of g than another cognitive test would; this then adds more prediction power.

 

Authors: 

Laila Issayeva, MS

Nathan Thompson, PhD

 

The Bookmark Method of standard setting (Lewis, Mitzel, & Green, 1996) is a scientifically-based approach to setting cutscores on an examination. It allows stakeholders of an assessment to make decisions and classifications about examinees that are constructive rather than arbitrary (e.g., 70%), meet the goals of the test, and contribute to overall validity. A major advantage of the bookmark method over others is that it utilizes difficulty statistics on all items, making it very data-driven; but this can also be a disadvantage in situations where such data is not available. It also has the advantage of panelist confidence (Karantonis & Sireci, 2006).

The bookmark method operates by delivering a test to a representative sample (or population) of examinees, and then calculating the difficulty statistics for each item. We line up the items in order of difficulty, and experts review the items to place a bookmark where they think a cutscore should be. Nowadays, we use computer screens, but of course in the past this was often done by printing the items in paper booklets, and the experts would literally insert a bookmark.

What is standard setting?

Standard setting (Cizek & Bunch, 2006) is an integral part of the test development process even though it has been undervalued outside of practitioners’ view in the past (Bejar, 2008). Standard setting is the methodology of defining achievement or proficiency levels and corresponding cutscores. A cutscore is a score that serves as a measure of classifying test takers into categories.

Educational assessments and credentialing examinations are often employed to distribute test takers among ordered categories according to their performance across specific content and skills (AERA, APA, & NCME, 2014; Hambleton, 2013). For instance, in tests used for certification and licensing purposes, test takers are typically classified as “pass”—those who score at or above the cutscore—and those who “fail”. In education, students are often classified in terms of proficiency; the Nation’s Report Card assessment (NAEP) in the United States classifies students as Below Basic, Basic, Proficient, Advanced.

However, assessment results could come into question unless the cutscores are appropriately defined. This is why arbitrary cutscores are considered indefensible and lacking validity. Instead, psychometricians help test sponsors to set cutscores using methodologies from the scientific literature, driven by evaluations of item and test difficulty as well as examinee performance.

When to use the bookmark method?

Two approaches are mainly used in international practice to establish assessment standards: the Angoff method (Cizek, 2006) and the Bookmark method (Buckendahl, Smith, Impara, & Plake, 2000). The Bookmark method, unlike the Angoff method, requires the test to be administered prior to defining cutscores based on test data. This provides additional weight to the validity of the process, and better informs the subject matter experts during the process. Of course, many exams require a cutscore to be set before it is published, which is impossible with the bookmark; the Angoff procedure is very useful then.

How do I implement the bookmark method?

The process of standard setting employing the Bookmark method consists of the following stages:

  1. Identify a team of subject matter experts (SMEs); their number should be around 6-12, and led by a test developer/psychometrician/statistician
  2. Analyze test takers’ responses by means of the item response theory (IRT)
  3. Create a list items according to item difficulty in an ascending order
  4. Define the competency levels for test takers; for example, have the 6-12 experts discuss what should differentiate a “pass” candidate from a “fail” candidate
  5. Experts read the items in the ascending order (they do not need to see the IRT values), and place a bookmark where appropriate based on professional judgement across well-defined levels
  6. Calculate thresholds based on the bookmarks set, across all experts
  7. If needed, discuss results and perform a second round

Example of the Bookmark Method

If there are four competency levels such as the NAEP example, then SMEs need to set up three bookmarks in-between: first bookmark is set after the last item in a row that fits the minimally competent candidate for the first level, then second and third. There are thresholds/cutscores from 1 to 2, 2 to 3, and 3 to 4. SMEs perform this individually without discussion, by reading the items.

When all SMEs have provided their opinion, the standard setting coordinator combines all results into one spreadsheet and leads the discussion when all participants express their opinion referring to the bookmarks set. This might look like the sheet below. Note that SME4 had a relatively high standard in their mind, while SME2 had a low standard in their mind – placing virtually every student above an IRT score of 0.0 into the top category!

bookmark method 1

After the discussion, the SMEs are given one more opportunity to set the bookmarks again. Usually, after the exchange of opinions, the picture alters. SMEs gain consensus, and the variation in the graphic is reduced.  An example of this is below.

bookmark method 2

What do to with the results?

Based on the SMEs’ voting results, the coordinator or psychometrician calculates the final thresholds on the IRT scale, and provides them to the analytical team who would ultimately prepare reports for the assessment across competency levels. This might entail score reports to examinees, feedback reports to teachers, and aggregate reports to test sponsors, government officials, and more.

You can see how the scientific approach will directly impact the interpretations of such reports. Rather than government officials just knowing how many students scored 80-90% correct vs 90-100% correct, the results are framed in terms of how many students are truly proficient in the topic. This makes decisions from test scores – both at the individual and aggregate levels – much more defensible and informative.  They become truly criterion-referenced.  This is especially true when the scores are equated across years to account for differences in examinee distributions and test difficulty, and the standard can be demonstrated to be stable.  For high-stakes examinations such as medical certification/licensure, admissions exams, and many more situations, this is absolutely critical.

Want to talk to an expert about implementing this for your exams?  Contact us.

References

[AERA, APA, & NCME] (American Educational Research Association, American Psychological Association, & National Council on Measurement in Education). (2014). Standards for educational and psychological testing. Washington, DC: American Educational Research Association.

Bejar, I. I. (2008). Standard setting: What is it? Why is it important. R&D Connections, 7, 1-6. Retrieved from https://www.ets.org/Media/Research/pdf/RD_Connections7.pdf

Buckendahl, C. W., Smith, R. W., Impara, J. C., & Plake, B. S. (2000). A comparison of Angoff and Bookmark standard setting methods. Paper presented at the Annual Meeting of the Mid-Western Educational Research Association, Chicago, IL: October 25-28, 2000.

Cizek, G., & Bunch, M. (2006). Standard Setting: A Guide to Establishing and Evaluating Performance Standards on Tests.  Thousand Oaks, CA: Sage.

Cizek, G. J. (2007). Standard setting. In Steven M. Downing and Thomas M. Haladyna (Eds.) Handbook of test development. Mahwah, NJ: Lawrence Erlbaum Associates, Publishers, pp. 225-258.

Hambleton, R. K. (2013). Setting performance standards on educational assessments and criteria for evaluating the process. In Setting performance standards, pp. 103-130. Routledge. Retrieved from https://www.nciea.org/publications/SetStandards_Hambleton99.pdf

Karantonis, A., & Sireci, S. (2006). The Bookmark Standard‐Setting Method: A Literature Review. Educational Measurement Issues and Practice 25(1):4 – 12.

Lewis, D. M., Mitzel, H. C., & Green, D. R. (1996, June). Standard setting: A Book-mark approach. In D. R. Green (Chair),IRT-based standard setting procedures utilizing behavioral anchoring. Symposium conducted at the Council of Chief State School Officers National Conference on Large-Scale Assessment, Phoenix, AZ.

 

 

The COVID-19 pandemic has driven a surge in the amount of certification exams with remote proctoring.  Obviously, the decision to migrate to this model is not one to be taken lightly.  This post discusses some of the issues and best practices to consider.

1. Determine the best type of remote proctoring

As I discussed at length in this article, there is a surprising wide range of remote proctoring out there, with dozens of vendors.  It can be overwhelming.  So first, determine which of the three main types of remote proctoring you need.

A. AI only

AI only proctoring means that there are no humans.  The examinee is recorded on video, and AI algorithms flag potential issues, such as if they leave their seat, then notify an administrator (usually a professor) of students with a high number of flags.  This approach is usually not relevant for certifications or other credentialing exams, it is more for low-stakes exams like a Psychology 101 Midterm at your local university.  The vendors for this approach are interested in large-scale projects, such as proctoring all midterms and finals at a university, perhaps hundreds of thousands of exams per year.

B. Record and Review

Record and review proctoring means that the examinee is recorded on video, but that video is watched by a real human and flagged if they think there is cheating, theft, or other issues.  This is much higher quality, and higher price, but has one major flaw that might be concerning to certification tests: if someone steals your test by taking pictures, you won’t find out until tomorrow.  But at least you know who it was and you are certain of what happened, with a video proof.

C. Live Online Proctoring

Live online proctoring (LOP), or what I call “live human proctoring” (because some AI proctoring is also “live” in real time!) means that there is a professional human proctor on the other side of the video from the examinee.  They check the examinee in, confirm their identity, scan the room, provide instructions, and actually watch them take the test.  Some providers like MonitorEDU even have the examinee make a second video stream on their phone, which is placed on a bookshelf or similar spot to see the entire room through the test.  Certainly, this approach is a very good fit with certification exams and other credentialing.  You protect the test content as well as the validity of that individual’s score.

2. Determine other technology, psychometric, and operational needs

Next, your organization should establish the other needs for your exams.  Do you require special item types?  Perhaps adaptive testing or linear on the fly testingPsychometric consulting services?  Specific operational controls such as exam time/date windows or navigation limits?  Write all these up so that you can use the list to shop for a provider.

3. Find an integrated provider for certification exams with remote proctoring

Most providers of remote proctoring are just that: remote proctoring.  They do not have a professional platform to manage item banks, schedule examinees, deliver tests, create custom score reports, and analyze psychometrics.  Some do not even integrate with such platforms, and only integrate with learning management systems like Moodle, seeing as their entire target market is only low-stakes university exams.  So if you are seeking a vendor for certification testing or other credentialing, the list of potential vendors is smaller.

Our flagship platform, FastTest, works with 5 different remote proctoring providers and can easily integrate with more.

FastTest exam development

 

4. Establish the new process

Once you have selected a vendor, work with them to establish the new process for delivering your certification exams with remote proctoring.  Remember, this goes FAR beyond exam day!

  • Candidate Handbook
  • Registration and scheduling
  • Candidate training and practice tests
  • Exam delivery (including verification, environmental rules, materials allowed, break policy, etc.)
  • Test security plan:  What do you do if someone is caught taking pictures of the exam with their phone, or the other potential events?

 

Ready to start?

ASC is one of the world leaders in this endeavor.  Contact us to get a free account in our platform and experience the examinee process, or to receive a demonstration from one of our experts.

A test battery or assessment battery is a set multiple psychometrically-distinct exams delivered in one administration.  In some cases, these are various tests that are cobbled together for related purposes, such as a psychologist testing a 8 year old child on their intelligence, anxiety, and autism spectrum.  However, in many cases it is a single test title that we often refer to as a single test but is actually several separate tests, like a university admissions test that has English, Math, and Logical Reasoning components.  Why do so? The key here is that we want to keep them psychometrically separate, but maximize the amount of information about the person to meet the purposes of the test.

Examples of a Test Battery

Test batteries are used in a variety of fields, pretty much anywhere assessment is done.

Admissions and Placement Testing

The classic example is a university admissions test that has English, Math, and Logic portions.  These are separate tests, and psychometricians would calculate the reliability and other important statistics separately.  However, the scores are combined at the end to get an overall picture of examinee aptitude or achievement, and use that to maximally predict 4-graduation rates and other important criterion variables.

Why is is called a battery?  Because we are battering the poor student with not just one, but many exams!

Pre-Employment Testing

Exam batteries are often used in pre-employment testing.  You might get tested on computer skills, numerical reasoning, and noncognitive traits such as integrity or conscientiousness. These are used together to gain incremental validity.  A good example is the CAT-ASVAB, which is the selection test to get into the US Armed Forces.  There are 10 tests (vocabulary, math, mechanical aptitude…).

Psychological or Psychoeducational Assessment

In a clinical setting, clinicians will often use a battery of tests, such as IQ, autism, anxiety, and depression.  Some IQ tests themselves as a battery, as they might assess visual reasoning, logical reasoning, numerical reasoning, etc.  However, these have a positive manifold, meaning that they correlate quite highly with each other.  Another example is the Woodcock-Johnson.

K-12 Educational Assessment

Many large-scale tests that are used in schools are considered a battery, though often with only 2 or 3 aspects.  A common one in the USA is the NWEA Measures of Academic Progress.

Composite Scores

A composite score is a combination of scores in a battery.  If you took an admissions test like the SAT and GRE, you recall how it would add your scores on the different subtests, while the ACT test takes the average.  The ASVAB takes a linear combination of the 4 most important subtests and uses them for admission; the others are used for job matching.

A Different Animal: Test with Sections

The battery is different than a single test that has distinct sections.  For example, a K12 English test might have 10 vocab items, 10 sentence-completion grammar items, and 2 essays.  Such tests are usually analyzed as a single test, as they are psychometrically unidimensional.

How to Deliver A Test Battery

In ASC’s platforms, Assess.ai and FastTest, all this functionality is available out of the box: test batteries, composite scores, and sections within a test.  Moreover, they come with a lot of important functionality, such as separation of time limits, navigation controls, customizable score reporting, and more.  Click here to request a free account and start applying best practices.