Psychometrics is the science of educational and psychological assessment, using data to ensure that tests are fair and accurate. Ever felt like you took a test which was unfair, too hard, didn’t cover the right topics, or was full of questions that were simply confusing or poorly written? Psychometricians are the people who help organizations fix these things using data science, as well as more advanced topics like how to design an AI algorithm that adapts to each examinee.
Psychometrics is a critical aspect of many fields. Having accurate information on people is essential to education, human resources, workforce development, corporate training, professional certifications/licensure, medicine, and more. It scientifically studies how tests are designed, developed, delivered, validated, and scored.
Key Takeaways on Psychometrics
- Psychometrics is the study of how to measure and assess mental constructs, such as intelligence, personality, or knowledge of accounting law
- Psychometrics is NOT just screening tests for jobs
- Psychometrics is dedicated to making tests more accurate and fair
- Psychometrics is heavily reliant on data analysis and machine learning, such as item response theory
What is Psychometrics? Definition & Meaning
Psychometrics is the study of assessment itself, regardless of what type of test is under consideration. In fact, many psychometricians don’t even work on a particular test, they just work on psychometrics itself, such as new methods of data analysis. Most professionals don’t care about what the test is measuring, and will often switch to new jobs at completely unrelated topics, such as moving from a K-12 testing company to psychological measurement to an Accountant certification exam. We often refer to whatever we are measuring simply as “theta” – a term from item response theory.
Psychometrics tackles fundamental questions around assessment, such as how to determine if a test is reliable or if a question is of good quality, as well as much more complex questions like how to ensure that a score today on a university admissions exam means the same thing as it did 10 years ago. Additionally, it examines phenomena like the positive manifold, where different cognitive abilities tend to be positively correlated, supporting the consistency and generalizability of test scores over time.
Psychometrics is a branch of data science. In fact, it’s been around a long time before that term was even a buzzword. Don’t believe me? Check out this Coursera course on Data Science, and the first example they give as one of the foundational historical projects in data science is… psychometrics! (early research on factor analysis of intelligence).
Even though assessment is everywhere and Psychometrics is an essential aspect of assessment, to most people it remains a black box, and professionals are referred to as “psychomagicians” in jest. However, a basic understanding is important for anyone working in the testing industry, especially those developing or selling tests.
Psychometrics is NOT limited to very narrow types of assessment. Some people use the term interchangeably with concepts like IQ testing, personality assessment, or pre-employment testing. These are each but tiny parts of the field! Also, it is not the administration of a test.
Why do we need Psychometrics?
This purpose of tests is providing useful information about people, such as whether to hire them, certify them in a profession, or determine what to teach them next in school. Better tests mean better decisions. Why? The scientific evidence is overwhelming that tests provide better information for decision makers than many other types of information, such as interviews, resumes, or educational attainment. Thus, tests serve an extremely useful role in our society.
The goal of psychometrics is to provide validity: evidence to support that interpretations of scores from the test are what we intended. If a certification test is supposed to mean that someone passing it meets the minimum standard to work in a certain job, we need a lot of evidence about that, especially since the test is so high stakes in that case. Meta-analysis, a key tool in psychometrics, aggregates research findings across studies to provide robust evidence on the reliability and validity of tests. By synthesizing data from multiple studies, meta-analysis strengthens the validity claims of tests, especially crucial in high-stakes certification exams where accuracy and fairness are paramount.
What does Psychometrics do?
Building and maintaining a high-quality test is not easy. A lot of big issues can arise. Much of the field revolves around solving major questions about tests: what should they cover, what is a good question, how do we set a good cutscore, how do we make sure that the test predicts job performance or student success, etc. Many of these questions align with the test development cycle – more on that later.
How do we define what should be covered by the test? (Test Design)
Before writing any items, you need to define very specifically what will be on the test. If the test is in credentialing or pre-employment, psychometricians typically run a job analysis study to form a quantitative, scientific basis for the test blueprints. A job analysis is necessary for a certification program to get accredited. In Education, the test coverage is often defined by the curriculum.
How do we ensure the questions are good quality? (Item Writing)
There is a corpus of scientific literature on how to develop test items that accurately measure whatever you are trying to measure. A great overview is the book by Haladyna. This is not just limited to multiple-choice items, although that approach remains popular. Psychometricians leverage their knowledge of best practices to guide the item authoring and review process in a way that the result is highly defensible test content. Professional item banking software provides the most efficient way to develop high-quality content and publish multiple test forms, as well as store important historical information like item statistics.
How do we set a defensible cutscore? (Standard Setting)
Test scores are often used to classify candidates into groups, such as pass/fail (Certification/Licensure), hire/non-hire (Pre-Employment), and below-basic/basic/proficient/advanced (Education). Psychometricians lead studies to determine the cutscores, using methodologies such as Angoff, Beuk, Contrasting-Groups, and Borderline.
How do we analyze results to improve the exam? (Psychometric Analysis)
Psychometricians are essential for this step, as the statistical analyses can be quite complex. Smaller testing organizations typically utilize classical test theory, which is based on simple mathematics like proportions and correlations. Large, high-profile organizations typically use item response theory (IRT), which is based on a type of nonlinear regression analysis. Psychometricians evaluate overall reliability of the test, difficulty and discrimination of each item, distractor analysis, possible bias, multidimensionality, linking multiple test forms/years, and much more. Software such as Iteman and Xcalibre is also available for organizations with enough expertise to run statistical analyses internally. Scroll down below for examples.
How do we compare scores across groups or years? (Equating)
This is referred to as linking and equating. There are some psychometricians that devote their entire career to this topic. If you are working on a certification exam, for example, you want to make sure that the passing standard is the same this year as last year. If you passed 76% last year and this year you passed 25%, not only will the candidates be angry, but there will be much less confidence in the meaning of the credential.
How do we know the test is measuring what it should? (Validity)
Validity is the evidence provided to support score interpretations. For example, we might interpret scores on a test to reflect knowledge of English, and we need to provide documentation and research supporting this. There are several ways to provide this evidence. A straightforward approach is to establish content-related evidence, which includes the test definition, blueprints, and item authoring/review. In some situations, criterion-related evidence is important, which directly correlates test scores to another variable of interest. Delivering tests in a secure manner is also essential for validity.
Where is Psychometrics Used?
Certification/Licensure/Credentialing
In certification testing, psychometricians develop the test via a documented chain of evidence following a sequence of research outlined by accreditation bodies, typically: job analysis, test blueprints, item writing and review, cutscore study, and statistical analysis. Web-based item banking software like FastTest is typically useful because the exam committee often consists of experts located across the country or even throughout the world; they can then easily log in from anywhere and collaborate.
Pre-Employment
In pre-employment testing, validity evidence relies primarily on establishing appropriate content (a test on PHP programming for a PHP programming job) and the correlation of test scores with an important criterion like job performance ratings (shows that the test predicts good job performance). Adaptive tests are becoming much more common in pre-employment testing because they provide several benefits, the most important of which is cutting test time by 50% – a big deal for large corporations that test a million applicants each year. Adaptive testing is based on item response theory, and requires a specialized psychometrician as well as specially designed software like FastTest.
K-12 Education
Most assessments in education fall into one of two categories: lower-stakes formative assessment in classrooms, and higher-stakes summative assessments like year-end exams. Psychometrics is essential for establishing the reliability and validity of higher-stakes exams, and on equating the scores across different years. They are also important for formative assessments, which are moving towards adaptive formats because of the 50% reduction in test time, meaning that student spend less time testing and more time learning.
Universities
Universities typically do not give much thought to psychometrics even though a significant amount of testing occurs in higher education, especially with the move to online learning and MOOCs. Given that many of the exams are high stakes (consider a certificate exam after completing a year-long graduate program!), psychometricians should be used in the establishment of legally defensible cutscores and in statistical analysis to ensure reliable tests, and professionally designed assessment systems used for developing and delivering tests, especially with enhanced security.
Medicine/Psychology
Have you ever taken a survey at your doctor’s office, or before/after a surgery? Perhaps a depression or anxiety inventory at a psychotherapist? Psychometricians have worked on these.
The Test Development Cycle
Psychometrics is the core of the test development cycle, which is the process of developing a strong exam. It is sometimes called similar names like assessment lifecycle.
You will recognize some of the terms from the introduction earlier. What we are trying to demonstrate here is that those questions are not standalone topics, or something you do once and simply file a report. An exam is usually a living thing. Organizations will often be republishing a new version every year or 6 months, which means that much of the cycle is repeated on that timeline. Not all of it is; for example, many orgs only do a job analysis and standard setting every 5 years.
Consider a certification exam in healthcare. The profession does not change quickly because things like anatomy never change and medical procedures rarely change (e.g., how to measure blood pressure). So, every 5 years it does a job analysis of its certificants to see what they are doing and what is important. This is then converted to test blueprints. Items are re-mapped if needed, but most likely do not need it because there are probably only minor changes to the blueprints. Then a new cutscore is set with the modified-Angoff method, and the test is delivered this year. It is delivered again next year, but equated to this year rather than starting again. However, the item statistics are still analyzed, which leads to a new cycle of revising items and publishing a new form for next year.
Example of Psychometrics in Action
Here is some output from our Iteman software. This is deeply analyzing a single question on English vocabulary, to see if the student knows the word alleviate. About 70% of the students answered correctly, with a very strong point-biserial. The distractor P values were all in the minority and the distractor point-biserials were negative, which adds evidence to the validity. The graph shows that the line for the correct answer is going up while the others are going down, which is good. If you are familiar with item response theory, you’ll notice how the blue line is similar to an item response function. That is not a coincidence.
Now, let’s look at another one, which is more interesting. Here’s a vocab question about the word confectioner. Note that only 37% of the students get it right… even though there is a 25% chance just of guessing!!! However, the point-biserial discrimination remains very strong at 0.49. That means it is a really good item. It’s just hard, which means it does a great job to differentiate amongst the top students.
A Glossary of Psychometric Terms
Accreditation: Accreditation by an outside agency affirms that an organization has met a certain level of standards. Certification testing programs may become accredited by meeting specified standards in test development, psychometrics, bylaws, management, etc. Learn more.
Adaptive Test: A test that is delivered with an AI-based algorithm that personalizes it to each examinee, thereby making it much more secure and accurate while decreasing test length. Learn more.
Achievement: The psychometric term for measuring something that a student has learned, such as 9th grade biology curriculum knowledge, rather than an innate construct such as intelligence of conscientiousness.
Aptitude: A construct that is measured which is innate, usually in a cognitive context. For example, logical reasoning ability.
Biserial Correlation: A classical index of item discrimination, highly similar to the more commonly used point-biserial. The biserial correlation assumes that the item scores and test scores reflect an underlying normal distribution, which is not always the case
Blueprint: A test blueprint, or test specification, details how an exam is to be constructed. It includes important information, such as the total number of items, the number of items in each content area or domain, the number of items that are recall verses reasoning, and the item formats to be utilized.
Certification: A non-mandatory testing program that certifies the candidates have achieved a minimum standard or knowledge or performance.
Classical Test Theory (CTT): A psychometric analysis and test development paradigm based on correlations, proportions, and other statistics that are relatively simple compared to IRT. It is, therefore, more appropriate for smaller samples, especially for fewer than 100.
Classification: The use of tests for classifying candidates into categories, such as pass/fail, nonmaster/master, or basic/proficient/advanced.
Cognitive Diagnostic Models (CDMs) aka Diagnostic Measurement Models (DMMs): A relatively new psychometric paradigm that frames the measurement problem not as one latent trait, but rather individual skills that must be mastered. So rather than 4th grade match achievement as a scale, there are locations for adding fractions, dividing fractions, multiplying decimals, etc. Can be used in concert with IRT. Learn more.
Computerized Adaptive Testing (CAT): A dynamic method of test administration where items are selected one at a time to match item difficulty and candidate ability as closely as possible. This helps prevent candidates from being presented with items that are too difficult or too easy for them, which has multiple benefits. Often, the test only takes half as many items to obtain a similar level of accuracy to form-based tests. This reduces the testing time per examinee and also reduces the total number of times an item is exposed, as well as increasing security by the fact that nearly every candidate will receive a different set of items.
Computerized Classification Testing (CCT): An approach similar to CAT, but with different algorithms to reflect the fact that the purpose of the test is only to make a broad classification and not obtain a highly accurate point estimate of ability.
Concurrent Validity: An aspect of validity (see below) that correlates a test to other variables at the same time, to which we hope it correlates. A university admissions test should correlate with high school grade point average – but not perfectly, since they are not exactly the same construct, and then what is the point of having the test?
Cutscore: Also known as a passing score, the cutscore is the score that a candidate must achieve to obtain a certain classification, such as “pass” on a licensure or certification exam.
Criterion-Referenced: A test score (not a test) is criterion-referenced if it is interpreted with regard to a specified criterion and not compared to scores of other candidates. For instance, providing the number-correct score does not relate any information regarding a candidate’s relative standing.
Differential item functioning: A specific type of analysis that evaluates whether an item is biased towards a subgroup. This is different than overall test bias.
Distractors: Distractors are the incorrect options of a multiple-choice item. A distractor analysis is an important part of psychometric review, as it helps determine if one is acting as a keyed response. Learn more.
Equating: A psychometric term for the process of determining comparable scores on different forms of an examination. For example, if Form A is more difficult than Form B, it might be desirable to adjust scores on Form A upward for the purposes of comparing them to scores on Form B. Usually, this is done statistically based on items that are on both forms, which are called equator, anchor, or common items. Because the groups who took the two forms are different, this is called a common items non-equivalent groups design.
Factor Analysis: An approach to analyzing complex data that seeks to break it down into major components or factors. Use in many fields nowadays, but originally developed for psychometrics. Two of the most common examples are the extensive research which finds that personality items/measures boil down to the Big Five, and that intelligence items/measures boil down to general cognitive ability (though there is evidence of different aspects with massive cognitive manifold).
Form: Forms are specific sets of items that are administered together for a test. For example, if a test included a certain set of 100 items this year and a different set of 100 items next year, these would be two distinct forms.
Item: The basic component of a test, often colloquially referred to as a “question,” but items are not necessary phrased as a question. They can be as varied as true/false statements, rating scales, and performance task simulations, in addition to the ubiquitous multiple-choice item.
Item Bank: A repository of items for a testing program, including items at all stages, such as newly written, reviewed, pretested, active, and retired.
Item Banker: A specialized software program that facilitates the maintenance and growth of an item bank by recording item stages, statistics, notes, and other characteristics.
Item Difficulty: A statistical index of how easy/hard the item is with respect to the underlying ability/trait. That is, an item is difficult if not many people get it correct or respond in the keyed direction.
Item Discrimination: A statistical index of the quality of the item, assessing how well it differentiates examinees of high verses low ability. Items with low discrimination are considered poor quality and are candidates to be revised or retired.
Item Response Theory (IRT): A comprehensive approach to psychometric analysis and test development that utilizes complex mathematical models. This provides several benefits, including the ability to design CATs, but requires larger sample sizes. A common rule of thumb is 100 candidates for the one-parameter model and 500 for the three-parameter model.
a: The item response theory index of item discrimination, analogous to the point-biserial and biserial correlations in classical test theory. It reflects the slope of the item response function. Often ranging from 0.1 to 2.0 in practice, a higher value indicates a better-performing item.
b: The item response theory index of item difficulty or location, analogous to the P-value (P+) of classical test theory. Typically ranging from -3.0 to 3.0 in practice, a higher value indicates a more difficult item.
c: The item response theory pseudo-guessing parameter, representing the lower asymptote of the item response function. It is theoretically near the value of 1/k, where k is the number of alternatives. For example, with the typical four-option multiple-choice item, a candidate has a base chance of 25% of guessing the correct answer.
Item Type: Items (test questions) can be a huge range of formats. We are all familiar with single best answer multiple choice, but there are many others. Some of these are: multiple response, drag and drop, essay, scored short answer, and equation editor.
Job Analysis: Also known as practice analysis or role delineation study, job analysis is a formal study used to determine the structure of a job and the KSAs important to success or competence. This is then used to establish the test blueprint for a professional testing program, a critical step in the chain of evidence for validity.
Key: The key is the correct response to an item.
KSA: KSA is an acronym for knowledge, skills, and abilities. A critical step in testing for employment or professional credentials is to determine the KSAs that are important in a job. This is often done via a job analysis study.
Licensure: A testing program mandated by a government body. The test must be passed in order to perform the task in question, whether it is to work in the profession or drive a car.
Norm-Referenced: A test score (not a test) is norm-referenced if it is interpreted with regard to the performance of other candidates. Percentile rank is an example of this because it does not provide any information regarding how many items the candidate got correct.
P-value: A classical index of item difficulty, presented as the proportion of candidates who correctly responded to the item. A value above 0.90 indicates an easy item, while a value below 0.50 indicates a relatively difficult item. Note that it is inverted; a higher value indicates less difficulty.
Point-Biserial Correlation: A classical index of item discrimination, calculated as the Pearson correlation between the item score and the total test score. If below 0.0, low-scoring candidates are actually doing better than high-scoring candidates, and the item should be revised or retired. Low positive values are marginal, higher positive values are ideal.
Polytomous: A psychometric term for data where there are 2 or more possible points. Multiple choice items, while having 3-5 options, are usually still only dichotomous (0/1 points). Examples of a polytomous item are a Likert-style rating scale (“rate on a scale of 1 to 5”) and partial credit items or rubrics (scoring an essay as 0 to 5 points).
Power Test: A test where the goal is to measure the maximal knowledge, ability, or trait of the examinee. For example, a medical certification exam with a generous time limit.
Predictive Validity: An aspect of validity (see below) that focuses on how well the test predicts important outcomes. A university admissions test should predict 4-year graduation probability very well, and a pre-employment test on MS Excel should predict job performance for bookkeepers.
Pretest (or Pilot) Item: An item that is administered to candidates simply for the purposes of obtaining data for future psychometric analysis. The results on this item are not included in the score. It is often prudent to include a small number of pretest items in a test.
Reliability: A psychometric term for the repeat-ability or consistency of the measurement process. Often, this is indexed by a single number, most commonly the internal consistency index coefficient alpha or its dichotomous formulation, KR-20. Under most conditions, these range from 0.0 to 1.0, with 1.0 being a perfectly reliable measurement. However, just because a test is reliable does not mean that it is valid (i.e., measures what it is supposed to measure).
Scaling: Scaling is a process of converting scores obtained on an exam to an arbitrary scale. This is done so that all the forms and exams used by a testing organization are on a common scale. For example, suppose an organization had two testing programs, one with 50 items and one with 150 items. All scores could be put on the same scale to standardize score reporting.
Speeded Test: A test where the purpose is to see how fast the examinee can answer questions. The questions are therefore not usually knowledge based. For example, seeing how many 5-digit zip codes they can correctly type in 60 seconds. Learn more.
Standard Error of Measurement: A psychometric term for a concept that quantifies the amount of error in an examinee’s score, since psychometrics is not perfect; even with the best Math test, a student might have variation in their result today vs next week. The concept differs substantially in classical test theory vs. item response theory.
Standard-Setting Study: A formal study conducted by a testing organization to determine standards for a testing program, which are manifested as a cutscore. Common methods include the Angoff, Bookmark, Contrasting Groups, and Borderline Survey methods.
Subject Matter Expert (SME): An extremely knowledgeable person within the test development process. SMEs are necessary to write items, review items, participate in standard-setting studies, and job analyses, and oversee the testing program to ensure its fidelity to its true intent.
Validity: Validity is the concept that test scores can be interpreted as intended. For example, a test for certification in a profession should reflect basic knowledge of that profession, and not intelligence or other constructs, and scores can, therefore, be interpreted as evidencing professional competence. Validity must be formally established and maintained by empirical studies as well as sound psychometric and test development practices. Learn more.
Psychometrics looks fun! How can I join the band?
You will need a graduate degree. I recommend you look at the NCME website (ncme.org) with resources for students. Good luck!
Already have a degree and looking for a job? Here’s the two sites that I recommend:
- NCME – Also has a job listings page that is really good (ncme.org)
- Horizon Search – Headhunter for Psychometricians and I/O Psychologists