assessment systems logo

Glossary of Psychometric Terms

New to the world of testing and psychometrics? 
This list of important psychometric terms will get you started in learning the field.

Accreditation: Accreditation by an outside agency affirms that an organization has met a certain level of standards. Certification testing programs may become accredited by meeting specified standards in test development, psychometrics, bylaws, management, etc.

Adaptive Test: A test that is delivered with an AI-based algorithm that personalizes it to each examinee, thereby making it much more secure and accurate while decreasing test length.  Also called adaptive assessment, computer-adaptive testing, or computerized adaptive testing (see bel0w).

Achievement: The psychometric term for measuring something that a student has learned, such as 9th grade biology curriculum knowledge, rather than an innate construct such as intelligence of conscientiousness. 

Aptitude: A construct that is measured which is innate, usually in a cognitive context.  For example, logical reasoning ability.

Biserial Correlation: A classical index of item discrimination, highly similar to the more commonly used point-biserial. The biserial correlation assumes that the item scores and test scores reflect an underlying normal distribution, which is not always the case

Blueprint: A test blueprint, or test specification, details how an exam is to be constructed. It includes important information, such as the total number of items, the number of items in each content area or domain, the number of items that are recall verses reasoning, and the item formats to be utilized.

Certification: A non-mandatory testing program that certifies the candidates have achieved a minimum standard or knowledge or performance.

Classical Test Theory (CTT): A psychometric analysis and test development paradigm based on correlations, proportions, and other statistics that are relatively simple compared to IRT. It is, therefore, more appropriate for smaller samples, especially for fewer than 100.

Classification: The use of tests for classifying candidates into categories, such as pass/fail, nonmaster/master, or basic/proficient/advanced.

Cognitive Diagnostic Models (CDMs) aka Diagnostic Measurement Models (DMMs): A relatively new psychometric paradigm that frames the measurement problem not as one latent trait, but rather individual skills that must be mastered.  So rather than 4th grade match achievement as a scale, there are locations for adding fractions, dividing fractions, multiplying decimals, etc.  Can be used in concert with IRT.

Computerized Adaptive Testing (CAT):infographic-CAT-adaptive-testingdynamic method of test administration where items are selected one at a time to match item difficulty and candidate ability as closely as possible. This helps prevent candidates from being presented with items that are too difficult or too easy for them, which has multiple benefits. Often, the test only takes half as many items to obtain a similar level of accuracy to form-based tests. This reduces the testing time per examinee and also reduces the total number of times an item is exposed, as well as increasing security by the fact that nearly every candidate will receive a different set of items.

Computerized Classification Testing (CCT): An approach similar to CAT, but with different algorithms to reflect the fact that the purpose of the test is only to make a broad classification and not obtain a highly accurate point estimate of ability.

Concurrent Validity: An aspect of validity (see below) that correlates a test to other variables at the same time, to which we hope it correlates.  A university admissions test should correlate with high school grade point average – but not perfectly, since they are not exactly the same construct, and then what is the point of having the test?

Cutscore: Also known as a passing score, the cutscore is the score that a candidate must achieve to obtain a certain classification, such as “pass” on a licensure or certification exam.

Criterion-Referenced: A test score (not a test) is criterion-referenced if it is interpreted with regard to a specified criterion and not compared to scores of other candidates. For instance, providing the number-correct score does not relate any information regarding a candidate’s relative standing.

Differential item functioning: A specific type of analysis that evaluates whether an item is biased towards a subgroup.  This is different than overall test bias.

Distractors: Distractors are the incorrect options of a multiple-choice item. A distractor analysis is an important part of psychometric review, as it helps determine if one is acting as a keyed response.

Equating: concurrent calibration irt equating linkingA psychometric term for the process of determining comparable scores on different forms of an examination. For example, if Form A is more difficult than Form B, it might be desirable to adjust scores on Form A upward for the purposes of comparing them to scores on Form B. Usually, this is done statistically based on items that are on both forms, which are called equator, anchor, or common items. Because the groups who took the two forms are different, this is called a common items non-equivalent groups design.

Factor Analysis: An approach to analyzing complex data that seeks to break it down into major components or factors.  Use in many fields nowadays, but originally developed for psychometrics.  Two of the most common examples are the extensive research which finds that personality items/measures boil down to the Big Five, and that intelligence items/measures boil down to general cognitive ability (though there is evidence of different aspects with massive cognitive manifold).

Form: Forms are specific sets of items that are administered together for a test. For example, if a test included a certain set of 100 items this year and a different set of 100 items next year, these would be two distinct forms.

Item: The basic component of a test, often colloquially referred to as a “question,” but items are not necessary phrased as a question. They can be as varied as true/false statements, rating scales, and performance task simulations, in addition to the ubiquitous multiple-choice item.

Item Bank: A repository of items for a testing program, including items at all stages, such as newly written, reviewed, pretested, active, and retired.

Item Banker: A specialized software program that facilitates the maintenance and growth of an item bank by recording item stages, statistics, notes, and other characteristics.

Item Difficulty: A statistical index of how easy/hard the item is with respect to the underlying ability/trait. That is, an item is difficult if not many people get it correct or respond in the keyed direction.

Item Discrimination: A statistical index of the quality of the item, assessing how well it differentiates examinees of high verses low ability. Items with low discrimination are considered poor quality and are candidates to be revised or retired.

Item Response Theory (IRT):Test Response Function in Item Response Theorycomprehensive approach to psychometric analysis and test development that utilizes complex mathematical models. This provides several benefits, including the ability to design CATs, but requires larger sample sizes. A common rule of thumb is 100 candidates for the one-parameter model and 500 for the three-parameter model.

     a: The item response theory index of item discrimination, analogous to the point-biserial and biserial correlations in classical test theory. It reflects the slope of the item response function. Often ranging from 0.1 to 2.0 in practice, a higher value indicates a better-performing item.

     b: The item response theory index of item difficulty or location, analogous to the P-value (P+) of classical test theory. Typically ranging from -3.0 to 3.0 in practice, a higher value indicates a more difficult item.

     c: The item response theory pseudo-guessing parameter, representing the lower asymptote of the item response function. It is theoretically near the value of 1/k, where k is the number of alternatives. For example, with the typical four-option multiple-choice item, a candidate has a base chance of 25% of guessing the correct answer.

Item Type: Items (test questions) can be a huge range of formats.  We are all familiar with single best answer multiple choice, but there are many others.   Some of these are: multiple response, drag and drop, essay, scored short answer, and equation editor.

Job Analysis: Also known as practice analysis or role delineation study, job analysis is a formal study used to determine the structure of a job and the KSAs important to success or competence. This is then used to establish the test blueprint for a professional testing program, a critical step in the chain of evidence for validity.

Key: The key is the correct response to an item.

KSA: KSA is an acronym for knowledge, skills, and abilities. A critical step in testing for employment or professional credentials is to determine the KSAs that are important in a job. This is often done via a job analysis study.

Licensure: A testing program mandated by a government body. The test must be passed in order to perform the task in question, whether it is to work in the profession or drive a car. This is in contrast to Certification, which is not mandated by government.

Norm-Referenced: A test score (not a test) is norm-referenced if it is interpreted with regard to the performance of other candidates. Percentile rank is an example of this because it does not provide any information regarding how many items the candidate got correct.

P-value: A classical index of item difficulty, presented as the proportion of candidates who correctly responded to the item. A value above 0.90 indicates an easy item, while a value below 0.50 indicates a relatively difficult item. Note that it is inverted; a higher value indicates less difficulty.

Point-Biserial Correlation: A classical index of item discrimination, calculated as the Pearson correlation between the item score and the total test score. If below 0.0, low-scoring candidates are actually doing better than high-scoring candidates, and the item should be revised or retired. Low positive values are marginal, higher positive values are ideal.

Polytomous: A psychometric term for data where there are 2 or more possible points.  Multiple choice items, while having 3-5 options, are usually still only dichotomous (0/1 points).  Examples of a polytomous item are a Likert-style rating scale (“rate on a scale of 1 to 5”) and partial credit items or rubrics (scoring an essay as 0 to 5 points).

Power Test: A test where the goal is to measure the maximal knowledge, ability, or trait of the examinee.  For example, a medical certification exam with a generous time limit.

Predictive Validity: An aspect of validity (see below) that focuses on how well the test predicts important outcomes.  A university admissions test should predict 4-year graduation probability very well, and a pre-employment test on MS Excel should predict job performance for bookkeepers.

Pretest (or Pilot) Item: An item that is administered to candidates simply for the purposes of obtaining data for future psychometric analysis. The results on this item are not included in the score. It is often prudent to include a small number of pretest items in a test.

Reliability: A psychometric term for the repeat-ability or consistency of the measurement process. Often, this is indexed by a single number, most commonly the internal consistency index coefficient alpha or its dichotomous formulation, KR-20. Under most conditions, these range from 0.0 to 1.0, with 1.0 being a perfectly reliable measurement. However, just because a test is reliable does not mean that it is valid (i.e.,  measures what it is supposed to measure).

Scaling: Vertical-scalingScaling is a process of converting scores obtained on an exam to an arbitrary scale. This is done so that all the forms and exams used by a testing organization are on a common scale. For example, suppose an organization had two testing programs, one with 50 items and one with 150 items. All scores could be put on the same scale to standardize score reporting.  There is also vertical scaling to string grades together on a common scale.

Speeded Test: A test where the purpose is to see how fast the examinee can answer questions.  The questions are therefore not usually knowledge based.  For example, seeing how many 5-digit zip codes they can correctly type in 60 seconds.

Standard Error of Measurement: A psychometric term for a concept that quantifies the amount of error in an examinee’s score, since psychometrics is not perfect; even with the best Math test, a student might have variation in their result today vs next week.  The concept differs substantially in classical test theory vs. item response theory.

Standard-Setting Study: A formal study conducted by a testing organization to determine standards for a testing program, which are manifested as a cutscore. Common methods include the Angoff, Bookmark, Contrasting Groups, and Borderline Survey methods.

Subject Matter Expert (SME): An extremely knowledgeable person within the test development process. SMEs are necessary to write items, review items, participate in standard-setting studies, and job analyses, and oversee the testing program to ensure its fidelity to its true intent. Learn more about their role.

Validity: Validity is the concept that test scores can be interpreted as intended. For example, a test for certification in a profession should reflect basic knowledge of that profession, and not intelligence or other constructs, and scores can, therefore, be interpreted as evidencing professional competence. Validity must be formally established and maintained by empirical studies as well as sound psychometric and test development practices.