Test validation is the process of verifying whether the specific requirements to test development stages are fulfilled or not, based on solid evidence. In particular, test validation is an ongoing process of developing an argument that a specific test, its score interpretation or use is valid. The interpretation and use of testing data should be validated in terms of content, substantive, structural, external, generalizability, and consequential aspects of construct validity (Messick, 1994). Validity is the status of an argument that can be positive or negative: positive evidence supports and negative evidence weakens the validity argument, accordingly. Validity cannot be absolute and can be judged only in degrees. American Educational Research Association [AERA], American Psychological Association [APA], and National Council on Measurement in Education [NCME] (1999) claim that validity is crucial for educational and psychological test development and evaluation.
Validation as part of test development
To be effective, test development has to be structured, systematic, and detail-oriented. These features can guarantee sufficient validity evidence supporting inferences proposed by test scores obtained via assessment. Downing (2006) suggested a twelve-step framework for the effective test development:
- Overall plan
- Content definition
- Test blueprint
- Item development
- Test design and assembly
- Test production
- Test administration
- Scoring test responses
- Standard setting
- Reporting test results
- Item bank management
- Technical report
Even though this framework is outlined as a sequential timeline, in practice some of these steps may occur simultaneously or may be ordered differently. A starting point of the test development – the purpose – defines the planned test and regulates almost all validity-related activities. Each step of the test development process focuses on its crucial aspect – validation.
Hypothetically, an excellent performance of all steps can ensure a test validity, i.e. the produced test would estimate examinee ability fairly within the content area to be measured by this test. However, human factor involved in the test production might play a negative role, so there is an essential need for the test validation.
Reasons for test validation
There are myriads of possible reasons that can lead to the invalidation of test score interpretation or use. Let us consider some obvious issues that potentially jeopardize test validation and are subject to validation:
- overall plan: wrong choice of a psychometric model;
- content definition: content domain is ill defined;
- test blueprint: test blueprint does not specify an exact sampling plan for the content domain;
- item development: items measure content at an inappropriate cognitive level;
- test design and assembly: unequal booklets;
- test administration: cheating;
- scoring test responses: inconsistent scoring among examiners;
- standard setting: unsuitable method of establishing passing scores;
- item bank management: inaccurate updating of item parameters.
Context for test validation
All tests have common types of validity evidence that is purported, e.g. reliability, comparability, equating, and item quality. However, tests can vary in terms of a quantity of constructs measured (single, multiple) and can have different purposes which call for the unique types of test validation evidence. In general, there are several major types of tests:
- Admissions tests (e.g., SAT, ACT, and GRE)
- Credentialing tests (e.g., a live-patient examination for a dentist before licensing)
- Large-scale achievement tests (e.g., Stanford Achievement Test, Iowa Test of Basic Skills, and TerraNova)
- Pre-employment tests
- Medical or psychological
The main idea is that the type of test usually defines a unique validation agenda that focuses on appropriate types of validity evidence and issues that are challenged in that type of test.
Categorization of test validation studies
Since there are multiple precedents for the test score invalidation, there are many categories of test validation studies that can be applied to validate test results. In our post, we will look at the categorization suggested by Haladyna (2011):
Category 1: Test Validation Studies Specific to a Testing Program
Subcategory of a study
Focus of a study
|1. Studies That Provide Validity Evidence in Support of the Claim for a Test Score Interpretation or Use||o Content analysis
o Item analysis
o Standard setting
|2. Studies That Threaten a Test Score Interpretation of Use||o Cheating
o Scoring errors
o Student motivation
o Unethical test preparation
o Inappropriate test administration
|3. Studies That Address Other Problems That Threaten Test Score Interpretation or Use||o Drop in reliability
o Drift in item parameters over time
o Redesign of a published test
o Possible security problem
Category 2: Test Validation Studies That Apply to More Than One Testing Program
|Studies that lead to the establishment of concepts, principles, or procedures that guide, inform, or improve test development or scoring||o Introducing a concept
o Introducing a principle
o Introducing a procedure
o Studying a pervasive problem
Even though test development is a longitudinal laborious process, test creators have to be extremely accurate while executing their obligations within each activity. The crown of this process is obtaining valid and reliable test scores, and their adequate interpretation and use. The higher the stakes or consequences of the test scores, the greater attention should be paid to the test validity, and, therefore, to the test validation. The latter one is emphasized by integrating all reliable sources of evidence to strengthen the argument for test score interpretation and use.
American Educational Research Association (AERA), American Psychological Association (APA), & National Council on Measurement in Education (NCME). (1999). Standards for educational and psychological testing. American Educational Research Association.
Downing, S. M. (2011). Twelve steps for effective test development. In. S. M. Downing & T. M. Haladyna (Eds.), Handbook of test development (pp. 3-25). Lawrence Erlbaum Associates.
Haladyna, T. M. (2011). Roles and importance of validity studies in test development. In. S. M. Downing & T. M. Haladyna (Eds.), Handbook of test development (pp. 739-755). Lawrence Erlbaum Associates.
Messick, S. (1994). The interplay of evidence and consequences in the validation of performance assessments. Educational researcher, 23(2), 13-23.
Laila is an experienced educator and an Educational Measurement specialist with expertise in item and test development, setting standards, analyzing, interpreting, and presenting data based on Classical Test Theory (CTT) and Item Response Theory (IRT). As a professional, Laila is primarily interested in employing IRT methodology and AI technologies to educational improvement.