Enemy items lego

Enemy items is a psychometric term that refers to two test questions (items) which should not be on the same test form (if linear) seen by a given examinee (if LOFT or adaptive).  This can therefore be relevant to linear forms, but also pertains to linear on the fly testing (LOFT) and computerized adaptive testing (CAT).  There are several reasons why two items might be considered enemies:

  1. Too similar: the text of the two items is almost the same
  2. One gives away the answer to the other
  3. The items are on the same topic/answer, even if the text is different.

 

How do we find enemy items?

There are two ways (as there often is): Manual and Automated.fasttest-item-authoring

Manual means that humans are reading items and intentionally mark two of them as enemies.  So maybe you have a reviewer that is reviewing new items from a pool of 5 authors, and finds two that cover the same concept.  They would mark them as enemies.

Automated means that you have a machine learning algorithm, such as one which uses natural language processing (NLP) to evaluate all items in a pool and then uses distance/similarity metrics to quantify how similar they are.  Of course, this could miss some of the situations, like if two items have the same topic but have fairly different text.  It is also difficult to do if items have formulas, multimedia files, or other aspects that could not be caught by NLP.

 

Why are enemy items a problem?

This violates the assumption of local independence; that the interaction of an examinee with an item should not be affected by other items.  It also means that the examinee is in double jeopardy; if they don’t know that topic, they will be getting two questions wrong, not one.  There are other potential issues as well, as discussed in this article.

 

What does this mean for test development?

We want to identify enemy items and ensure that they don’t get used together.  Your item banking and assessment platform should have functionality to track which items are enemies.  You can sign up for a free account in FastTest to see an example.

 

creative workplace incremental validity

Incremental validity is an aspect of validity that refers to what an additional assessment or predictive variable can add to the information provided by existing assessments or variables.  It refers to the amount of “bonus” predictive power by adding in another predictor.  In many cases, it is on the same or similar trait, but often the most incremental validity comes from using a predictor/trait that is relatively unrelated to the original.  See examples below.

Note that this is often discussed with respect to tests and assessment, but in many cases a predictor is not a test or assessment, as you will also see.

How is Incremental Validity Evaluated?

It is most often quantified with a linear regression model and correlations.  However, any predictive modeling approach could work from support vector machines to neural networks.

Example of Incremental Validity: University Admissions

One of the most commonly used predictors for university admissions is an admissions test, or battery of tests.  You might be required to take an assessment which includes an English/Verbal test, a Logic/Reasoning test, and a Quantitative/Math test.  These might be used individually or aggregate to create a mathematical model, based on past data, that predicts your performance at university. (there are actually several variables for this, such as first year GPA, final GPA, and 4 year graduation rate, but that’s beyond the scope of this article)

Of course, the admissions exams scores are not the only point of information that the university has on students.  It also has their high school GPA, perhaps an admissions essay which is graded by instructors, and so on.  Incremental validity poses this question: if the admissions exam correlates 0.59 with first year GPA, what happens if we make it into a multiple regression/correlation with High School GPA (HGPA) as a second predictor?  It might go up to, say, 0.64.  There is an increment of 0.05.  If the university has that data from students, they would be wasting it by not using it.

Of course, HGPA will correlate very highly with the admissions exam scores.  So it will likely not add a lot of incremental validity.  Perhaps the school finds that essays add a 0.09 increment to the predictive power, because it is more orthogonal to the admissions exam scores.  Does it make sense to add that, given the additional expense of scoring thousands of essays?  That’s a business decision for them.

Example of Incremental Validity: Pre-Employment Testing

Another common use case is that of pre-employment testing, where the purpose of the test is to predict criterion variables like job performance, tenure, 6-month termination rate, or counterproductive work behavior.  You might start with a skills test; perhaps you are hiring accountants or bookkeepers and you give them a test on MS Excel.  What additional predictive power would we get by also doing a quantitative reasoning test?  Probably some, but that most likely correlates highly with MS Excel knowledge.  So what about using a personality assessment like Conscientiousness?  That would be more orthogonal.  It’s up to the researcher to determine what the best predictors are.  This topic, personnel selection, is one of the primary areas of Industrial/ Organizational Psychology.

students discussing formative summative assessment

Summative and formative assessment are a crucial component of the educational process.  If you work in the educational assessment field or even in educational generally, you have probably encountered these terms.  What do they mean?  This post will explore the differences between summative and formative assessment.

Assessment plays a crucial role in education, serving as a powerful tool to gauge student understanding and guide instructional practices. Among the various assessment methods, two approaches stand out: formative assessment and summative assessment. While both types aim to evaluate student performance, they serve distinct purposes and are applied at different stages of the learning process.

Summative Assessment

Summative assessment refers to an assessment that is at the end (sum) of an educational experience.  The “educational experience” can vary widely.  Perhaps it is a one-day training course, or even shorter.  I worked at a lumber yard in high school, and I remember getting a rudimentary training – maybe an hour – on how to use a forklift before they had me take an exam to become OSHA Certified to used a forklift.  Proctored by the guy who had just showed me the ropes, of course.  On the other end of a spectrum is board certification for a physician specialty like ophthalmology: after 4 years of undergrad, 4 years of med school, and several more years of specialty training, then you finally get to take the exam.  Either way, the purpose is to evaluate what you learned in some educational experience.

Note that it does not have to be formal education.  Many certifications have multiple eligibility pathways.  For example, to be eligible to sit for the exam, you might need:

  1. A bachelor’s degree
  2. An associate degree plus 1 year of work experience
  3. 3 years of work experience.

How it is developed

Summative assessments are usually developed by assessment professionals, or a board of subject matter experts led by assessment professionals.  For example, a certification for ophthalmology is not informally developed by a teacher; there is a panel of experienced ophthalmologists led by a psychometrician.  A high school graduation exam might be developed by a panel of experienced math or English teachers, again led by a psychometrician and test developers.

The process is usually very long and time-intensive, and therefore quite expensive.  A certification will need a job analysis, item writing workshop, standard-setting study, and other important developments that contribute to the validity of the exam scores.  A high school graduation exam has expensive curriculum alignment studies and other aspects.

Implementation of Summative Assessment

Let’s explore the key aspects of summative assessment:

  1. End-of-Term Evaluation: Summative assessments are administered after the completion of a unit, semester, or academic year. They aim to evaluate the overall achievement of students and determine their readiness for advancement or graduation.
  2. Formal and Standardized: Summative assessments are often formal, standardized, and structured, ensuring consistent evaluation across different students and classrooms. Common examples include final exams, standardized tests, and grading rubrics.
  3. Accountability: Summative assessment holds students accountable for their learning outcomes and provides a comprehensive summary of their performance. It also serves as a basis for grade reporting, academic placement, and program evaluation.
  4. Future Planning: Summative assessment results can guide future instructional planning and curriculum development. They provide insights into areas of strength and weakness, helping educators identify instructional strategies and interventions to improve student outcomes.

Formative Assessmentstudent assessment

Formative assessment is something that is used during the educational process.  Everyone is familiar with this from their school days.  A quiz, an exam, or even just the teacher asking you a few questions verbally to understand your level of knowledge.  Usually, but not always, a formative assessment is used to to direct instruction.  A common example of formative assessment is low-stakes exams given in K-12 schools purely to check on student growth, without any counting towards their grades.  Some of the most widely used titles are the NWEA MAP, Renaissance Learning STAR, and Imagine Learning MyPath.

Formative assessment is a great fit for computerized adaptive testing, a method that adapts the difficulty of the exam to each student.  If a student is 3 grades behind, the test will quickly adapt down to that level, providing a better experience for the student and more accurate feedback on their level of knowledge.

How it is developed

Formative assessments are typically much more informal than summative assessments.  Most of the exams we take in our life are informally developed formative assessments; think of all the quizzes and tests you ever took during courses as a student.  Even taking a test during training on the job will often count.  However, some are developed with heavy investment, such as a nationwide K-12 adaptive testing platform.

Implementation of Formative Assessment

Formative assessment refers to the ongoing evaluation of student progress throughout the learning journey. It is designed to provide immediate feedback, identify knowledge gaps, and guide instructional decisions. Here are some key characteristics of formative assessment:

  1. Timely Feedback: Formative assessments are conducted during the learning process, allowing educators to provide immediate feedback to students. This feedback focuses on specific strengths and areas for improvement, helping students adjust their understanding and study strategies.
  2. Informal Nature: Formative assessments are typically informal and flexible, offering a wide range of techniques such as quizzes, class discussions, peer evaluations, and interactive activities. They encourage active participation and engagement, promoting deeper learning and critical thinking skills.
  3. Diagnostic Function: Formative assessment serves as a diagnostic tool, enabling teachers to monitor individual and class-wide progress. It helps identify misconceptions, adapt instructional approaches, and tailor learning experiences to meet students’ needs effectively.
  4. Growth Mindset: The primary goal of formative assessment is to foster a growth mindset among students. By focusing on improvement rather than grades, it encourages learners to embrace challenges, learn from mistakes, and persevere in their educational journey.

The Synergy Between Formative and Summative Assessments

While formative and summative assessments have distinct purposes, they work together in a complementary manner to enhance learning outcomes. Here are a few ways in which these assessment types can be effectively integrated:

  1. Feedback Loop: The feedback provided during formative assessments can inform and improve summative assessments. It allows students to understand their strengths and weaknesses, guiding their study efforts for better performance in the final evaluation.
  2. Continuous Improvement: By employing formative assessments throughout a course, teachers can continuously monitor student progress, identify learning gaps, and adjust instructional strategies accordingly. This iterative process can ultimately lead to improved summative assessment results.
  3. Balanced Assessment Approach: Combining both formative and summative assessments creates a more comprehensive evaluation system. It ensures that student growth and understanding are assessed both during the learning process and at the end, providing a holistic view of

Summative and Formative Assessment: A Validity Perspective

So what is the difference?  You will notice it is the situation and use of the exam, not the exam itself.  You could take those K-12 feedback assessments and deliver them at the end of the year, with weighting towards the student’s final grade.  That would make them summative.  But that is not what the test was designed for.  This is the concept of validity; the evidence showing that interpretations and use of test scores are supported towards their intended use.  So the key is to design a test for its intended use, provide evidence for that use, and make sure that the exam is being used in the way that it should be.

QUESTION:   “What are the costs associated with using validated assessments in public safety hiring?”

ANSWER:       “Always cheaper than a lawsuit!”

It is not uncommon for public safety hiring practices to be called into question. There are several landmark court cases surrounding discrimination in hiring or testing that prove that point. Each year, millions and millions of dollars are spent defending or rectifying these occurrences. It is vital that steps are taken to avoid even the appearance of discrimination.

These four mistakes in public safety testing are some of the most common oversights made by human resources and public safety personnel. It is imperative that those responsible for hiring and promotional processes stay vigilant and aware of their legal responsibilities throughout the hiring and promotional process.

# 1:  Failing to validate the written test to a current job description for public safety hiring

Test questions must be related to the job description. This is one of the biggest mistakes that hiring officials make and is a frequent reason for public safety testing lawsuits. The test must either measure critical skills and abilities necessary for the job, or must predict which candidates will be most successful on the job (predictive validity). At the very least, the most important skills should be reflected on the test.

The United States filed a lawsuit against the City of New York in 2007 for unfair public safety hiring practices. The United States alleged that the examinations that the City used for hiring its firefighters was not an adequate method for determining whether an applicant qualified for the position or not. In this case, Judge Nicholas G. Garaufis ruled in favor of the United States. He determined that the City was in violation of Title VII and the written examinations that were used excluded certain minorities like Black and Hispanic candidates and were not job-related.

# 2:  Failing to include job-related practices that mitigate adverse impact

The City of New Haven, Connecticut, found itself in hot water in 2003, when seeking to fill 15 supervisory positions for it fire department. The test consisted of an oral and a written exam. There were 118 firefighters who took the test. When the test scores were calculated, there was a distinct racial bias. The White applicants passed the test at a rate that was twice that of the Black applicants. During the court case, it was determined that the fire department was guilty of disparate-impact discrimination.

Simply put, disparate impact discrimination occurs when hiring practice rules or tests show a distinct slant toward one race. In this case it was asserted that the test and ranking was structured in such a way that it eliminated any Black or Hispanic applicants.

# 3:  Failing to use a locally-validated, job-related Physical Ability Test (PAT)

Of all of the selection practices administered by public safety departments, the physical ability test, or PAT, is most likely to have the highest failure rate to females. It’s imperative that the PAT measures the critical physical skills that a police officer must possess on day one. Those departments that utilize work-sample PATs rather than fitness tests tend to have more success in court as it is easier to demonstrate job-relatedness to a PAT that measures specific, critical job duties than a fitness test that requires candidates to run a mile-and-one-half or complete a number of pushups and sit-ups.

# 4:  Failing to use a structured interview with trained raters

A study was conducted analyzing the occurrence of litigation across the different tests included in most entry-level recruitments for public safety. Of the most common selection practices (i.e., a written test, a PAT, and an interview), the unstructured interview was the selection practice that was most commonly challenged in court and which resulted in the plaintiff’s success in court.

Departments should ensure that the questions asked during the interview are structured, job-related and utilize structured scoring methods. Additionally, all parties who sit on the interview panel should be properly trained in how to objectively administer, assess, and score the interview. Questions like “Tell me why you want to work with our department” should never be included in a structured interview process.

About FPSI

This is a guest post on pre-employment testing and hiring practices in public safety, by one of the leaders in the field, Fire & Police Selection, Inc. (FPSI).  FPSI consultants are well-versed in public safety litigation. Contact us for assistance with your public safety testing needs.

scale-reliability-small

Test score reliability and validity are core concepts in the field of psychometrics and assessment.  Both of them refer to the quality of a test, the scores it produces, and how we use those scores.  Because test scores are often used for very important purposes with high stakes, it is of course paramount that the tests be of high quality.  But because it is such a complex situation, it is not a simple yes/no answer of whether a test is good.  There is a ton of work that goes into establishing validity and reliability, and that work never ends!

This post provide an introduction to this incredibly complex topic.  For more information, we recommend you delve into books that are dedicated to the topic.  Here is a classic.

 

Why do we need reliability and validity?

To begin a discussion of reliability and validity, let us first pose the most fundamental question in psychometrics: Why are we testing people? Why are we going through an extensive and expensive process to develop examinations, inventories, surveys, and other forms of assessment? The answer is that the assessments provide information, in the form of test scores and subscores, that can be used for practical purposes to the benefit of individuals, organizations, and society. Moreover, that information is of higher quality for a particular purpose than information available from alternative sources. For example, a standardized test can provide better information about school students than parent or teacher ratings. A preemployment test can provide better information about specific job skills than an interview or a resume, and therefore be used to make better hiring decisions.

So, exams are constructed in order to draw conclusions about examinees based on their performance. The next question would be, just how supported are various conclusions and inferences we are making? What evidence do we have that a given standardized test can provide better information about school students than parent or teacher ratings? This is the central question that defines the most important criterion for evaluating an assessment process: validity. Validity, from a broad perspective, refers to the evidence we have to support a given use or interpretation of test scores. The importance of validity is so widely recognized that it typically finds its way into laws and regulations regarding assessment (Koretz, 2008).

Test score reliability is a component of validity. Reliability indicates the degree to which test scores are stable, reproducible, and free from measurement error. If test scores are not reliable, they cannot be valid since they will not provide a good estimate of the ability or trait that the test intends to measure. Reliability is therefore a necessary but not sufficient condition for validity.

 

Test Score Reliability

Reliability refers to the precision, accuracy, or repeatability of the test scores. There is no universally accepted way to define and evaluate the concept; classical test theory provides several indices, while item response theory drops the idea of a single index (and drops the term “reliability” entirely!) and reconceptualizes it as a conditional standard error of measurement, an index of precision.  This is actuall a very important distinction, though outside the scope of this article.

An extremely common way of evaluating classical test reliability is the internal consistency index, called KR-20 or α (alpha). The KR-20 index ranges from 0.0 (test scores are comprised only of random error) to 1.0 (scores have no measurement error). Of course, because human behavior is generally not perfectly reproducible, perfect reliability is not possible; typically, a reliability of 0.90 or higher is desired for high-stakes certification exams. The relevant standard for a test depends on its stakes. A test for medical doctors might require reliability of 0.95 or greater. A test for florists or a personality self-assessment might suffice with 0.80.

Reliability depends on several factors, including the stability of the construct, length of the test, and the quality of the test items.

  • Stability of the construct: Reliability will be higher if the trait/ability is more stable (mood is inherently difficult to measure repeatedly). A test sponsor typically has little control over the nature of the construct – if you need to measure knowledge of algebra, well, that’s what we have to measure, and there’s no way around that.
  • Length of the test: Obviously, a test with 100 items is going to produce better scores than one with 5 items, assuming the items are not worthless.
  • Item Quality: A test will have higher reliability if the items are good.  Often, this is operationalized as point-biserial discrimination coefficients.

How to you calculate reliability?  You need psychometric analysis software like Iteman.

 

Validity

Validity is conventionally defined as the extent to which a test measures what it purports to measure.  Test validation is the process of gathering evidence to support the inferences made by test scores. Validation is an ongoing process which makes it difficult to know when one has reached a sufficient amount of validity evidence to interpret test scores appropriately.

Academically, Messick (1989) defines validity as an “integrated evaluative judgment of the degree to which empirical evidence and theoretical rationales support the adequacy and appropriateness of inferences and actions based on test scores or other modes of measurement.” This definition suggests that the concept of validity contains a number of important characteristics to review or propositions to test and that validity can be described in a number of ways. The modern concept of validity (AERA, APA, & NCME Standards) is multi-faceted and refers to the meaningfulness, usefulness, and appropriateness of inferences made from test scores.

First of all, validity is not an inherent characteristic of a test. It is the reasonableness of using the test score for a particular purpose or for a particular inference. It is not correct to say a test or measurement procedure is valid or invalid. It is more reasonable to ask, “Is this a valid use of test scores or is this a valid interpretation of the test scores?” Test score validity evidence should always be reviewed in relation to how test scores are used and interpreted.  Example: we might use a national university admissions aptitude test as a high school graduation exam, since they occur in the same period of a student’s life.  But it is likely that such a test does not match the curriculum of a particular state, especially since aptitude and achievement are different things!  You could theoretically use the aptitude test as a pre-employment exam as well; while valid in its original use it is likely not valid in that use.

Secondly, validity cannot be adequately summarized by a single numerical index like a reliability coefficient or a standard error of measurement. A validity coefficient may be reported as a descriptor of the strength of relationship between other suitable and important measurements. However, it is only one of many pieces of empirical evidence that should be reviewed and reported by test score users. Validity for a particular test score use is supported through an accumulation of empirical, theoretical, statistical, and conceptual evidence that makes sense for the test scores.

Thirdly, there can be many aspects of validity dependent on the intended use and intended inferences to be made from test scores. Scores obtained from a measurement procedure can be valid for certain uses and inferences and not valid for other uses and inferences. Ultimately, an inference about probable job performance based on test scores is usually the kind of inference desired in test score interpretation in today’s test usage marketplace. This can take the form of making an inference about a person’s competency measured by a tested area.

Example 1: A Ruler

A standard ruler has both reliability and validity.  If you measure something that is 10 cm long, and measure it again and again, you will get the same measurement.  It is highly consistent and repeatable.  And if the object is actually 10 cm long, you have validity. (If not, you have a bad ruler.)

Example 2: A Bathroom Scale

Bathroom scales are not perfectly reliable (though this is often a function of their price).  But that meets the reliability requirements of this measurement.

  • If you weigh 180 lbs, and step on the scale several times, you will likely get numbers like 179.8 or 180.1.  That is quite reliable, and valid.
  • If the numbers were more spread out, like 168.9 and 185.7, then you can consider it unreliable but valid.
  • If the results were 190.00 lbs every time, you have perfectly reliable measurement… but poor validity
  • If the results were spread like 25.6, 2023.7, 0.000053 – then it is neither reliable or valid.

This is similar to the classic “target” example of reliability and validity, like you see below (image from Wikipedia).

Reliability_and_validity

Example 3: A Pre-Employment Test

Now, let’s get to a real example.  You have a test of quantitative reasoning that is being used to assess bookkeepers that apply for a job at a large company.  Jack has very high ability, and scores around the 90th percentile each time he takes the test.  This is reliability.  But does it actually predict job performance?  That is validity.  Does it predict job performance better than a Microsoft Excel test?  Good question, time for some validity research.  What if we also tack on a test of conscientiousness?  That is incremental validity.

 

Summary

In conclusion, and validity and reliability are two essential aspects in evaluating an assessment , be it an examination of knowledge, a psychological inventory, a customer survey, or an aptitude test. Validity is an overarching, fundamental issue that drives at the heart of the reason for the assessment in the first place: the use of test scores. Reliability is an aspect of validity, as it is a necessary but not sufficient condition. Developing a test that produces reliable scores and valid interpretations is not an easy task, and progressively higher stakes indicate a progressively greater need for a professional psychometrician.  High-stakes exams like national university admissions often have teams of experts devoted to producing a high quality assessment.

certification exam development construction

Certification exam development, as well as other credentialing like licensure or certificates, is incredibly important.  Such exams serve as gatekeepers into many professions, often after people have invested a ton of money and years of their life in preparation.  Therefore, it is critical that the tests be developed well, and have the necessary supporting documentation to show that they are defensible.  So what exactly goes into developing a quality exam, sound psychometrics, and establishing the validity documentation, perhaps enough to achieve NCCA accreditation for your certification?

Well, there is a well-defined and recognized process for certification exam development, though it is rarely the exact same for every organization.  In general, the accreditation guidelines say you need to address these things, but leave the specific approach up to you.  For example, you have to do a cutscore study, but you are allow to choose Bookmark vs Angoff vs other method.

 

Job Analysis / Practice Analysis

A job analysis study provides the vehicle for defining the important job knowledge, skills, and abilities (KSA) that will later be translated into content on a certification exam. During a job analysis, important job KSAs are obtained by directly analyzing job performance of highly competent job incumbents or surveying subject-matter experts regarding important aspects of successful job performance. The job analysis generally serves as a fundamental source of evidence supporting the validity of scores for certification exams.

 

Test Specifications and Blueprints

The results of the job analysis study are quantitatively converted into a blueprint for the exam.  Basically, it comes down to this: if the experts say that a certain topic or skill is done quite often or is very critical, then it deserves more weight on the exam, right?  There are different ways to do this.  My favorite article on the topic is Raymond & Neustel, 2006Here’s a free tool to help.

 

test development cycle job task analysis

Item Development

After important job KSAs are established, subject-matter experts write test items to assess them. The end result is the development of an item bank from which exam forms can be constructed. The quality of the item bank also supports test validity.  A key operational step is the development of an Item Writing Guide and holding an item writing workshop for the SMEs.

 

Pilot Testing

There should be evidence that each item in the bank actually measures the content that it is supposed to measure; in order to assess this, data must be gathered from samples of test-takers. After items are written, they are generally pilot tested by administering them to a sample of examinees in a low-stakes context—one in which examinees’ responses to the test items do not factor into any decisions regarding competency. After pilot test data is obtained, a psychometric analysis of the test and test items can be performed. This analysis will yield statistics that indicate the degree to which the items measure the intended test content. Items that appear to be weak indicators of the test content generally are removed from the item bank or flagged for item review so they can be reviewed by subject matter experts for correctness and clarity.

Note that this is not always possible, and is one of the ways that different organizations diverge in how they approach exam development.

 

Standard Setting

Standard setting also is a critical source of evidence supporting the validity of professional credentialing exam (i.e. pass/fail) decisions made based on test scores.  Standard setting is a process by which a passing score (or cutscore) is established; this is the point on the score scale that differentiates between examinees that are and are not deemed competent to perform the job. In order to be valid, the cutscore cannot be arbitrarily defined. Two examples of arbitrary methods are the quota (setting the cut score to produce a certain percentage of passing scores) and the flat cutscore (such as 70% on all tests). Both of these approaches ignore the content and difficulty of the test.  Avoid these!

Instead, the cutscore must be based on one of several well-researched criterion-referenced methods from the psychometric literature.  There are two types of criterion-referenced standard-setting procedures (Cizek, 2006): examinee-centered and test-centered.

The Contrasting Groups method is one example of a defensible examinee-centered standard-setting approach. This method compares the scores of candidates previously defined as Pass or Fail. Obviously, this has the drawback that a separate method already exists for classification. Moreover, examinee-centered approaches such as this require data from examinees, but many testing programs wish to set the cutscore before publishing the test and delivering it to any examinees. Therefore, test-centered methods are more commonly used in credentialing.

The most frequently used test-centered method is the Modified Angoff Method (Angoff, 1971) which requires a committee of subject matter experts (SMEs).  Another commonly used approach is the Bookmark Method.

 

Equating

If the test has more than one form – which is required by NCCA Standards and other guidelines – they must be statistically equated.  If you use classical test theory, there are methods like Tucker or Levine.  If you use item response theory, you can either bake the equating into the item calibration process with software like Xcalibre, or use conversion methods like Stocking & Lord.

What does this process do?  Well, if this year’s certification exam had an average of 3 points higher than last years, how do you know if this year’s version was 3 points easier, or this year’s cohort was 3 points smarter, or a mixture of both?  Learn more here.

 

Psychometric Analysis & Reporting

This part is an absolutely critical step in the exam development cycle for professional credentialing.  You need to statistically analyze the results to flag any items that are not performing well, so you can replace or modify them.  This looks at statistics like item p-value (difficulty), item point biserial (discrimination), option/distractor analysis, and differential item functioning.  You should also look at overall test reliability/precision and other psychometric indices.  If you are accredited, you need to perform year-end reports and submit them to the governing body.  Learn more about item and test analysis.

 

Exam Development: It’s a Vicious Cycle

Now, consider the big picture: in many cases, an exam is not a one-and-done thing.  It is re-used, perhaps continually.  Often there are new versions released, perhaps based on updated blueprints or simply to swap out questions so that they don’t get overexposed.  That’s why this is better conceptualized as an exam development cycle, like the circle shown above.  Often some steps like Job Analysis are only done once every 5 years, while the rotation of item development, piloting, equating, and psychometric reporting might happen with each exam window (perhaps you do exams in December and May each year).

ASC has extensive expertise in managing this cycle for professional credentialing exams, as well as many other types of assessments.  Get in touch with us to talk to one of our psychometricians.

concurrent calibration irt equating linking

Test equating refers to the issue of defensibly translating scores from one test form to another. That is, if you have an exam where half of students see one set of items while the other half see a different set, how do you know that a score of 70 is the same one both forms? What if one is a bit easier? If you are delivering assessments in conventional linear forms – or piloting a bank for CAT/LOFT – you are likely to utilize more than one test form, and, therefore, are faced with the issue of test equating.

When two test forms have been properly equated, educators can validly interpret performance on one test form as having the same substantive meaning compared to the equated score of the other test form (Ryan & Brockmann, 2009). While the concept is simple, the methodology can be complex, and there is an entire area of psychometric research devoted to this topic. This post will provide an overview of the topic.

Why do we need test linking and equating?

The need is obvious: to adjust for differences in difficulty to ensure that all examinees receive a fair score on a stable scale. Suppose you take Form A and get a score of 72/100 while your friend takes Form B and gets a score of 74/100. Is your friend smarter than you, or did his form happen to have easier questions?  What if the passing score on the exam was 73? Well, if the test designers built-in some overlap of items between the forms, we can answer this question empirically.

Suppose the two forms overlap by 50 items, called anchor items or equator items. They are delivered to a large, representative sample. Here are the results.

Mean score on 50 overlap items Mean score on 100 total items
30 72
32 74

Because the mean score on the anchor items was higher, we then think that the Form B group was a little smarter, which led to a higher total score.

Now suppose these are the results:

Mean score on 50 overlap items Mean score on 100 total items
32 72
32 74

Now, we have evidence that the groups are of equal ability. The higher total score on Form B must then be because the unique items on that form are a bit easier.

What is test equating?

According to Ryan and Brockmann (2009), “Equating is a technical procedure or process conducted to establish comparable scores, with equivalent meaning, on different versions of test forms of the same test; it allows them to be used interchangeably.” (p. 8). Thus, successful equating is an important factor in evaluating assessment validity, and, therefore, it often becomes an important topic of discussion within testing programs.

Practice has shown that scores, and tests producing scores, must satisfy very strong requirements to achieve this demanding goal of interchangeability. Equating would not be necessary if test forms were assembled as strictly parallel, meaning that they would have identical psychometric properties. In reality, it is almost impossible to construct multiple test forms that are strictly parallel, and equating is necessary to attune a test construction process.

Dorans, Moses, and Eignor (2010) suggest the following five requirements towards equating of two test forms:

  • tests should measure the same construct (e.g. latent trait, skill, ability);
  • tests should have the same level of reliability;
  • equating transformation for mapping the scores of tests should be the inverse function;
  • test results should not depend on the test form an examinee actually takes;
  • the equating function used to link the scores of two tests should be the same regardless of the choice of (sub) population from which it is derived.

How do I calculate an equating?

Classical test theory (CTT) methods include linear equating and equipercentile equating as well as several others. Some newer approaches that work well with small samples are Circle-Arc (Livingston & Kim, 2009) and Nominal Weights (Babcock, Albano, & Raymond, 2012).  Specific methods for linear equating include Tucker, Levine, and Chained (von Davier & Kong, 2003). Linear equating approaches are conceptually simple and easy to interpret; given the examples above, the equating transformation might be estimated with a slope of 1.01 and an intercept of 1.97, which would directly confirm the hypothesis that one form was about 2 points easier than the other.

Item response theory (IRT) approaches include equating through common items (equating by applying an equating constant, equating by concurrent or simultaneous calibration, and equating with common items through test characteristic curves), and common person calibration (Ryan & Brockmann, 2009). The common-item approach is quite often used, and specific methods for finding the constants (conversion parameters) include Stocking-Lord, Haebara, Mean/Mean, and Mean/Sigma. Because IRT assumes that two scales on the same construct differ by only a simple linear transformation, all we need to do is find the slope and intercept of that transformation. Those methods do so, and often produce nice looking figures like the one below from the program IRTEQ (Han, 2007). Note that the b parameters do not fall on the identity line, because there was indeed a difference between the groups, and the results clearly find that is the case.

IRTEQ IRT equating

Practitioners can equate forms with CTT or IRT. However, one of the reasons that IRT was invented was that equating with CTT was very weak. Hambleton and Jones (1993) explain that when CTT equating methods are applied, both ability parameter (i.e., observed score) and item parameters (i.e., difficulty and discrimination) are dependent on each other, limiting its utility in practical test development. IRT solves the CTT interdependency problem by combining ability and item parameters in one model. The IRT equating methods are more accurate and stable than the CTT methods (Hambleton & Jones, 1993; Han, Kolen, & Pohlmann, 1997; De Ayala, 2013; Kolen and Brennan, 2014) and provide a solid basis for modern large-scale computer-based tests, such as computerized adaptive tests (Educational Testing Service, 2010; OECD, 2017).

Of course, one of the reasons that CTT is still around in general is that it works much better with smaller samples, and this is also the case for CTT test equating (Babcock, Albano, & Raymond, 2012).

How do I implement test equating?

Test equating is a mathematically complex process, regardless of which method you use.  Therefore, it requires special software.  Here are some programs to consider.

  1. CIPE performs both linear and equipercentile equating with classical test theory. It is available from the University of Iowa’s CASMA site, which also includes several other software programs.
  2. IRTEQ is an easy-to-use program which performs all major methods of IRT Conversion equating.  It is available from the University of Massachusetts website, as well as several other good programs.
  3. There are many R packages for equating and related psychometric topics. This article claims that there are 45 packages for IRT analysis alone!
  4. If you want to do IRT equating, you need IRT calibration software. We highly recommend Xcalibre since it is easy to use and automatically creates reports in Word for you. If you want to do the calibration approach to IRT equating (both anchor-item and concurrent-calibration), rather than the conversion approach, this is handled directly by IRT software like Xcalibre. For the conversion approach, you need separate software like IRTEQ.

Equating is typically performed by highly trained psychometricians; in many cases, an organization will contract out to a testing company or consultant with the relevant experience. Contact us if you’d like to discuss this.

Does equating happen before or after delivery?

Both. These are called pre-equating and post-equating (Ryan & Brockmann, 2009).  Post-equating means the calculation is done after delivery and you have a full data set, for example if a test is delivered twice per year on a single day, we can do it after that day.  Pre-equating is more tricky, because you are trying to calculate the equating before a test form has ever been delivered to an examinee; but this is 100% necessary in many situations, especially those with continuous delivery windows.

How do I learn more about test equating?

If you are eager to learn more about the topic of equating, the classic reference is the book by Kolen and Brennan (2004; 2014) that provides the most complete coverage of score equating and linking.  There are other resources more readily available on the internet, like this free handbook from CCSSO. If you would like to learn more about IRT, we suggest the books by De Ayala (2008) and Embretson and Reise (2000). A brief intro of IRT equating is available on our website.

Several new ideas of general use in equating, with a focus on kernel equating, were introduced in the book by von Davier, Holland, and Thayer (2004). Holland and Dorans (2006) presented a historical background for test score linking, based on work by Angoff (1971), Flanagan (1951), and Petersen, Kolen, and Hoover (1989). If you look for a straightforward description of the major issues and procedures encountered in practice, then you should turn to Livingston (2004).


Want to learn more? Talk to a Psychometric Consultant

References

Angoff, W. H. (1971). Scales, norms and equivalent scores. In R. L. Thorndike (Ed.), Educational measurement (2nd ed., pp. 508-600). American Council on Education.

Babcock, B., Albano, A., & Raymond, M. (2012). Nominal Weights Mean Equating: A Method for Very Small Samples. Educational and Psychological Measurement, 72(4), 1-21.

Dorans, N. J., Moses, T. P., & Eignor, D. R. (2010). Principles and practices of test score equating. ETS Research Report Series2010(2), i-41.

De Ayala, R. J. (2008). A commentary on historical perspectives on invariant measurement: Guttman, Rasch, and Mokken.

De Ayala, R. J. (2013). Factor analysis with categorical indicators: Item response theory. In Applied quantitative analysis in education and the social sciences (pp. 220-254). Routledge.

Educational Testing Service (2010). Linking TOEFL iBT Scores to IELTS Scores: A Research Report. Educational Testing Service.

Embretson, S. E., & Reise, S. P. (2000). Item response theory for psychologists. Maheah.

Flanagan, J. C. (1951). Units, scores, and norms. In E. F. Lindquist (Ed.), Educational measurement (pp. 695-763). American Council on Education.

Hambleton, R. K., & Jones, R. W. (1993). Comparison of classical test theory and item response theory and their applications to test development. Educational measurement: issues and practice12(3), 38-47.

Han, T., Kolen, M., & Pohlmann, J. (1997). A comparison among IRT true-and observed-score equatings and traditional equipercentile equating. Applied Measurement in Education10(2), 105-121.

Holland, P. W., & Dorans, N. J. (2006). Linking and equating. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 187-220). Praeger.

Kolen, M. J., & Brennan, R. L. (2004). Test equating, linking, and scaling: Methods and practices (2nd ed.). Springer-Verlag.

Kolen, M. J., & Brennan, R. L. (2014). Item response theory methods. In Test Equating, Scaling, and Linking (pp. 171-245). Springer.

Livingston, S. A. (2004). Equating test scores (without IRT). ETS.

Livingston, S. A., & Kim, S. (2009). The Circle‐Arc Method for Equating in Small Samples. Journal of Educational Measurement 46(3): 330-343.

OECD (2017). PISA 2015 Technical Report. OECD Publishing.

Petersen, N. S., Kolen, M. J., & Hoover, H. D. (1989). Scaling, norming and equating. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 221-262). Macmillan.

Ryan, J., & Brockmann, F. (2009). A Practitioner’s Introduction to Equating with Primers on Classical Test Theory and Item Response Theory. Council of Chief State School Officers.

von Davier, A. A., Holland, P. W., & Thayer, D. T. (2004). The kernel method of test equating. Springer.

von Davier, A. A., & Kong, N. (2003). A unified approach to linear equating for non-equivalent groups design. Research report 03-31 from Educational Testing Service. https://www.ets.org/Media/Research/pdf/RR-03-31-vonDavier.pdf

certification licensure exam laptop

Certification vs Licensure exams are two terms that are used quite frequently to refer to examinations that someone has to pass to demonstrate skills in a certain profession or topic.  They are quite similar, and often confused.  This is exacerbated by even more similar terms in the field, such as accreditation, credentialing, certificate, and microcredentials.  This post will help you understand the differences.

What is Certification?

Certification is “a credential that you earn to show that you have specific skills or knowledge. They are usually tied to an occupation, technology, or industry.” (CareerOneStop)  The important aspect in this definition is the latter portion; the organization that runs the certification is generally across an industry or a profession, regardless of political boundaries.  It is almost always some sort of professional association or industry board, like the American Association of Widgetmakers (obviously not a real thing).  However, it is sometimes governed by a specific company or other organization regarding their products; perhaps the most well known is how Amazon Web Services will certify you in skills to hand their offerings.  Many other technology and software companies do the same.

What is Licensure?

Licensure is a “formal permission to do something: esp., authorization by law to do some specified thing (license to marry, practice medicine, hunt, etc.)” (Schmitt, 1995).  The key phrase here is by law.  The sponsoring organization is a governmental entity, and that is defines what licensure is.  In fact, licensure is not even always about a profession; almost all of us have a Driver’s License for which we passed a simple exam.  Moreover, it does not always even have to be about a profession; many millions of people have a Fishing License, which is granted by the government (by States in the USA), for which you simply pay a small fee.  The license is still an attestation, but not of your skills, just that you have been authorized to do something.  Of course, in the context of assessment, it means that you have passed some sort of exam which is mandated by law, typically for professions that are dangerous enough or impact a wide range of people that the government has stepped in to provide oversight: attorneys, physicians, medical professionals, etc.

Certification vs Licensure Exams

woman-taking-testUsually, there is a test that you must pass, but the sponsor can differ with certification vs licensure.  The development and delivery of such tests is extremely similar, leading to the confusion.  They often will both utilize job analysis, Angoff studies, and the like.  The difference between the two is outside the test itself, and instead refers to the sponsoring organization: is it mandated/governed by a governmental entity, or is it unrelated to political/governmental boundaries?  You are awarded a credential after successful completion, but the difference is in the group that awards the credential, what it means, and where it is recognized.

However, there are many licensures that do not involve an exam, but you simply need to file some paperwork with the government.  An example of this is a marriage license.  You certainly don’t have to take a test to qualify!

Can they be the same exam?

To make things even more confusing… yes.  And it does not even have to be consistent.  In the US, some professions have a wide certification, which is also required in some States as licensure, but not in all States!  Some States might have their own exams, or not even require an exam.  This muddles the difference between certification vs licensure.  ICRC notes that they are sometimes complementary or parallel processes.

Differences between Certification and Licensure

Aspect Certification Licensure
Mandatory? No Yes
Run by Association, Board, Nonprofit, Private Company Government
Does it use an exam? Yes, especially if it is accredited Sometimes, but often not (consider a marriage license)
Accreditation involved? Yes, NCCA and ANSI provide accreditation that a certification is high quality No; often there is no check on quality
Examples Certified Chiropractic Sports Physician (CCSP®), Certified in Clean Needle Technique (CNT) Marriage license; Driver’s License; Fishing License; License to practice law (Bar Exam)

How do these terms relate to other, similar terms?

This outline summarizes some of the relevant terms regarding certification vs licensure and other credentials.  This is certainly more than can be covered in a single blog post!

  • Attestation of some level of quality for a person or organization = CREDENTIALING
    • Attestation of a person
      • By government = LICENSURE
      • By independent board or company
        • High stakes, wide profession = CERTIFICATION
        • Medium stakes = CERTIFICATE
        • Low stakes, quite specific skill = MICROCREDENTIAL
      • By an educational institution = DEGREE OR DIPLOMA
    • Attestation of an organization = ACCREDITATION
bookmark-method-of-standard-setting

Authors: 

Laila Issayeva, MS

Nathan Thompson, PhD

The Bookmark Method of standard setting (Lewis, Mitzel, & Green, 1996) is a scientifically-based approach to setting cutscores on an examination. It allows stakeholders of an assessment to make decisions and classifications about examinees that are constructive rather than arbitrary (e.g., 70%), meet the goals of the test, and contribute to overall validity. A major advantage of the bookmark method over others is that it utilizes difficulty statistics on all items, making it very data-driven; but this can also be a disadvantage in situations where such data is not available. It also has the advantage of panelist confidence (Karantonis & Sireci, 2006).

The bookmark method operates by delivering a test to a representative sample (or population) of examinees, and then calculating the difficulty statistics for each item. We line up the items in order of difficulty, and experts review the items to place a bookmark where they think a cutscore should be. Nowadays, we use computer screens, but of course in the past this was often done by printing the items in paper booklets, and the experts would literally insert a bookmark.

What is standard setting?

Standard setting (Cizek & Bunch, 2006) is an integral part of the test development process even though it has been undervalued outside of practitioners’ view in the past (Bejar, 2008). Standard setting is the methodology of defining achievement or proficiency levels and corresponding cutscores. A cutscore is a score that serves as a measure of classifying test takers into categories.

Educational assessments and credentialing examinations are often employed to distribute test takers among ordered categories according to their performance across specific content and skills (AERA, APA, & NCME, 2014; Hambleton, 2013). For instance, in tests used for certification and licensing purposes, test takers are typically classified as “pass”—those who score at or above the cutscore—and those who “fail”. In education, students are often classified in terms of proficiency; the Nation’s Report Card assessment (NAEP) in the United States classifies students as Below Basic, Basic, Proficient, Advanced.

However, assessment results could come into question unless the cutscores are appropriately defined. This is why arbitrary cutscores are considered indefensible and lacking validity. Instead, psychometricians help test sponsors to set cutscores using methodologies from the scientific literature, driven by evaluations of item and test difficulty as well as examinee performance.

When to use the bookmark method?

Two approaches are mainly used in international practice to establish assessment standards: the Angoff method (Cizek, 2006) and the Bookmark method (Buckendahl, Smith, Impara, & Plake, 2000). The Bookmark method, unlike the Angoff method, requires the test to be administered prior to defining cutscores based on test data. This provides additional weight to the validity of the process, and better informs the subject matter experts during the process. Of course, many exams require a cutscore to be set before it is published, which is impossible with the bookmark; the Angoff procedure is very useful then.

How do I implement the bookmark method?

The process of standard setting employing the Bookmark method consists of the following stages:

  1. Identify a team of subject matter experts (SMEs); their number should be around 6-12, and led by a test developer/psychometrician/statistician
  2. Analyze test takers’ responses by means of the item response theory (IRT)
  3. Create a list items according to item difficulty in an ascending order
  4. Define the competency levels for test takers; for example, have the 6-12 experts discuss what should differentiate a “pass” candidate from a “fail” candidate
  5. Experts read the items in the ascending order (they do not need to see the IRT values), and place a bookmark where appropriate based on professional judgement across well-defined levels
  6. Calculate thresholds based on the bookmarks set, across all experts
  7. If needed, discuss results and perform a second round

Example of the Bookmark Method

If there are four competency levels such as the NAEP example, then SMEs need to set up three bookmarks in-between: first bookmark is set after the last item in a row that fits the minimally competent candidate for the first level, then second and third. There are thresholds/cutscores from 1 to 2, 2 to 3, and 3 to 4. SMEs perform this individually without discussion, by reading the items.

When all SMEs have provided their opinion, the standard setting coordinator combines all results into one spreadsheet and leads the discussion when all participants express their opinion referring to the bookmarks set. This might look like the sheet below. Note that SME4 had a relatively high standard in their mind, while SME2 had a low standard in their mind – placing virtually every student above an IRT score of 0.0 into the top category!

bookmark method 1

After the discussion, the SMEs are given one more opportunity to set the bookmarks again. Usually, after the exchange of opinions, the picture alters. SMEs gain consensus, and the variation in the graphic is reduced.  An example of this is below.

bookmark method

What to do with the results?

Based on the SMEs’ voting results, the coordinator or psychometrician calculates the final thresholds on the IRT scale, and provides them to the analytical team who would ultimately prepare reports for the assessment across competency levels. This might entail score reports to examinees, feedback reports to teachers, and aggregate reports to test sponsors, government officials, and more.

You can see how the scientific approach will directly impact the interpretations of such reports. Rather than government officials just knowing how many students scored 80-90% correct vs 90-100% correct, the results are framed in terms of how many students are truly proficient in the topic. This makes decisions from test scores – both at the individual and aggregate levels – much more defensible and informative.  They become truly criterion-referenced.  This is especially true when the scores are equated across years to account for differences in examinee distributions and test difficulty, and the standard can be demonstrated to be stable.  For high-stakes examinations such as medical certification/licensure, admissions exams, and many more situations, this is absolutely critical.

Want to talk to an expert about implementing this for your exams?  Contact us.

References

[AERA, APA, & NCME] (American Educational Research Association, American Psychological Association, & National Council on Measurement in Education). (2014). Standards for educational and psychological testing. Washington, DC: American Educational Research Association.

Bejar, I. I. (2008). Standard setting: What is it? Why is it important. R&D Connections, 7, 1-6. Retrieved from https://www.ets.org/Media/Research/pdf/RD_Connections7.pdf

Buckendahl, C. W., Smith, R. W., Impara, J. C., & Plake, B. S. (2000). A comparison of Angoff and Bookmark standard setting methods. Paper presented at the Annual Meeting of the Mid-Western Educational Research Association, Chicago, IL: October 25-28, 2000.

Cizek, G., & Bunch, M. (2006). Standard Setting: A Guide to Establishing and Evaluating Performance Standards on Tests.  Thousand Oaks, CA: Sage.

Cizek, G. J. (2007). Standard setting. In Steven M. Downing and Thomas M. Haladyna (Eds.) Handbook of test development. Mahwah, NJ: Lawrence Erlbaum Associates, Publishers, pp. 225-258.

Hambleton, R. K. (2013). Setting performance standards on educational assessments and criteria for evaluating the process. In Setting performance standards, pp. 103-130. Routledge. Retrieved from https://www.nciea.org/publications/SetStandards_Hambleton99.pdf

Karantonis, A., & Sireci, S. (2006). The Bookmark Standard‐Setting Method: A Literature Review. Educational Measurement Issues and Practice 25(1):4 – 12.

Lewis, D. M., Mitzel, H. C., & Green, D. R. (1996, June). Standard setting: A Book-mark approach. In D. R. Green (Chair),IRT-based standard setting procedures utilizing behavioral anchoring. Symposium conducted at the Council of Chief State School Officers National Conference on Large-Scale Assessment, Phoenix, AZ.

assessment-test-battery

A test battery or assessment battery is a set multiple psychometrically-distinct exams delivered in one administration.  In some cases, these are various tests that are cobbled together for related purposes, such as a psychologist testing a 8 year old child on their intelligence, anxiety, and autism spectrum.  However, in many cases it is a single test title that we often refer to as a single test but is actually several separate tests, like a university admissions test that has English, Math, and Logical Reasoning components.  Why do so? The key here is that we want to keep them psychometrically separate, but maximize the amount of information about the person to meet the purposes of the test.

Learn more about our powerful exam platform that allows you to easily develop and deliver test batteries.

 

Examples of a Test Battery

Test batteries are used in a variety of fields, pretty much anywhere assessment is done.

Admissions and Placement Testing

The classic example is a university admissions test that has English, Math, and Logic portions.  These are separate tests, and psychometricians would calculate the reliability and other important statistics separately.  However, the scores are combined at the end to get an overall picture of examinee aptitude or achievement, and use that to maximally predict 4-graduation rates and other important criterion variables.

Why is is called a battery?  Because we are battering the poor student with not just one, but many exams!

Pre-Employment Testing

Exam batteries are often used in pre-employment testing.  You might get tested on computer skills, numerical reasoning, and noncognitive traits such as integrity or conscientiousness. These are used together to gain incremental validity.  A good example is the CAT-ASVAB, which is the selection test to get into the US Armed Forces.  There are 10 tests (vocabulary, math, mechanical aptitude…).

Psychological or Psychoeducational Assessment

In a clinical setting, clinicians will often use a battery of tests, such as IQ, autism, anxiety, and depression.  Some IQ tests themselves as a battery, as they might assess visual reasoning, logical reasoning, numerical reasoning, etc.  However, these have a positive manifold, meaning that they correlate quite highly with each other.  Another example is the Woodcock-Johnson.

K-12 Educational Assessment

Many large-scale tests that are used in schools are considered a battery, though often with only 2 or 3 aspects.  A common one in the USA is the NWEA Measures of Academic Progress.

 

Composite Scores

A composite score is a combination of scores in a battery.  If you took an admissions test like the SAT and GRE, you recall how it would add your scores on the different subtests, while the ACT test takes the average.  The ASVAB takes a linear combination of the 4 most important subtests and uses them for admission; the others are used for job matching.

 

A Different Animal: Test with Sections

The battery is different than a single test that has distinct sections.  For example, a K12 English test might have 10 vocab items, 10 sentence-completion grammar items, and 2 essays.  Such tests are usually analyzed as a single test, as they are psychometrically unidimensional.

 

How to Deliver A Test Battery

In ASC’s platforms,  Assess.ai  and  FastTest, all this functionality is available out of the box: test batteries, composite scores, and sections within a test.  Moreover, they come with a lot of important functionality, such as separation of time limits, navigation controls, customizable score reporting, and more.  Click here to request a free account and start applying best practices.