Tag Archive for: psychometrics

Psychometrics is the science of educational and psychological assessment, using data to ensure that tests are fair and accurate.  Ever felt like you took a test which was unfair, too hard, didn’t cover the right topics, or was full of questions that were simply confusing or poorly written?  Psychometricians are the people who help organizations fix these things using data science, as well as more advanced topics like how to design an AI algorithm that adapts to each examinee.

Psychometrics is a critical aspect of many fields.  Having accurate information on people is essential to education, human resources, workforce development, corporate training, professional certifications/licensure, medicine, and more.  It scientifically studies how tests are designed, developed, delivered, validated, and scored.

Key Takeaways on Psychometrics

  • Psychometrics is the study of how to measure and assess mental constructs, such as intelligence, personality, or knowledge of accounting law
  • Psychometrics is NOT just screening tests for jobs
  • Psychometrics is dedicated to making tests more accurate and fair
  • Psychometrics is heavily reliant on data analysis and machine learning, such as item response theory

 

What is Psychometrics?

Psychometrician Qualities
Psychometrics is the study of assessment itself, regardless of what type of test is under consideration. In fact, many psychometricians don’t even work on a particular test, they just work on psychometrics itself, such as new methods of data analysis.  Most professionals don’t care about what the test is measuring, and will often switch to new jobs at completely unrelated topics, such as moving from a K-12 testing company to psychological measurement to an Accountant certification exam.  We often refer to whatever we are measuring simply as “theta” – a term from item response theory.

Psychometrics tackles fundamental questions around assessment, such as how to determine if a test is reliable or if a question is of good quality, as well as much more complex questions like how to ensure that a score today on a university admissions exam means the same thing as it did 10 years ago.  Additionally, it examines phenomena like the positive manifold, where different cognitive abilities tend to be positively correlated, supporting the consistency and generalizability of test scores over time.

Psychometrics is a branch of data science.  In fact, it’s been around a long time before that term was even a buzzword.  Don’t believe me?  Check out this Coursera course on Data Science, and the first example they give as one of the foundational historical projects in data science is… psychometrics!  (early research on factor analysis of intelligence).

Even though assessment is everywhere and Psychometrics is an essential aspect of assessment, to most people it remains a black box, and professionals are referred to as “psychomagicians” in jest. However, a basic understanding is important for anyone working in the testing industry, especially those developing or selling tests.

Psychometrics is NOT limited to very narrow types of assessment.  Some people use the term interchangeably with concepts like IQ testing, personality assessment, or pre-employment testing.  These are each but tiny parts of the field!  Also, it is not the administration of a test.

 

Why do we need Psychometrics?

This purpose of tests is providing useful information about people, such as whether to hire them, certify them in a profession, or determine what to teach them next in school.  Better tests mean better decisions.  Why?  The scientific evidence is overwhelming that tests provide better information for decision makers than many other types of information, such as interviews, resumes, or educational attainment.  Thus, tests serve an extremely useful role in our society.

The goal of psychometrics is to provide validity: evidence to support that interpretations of scores from the test are what we intended.  If a certification test is supposed to mean that someone passing it meets the minimum standard to work in a certain job, we need a lot of evidence about that, especially since the test is so high stakes in that case.  Meta-analysis, a key tool in psychometrics, aggregates research findings across studies to provide robust evidence on the reliability and validity of tests. By synthesizing data from multiple studies, meta-analysis strengthens the validity claims of tests, especially crucial in high-stakes certification exams where accuracy and fairness are paramount.

 

What does Psychometrics do?

test development cycle job task analysis psychometrics

Building and maintaining a high-quality test is not easy.  A lot of big issues can arise.  Much of the field revolves around solving major questions about tests: what should they cover, what is a good question, how do we set a good cutscore, how do we make sure that the test predicts job performance or student success, etc.  Many of these questions align with the test development cycle – more on that later.

How do we define what should be covered by the test? (Test Design)

Before writing any items, you need to define very specifically what will be on the test.  If the test is in credentialing or pre-employment, psychometricians typically run a job analysis study to form a quantitative, scientific basis for the test blueprints.  A job analysis is necessary for a certification program to get accredited.  In Education, the test coverage is often defined by the curriculum.

How do we ensure the questions are good quality? (Item Writing)

There is a corpus of scientific literature on how to develop test items that accurately measure whatever you are trying to measure.  A great overview is the book by Haladyna.  This is not just limited to multiple-choice items, although that approach remains popular.  Psychometricians leverage their knowledge of best practices to guide the item authoring and review process in a way that the result is highly defensible test content.  Professional item banking software provides the most efficient way to develop high-quality content and publish multiple test forms, as well as store important historical information like item statistics.

How do we set a defensible cutscore? (Standard Setting)

Test scores are often used to classify candidates into groups, such as pass/fail (Certification/Licensure), hire/non-hire (Pre-Employment), and below-basic/basic/proficient/advanced (Education).  Psychometricians lead studies to determine the cutscores, using methodologies such as Angoff, Beuk, Contrasting-Groups, and Borderline.

How do we analyze results to improve the exam? (Psychometric Analysis)

Psychometricians are essential for this step, as the statistical analyses can be quite complex.  Smaller testing organizations typically utilize classical test theory, which is based on simple mathematics like proportions and correlations.  Large, high-profile organizations typically use item response theory (IRT), which is based on a type of nonlinear regression analysis.  Psychometricians evaluate overall reliability of the test, difficulty and discrimination of each item, distractor analysis, possible bias, multidimensionality, linking multiple test forms/years, and much more.  Software such as  Iteman  and  Xcalibre  is also available for organizations with enough expertise to run statistical analyses internally.  Scroll down below for examples.

How do we compare scores across groups or years? (Equating)

This is referred to as linking and equating.  There are some psychometricians that devote their entire career to this topic.  If you are working on a certification exam, for example, you want to make sure that the passing standard is the same this year as last year.  If you passed 76% last year and this year you passed 25%, not only will the candidates be angry, but there will be much less confidence in the meaning of the credential.

How do we know the test is measuring what it should? (Validity)

Validity is the evidence provided to support score interpretations.  For example, we might interpret scores on a test to reflect knowledge of English, and we need to provide documentation and research supporting this.  There are several ways to provide this evidence.  A straightforward approach is to establish content-related evidence, which includes the test definition, blueprints, and item authoring/review.  In some situations, criterion-related evidence is important, which directly correlates test scores to another variable of interest.  Delivering tests in a secure manner is also essential for validity.

 

Where is Psychometrics Used?

Certification/Licensure/Credentialing

In certification testing, psychometricians develop the test via a documented chain of evidence following a sequence of research outlined by accreditation bodies, typically: job analysis, test blueprints, item writing and review, cutscore study, and statistical analysis.  Web-based item banking software like  FastTest  is typically useful because the exam committee often consists of experts located across the country or even throughout the world; they can then easily log in from anywhere and collaborate.

Pre-Employment

In pre-employment testing, validity evidence relies primarily on establishing appropriate content (a test on PHP programming for a PHP programming job) and the correlation of test scores with an important criterion like job performance ratings (shows that the test predicts good job performance).  Adaptive tests are becoming much more common in pre-employment testing because they provide several benefits, the most important of which is cutting test time by 50% – a big deal for large corporations that test a million applicants each year. Adaptive testing is based on item response theory, and requires a specialized psychometrician as well as specially designed software like  FastTest.

K-12 Education

Most assessments in education fall into one of two categories: lower-stakes formative assessment in classrooms, and higher-stakes summative assessments like year-end exams.  Psychometrics is essential for establishing the reliability and validity of higher-stakes exams, and on equating the scores across different years.  They are also important for formative assessments, which are moving towards adaptive formats because of the 50% reduction in test time, meaning that student spend less time testing and more time learning.

Universities

Universities typically do not give much thought to psychometrics even though a significant amount of testing occurs in higher education, especially with the move to online learning and MOOCs.  Given that many of the exams are high stakes (consider a certificate exam after completing a year-long graduate program!), psychometricians should be used in the establishment of legally defensible cutscores and in statistical analysis to ensure reliable tests, and professionally designed assessment systems used for developing and delivering tests, especially with enhanced security.

Medicine/Psychology

Have you ever taken a survey at your doctor’s office, or before/after a surgery?  Perhaps a depression or anxiety inventory at a psychotherapist?  Psychometricians have worked on these.

 

The Test Development Cycle

Psychometrics is the core of the test development cycle, which is the process of developing a strong exam.  It is sometimes called similar names like assessment lifecycle.

You will recognize some of the terms from the introduction earlier.  What we are trying to demonstrate here is that those questions are not standalone topics, or something you do once and simply file a report.  An exam is usually a living thing.  Organizations will often be republishing a new version every year or 6 months, which means that much of the cycle is repeated on that timeline.  Not all of it is; for example, many orgs only do a job analysis and standard setting every 5 years.

Consider a certification exam in healthcare.  The profession does not change quickly because things like anatomy never change and medical procedures rarely change (e.g., how to measure blood pressure).  So, every 5 years it does a job analysis of its certificants to see what they are doing and what is important.  This is then converted to test blueprints.  Items are re-mapped if needed, but most likely do not need it because there are probably only minor changes to the blueprints.  Then a new cutscore is set with the modified-Angoff method, and the test is delivered this year.  It is delivered again next year, but equated to this year rather than starting again.  However, the item statistics are still analyzed, which leads to a new cycle of revising items and publishing a new form for next year.

 

Example of Psychometrics in Action

Here is some output from our Iteman software.  This is deeply analyzing a single question on English vocabulary, to see if the student knows the word alleviate.  About 70% of the students answered correctly, with a very strong point-biserial.  The distractor P values were all in the minority and the distractor point-biserials were negative, which adds evidence to the validity.  The graph shows that the line for the correct answer is going up while the others are going down, which is good.  If you are familiar with item response theory, you’ll notice how the blue line is similar to an item response function.  That is not a coincidence.

FastTest Iteman Psychometrics Analysis

Now, let’s look at another one, which is more interesting.  Here’s a vocab question about the word confectioner.  Note that only 37% of the students get it right… even though there is a 25% chance just of guessing!!!  However, the point-biserial discrimination remains very strong at 0.49.  That means it is a really good item.  It’s just hard, which means it does a great job to differentiate amongst the top students.

Confectioner confetti

Psychometrics looks fun!  How can I join the band?

You will need a graduate degree.  I recommend you look at the NCME website (ncme.org) with resources for students.  Good luck!

Already have a degree and looking for a job?  Here’s the two sites that I recommend:

  • NCME – Also has a job listings page that is really good (ncme.org)
  • Horizon Search – Headhunter for Psychometricians and I/O Psychologists
ebel-method-for-multiple-choice-questions

The Ebel method of standard setting is a psychometric approach to establish a cutscore for tests consisting of multiple-choice questions. It is usually used for high-stakes examinations in the fields of higher education, medical and health professions, and for selecting applicants.

How is the Ebel method performed?

The Ebel method requires a panel of judges who would first categorize each item in a data set by two criteria: level of difficulty and relevance or importance. Then the panel would agree upon an expected percentage of items that should be answered correctly for each group of items according to their categorization.

It is crucial that judges are the experts in the examined field; otherwise, their judgement would not be valid and reliable. Prior to the item rating process, the panelists should be given sufficient amount of information about the purpose and procedures of the Ebel method. In particular, it is important that the judges would understand the meaning of difficulty and relevance in the context of the current assessment.

Next stage would be to determine what “minimally competent” performance means in the specific case depending on the content. When everything is clear and all definitions are agreed upon, the experts should classify each item across difficulty (easy, medium, or hard) and relevance (minimal, acceptable, important, or essential). In order to minimize the influence of the judges’ opinion on each other, it is more recommended to use individual ratings rather than consensus ones.

Afterwards judgements on the proportion of items expected to be answered correctly by minimally competent candidates need to be collected for each item category, e.g. easy and desirable. However, for the rating and timesaving purposes the grid proposed by Ebel and Frisbie (1972) might be used. It is worth mentioning though that Ebel ratings are content-specific, so values in the grid might happen to be too low or too high for a test.

Ebel-method-data

At the end, the Ebel method, like the modified-Angoff method, identifies a cut-off score for an examination based on the performance of candidates in relation to a defined standard (absolute), rather than how they perform in relation to their peers (relative). Ebel scores for each item and for the whole exam are calculated as the average of the scores provided by each expert: the number of items in each category is multiplied by the expected percentage of correct answers, and the total results are added to calculate the cutscore.

Pros of using Ebel

  • This method provides an overview of a test difficulty
  • Cut-off score is identified prior to an examination
  • It is relatively easy for experts to perform

Cons of using Ebel

  • This method is time-consuming and costly
  • Evaluation grid is hard to get right
  • Digital software is required
  • Back-up is necessary

Conclusion

The Ebel method is a quite complex standard-setting process compared to others due to the need of an analysis of the content, and it therefore imposes a burden on the standard-setting panel. However, Ebel considers the relevance of the test items and the expected proportion of the correct answers of the minimally competent candidates, including borderline candidates. Thus, even though the procedure is complicated, the results are very stable and very close to the actual cut-off scores.

References

Ebel, R. L., & Frisbie, D. A. (1972). Essentials of educational measurement.

response-time-effort

The concept of Speeded vs Power Test is one of the ways of differentiating psychometric or educational assessments. In the context of educational measurement and depending on the assessment goals and time constraints, tests are categorized as speeded and power. There is also the concept of a Timed test, which is really a Power test. Let’s look at these types more carefully.

Speeded test

In this test, examinees are limited in time but expected to answer as many questions as possible but there is a unreasonably short time limit that prevents even the best examinees from completing the test, and therefore forces the speed.  Items are delivered sequentially starting from the first one and until the last one. All items are relatively easy, usually.  Sometimes they are increasing in difficulty.  If a time limit and difficulty level are correctly set, none of the test takers will be able to reach the last item before the time limit is reached. A speeded test is supposed to demonstrate how fast an examinee can respond to questions within a time limit. In this case, examinees’ answers are not as important as their speed of answering questions. Total score is usually computed as a number of questions answered correctly when a time limit is met, and differences in scores are mainly attributed to individual differences in speed rather than knowledge.

An example of this might be a mathematical calculation speed test. Examinees are given 100 multiplication problems and told to solve as many as they can in 20 seconds. Most examinees know the answers to all the items, it is a question of how many they can finish. Another might be a 10-key task, where examinees are given a list of 100 5-digit strings and told to type as many as they can in 20 seconds.

Pros of a speeded test:

  • Speeded test is appropriate for when you actually want to test the speed of examinees; the 10-digit task above would be useful in selecting data entry clerks, for example. The concept of “knowledge of 5 digit string” in this case is not relevant and doesn’t even make sense.
  • Tests can sometimes be very short but still discriminating.
  • In case when a test is a mixture of items in terms of their difficulty, examinees might save some time when responding easier items in order to respond to more difficult items. This can create an increased spread in scores.

Cons of a speeded test:

  • Most situations where a test is used is to evaluate knowledge, not speed.
  • The nature of the test provokes examinees commit errors even if they know the answers, which can be stressful.
  • Speeded test does not consider individual peculiarities of examinees.

Power test

A power test provides examinees with sufficient time so that they could attempt all items and express their true level of knowledge or ability. Therefore, this testing category focuses on assessing knowledge, skills, and abilities of the examinees.  The total score is often computed as a number of questions answered correctly (or with item response theory), and individual differences in scores are attributed to differences in ability under assessment, not to differences in basic cognitive abilities such as processing speed or reaction time.

There is also the concept of a Timed Test. This has a time limit, but it is NOT a major factor in how examinees respond to questions or affect their score. For example, the time limit might be set so that 95% of examinees are not affected at all, and the remaining 5% are slightly hurried. This is done with the CAT-ASVAB.

Pros of a power test:

  • There is no time restrictions for test-takers
  • Power test is great to evaluate knowledge, skills, and abilities of examinees
  • Power test reduces chances of committing errors by examinees even if they know the answers
  • Power test considers individual peculiarities of examinees

Cons of a power test:

  • It can be time consuming (some of these exams are 8 hours long or even more!)
  • This test format sometimes does not suit competitive examinations because of administrative issues (too much test time across too many examinees)
  • Power test is sometimes bad for discriminative purposes, since all examinees have high chances to perform well.  There are certainly some pass/fail knowledge exams where almost everyone passes.  But the purpose of those exams is not to differentiate for selection, but to make sure students have mastered the material, so this is a good thing in that case.

Speeded test vs power test

The categorization of speed or power test depends on the assessment purpose. For instance, an arithmetical test for Grade 8 students might be a speeded test when containing many relatively easy questions but the same test could be a power test for Grade 7 students. Thus, a speeded test measures the power when all of the items are correctly responded in a limited time period. Similarly, a power test might turn into a speeded test when easy items are correctly responded in shorter time period. Once a time limit is fixed for a power test, it becomes a speeded test. Today, a pure speeded or power test is rare. Usually, what we meet in practice is a mixture of both, typically a Timed Test.

Below you may find a comparison of a speeded vs power test, in terms of the main features.

 

Speeded test Power test
Time limit is fixed, and it affects all examinees There is no time limit, or there is one and it only affects a small percentage of examinees
The goal is to evaluate speed only, or a combination of speed and correctness The goal is to evaluate correctness in the sense knowledge, skills, and abilities of test-takers
Questions are relatively easy in nature Questions are relatively difficult in nature
Test format increases chances of committing errors Test format reduces chances of committing errors

 

math educational assessment

Educational assessment of Mathematics achievement is a critical aspect of most educational ministries and programs across the world. One might say that all subjects at school are equally important and that would be relatively true. However, Mathematics stands out amongst the remaining ones, because it is more than just an academic subject. Here are three reasons why Math is so important:

Math is everywhere. Any job is tough to be completed without mathematical knowledge. Executives, musicians, accountants, fashion designers, and even mothers use Math in their daily lives. In particular, Math is essential for decision-making in the fast-growing digital world.

Math designs thinking paths. Math enables people, especially children, to analyze and solve real-world problems by developing logical and critical thinking. Einstein’s words describe this fact inimitably, “Pure mathematics is, in its way, the poetry of logical ideas”.

Math is a language of science. Math gives tools for understanding and developing engineering, science, and technology. Mathematical language, including symbols and their meanings, is the same in the world, so scientists use math to communicate concepts.

No matter which profession a student has chosen, he would likely need some solid knowledge in Math to enter an undergraduate or a graduate program. Some world-known tests that contain Math part are TIMSS, PISA, ACT, SAT, SET, and GRE.

The role of educational assessment in Math

Therefore, an important subject like Math needs careful and accurate assessment approaches starting from school. Educational assessment is the process of collecting data on student progress in knowledge acquisition to inform future academic decisions towards learning goals. This is true at the individual student level, teacher or school level, district level, and state or national level. There are different types of assessment depending on its scale, purpose, and functionality of the data collected. Effective test preparation is crucial to help students perform to the best of their abilities and gain confidence in their mathematical skills.calculator-math

In general, educational authorities in many countries apply criteria-based approach for classroom and external assessment of Mathematics. Criteria help divide a construct of knowledge into edible portions so that students understand what they have to acquire and teachers could positively interfere student individual learning paths to make sure that at the end students achieve learning goals.

Classroom assessment or assessment for learning is curriculum-based. Teachers use learning objectives from Math curriculum to form assessment criteria and make tasks according to the latter. Teachers employ assessment results for making informed decisions on the student level.

External assessment or assessment of learning is also curriculum-based but it covers much more topics than classroom assessment. Tasks are made by external specialists, usually from an independent educational institution. Assessment procedure itself is likely to be invigilated and its results are used by different authorities, not just teachers, to evaluate student progress in learning Math but also curriculum.

Applications of educational assessment of Mathematics

Aforementioned types of assessment are classroom- and school-level, and both are mostly formatted as pen-and-pencil tests. There are some other internationally recognized assessment programs focusing on Math, such as Programme for International Student Assessment (PISA). PISA set a global trend of applying knowledge and skills in Math to solving real-world problems.

In 2018, PISA became a computerized adaptive test which is a great shift favoring all students with various levels of knowledge in Math. Application of adaptive technologies in Math for assessment and evaluation purposes could greatly motivate students because the majority of them are not big fans of Math. Thus, teachers and other stakeholders could get more valid and reliable data on student progress in learning Math.

Implementation

The first steps towards implementation of modern technologies for educational assessment of Math at schools and colleges are extensive research and planning. Second, there has to be a pool of good items written according to the best international practices. Third, assessment procedures have to be standardized. Finally yet importantly, schools would need a consultant with rich expertise in adaptive technologies and psychometrics.

An important consideration is the item types of formats to be used.  FastTest allows you to not only use traditional formats like multiple choice, but advanced formats like drag and drop or the presentation of an equation editor to the student.  An example of that is below.

 

Equation editor item type

 

Why is educational assessment of Math so important?

Educational assessment of Math is one of the major focuses of PISA and other assessments for good reason.  Since Math skills translate to job success in many fields, especially STEM fields, a well-educated workforce is one of the necessary components of a modern economy.  So an educational system needs to know that it is preparing students for the future needs of the economy.  One aspect of this is progress monitoring, which tracks learning over time so that we can not only help individual students but also effect the aggregate changes needed to improve the educational system.

 

classroom students exam

If you are delivering high-stakes tests in linear forms – or piloting a bank for CAT/LOFT – you are faced with the issue of how to equate the forms together.  That is, how can we defensibly translate a score on Form A to a score on Form B?  While the concept is simple, the methodology can be complex, and there is an entire area of psychometric research devoted to this topic. There are a number of ways to approach this issue, and IRT equating is the strongest.

Why do we need equating?

The need is obvious: to adjust for differences in difficulty to ensure that all examinees receive a fair score on a stable scale.  Suppose you take Form A and get s score of 72/100 while your friend takes Form B and gets a score of 74/100.  Is your friend smarter than you, or did his form happen to have easier questions?  Well, if the test designers built-in some overlap, we can answer this question empirically.

Suppose the two forms overlap by 50 items, called anchor items or equator items.  Both forms are each delivered to a large, representative sample. Here are the results.

Form Mean score on 50 overlap items Mean score on 100 total items
A 30 72
B 30 74

Because the mean score on the anchor items was higher, we then think that the Form B group was a little smarter, which led to a higher total score.

Now suppose these are the results:

Form Mean score on 50 overlap items Mean score on 100 total items
A 32 72
B 32 74

Now, we have evidence that the groups are of equal ability.  The higher total score on Form B must then be because the unique items on that form are a bit easier.

How do I calculate an equating?

You can equate forms with classical test theory (CTT) or item response theory (IRT).  However, one of the reasons that IRT was invented was that equating with CTT was very weak.  CTT methods include Tucker, Levine, and equipercentile.  Right now, though, let’s focus on IRT.

IRT equating

There are three general approaches to IRT equating.  All of them can be accomplished with our industry-leading software  Xcalibre, though conversion equating requires an additional software called IRTEQ.

  1. Conversion
  2. Concurrent Calibration
  3. Fixed Anchor Calibration

Conversion

With this approach, you need to calibrate each form of your test using IRT, completely separately.  We then evaluate the relationship between IRT parameters on each form and use that to estimate the relationship to convert examinee scores.  Theoretically what you do is line up the IRT parameters of the common items and perform a linear regression, so you can then apply that linear conversion to scores.

But DO NOT just do a regular linear regression.  There are specific methods you must use, including mean/mean, mean/sigma, Stocking & Lord, and Haebara.  Fortunately, you don’t have to figure out all the calculations yourself, as there is free software available to do it for you:  IRTEQ.

Concurrent Calibrationcommon item linking irt equating

The second approach is to combine the datasets into what is known as a sparse matrix.  You then run this single data set through the IRT calibration, and it will place all items and examinees onto a common scale.  The concept of a sparse matrix is typically represented by the figure below, representing the non-equivalent anchor test (NEAT) design approach.

The IRT calibration software will automatically equate the two forms and you can use the resultant scores.

Fixed Anchor Calibration

The third approach is a combination of the two above; it utilizes the separate calibration concept but still uses the IRT calibration process to perform the equating rather than separate software.

With this approach, you would first calibrate your data for Form A.  You then find all the IRT item parameters for the common items and input them into your IRT calibration software when you calibrate Form B.

You can tell the software to “fix” the item parameters so that those particular ones (from the common items) do not change.  Then all the item parameters for the unique items are forced onto the scale of the common items, which of course is the underlying scale from Form A.  This then also forces the scores from the Form B students onto the Form A scale.

How do these IRT equating approaches compare to each other?
concurrent calibration irt equating linking

Concurrent calibration is arguably the easiest but has the drawback that it merges the scales of each form into a new scale somewhere in the middle.  If you need to report the scores on either form on the original scale, then you must use the Conversion or Fixed Anchor approaches.  This situation commonly happens if you are equating across time periods.

Suppose you delivered Form A last year and are now trying to equate Form B.  You can’t just create a new scale and thereby nullify all the scores you reported last year.  You must map Form B onto Form A so that this year’s scores are reported on last year’s scale and everyone’s scores will be consistent.

Where do I go from here?

If you want to do IRT equating, you need IRT calibration software.  All three approaches use it.  I highly recommend  Xcalibre  since it is easy to use and automatically creates reports in Word for you.  If you want to learn more about the topic of equating, the classic reference is the book by Kolen and Brennan (2004; 2014).  There are other resources more readily available on the internet, like this free handbook from CCSSO.  If you would like to learn more about IRT, I recommend the books by de Ayala (2008) and Embretson & Reise (2000).  An intro is available in our blog post.

Item banking refers to the purposeful creation of a database of assessment items to serve as a central repository of all test content, improving efficiency and quality. The term item refers to what many call questions; though their content need not be restricted as such and can include problems to solve or situations to evaluate in addition to straightforward questions. Regular item review is essential to ensure that each item meets content standards, is fair, and is free from bias, thereby maintaining the integrity and accuracy of the item bank. As a critical foundation to the test development cycle, item banking is the foundation for the development of valid, reliable content and defensible test forms.

Automated item banking systems, such as  Assess.ai  or  FastTest, result in significantly reduced administrative time for developing/reviewing items and assembling/publishing tests, while producing exams that have greater reliability and validity.  Contact us to request a free account.

What is Item Banking?

While there are no absolute standards in creating and managing item banks, best practice guidelines are emerging. Here are the essentials your should be looking for:

   Items are reusable objects; when selecting an item banking platform it is important to ensure that items can be used more than once; ideally, item performance should be tracked not only within a test form but across test forms as well.

   Item history and usage are tracked; the usage of a given item, whether it is actively on a test form or dormant waiting to be assigned, should be easily accessible for test developers to assess, as the over-exposure of items can reduce the validity of a test form. As you deliver your items, their content is exposed to examinees. Upon exposure to many examinees, items can then be flagged for retirement or revision to reduce cheating or teaching to the test.

   Items can be sorted; as test developers select items for a test form, it is imperative that they can sort items based on their content area or other categorization methods, so as to select a sample of items that is representative of the full breadth of constructs we intend to measure.

   Item versions are tracked; as items appear on test forms, their content may be revised for clarity. Any such changes should be tracked and versions of the same item should have some link between them so that we can easily review the performance of earlier versions in conjunction with current versions.

   Review process workflow is tracked; as items are revised and versioned, it is imperative that the changes in content and the users who made these changes are tracked. In post-test assessment, there may be a need for further clarification, and the ability to pinpoint who took part in reviewing an item and expedite that process.

   Metadata is recorded; any relevant information about an item should be recorded and stored with the item. The most common applications for metadata that we see are author, source, description, content area, depth of knowledge, item response theory parameters, and classical test theory statistics, but there are likely many data points specific to your organization that is worth storing.

 

Managing an Item Bank

Names are important. As you create or import your item banks it is important to identify each item with a unique, but recognizable name. Naming conventions should reflect your bank’s structure and should include numbers with leading zeros to support true numerical sorting.  You might want to also add additional pieces of information.  If importing, the system should be smart enough to recognize duplicates.

Search and filter. The system should also have a reliable sorting mechanism. 

automated item generation cpr

 

Prepare for the Future: Store Extensive Metadata

Metadata is valuable. As you create items, take the time to record simple metadata like author and source. Having this information can prove very useful once the original item writer has moved to another department, or left the organization. Later in your test development life cycle, as you deliver items, you have the ability to aggregate and record item statistics. Values like discrimination and difficulty are fundamental to creating better tests, driving reliability, and validity.

Statistics are used in the assembly of test forms while classical statistics can be used to estimate mean, standard deviation, reliability, standard error, and pass rate. 

Item banking statistics

Item response theory parameters can come in handy when calculating test information and standard error functions. Data from both psychometric theories can be used to pre-equate multiple forms.

In the event that your organization decides to publish an adaptive test, utilizing computerized adaptive testing delivery, item parameters for each item will be essential. This is because they are used for intelligent selection of items and scoring examinees. Additionally, in the event that the integrity of your test or scoring mechanism is ever challenged, documentation of validity is essential to defensibility and the storage of metadata is one such vital piece of documentation.

 

Increase Content Quality: Track Workflow

Utilize a review workflow to increase quality. Using a standardized review process will ensure that all items are vetted in a similar matter. Have a step in the process for grammar, spelling, and syntax review, as well as content review by a subject matter expert. As an item progresses through the workflow, its development should be tracked, as workflow results also serve as validity documentation.

Accept comments and suggestions from a variety of sources. It is not uncommon for each item reviewer to view an item through their distinctive lens. Having a diverse group of item reviewers stands to benefit your test-takers, as they are likely to be diverse as well!

item review kanban

 

Keep Your Items Organized: Categorize Them

Identify items by content area. Creating a content hierarchy can also help you to organize your item bank and ensure that your test covers the relevant topics. Most often, we see content areas defined first by an analysis of the construct(s) being tested. In the event of a high school science test, this may include the evaluation of the content taught in class. A high-stakes certification exam, almost always includes a job-task analysis. Both methods produce what is called a test blueprint, indicating how important various content areas are to the demonstration of knowledge in the areas being assessed.

Once content areas are defined, we can assign items to levels or categories based on their content. As you are developing your test, and invariably referring back to your test blueprint, you can use this categorization to determine which items from each content area to select.

 

The Benefits of Item Banking

There is no doubt that item banking is a key aspect of developing and maintaining quality assessments. Utilizing best practices, and caring for your items throughout the test development life cycle, will pay great dividends as it increases the reliability, validity, and defensibility of your assessment. Moreover, good item banking will make the job easier and more efficient thus reducing the cost of item development and test publishing.

 

Ready to Improve assessment quality through item banking?

Visit our Contact Us page, where you can request a demonstration or a free account (up to 500 items).

item response theory

Classical test theory is a century-old paradigm for psychometrics – using quantitative and scientific processes to develop and analyze assessments to improve their quality.  (Nobody likes unfair tests!)  The most basic and frequently used item statistic from classical test theory is the P-value.  It is usually called item difficulty but is sometimes called item facility, which can lead to possible confusion.

The P-Value Statistic

The classical P-value is the proportion of examinees that respond correctly to a question, or respond in the “keyed direction” for items where the notion of correct is not relevant (imagine a personality assessment where all questions are Yes/No statements such as “I like to go to parties” … Yes is the keyed direction for an Extraversion scale).  Note that this is NOT the same as the p-value that is used in hypothesis testing from general statistical methods.  This P-value is almost universally agreed upon in terms of calculation.  But some people call it item difficulty and others call it item facility.  Why?

It has to do with the clarity interpretation.  It usually makes sense to think of difficulty as an important aspect of the item.  The P-value presents this, but in a reverse manner.  We usually expect higher values to indicate more of something, right?  But a P-value of 1.00 is high, and it means that there is not much difficulty; everyone gets the item correct, so it is actually no difficulty whatsoever.  A P-value of 0.25 is low, but it means that there is a lot of difficulty; only 25% of examinees are getting it correct, so it has quite a lot of difficulty.

So where does “item facility” come in?

See how the meaning is reversed?  It’s for this reason that some psychometricians prefer to call it item facility or item easiness.  We still use the P-value, but 1.00 means high facility/easiness, and 0.25 means low facility/easiness.  The direction of the semantics fits much better.

Nevertheless, this is a minority of psychometricians.  There’s too much momentum to change an entire field at this point!  It’s similar to the 3 dichotomous IRT parameters (a, b, c); some of you might have noticed that they are actually in the wrong order because the 1-parameter model does not use the parameter, it uses the b. 

At the end of the day, it doesn’t really matter, but it’s another good example of how we all just got used to doing something and it’s now too far down the road to change it.  Tradition is a funny thing.