job-task-analysis

Job Task Analysis (JTA) is an essential step in designing a test to be used in the workforce, such as pre-employment or certification/licensure, by analyzing data on what is actually being done in the job.  Also known as Job Analysis or Role Delineation, job task analysis is important to design a test that is legally defensible and eligible for accreditation.  It usually involves a panel of subject matter experts to develop a survey, which you then deliver to professionals in your field to get quantitative data about what is most frequently done on the job and what is most critical/important.  This data can then be used for several important purposes.

Need help? Our experts can help you efficiently produce a job task analysis study for your certification, guide the process of item writing and standard setting, then publish and deliver the exam on our secure platform.

 

Reasons to do a Job Task Analysis

Job analysis is extremely important in the field of industrial/organizational psychology, hence the meme here from @iopsychmemes.  It’s not just limited to credentialing.

Job analysis I/O Psychology

Exam design

The most common reason is to get quantitative data that will help you design an exam.  By knowing what knowledge, skills, or abilities (KSAs), are most commonly used, you then know which deserve more questions on the test.  It can also help you with more complex design aspects, such as defining a practical exam with live patients.

Training curriculum

Similarly, that quantitative info can help design a curriculum and other training materials.  You will have data on what is most important or frequent.

Compensation analysis

You have a captive audience with the JTA survey.  Ask them other things that you want to know!  This is an excellent time to gather information about compensation.  I worked on a JTA in the past which asked about work location: clinic, hospital, private practice, or vendor/corporate.

Job descriptions

A good job analysis will help you write a job description for postings.  It will tell you the job responsibilities (common tasks), qualifications (required skills, abilities, and education), and other important aspects.  If you gather compensation data in the survey, that can be used to define the salary range of the open position.

Workforce planning

Important trends might become obvious when analyzing the data.  Are fewer people entering your profession, perhaps specific to a certain region or demographic?  Are they entering without certain skills?  Are there certain universities or training programs that are not performing well?  A JTA can help you discover such issues and then work with stakeholders to address them.  These are major potential problems for the profession.

IT IS MANDATORY

If you have a professional certification exam and want to get it accredited by a board such as NCCA or ANSI/ANAB/ISO, then you are REQUIRED to do some sort of job task analysis.

 

Why is a JTA so important for certification and licensure?  Validity.

The fundamental goal of psychometrics is validity, which is evidence that the interpretations we make from scores are actually true. In the case of certification and licensure exams, we are interpreting that someone who passes the test is qualified to work in that job role. So, the first thing we need to do is define exactly what is the job role, and to do it in a quantitative, scientific way. You can’t just have someone sit down in their basement and write up 17 bullet points as the exam blueprint.  That is a lawsuit waiting to happen.

There are other aspects that are essential as well, such as item writer training and standard setting studies.

 

The Methodology: Job Task Inventory

It’s not easy to develop a defensible certification exam, but the process of job task analysis (JTA) doesn’t require a Ph.D. in Psychometrics to understand. Here’s an overview of what to expect.

  1. Convene a panel of subject matter experts (SMEs), and provide a training on the JTA process.
  2. The SMEs then discuss the role of the certification in the profession, and establish high-level topics (domains) that the certification test should cover. Usually, there is 5-20. Sometimes there are subdomains, and occasionally sub-subdomains.
  3. The SME panel generates a list of job tasks that are assigned to domains; the list is reviewed for duplicates and other potential issues. These tasks have an action verb, a subject, and sometimes a qualifier. Examples: “Calibrate the lensometer,” “Take out the trash”, “Perform an equating study.”  There is a specific approach to help with the generation, called the critical incident technique.  With this, you ask the SMEs to describe a critical incident that happened on the job and what skills or knowledge led to success by the professional.  While this might not generate ideas for frequent yet simple tasks, it can help generate ideas for tasks that are rarer but very important.
  4. The final list is used to generate a survey, which is sent to a representative sample of professionals that actually work in the role. The respondents take the survey, whereby they rate each task, usually on its importance and time spent (sometimes called criticality and frequency). Demographics are also gathered, which include age range, geographic region, work location (e.g., clinic vs hospital if medical), years of experience, educational level, and additional certifications.
  5. A psychometrician analyzes the results and creates a formal report, which is essential for validity documentation.  This report is sometimes considered confidential, sometimes published on the organization’s website for the benefit of the profession, and sometimes published in an abbreviated form.  It’s up to you.  For example, this site presents the final results, but then asks you to submit your email address for the full report.

 

Using JTA results to create test blueprints

Many corporations do a job analysis purely for in-house purposes, such as job descriptions and compensation.  This becomes important for large corporations where you might have thousands of people in the same job; it needs to be well-defined, with good training and appropriate compensation.

If you work for a credentialing organization (typically a non-profit, but sometimes the Training arm of a corporation… for example, Amazon Web Services has a division dedicated to certification exams, you will need to analyze the results of the JTA to develop exam blueprints.  We will discuss this process in more detail with another blog post.  But below is an example of how this will look, and here is a free spreadsheet to perform the calculations: Job Task Analysis to Test Blueprints.

 

Job Task Analysis Example

Suppose you are an expert widgetmaker in charge of the widgetmaker certification exam.  You hire a psychometrician to guide the organization through the test development process.  The psychometrician would start by holding a webinar or in-person meeting for a panel of SMEs to define the role and generate a list of tasks.  The group comes up with a list of 20 tasks, sorted into 4 content domains.  These are listed in a survey to current widgetmakers, who rate them on importance and frequency.  The psychometrician analyzes the data and presents a table like you see below.

We can see here that Task 14 is the most frequent, while Task 2 is the least frequent.  Task 7 is the most important while Task 17 is the least.  When you combine Importance and Frequency either by adding or multiplying, you get the weights on the right-hand columns.  If we sum these and divide by the total, we get the suggested blueprints in the green cells.

 

Job task analysis to test blueprints

 

laptop screen demonstrating automated test assembly

Question and Test Interoperability® (QTI®) is a set of standards around the format of import/export files for test questions in educational assessment and HR/credentialing exams.  This facilitates the movement of questions from one software platform to another, including item banking, test assembly, e-Learning, training, and exam delivery.  This serves two main purposes:

  1. Allows you to use multiple vendors more easily, such as one for item banking and another for exam delivery;
  2. Migrating to a new vendor, as you can export all your content from the old vendor then move into the new vendor easily.

In this blog post, we’ll discuss the significance of QTI and how it serves helps test sponsors in the world of certification, workforce, and educational assessment.

What is Question and Test Interoperability (QTI)?

QTI is a widely adopted standard that facilitates the exchange of assessment content and results between various learning platforms and assessment tools. Developed by the IMS Global Learning Consortium / 1EdTech, its goal is that assessments can be created, delivered, and evaluated consistently across different systems, paving the way for a more efficient and streamlined educational experience.  QTI is similar to SCORM, which is intended for learning content, while QTI is specific for assessment.

QTI uses an XML approach to content and markup, specifically modified for the situation of educational assessment, such as stems, answer, correct answers, and scoring information.  Version 2.x creates a zip file of all content, including a manifest file that lets the importing platform know what’s supposed to be coming in, with items then as separate XML files, and media files saved separately, sometimes into a subfolder.

Here is an example of the file arrangement inside the zip:

QTI files

Here is an example of what a test question would look like:

QTI example item

 

Why is QTI important?

Interoperability Across Platforms

QTI enables educators to create assessments on one platform and seamlessly transfer them to another. This cross-platform compatibility is crucial in today’s diverse educational technology landscape, where institutions often use a combination of learning management systems, assessment tools, and other applications.

Enhanced Efficiency

With QTI, the time-consuming process of manually transferring assessment content between systems is eliminated. This not only saves valuable time for educators but also ensures that the integrity of the assessment is maintained throughout the transfer process.

Adaptability to Diverse Assessment Types

QTI supports a wide range of question types, including multiple-choice, true/false, short answer, and more. This adaptability allows educators to create diverse and engaging assessments that cater to different learning styles and subject matter.

Data Standardization

The standardization of data formats within QTI ensures that assessment results are consistent and easily interpretable. This standardization not only facilitates a smoother exchange of information but also enables educators to gain valuable insights into student performance across various assessments.

Facilitating Accessibility

QTI is designed to support accessibility standards, making assessments more inclusive for all students, including those with disabilities. By adhering to accessibility guidelines, educational institutions can ensure that assessments are a fair and effective means of evaluating student knowledge.

 

How to Use QTI

Creating Assessments

qti specifications computer

QTI allows educators to author assessments in a standardized format that can be easily transferred between different platforms. When creating assessments, users adhere to the specification, ensuring compatibility and consistency.

Importing and Exporting Assessments

Educational institutions often use multiple learning management systems and assessment tools. QTI simplifies the process of transferring assessments between different platforms, eliminating the need for manual adjustments and reducing the risk of data corruption.

Adhering to QTI Specifications

To fully leverage the benefits of QTI, users must adhere to its specifications when creating and implementing assessments. Understanding the QTI schema and guidelines is essential for ensuring that assessments are interoperable across various systems.  This is dependent on the vendor you select.  Note that there have been different sets of QTI standards that have evolved over the years, and some vendors have slightly modified their own format!

 

Examples of QTI Applications

Online Testing Platforms

QTI is widely used in online testing platforms to facilitate the seamless transfer of assessments. Whether transitioning between different learning management systems or integrating third-party assessment tools, it ensures a smooth and standardized process.

Learning Management Systems (LMS)

Educational institutions often employ different LMS platforms. QTI allows educators to create assessments in one LMS and seamlessly transfer them to another, ensuring continuity and consistency in the assessment process.

Assessment Authoring Tools

QTI is integrated into various assessment authoring tools, enabling educators to create assessments in a standardized format. This integration ensures that assessments can be easily shared and used across different educational platforms.

 

Resources for Implementation

IMS Global Learning Consortium

The official website of the IMS Global Learning Consortium provides comprehensive documentation, specifications, and updates related to QTI. Educators and developers can access valuable resources to understand and implement QTI effectively.

QTI-Compatible Platforms and Tools

Many learning platforms and assessment tools explicitly support these specifications. Exploring and adopting compatible solutions simplifies the implementation process and ensures a seamless experience for both educators and students.  Our FastTest platform provides support for QTI.

Community Forums and Support Groups

Engaging with the educational technology community through forums and support groups allows users to share experiences, seek advice, and stay updated on best practices for QTI implementation.  See this thread in Moodle forums, for example.

Wikipedia

Wikipedia has an overview of the topic.

 

Conclusion

In a world where educational technology is advancing rapidly, the Question and Test Interoperability specification stands out as a crucial standard for achieving interoperability in assessment tools, fostering a more efficient, accessible, and adaptable educational environment.. By understanding what QTI is, how to use it, exploring real-world examples, and tapping into valuable resources, educators can navigate the educational landscape more effectively, ensuring a streamlined and consistent e-Assessment experience for students and instructors alike.

Iteman45-quantile-plot

Classical Test Theory (CTT) is a psychometric approach to analyzing, improving, scoring, and validating assessments.  It is based on relatively simple concepts, such as averages, proportions, and correlations.  One of the most frequently used aspects is item statistics, which provide insight into how an individual test question is performing.  Is it too easy, too hard, too confusing, miskeyed, or potentially another issue?  Item statistics are what tell you these things.

What are classical test theory item statistics?

They are indices of how a test item, or components of it, is performing.  Items can be hard vs easy, strong vs weak, and other important aspects.  Below is the output from the  Iteman  report in our  FastTest  online assessment platform, showing an English vocabulary item with real student data.  How do we interpret this?

FastTest Iteman Psychometric Analysis

Interpreting Classical Test Theory Item Statistics: Item Difficulty

The P value (Multiple Choice)

The P value is the classical test theory index of difficulty, and is the proportion of examinees that answered an item correctly (or in the keyed direction). It ranges from 0.0 to 1.0. A high value means that the item is easy, and a low value means that the item is difficult.  There are no hard and fast rules because interpretation can vary widely for different situations.  For example, a test given at the beginning of the school year would be expected to have low statistics since the students have not yet been taught the material.  On the other hand, a professional certification exam, where someone can not even sit unless they have 3 years of experience and a relevant degree, might have all items appear easy even though they are quite advanced topics!  Here are some general guidelines”

    0.95-1.0 = Too easy (not doing much good to differentiate examinees, which is really the purpose of assessment)

    0.60-0.95 = Typical

    0.40-0.60 = Hard

    <0.40 = Too hard (consider that a 4 option multiple choice has a 25% chance of pure guessing)

With Iteman, you can set bounds to automatically flag items.  The minimum P value bound represents what you consider the cut point for an item being too difficult. For a relatively easy test, you might specify 0.50 as a minimum, which means that 50% of the examinees have answered the item correctly.

For a test where we expect examinees to perform poorly, the minimum might be lowered to 0.4 or even 0.3. The minimum should take into account the possibility of guessing; if the item is multiple-choice with four options, there is a 25% chance of randomly guessing the answer, so the minimum should probably not be 0.20.  The maximum P value represents the cut point for what you consider to be an item that is too easy. The primary consideration here is that if an item is so easy that nearly everyone gets it correct, it is not providing much information about the examinees.

In fact, items with a P of 0.95 or higher typically have very poor point-biserial correlations.

The Item Mean (Polytomous)

This refers to an item that is scored with 2 or more point levels, like an essay scored on a 0-4 point rubric or a Likert-type item that is “Rate on a scale of 1 to 5.”

  • 1=Strongly Disagree
  • 2=Disagree
  • 3=Neutral
  • 4=Agree
  • 5=Strongly Agree

The item mean is the average of the item responses converted to numeric values across all examinees. The range of the item mean is dependent on the number of categories and whether the item responses begin at 0. The interpretation of the item mean depends on the type of item (rating scale or partial credit). A good rating scale item will have an item mean close to ½ of the maximum, as this means that on average, examinees are not endorsing categories near the extremes of the continuum.

You will have to adjust for your own situation, but here is an example for the 5-point Likert-style item.

1-2 is very low; people disagree fairly strongly on average

2-3 is low to neutral; people tend to disagree on average

3-4 is neutral to high; people tend to agree on average

4-5 is very high; people agree fairly strongly on average

Iteman also provides flagging bounds for this statistic.  The minimum item mean bound represents what you consider the cut point for the item mean being too low.  The maximum item mean bound represents what you consider the cut point for the item mean being too high.

The number of categories for the items must be considered when setting the bounds of the minimum/maximum values. This is important as all items of a certain type (e.g., 3-category) might be flagged.

Interpreting Classical Test Theory Item Statistics: Item Discrimination

Multiple-Choice Items

The Pearson point-biserial correlation (r-pbis) is a classical test theory measure of the discrimination or differentiating strength, of the item. It ranges from −1.0 to 1.0 and is a correlation of item scores and total raw scores.  If you consider a scored data matrix (multiple-choice items converted to 0/1 data), this would be the correlation between the item column and a column that is the sum of all item columns for each row (a person’s score).

A good item is able to differentiate between examinees of high and low ability yet have a higher point-biserial, but rarely above 0.50. A negative point-biserial is indicative of a very poor item because it means that the high-ability examinees are answering incorrectly, while the low-ability examinees are answering it correctly, which of course would be bizarre, and therefore typically indicates that the specified correct answer is actually wrong. A point-biserial of 0.0 provides no differentiation between low-scoring and high-scoring examinees, essentially random “noise.”  Here are some general guidelines on interpretation.  Note that these assume a decent sample size; if you only have a small number of examinees, many item statistics will be flagged!

0.20+ = Good item; smarter examinees tend to get the item correct

0.10-0.20 = OK item; but probably review it

0.0-0.10 = Marginal item quality; should probably be revised or replaced

<0.0 = Terrible item; replace it

***Major red flag is if the correct answer has a negative Rpbis and a distractor has a positive Rpbis

The minimum item-total correlation bound represents the lowest discrimination you are willing to accept. This is typically a small positive number, like 0.10 or 0.20. If your sample size is small, it could possibly be reduced.  The maximum item-total correlation bound is almost always 1.0, because it is typically desired that the r-pbis be as high as possible.

The biserial correlation is also a measure of the discrimination or differentiating strength, of the item. It ranges from −1.0 to 1.0. The biserial correlation is computed between the item and total score as if the item was a continuous measure of the trait. Since the biserial is an estimate of Pearson’s r it will be larger in absolute magnitude than the corresponding point-biserial.

The biserial makes the stricter assumption that the score distribution is normal. The biserial correlation is not recommended for traits where the score distribution is known to be non-normal (e.g., pathology).

Polytomous Items

The Pearson’s r correlation is the product-moment correlation between the item responses (as numeric values) and total score. It ranges from −1.0 to 1.0. The r correlation indexes the linear relationship between item score and total score and assumes that the item responses for an item form a continuous variable. The r correlation and the r-pbis are equivalent for a 2-category item, so guidelines for interpretation remain unchanged.

The minimum item-total correlation bound represents the lowest discrimination you are willing to accept. Since the typical r correlation (0.5) will be larger than the typical rpbis (0.3) correlation, you may wish to set the lower bound higher for a test with polytomous items (0.2 to 0.3). If your sample size is small, it could possibly be reduced.  The maximum item-total correlation bound is almost always 1.0, because it is typically desired that the r-pbis be as high as possible.

The eta coefficient is an additional index of discrimination computed using an analysis of variance with the item response as the independent variable and total score as the dependent variable. The eta coefficient is the ratio of the between-groups sum of squares to the total sum of squares and has a range of 0 to 1. The eta coefficient does not assume that the item responses are continuous and also does not assume a linear relationship between the item response and total score.

As a result, the eta coefficient will always be equal or greater than Pearson’s r. Note that the biserial correlation will be reported if the item has only 2 categories.

Classical Test Theory vs. Item Response Theory

Classical Test Theory and Item Response Theory (CTT & IRT) are the two primary psychometric paradigms.  That is, they are mathematical approaches to how tests are analyzed and scored.  They differ quite substantially in substance and complexity, even though they both nominally do the same thing, which is statistically analyze test data to ensure reliability and validity.  CTT is quite simple, easily understood, and works with small samples, but IRT is far more powerful and effective, so it is used by most big exams in the world.

So how are they different, and how can you effectively choose the right solution?  First, let’s start by defining the two.  This is just a brief intro; there are entire books dedicated to the details!

Classical Test Theory

CTT is an approach that is based on simple mathematics; primarily averages, proportions, and correlations.  It is more than 100 years old, but is still used quite often, with good reason. In addition to working with small sample sizes, it is very simple and easy to understand, which makes it useful for working directly with content experts to evaluate, diagnose, and improve items or tests.

Download free version of Iteman for CTT Analysis

 

Iteman classical test theory

 

Item Response Theory

IRT is a much more complex approach to analyzing tests. Moreover, it is not just for analyzing; it is a complete psychometric paradigm that changes how item banks are developed, test forms are designed, tests are delivered (adaptive or linear-on-the-fly), and scores produced. There are many benefits to this approach that justify the complexity, and there is good reason that all major examinations in the world utilize IRT.  Learn more about IRT here.

 

Download free version of Xcalibre for IRT Analysis

 

Similarities between Classical Test Theory and Item Response Theory

CTT & IRT are both foundational frameworks in psychometrics aimed at improving the reliability and validity of psychological assessments. Both methodologies involve item analysis to evaluate and refine test items, ensuring they effectively measure the intended constructs. Additionally, IRT and CTT emphasize the importance of test standardization and norm-referencing, which facilitate consistent administration and meaningful score interpretation. Despite differing in specific techniques both frameworks ultimately strive to produce accurate and consistent measurement tools. These shared goals highlight the complementary nature of IRT and CTT in advancing psychological testing.

Differences between Classical Test Theory and Item Response Theory

Test-Level and Subscore-Level Analysis

CTT statistics for total scores and subscores include coefficient alpha reliability, standard error of measurement (a function of reliability and SD), descriptive statistics (average, SD…), and roll-ups of item statistics (e.g., mean Rpbis).

With IRT, we utilize the same descriptive statistics, but the scores are now different (theta, not number-correct).  The standard error of measurement is now a conditional function, not a single number. The entire concept of reliability is dropped, and replaced with the concept of precision, and also as that same conditional function.

Item-Level AnalysisXcalibre item response theory

Item statistics for CTT include proportion-correct (difficulty), point-biserial (Rpbis) correlation (discrimination), and a distractor/answer analysis. If there is demographic information, CTT analysis can also provide a simple evaluation of differential item functioning (DIF).

IRT replaces the difficulty and discrimination with its own quantifications, called simply b and a.  In addition, it can add a c parameter for guessing effects. More importantly, it creates entirely new classes of statistics for partial credit or rating scale items.

Scoring

CTT scores tests with traditional scoring: number-correct, proportion-correct, or sum-of-points.  CTT interprets test scores based on the total number of correct responses, assuming all items contribute equally.  IRT scores examinees directly on a latent scale, which psychometricians call theta, allowing for more nuanced and precise ability estimates.

Linking and Equating

Linking and equating is a statistical analysis to determine comparable scores on different forms; e.g., Form A is “two points easier” than Form B and therefore a 72 on Form A is comparable to a 70 on Form B. CTT has several methods for this, including the Tucker and Levine methods, but there are methodological issues with these approaches. These issues, and other issues with CTT, eventually led to the development of IRT in the 1960s and 1970s.

IRT has methods to accomplish linking and equating which are much more powerful than CTT, including anchor-item calibration or conversion methods like Stocking-Lord. There are other advantages as well.

Vertical Scaling

One major advantage of IRT, as a corollary to the strong linking/equating, is that we can link/equate not just across multiple forms in one grade, but from grade to grade. This produces a vertical scale. A vertical scale can span across multiple grades, making it much easier to track student growth, or to measure students that are off-grade in their performance (e.g., 7th grader that is at a 5th grade level). A vertical scale is a substantial investment, but is extremely powerful for K-12 assessments.

Sample Sizes

Classical test theory can work effectively with 50 examinees, and provide useful results with as little as 20.  Depending on the IRT model you select (there are many), the minimum sample size can be 100 to 1,000.

Sample- and Test-Dependence

CTT analyses are sample-dependent and test-dependent, which means that such analyses are performed on a single test form and set of students. It is possible to combine data across multiple test forms to create a sparse matrix, but this has a detrimental effect on some of the statistics (especially alpha), even if the test is of high quality, and the results will not reflect reality.

For example, if Grade 7 Math has 3 forms (beginning, middle, end of year), it is conceivable to combine them into one “super-matrix” and analyze together. The same is true if there are 3 forms given at the same time, and each student randomly receives one of the forms. In that case, 2/3 of the matrix would be empty, which psychometricians call sparse.

Distractor Analysis

Classical test theory will analyze the distractors of a multiple choice item.  IRT models, except for the rarely-used Nominal Response Model, do not.  So even if you primarily use IRT, psychometricians will also use CTT for this.

Guessing

educational assessment

IRT has a parameter to account for guessing, though some psychometricians argue against its use.  CTT has no effective way to account for guessing.

Adaptive Testing

There are rare cases where adaptive testing (personalized assessment) can be done with classical test theory.  However, it pretty much requires the use of item response theory for one important reason: IRT puts people and items onto the same latent scale.

Linear Test Design

CTT and IRT differ in how test forms are designed and built.  CTT works best when there are lots of items of middle difficulty, as this maximizes the coefficient alpha reliability.  However, there are definitely situations where the purpose of the assessment is otherwise.  IRT provides stronger methods for designing such tests, and then scoring as well.

So… How to Choose?

There is no single best answer to the question of CTT vs. IRT.  You need to evaluate the aspects listed above, and in some cases other aspects (e.g., financial, or whether you have staff available with the expertise in the first place).  In many cases, BOTH are necessary.  This is especially true because IRT does not provide an effective and easy-to-understand distractor analysis that you can use to discuss with subject matter experts.  It is for this reason that IRT software will typically produce CTT analysis too, though the reverse is not true.

IRT is very powerful, and can provide additional information about tests if used just for analyzing results to evaluate item and test performance.  A researcher might choose IRT over CTT for its ability to provide detailed item-level data, handle varying item characteristics, and improve the precision of ability estimates.  IRT’s flexibility and advanced modeling capabilities make it suitable for complex assessments and adaptive testing scenarios.

However, IRT is really only useful if you are going to make it your psychometric paradigm, thereby using it in the list of activities above, especially IRT scoring of examines. Otherwise, IRT analysis is merely just another way of looking test and item performance that will correlate substantially with CTT.

Contact Us To Talk With An Expert

 

Coefficient cronbachs alhpa interpretation

Coefficient alpha reliability, sometimes called Cronbach’s alpha, is a statistical index that is used to evaluate the internal consistency or reliability of an assessment. That is, it quantifies how consistent we can expect scores to be, by analyzing the item statistics. A high value indicates that the test is of high reliability, and a low value indicates low reliability. This is one of the most fundamental concepts in psychometrics, and alpha is arguably the most common index. You may also be interested in reading about its competitor, the Split Half Reliability Index.

What is coefficient alpha, aka Cronbach’s alpha?

The classic reference to alpha is Cronbach (1954). He defines it as:

coefficient alpha

where k is the number of items, sigma-i is variance of item i, and sigma-X is total score variance.

Kuder-Richardson 20

While Cronbach tends to get the credit, to the point that the index is often called “Cronbach’s Alpha” he really did not invent it. Kuder and Richardson (1927) suggested the following equation to estimate the reliability of a test with dichotomous (right/wrong) items.

kr 20 reliability

Note that it is the same as Cronbach’s equation, except that he replaced the binomial variance pq with the more general notation of variance (sigma). This just means that you can use Cronbach’s equation on polytomous data such as Likert rating scales. In the case of dichotomous data such as multiple choice items, Cronbach’s alpha and KR-20 are the exact same.

Additionally, Cyril Hoyt defined reliability in an equivalent approach using ANOVA in 1941, a decade before Cronbach’s paper.

How to interpret coefficient alpha

In general, alpha will range from 0.0 (random number generator) to 1.0 (perfect measurement). However, in rare cases, it can go below 0.0, such as if the test is very short or if there is a lot of missing data (sparse matrix). This, in fact, is one of the reasons NOT to use alpha in some cases. If you are dealing with linear-on-the-fly tests (LOFT), computerized adaptive tests (CAT), or a set of overlapping linear forms for equating (non-equivalent anchor test, or NEAT design), then you will likely have a large proportion of sparseness in the data matrix and alpha will be very low or negative. In such cases, item response theory provides a much more effective way of evaluating the test.

What is “perfect measurement?”  Well, imagine using a ruler to measure a piece of paper.  If it is American-sized, that piece of paper is always going to be 8.5 inches wide, no matter how many times you measure it with the ruler.  A bathroom scale is slightly less reliability; You might step on it, see 190.2 pounds, then step off and on again, and see 190.4 pounds.  This is a good example of how we often accept unreliability in measurement.

Of course, we never have this level of accuracy in the world of psychoeducational measurement.  Even a well-made test is something where a student might get 92% today and 89% tomorrow (assuming we could wipe their brain of memory of the exact questions).

Reliability can also be interpreted as the ratio of true score variance to total score variance. That is, all test score distributions have a total variance, which consist of variance due to the construct of interest (i.e., smart students do well and poor students do poorly), but also some error variance (random error, kids not paying attention to a question, second dimension in the test… could be many things.

What is a good value of coefficient alpha?

As psychometricians love to say, “it depends.” The rule of thumb that you generally hear is that a value of 0.70 is good and below 0.70 is bad, but that is terrible advice. A higher value indeed indicates higher reliability, but you don’t always need high reliability. A test to certify surgeons, of course, deserves all the items it needs to make it quite reliable. Anything below 0.90 would be horrible. However, the survey you take from a car dealership will likely have the statistical results analyzed, and a reliability of 0.60 isn’t going to be the end of the world; it will still provide much better information than not doing a survey at all!

Here’s a general depiction of how to evaluate levels of coefficient alpha.

Coefficient cronbachs alhpa interpretation

Using alpha: the classical standard error of measurement

Coefficient alpha is also often used to calculate the classical standard error of measurement (SEM), which provides a related method of interpreting the quality of a test and the precision of its scores. The SEM can be interpreted as the standard deviation of scores that you would expect if a person took the test many times, with their brain wiped clean of the memory each time. If the test is reliable, you’d expect them to get almost the same score each time, meaning that SEM would be small.

   SEM=SD*sqrt(1-r)

Note that SEM is a direct function of alpha, so that if alpha is 0.99, SEM will be small, and if alpha is 0.1, then SEM will be very large.

Coefficient alpha and unidimensionality

It can also be interpreted as a measure of unidimensionality. If all items are measuring the same construct, then scores on them will align, and the value of alpha will be high. If there are multiple constructs, alpha will be reduced, even if the items are still high quality. For example, if you were to analyze data from a Big Five personality assessment with all five domains at once, alpha would be quite low. Yet if you took the same data and calculated alpha separately on each domain, it would likely be quite high.

How to calculate the index

Because the calculation of coefficient alpha reliability is so simple, it can be done quite easily if you need to calculate it from scratch, such as using formulas in Microsoft Excel. However, any decent assessment platform or psychometric software will produce it for you as a matter of course. It is one of the most important statistics in psychometrics.

Cautions on overuse

Because alpha is just so convenient – boiling down the complex concept of test quality and accuracy to a single easy-to-read number – it is overused and over-relied upon. There are papers out in the literature that describe the cautions in detail; here is a classic reference.

One important consideration is the over-simplification of precision with coefficient alpha, and the classical standard error of measurement, when juxtaposed to the concept of conditional standard error of measurement from item response theory. This refers to the fact that most traditional tests have a lot of items of middle difficulty, which maximizes alpha. This measures students of middle ability quite well. However, if there are no difficult items on a test, it will do nothing to differentiate amongst the top students. Therefore, that test would have a high overall alpha, but have virtually no precision for the top students. In an extreme example, they’d all score 100%.

Also, alpha will completely fall apart when you calculate it on sparse matrices, because the total score variance is artifactually reduced.

Limitations of coefficient alpha

Cronbach’s alpha has several limitations. Firstly, it assumes that all items on a scale measure the same underlying construct and have equal variances, which is often not the case. Secondly, it is sensitive to the number of items on the scale; longer scales tend to produce higher alpha values, even if the additional items do not necessarily improve measurement quality. Thirdly, Cronbach’s alpha assumes that item errors are uncorrelated, an assumption that is frequently violated in practice. Lastly, it provides only a lower bound estimate of reliability, which means it can underestimate the true reliability of the test.

Summary

In conclusion, coefficient alpha is one of the most important statistics in psychometrics, and for good reason. It is quite useful in many cases, and easy enough to interpret that you can discuss it with test content developers and other non-psychometricians. However, there are cases where you should be cautious about its use, and some cases where it completely falls apart. In those situations, item response theory is highly recommended.

Juggling-statistics

What is the difference between the terms dichotomous and polytomous in psychometrics?  Well, these terms represent two subcategories within item response theory (IRT) which is the dominant psychometric paradigm for constructing, scoring and analyzing assessments.  Virtually all large-scale assessments utilize IRT because of its well-documented advantages.  In many cases, however, it is referred to as a single way of analyzing data.  But IRT is actually a family of fast-growing models, each requiring rigorous item fit analysis to ensure that each question functions appropriately within the model.   The models operate quite differently based on whether the test questions are scored right/wrong or yes/no (dichotomous), vs. complex items like an essay that might be scored on a rubric of 0 to 6 points (polytomous).  This post will provide a description of the differences and when to use one or the other.

 

Ready to use IRT?  Download Xcalibre for free

 

Dichotomous IRT Models

Dichotomous IRT models are those with two possible item scores.  Note that I say “item scores” and not “item responses” – the most common example of a dichotomous item is multiple choice, which typically has 4 to 5 options, but only two possible scores (correct/incorrect).  

True/False or Yes/No items are also obvious examples and are more likely to appear in surveys or inventories, as opposed to the ubiquity of the multiple-choice item in achievement/aptitude testing. Other item types that can be dichotomous are Scored Short Answer and Multiple Response (all or nothing scoring).  

What models are dichotomous?

The three most common dichotomous models are the 1PL/Rasch, the 2PL (Graded Response Model), and the 3PL.  Which one to use depends on the type of data you have, as well as your doctrine of course.  A great example is Scored Short Answer items: there should be no effect of guessing on such an item, so the 2PL is a logical choice.  Here is a broad overgeneralization:

  • 1PL/Rasch: Uses only the difficulty (b) parameter and does not take into account guessing effects or the possibility that some items might be more discriminating than others; however, can be useful with small samples and other situations
  • 2PL: Uses difficulty (b) and discrimination (a) parameters, but no guessing (c); relevant for the many types of assessment where there is no guessing
  • 3PL: Uses all three parameters, typically relevant for achievement/aptitude testing.

What do dichotomous models look like?

Dichotomous models, graphically, will have one S-shaped curve with a positive slope, as seen here.  This model that the probability of responding in the keyed direction increases with higher levels of the trait or ability.  

item response function

Technically, there is also a line for the probability of an incorrect response, which goes down, but this is obviously the 1-P complement, so it is rarely drawn in graphs.  It is, however, used in scoring algorithms (check out this white paper).

In the example, a student with theta = -3 has about a 0.28 chance of responding correctly, while theta = 0 has about 0.60 and theta = 1 has about 0.90.

Polytomous IRT Models

Polytomous models are for items that have more than two possible scores.  The most common examples are Likert-type items (Rate on a scale of 1 to 5) and partial credit items (score on an Essay might be 0 to 5 points). IRT models typically assume that the item scores are integers.

What models are polytomous?

Unsurprisingly, the most common polytomous models use names like rating scale and partial credit.

  • Rating Scale Model (Andrich, 1978)
  • Partial Credit Model (Masters, 1982)
  • Generalized Rating Scale Model (Muraki, 1990)
  • Generalized Partial Credit Model (Muraki, 1992)
  • Graded Response Model (Samejima, 1972)
  • Nominal Response Model (Bock, 1972)

What do polytomous models look like?

Polytomous models have a line that dictates each possible response.  The line for the highest point value is typically S-shaped like a dichotomous curve.  The line for the lowest point value is typically sloped down like the 1-P dichotomous curve.  Point values in the middle typically have a bell-shaped curve. The example is for an Essay that scored 0 to 5 points.  Only students with theta >2 are likely to get the full points (blue), while students 1<theta<2 are likely to receive 4 points (green).

I’ve seen “polychotomous.”  What does that mean?

It means the same as polytomous.  

How is IRT used in our platform?

We use it to support the test development cycle, including form assembly, scoring, and adaptive testing.  You can learn more on this page.

How can I analyze my tests with IRT?

You need specially designed software, like  Xcalibre.  Classical test theory is so simple that you can do it with Excel functions.

Recommended Readings

Item Response Theory for Psychologists by Embretson and Riese (2000).  

group working on meta-analysis

Meta-analysis is a research process of collating data from multiple independent but similar scientific studies in order to identify common trends and findings by means of statistical methods. To put it simply, it is a method where you can accumulate all of your research findings and analyze them statistically. It is often used in psychometrics and industrial-organizational psychology to help validate assessments. Meta-analysis not only serves as a summary of a research question but also provides a quantitative evaluation of the relationship between two variables or the effectiveness of an experiment. It can also work for examining theoretical assumptions that compete with each other.

Background of Meta-Analysis

An American statistician and researcher, Gene Glass, devised the term ‘meta-analysis’ in 1976. He called so the statistical analysis of a large amount of data from individual studies in order to integrate the findings. Medical researchers began employing meta-analysis a few years later. One of the first influential applications of this method was when Elwood and Cochrane used meta-analysis to examine the effect of aspirin on reducing recurrences of heart attacks.

meta-analysis-studies

Purpose of Meta-Analysis

In general, meta-analysis is aimed at two things:

  • to establish whether a study has an effect and to determine whether it is positive or negative,
  • to analyze the results of previously conducted studies to find out common trends.

 

Performing Meta-Analysis

Even though there could be various ways of conducting meta-analysis depending on the research purpose and field, there are eight major steps:

  1. Set a research question and propose a hypothesis
  2. Conduct a systematic review of the relevant studies
  3. Extract data from the studies to include into the meta-analysis considering sample sizes and data variability measures for intervention and control groups (the control group is under observation whilst the intervention group is under experiment)
  4. Calculate summary measures, called effect sizes (the difference in average values between intervention and control groups), and standardize
    estimates if necessary for making comparisons between the groups
  5. Choose a meta-analytical method: quantitative (traditional univariate meta-analysis, meta-regression, meta-analytic structural equation modeling) or qualitative
  6. Pick up the software depending on the complexity of the methods used and the dataset (e.g. templates for Microsoft Excel, Stata, SPSS, SAS, R, Comprehensive Meta-Analysis, RevMan), and code the effect sizes
  7. Do analyses by employing an appropriate model for comparing effect sizes using fixed effects (assumes that all observations share a common mean effect size) or random effects (assumes heterogeneity and allows for a variation of the true effect sizes across observations)
  8. Synthesize results and report them

Prior to making any conclusions and reporting results, it would be helpful to use the checklist suggested by DeSimone et al. (2021) to ensure that all crucial aspects of the meta-analysis have been addressed in your study.

Meta-Analysis in Assessment & Psychometrics: Test Validation & Validity Generalization

Due to its versatility, meta-analysis is used in various fields of research, in particular as a test validation strategy in psychology and psychometrics. The most common situation to apply meta-analysis is validating the use of tests in workplace in the field of personnel psychology and pre-employment testing. The classic example of such application is the work done by Schmidt and Hunter (1998) who analyzed 85 years of research on what best predicts job performance. This is one of the most important articles in that topic. It has been recently updated by Sackett et al. (2021) with slightly different results.

psychometrics-possibilities

How is meta-analysis applied to such a situation?  Well, start be reconceptualizing a “sample” as a set of studies, not a set of people. So let’s say we find 100 studies that use pre-employment tests to select examinees by predicting job performance (obviously, there are far more). Because most studies use more than one test, there might be 77 that use a general cognitive ability test, 63 that use a conscientiousness assessment, 24 that use a situational judgment test, etc. We look at the correlation coefficients reported for those first 77 studies and find that the average is 0.51, while the average correlation for conscientiousness is 0.44 and for SJTs is 0.39. You can see how this is extremely useful in a practical sense, as a practitioner that might be tasked with selecting an assessment battery!

Meta-analysis studies will often go further and clean up the results, by tossing studies with poor methodology or skewed samples, and applying corrections for things like range restriction and unreliability. This enhances the validity of the overall results. To see such an example, visit the Sackett et al. (2021) article.

Such research has led to the concept of validity generalization. This suggests that if a test has been validated for many uses, or similar uses, you can consider it validated for your particular use without having to do a validation study. For example, if you are selecting clerical workers and you can see that there are literally hundreds of studies which show that numeracy or quantitative tests will predict job performance, there is no need for you to do ANOTHER study. If challenged, you can just point to the hundreds of studies already done. Obviously, this is a reasonable argument, but you should not take it too far, i.e., generalize too much.

Conclusion

As you might have understood so far, conducting meta-analysis is not a piece of cake. However, it is very efficient when the researcher intends to evaluate effects in diverse participants, set another hypothesis creating a precedence for future research studies, demonstrate statistical significance or surmount the issue of a small sample size in research.

References

Borenstein, M., Hedges, L. V., Higgins, J. P., & Rothstein, H. R. (2021). Introduction to meta-analysis. John Wiley & Sons.

DeSimone, J. A., Brannick, M. T., O’Boyle, E. H., & Ryu, J. W. (2021). Recommendations for reviewing meta-analyses in organizational research. Organizational Research Methods24(4), 694-717.

Field, A. P., & Gillett, R. (2010). How to do a meta‐analysis. British Journal of Mathematical and Statistical Psychology63(3), 665-694.

Glass, G. V. (1976). Primary, secondary, and meta-analysis of research. Educational researcher5(10), 3-8.

Glass, G. V., McGaw, B., & Smith, M. L. (1981). Meta-analysis in social research. Sage Publications.

Gurevitch, J., Koricheva, J., Nakagawa, S., & Stewart, G. (2018). Meta-analysis and the science of research synthesis. Nature555(7695), 175-182.

Hansen, C., Steinmetz, H., & Block, J. (2022). How to conduct a meta-analysis in eight steps: A practical guide. Management Review Quarterly72(1), 1-19.

Hedges, L. V., & Olkin, I. (1985). Statistical methods for meta-analysis. Academic Press.

Hunter, J. E., Schmidt, F. L., & Jackson, G. B. (1982). Meta-analysis: Cumulating research findings across studies. Sage Publications.

Hunter, J. E., & Schmidt, F. L. (2004). Methods of meta-analysis: Correcting error and bias in research findings. Sage.

Peto, R., & Parish, S. (1980). Aspirin after myocardial infarction. Lancet1(8179), 1172-1173.

Sackett, P. R., Zhang, C., Berry, C. M., & Lievens, F. (2021). Revisiting meta-analytic estimates of validity in personnel selection: Addressing systematic overcorrection for restriction of range. Journal of Applied Psychology.

Schmidt, F. L., & Hunter, J. E. (1998). The validity and utility of selection methods in personnel psychology: Practical and theoretical implications of 85 years of research findings. Psychological bulletin124(2), 262.

 

test-validation

Test validation is the process of verifying whether the specific requirements to test development stages are fulfilled or not, based on solid evidence. In particular, test validation is an ongoing process of developing an argument that a specific test, its score interpretation or use is valid. The interpretation and use of testing data should be validated in terms of content, substantive, structural, external, generalizability, and consequential aspects of construct validity (Messick, 1994). Validity is the status of an argument that can be positive or negative: positive evidence supports and negative evidence weakens the validity argument, accordingly. Validity cannot be absolute and can be judged only in degrees. Meta-analysis, a technique frequently employed in psychometrics, aggregates research findings across multiple studies to assess the overall validity and reliability of a test. By synthesizing data from diverse sources, meta-analysis provides a comprehensive evaluation of the test’s construct validity, supporting its use in educational and psychological assessments (AERA, APA, & NCME, 1999).

Validation as part of test development

To be effective, test development has to be structured, systematic, and detail-oriented. These features can guarantee sufficient validity evidence supporting inferences proposed by test scores obtained via assessment. Downing (2006) suggested a twelve-step framework for the effective test development:

  1. Overall plan
  2. Content definition
  3. Test blueprint
  4. Item development
  5. Test design and assembly
  6. Test production
  7. Test administration
  8. Scoring test responses
  9. Standard setting
  10. Reporting test results
  11. Item bank management
  12. Technical report

Even though this framework is outlined as a sequential timeline, in practice some of these steps may occur simultaneously or may be ordered differently. A starting point of the test development – the purpose – defines the planned test and regulates almost all validity-related activities. Each step of the test development process focuses on its crucial aspect – validation.

Hypothetically, an excellent performance of all steps can ensure a test validity, i.e. the produced test would estimate examinee ability fairly within the content area to be measured by this test. However, human factor involved in the test production might play a negative role, so there is an essential need for the test validation.

Reasons for test validation

There are myriads of possible reasons that can lead to the invalidation of test score interpretation or use. Let us consider some obvious issues that potentially jeopardize test validation and are subject to validation:

  • overall plan: wrong choice of a psychometric model;
  • content definition: content domain is ill defined;
  • test blueprint: test blueprint does not specify an exact sampling plan for the content domain;
  • item development: items measure content at an inappropriate cognitive level;
  • test design and assembly: unequal booklets;
  • test administration: cheating;
  • scoring test responses: inconsistent scoring among examiners;
  • standard setting: unsuitable method of establishing passing scores;
  • item bank management: inaccurate updating of item parameters.

 

Context for test validation

All tests have common types of validity evidence that is purported, e.g. reliability, comparability, equating, and item quality. However, tests can vary in terms of a quantity of constructs measured (single, multiple) and can have different purposes which call for the unique types of test validation evidence. In general, there are several major types of tests:

  • Admissions tests (e.g., SAT, ACT, and GRE)
  • Credentialing tests (e.g., a live-patient examination for a dentist before licensing)
  • Large-scale achievement tests (e.g., Stanford Achievement Test, Iowa Test of Basic Skills, and TerraNova)
  • Pre-employment tests
  • Medical or psychological
  • Language

The main idea is that the type of test usually defines a unique validation agenda that focuses on appropriate types of validity evidence and issues that are challenged in that type of test.

Categorization of test validation studies

Since there are multiple precedents for the test score invalidation, there are many categories of test validation studies that can be applied to validate test results. In our post, we will look at the categorization suggested by Haladyna (2011):

Category 1: Test Validation Studies Specific to a Testing Program

Subcategory of a study

Focus of a study

    1. Studies That Provide Validity Evidence in Support of the Claim for a Test Score Interpretation or Use
  • Content analysis
  • Item analysis
  • Standard setting
  • Equating
  • Reliability
    2. Studies That Threaten a Test Score Interpretation of Use
  • Cheating
  • Scoring errors
  • Student motivation
  • Unethical test preparation
  • Inappropriate test administration
    3. Studies That Address Other Problems That Threaten Test Score Interpretation or Use
  • Drop in reliability
  • Drift in item parameters over time
  • Redesign of a published test
  • Possible security problem

Category 2: Test Validation Studies That Apply to More Than One Testing Program

    Studies that lead to the establishment of concepts, principles, or procedures that guide, inform, or improve test development or scoring
  • Introducing a concept
  • Introducing a principle
  • Introducing a procedure
  • Studying a pervasive problem

Summary

Even though test development is a longitudinal laborious process, test creators have to be extremely accurate while executing their obligations within each activity. The crown of this process is obtaining valid and reliable test scores, and their adequate interpretation and use. The higher the stakes or consequences of the test scores, the greater attention should be paid to the test validity, and, therefore, to the test validation. The latter one is emphasized by integrating all reliable sources of evidence to strengthen the argument for test score interpretation and use.

References

American Educational Research Association (AERA), American Psychological Association (APA), & National Council on Measurement in Education (NCME). (1999). Standards for educational and psychological testing. American Educational Research Association.

Downing, S. M. (2011). Twelve steps for effective test development. In. S. M. Downing & T. M. Haladyna (Eds.), Handbook of test development (pp. 3-25). Lawrence Erlbaum Associates.

Haladyna, T. M. (2011). Roles and importance of validity studies in test development. In. S. M. Downing & T. M. Haladyna (Eds.), Handbook of test development (pp. 739-755). Lawrence Erlbaum Associates.

Messick, S. (1994). The interplay of evidence and consequences in the validation of performance assessments. Educational researcher23(2), 13-23.

 

Psychometric software

Automated item generation (AIG) is a paradigm for developing assessment items (test questions), utilizing principles of artificial intelligence and automation. As the name suggests, it tries to automate some or all of the effort involved with item authoring, as that is one of the most time-intensive aspects of assessment development – which is no news to anyone who has authored test questions!

What is Automated Item Generation?

Automated item generation involves the use of computer algorithms to create new test questions, or variations of them.  It can also be used for item review, or the generation of answers, or the generation of assets such as reading passages.  Items still need to be reviewed and edited by humans, but this still saves a massive amount of time in test development.

Why Use Automated Item Generation?

Items can cost up to $2000 to develop, so even cutting the average cost in half could provide massive time/money savings to an organization.  ASC provides AIG functionality, with no limits, to anyone who signs up for a free item banking account in our platform  Assess.ai.

Types of Automated Item Generation?

There are two types of automated item generation.  The Item Templates approach was developed before large language models (LLMs) were widely available.  The second approach is to use LLMs, which became widely available at the end of 2022.

Type 1: Item Templates

The first type is based on the concept of item templates to create a family of items using dynamic, insertable variables. There are three stages to this work. For more detail, read this article by Gierl, Lai, and Turner (2012).

  • Authors, or a team, create an cognitive model by isolating what it is they are exactly trying to assess and different ways that it the knowledge could be presented or evidenced. This might include information such as what are the important vs. incidental variables, and what a correct answer should include .
  • They then develop templates for items based on this model, like the example you see below.
  • An algorithm then turns this template into a family of related items, often by producing all possible permutations.

Obviously, you can’t use more than one of these on a given test form. And in some cases, some of the permutations will be an unlikely scenario or possibly completely irrelevant. But the savings can still be quite real. I saw a conference presentation by Andre de Champlain from the Medical Council of Canada, stating that overall efficiency improved by 6x and the generated items were higher quality than traditionally written items because the process made the authors think more deeply about what they were assessing and how. He also recommended that template permutations not be automatically moved to the item bank but instead that each is reviewed by SMEs, for reasons such as those stated above.

You might think “Hey, that’s not really AI…” – AI is doing things that have been in the past done by humans, and the definition gets pushed further every year. Remember, AI used to be just having the Atari be able to play Pong with you!

AIG-CPR

Type 2: AI Generation or Processing of Source Text

The second type is what the phrase “automated item generation” more likely brings to mind: upload a textbook or similar source to some software, and it spits back drafts of test questions. For example, see this article by von Davier (2019). Or alternatively, simply state a topic as a prompt and the AI will generate test questions.

Until the release of ChatGPT and other publicly available AI platforms to implement large language models (LLMs), this approach was only available to experts at large organizations.  Now, it is available to everyone with an internet connection.  If you use such products directly, you can provide a prompt such as “Write me 10 exam questions on Glaucoma, in a 4-option multiple choice format” and it will do so.  You can also update the instructions to be more specific, and add instructions such as formatting the output for your preferred method, such as QTI or JSON.

Alternatively, many assessment platforms now integrate with these products directly, so you can do the same thing, but have the items appear for you in the item banker under New status, rather than have them go to a raw file on your local computer that you then have to clean and upload.  FastTest  has such functionality available.

This technology has completely revolutionized how we develop test questions.  I’ve seen several research presentations on this, and they all find that AIG produces more items, of quality that is as good or even better than humans, in a fraction of the time!  But, they have also found that prompt engineering is critical, and even one word – like including “concise” in your prompt – can affect the quality of the items.

FastTest Automated item generation

The Limitations of Automated Item Generation

Automated item generation (AIG) has revolutionized the way educational and psychological assessments are developed, offering increased efficiency and consistency. However, this technology comes with several limitations that can impact the quality and effectiveness of the items produced.

One significant limitation is the challenge of ensuring content validity. AIG relies heavily on algorithms and pre-defined templates, which may not capture the nuanced and comprehensive understanding of subject matter that human experts possess. This can result in items that are either too simplistic or fail to fully address the depth and breadth of the content domain .

Another limitation is the potential for over-reliance on statistical properties rather than pedagogical soundness. While AIG can generate items that meet certain psychometric criteria, such as difficulty and discrimination indices, these items may not always align with best practices in educational assessment or instructional design. This can lead to tests that are technically robust but lack relevance or meaningfulness to the learners .

Furthermore, the use of AIG can inadvertently introduce bias. Algorithms used in item generation are based on historical data and patterns, which may reflect existing biases in the data. Without careful oversight and adjustment, AIG can perpetuate or even exacerbate these biases, leading to unfair assessment outcomes for certain groups of test-takers .

Lastly, there is the issue of limited creativity and innovation. Automated systems generate items based on existing templates and rules, which can result in a lack of variety and originality in the items produced. This can make assessments predictable and less engaging for test-takers, potentially impacting their motivation and performance .

In conclusion, while automated item generation offers many benefits, it is crucial to address these limitations through continuous oversight, integration of expert input, and regular validation studies to ensure the development of high-quality assessment items.

How Can I Implement Automated Item Generation?

If you are a user of AI products like ChatGPT or Bard, you can work directly with them.  Advanced users can implement APIs to upload documents or fine-tune the machine learning models.  The aforementioned article by von Davier talks about such usage.

If you want to save time, FastTest provides a direct ChatGPT integration, so you can provide the prompt using the screen shown above, and items will then be automatically created in the item banking folder you specify, with the item naming convention you specify, tagged as Status=New and ready for review.  Items can then be routed through our configurable Item Review Workflow process, including functionality to gather modified-Angoff ratings.

Ready to improve your test development process?  Click here to talk to a psychometric expert.

borderline method educational assessment

The borderline group method of standard setting is one of the most common approaches to establishing a cutscore for an exam.  In comparison with the item-centered standard setting methods such as modified-Angoff, Nedelsky, and Ebel, there are two well-known examinee-centered methods (Jaeger, 1989), the contrasting groups method and the borderline group method (Livingston & Zieky, 1989). This post will focus on the latter one.

The concept of the borderline group method

Examinee-centered methods require participants to judge whether an individual examinee possesses adequate knowledge, skills, and abilities across specific content standards. The borderline group method is based on the idea to determine a common passing score for all examinees that would be expected from an examinee whose competencies are on the borderline between quite adequate and yet not inadequate.

How to perform the borderline group methodpsychometric training and workshops

First of all, the judges are selected from those who are thoroughly familiar with the content examined and are knowledgeable about knowledge, skills, and abilities of individual examinees. Next, the judges engage in a discussion to develop a description of an examinee who is on the borderline between two extremes, mastery and non-mastery. Alternatively, the judges may be tasked to sort examinees into three categories: clearly competent, clearly incompetent, and those in-between.

After the description is agreed upon, borderline examinees need to be identified. The ultimate goal of the borderline group method is to distribute the borderline examinees’ scores and to find the median of that distribution (50th percentile), which would become the recommended cut score for the borderline group.

Why is the median used and not the mean, you might ask? The reason is that the median is much less affected by extremely high or extremely low values. This feature of the median is particularly important for the borderline group method, because an examinee with a very high or very low score is likely to not really belong in the group.

Analyzing the borderline group method

Advantages of this method:

  • Time efficient
  • Straightforward to implement

Disadvantages of this method:

  • Difficult to achieve consensus on the nature borderline examinees
  • The cut score could have low validity in case of a small number of borderline examinees

What could the borderline group method work poorly and how this can be tackled?

Possible issue Probable solution
The judges could identify some examinees as borderline by mistake (e.g. their skills were difficult to judge), so the borderline group might contain examinees who do not belong it. Remind the judges not to include in the borderline group any examinees whose competencies they are not sure about.
The judges may base their judgements on something other than what the examine measures. Give the judges appropriate instructions and get them agree with each other when defining a borderline examinee.
The judgements in terms of individual standards regarding the examinees’ skills and abilities may differ greatly.

There is also a risk that judges would be sensitive to errors of central tendency and, therefore, might assign a disproportionately large number of examinees to the borderline group if they do not have sufficient knowledge about individual examinees’ performances. Thus, it is key for implementing the borderline group method to pick highly competent judges.

Conclusion

Let’s summarize which steps need to be made to implement the borderline group method:

  • Select the competent judges
  • Define the borderline level of examinees’ knowledge, skills, and abilities
  • Evolve the borderline examinees
  • Obtain the test scores of the borderline examinees
  • Calculate the cut off score as the median of the distribution of the borderline examinees’ test scores

 

References

Jaeger, R. M. (1989). Certification of student competence.

Livingston, S. A., & Zieky, M. J. (1989). A comparative study of standard-setting methods. Applied Measurement in Education2(2), 121-141.