T scores

A T Score (sometimes hyphenated T-Score) is a common example of a scaled score in psychometrics and assessment.  A scaled score is simply a way to present scores in a more meaningful and easier-to-digest context, with the benefit of hiding the sometimes obtuse technicalities of psychometrics.  Therefore, a T Score is a standardized way that scores are presented to make them easier to understand.

What is a T Score?

A T score is a conversion of the standard normal distribution, aka Bell Curve.  The normal distribution places observations (of anything, not just test scores) on a scale that has a mean of 0.00 and a standard deviation of 1.00.  We simply convert this to have a mean of 50 and standard deviation of 10.  Doing so has two immediate benefits to most consumers:

  1. There are no negative scores; people generally do not like to receive a negative score!
  2. Scores are round numbers that generally range from 0 to 100, depending on whether 3, 4, or 5 standard deviations is the bound (usually 20 to 80); this somewhat fits with what most people expect from their school days, even though the numbers are entirely different.

The image below shows the normal distribution, labeled with the different scales for interpretation.

T score vs z score vs percentile

How to interpret a T score?

As you can see above, a T Score of 40 means that you are approximately the 16th percentile.   This is a low score, obviously, but a student will feel better than if they received a score if -1.  It is for the same reason that many educational assessments use other scaled scores.  The SAT has a scale of mean=500 SD=100 (T score x 10), so if you receive a score of 400 it again means that you are z=-1 or percentile of 16.

A 70 means that you are approximately the 98th percentile – so that it is actually quite high though students who are used to receiving 90s will feel like it is low!

Since there is a 1-to-1 mapping of T Score to the other rows, you can see that it does not actually provide any new information.  It is simply a conversion to round, positive numbers, that is easier to digest and less likely to upset someone that is unfamiliar with psychometrics.  My undergraduate professor who introduced me to psychometrics used the term “repackaging” to describe scaled scores.  Like if you take an object out of one box and put it in a different box, it looks different superficially, but the object itself and its meaning (e.g., weight) have not changed.

How do I calculate a T score?

Use this formula:

T = z*10 + 50

where  is the standard z-score on the normal distribution N(0,1).

Example of a T score

Suppose you have a z-score of -0.5.  If you put that into the formula, you get T = -0.5*10 + 50 = -5 + 50 = 45.  If you look at the graphic above, you can see how being half a standard deviation below the mean translates to a T score of 45.

Is a T Score like a t-test?

No.  Couldn’t be more unrelated.  Nothing like the t-test.

How do I implement with an assessment?

If you are using off-the-shelf psychological assessments, they will likely produce a T Score for you in the results.  If you want to utilize it for your own assessments, you need a world-class assessment platform like  FastTest  that has strong functionality for scoring methods and scaled scoring.  An example of this is below.  Here, we are utilizing item response theory for the raw score.

As with all scaled scoring, it is a good idea to provide an explanation to your examinees and stakeholders.

Scaled scores in FastTest

item-writing-tips

Item writing (aka item authoring) is a science as well as an art, and if you have done it, you know just how challenging it can be!  You are experts at what you do, and you want to make sure that your examinees are too.  But it’s hard to write questions that are clear, reliable, unbiased, and differentiate on the thing you are trying to assess.  Here are some tips.

What is Item Writing / Item Authoring ?

Item authoring is the process of creating test questions.  You have certainly seen “bad” test questions in your life, and know firsthand just how frustrating and confusing that can be.  Fortunately, there is a lot of research in the field of psychometrics on how to write good questions, and also how to have other experts review them to ensure quality.  It is best practice to make items go through a workflow, so that the test development process is similar to the software development process.

Because items are the building blocks of tests, it is likely that the test items within your tests are the greatest threat to its overall validity and reliability.  Here are some important tips in item authoring.  Want deeper guidance?  Check out our Item Writing Guide.

Anatomy of an Item

First, let’s talk a little bit about the parts of a test question.  The diagram on the right shows a reading passage with two questions on it.  Here are some of the terms used:

  • Asset/Stimulus: This is a reading passage here, but could also be an audio, video, table, PDF, or other resource
  • Item: An overall test question, usually called an “item” rather than a “question” because sometimes they might be statements.
  • Stem: The part of the item that presents the situation or poses a question.
  • Options: All of the choices to answer.
  • Key: The correct answer.
  • Distractors: The incorrect answers.

Parts of a test item

 

Item writing tips: The Stem

To find out whether your test items are your allies or your enemies, read through your test and identify the items that contain the most prevalent item construction flaws.  The first three of the most prevalent construction flaws are located in the item stem (i.e. question).  Look to see if your item stems contain…

1) BIAS

Nowadays, we tend to think of bias as relating to culture or religion, but there are many more subtle types of biases that oftentimes sneak into your tests.  Consider the following questions to determine the extent of bias in your tests:

  • Are there are acronyms in your test that are not considered industry standard?
  • Are you testing on policies and procedures that may vary from one location to another?
  • Are you using vocabulary that is more recognizable to a female examinee than a male?
  • Are you referencing objects that are not familiar to examinees from a newer or older generation?

2) NOT

We’ve all taken tests which ask a negatively worded question. These test items are often the product of item authoring by newbies, but they are devastating to the validity and reliability of your tests—particularly fast test-takers or individuals with lower reading skills.  If the examinee misses that one single word, they will get the question wrong even if they actually know the material.  This test item ends up penalizing the wrong examinees!

3) EXCESS VERBIAGEborderline method educational assessment

Long stems can be effective and essential in many situations, but they are also more prone to two specific item construction flaws.  If the stem is unnecessarily long, it can contribute to examinee fatigue.  Because each item requires more energy to read and understand, examinees tire sooner and may begin to perform more poorly later on in the test—regardless of their competence level.

Additionally, long stems often include information that can be used to answer other questions in the test.  This could lead your test to be an assessment of whose test-taking memory is best (i.e. “Oh yeah, #5 said XYZ, so the answer to #34 is XYZ.”) rather than who knows the material.

Item writing tips:  distractors / options

Unfortunately, item stems aren’t the only offenders.  Experienced test writers actually know that the distractors (i.e. options) are actually more difficult to write than the stems themselves.  When you review your test items, look to see if your item distractors contain

4) IMPLAUSIBILTY

The purpose of a distractor is to pull less qualified examinees away from the correct answer by other options that look correct.  In order for them to “distract” an examinee from the correct answer, they have to be plausible.  The closer they are to being correct, the more difficult the exam will be.  If the distractors are obviously incorrect, even unqualified examinees won’t pick them.  Then your exam will not help you discriminate between examinees who know the material and examinees that do not, which is the entire goal.

5) 3-TO-1 SPLITS

You may recall watching Sesame Street as a child.  If so, you remember the song “One of these things…”  (Either way, enjoy refreshing your memory!)   Looking back, it seems really elementary, but sometimes our test item options are written in such a way that an examinee can play this simple game with your test.  Instead of knowing the material, they can look for the option that stands out as different from the others.  Consider the following questions to determine if one of your items falls into this category:

  • Is the correct answer significantly longer than the distractors?
  • Does the correct answer contain more detail than the distractors?
  • Is the grammatical structure different for the answer than for the distractors?

6) ALL OF THE ABOVE

There are a couple of problems with having this phrase (or the opposite “None of the above”) as an option.  For starters, good test takers know that this is—statistically speaking—usually the correct answer.  If it’s there and the examinee picks it, they have a better than 50% chance of getting the item right—even if they don’t know the content.  Also, if they are able to identify two options as correct, they can select “All of the above” without knowing whether or not the third option was correct.  These sorts of questions also get in the way of good item analysis.   Whether the examinee gets this item right or wrong, it’s harder to ascertain what knowledge they have because the correct answer is so broad.

This is helpful, can I learn more?

Want to learn more about item writing?  Here’s an instructional video from one of our PhD psychometricians.  You should also check out this book.

Item authoring is easier with an item banking system

The process of reading through your exams in search of these flaws in the item authoring is time-consuming (and oftentimes depressing), but it is an essential step towards developing an exam that is valid, reliable, and reflects well on your organization as a whole.  We also recommend that you look into getting a dedicated item banking platform, designed to help with this process.

Summary Checklist

 

Issue

Recommendation

Key is invalid due to multiple correct answers. Consider each answer option individually; the key should be fully correct with each distractor being fully incorrect.
Item was written in a hard to comprehend way, examinees were unable to apply their knowledge because of poor wording.

 

Ensure that the item can be understood after just one read through. If you have to read the stem multiple times, it needs to be rewritten.
Grammar, spelling, or syntax errors direct savvy test takers toward the correct answer (or away from incorrect answers). Read the stem, followed by each answer option, aloud. Each answer option should fit with the stem.
Information was introduced in the stem text that was not relevant to the question. After writing each question, evaluate the content of the stem. It should be clear and concise without introducing irrelevant information.
Item emphasizes trivial facts. Work off of a test blue print to ensure that each of your items map to a relevant construct. If you are using Bloom’s taxonomy or a similar approach, items should be from higher order levels.
Numerical answer options overlap. Carefully evaluate numerical ranges to ensure there is no overlap among options.
Examinees noticed answer was most often A. Distribute the key evenly among the answer options. This can be avoided with FastTest’s randomized delivery functionality.
Key was overly specific compared to distractors. Answer options should all be about the same length and contain the same amount of information.
Key was only option to include key word from item stem. Avoid re-using key words from the stem text in your answer options. If you do use such words, evenly distribute them among all of the answer options so as to not call out individual options.
Rare exception can be argued to invalidate true/false always/never question. Avoid using “always” or “never” as there can be unanticipated or rare scenarios. Opt for less absolute terms like “most often” or “rarely”.
Distractors were not plausible, key was obvious. Review each answer option and ensure that it has some bearing in reality. Distractors should be plausible.
Idiom or jargon was used; non-native English speakers did not understand. It is best to avoid figures of speech, keep the stem text and answer options literal to avoid introducing undue discrimination against certain groups.
Key was significantly longer than distractors. There is a strong tendency to write a key that is very descriptive. Be wary of this and evaluate distractors to ensure that they are approximately the same length.
 

validity threats

Validity threats are issues with a test or assessment that hinder the interpretations and use of scores, such as cheating, inappropriate use of scores, unfair preparation, or non-standardized delivery.  It is important to establish a test security plan to define the threats relevant for you and address them.

Validity, in its modern conceptualization, refers to evidence that supports our intended interpretations of test scores (see Chapter 1 of APA/AERA/NCME Standards for full treatment).   The word “interpretation” is key because test scores can be interpreted in different ways, including ways that are not intended by the test designers.  For example, a test given at the end of Nursing school to prepare for a national licensure exam might be used by the school as a sort of Final Exam.  However, the test was not designed for this purpose and might not even be aligned with the school’s curriculum.  Another example is that certification tests are usually designed to demonstrate minimal competence, not differentiate amongst experts, so interpreting a high score as expertise might not be warranted.

Validity threats: Always be on the lookout!

Test sponsors, therefore, must be vigilant against any validity threats.  Some of these, like the two aforementioned examples, might be outside the scope of the organization.  While it is certainly worthwhile to address such issues, our primary focus is on aspects of the exam itself.

Which validity threats rise to the surface in psychometric forensics?

Here, we will discuss several threats to validity that typically present themselves in psychometric forensics, with a focus on security aspects.  However, I’m not just listing security threats here, as psychometric forensics is excellent at flagging other types of validity threats too.

Threat Description Approach Example Indices
Collusion (copying) Examinees are copying answers from one another, usually with a defined Source. Error similarity (only looks at incorrect) 2 examinees get the same 10 items wrong, and select the same distractor on each B-B Ran, B-B Obs, K, K1, K2, S2
Response similarity 2 examinees give the same response on 98/100 items S2, g2, ω, Zjk
Group level help/issues Similar to collusion but at a group level; could be examinees working together, or receiving answers from a teacher/proctor.  Note that many examinees using the same brain dump would have a similar signature but across locations. Group level statistics Location has one of the highest mean scores but lowest mean times Descriptive statistics such as mean score, mean time, and pass rate
Response or error similarity On a certain group of items, the entire classroom gives the same answers Roll-up analysis, such as mean collusion flags per group; also erasure analysis (paper only)
Pre-Knowledge Examinee comes in to take the test already knowing the items and answers, often purchased from a brain dump website. Time-Score analysis Examinee has high score and very short time RTE or total time vs. scores
Response or error similarity Examinee has all the same responses as a known brain dump site All indices
Pretest item comparison Examinee gets 100% on existing items but 50% on new items Pre vs Scored results
Person fit Examinee gets the 10 hardest items correct but performs below average on the rest of the items Guttman indices, lz
Harvesting Examinee is not actually taking the test, but is sitting it to memorize items so they can be sold afterwards, often at a brain dump website.  Similar signature to Sleepers but more likely to occur on voluntary tests, or where high scores benefit examinees. Time-Score analysis Low score, high time, few attempts. RTE or total time vs. scores
Mean vs Median item time Examinee “camps” on 10 items to memorize them; mean item time much higher than the median Mean-Median index
Option flagging Examinee answers “C” to all items in the second half Option proportions
Low motivation: Sleeper Examinees are disengaged, producing data that is flagged as unusual and invalid; fortunately, not usually a security concern but could be a policy concern. Similar signature to Harvester but more likely to occur on mandatory tests, or where high scores do not benefit examinees. Time-Score analysis Low score, high time, few attempts. RTE or total time vs. scores
Item timeout rate If you have item time limits, examinee hits them Proportion items that hit limit
Person fit Examinee attempt a few items, passes through the rest Guttman indices, lz
Low motivation: Clicker Examinees are disengaged, producing data that is flagged as unusual and invalid; fortunately, not usually a security concern but could be a policy concern. Similar idea to Sleeper but data is quite different. Time-Score analysis Examinee quickly clicks “A” to all items, finishing with a low time and low score RTE, Total time vs. scores
Option flagging See above Option proportions
woman-taking-test

Item review is the process of ensuring that newly-written test questions go through a rigorous peer review, to ensure that they are high quality and meet industry standards.

What is an item review workflow?

Developing a high-quality item bank is an extremely involved process, and authoring of the items is just the first step.  Items need to go through a defined workflow, with multiple people providing item review.  For example, you might require all items to be reviewed by another content expert, a psychometrician, an editor, and a bias reviewer.  Each needs to give their input and pass the item along to the next in line.  You need to record the results of the review for posterity, as part of the concept of validity is that we have documentation to support the development of a test.

What to review?

You should first establish what you want reviewed.  Assessment organizations will often formalize the guidelines as an Item Writing Guide.  Here is the guide that Assessment Systems uses with out clients, but I also recommend checking out the NBME Item Writing Guide.  For an even deeper treatment, I recommend the book Developing and Validating Test Items by Haladyna and Rodriguez (2013).

Here are some aspects to consider for item review.

Content

Most importantly, other content experts should check the item’s content.  Is the correct answer actually correct?  Are all the distractors actually correct?  Does the stem provide all the necessary info?  You’d be surprised how many times such issues slip past even the best reviewers!

Psychometrics

Psychometricians will often review an item to confirm that it meets best practices and that there are no tip-offs.  A common one is that the correct answer is often longer (more words) than the distractors.  Some organizations avoid “all of the above” and other approaches.

Format

Formal editors are sometimes brought in to work on the language and format of the item.  A common mistake is to end the stem with a colon even though that does not follow basic grammatical rules of English.

Bias/Sensitivity

For high-stakes exams that are used on diverse populations, it is important to add this step.  You don’t want items that are biased against a subset of students.  This is not just racial; it can include other differentiations of students.  Years ago I worked on items for the US State of Alaska, which has some incredibly rural regions; we had to avoid concepts that many people take for granted, like roads or shopping malls!

How to implement an item review workflow

item review kanban

This is an example of how to implement the process in a professional-grade item banking platform.  Both of our platforms,  FastTest  and  Assess.ai, have powerful functionality to manage this process.  Admin users can define the stages and the required input, then manage the team members and flow of items.  Assess.ai is unique in the industry with its use of Kanban boards – recognized as the best UI for workflow management – for item review.

An additional step, often at the same time, is standard setting.  One of the most common approaches is called the modified-Angoff method, which requires you to obtain a difficulty rating from a team of experts for each item.  The Item Review interfaces excel in managing this process as well, saving you all the effort of manually managing that process!

CREATE WORKFLOW
Assess.ai item review submit optionsSpecify your stages and how items can move between them

DEFINE YOUR REVIEW FIELDS
These are special item metadata fields that require input from multiple users

MOVE NEW ITEMS INTO THE WORKFLOW
Once an item is written, it is ready for review

ASSIGN ITEMS TO USERS
Assign the item in the UI, with the option to send an email

USERS PERFORM REVIEWS
They can read the item, interact as a student would, and then leave feedback and other metadata in the review fields; then push the item down the line

ADMINS EVALUATE/EXPORT THE RESULTS
Admins can evaluate the results and decide if an item needs revision, or if it can be considered released.

 

psychometrician laptop

e-Assessment is a critical component in education and workforce assessment, managing and delivering exams via the internet.  It requires a cloud-based platform that is designed specifically to build, deliver, manage, and validate exams that are either large-scale or high-stakes.  It is a critical core-business tool for high-stakes professional and educational assessment programs, such as certification, licensure, or university admissions.  There are many, many software products out in market that provide at least some functionality for online testing.

The biggest problem when you start shopping is that there is an incredible range in quality, though there are also other differentiators, such as some being made only to deliver pre-packaged employment skill tests rather than being for general usage.  This article provides some tips on how to implement e-assessment more effectively.

Type of e-Assessment tools

So how do you know what level of quality you need in an e-Assessment solution?  It mostly depends on the stakes of your test, which governs the need for quality in the test itself, which then drives the need for a quality platform to build and deliver the test.  This post helps you identify the types of functionality that set apart “real” online exam platforms, and you can evaluate which components are most critical for you once you go shopping.

This table depicts one way to think about what sort of solution you need.

Non-professional level Professional level
Not dedicated to assessment Systems that can do minimal assessment and are inexpensive, such as survey software (LimeSurvey, QuestionPro, etc.) Related systems like LMS platforms that are high quality (Blackboard, Canvas); these have some assessment functionality but lack professional functionality like IRT, adaptive testing, and true item banking
Dedicated to assessment Systems designed for assessment but without professional functionality; anybody can make a simple platform for MCQ exams etc. Powerful systems designed for high-stakes exams, with professional functionality like IRT/CAT

 

This post will discuss some of the “real” functionality that separates a true e-Assessment solution that is designed for assessment professionals, from the other 3 cells of the table.

 

Prefer to get your hands dirty?  Sign up for a free account in our platform or request a personalized demonstration.

 

TALK TO US Contact

What is a professional e-Assessment tool, anyway?

test development cycle fasttest

An e-Assessment system is much more than an exam module in a learning management system (LMS) or an inexpensive quiz/survey maker.  A real online exam platform is designed for professionals, that is, someone whose entire job is to make assessments.  A good comparison is a customer relationship management (CRM) system.  That is a platform that is designed for use be people whose job is to manage customers, whether for existing customers or to manage the sales process.  While it is entirely possible to use a spreadsheet to manage such things at a small scale, all organizations doing any sort of scale will leverage a true CRM like SalesForce or Zoho.   You wouldn’t hire a team of professional sales experts and then have them waste hours each day in a spreadsheet; you would give them SalesForce to make them much more effective.

The same is true for online testing and assessment.  If you are a teacher making math quizzes, then Microsoft Word might be sufficient.  But there are many organizations that are doing a professional level of assessment, with dedicated staff.  Some examples, by no means an exhaustive list:

  • Professional credentialing: Certification and licensure exams that a person passes to work in a profession, such as chiropractors
  • Employment: Evaluating job applicants to make sure they have relevant skills, ability, and experience
  • Universities: Not for classroom assessments, but rather for topics like placement exams of all incoming students, or for nationwide admissions exams
  • K-12 benchmark: If you are a government that tests all 8th graders at the end of the year, or a company that delivers millions of formative assessments

 

The traditional vs modern approach to e-Assessment

For starters, one important thing to consider is the approach that the exam software takes to assessment.  Some of the aspects listed here are points in the detailed discussion below.

online testing platform

 

Goal 1: Item banking that makes your team more efficient

True item banking:

The platform should treat items as reusable objects that exist with persistent IDs and metadata.  Learn more about item banking.

Configurability:

The platform should allow you to configure how items are scored and presented, such as font size, answer layout, and weighting.

Multimedia management:

Audio, video, and images should be stored in their own banks, with their own metadata fields, as reusable objects.  If an image is in 7 questions, you should not have to upload 7 times… you upload once and the system tracks which items use it.

Statistics and other metadata:

All items should have many fields that are essential metadata: author name, date created, tests which use the item, content area, Bloom’s taxonomy, classical statistics, IRT parameters, and much more.

Custom fields:

You should be able to create any new metadata fields that you like.

Item review workflow:item review kanban

Professionally built items will go through a review process, like Psychometric Review, English Editing, and Content Review. The platform should manage this, allowing you to assign items to people with due dates and email notifications.

Standard Setting:

The exam platform should include functionality to help you do standard setting like the modified-Angoff approach.

Automated item generation:

There should be functionality for automated item generation.

Powerful test assembly:

When you publish a test, there should be many options, including sections, navigation limits, paper vs online, scoring algorithms, instructional screens, score reports, etc.  You should also have aids in psychometric aspects, such as a Test Information Function.

Equation Editor:

Many math exams need a professional equation editor to write the items, embedded in the item authoring.

 

Goal 2: Professional exam delivery with e-Assessment

Scheduling options:

Date ranges for availability, retake rules, alternate forms, passwords, etc.  These are essential for maintaining the integrity of high stakes tests.

Item response theory:Adaptive testing options

Item response theory modern psychometric paradigm used by organizations dedicated to stronger assessment.  It is far superior to the oversimplified, classical approach based on proportions and correlations.

Linear on the fly testing (LOFT):

Suppose you have a pool of 200 questions, and you want every student to get 50 randomly picked, but balanced so that there are 10 items from each of 5 content areas.  This is known as linear-on-the-fly testing, and can greatly enhance the security and validity of the test.

Computerized adaptive testing:

This uses AI and machine learning to customize the test uniquely to every examinee.  Adaptive testing is much more secure, more accurate, more engaging, and can reduce test length by 50-90%.

Tech-enhanced item types:

Drag and drop, audio/video, hotspot, fill-in-the-blank, etc.

Scalability:

Because most “real” exams will be doing thousands, tens of thousands, or even hundreds of thousands of examinees, the online exam platform needs to be able to scale up.

Online essay marking:

The e-Assessment platform should have a module to score open-response items. Preferably with advanced options, like having multiple markers or anonymity.

 

Goal 3: Maintaining test integrity and security during e-Assessment

Delivery security options:New test scheduler sites proctor code

There should be choices for how to create/disseminate passcodes, set time/date windows, disallow movement back to previous sections, etc.

Lockdown browser:

An option to deliver with software that locks the computer while the examinee is in the test.

Remote proctoring:

There should be an option for remote (online) proctoring.  This can be AI, record and review, or live.

Live proctoring:

There should be functionality that facilitates live human proctoring, such as in computer labs at a university.  The system might have Proctor Codes or a site management module.

User roles and content access:

There should be various roles for users, as well as options to limit them by content.  For example, limiting a Math teacher doing reviews to do nothing but review Math items.

Rescoring:

If items are compromised or challenged, you need functionality to easily remove them from scoring for an exam, and rescore all candidates

Live dashboard:

You should be able to see who is in the online exam, stop them if needed, and restart or re-register if needed.

 

Goal 4: Powerful reporting and exporting

Support for QTI:iteman item analysis

You should be able to import and export items with QTI, as well as common formats like Word or Excel.

Psychometric analytics & data visualization:

You should be able to see reports on reliability, standard error of measurement, point-biserial item discriminations, and all the other statistics that a psychometrician needs.  Sophisticated users will need things like item response theory.

Exporting of detailed raw files:

You should be able to easily export the examinee response matrix, item times, item comments, scores, and all other result data.

API connections:

You should have options to set up APIs to other platforms, like an LMS or CRM.

 

General Considerations

Ease-of-Use

As Albert Einstein said, Everything should be made as simple as possible, but no simpler.  The best e-Assessment software is one that offers sophisticated solutions in a way that anyone can use.  Power users should be able to leverage technology like adaptive testing, while there should also be simpler roles for item writers or reviewers.

Integrations

Your platform should integrate with learning management systems, job applicant tracking systems, certification management systems, or whatever other business operations software is important to you.

Support and Training

Does the platform have a detailed manual?  Bank of tutorial videos?  Email support from product experts?  Training webinars?

 

OK, now how do I find an e-Assessment solution that fits my needs?

If you are out shopping, ask about the aspects in the list above.  For sure, make sure to check the websites for documentation on these.  There is a huge range out there, from free survey software up to multi-million dollar platforms.

Want to save yourself some time?  Click here to request a free account in our platform.

job-task-analysis

Job Task Analysis (JTA) is an essential step in designing a test to be used in the workforce, such as pre-employment or certification/licensure, by analyzing data on what is actually being done in the job.  Also known as Job Analysis or Role Delineation, job task analysis is important to design a test that is legally defensible and eligible for accreditation.  It usually involves a panel of subject matter experts to develop a survey, which you then deliver to professionals in your field to get quantitative data about what is most frequently done on the job and what is most critical/important.  This data can then be used for several important purposes.

Need help? Our experts can help you efficiently produce a job task analysis study for your certification, guide the process of item writing and standard setting, then publish and deliver the exam on our secure platform.

 

Reasons to do a Job Task Analysis

Job analysis is extremely important in the field of industrial/organizational psychology, hence the meme here from @iopsychmemes.  It’s not just limited to credentialing.

Job analysis I/O Psychology

Exam design

The most common reason is to get quantitative data that will help you design an exam.  By knowing what knowledge, skills, or abilities (KSAs), are most commonly used, you then know which deserve more questions on the test.  It can also help you with more complex design aspects, such as defining a practical exam with live patients.

Training curriculum

Similarly, that quantitative info can help design a curriculum and other training materials.  You will have data on what is most important or frequent.

Compensation analysis

You have a captive audience with the JTA survey.  Ask them other things that you want to know!  This is an excellent time to gather information about compensation.  I worked on a JTA in the past which asked about work location: clinic, hospital, private practice, or vendor/corporate.

Job descriptions

A good job analysis will help you write a job description for postings.  It will tell you the job responsibilities (common tasks), qualifications (required skills, abilities, and education), and other important aspects.  If you gather compensation data in the survey, that can be used to define the salary range of the open position.

Workforce planning

Important trends might become obvious when analyzing the data.  Are fewer people entering your profession, perhaps specific to a certain region or demographic?  Are they entering without certain skills?  Are there certain universities or training programs that are not performing well?  A JTA can help you discover such issues and then work with stakeholders to address them.  These are major potential problems for the profession.

IT IS MANDATORY

If you have a professional certification exam and want to get it accredited by a board such as NCCA or ANSI/ANAB/ISO, then you are REQUIRED to do some sort of job task analysis.

 

Why is a JTA so important for certification and licensure?  Validity.

The fundamental goal of psychometrics is validity, which is evidence that the interpretations we make from scores are actually true. In the case of certification and licensure exams, we are interpreting that someone who passes the test is qualified to work in that job role. So, the first thing we need to do is define exactly what is the job role, and to do it in a quantitative, scientific way. You can’t just have someone sit down in their basement and write up 17 bullet points as the exam blueprint.  That is a lawsuit waiting to happen.

There are other aspects that are essential as well, such as item writer training and standard setting studies.

 

The Methodology: Job Task Inventory

It’s not easy to develop a defensible certification exam, but the process of job task analysis (JTA) doesn’t require a Ph.D. in Psychometrics to understand. Here’s an overview of what to expect.

  1. Convene a panel of subject matter experts (SMEs), and provide a training on the JTA process.
  2. The SMEs then discuss the role of the certification in the profession, and establish high-level topics (domains) that the certification test should cover. Usually, there is 5-20. Sometimes there are subdomains, and occasionally sub-subdomains.
  3. The SME panel generates a list of job tasks that are assigned to domains; the list is reviewed for duplicates and other potential issues. These tasks have an action verb, a subject, and sometimes a qualifier. Examples: “Calibrate the lensometer,” “Take out the trash”, “Perform an equating study.”  There is a specific approach to help with the generation, called the critical incident technique.  With this, you ask the SMEs to describe a critical incident that happened on the job and what skills or knowledge led to success by the professional.  While this might not generate ideas for frequent yet simple tasks, it can help generate ideas for tasks that are rarer but very important.
  4. The final list is used to generate a survey, which is sent to a representative sample of professionals that actually work in the role. The respondents take the survey, whereby they rate each task, usually on its importance and time spent (sometimes called criticality and frequency). Demographics are also gathered, which include age range, geographic region, work location (e.g., clinic vs hospital if medical), years of experience, educational level, and additional certifications.
  5. A psychometrician analyzes the results and creates a formal report, which is essential for validity documentation.  This report is sometimes considered confidential, sometimes published on the organization’s website for the benefit of the profession, and sometimes published in an abbreviated form.  It’s up to you.  For example, this site presents the final results, but then asks you to submit your email address for the full report.

 

Using JTA results to create test blueprints

Many corporations do a job analysis purely for in-house purposes, such as job descriptions and compensation.  This becomes important for large corporations where you might have thousands of people in the same job; it needs to be well-defined, with good training and appropriate compensation.

If you work for a credentialing organization (typically a non-profit, but sometimes the Training arm of a corporation… for example, Amazon Web Services has a division dedicated to certification exams, you will need to analyze the results of the JTA to develop exam blueprints.  We will discuss this process in more detail with another blog post.  But below is an example of how this will look, and here is a free spreadsheet to perform the calculations: Job Task Analysis to Test Blueprints.

 

Job Task Analysis Example

Suppose you are an expert widgetmaker in charge of the widgetmaker certification exam.  You hire a psychometrician to guide the organization through the test development process.  The psychometrician would start by holding a webinar or in-person meeting for a panel of SMEs to define the role and generate a list of tasks.  The group comes up with a list of 20 tasks, sorted into 4 content domains.  These are listed in a survey to current widgetmakers, who rate them on importance and frequency.  The psychometrician analyzes the data and presents a table like you see below.

We can see here that Task 14 is the most frequent, while Task 2 is the least frequent.  Task 7 is the most important while Task 17 is the least.  When you combine Importance and Frequency either by adding or multiplying, you get the weights on the right-hand columns.  If we sum these and divide by the total, we get the suggested blueprints in the green cells.

 

Job task analysis to test blueprints

 

laptop screen demonstrating automated test assembly

Question and Test Interoperability® (QTI®) is a set of standards around the format of import/export files for test questions in educational assessment and HR/credentialing exams.  This facilitates the movement of questions from one software platform to another, including item banking, test assembly, e-Learning, training, and exam delivery.  This serves two main purposes:

  1. Allows you to use multiple vendors more easily, such as one for item banking and another for exam delivery;
  2. Migrating to a new vendor, as you can export all your content from the old vendor then move into the new vendor easily.

In this blog post, we’ll discuss the significance of QTI and how it serves helps test sponsors in the world of certification, workforce, and educational assessment.

What is Question and Test Interoperability (QTI)?

QTI is a widely adopted standard that facilitates the exchange of assessment content and results between various learning platforms and assessment tools. Developed by the IMS Global Learning Consortium / 1EdTech, its goal is that assessments can be created, delivered, and evaluated consistently across different systems, paving the way for a more efficient and streamlined educational experience.  QTI is similar to SCORM, which is intended for learning content, while QTI is specific for assessment.

QTI uses an XML approach to content and markup, specifically modified for the situation of educational assessment, such as stems, answer, correct answers, and scoring information.  Version 2.x creates a zip file of all content, including a manifest file that lets the importing platform know what’s supposed to be coming in, with items then as separate XML files, and media files saved separately, sometimes into a subfolder.

Here is an example of the file arrangement inside the zip:

QTI files

Here is an example of what a test question would look like:

QTI example item

 

Why is QTI important?

Interoperability Across Platforms

QTI enables educators to create assessments on one platform and seamlessly transfer them to another. This cross-platform compatibility is crucial in today’s diverse educational technology landscape, where institutions often use a combination of learning management systems, assessment tools, and other applications.

Enhanced Efficiency

With QTI, the time-consuming process of manually transferring assessment content between systems is eliminated. This not only saves valuable time for educators but also ensures that the integrity of the assessment is maintained throughout the transfer process.

Adaptability to Diverse Assessment Types

QTI supports a wide range of question types, including multiple-choice, true/false, short answer, and more. This adaptability allows educators to create diverse and engaging assessments that cater to different learning styles and subject matter.

Data Standardization

The standardization of data formats within QTI ensures that assessment results are consistent and easily interpretable. This standardization not only facilitates a smoother exchange of information but also enables educators to gain valuable insights into student performance across various assessments.

Facilitating Accessibility

QTI is designed to support accessibility standards, making assessments more inclusive for all students, including those with disabilities. By adhering to accessibility guidelines, educational institutions can ensure that assessments are a fair and effective means of evaluating student knowledge.

 

How to Use QTI

Creating Assessments

qti specifications computer

QTI allows educators to author assessments in a standardized format that can be easily transferred between different platforms. When creating assessments, users adhere to the specification, ensuring compatibility and consistency.

Importing and Exporting Assessments

Educational institutions often use multiple learning management systems and assessment tools. QTI simplifies the process of transferring assessments between different platforms, eliminating the need for manual adjustments and reducing the risk of data corruption.

Adhering to QTI Specifications

To fully leverage the benefits of QTI, users must adhere to its specifications when creating and implementing assessments. Understanding the QTI schema and guidelines is essential for ensuring that assessments are interoperable across various systems.  This is dependent on the vendor you select.  Note that there have been different sets of QTI standards that have evolved over the years, and some vendors have slightly modified their own format!

 

Examples of QTI Applications

Online Testing Platforms

QTI is widely used in online testing platforms to facilitate the seamless transfer of assessments. Whether transitioning between different learning management systems or integrating third-party assessment tools, it ensures a smooth and standardized process.

Learning Management Systems (LMS)

Educational institutions often employ different LMS platforms. QTI allows educators to create assessments in one LMS and seamlessly transfer them to another, ensuring continuity and consistency in the assessment process.

Assessment Authoring Tools

QTI is integrated into various assessment authoring tools, enabling educators to create assessments in a standardized format. This integration ensures that assessments can be easily shared and used across different educational platforms.

 

Resources for Implementation

IMS Global Learning Consortium

The official website of the IMS Global Learning Consortium provides comprehensive documentation, specifications, and updates related to QTI. Educators and developers can access valuable resources to understand and implement QTI effectively.

QTI-Compatible Platforms and Tools

Many learning platforms and assessment tools explicitly support these specifications. Exploring and adopting compatible solutions simplifies the implementation process and ensures a seamless experience for both educators and students.  Our FastTest platform provides support for QTI.

Community Forums and Support Groups

Engaging with the educational technology community through forums and support groups allows users to share experiences, seek advice, and stay updated on best practices for QTI implementation.  See this thread in Moodle forums, for example.

Wikipedia

Wikipedia has an overview of the topic.

 

Conclusion

In a world where educational technology is advancing rapidly, the Question and Test Interoperability specification stands out as a crucial standard for achieving interoperability in assessment tools, fostering a more efficient, accessible, and adaptable educational environment.. By understanding what QTI is, how to use it, exploring real-world examples, and tapping into valuable resources, educators can navigate the educational landscape more effectively, ensuring a streamlined and consistent e-Assessment experience for students and instructors alike.

Iteman45-quantile-plot

Classical Test Theory (CTT) is a psychometric approach to analyzing, improving, scoring, and validating assessments.  It is based on relatively simple concepts, such as averages, proportions, and correlations.  One of the most frequently used aspects is item statistics, which provide insight into how an individual test question is performing.  Is it too easy, too hard, too confusing, miskeyed, or potentially another issue?  Item statistics are what tell you these things.

What are classical test theory item statistics?

They are indices of how a test item, or components of it, is performing.  Items can be hard vs easy, strong vs weak, and other important aspects.  Below is the output from the  Iteman  report in our  FastTest  online assessment platform, showing an English vocabulary item with real student data.  How do we interpret this?

FastTest Iteman Psychometric Analysis

Interpreting Classical Test Theory Item Statistics: Item Difficulty

The P value (Multiple Choice)

The P value is the classical test theory index of difficulty, and is the proportion of examinees that answered an item correctly (or in the keyed direction). It ranges from 0.0 to 1.0. A high value means that the item is easy, and a low value means that the item is difficult.  There are no hard and fast rules because interpretation can vary widely for different situations.  For example, a test given at the beginning of the school year would be expected to have low statistics since the students have not yet been taught the material.  On the other hand, a professional certification exam, where someone can not even sit unless they have 3 years of experience and a relevant degree, might have all items appear easy even though they are quite advanced topics!  Here are some general guidelines”

    0.95-1.0 = Too easy (not doing much good to differentiate examinees, which is really the purpose of assessment)

    0.60-0.95 = Typical

    0.40-0.60 = Hard

    <0.40 = Too hard (consider that a 4 option multiple choice has a 25% chance of pure guessing)

With Iteman, you can set bounds to automatically flag items.  The minimum P value bound represents what you consider the cut point for an item being too difficult. For a relatively easy test, you might specify 0.50 as a minimum, which means that 50% of the examinees have answered the item correctly.

For a test where we expect examinees to perform poorly, the minimum might be lowered to 0.4 or even 0.3. The minimum should take into account the possibility of guessing; if the item is multiple-choice with four options, there is a 25% chance of randomly guessing the answer, so the minimum should probably not be 0.20.  The maximum P value represents the cut point for what you consider to be an item that is too easy. The primary consideration here is that if an item is so easy that nearly everyone gets it correct, it is not providing much information about the examinees.

In fact, items with a P of 0.95 or higher typically have very poor point-biserial correlations.

The Item Mean (Polytomous)

This refers to an item that is scored with 2 or more point levels, like an essay scored on a 0-4 point rubric or a Likert-type item that is “Rate on a scale of 1 to 5.”

  • 1=Strongly Disagree
  • 2=Disagree
  • 3=Neutral
  • 4=Agree
  • 5=Strongly Agree

The item mean is the average of the item responses converted to numeric values across all examinees. The range of the item mean is dependent on the number of categories and whether the item responses begin at 0. The interpretation of the item mean depends on the type of item (rating scale or partial credit). A good rating scale item will have an item mean close to ½ of the maximum, as this means that on average, examinees are not endorsing categories near the extremes of the continuum.

You will have to adjust for your own situation, but here is an example for the 5-point Likert-style item.

1-2 is very low; people disagree fairly strongly on average

2-3 is low to neutral; people tend to disagree on average

3-4 is neutral to high; people tend to agree on average

4-5 is very high; people agree fairly strongly on average

Iteman also provides flagging bounds for this statistic.  The minimum item mean bound represents what you consider the cut point for the item mean being too low.  The maximum item mean bound represents what you consider the cut point for the item mean being too high.

The number of categories for the items must be considered when setting the bounds of the minimum/maximum values. This is important as all items of a certain type (e.g., 3-category) might be flagged.

Interpreting Classical Test Theory Item Statistics: Item Discrimination

Multiple-Choice Items

The Pearson point-biserial correlation (r-pbis) is a classical test theory measure of the discrimination or differentiating strength, of the item. It ranges from −1.0 to 1.0 and is a correlation of item scores and total raw scores.  If you consider a scored data matrix (multiple-choice items converted to 0/1 data), this would be the correlation between the item column and a column that is the sum of all item columns for each row (a person’s score).

A good item is able to differentiate between examinees of high and low ability yet have a higher point-biserial, but rarely above 0.50. A negative point-biserial is indicative of a very poor item because it means that the high-ability examinees are answering incorrectly, while the low-ability examinees are answering it correctly, which of course would be bizarre, and therefore typically indicates that the specified correct answer is actually wrong. A point-biserial of 0.0 provides no differentiation between low-scoring and high-scoring examinees, essentially random “noise.”  Here are some general guidelines on interpretation.  Note that these assume a decent sample size; if you only have a small number of examinees, many item statistics will be flagged!

0.20+ = Good item; smarter examinees tend to get the item correct

0.10-0.20 = OK item; but probably review it

0.0-0.10 = Marginal item quality; should probably be revised or replaced

<0.0 = Terrible item; replace it

***Major red flag is if the correct answer has a negative Rpbis and a distractor has a positive Rpbis

The minimum item-total correlation bound represents the lowest discrimination you are willing to accept. This is typically a small positive number, like 0.10 or 0.20. If your sample size is small, it could possibly be reduced.  The maximum item-total correlation bound is almost always 1.0, because it is typically desired that the r-pbis be as high as possible.

The biserial correlation is also a measure of the discrimination or differentiating strength, of the item. It ranges from −1.0 to 1.0. The biserial correlation is computed between the item and total score as if the item was a continuous measure of the trait. Since the biserial is an estimate of Pearson’s r it will be larger in absolute magnitude than the corresponding point-biserial.

The biserial makes the stricter assumption that the score distribution is normal. The biserial correlation is not recommended for traits where the score distribution is known to be non-normal (e.g., pathology).

Polytomous Items

The Pearson’s r correlation is the product-moment correlation between the item responses (as numeric values) and total score. It ranges from −1.0 to 1.0. The r correlation indexes the linear relationship between item score and total score and assumes that the item responses for an item form a continuous variable. The r correlation and the r-pbis are equivalent for a 2-category item, so guidelines for interpretation remain unchanged.

The minimum item-total correlation bound represents the lowest discrimination you are willing to accept. Since the typical r correlation (0.5) will be larger than the typical rpbis (0.3) correlation, you may wish to set the lower bound higher for a test with polytomous items (0.2 to 0.3). If your sample size is small, it could possibly be reduced.  The maximum item-total correlation bound is almost always 1.0, because it is typically desired that the r-pbis be as high as possible.

The eta coefficient is an additional index of discrimination computed using an analysis of variance with the item response as the independent variable and total score as the dependent variable. The eta coefficient is the ratio of the between-groups sum of squares to the total sum of squares and has a range of 0 to 1. The eta coefficient does not assume that the item responses are continuous and also does not assume a linear relationship between the item response and total score.

As a result, the eta coefficient will always be equal or greater than Pearson’s r. Note that the biserial correlation will be reported if the item has only 2 categories.

z-score-avatar

A z-score measures the distance between a raw score and a mean in standard deviation units. The z-score is also known as a standard score since it enables comparing scores on various variables by standardizing the distribution of scores. It is worth mentioning that a standard normal distribution (also known as the z-score distribution or probability distribution) is a normally shaped distribution with a mean of 0 and a standard deviation of  1. T-score is another example of standardized scores.

The z-score can be positive or negative. The sign depends on whether the observation is above or below the mean. For instance, the z of  +2  indicates that the raw score (data point) is two standard deviations above the mean, while a  -1  signifies that it is one standard deviation below the mean. The z of 0 equals the mean. Z-scores generally range from  -3  standard deviations (which would fall to the far left of the normal distribution curve) up to  +3  standard deviations (which would fall to the far right of the normal distribution curve). This covers  99%  of the population; there are people outside that range (e.g., gifted students) but for most cases it is difficult to measure the extremes and there is little practical difference.  It is for this reason that scaled scores on exams are often produced with this paradigm; the SAT has a mean of  500  and standard deviation of  100, so the range is  200  to  800. 

How to calculate a z-score

Here is a formula for calculating the z:

z = (xμ)/σ

where

     x – individual value

     μ – mean

     σ – standard deviation.

Interpretation of the formula:

  • Subtract the mean of the values from the individual value
  • Divide the difference by the standard deviation.

Here is a graphical depiction of the standard normal curve and how the z-score relates to other metrics.

 

T scores

Advantages of using a z-score

When you standardize the raw data by transforming them into z-scores, you receive the following benefits:

  • Identify outliers
  • Understand where an individual score fits into a distribution
  • Normalize scores for statistical decision-making (e.g., grading on a curve)
  • Calculate probabilities and percentiles using the standard normal distribution
  • Compare scores on different distributions with different means and standard deviations

Example of using a z-score in real life situation

Let’s imagine that there is a set of SAT scores from students, and this data set obeys a normal distribution law with the mean score of  500  and a standard deviation of  100. Suppose we need to find the probability that these SAT scores exceed  650. In order to standardize our data, we have to find the z-score for  650. The z will tell us how many standard deviations away from the mean  650  is.

  • Subtracting the mean from the individual value:

x – 650

μ – 500

xμ = 650– 500= 150

  • Dividing the obtained difference by the standard deviation:

σ – 100

z = 150 ÷ 100 = 1.5

The z for the value of  650  is  1.5, i.e.  650 is 1.5 standard deviations above the mean in our distribution.

If you look up this z-score on a conversion table, you will see that it says  0.93319.  This means that a score of  650  is at the  93rd  percentile of students.

Additional resources

Khan Academy

Normal Distribution (Wikipedia)

Classical Test Theory vs. Item Response Theory

Classical Test Theory and Item Response Theory (CTT & IRT) are the two primary psychometric paradigms.  That is, they are mathematical approaches to how tests are analyzed and scored.  They differ quite substantially in substance and complexity, even though they both nominally do the same thing, which is statistically analyze test data to ensure reliability and validity.  CTT is quite simple, easily understood, and works with small samples, but IRT is far more powerful and effective, so it is used by most big exams in the world.

So how are they different, and how can you effectively choose the right solution?  First, let’s start by defining the two.  This is just a brief intro; there are entire books dedicated to the details!

Classical Test Theory

CTT is an approach that is based on simple mathematics; primarily averages, proportions, and correlations.  It is more than 100 years old, but is still used quite often, with good reason. In addition to working with small sample sizes, it is very simple and easy to understand, which makes it useful for working directly with content experts to evaluate, diagnose, and improve items or tests.

Download free version of Iteman for CTT Analysis

 

Iteman classical test theory

 

Item Response Theory

IRT is a much more complex approach to analyzing tests. Moreover, it is not just for analyzing; it is a complete psychometric paradigm that changes how item banks are developed, test forms are designed, tests are delivered (adaptive or linear-on-the-fly), and scores produced. There are many benefits to this approach that justify the complexity, and there is good reason that all major examinations in the world utilize IRT.  Learn more about IRT here.

 

Download free version of Xcalibre for IRT Analysis

 

Similarities between Classical Test Theory and Item Response Theory

CTT & IRT are both foundational frameworks in psychometrics aimed at improving the reliability and validity of psychological assessments. Both methodologies involve item analysis to evaluate and refine test items, ensuring they effectively measure the intended constructs. Additionally, IRT and CTT emphasize the importance of test standardization and norm-referencing, which facilitate consistent administration and meaningful score interpretation. Despite differing in specific techniques both frameworks ultimately strive to produce accurate and consistent measurement tools. These shared goals highlight the complementary nature of IRT and CTT in advancing psychological testing.

Differences between Classical Test Theory and Item Response Theory

Test-Level and Subscore-Level Analysis

CTT statistics for total scores and subscores include coefficient alpha reliability, standard error of measurement (a function of reliability and SD), descriptive statistics (average, SD…), and roll-ups of item statistics (e.g., mean Rpbis).

With IRT, we utilize the same descriptive statistics, but the scores are now different (theta, not number-correct).  The standard error of measurement is now a conditional function, not a single number. The entire concept of reliability is dropped, and replaced with the concept of precision, and also as that same conditional function.

Item-Level AnalysisXcalibre item response theory

Item statistics for CTT include proportion-correct (difficulty), point-biserial (Rpbis) correlation (discrimination), and a distractor/answer analysis. If there is demographic information, CTT analysis can also provide a simple evaluation of differential item functioning (DIF).

IRT replaces the difficulty and discrimination with its own quantifications, called simply b and a.  In addition, it can add a c parameter for guessing effects. More importantly, it creates entirely new classes of statistics for partial credit or rating scale items.

Scoring

CTT scores tests with traditional scoring: number-correct, proportion-correct, or sum-of-points.  CTT interprets test scores based on the total number of correct responses, assuming all items contribute equally.  IRT scores examinees directly on a latent scale, which psychometricians call theta, allowing for more nuanced and precise ability estimates.

Linking and Equating

Linking and equating is a statistical analysis to determine comparable scores on different forms; e.g., Form A is “two points easier” than Form B and therefore a 72 on Form A is comparable to a 70 on Form B. CTT has several methods for this, including the Tucker and Levine methods, but there are methodological issues with these approaches. These issues, and other issues with CTT, eventually led to the development of IRT in the 1960s and 1970s.

IRT has methods to accomplish linking and equating which are much more powerful than CTT, including anchor-item calibration or conversion methods like Stocking-Lord. There are other advantages as well.

Vertical Scaling

One major advantage of IRT, as a corollary to the strong linking/equating, is that we can link/equate not just across multiple forms in one grade, but from grade to grade. This produces a vertical scale. A vertical scale can span across multiple grades, making it much easier to track student growth, or to measure students that are off-grade in their performance (e.g., 7th grader that is at a 5th grade level). A vertical scale is a substantial investment, but is extremely powerful for K-12 assessments.

Sample Sizes

Classical test theory can work effectively with 50 examinees, and provide useful results with as little as 20.  Depending on the IRT model you select (there are many), the minimum sample size can be 100 to 1,000.

Sample- and Test-Dependence

CTT analyses are sample-dependent and test-dependent, which means that such analyses are performed on a single test form and set of students. It is possible to combine data across multiple test forms to create a sparse matrix, but this has a detrimental effect on some of the statistics (especially alpha), even if the test is of high quality, and the results will not reflect reality.

For example, if Grade 7 Math has 3 forms (beginning, middle, end of year), it is conceivable to combine them into one “super-matrix” and analyze together. The same is true if there are 3 forms given at the same time, and each student randomly receives one of the forms. In that case, 2/3 of the matrix would be empty, which psychometricians call sparse.

Distractor Analysis

Classical test theory will analyze the distractors of a multiple choice item.  IRT models, except for the rarely-used Nominal Response Model, do not.  So even if you primarily use IRT, psychometricians will also use CTT for this.

Guessing

educational assessment

IRT has a parameter to account for guessing, though some psychometricians argue against its use.  CTT has no effective way to account for guessing.

Adaptive Testing

There are rare cases where adaptive testing (personalized assessment) can be done with classical test theory.  However, it pretty much requires the use of item response theory for one important reason: IRT puts people and items onto the same latent scale.

Linear Test Design

CTT and IRT differ in how test forms are designed and built.  CTT works best when there are lots of items of middle difficulty, as this maximizes the coefficient alpha reliability.  However, there are definitely situations where the purpose of the assessment is otherwise.  IRT provides stronger methods for designing such tests, and then scoring as well.

So… How to Choose?

There is no single best answer to the question of CTT vs. IRT.  You need to evaluate the aspects listed above, and in some cases other aspects (e.g., financial, or whether you have staff available with the expertise in the first place).  In many cases, BOTH are necessary.  This is especially true because IRT does not provide an effective and easy-to-understand distractor analysis that you can use to discuss with subject matter experts.  It is for this reason that IRT software will typically produce CTT analysis too, though the reverse is not true.

IRT is very powerful, and can provide additional information about tests if used just for analyzing results to evaluate item and test performance.  A researcher might choose IRT over CTT for its ability to provide detailed item-level data, handle varying item characteristics, and improve the precision of ability estimates.  IRT’s flexibility and advanced modeling capabilities make it suitable for complex assessments and adaptive testing scenarios.

However, IRT is really only useful if you are going to make it your psychometric paradigm, thereby using it in the list of activities above, especially IRT scoring of examines. Otherwise, IRT analysis is merely just another way of looking test and item performance that will correlate substantially with CTT.

Contact Us To Talk With An Expert