Psychometrics is the science of educational and psychological assessment.  It scientifically studies how tests are developed, delivered, and scored, regardless of the test topic.  The goal is to provide validity: evidence to support that interpretations of scores from the test are trustworthy.  This makes the tests more effective for their purpose of providing useful information about people.

Psychometrics tackles fundamental questions around assessment, such as how to determine if a test is reliable or if a question is of good quality, as well as much more complex questions like how to ensure that a score today is on the same scale as a score 10 years ago.  The goal of psychometrics is to make test scores fairer, more precise, and more valid – because test scores are used to make decisions about people (pass a course, hire for a job…), and better tests mean better decisions.  Why?  The scientific evidence is overwhelming that tests provide better information for decision makers than many other types of information, such as interviews, resumes, or educational attainment.

What is psychometrics? An introduction / definition.

Psychometrics is the study of assessment itself, regardless of what type of test is under consideration. In fact, many psychometricians don’t even work on a particular test, they just work on psychometrics itself, such as new methods of data analysis.  Most professionals don’t care about what the test is measuring, and will often switch to new jobs at completely unrelated topics, such as moving from a K-12 testing company to psychological measurement to an Accountant certification exam.  We often refer to whatever we are measuring simply as “theta” – a term from item response theory.Generalized-partial-credit-model psychometrics IRT

Psychometrics is a branch of data science.  In fact, it’s been around a long time before that term was even a buzzword.  Don’t believe me?  Check out this Coursera course on Data Science, and the first example they give as one of the foundational historical projects in data science is… psychometrics!  (early research on factor analysis of intelligence)

Even though assessment is everywhere and Psychometrics is an essential aspect of assessment, to most people it remains a black box, and professionals are referred to as “psychomagicians” in jest. However, a basic understanding is important for anyone working in the testing industry, especially those developing or selling tests.  It’s also important for many areas that use assessments, like human resources and education.

What is not psychometrics?

Psychometrics is NOT limited to very narrow types of assessment.  Some people use the term interchangeably with concepts like IQ testing, personality assessment, or pre-employment testing.  These are each but tiny parts of the field!  Also, it is not the administration of a test.

 

What questions does the field of Psychometrics address?

Building and maintaining a high-quality test is not easy.  A lot of big issues can arise.  Much of the field revolves around solving major questions about tests: what should they cover, what is a good question, how do we set a good cutscore, how do we make sure that the test predicts job performance or student success, etc.

 

How do we define what should be covered by the test? (Test Design)

Before writing any items, you need to define very specifically what will be on the test.  If the test is in credentialing or pre-employment, psychometricians typically run a job analysis study to form a quantitative, scientific basis for the test blueprints.  A job analysis is necessary for a certification program to get accredited.  In Education, the test coverage is often defined by the curriculum.

 

How do we ensure the questions are good quality? (Item Writing)

There is a corpus of scientific literature on how to develop test items that accurately measure whatever you are trying to measure.  A great overview is the book by Haladyna.  This is not just limited to multiple-choice items, although that approach remains popular.  Psychometricians leverage their knowledge of best practices to guide the item authoring and review process in a way that the result is highly defensible test content.  Professional item banking software provides the most efficient way to develop high-quality content and publish multiple test forms, as well as store important historical information like item statistics.

 

How do we set a defensible cutscore? (Standard Setting)

Test scores are often used to classify candidates into groups, such as pass/fail (Certification/Licensure), hire/non-hire (Pre-Employment), and below-basic/basic/proficient/advanced (Education).  Psychometricians lead studies to determine the cutscores, using methodologies such as Angoff, Beuk, Contrasting-Groups, and Borderline.

 

How do we analyze results to improve the exam? (Psychometric Analysis)

Psychometricians are essential for this step, as the statistical analyses can be quite complex.  Smaller testing organizations typically utilize classical test theory, which is based on simple mathematics like proportions and correlations.  Large, high-profile organizations typically use item response theory (IRT), which is based on a type of nonlinear regression analysis.  Psychometricians evaluate overall reliability of the test, difficulty and discrimination of each item, distractor analysis, possible bias, multidimensionality, linking multiple test forms/years, and much more.  Software such as  Iteman  and  Xcalibre  is also available for organizations with enough expertise to run statistical analyses internally.  Scroll down below for examples.

 

How do we compare scores across groups or years? (Equating)

This is referred to as linking and equating.  There are some psychometricians that devote their entire career to this topic.  If you are working on a certification exam, for example, you want to make sure that the passing standard is the same this year as last year.  If you passed 76% last year and this year you passed 25%, not only will the candidates be angry, but there will be much less confidence in the meaning of the credential.

 

How do we know the test is measuring what it should? (Validity)

Validity is the evidence provided to support score interpretations.  For example, we might interpret scores on a test to reflect knowledge of English, and we need to provide documentation and research supporting this.  There are several ways to provide this evidence.  A straightforward approach is to establish content-related evidence, which includes the test definition, blueprints, and item authoring/review.  In some situations, criterion-related evidence is important, which directly correlates test scores to another variable of interest.  Delivering tests in a secure manner is also essential for validity.

 

Where is Psychometrics Used?

Certification/Licensure/Credentialing

In certification testing, psychometricians develop the test via a documented chain of evidence following a sequence of research outlined by accreditation bodies, typically: job analysis, test blueprints, item writing and review, cutscore study, and statistical analysis.  Web-based item banking software like  FastTest  is typically useful because the exam committee often consists of experts located across the country or even throughout the world; they can then easily log in from anywhere and collaborate.

 

Pre-Employment

In pre-employment testing, validity evidence relies primarily on establishing appropriate content (a test on PHP programming for a PHP programming job) and the correlation of test scores with an important criterion like job performance ratings (shows that the test predicts good job performance).  Adaptive tests are becoming much more common in pre-employment testing because they provide several benefits, the most important of which is cutting test time by 50% – a big deal for large corporations that test a million applicants each year. Adaptive testing is based on item response theory, and requires a specialized psychometrician as well as specially designed software like  FastTest.

 

K-12 Education

Most assessments in education fall into one of two categories: lower-stakes formative assessment in classrooms, and higher-stakes summative assessments like year-end exams.  Psychometrics is essential for establishing the reliability and validity of higher-stakes exams, and on equating the scores across different years.  They are also important for formative assessments, which are moving towards adaptive formats because of the 50% reduction in test time, meaning that student spend less time testing and more time learning.

 

Universities

Universities typically do not give much thought to psychometrics even though a significant amount of testing occurs in higher education, especially with the move to online learning and MOOCs.  Given that many of the exams are high stakes (consider a certificate exam after completing a year-long graduate program!), psychometricians should be used in the establishment of legally defensible cutscores and in statistical analysis to ensure reliable tests, and professionally designed assessment systems used for developing and delivering tests, especially with enhanced security.

 

Medicine/Psychology

Have you ever taken a survey at your doctor’s office, or before/after a surgery?  Perhaps a depression or anxiety inventory at a psychotherapist?  Psychometricians have worked on these.

 

The Test Development Cycle

Psychometrics is the core of the test development cycle, which is the process of developing a strong exam.  It is sometimes called similar names like assessment lifecycle.

test development cycle job task analysis psychometrics

You will recognize some of the terms from the introduction earlier.  What we are trying to demonstrate here is that those questions are not standalone topics, or something you do once and simply file a report.  An exam is usually a living thing.  Organizations will often be republishing a new version every year or 6 months, which means that much of the cycle is repeated on that timeline.  Not all of it is; for example, many orgs only do a job analysis and standard setting every 5 years.

Consider a certification exam in healthcare.  The profession does not change quickly because things like anatomy never change and medical procedures rarely change (e.g., how to measure blood pressure).  So, every 5 years it does a job analysis of its certificants to see what they are doing and what is important.  This is then converted to test blueprints.  Items are re-mapped if needed, but most likely do not need it because there are probably only minor changes to the blueprints.  Then a new cutscore is set with the modified-Angoff method, and the test is delivered this year.  It is delivered again next year, but equated to this year rather than starting again.  However, the item statistics are still analyzed, which leads to a new cycle of revising items and publishing a new form for next year.

 

Example of Psychometrics in Action

Here is some output from our Iteman software.  This is deeply analyzing a single question on English vocabulary, to see if the student knows the word alleviate.  About 70% of the students answered correctly, with a very strong point-biserial.  The distractor P values were all in the minority and the distractor point-biserials were negative, which adds evidence to the validity.  The graph shows that the line for the correct answer is going up while the others are going down, which is good.  If you are familiar with item response theory, you’ll notice how the blue line is similar to an item response function.  That is not a coincidence.

FastTest Iteman Psychometrics Analysis

 

Now, let’s look at another one, which is more interesting.  Here’s a vocab question about the word confectioner.  Note that only 37% of the students get it right… even though there is a 25% chance just of guessing!!!  However, the point-biserial discrimination remains very strong at 0.49.  That means it is a really good item.  It’s just hard, which means it does a great job to differentiate amongst the top students.

Confectioner confetti

 

Psychometrics looks fun!  How can I join the band?

You will need a graduate degree.  I recommend you look at the NCME website with resources for students.  Good luck!

Already have a degree and looking for a job?  Here’s the two sites that I recommend:

NCME – Also has a job listings page that is really good (ncme.org)

Horizon Search – Headhunter for Psychometricians and I/O Psychologists

Computerized adaptive testing is an AI-based approach to testing where the difficulty of the test is adapted to you based on your performance as you take the test.  If you do well, the items get more difficult, and if you do poorly, the items get easier.  If an accurate score is reached, the test stops early.  This means that the test becomes shorter, more accurate, more secure, more engaging, and fairer. 

The AI algorithms are almost always based on item response theory (IRT), an application of machine learning to assessment, but can be based on other models as well.  CAT is also called computer-adaptive testing or adaptive assessment, but “computerized adaptive testing” is used more often in the scientific literature.

This post will cover the following topics:

  1. What is computerized adaptive testing?
  2. How does an adaptive test adapt?
  3. What is an example of computerized adaptive testing
  4. What are advantages of computerized adaptive testing
  5. How to develop an CAT that is valid and defensible
  6. What do I need to implement adaptive testing?

 

Quick FAQ

Let’s start with some quick FAQ.  Afterwards, we will delve into details about the machine learning algorithm.

How do computer adaptive tests work?

Computer adaptive tests adjust the difficulty of upcoming questions based on a test-taker's previous answers. The process starts with a question of medium difficulty; if answered correctly, a more difficult question follows. An incorrect answer leads to an easier question. This dynamic adjustment continues throughout the exam, creating a tailored testing experience that accurately measures the individual's ability level.

What is the purpose of computerized adaptive testing?

The purpose of Computerized Adaptive Testing (CAT) is to accurately measure an individual's proficiency with fewer questions and in less time. By tailoring question difficulty to each test-taker's performance, CAT ensures an efficient and secure testing process.

What are the pros and cons of computer adaptive testing?

Pros of computer adaptive testing include more efficient assessments (potentially saving millions of hours of time), greater student engagement, and enhanced test security. The main cons are the high cost and complexity of test development, which precludes CAT for small exams.

Is computer adaptive testing fair?

Yes. It is psychometrically more fair than a traditional, static test. Even though test-takers encounter different questions, the adaptive algorithm adjusts for question difficulty and they are scored on a percentile basis. This allows for an accurate assessment of each student's ability and provides employers with a fair basis to compare qualifications among candidates. A traditional test typically has mostly average items, leading to inaccurate scores for top or low students.

What is an example of an adaptive test?

The GRE (Graduate Record Examinations) is a prime example of an adaptive test. So is the NCLEX (nursing exam in the USA), GMAT (business school admissions), and many formative assessments like the NWEA MAP.

 

Prefer to learn by doing?  Request a free account in FastTest, our powerful adaptive testing platform.

Free FastTest Account

 

Computerized adaptive testing: What is it?

Computerized adaptive testing is an algorithm that personalizes how an assessment is delivered to each examinee.  It is coded into a software platform, using the machine-learning approach of IRT to select items and score examinees.  The algorithm proceeds in a loop until the test is complete.  This makes the test smarter, shorter, fairer, and more precise.

computerized adaptive testing

The steps in the diagram above are adapted from Kingsbury and Weiss (1984). based on these components.

Components of a CAT

  1. Item bank calibrated with IRT
  2. Starting point (theta level before someone answers an item)
  3. Item selection algorithm (usually maximum Fisher information)
  4. Scoring method (e.g., maximum likelihood)
  5. Termination criterion (stop the test at 50 items, or when standard error is less than 0.30?  Both?)

How the components work

Let’s step through how it works.

For starters, you need an item bank that has been calibrated with a relevant psychometric or machine learning model.  That is, you can’t just write a few items and subjectively rank them as Easy, Medium, or Hard difficulty.  That’s an easy way to get sued.  Instead, you need to write a large number of items (rule of thumb is 3x your intended test length) and then pilot them on a representative sample of examinees.  The sample must be large enough to support the psychometric model you choose, and can range from 100 to 1000.  You then need to perform simulation research – more on that later.

Once you have an item bank ready, here is how the computerized adaptive testing algorithm works for a student that sits down to take the test.

  1. Starting point: there are three option to select the starting score, which psychometricians call theta
    1. Everyone gets the same value, like 0.0 (average, in the case of non-Rasch models)
    2. Randomized within a range, to help test security and item exposure
    3. Predicted value, perhaps from external data, or from a previous exam
  2. Select item
    1. Find the item in the bank that has the highest information value
    2. Often, you need to balance this with practical constraints such as Item Exposure or Content Balancing
  3. Score examinee
    1. Score the examinee; if using IRT, perhaps maximum likelihood or Bayes mod
    2. 32al
  4. Evaluate termination criterion: using a predefined rule supported by your simulation research
    1. Is a certain level of precision reached, such as a standard error of measurement <0.30
    2. Are there no good items left in the bank
    3. Has a time limit been reached
    4. Has a Max Items limit been reached

The algorithm works by looping through 2-3-4 until the termination criterion is satisfied.

 

Computer adaptive testing software: How to implement CAT

Our revolutionary platform, FastTest, makes it easy to publish a CAT.  Once you upload your item texts and the IRT parameters, you can choose whatever options you please for steps 2-3-4 of the algorithm, simply by clicking on elements in our easy-to-use interface.  Want to try it yourself?  Contact us to set up a free account and demo.

But of course, there are many technical considerations that affect the quality and defensibility of your CAT – we’ll be talking about those later in this post.

computerized Adaptive testing options

How does the test adapt? By Difficulty or Quantity?

CATs operate by adapting both the difficulty and quantity of items seen by each examinee.

Difficulty
Most characterizations of computerized adaptive testing focus on how item difficulty is matched to examinee ability. High-ability examinees receive more difficult items, while low ability examinees receive easier items, which has important benefits to the student and the organization. An adaptive test typically begins by delivering an item of medium difficulty; if you get it correct, you get a tougher item, and if you get it incorrect, you get an easier item. This basic algorithm continues until the test is finished, though it usually includes sub algorithms for important things like content distribution and item exposure.

Quantity
A less publicized facet of adaptation is the number of items. Adaptive tests can be designed to stop when certain psychometric criteria are reached, such as a specific level of score precision. Some examinees finish very quickly with few items, so that adaptive tests are typically about half as many questions as a regular test, with at least as much accuracy. Since some examinees have longer tests, these adaptive tests are referred to as variable-length. Obviously, this makes for a massive benefit: cutting testing time in half, on average, can substantially decrease testing costs.

Some adaptive tests use a fixed length, and only adapt item difficulty. This is merely for public relations issues, namely the inconvenience of dealing with examinees who feel they were unfairly treated by the CAT, even though it is arguably more fair and valid than conventional tests.

In general, it is best practice to meld the two: allow test length to be shorter or longer, but put caps on either end that prevent inadvertently too-short tests or tests that could potentially go on to 400 items.  For example, the NCLEX has a minimum length exam of 75 items and the maximum length exam of 145 items.

 

An example of the computerized adaptive testing algorithm

Let’s walk through an oversimplified example.  Here, we have an item bank with 5 questions.  We will start with an item of average difficulty, and answer as would a student of below-average difficulty.

Below are the item information functions for five items in a bank.  Let’s suppose the starting theta is 0.0.  

item information functions

 

  1. We find the first item to deliver.  Which item has the highest information at 0.0?  It is Item 4.
  2. Suppose the student answers incorrectly.
  3. We run the IRT scoring algorithm, and suppose the score is -2.0.  
  4. Check the termination criterion; we certainly aren’t done yet, after 1 item.
  5. Find the next item.  Which has the highest information at -2.0?  Item 2.
  6. Suppose the student answers correctly.
  7. We run the IRT scoring algorithm, and suppose the score is -0.8.  
  8. Evaluate termination criterion; not done yet.
  9. Find the next item.  Item 2 is the highest at -0.8 but we already used it.  Item 4 is next best, but we already used it.  So the next best is Item 1.
  10. Item 1 is very easy, so the student gets it correct.
  11. New score is -0.2.
  12. Best remaining item at -0.2 is Item 3.
  13. Suppose the student gets it incorrect.
  14. New score is perhaps -0.4.
  15. Evaluate termination criterion.  Suppose that the test has a max of 3 items, an extremely simple criterion.  We have met it.  The test is now done and automatically submitted.

 

Want to take an adaptive test yourself and see how it adapts?  Here is a link to take an English Vocabulary test.

TAKE EXAMPLE ADAPTIVE TEST

 

Advantages/benefits of computerized adaptive testing

By making the test more intelligent, adaptive testing provides a wide range of benefits.  Some of the well-known advantages of adaptive testing, recognized by scholarly psychometric research, are listed below.  
 
However, the development of an adaptive test is a very complex process that requires substantial expertise in item response theory (IRT) and CAT simulation research.  Our experienced team of psychometricians can provide your organization with the requisite experience to implement adaptive testing and help your organization benefit from these advantages. Contact us to learn more.
 

Shorter tests

Research has found that adaptive tests produce anywhere from a 50% to 90% reduction in test length.  This is no surprise.  Suppose you have a pool of 100 items.  A top student is practically guaranteed to get the easiest 70 correct; only the hardest 30 will make them think.  Vice versa for a low student.  Middle-ability students do no need the super-hard or the super-easy items.

Why does this matter?  Primarily, it can greatly reduce costs.  Suppose you are delivering 100,000 exams per year in testing centers, and you are paying $30/hour.  If you can cut your exam from 2 hours to 1 hour, you just saved $3,000,000.  Yes, there will be increased costs from the use of adaptive assessment, but you will likely save money in the end.

For the K12 assessment, you aren’t paying for seat time, but there is the opportunity cost of lost instruction time.  If students are taking formative assessments 3 times per year to check on progress, and you can reduce each by 20 minutes, that is 1 hour; if there are 500,000 students in your State, then you just saved 500,000 hours of learning.

More precise scores

CAT will make tests more accurate, in general.  It does this by designing the algorithms specifically around how to get more accurate scores without wasting examinee time.

More control of score precision (accuracy)

CAT ensures that all students will have the same accuracy, making the test much fairer.  Traditional tests measure the middle students well but not the top or bottom students.  Is it better than A) students see the same items but can have drastically different accuracy of scores, or B) have equivalent accuracy of scores, but see different items?

Better test security

Since all students are essentially getting an assessment that is tailored to them, there is better test security than everyone seeing the same 100 items.  Item exposure is greatly reduced; note, however, that this introduces its own challenges, and adaptive assessment algorithms have considerations of their own item exposure.

A better experience for examinees, with reduced fatigue

Adaptive assessments will tend to be less frustrating for examinees on all ranges of ability.  Moreover, by implementing variable-length stopping rules (e.g., once we know you are a top student, we don’t give you the 70 easy items), reduces fatigue.

Increased examinee motivation

Since examinees only see items relevant to them, this provides an appropriate challenge.  Low-ability examinees will feel more comfortable and get many more items correct than with a linear test.  High-ability students will get the difficult items that make them think.

Frequent retesting is possible

The whole “unique form” idea applies to the same student taking the same exam twice.  Suppose you take the test in September, at the beginning of a school year, and take the same one again in November to check your learning.  You’ve likely learned quite a bit and are higher on the ability range; you’ll get more difficult items, and therefore a new test.  If it was a linear test, you might see the same exact test.

This is a major reason that adaptive assessment plays a formative role in K-12 education, delivered several times per year to millions of students in the US alone.

Individual pacing of tests

Examinees can move at their own speed.  Some might move quickly and be done in only 30 items.  Others might waver, also seeing 30 items but taking more time.  Still, others might see 60 items.  The algorithms can be designed to maximize the process.

Advantages of computerized testing in general

Of course, the advantages of using a computer to deliver a test are also relevant.  Here are a few
  • Immediate score reporting
  • On-demand testing can reduce printing, scheduling, and other paper-based concerns
  • Storing results in a database immediately makes data management easier
  • Computerized testing facilitates the use of multimedia in items
  • You can immediately run psychometric reports
  • Timelines are reduced with an integrated item banking system

 

How to develop an adaptive assessment that is valid and defensible

CATs are the future of assessment. They operate by adapting both the difficulty and number of items to each individual examinee. The development of an adaptive test is no small feat, and requires five steps integrating the expertise of test content developers, software engineers, and psychometricians.

The development of a quality adaptive test is complex and requires experienced psychometricians in both item response theory (IRT) calibration and CAT simulation research. FastTest can provide you the psychometrician and software; if you provide test items and pilot data, we can help you quickly publish an adaptive version of your test.

   Step 1: Feasibility, applicability, and planning studies. First, extensive monte carlo simulation research must occur, and the results formulated as business cases, to evaluate whether adaptive testing is feasible, applicable, or even possible.

   Step 2: Develop item bank. An item bank must be developed to meet the specifications recommended by Step 1.

   Step 3: Pretest and calibrate item bank. Items must be pilot tested on 200-1000 examinees (depends on IRT model) and analyzed by a Ph.D. psychometrician.

   Step 4: Determine specifications for final CAT. Data from Step 3 is analyzed to evaluate CAT specifications and determine most efficient algorithms using CAT simulation software such as CATSim.

   Step 5: Publish live CAT. The adaptive test is published in a testing engine capable of fully adaptive tests based on IRT.  There are not very many of them out in the market.  Sign up for a free account in our platform FastTest and try for yourself!

Want to learn more about our one-of-a-kind model? Click here to read the seminal article by our two co-founders.  More adaptive testing research is available here.

 

Minimum requirements for computerized adaptive testing

Here are some minimum requirements to evaluate if you are considering a move to the CAT approach.

  • A large item bank piloted so that each item has at least 100 valid responses (Rasch model) or 500 (3PL model)
  • 500 examinees per year
  • Specialized IRT calibration and CAT simulation software like Xcalibre and CATsim.
  • Staff with a Ph.D. in psychometrics or an equivalent level of experience. Or, leverage our internationally recognized expertise in the field.
  • Items (questions) that can be scored objectively correct/incorrect in real-time
  • An item banking system and CAT delivery platform
  • Financial resources: Because it is so complex, the development of a CAT will cost at least $10,000 (USD) — but if you are testing large volumes of examinees, it will be a significantly positive investment. If you pay $20/hour for proctoring seats and cut a test from 2 hours to 1 hour for just 1,000 examinees… that’s a $20,000 savings.  If you are doing 200,000 exams?  That is $4,000,000 in seat time that is saved.

 

Adaptive testing: Resources for further reading

Visit the links below to learn more about adaptive assessment.  

 

How can I start developing a CAT?

Contact us to sign up for a free account in our industry-leading CAT platform or to discuss with one of our PhD psychometricians.

 

Automated essay scoring (AES) is an important application of machine learning and artificial intelligence to the field of psychometrics and assessment.  In fact, it’s been around far longer than “machine learning” and “artificial intelligence” have been buzzwords in the general public!  The field of psychometrics has been doing such groundbreaking work for decades.

So how does AES work, and how can you apply it?

 

 

 

What is automated essay scoring?

The first and most critical thing to know is that there is not an algorithm that “reads” the student essays.  Instead, you need to train an algorithm.  That is, if you are a teacher and don’t want to grade your essays, you can’t just throw them in an essay scoring system.  You have to actually grade the essays (or at least a large sample of them) and then use that data to fit a machine learning algorithm.  Data scientists use the term train the model, which sounds complicated, but if you have ever done simple linear regression, you have experience with training models.

 

There are three steps for automated essay scoring:

  1. Establish your data set (collate student essays and grade them).
  2. Determine the features (predictor variables that you want to pick up on).
  3. Train the machine learning model.

 

Here’s an extremely oversimplified example:

  1. You have a set of 100 student essays, which you have scored on a scale of 0 to 5 points.
  2. The essay is on Napoleon Bonaparte, and you want students to know certain facts, so you want to give them “credit” in the model if they use words like: Corsica, Consul, Josephine, Emperor, Waterloo, Austerlitz, St. Helena.  You might also add other Features such as Word Count, number of grammar errors, number of spelling errors, etc.
  3. You create a map of which students used each of these words, as 0/1 indicator variables.  You can then fit a multiple regression with 7 predictor variables (did they use each of the 7 words) and the 5 point scale as your criterion variable.  You can then use this model to predict each student’s score from just their essay text.

 

Obviously, this example is too simple to be of use, but the same general idea is done with massive, complex studies.  The establishment of the core features (predictive variables) can be much more complex, and models are going to be much more complex than multiple regression (neural networks, random forests, support vector machines).

Here’s an example of the very start of a data matrix for features, from an actual student essay.  Imagine that you also have data on the final scores, 0 to 5 points.  You can see how this is then a regression situation.

Examinee Word Count i_have best_jump move and_that the_kids well
1 307 0 1 2 0 0 1
2 164 0 0 1 0 0 0
3 348 1 0 1 0 0 0
4 371 0 1 1 0 0 0
5 446 0 0 0 0 0 2
6 364 1 0 0 0 1 1

 

How do you score the essay?

If they are on paper, then automated essay scoring won’t work unless you have an extremely good software for character recognition that converts it to a digital database of text.  Most likely, you have delivered the exam as an online assessment and already have the database.  If so, your platform should include functionality to manage the scoring process, including multiple custom rubrics.  An example of our FastTest platform is provided below.

FastTest_essay-marking

Some rubrics you might use:

  • Grammar
  • Spelling
  • Content
  • Style
  • Supporting arguments
  • Organization
  • Vocabulary / word choice

 

How do you pick the Features?

This is one of the key research problems.  In some cases, it might be something similar to the Napoleon example.  Suppose you had a complex item on Accounting, where examinees review reports and spreadsheets and need to summarize a few key points.  You might pull out a few key terms as features (mortgage amortization) or numbers (2.375%) and consider them to be Features.  I saw a presentation at Innovations In Testing 2022 that did exactly this.  Think of them as where you are giving the students “points” for using those keywords, though because you are using complex machine learning models, it is not simply giving them a single unit point.  It’s contributing towards a regression-like model with a positive slope.

In other cases, you might not know.  Maybe it is an item on an English test being delivered to English language learners, and you ask them to write about what country they want to visit someday.  You have no idea what they will write about.  But what you can do is tell the algorithm to find the words or terms that are used most often, and try to predict the scores with that.  Maybe words like “jetlag” or “edification” show up in students that tend to get high scores, while words like “clubbing” or “someday” tend to be used by students with lower scores.  The AI might also pick up on spelling errors.  I worked as an essay scorer in grad school, and I can’t tell you how many times I saw kids use “ludacris” (name of an American rap artist) instead of “ludicrous” when trying to describe an argument.  They had literally never seen the word used or spelled correctly.  Maybe the AI model finds to give that a negative weight.   That’s the next section!

How do you train a model?

bart model train

Well, if you are familiar with data science, you know there are TONS of models, and many of them have a bunch of parameterization options.  This is where more research is required.  What model works the best on your particular essay, and doesn’t take 5 days to run on your data set?  That’s for you to figure out.  There is a trade-off between simplicity and accuracy.  Complex models might be accurate but take days to run.  A simpler model might take 2 hours but with a 5% drop in accuracy.  It’s up to you to evaluate.

If you have experience with Python and R, you know that there are many packages which provide this analysis out of the box – it is a matter of selecting a model that works.

How well does automated essay scoring work?

Well, as psychometricians love to say, “it depends.”  You need to do the model fitting research for each prompt and rubric.  It will work better for some than others.  The general consensus in research is that AES algorithms work as well as a second human, and therefore serve very well in that role.  But you shouldn’t use them as the only score; of course, that’s impossible in many cases.

Here’s a graph from some research we did on our algorithm, showing the correlation of human to AES.  The three lines are for the proportion of sample used in the training set; we saw decent results from only 10% in this case!  Some of the models correlated above 0.80 with humans, even though this is a small data set.   We found that the Cubist model took a fraction of the time needed by complex models like Neural Net or Random Forest; in this case it might be sufficiently powerful.

Automated essay scoring results

 

How can I implement automated essay scoring without writing code from scratch?

There are several products on the market.  Some are standalone, some are integrated with a human-based essay scoring platform.  ASC’s platform for automated essay scoring is SmartMarq; click here to learn more.  It is currently in a standalone approach like you see below, making it extremely easy to use.  It is also in the process of being integrated into our online assessment platform, alongside human scoring, to provide an efficient and easy way of obtaining a second or third rater for QA purposes.

Want to learn more?  Contact us to request a demonstration.

 

SmartMarq automated essay scoring

 

Artificial intelligence (AI) is poised to address some challenges that education deals with today, through innovation of teaching and learning processes. By applying AI in education technologies, educators can determine student needs more precisely, keep students more engaged, improve learning, and adapt teaching accordingly to boost learning outcomes. A process of utilizing AI in education started off from looking for a substitute for one-on-one tutoring in the 1970s and has been witnessing multiple improvements since then. This article will look at some of the latest AI developments used in education, their potential impact, and drawbacks they possess.

Application of AI

AI robot - AI in Education

Recently, a helping hand of AI technologies has permeated into all aspects of educational process. The research that has been going since 2009 shows that AI has been extensively employed in managing, instructing, and learning sectors. In management, AI tools are used to review and grade student assignments, sometimes they operate even more accurately than educators do. There are some AI-based interactive tools that teachers apply to build and share student knowledge. Learning can be enhanced through customization and personalization of content enabled by new technological systems that leverage machine learning (ML) and adaptability.

Below you may find a list of major educational areas where AI technologies are actively involved and that are worthy of being further developed.

Personalized learning This educational approach tailors learning trajectory to individual student needs and interests. AI algorithms analyze student information (e.g. learning style and performance) to create customized learning paths. Based on student weaknesses and strengths, AI recommends exercises and learning materials.  AI technologies are increasingly pivotal in online learning apps, personalizing education and making it more accessible to a diverse learner base.
Adaptive learning This approach does the same as personalized learning but in real-time stimulating learners to be engaged and motivated. ALEKS is a good example of an adaptive learning program.
Learning courses These are AI-powered online platforms that are designed for eLearning and course management, and enable learners to browse for specific courses and study with their own speed. These platforms offer learning activities in an increasing order of their difficulty aiming at ultimate educational goals. For instance, advanced Learning Management Systems (LMS) and Massive Open Online Courses (MOOCs).
Learning assistants/Teaching robots AI-based assistants can supply support and resources to learners upon request. They can respond to questions, provide personalized feedback, and guide students through learning content. Such virtual assistants might be especially helpful for learners who cannot access offline support.
Adaptive testing This mode of delivering tests means that each examinee will get to respond to specific questions that correspond to their level of expertise based on their previous responses. It is possible due to AI algorithms enabled by ML and psychometric methods, i.e. item response theory (IRT). You can get more information about adaptive testing from Nathan Thompson’s blog post.
Remote proctoring It is a type of software that allows examiners to coordinate an assessment process remotely whilst keeping confidentiality and preventing examinees from cheating. In addition, there can be a virtual proctor who can assist examinees in resolving any issues arisen during the process. The functionality of proctoring software can differ substantially depending on the stakes of exams and preferences of stakeholders. You can read more on this topic from the ASC’s blog here.
Test assembly Automated test assembly (ATA) is a widely used valid and efficient method of test construction based on either classical test theory (CTT) or item response theory (IRT). ATA lets you assemble test forms that are equivalent in terms of content distribution and psychometric statistics in seconds. ASC has designed TestAssembler to minimize a laborious and time-consuming process of form building.
Automated grading Grading student assignments is one of the biggest challenges that educators face. AI-powered grading systems automate this routine work reducing bias and inconsistencies in assessment results and increasing validity. ASC has developed an AI essay scoring system—SmartMarq. If you are interested in automated essay scoring, you should definitely read this post.
Item generation There are often cases when teachers are asked to write a bunch of items for assessment purposes, as if they are not busy with lesson planning and other drudgery. Automated item generation is very helpful in terms of time saving and producing quality items.
Search engine The time of libraries has sunk into oblivion, so now we mostly deal with huge search engines that have been constructed to carry out web searches. AI-powered search engines help us find an abundance of information; search results heavily depend on how we formulate our queries, choose keywords, and navigate between different sites. One of the biggest search engines so far is Google.
Chatbot Last but not least… Chatbots are software applications that employ AI and natural language processing (NLP) to make humanized conversations with people. AI-powered chatbots can provide learners with additional personalized support and resources. ChatGPT can truly be considered as the brightest example of a chatbot today.

 

Highlights of AI and challenges to address

ai chatbot - AI in Education

Today AI-powered functions revolutionize education, just to name a few: speech recognition, NLP, and emotion detection. AI technologies enable identifying patterns, building algorithms, presenting knowledge, sensing, making and following plans, maintaining true-to-life interactions with people, managing complex learning activities, magnifying human abilities in learning contexts, and supporting learners in accordance with their individual interests and needs. AI allows students to use handwriting, gestures or speech as input while studying or taking a test.

Along with numerous opportunities, AI-evolution brings some risks and challenges that should be profoundly investigated and addressed. While approaching utilization of AI in education, it is important to keep caution and consideration to make sure that it is done in a responsible and ethical way, and not to get caught up in the mainstream since some AI tools consult billions of data available to everyone on the web. Another challenge associated with AI is a variability in its performance: some functions are performed on a superior level (such as identifying patterns in data) but some of them are quite primitive (such as inability to support an in-depth conversation). Even though AI is very powerful, human beings still play a crucial role in verifying AI’s output to avoid plagiarism and falsification of information.

Conclusion

AI is already massively applied in education around the world. With the right guidance and frameworks in place, AI-powered technologies can help build more efficient and equitable learning experiences. Today we have an opportunity to witness how AI- and ML-based approaches contribute to development of individualized, personalized, and adaptive learning.

ASC’s CEO, Dr Thompson, presented several topics on AI at the 2023 ATP Conference in Dallas, TX. If you are interested in utilizing AI-powered services provided by ASC, please do not hesitate to contact us!

References

Miao, F., Holmes, W., Huang, R., & Zhang, H. (2021). AI and education: A guidance for policymakers. UNESCO.

Niemi, H., Pea, R. D., & Lu, Y. (Eds.). (2022). AI in Learning: Designing the Future. Springer. https://doi.org/10.1007/978-3-031-09687-7

Gamification in assessment and psychometrics presents new opportunities for ways to improve the quality of exams. While the majority of adults perceive games with caution because of their detrimental effect on youngsters’ minds causing addiction, they can be extremely beneficial for learning and assessment if employed thoughtfully. Gamification does not only provide learners with multiple opportunities to learn in context, but also is instrumental in developing digital literacy skills that are highly necessary in modern times.

What is Gamification?

Gamification means that elements of games, such as point-scoring, team collaboration, competition, and prizes) are incorporated into processes that would not otherwise have them. For example, a software for managing a Sales team might incorporate points for the number of phone calls and emails, splitting the team into two “teams” to compete against each other on those points, and winning a prize at the end of the month. Such ideas can also be incorporated into learning and assessment. A student might get points for each module they complete correctly, and a badge for each test they pass to show mastery of a skill, which are then displayed on their profile in the learning system.

Gamification equals motivation?student exam help

It is a fact that learning is much more effective when learners are motivated. What can motivate learners, you might ask? Engagement comes first—that is the core of learning. Engaged learners grasp knowledge because they are interested in the learning process, the material itself, and they are curious about discovering more. In-contrast, unengaged learners wait when a lesson ends.

A traditional educational process usually involves several lessons where students learn one unit, and at the end of this unit, they take a cumulative test that gauges their level of acquisition. This model usually provides minimum of context for learning throughout the unit, so learners are supposed just to learn and memorize things unless they are given a chance to succeed or fail on the test.

Gamification can change this approach. When lessons and tests are gamified, learners obtain an opportunity to learn in context and use their associations and imagination—they become participants of the process, not just executors of instructions.

Gamification: challenges and ways to overcome them

While gamified learning and assessment are very efficacious, they might be challenging for educators in terms of development and implementation. Below you may check some challenges and how they can be tackled.

Challenge

Solution

More work Interactive lessons containing gamified elements demand more time and effort from educators, which is why overwhelmed with other obligations many of them give up and keep up with traditional style of teaching. However, if the whole team sets up the planning and preparations prior to starting a new unit, then there will be less work and less stress, respectively.
Preparation Gamified learning and assessment can be difficult for educators lacking creativity or not having any experience. Senior managers, like heads of departments, should take a leading position here: organize some courses and support their staff.
Distraction When developing gamified learning or assessment, it is important not to get distracted with fancy stuff and keep focused on the targeted learning objectives.
Individual needs Gamified learning and assessment cannot be unified, so educators will have to customize their materials to meet learner needs.

Gamified assessment

Psychometric tests have been evolving over time to provide more benefits to educators and learners, employers and candidates, and other stakeholders. Gamification is the next stage in the evolutionary process after having gained positive feedback from scientists and practitioners.

Gamified assessment is applied by human resources departments in the hiring process like psychometric tests evaluating candidate’s knowledge and skills. However, game-based assessment is quicker and more engaging than aptitude tests due to its user-friendly and interactive format. The latter features are also true for computerized adaptive testing (CAT), and I believe that these two can be complemented by each other to double the benefits provided.

There are several ways to incorporate gamification into assessment. Here are some ideas, but this is by no means exhaustive.

Aspect

Example

High fidelity items and/or assignments Instead of multiple choice items to ask about a task (e.g., operating a construction crane), create a simulation that is similar to a game.
Badging Candidates win badges for passing exams, which can be displayed places like their LinkedIn profile or email signature.
Points Obviously, most tests have “points” as part of the exam score, but it can be used in other ways, such as how many modules/quizzes you pass per month.
Teams Subdivide a class or other group into teams, and have them compete on other aspects.

Analyzing my personal experience, I remember how I used kahoot.it tool on my Math classes to interact with students and make them more engaged in the formative assessment activities. Students were highly motivated to take such tests because they were rewarding—it felt like competition and sometimes they got sweets. It was fun!

Summary

Obviously, gamified learning and assessment require more time and effort from creators than traditional non-gamified ones, but they are worthy. Both educators and learners are likely to benefit from this experience in different ways. If you are ready to apply gamified assessment by employing CAT technologies, our experts are ready to help. Contact us!

 

Multistage testing (MST) is a type of computerized adaptive testing (CAT).  This means it is an exam delivered on computers which dynamically personalize it for each examinee or student.  Typically, this is done with respect to the difficulty of the questions, by making the exam easier for lower-ability students and harder for high-ability students.  Doing this makes the test shorter and more accurate while providing additional benefits.  This post will provide more information on multistage testing so you can evaluate if it is a good fit for your organization.

Already interested in MST and want to implement it?  Contact us to talk to one of our experts and get access to our powerful online assessment platform, where you can create your own MST and CAT exams in a matter of hours.

 

What is multistage testing?Multistage testing algorithm

Like CAT, multistage testing adapts the difficulty of the items presented to the student. But while adaptive testing works by adapting each item one by one using item response theory (IRT), multistage works in blocks of items.  That is, CAT will deliver one item, score it, pick a new item, score it, pick a new item, etc.  Multistage testing will deliver a block of items, such as 10, score them, then deliver another block of 10.

The design of a multistage test is often referred to as panels.  There is usually a single routing test or routing stage which starts the exam, and then students are directed to different levels of panels for subsequent stages.  The number of levels is sometimes used to describe the design; the example on the right is a 1-3-3 design.  Unlike CAT, there are only a few potential paths, unless each stage has a pool of available testlets.

As with item-by-item CAT, multistage testing is almost always done using IRT as the psychometric paradigm, selection algorithm, and scoring method.  This is because IRT can score examinees on a common scale regardless of which items they see, which is not possible using classical test theory.

 

Why multistage testing?

Item-by-item CAT is not the best fit for all assessments, especially those that naturally tend towards testlets, such as language assessments where there is a reading passage with 3-5 associated questions.

Multistage testing allows you to realize some of the well-known benefits of adaptive testing (see below), with more control over content and exposure.  In addition to controlling content at an examinee level, it also can make it easier to manage item bank usage for the organization.

 

How do I implement multistage testing?Multistage testing

 

1. Develop your item banks using items calibrated with item response theory

2. Assemble a test with multiple stages, defining pools of items in each stage as testlets

3. Evaluate the test information functions for each testlet

4. Run simulation studies to validate the delivery algorithm with your predefined testlets

5. Publish for online delivery

Our industry-leading assessment platform manages much of this process for you.  The image to the right shows our test assembly screen where you can evaluate the test information functions for each testlet.

 

Benefits of MST

There are a number of benefits to this approach, which are mostly shared with CAT.

  • Shorter exams: because difficulty is targeted, you waste less time
  • Increased security: There are many possible configurations, unlike a linear exam where everyone sees the same set of items
  • Increased engagement: Lower ability students are not discouraged, and high ability students are not bored
  • Control of content: CAT has some content control algorithms, but they are sometimes not sufficient
  • Supports testlets: CAT does not support tests that have testlets, like a reading passage with 5 questions
  • Allows for review: CAT does not usually allow for review (students can go back a question to change an answer), while MST does

 

Examples of multistage testing

MST is often used in language assessment, which means that it is often used in educational assessment, such as benchmark K-12 exams, university admissions, or language placement/certification.  One of the most famous examples is the Scholastic Aptitude Test from The College Board; it is moving to an MST approach in 2023.

Because of the complexity of item response theory, most organizations that implement MST have a full-time psychometrician on staff.  If your organization does not, we would love to discuss how we can work together.

 

The concept of Speeded vs Power Test is one of the ways of differentiating psychometric or educational assessments. In the context of educational measurement and depending on the assessment goals and time constraints, tests are categorized as speeded and power. There is also the concept of a Timed test, which is really a Power test. Let’s look at these types more carefully.

Speeded test

 

In this test, examinees are limited in time but expected to answer as many questions as possible but there is a unreasonably short time limit that prevents even the best examinees from completing the test, and therefore forces the speed.  Items are delivered sequentially starting from the first one and until the last one. All items are relatively easy, usually.  Sometimes they are increasing in difficulty.  If a time limit and difficulty level are correctly set, none of the test takers will be able to reach the last item before the time limit is reached. A speeded test is supposed to demonstrate how fast an examinee can respond to questions within a time limit. In this case, examinees’ answers are not as important as their speed of answering questions. Total score is usually computed as a number of questions answered correctly when a time limit is met, and differences in scores are mainly attributed to individual differences in speed rather than knowledge.

An example of this might be a mathematical calculation speed test. Examinees are given 100 multiplication problems and told to solve as many as they can in 20 seconds. Most examinees know the answers to all the items, it is a question of how many they can finish. Another might be a 10-key task, where examinees are given a list of 100 5-digit strings and told to type as many as they can in 20 seconds.

Pros of a speeded test:

  • Speeded test is appropriate for when you actually want to test the speed of examinees; the 10-digit task above would be useful in selecting data entry clerks, for example. The concept of “knowledge of 5 digit string” in this case is not relevant and doesn’t even make sense.
  • Tests can sometimes be very short but still discriminating.
  • In case when a test is a mixture of items in terms of their difficulty, examinees might save some time when responding easier items in order to respond to more difficult items. This can create an increased spread in scores.

Cons of a speeded test:

  • Most situations where a test is used is to evaluate knowledge, not speed.
  • The nature of the test provokes examinees commit errors even if they know the answers, which can be stressful.
  • Speeded test does not consider individual peculiarities of examinees.

Power Test

A power test provides examinees with sufficient time so that they could attempt all items and express their true level of knowledge or ability. Therefore, this testing category focuses on assessing knowledge, skills, and abilities of the examinees.  The total score is often computed as a number of questions answered correctly (or with item response theory), and individual differences in scores are attributed to differences in ability under assessment, not to differences in basic cognitive abilities such as processing speed or reaction time.

There is also the concept of a Timed Test. This has a time limit, but it is NOT a major factor in how examinees respond to questions or affect their score. For example, the time limit might be set so that 95% of examinees are not affected at all, and the remaining 5% are slightly hurried. This is done with the CAT-ASVAB.

Pros of a power test:

  • There is no time restrictions for test-takers
  • Power test is great to evaluate knowledge, skills, and abilities of examinees
  • Power test reduces chances of committing errors by examinees even if they know the answers
  • Power test considers individual peculiarities of examinees

Cons of a power test:

  • It can be time consuming (some of these exams are 8 hours long or even more!)
  • This test format sometimes does not suit competitive examinations because of administrative issues (too much test time across too many examinees)
  • Power test is sometimes bad for discriminative purposes, since all examinees have high chances to perform well.  There are certainly some pass/fail knowledge exams where almost everyone passes.  But the purpose of those exams is not to differentiate for selection, but to make sure students have mastered the material, so this is a good thing in that case.

Speeded vs power test

The categorization of speed or power test depends on the assessment purpose. For instance, an arithmetical test for Grade 8 students might be a speeded test when containing many relatively easy questions but the same test could be a power test for Grade 7 students. Thus, a speeded test measures the power when all of the items are correctly responded in a limited time period. Similarly, a power test might turn into a speeded test when easy items are correctly responded in shorter time period. Once a time limit is fixed for a power test, it becomes a speeded test. Today, a pure speeded or power test is rare. Usually, what we meet in practice is a mixture of both, typically a Timed Test.

Below you may find a comparison of a speeded vs power test, in terms of the main features.

 

Speeded test Power test
Time limit is fixed, and it affects all examinees There is no time limit, or there is one and it only affects a small percentage of examinees
The goal is to evaluate speed only, or a combination of speed and correctness The goal is to evaluate correctness in the sense knowledge, skills, and abilities of test-takers
Questions are relatively easy in nature Questions are relatively difficult in nature
Test format increases chances of committing errors Test format reduces chances of committing errors

 

Educational assessment of Mathematics achievement is a critical aspect of most educational ministries and programs across the world.  One might say that all subjects at school are equally important and that would be relatively true. However, Mathematics stands out amongst the remaining ones, because it is more than just an academic subject. Here are three reasons why Math is so important:

Math is everywhere. Any job is tough to be completed without mathematical knowledge. Executives, musicians, accountants, fashion designers, and even mothers use Math in their daily lives. In particular, Math is essential for decision-making in the fast-growing digital world.

Math designs thinking paths. Math enables people, especially children, to analyze and solve real-world problems by developing logical and critical thinking. Einstein’s words describe this fact inimitably, “Pure mathematics is, in its way, the poetry of logical ideas”.

Math is a language of science. Math gives tools for understanding and developing engineering, science, and technology. Mathematical language, including symbols and their meanings, is the same in the world, so scientists use math to communicate concepts.

No matter which profession a student has chosen, he would likely need some solid knowledge in Math to enter an undergraduate or a graduate program. Some world-known tests that contain Math part are TIMSS, PISA, ACT, SAT, SET, and GRE.

The role of educational assessment in Math

Therefore, an important subject like Math needs careful and accurate assessment approaches starting from school. Educational assessment is the process of collecting data on student progress in knowledge acquisition to inform future academic decisions towards learning goals. This is true at the individual student level, teacher or school level, district level, and state or national level. There are different types of assessment depending on its scale, purpose, and functionality of the data collected. calculator-math

In general, educational authorities in many countries apply criteria-based approach for classroom and external assessment of Mathematics. Criteria help divide a construct of knowledge into edible portions so that students understand what they have to acquire and teachers could positively interfere student individual learning paths to make sure that at the end students achieve learning goals.

Classroom assessment or assessment for learning is curriculum-based. Teachers use learning objectives from Math curriculum to form assessment criteria and make tasks according to the latter. Teachers employ assessment results for making informed decisions on the student level.

External assessment or assessment of learning is also curriculum-based but it covers much more topics than classroom assessment. Tasks are made by external specialists, usually from an independent educational institution. Assessment procedure itself is likely to be invigilated and its results are used by different authorities, not just teachers, to evaluate student progress in learning Math but also curriculum.

Applications of educational assessment of Mathematics

Aforementioned types of assessment are classroom- and school-level, and both are mostly formatted as pen-and-pencil tests. There are some other internationally recognized assessment programs focusing on Math, such as Programme for International Student Assessment (PISA). PISA set a global trend of applying knowledge and skills in Math to solving real-world problems.

In 2018, PISA became a computerized adaptive test which is a great shift favoring all students with various levels of knowledge in Math. Application of adaptive technologies in Math for assessment and evaluation purposes could greatly motivate students because the majority of them are not big fans of Math. Thus, teachers and other stakeholders could get more valid and reliable data on student progress in learning Math.

Implementation

The first steps towards implementation of modern technologies for educational assessment of Math at schools and colleges are extensive research and planning. Second, there has to be a pool of good items written according to the best international practices. Third, assessment procedures have to be standardized. Finally yet importantly, schools would need a consultant with rich expertise in adaptive technologies and psychometrics.

An important consideration is the item types of formats to be used.  FastTest allows you to not only use traditional formats like multiple choice, but advanced formats like drag and drop or the presentation of an equation editor to the student.  An example of that is below.

 

Equation editor item type

 

Why is educational assessment of Math so important?

Educational assessment of Math is one of the major focuses of PISA and other assessments for good reason.  Since Math skills translate to job success in many fields, especially STEM fields, a well-educated workforce is one of the necessary components of a modern economy.  So an educational system needs to know that it is preparing students for the future needs of the economy.  One aspect of this is progress monitoring, which tracks learning over time so that we can not only help individual students but also effect the aggregate changes needed to improve the educational system.

 

An Objective Structured Clinical Examination (OSCE Exam) is an assessment designed to measure performance of tasks, typically medical, in a high-fidelity way.  It is more a test of skill than knowledge.  For example, I used to work at a certification board for ophthalmic assistants; there were 3 levels, and the top two levels included both a knowledge test (200 multiple choice items) and an OSCE (level 2 was a digital simulation, level 3 was live human patients).

OSCE exams serve a very important purpose in many fields, forging a critical bridge between learning and practice.  This post will cover some of the basics.

What is an Objective Structured Clinical Examination?

An OSCE exam typically works by defining very specific tasks that the examinee is required to do, while examiners (often professors) watch them while grading them via a rubric or checklist.  Each of the tasks is often called a station, and the OSCE will often have multiple stations.  Consider the components of the name:

  • Objective: We are trying to be as objective as possible, boiling down a potentially very complex patient scenario and task into a checklist or rubric. We want to make it quantitative, measurable, and reliable.
  • Structured: The task itself is very boxed, such as using retinoscopy to measure astigmatism (perhaps one thing of 20 that might happen at a visit to your ophthalmologist).
  • Clinical: The task is something to be done in a clinical setting; this is to increase fidelity and validity.

A great summary is provided by Zayyan (2011):

The Objective Structured Clinical Examination is a versatile multipurpose evaluative tool that can be utilized to assess health care professionals in a clinical setting. It assesses competency, based on objective testing through direct observation. It is precise, objective, and reproducible allowing uniform testing ofclinical examination students for a wide range of clinical skills. Unlike the traditional clinical exam, the OSCE could evaluate areas most critical to performance of health care professionals such as communication skills and ability to handle unpredictable patient behavior.

There are a few key points here.

  • It is a clinical setting, rather than a lecture hall setting (though in non-medical fields, “clinical setting” is relative!)
  • It is assessing competency of clinical skills
  • It is based on observation, where examiners rate the examinee
  • It will often include assessment of “soft skills” or other non-knowledge aspects

Where are OSCE Exams used?

OSCE exams are very important in the medical professions.  This report shows that many medical schools use it, though it curiously does not say how many schools were part of the survey.

However, it is most certainly not limited to medical fields.  You don’t hear the term very often outside medical education, but the approach is used widely.   Professions where someone is physically doing something are more likely to use OSCEs.  An accountant, on the other hand, does no physically do something, and their equivalent of an OSCE is more like a complex accounting scenario that needs to be completed in MS Excel and then graded.

Examples of OSCE exams

Of course, there are many medical examples.  I work with the American Board of Chiropractic Sports Physicians, who have a practical exam.  Check out their DACBSP webpage and scroll down to the Practical Exam resources, including instructional videos for some stations.

Nurse skill test

I once worked with a crane operator certification.  They had a performance test where you had to drive the crane into a certain position, lift and place certain objects, and then move a wrecking ball through a path of oil drums without knocking anything over – all while being rated by an examiner with a checklist.  Sounds a lot like an OSCE?

Perhaps the most common OSCE is one that you have likely taken: a Driver’s test.  In addition to taking a knowledge test, you were also likely asked to drive a car with an examiner armed with a checklist while he told you to do various “stations” like parallel parking, perpendicular parking, or navigating a stoplight.

Tell me more!

There are dedicated resources in the world of medical education and assessment, such as Downing and Yudkowsky (2019) Assessment in Health Professions Education (https://www.routledge.com/Assessment-in-Health-Professions-Education/Yudkowsky-Park-Downing/p/book/9781315166902).   You might also be interested in my Lecture Notes from a course taught using that textbook.

One of the primary goals of psychometrics and assessment research is to ensure that tests, their scores, and interpretations of the scores, are reliable, valid, and fair. The concepts of reliability and validity are discussed quite often and are well-defined, but what do we mean when we say that a test is fair or unfair? We’ll discuss it here. Though note that fairness is technically part of validity, because if there is bias, then the interpretations being made from scores are usually biased as well.

What do we mean by bias?

Well, there are actually three types of bias in assessment.

1. Differential item functioning / differential test functioning

This type of bias occurs when a single item, or sometimes a test, is biased against a group when ability/trait level is constant. For example, suppose that the reference group (usually the majority) and focal group (usually a minority) perform similarly on the test overall, but on one item we find that the focal group was less likely to get the item correct after adjusting for total score performance. This is known as differential item functioning. Content experts should review the question.

2. Overall test bias

With this type of bias, we find that the entire test is biased against the focal group, so that they receive lower scores (ability/trait estimate) than the reference group. This is especially concerning if there is data from another test or variable that shows the two groups should be of equal ability. However, there are many cases where the focal group has lower scores not because the test is biased, but because of some other reason. For example, if it is economically disadvantaged and receives subpar educational opportunities, the test could very well be valid and simply reflect these well-known inequities.

3. Predictive bias

This is a complex situation. Suppose that the test itself was not biased, but it is used to predict something like job performance or university admissions, and the test scores systematically underpredict performance for the focal group. This is manifested in the predictive model, such as a linear regression, and not in the test scores. There is also selection bias, where a focal group ends up not being selected as often.  In the USA, a rule of thumb is the four-fifths rule.

Other types of unfairness

There are other ways that a test can be considered unfair. One is the case of unequal precision. This refers to the situation that is the case with almost all traditional exams that there are plenty of items of middle difficulty, but not as many items that are easy or difficult. This can lead to very inaccurate scores for examinees on the top or bottom of the distribution. It is one reason that scaled scores are often capped on the ends of the spectrum; the difference between a person at the 98th percentile vs 99th percentile is most likely not meaningful, even if there is a wide difference in the raw scores.

Another is the case of test adaptation and translation. Here, the original test and its items might be unbiased, but when the test is translated or adapted to a different language or culture, it becomes biased. In such cases, the data might manifest itself as DIF/DTF or test bias as described above. I recall a story that a friend of mine told me about an item that was translated to Spanish, where the original item in English was quite strong and unbiased, but when used in Latin America it touched on a cultural aspect that was not present in USA/Canada, and performed poorly.

How can we find test bias?

Psychometricians have a number of statistical methods that are designed to specifically look for the situations described here. Differential item functioning in particular has a ton of scientific literature devoted to it. One example method, which is older but still commonly used, is the Mantel-Haenszel statistic. For predictive bias, I remember learning about the partial F-test in graduate school, but have not had the opportunity to perform such analyses since then.

How do we address or avoid test bias?

As with many things, an ounce of prevention is worth a pound of cure. High-stakes exams such as university admissions will invest heavily in avoiding bias. They will create detailed item writing guidelines, heavily train the item writers, and pay for items to be reviewed not only by experts but by people who are representative of target populations. Of course, some issues will always slip through this process, which is why it is important to perform the statistical analyses afterwards to validate the items, the test, and predictive models.

Where can I learn more?

Here are some relevant resources to help you learn more about test bias.

Handbook of Methods for Detecting Test Bias

Test Bias in Employment Selection Testing: A Visual Introduction

Differential Item Functioning