Generalized-partial-credit-model

What is a rubric? It’s a rule for converting unstructured responses on an assessment, such as essays that students write, into structured data that we can use psychometrically.

Why do we need rubrics?

Measurement is a quantitative endeavor.  In psychometrics, we are trying to measure things like knowledge, achievement, aptitude, or skills.  So we need a way to convert qualitative data into quantitative data.  We can still keep the qualitative data on hand for certain uses, but typically need the quantitative data for the primary use.  For example, writing essays in school will need to be converted to a score, but the teacher might also want to talk to the student to provide a learning opportunity.

A rubric is a defined set of rules to convert open-response items like essays into usable quantitative data, such as scoring the essay 0 to 4 points.

How many rubrics do I need?

In some cases, a single rubric will suffice.  This is typical in mathematics, where the goal is a single correct answer.  In writing, the goal is often more complex.  You might be assessing writing and argumentative ability at the same time you are assessing language skills.  For example, you might have rubrics for spelling, grammar, paragraph structure, and argument structure – all on the same essay.

Examples

Spelling rubric for an essay

Points Description
0 Essay contains 5 or more spelling mistakes
1 Essay contains 1 to 4 spelling mistakes
2 Essay does not contain any spelling mistakes

 

Argument rubric for an essay

“Your school is considering the elimination of organized sports.  Write an essay to provide to the School Board that provides 3 reasons to keep sports, with a supporting explanation for each.”

Points Description
0 Student does not include any reasons with explanation (includes providing 3 reasons but no explanations)
1 Student provides 1 reason with a clear explanation
2 Student provides 2 reasons with clear explanations
3 Student provides 3 reasons with clear explanations

 

Answer rubric for math

Points Description
0 Student provides no response or a response that does not indicate understanding of the problem.
1 Student provides a response that indicates understanding of the problem, but does not arrive at correct answer OR provides the correct answer but no supporting work.
2 Student provides a response with the correct answer and supporting work that explains the process.

 

How do I score tests with a rubric?

Well, the traditional approach is to just take the integers supplied by the rubric and add them to the number-correct score. This is consistent with classical test theory, and therefore fits with conventional statistics such as coefficient alpha for reliability and Pearson correlation for discrimination. However, the modern paradigm of assessment is item response theory, which analyzes the rubric data much more deeply and applies advanced mathematical modeling like the generalized partial credit model (Muraki, 1992; resources on that here and here).

An example of this is below.  Imagine that you have an essay which is scored 0-4 points.  This graph shows the probability of earning each point level, as a function of total score (Theta).  Someone who is average (Theta=0.0) is likely to get 2 points, the yellow line.  Someone at Theta=1.0 is likely to get 3 points.  Note that the middle curves are always bell-shaped while the ones on the end go up to an upper asymptote of 1.0.  That is, the smarter the student, the more likely they are to get 4 out of 4 points, but the probability of that can never go above 100%, obviously.

Generalized-partial-credit-model

How can I efficiently implement a scoring rubric?

It is much easier to implement a scoring rubric if your online assessment platform supports them in an online marking module, especially if the platform already has integrated psychometrics like the generalized partial credit model.  Below is an example of what an online essay marking system would look like, allowing you to efficiently implement rubrics.  It should have advanced functionality, such as allowing multiple rubrics per item, multiple raters per response, anonymity, and more.

Online marking essays

 

What about automated essay scoring?

You also have the option of using automated essay scoring; once you have some data from human raters on rubrics, you can train machine learning models to help.  Unfortunately, the world is not yet to the state where we have a droid that you can just feed a pile of student papers to grade!

 

SIFT test security data forensics

Test fraud is an extremely common occurrence.  We’ve all seen articles about examinee cheating.  However, there are very few defensible tools to help detect it.  I once saw a webinar from an online testing provider that proudly touted their reports on test security… but it turned out that all they provided was a simple export of student answers that you could subjectively read and form conjectures.  The goal of SIFT is to provide a tool that implements real statistical indices from the corpus of scientific research on statistical detection of test fraud, yet is user-friendly enough to be used by someone without a PhD in psychometrics and experience in data forensics.  SIFT still provides more collusion indices and other analysis than any other software on the planet, making it the standard in the industry from the day of its release.  The science behind SIFT is also being implemented in our world-class online testing platform, FastTest.  It is also worth noting that FastTest supports computerized adaptive testing, which is known to increase test security.

Interested?  Download a free trial version of SIFT!

What is Test Fraud?

As long as tests have been around, people have been trying to cheat them.  This is only natural; anytime there is a system with some sort of stakes/incentive involved (and maybe even when not), people will try to game that system.  Note that the root culprit is the system itself, not the test. Blaming the test is just shooting the messenger.  However, in most cases, the system serves a useful purpose.  In the realm of assessment, that means that K12 assessments provide useful information on curriculum on teachers, certification tests identify qualified professionals, and so on.  In such cases, we must minimize the amount of test fraud in order to preserve the integrity of the system.

When it comes to test fraud, the old cliche is true: an ounce of prevention is worth a pound of cure. You’ll undoubtedly see that phrase at conferences and in other resources.  So I of course recommend that your organization implement reasonable preventative measures to deter test fraud.  Nevertheless, there will still always be some cases.  SIFT is intended to help find those.  Also, some examinees might also be deterred by the knowledge that such analysis is even being done.

How can SIFT help me with statistical detection of test fraud?

Like other psychometric software, SIFT does not interpret results for you.  For example, software for item analysis like  Iteman  and  Xcalibre  do not specifically tell you which items to retire or revise, or how to revise them.  But they provide the output necessary for a practitioner to do so.  SIFT provides you a wide range of output that can help you find different types of test fraud, like copying, proctor help, suspect test centers, brain dump usage, etc.  It can also help find other issues, like low examinee motivation.  But YOU have to decide what is important to you regarding statistical detection of test fraud, and look for relevant evidence.  More information on this is provided in the manual, but here is a glimpse.

SIFT test security data forensics

First, there are a number if indices you can evaluate, as you see above.  SIFT  will calculate those collusion indices for each pair of students, and summarize the number of flags.

sift collusion index analysis

A certification organization could use  SIFT  to look for evidence of brain dump makers and takers by evaluating similarity between examinee response vectors and answers from a brain dump site – especially if those were intentionally seeded by the organization!  We also might want to find adjacent examinees or examinees in the same location that group together in the collusion index output.  Unfortunately, these indices can differ substantially in their conclusions.

Additionally, you might want to evaluate time data.  SIFT  provides this as well.

sift time analysis

Finally, we can roll up many of these statistics to the group level.  Below is an example that provides a portion of  SIFT  output regarding teachers.  Note the Gutierrez has suspiciously high scores but without spending much more time.  Cheating?  Possibly.  On the other hand, that is the smallest N, so perhaps the teacher just had a group of accelerated students.  Worthington, on the other hand, also had high scores but had notably shorter times – perhaps the teacher was helping?

sift group analysis

 

The Story of SIFT

I started  SIFT  in 2012.  Years ago, ASC sold a software program called  Scrutiny!  We had to stop selling it because it did not work on recent versions of Windows, but we still received inquiries for it.  So I set out to develop a program that could perform the analysis from  Scrutiny! (the Bellezza & Bellezza index) but also much more.  I quickly finished a few collusion indices.  Then unfortunately I had to spend a few years dealing with the realities of business, wasting hundreds of hours in pointless meetings and other pitfalls.  I finally set a goal to release SIFT in July 2016.

Version 1.0 of  SIFT  includes 10 collusion indices (5 probabilistic, 5 descriptive), response time analysis, group level analysis, and much more to aid in the statistical detection of test fraud.  This is obviously not an exhaustive list of the analyses from the literature, but still far surpasses other options for the practitioner, including the choice to write all your own code.  Suggestions?  I’d love to hear them.