Posts on psychometrics: The Science of Assessment

standard-error-of-the-mean

The standard error of the mean is one of the three main standard errors in psychometrics and psychology.  Its purpose is to help conceptualize the error in estimating the mean of some population based on a sample.  The SEM is a well-known concept from the general field of statistics, used in an untold number of applications.

For example, a biologist might catch a number of fish from a lake, measure their length, and use that data to determine the average size of fish in the lake.  In psychometrics and psychology, we usually utilize data from some measurement of people.

For example, suppose we have a population of 5 employees with scores below on a 10 item assessment on safety procedures.

Student  Score

1             4

2             6

3             6

4             8

5             7

Then let’s suppose we are drawing a sample of 4.  There are 5 different ways to do this, but let’s just say it’s the first 4.  The average score is (4+6+6+8)/4=6, and the standard deviation for the sample is 1.63.  The standard error of the mean says that

   SEM=SD/sqrt(n) = 1.63/sqrt(4) = 0.815.

Because a 95% confidence interval is 1.96 standard errors around the average, and the average is 6, this says we expect the true mean of the distribution to be somewhere between 4.40 to 7.60.  Obviously, this is not very exact!  There are two reasons for that in this case: the SD is relatively large 1.63 on a scale of 0 to 10, and the N is only 4.  If you had a sample N of 10,000, for example, you’d be dividing the 1.63 by 100, leading to an SEM of only 0.0163.

How is this useful?  Well, this tells us that even the wide range of 4.40 to 7.60 means that the average score is fairly low; that our employees probably need some training on safety procedures.  The other two standard errors in psychology/psychometrics are more complex and more useful, though.  I’ll be covering those in future posts.

three standard errors

One of my graduate school mentors once said in class that there are three standard errors that everyone in the assessment or I/O Psych field needs to know: mean, error, and estimate.  They are quite distinct in concept and application but easily confused by someone with minimal training.

I’ve personally seen the standard error of the mean reported as the standard error of measurement, which is completely unacceptable.

So in this post, I’ll briefly describe each so that the differences are clear.  In later posts, I’ll delve deeper into each of the standard errors.

Standard Error of the Mean

This is the standard error that you learned about in Introduction to Statistics back in your sophomore year of college/university.  It is related to the Central Limit Theorem, the cornerstone of statistics.  Its purpose is to provide an index of accuracy (or conversely, error) in a sample mean.  Any sample drawn from a population will have an average, but these can vary.  The standard error of the mean estimates the variation we might expect in these different means from different samples and is defined as

   SEmean = SD/sqrt(n)

Where SD is the sample’s standard deviation and n is the number of observations in the sample.  This can be used to create a confidence interval for the true population mean.

The most important thing to note, with respect to psychometrics, is that this has nothing to do with psychometrics.  This is just general statistics.  You could be weighing a bunch of hay bales and calculating their average; anything where you are making observations.  It can be used, however, with assessment data.

For example, if you do not want to make every student in a country take a test, and instead sample 50,000 students, with a mean of 71 items correct with an SD of 12.3, then the SEM is  12.3/sqrt(50000) = 0.055.  You can be 95% certain that the true population means then lies in the narrow range of 71 +- 0.055.

Click here to read more.

Standard Error of Measurement

More important in the world of assessment is the standard error of measurement.  Its purpose is to provide an index of the accuracy of a person’s score on a test.  That is a single person, rather than a group like with the standard error of the mean.  It can be used in both the classical test theory perspective and item response theory perspective, though it is defined quite differently in both.

In classical test theory, it is defined as

   SEM = SD*sqrt(1-r)

Where SD is the standard deviation of scores for everyone who took the test, and r is the reliability of the test.  It can be interpreted as the standard deviation of scores that you would find if you had the person take the test over and over, with a fresh mind each time.  A confidence interval with this is then interpreted as the band where you would expect the person’s true score on the test to fall.

Item Response Theory conceptualizes the SEM as a continuous function across the range of student abilities.  A test form will have more accuracy – less error – in a range of abilities where there are more items or items of higher quality.  That is, a test with most items of middle difficulty will produce accurate scores in the middle of the range, but not measure students on the top or bottom very well.  The example below is a test that has many items above the average examinee score (θ) of 0.0 so that any examinee with a score of less than 0.0 has a relatively inaccurate score, namely with an SEM greater than 0.50.

Standard error of measurement and test information function

 

For a deeper discussion of SEM, click here. 

Standard Error of the Estimate

Lastly, we have the standard error of the estimate.  This is an estimate of the accuracy of a prediction that is made, usually in the paradigm of linear regression.  Suppose we are using scores on a 40 item job knowledge test to predict job performance, and we have data on a sample of 1,000 job incumbents that took the test last year and have job performance ratings from this year on a measure that entails 20 items scored on a 5 point scale for a total of 100 points.

There might have been 86 incumbents that scored 30/40 on the test, and they will have a range of job performance, let’s say from 61 to 89.  If a new person takes the test and scores 30/40, how would we predict their job performance?

The SEE is defined as

       SEE = SDy*sqrt(1-r2)

Here, the r is the correlation of x and y, not reliability. Many statistical packages can estimate linear regression, SEE, and many other related statistics for you.  In fact, Microsoft Excel comes with a free package to implement simple linear regression.  Excel estimates the SEE as 4.69 in the example above, and the regression slope and intercept are 29.93 and 1.76, respectively

Given this, we can estimate the job performance of a person with a 30 test score to be 82.73.  A 95% confidence interval for a candidate with a test score of 30 is then 82.71-(4.69*1.96) to 82.71+(4.69*1.96), or 73.52 to 91.90.

You can see how this might be useful in prediction situations.  Suppose we wanted to be sure that we only hired people who are likely to have a job performance rating of 80 or better?  Well, a cutscore of 30 on the test is therefore quite feasible.

OK, so now what?

Well, remember that these three standard errors are quite different and are not even in related situations.  When you see a standard error requested – for example if you must report the standard error for an assessment – make sure you use the right one!

certification exam delivery

In the past decade, terms like machine learning, artificial intelligence, and data mining are becoming greater buzzwords as computing power, APIs, and the massively increased availability of data enable new technologies like self-driving cars. However, we’ve been using methodologies like machine learning in psychometrics for decades. So much of the hype is just hype.

So, what exactly is Machine Learning?

Unfortunately, there is no widely agreed-upon definition, and as Wikipedia notes, machine learning is often conflated with data mining. A broad definition from Wikipedia is that machine learning explores the study and construction of algorithms that can learn from and make predictions on data. It’s often divided into supervised learning, where a researcher drives the process, and unsupervised learning, where the computer is allowed to creatively run wild and look for patterns. The latter isn’t of much use to us, at least yet.

Supervised learning includes specific topics like regression, dimensionality reduction, anomaly detection that we obviously have in Psychometrics. But its the general definition above that really fits what Psychometrics has been doing for decades.

What is Machine Learning in Psychometrics?

We can’t cover all the ways that machine learning and related topics are used in psychometrics and test development, but here’s a sampling. My goal is not to cover them all but to point out that this is old news and that should not get hung up on buzzwords and fads and marketing schticks – but by all means, we should continue to drive in this direction.

Dimensionality Reduction

One of the first, and most straightforward, areas is dimensionality reduction. Given a bunch of unstructured data, how can we find some sort of underlying structure, especially based on latent dimension? We’ve been doing this, utilizing methods like cluster analysis and factor analysis, since Spearman first started investigating the structure of intelligence 100 years ago. In fact, Spearman helped invent those approaches to solve the problems that he was trying to address in psychometrics, which was a new field at the time and had no methodology yet. How seminal was this work in psychometrics for the field of machine learning in general? The Coursera MOOC on Machine Learning uses Spearman’s work as an example in one of the early lectures!

Classification

Classification is a typical problem in machine learning. A common example is classifying images, and the classic dataset is the MNIST handwriting set (though Silicon Valley fans will think of the “not hot dog” algorithm). Given a bunch of input data (image files) and labels (what number is in the image), we develop an algorithm that most effectively can predict future image classification.  A closer example to our world is the iris dataset, where several quantitative measurements are used to predict the species of a flower.

The contrasting groups method of setting a test cutscore is a simple example of classification in psychometrics. We have a training set where examinees are already classified as pass/fail by a criterion other than test score (which of course rarely happens, but that’s another story), and use mathematical models to find the cutscore that most efficiently divides them. Not all standard setting methods take a purely statistical approach; understandably, the cutscores cannot be decided by an arbitrary computer algorithm like support vector machines or they’d be subject to immediate litigation. Strong use of subject matter experts and integration of the content itself is typically necessary.

Of course, all tests that seek to assign examinees into categories like Pass/Fail are addressing the classification problem. Some of my earliest psychometric work on the sequential probability ratio test and the generalized likelihood ratio was in this area.

One of the best examples of supervised learning for classification, but much more advanced than the contrasting groups method, is automated essay scoring, which as been around for about 2 decades. It has all the classic trappings: a training set where the observations are classified by humans first, and then mathematical models are trained to best approximate the humans. What makes it more complex is that the predictor data is now long strings of text (student essays) rather than a single number.

Anomaly Detection

The most obvious way this is used in our field is psychometric forensics, trying to find examinees that are cheating or some other behavior that warrants attention. But we also use it to evaluate model fit, possibly removing items or examinees from our data set.

Using Algorithms to Learn/Predict from Data

five item response functionsItem response theory is a great example of the general definition. With IRT, we are certainly using a training set, which we call a calibration sample. We use it to train some models, which are then used to make decisions in future observations, primarily scoring examinees that take the test by predicting where those examinees would fall in the score distribution of the calibration sample. IRT is also applied to solve more sophisticated algorithmic problems: Computerized adaptive testing and automated test assembly are fantastic examples. We IRT more generally to learn from the data; which items are most effective, which are not, the ability range where the test provides most precision, etc.

What differs from the Classification problem is that we don’t have a “true state” of labels for our training set. That is, we don’t know what the true scores are of the examinees, or if they are truly a “pass” or a “fail” – especially because those terms can be somewhat arbitrary. It is for this reason we rely on a well-defined model with theoretical reasons for it fitting our data, rather than just letting a machine learning toolkit analyze it with any model it feels like.

Arguably, classical test theory also fits this definition. We have a very specific mathematical model that is used to learn from the data, including which items are stronger or more difficult than others, and how to construct test forms to be statistically equivalent. However, its use of prediction is much weaker. We do not predict where future examinees would fall in the distribution of our calibration set. The fact that it is test-form-specific hampers is generalizability.

Reinforcement learning

The Wikipedia article also mentions reinforcement learning. This is used less often in psychometrics because test forms are typically published with some sort of finality. That is, they might be used in the field for a year or two before being retired, and no data is analyzed in that time except perhaps some high level checks like the NCCA Annual Statistical Report. Online IRT calibration is a great example, but is rarely used in practice. There, response data is analyzed algorithmically over time, and used to estimate or update the IRT parameters. Evaluation of parameter drift also fits in this definition.

Use of Test Scores

We also use test scores “outside” the test in a machine learning approach. A classic example of this is using pre-employment test scores to predict job performance, especially with additional variables to increase the incremental validity. But I’m not going to delve into that topic here.

Automation

Another huge opportunity for machine learning in psychometrics that is highly related is automation. That is, programming computers to do tasks more effectively or efficient than humans. Automated test assembly and automated essay scoring are examples of this, but there are plenty of of ways that automation can help that are less “cool” but have more impact. My favorite is the creation of psychometrics reports; Iteman and Xcalibre do not produce any numbers also available in other software, but they automatically build you a draft report in MS Word, with all the tables, graphs, and narratives already embedded. Very unique. Without that automation, organizations would typically pay a PhD psychometrician to spend hours of time on copy-and-paste, which is an absolute shame. The goal of my mentor, Prof. David Weiss, and myself is to automate the test development cycle as a whole; driving job analysis, test design, item writing, item review, standard setting, form assembly, test publishing, test delivery, and scoring. There’s no reason people should be allowed to continue making bad tests, and then using those tests to ruin people’s lives, when we know so much about what makes a decent test.

Summary

I am sure there are other areas of psychometrics and the testing industry that are soon to be disrupted by technological innovations such as this. What’s next?

As this article notes, the future direction is about the systems being able to learn on their own rather than being programmed; that is, more towards unsupervised learning than supervised learning. I’m not sure how well that fits with psychometrics.

But back to my original point: psychometrics has been a data-driven field since its inception a century ago. In fact, we contributed some of the methodology that is used generally in the field of machine learning and data analytics. So it shouldn’t be any big news when you hear terms like machine learning, data mining, AI, or dimensionality reduction used in our field! In contrast, I think it’s more important to consider how we remove roadblocks to more widespread use.

One of the hurdles we need to overcome for machine learning in psychometrics yet is simply how to get more organizations doing what has been considered best practice for decades. There are two types of problem organizations. The first type is one that does not have the sample sizes or budget to deal with methodologies like I’ve discussed here. The salient example I always think of is a state licensure test required by law for a niche profession that might have only 3 examinees per year (I have talked with such programs!). Not much we can do there. The second type is those organizations that indeed have large sample sizes and a decent budget, but are still doing things the same way they did them 30 years ago. How can we bring modern methods and innovations to these organizations? Because they will definitely only make their tests more effective and fairer.

student-profile-cognitive-diagnostic-models

Cognitive diagnostic models are a psychometric paradigm for designing and scoring tests with the goal of providing a profile of examinee skill mastery rather than just an overall test score.

CDMS are an area of psychometric research that has seen substantial growth in the past decade, though the mathematics behind them, dating back to MacReady and Dayton (1977).  The reason that they have been receiving more attention is that in many assessment situations, a simple overall score does not serve our purposes and we want a finer evaluation of the examinee’s skills or traits.  For example, the purpose of formative assessment in education is to provide feedback to students on their strengths and weaknesses, so an accurate map of these is essential.  In contrast, a professional certification/licensure test focuses on a single overall score with a pass/fail decision.

What are cognitive diagnostic models?

The predominant psychometric paradigm since the 1980s is item response theory (IRT), which is also known as latent trait theory.  Cognitive diagnostic models are part of a different paradigm known as latent class theory.  Instead of assuming that we are measuring a single neatly unidimensional factor, latent class theory instead tries to assign examinees into more qualitative groups by determining whether they categorized along a number of axes.

What this means is that the final “score” we hope to obtain on each examinee is not a single number, but a profile of which axes they have and which they do not.  The axes could be a number of different psychoeducational constructs, but are often used to represent cognitive skills examinees have learned.  Because we are trying to diagnose strengths vs. weaknesses, we call it a cognitive diagnostic model.

Example: Fractions

A classic example you might see in the literature is a formative assessment on dealing with fractions in mathematics. Suppose you are designing such a test, and the curriculum includes these teaching points, which are fairly distinct skills or pieces of knowledge.

  1. Find the lowest common denominator
  2. Add fractions
  3. Subtract fractions
  4. Multiply fractions
  5. Divide fractions
  6. Convert mixed number to improper fraction

Now suppose this is one of the questions on the test.

 What is 2 3/4 + 1 1/2?

 

This item utilizes skills 1, 2, and 6.  We can apply a similar mapping to all items, and obtain a table.  Researchers call this the “Q Matrix.”  Our example item is Item 1 here.  You’d create your own items and tag appropriately.

Item Find the lowest common denominator Add fractions Subtract fractions Multiply fractions Divide fractions Convert mixed number to improper fraction
 Item 1  X X  X
 Item 2  X  X
 Item 3  X  X
 Item 4  X  X

 

So how do we obtain the examinee’s skill profile?

This is where the fun starts.  I used the plural cognitive diagnostic models because there are a number of available models.  Just like in item response theory we have the Rasch, 2 parameter, 3 parameter, generalized partial credit, and more.  Choice of model is up to the researcher and depends on the characteristics of the test.

The simplest model is the DINA model, which has two parameters per item.  The slippage parameter s refers to the probability that a student will get the item wrong if they do have the skills.  The guessing parameter g refers to the probability a student will get the item right if they do not have the skills.

The mathematical calculations for determining the skill profile are complex, and are based on maximum likelihood.  To determine the skill profile, we need to first find all possible profiles, calculate the likelihood of each (based on item parameters and the examinee response vector), then select the profile with the highest likelihood.

Calculations of item parameters are an order of magnitude greater complexity.  Again, compare to item response theory: brute force calculation of theta with maximum likelihood is complex, but can still be done using Excel formulas.  Item parameter estimation for IRT with marginal maximum likelihood can only be done by specialized software like  Xcalibre.  For CDMs, item parameter estimation can be done in software like MPlus or R (see this article).

In addition to providing the most likely skill profile for each examinee, the CDMs can also provide the probability that a given examinee has mastered each skill.  This is what can be extremely useful in certain contexts, like formative assessment.

How can I implement cognitive diagnostic models?

The first step is to analyze your data to evaluate how well CDMs work by estimating one or more of the models.  As mentioned, this can be done in software like MPlus or R.  Actually publishing a real assessment that scores examinees with CDMs is a greater hurdle.

Most tests that use cognitive diagnostic models are proprietary.  That is, a large K12 education company might offer a bank of prefabricated formative assessments for students in grades 3-12.  That, of course, is what most schools need, because they don’t have a PhD psychometrician on staff to develop new assessments with CDMs.  And the testing company likely has several on staff.

On the other hand, if you want to develop your own assessments that leverage CDMs, your options are quite limited.  I recommend our  FastTest platform for test development, delivery, and analytics.

This is cool!  I want to learn more!

I like this article by Alan Huebner, which talks about adaptive testing with the DINA model, but has a very informative introduction on CDMs.

Jonathan Templin, a professor at the University of Iowa, is one of the foremost experts on the topic.  Here is his website.  Lots of fantastic resources.

Here is a textbook on CDMs.

 

Generalized-partial-credit-model

What is a rubric? It’s a rule for converting unstructured responses on an assessment, such as essays that students write, into structured data that we can use psychometrically.

Why do we need rubrics?

Measurement is a quantitative endeavor.  In psychometrics, we are trying to measure things like knowledge, achievement, aptitude, or skills.  So we need a way to convert qualitative data into quantitative data.  We can still keep the qualitative data on hand for certain uses, but typically need the quantitative data for the primary use.  For example, writing essays in school will need to be converted to a score, but the teacher might also want to talk to the student to provide a learning opportunity.

A rubric is a defined set of rules to convert open-response items like essays into usable quantitative data, such as scoring the essay 0 to 4 points.

How many rubrics do I need?

In some cases, a single rubric will suffice.  This is typical in mathematics, where the goal is a single correct answer.  In writing, the goal is often more complex.  You might be assessing writing and argumentative ability at the same time you are assessing language skills.  For example, you might have rubrics for spelling, grammar, paragraph structure, and argument structure – all on the same essay.

Examples

Spelling rubric for an essay

Points Description
0 Essay contains 5 or more spelling mistakes
1 Essay contains 1 to 4 spelling mistakes
2 Essay does not contain any spelling mistakes

 

Argument rubric for an essay

“Your school is considering the elimination of organized sports.  Write an essay to provide to the School Board that provides 3 reasons to keep sports, with a supporting explanation for each.”

Points Description
0 Student does not include any reasons with explanation (includes providing 3 reasons but no explanations)
1 Student provides 1 reason with a clear explanation
2 Student provides 2 reasons with clear explanations
3 Student provides 3 reasons with clear explanations

 

Answer rubric for math

Points Description
0 Student provides no response or a response that does not indicate understanding of the problem.
1 Student provides a response that indicates understanding of the problem, but does not arrive at correct answer OR provides the correct answer but no supporting work.
2 Student provides a response with the correct answer and supporting work that explains the process.

 

How do I score tests with a rubric?

Well, the traditional approach is to just take the integers supplied by the rubric and add them to the number-correct score. This is consistent with classical test theory, and therefore fits with conventional statistics such as coefficient alpha for reliability and Pearson correlation for discrimination. However, the modern paradigm of assessment is item response theory, which analyzes the rubric data much more deeply and applies advanced mathematical modeling like the generalized partial credit model (Muraki, 1992; resources on that here and here).

An example of this is below.  Imagine that you have an essay which is scored 0-4 points.  This graph shows the probability of earning each point level, as a function of total score (Theta).  Someone who is average (Theta=0.0) is likely to get 2 points, the yellow line.  Someone at Theta=1.0 is likely to get 3 points.  Note that the middle curves are always bell-shaped while the ones on the end go up to an upper asymptote of 1.0.  That is, the smarter the student, the more likely they are to get 4 out of 4 points, but the probability of that can never go above 100%, obviously.

Generalized-partial-credit-model

How can I efficiently implement a scoring rubric?

It is much easier to implement a scoring rubric if your online assessment platform supports them in an online marking module, especially if the platform already has integrated psychometrics like the generalized partial credit model.  Below is an example of what an online essay marking system would look like, allowing you to efficiently implement rubrics.  It should have advanced functionality, such as allowing multiple rubrics per item, multiple raters per response, anonymity, and more.

Online marking essays

 

What about automated essay scoring?

You also have the option of using automated essay scoring; once you have some data from human raters on rubrics, you can train machine learning models to help.  Unfortunately, the world is not yet to the state where we have a droid that you can just feed a pile of student papers to grade!

 

psychometrics-possibilities

Today I read an article in The Industrial-Organizational Psychologist (the colloquial journal published by the Society for Industrial Organizational Psychology) that really resonated with me.

Has Industrial-Organizational Psychology Lost Its Way?
-Deniz S. Ones, Robert B. Kaiser, Tomas Chamorro-Premuzic, Cicek Svensson

Why?  Because I think a lot of the points they are making are also true about the field of Psychometrics and our innovation.  They summarize their point in six bullet points that they suggest present a troubling direction for their field.  Though honestly, I suppose a lot of Academia falls under these, while some great innovation is happening over on some free MOOCs and the like because they aren’t fettered by the chains of the purely or partially academic world.

  • An overemphasis on theory
  • A proliferation of, and fixation on, trivial methodological minutiae
  • A suppression of exploration and a repression of innovation
  • An unhealthy obsession with publication while ignoring practical issues
  • A tendency to be distracted by fads
  • A growing habit of losing real-world influence to other fields

 

So what is psychometrics supposed to be doing?

The part that has irked me the most about Psychometrics over the years is the overemphasis on theory and minutiae rather than solving practical problems.  This is the main reason I stopped attending the NCME conference and instead attend practical conferences like ATP.  It stems from my desire to improve the quality of assessment throughout the world.  Development of esoteric DIF methodology, new multidimensional IRT models, or a new CAT sub-algorithm when there are already dozens and the new one offers a 0.5% increase in efficiency… stuff like that isn’t going to impact all the terrible assessment being done in the world and the terrible decisions being made about people based on those assessments.  Don’t get me wrong, there is a place for the substantive research, but I feel the latter point is underserved.

The Goal: Quality Assessment

psychometrician in code

And it’s that point that is driving the work that I do.  There is a lot of mediocre or downright bad assessment out there in the world.  I once talked to a Pre-Employment testing company and asked if I could help implement strong psychometrics to improve their tests as well as validity documentation.  Their answer?  It was essentially “No thanks, we’ve never been sued so we’re OK where we are.”  Thankfully, they fell in the mediocre category rather the downright bad category.

Of course, in many cases, there is simply a lack of incentive to produce quality assessment.  Higher Education is a classic case of this.  Professional schools (e.g., Medicine) often have accreditation tied in some part to demonstrating quality assessment of their students.  There is typically no such constraint on undergraduate education, so your Intro to Psychology and freshman English Comp classes still do assessment the same way they did 40 years ago… with no psychometrics whatsoever.  Many small credentialing organizations lack incentive too, until they decide to pursue accreditation.

I like to describe the situation this way: take all the assessments of the world and get them a percentile rank in psychometric quality.  The top 5% are the big organizations, such as Nursing licensure in the US, that have in-house psychometricians, large volumes, and huge budgets.  We don’t have to worry about them as they will be doing good assessment (and that substantive research I mentioned might be of use to them!).  The bottom 50% or more are like university classroom assessments.  They’ll probably never use real psychometrics.  I’m concerned about that 50-95th percentile.

Example: Credentialing

A great example of this level is the world of Credentialing.  There a TON of poorly constructed licensure and certification tests that are being used to make incredibly important decisions about people’s lives.  Some are simply because the organization is for-profit and doesn’t care.  Some are caused by external constraints.  I once worked with a Department of Agriculture for a western US State, where the legislature mandated that licensure tests be given for certain professions, even though only like 3 people per year took some tests.

So how do we get groups like that to follow best practices in assessment?  In the past, the only way to get psychometrics done is for them to pay a consultant a ton of money that they don’t have.  Why spend $5k on an Angoff study or classical test report for 3 people/year?  I don’t blame them.  The field of Psychometrics needs to find a way to help such groups.  Otherwise, the tests are low quality and they are giving licenses to unqualified practitioners.

There are some bogus providers out there, for sure.  I’ve seen Certification delivery platforms that don’t even store the examinee responses, which would be necessary to do any psychometric analysis whatsoever.  Obviously they aren’t doing much to help the situation.  Software platforms that focus on things like tracking payments and prerequisites simply miss the boat too.  They are condoning bad assessment.

Similarly, mathematically complex advancements such as multidimensional IRT are of no use to this type of organization.  It’s not helping the situation.

An Opportunity for Innovation

making-predictions-and-decisions-based-on-test-scores

I think there is still a decent amount of innovation in our field.  There are organizations that are doing great work to develop innovative items, psychometrics, and assessments.  However, it is well known that large corporations will snap up fresh PhDs in Psychometrics and then lock them in a back room to do uninnovative work like run SAS scripts or conduct Angoff studies over and over and over.  This happened to me and after only 18 months I was ready for more.

Unfortunately, I have found that a lot of innovation is not driven by producing good measurement.  I was in a discussion on LinkedIn where someone was pushing gamification for assessments and declared that measurement precision was of no interest.  This, of course, is ludicrous.  It’s OK to produce random numbers as long as the UI looks cool for students?

Innovation in Psychometrics at ASC

Much of the innovation at ASC is targeted towards the issue I have presented here.  I originally developed Iteman 4 and Xcalibre 4 to meet this type of usage.  I wanted to enable an organization to produce professional psychometric analysis reports on their assessments without having to pay massive amounts of money to a consultant.  Additionally, I wanted to save time; there are other software programs which can produce similar results, but drop them in text files or Excel spreadsheets instead of Microsoft Word which is of course what everyone would use to draft a report.

Much of our FastTest platform is designed with a similar bent.  Tired of running an Angoff study with items on a projector and the SMEs writing all their ratings with pencil and paper, only to be transcribed later?  Well, you can do this online.  Moreover, because it is only you can use the SMEs remotely rather than paying to fly them into a central office.  Want to publish an adaptive (CAT) exam without writing code?  We have it built directly into our test publishing interface.

Back to My Original Point

So the title is “What is Psychometrics Supposed to be Doing?” with regards to psychometrics innovation.  My answer, of course, is improving assessment.  The issue I take with the mathematically advanced research is that it is only relevant for that top 5% of organizations that is mentioned.  It’s also our duty as psychometricians to find better ways to help the other 95%.

What else can we be doing?  I think the future here is automation.  Iteman 4 and Xcalibre 4, as well as FastTest, were really machine learning and automation platforms before those things became so en vogue.  As the SIOP article mentioned at the beginning talks about, other scholarly areas like Big Data are gaining more real-world influence even if they are doing things that Psychometrics has done for a long time.  Item Response Theory is a form of machine learning and it’s been around for 50 years!

 

SIFT test security data forensics

Test fraud is an extremely common occurrence.  We’ve all seen articles about examinee cheating.  However, there are very few defensible tools to help detect it.  I once saw a webinar from an online testing provider that proudly touted their reports on test security… but it turned out that all they provided was a simple export of student answers that you could subjectively read and form conjectures.  The goal of SIFT is to provide a tool that implements real statistical indices from the corpus of scientific research on statistical detection of test fraud, yet is user-friendly enough to be used by someone without a PhD in psychometrics and experience in data forensics.  SIFT still provides more collusion indices and other analysis than any other software on the planet, making it the standard in the industry from the day of its release.  The science behind SIFT is also being implemented in our world-class online testing platform, FastTest.  It is also worth noting that FastTest supports computerized adaptive testing, which is known to increase test security.

Interested?  Download a free trial version of SIFT!

What is Test Fraud?

As long as tests have been around, people have been trying to cheat them.  This is only natural; anytime there is a system with some sort of stakes/incentive involved (and maybe even when not), people will try to game that system.  Note that the root culprit is the system itself, not the test. Blaming the test is just shooting the messenger.  However, in most cases, the system serves a useful purpose.  In the realm of assessment, that means that K12 assessments provide useful information on curriculum on teachers, certification tests identify qualified professionals, and so on.  In such cases, we must minimize the amount of test fraud in order to preserve the integrity of the system.

When it comes to test fraud, the old cliche is true: an ounce of prevention is worth a pound of cure. You’ll undoubtedly see that phrase at conferences and in other resources.  So I of course recommend that your organization implement reasonable preventative measures to deter test fraud.  Nevertheless, there will still always be some cases.  SIFT is intended to help find those.  Also, some examinees might also be deterred by the knowledge that such analysis is even being done.

How can SIFT help me with statistical detection of test fraud?

Like other psychometric software, SIFT does not interpret results for you.  For example, software for item analysis like  Iteman  and  Xcalibre  do not specifically tell you which items to retire or revise, or how to revise them.  But they provide the output necessary for a practitioner to do so.  SIFT provides you a wide range of output that can help you find different types of test fraud, like copying, proctor help, suspect test centers, brain dump usage, etc.  It can also help find other issues, like low examinee motivation.  But YOU have to decide what is important to you regarding statistical detection of test fraud, and look for relevant evidence.  More information on this is provided in the manual, but here is a glimpse.

SIFT test security data forensics

First, there are a number if indices you can evaluate, as you see above.  SIFT  will calculate those collusion indices for each pair of students, and summarize the number of flags.

sift collusion index analysis

A certification organization could use  SIFT  to look for evidence of brain dump makers and takers by evaluating similarity between examinee response vectors and answers from a brain dump site – especially if those were intentionally seeded by the organization!  We also might want to find adjacent examinees or examinees in the same location that group together in the collusion index output.  Unfortunately, these indices can differ substantially in their conclusions.

Additionally, you might want to evaluate time data.  SIFT  provides this as well.

sift time analysis

Finally, we can roll up many of these statistics to the group level.  Below is an example that provides a portion of  SIFT  output regarding teachers.  Note the Gutierrez has suspiciously high scores but without spending much more time.  Cheating?  Possibly.  On the other hand, that is the smallest N, so perhaps the teacher just had a group of accelerated students.  Worthington, on the other hand, also had high scores but had notably shorter times – perhaps the teacher was helping?

sift group analysis

 

The Story of SIFT

I started  SIFT  in 2012.  Years ago, ASC sold a software program called  Scrutiny!  We had to stop selling it because it did not work on recent versions of Windows, but we still received inquiries for it.  So I set out to develop a program that could perform the analysis from  Scrutiny! (the Bellezza & Bellezza index) but also much more.  I quickly finished a few collusion indices.  Then unfortunately I had to spend a few years dealing with the realities of business, wasting hundreds of hours in pointless meetings and other pitfalls.  I finally set a goal to release SIFT in July 2016.

Version 1.0 of  SIFT  includes 10 collusion indices (5 probabilistic, 5 descriptive), response time analysis, group level analysis, and much more to aid in the statistical detection of test fraud.  This is obviously not an exhaustive list of the analyses from the literature, but still far surpasses other options for the practitioner, including the choice to write all your own code.  Suggestions?  I’d love to hear them.