Posts on psychometrics: The Science of Assessment

item response theory

Classical test theory is a century-old paradigm for psychometrics – using quantitative and scientific processes to develop and analyze assessments to improve their quality.  (Nobody likes unfair tests!)  The most basic and frequently used item statistic from classical test theory is the P-value.  It is usually called item difficulty but is sometimes called item facility, which can lead to possible confusion.

The P-Value Statistic

The classical P-value is the proportion of examinees that respond correctly to a question, or respond in the “keyed direction” for items where the notion of correct is not relevant (imagine a personality assessment where all questions are Yes/No statements such as “I like to go to parties” … Yes is the keyed direction for an Extraversion scale).  Note that this is NOT the same as the p-value that is used in hypothesis testing from general statistical methods.  This P-value is almost universally agreed upon in terms of calculation.  But some people call it item difficulty and others call it item facility.  Why?

It has to do with the clarity interpretation.  It usually makes sense to think of difficulty as an important aspect of the item.  The P-value presents this, but in a reverse manner.  We usually expect higher values to indicate more of something, right?  But a P-value of 1.00 is high, and it means that there is not much difficulty; everyone gets the item correct, so it is actually no difficulty whatsoever.  A P-value of 0.25 is low, but it means that there is a lot of difficulty; only 25% of examinees are getting it correct, so it has quite a lot of difficulty.

So where does “item facility” come in?

See how the meaning is reversed?  It’s for this reason that some psychometricians prefer to call it item facility or item easiness.  We still use the P-value, but 1.00 means high facility/easiness, and 0.25 means low facility/easiness.  The direction of the semantics fits much better.

Nevertheless, this is a minority of psychometricians.  There’s too much momentum to change an entire field at this point!  It’s similar to the 3 dichotomous IRT parameters (a, b, c); some of you might have noticed that they are actually in the wrong order because the 1-parameter model does not use the parameter, it uses the b. 

At the end of the day, it doesn’t really matter, but it’s another good example of how we all just got used to doing something and it’s now too far down the road to change it.  Tradition is a funny thing.

Have you heard about standard setting approaches such as the Hofstee method, or perhaps the Angoff, Ebel, Nedelsky, or Bookmark methods?  There are certainly various ways to set a defensible cutscore or a professional credentialing or pre-employment test.  Today, we are going to discuss the Hofstee method.  You may also be interested in reading this introductory post on setting a cutscore using item response theory.

Why Standard Setting?

Certification organizations that care about the quality of their examinations need to follow best practices and international standards for test development, such as the Standards laid out by the National Commission for Certifying Agencies (NCCA).  One component of that is standard setting, also known as cutscore studies.  One of the most common and respected approaches for that is the modified-Angoff methodology.

However, the Angoff approach has one flaw: the subject matter experts (SMEs) tend to expect too much out of minimally competent candidates, and sometimes set a cutscore so high that even they themselves would not pass the exam.  There are several reasons this can occur.  For example, raters might think “I would expect anyone that worked for me to know how to do this” and not consider the fact that people who work for them might have 10 years of experience while test candidates could be fresh out of training/school and have the topic only touched on for 5 minutes.  SMEs often forget what it was like to be a much younger and inexperienced version of themselves.

For this reason, several compromise methods have been suggested to compare the Angoff-recommended cutscore with a “reality check” of actual score performance on the exam, allowing the SMEs to make a more informed decision when setting the official cutscore of the exam.  I like to use the Beuk method and the Hofstee method.

The Hofstee Method

One method of adjusting the cutscore based on raters’ impressions of the difficulty of the test and possible pass rates is the Hofstee method (Mills & Melican, 1987; Cizek, 2006; Burr et al., 2016).  This method requires the raters to estimate four values:

  1. The minimum acceptable failure rate
  2. The maximum acceptable failure rate
  3. The minimum cutscore, even if all examinees failed
  4. The maximum cutscore, even if all examinees passed

The first two values are failure rates, and are therefore between 0% and 100%, with 100% indicating a test that is too difficult for anyone to pass.  The latter two values are on the raw score scale, and therefore range between 0 and the number of items in the test, again with a higher value indicating a more difficult cutscore to achieve.

These values are paired, and the line that passes through the two points estimated.  The intersection of this line with the failure rate function, is the recommendation of the adjusted cutscore.   

hofstee

How can I use the Hofstee Method?

Unlike the Beuk, the Hofstee method does not utilize the Angoff ratings, so it represents a completely independent reality check.  In fact, it is sometimes used as a standalone cutscore setting method itself, but because it does not involve rating of every single item, I recommend it be used in concert with the Angoff and Beuk approaches.

 

automated item generation AI

Simulation studies are an essential step in the development of a computerized adaptive test (CAT) that is defensible and meets the needs of your organization or other stakeholders. There are three types of simulations: Monte Carlo, Real Data (post hoc), and Hybrid.

Monte Carlo simulation is the most general-purpose approach, and the one most often used early in the process of developing a CAT.  This is because it requires no actual data, either on test items or examinees – although real data is welcome if available – which makes it extremely useful in evaluating whether CAT is even feasible for your organization before any money is invested in moving forward.

Let’s begin with an overview of how Monte Carlo simulation works before we return to that point.

How a Monte Carlo simulation works: An overview

First of all, what do we mean by CAT simulation?  Well, a CAT is a test that is administered to students via an algorithm.  We can use that same algorithm on imaginary examinees, or real examinees from the past, and simulate how well a CAT performs on them.

Best of all, we can change the specifications of the algorithm to see how it impacts the examinees and the CAT performance.

Each simulation approach requires three things:

  1. Item parameters from item response theory (IRT), though new CAT methods such as diagnostic models are now being developed.
  2. Examinee scores (theta) from IRT.
  3. A way to determine how an examinee responds to an item if the CAT algorithm says it should be delivered to the examinee.

The Monte Carlo simulation approach is defined by how it addresses the third requirement: it generates a response using some sort of mathematical model, while the other two simulation approaches look up actual responses for past examinees (real-data approach) or a mix of the two (hybrid).

The Monte Carlo simulation approach only uses the response generation process.  The item parameters can either be from a bank of actual items or generated.

Likewise, the examinee thetas can be from a database of past data, or generated.

How does the response generation process work? 

Well, it differs based on the model that is used as the basis for the CAT algorithm.  Here, let’s assume that we are using the three-parameter logistic model.  Start by supposing we have a fake examinee with a true theta of 0.0.  The CAT algorithm looks in the bank and says that we need to administer item #17 as the first item, which has the following item parameters: a=1.0, b=0.0, and c=0.20.

Well, we can simply plug those numbers into the equation for the three-parameter model and obtain the probability that this person would correctly answer this item.

Item response function - IRF 1.0 0.0 0.2

The probability, in this case, is 0.6.  The next step is to generate a random number from the set of all real numbers between 0.0 and 1.0.  If that number is less than the probability of correct response, the examinee “gets” the item correct.  If greater, the examinee gets the item incorrect.  Either way, the examinee is scored and the CAT algorithm proceeds.

For every item that comes up to be used, we utilize this same process.  Of course, the true theta does not change, but the item parameters are different for each item.  Each time, we generate a new random number and compare it to the probability to determine a response of correct or incorrect.

The CAT algorithm proceeds as if a real examinee is on the other side of the computer screen, actually responding to questions, and stops whenever the termination criterion is satisfied.  However, the same process can be used to “deliver” linear exams to examinees; instead of the CAT algorithm selecting the next item, we just process sequentially through the test.

A road to research

For a single examinee, this process is not much more than a curiosity.  Where it becomes useful is at a large scale aggregate level.  Imagine the process above as part of a much larger loop.  First, we establish a pool of 200 items pulled from items used in the past by your program.  Next, we generate a set of 1,000 examinees by pulling numbers from a random distribution.

Finally, we loop through each examinee and administer a CAT by using the CAT algorithm and generating responses with the Monte Carlo simulation process.  We then have extensive data on how the CAT algorithm performed, which can be used to evaluate the algorithm and the item bank.  The two most important are the length of the CAT and its accuracy, which are a trade-off in most cases.

So how is this useful for evaluating the feasibility of CAT?

Well, you can evaluate the performance of the CAT algorithm by setting up an experiment to compare different conditions.  Suppose you don’t have past items and are not even sure how many items you need?  Well, you can create several different fake item banks and administer a CAT to the same set of fake examinees.

Or you might know the item bank to be used, but need to establish that a CAT will outperform the linear tests you currently use.  There is a wide range of research questions you can ask, and since all the data is being generated, you can design a study to answer many of them.  In fact, one of the greatest problems you might face is that you can get carried away and start creating too many conditions!

How do I actually do a Monte Carlo simulation study?

Fortunately, there is software to do all the work for you.  The best option is CATSim, which provides all the options you need in a straightforward user interface (beware, this makes it even easier to get carried away).  The advantage of CATSim is that it collates the results for you and presents most of the summary statistics you need without you having to calculate them.  For example, it calculates the average test length (number of items used by a variable-length CAT), and the correlation of CAT thetas with true thetas.  Other software exists which is useful in generating data sets using Monte Carlo simulation (see SimulCAT), but they do not include this important feature.

adaptive testing simulation

decision-consistency

If you are involved with certification testing and are accredited by the National Commission of Certifying Agencies (NCCA), you have come across the term decision consistency.  NCCA requires you to submit a report of 11 important statistics each year, each for all active test forms.  These 11 provide a high level summary of the psychometric health of each form; more on that report here.  One of the 11 is decision consistency.

What is Decision Consistency?

Decision consistency is an estimate of how consistent the pass/fail decision is on your test.  That is, if someone took your test today, had their brain wiped of that memory, and took the test again next week, what is the probability that they would obtain the same classification both times?  This is often estimated as a proportion or percentage, and we would of course hope that this number is high, but if the test is unreliable it might not be.

The reasoning behind the need for a index specifically on this is that the psychometric aspect we are trying to estimate is different than reliability of point scores (Moltner, Timbil, & Junger, 2015; Downing & Mehrens, 1978).  The argument is that examinees near the cutscore are of interest, and reliability evaluates the entire scale.  It’s for this reason that if you are using item response theory (IRT), the NCCA allows you to instead submit the conditional standard error of measurement function at the cutscore.  But all of the classical decision consistency indices evaluate all examinees, and since most candidates are not near the cutscore, this inflates the baseline.  Only the CSEM – from IRT – follows the line of reasoning of focusing on examinees near the cutscore.

An important distinction that stems from this dichotomy is that of decision consistency vs. accuracy.  Consistency refers to receiving the same pass/fail classification each time if you take the test twice.  But what we really care about is whether your pass/fail based on the test matches with your true state.  For a more advanced treatment on this, I recommend Lathrop (2015).

Indices of Decision Consistency

There are a number of classical methods for estimating an index of decision consistency that have been suggested in the psychometric literature.  A simple and classic approach is Hambleton (1972), which is based on an assumption that examinees actually take the same test twice (or equivalent forms).  Of course, this is rarely feasible in practice, so a number of methods were suggested over the next few years on how to estimate this with a single test administration to a given set of examinees.  These include Huynh (1976), Livingston (1972), and Subkoviak (1976).  These are fairly complex.  I once reviewed a report from a psychometrician that faked the Hambleton index because they didn’t have the skills to figure out any of the indices.

How does Decision Consistency relate to reliability?

The note I made above about unreliability is worth another visit, however.  After the rash of publications on the topic, Mellenbergh and van der Linden (1978; 1980) pointed out that if you assume a linear loss function for misclassification, the conventional estimate of reliability – coefficient alpha – serves as a solid estimate of decision consistency.  What is a linear loss function?  It means that a misclassification is worse if the person’s score is further from the cutscore.  That is, if the cutscore is 70, failing someone with a true score of 80 is twice as bad as failing someone with a true score of 75.  Of course, we never know someone’s true score, so this is a theoretical assumption, but the researchers make an excellent point.

But while research amongst psychometricians on the topic cooled since they made that point, NCCA still requires one of the statistics – most from the 1970s – to be reported.  The only other well-known index on the topic was Hanson and Brennan (1990).  While the indices have been shown to be different than classical reliability, I remain to be convinced that they are the right approach.  Of course, I’m not much of a fan of classical test theory at all in the first place; that acceptance of CSEM from IRT is definitely aligned with my views on how psychometrics should tackle measurement problems.

three standard errors

Sympson-Hetter is a method of item exposure control within the algorithm of Computerized adaptive testing (CAT).  It prevents the algorithm from over-using the best items in the pool.

CAT is a powerful paradigm for delivering tests that are smarter, faster, and fairer than the traditional linear approach.  However, CAT is not without its challenges.  One is that it is a greedy algorithm that always selects your best items from the pool if it can.  The way that CAT researchers address this issue is with item exposure controls.  These are sub algorithms that are injected into the main item selection algorithm, to alter it from always using the best items. The Sympson-Hetter method is one such approach.  Another is the Randomesque method.

The Randomesque Method5 item information functions IIF for Sympson-Hetter

The simplest approach is called the randomesque method.  This selects from the top X items in terms of item information (a term from item response theory), usually for the first Y items in a test.  For example, instead of always selecting the top item, the algorithm finds the 3 top items and then randomly selects between those.

The figure on the right displays item information functions (IIFs) for a pool of 5 items.  Suppose an examinee had a theta estimate of 1.40.  The 3 items with the highest information are the light blue, purple, and green lines (5, 4, 3).  The algorithm would first identify this and randomly pick amongst those three.  Without item exposure controls, it would always select Item 4.

The Sympson-Hetter Method

A more sophisticated method is the Sympson-Hetter method.

Here, the user specifies a target proportion as a parameter for the selection algorithm.  For example, we might decide that we do not want an item seen by more than 75% of examinees.  So, every time that the CAT algorithm goes into the item pool to select a new item, we generate a random number between 0 and 1, which is then compared to the threshold.  If the number is between 0 and 0.75 in this case, we go ahead and administer the item.  If the number is from 0.75 to 1.0, we skip over it and go on to the next most informative item in the pool, though we then do the same comparison for that item.

Why do this?  It obviously limits the exposure of the item.  But just how much it limits it depends on the difficulty of the item.  A very difficult item is likely only going to be a candidate for selection for very high-ability examinees.  Let’s say it’s the top 4%… well, then the approach above will limit it to 3% of the sample overall, but 75% of the examinees in its neighborhood.

On the other hand, an item of middle difficulty is used not only for middle examinees but often for any examinee.  Remember, unless there are some controls, the first item for the test will be the same for everyone!  So if we apply the Sympson-Hetter rule to that item, it limits it to 75% exposure in a more absolute sense.

Because of this, you don’t have to set that threshold parameter to the same value for each item.  The original recommendation was to do some CAT simulation studies, then set the parameters thoughtfully for different items.  Items that are likely to be highly exposed (middle difficulty with high discrimination) might deserve a more strict parameter like 0.40.  On the other hand, that super-difficult item isn’t an exposure concern because only the top 4% of students see it anyway… so we might leave its parameter at 1.0 and therefore not limit it at all.

Is this the only method available?

No.  As mentioned, there’s that simple randomesque approach.  But there are plenty more.  You might be interested in this paper, this paper, or this paper.  The last one reviews the research literature from 1983 to 2005.

What is the original reference?

Sympson, J. B., & Hetter, R. D. (1985, October). Controlling item-exposure rates in computerized adaptive testing. Proceedings of the 27th annual meeting of the Military Testing Association (pp. 973–977). San Diego, CA: Navy Personnel Research and Development Center.

How can I apply this to my tests?

Well, you certainly need a CAT platform first.  Our platform at ASC allows this method right out of the box – that is, all you need to do is enter the target proportion when you publish your exam, and the Sympson-Hetter method will be implemented.  No need to write any code yourself!  Click here to sign up for a free account.

standard-error-of-the-mean

The standard error of the mean is one of the three main standard errors in psychometrics and psychology.  Its purpose is to help conceptualize the error in estimating the mean of some population based on a sample.  The SEM is a well-known concept from the general field of statistics, used in an untold number of applications.

For example, a biologist might catch a number of fish from a lake, measure their length, and use that data to determine the average size of fish in the lake.  In psychometrics and psychology, we usually utilize data from some measurement of people.

For example, suppose we have a population of 5 employees with scores below on a 10 item assessment on safety procedures.

Student  Score

1             4

2             6

3             6

4             8

5             7

Then let’s suppose we are drawing a sample of 4.  There are 5 different ways to do this, but let’s just say it’s the first 4.  The average score is (4+6+6+8)/4=6, and the standard deviation for the sample is 1.63.  The standard error of the mean says that

   SEM=SD/sqrt(n) = 1.63/sqrt(4) = 0.815.

Because a 95% confidence interval is 1.96 standard errors around the average, and the average is 6, this says we expect the true mean of the distribution to be somewhere between 4.40 to 7.60.  Obviously, this is not very exact!  There are two reasons for that in this case: the SD is relatively large 1.63 on a scale of 0 to 10, and the N is only 4.  If you had a sample N of 10,000, for example, you’d be dividing the 1.63 by 100, leading to an SEM of only 0.0163.

How is this useful?  Well, this tells us that even the wide range of 4.40 to 7.60 means that the average score is fairly low; that our employees probably need some training on safety procedures.  The other two standard errors in psychology/psychometrics are more complex and more useful, though.  I’ll be covering those in future posts.

three standard errors

One of my graduate school mentors once said in class that there are three standard errors that everyone in the assessment or I/O Psych field needs to know: mean, error, and estimate.  They are quite distinct in concept and application but easily confused by someone with minimal training.

I’ve personally seen the standard error of the mean reported as the standard error of measurement, which is completely unacceptable.

So in this post, I’ll briefly describe each so that the differences are clear.  In later posts, I’ll delve deeper into each of the standard errors.

Standard Error of the Mean

This is the standard error that you learned about in Introduction to Statistics back in your sophomore year of college/university.  It is related to the Central Limit Theorem, the cornerstone of statistics.  Its purpose is to provide an index of accuracy (or conversely, error) in a sample mean.  Any sample drawn from a population will have an average, but these can vary.  The standard error of the mean estimates the variation we might expect in these different means from different samples and is defined as

   SEmean = SD/sqrt(n)

Where SD is the sample’s standard deviation and n is the number of observations in the sample.  This can be used to create a confidence interval for the true population mean.

The most important thing to note, with respect to psychometrics, is that this has nothing to do with psychometrics.  This is just general statistics.  You could be weighing a bunch of hay bales and calculating their average; anything where you are making observations.  It can be used, however, with assessment data.

For example, if you do not want to make every student in a country take a test, and instead sample 50,000 students, with a mean of 71 items correct with an SD of 12.3, then the SEM is  12.3/sqrt(50000) = 0.055.  You can be 95% certain that the true population means then lies in the narrow range of 71 +- 0.055.

Click here to read more.

Standard Error of Measurement

More important in the world of assessment is the standard error of measurement.  Its purpose is to provide an index of the accuracy of a person’s score on a test.  That is a single person, rather than a group like with the standard error of the mean.  It can be used in both the classical test theory perspective and item response theory perspective, though it is defined quite differently in both.

In classical test theory, it is defined as

   SEM = SD*sqrt(1-r)

Where SD is the standard deviation of scores for everyone who took the test, and r is the reliability of the test.  It can be interpreted as the standard deviation of scores that you would find if you had the person take the test over and over, with a fresh mind each time.  A confidence interval with this is then interpreted as the band where you would expect the person’s true score on the test to fall.

Item Response Theory conceptualizes the SEM as a continuous function across the range of student abilities.  A test form will have more accuracy – less error – in a range of abilities where there are more items or items of higher quality.  That is, a test with most items of middle difficulty will produce accurate scores in the middle of the range, but not measure students on the top or bottom very well.  The example below is a test that has many items above the average examinee score (θ) of 0.0 so that any examinee with a score of less than 0.0 has a relatively inaccurate score, namely with an SEM greater than 0.50.

Standard error of measurement and test information function

 

For a deeper discussion of SEM, click here. 

Standard Error of the Estimate

Lastly, we have the standard error of the estimate.  This is an estimate of the accuracy of a prediction that is made, usually in the paradigm of linear regression.  Suppose we are using scores on a 40 item job knowledge test to predict job performance, and we have data on a sample of 1,000 job incumbents that took the test last year and have job performance ratings from this year on a measure that entails 20 items scored on a 5 point scale for a total of 100 points.

There might have been 86 incumbents that scored 30/40 on the test, and they will have a range of job performance, let’s say from 61 to 89.  If a new person takes the test and scores 30/40, how would we predict their job performance?

The SEE is defined as

       SEE = SDy*sqrt(1-r2)

Here, the r is the correlation of x and y, not reliability. Many statistical packages can estimate linear regression, SEE, and many other related statistics for you.  In fact, Microsoft Excel comes with a free package to implement simple linear regression.  Excel estimates the SEE as 4.69 in the example above, and the regression slope and intercept are 29.93 and 1.76, respectively

Given this, we can estimate the job performance of a person with a 30 test score to be 82.73.  A 95% confidence interval for a candidate with a test score of 30 is then 82.71-(4.69*1.96) to 82.71+(4.69*1.96), or 73.52 to 91.90.

You can see how this might be useful in prediction situations.  Suppose we wanted to be sure that we only hired people who are likely to have a job performance rating of 80 or better?  Well, a cutscore of 30 on the test is therefore quite feasible.

OK, so now what?

Well, remember that these three standard errors are quite different and are not even in related situations.  When you see a standard error requested – for example if you must report the standard error for an assessment – make sure you use the right one!

certification exam delivery

In the past decade, terms like machine learning, artificial intelligence, and data mining are becoming greater buzzwords as computing power, APIs, and the massively increased availability of data enable new technologies like self-driving cars. However, we’ve been using methodologies like machine learning in psychometrics for decades. So much of the hype is just hype.

So, what exactly is Machine Learning?

Unfortunately, there is no widely agreed-upon definition, and as Wikipedia notes, machine learning is often conflated with data mining. A broad definition from Wikipedia is that machine learning explores the study and construction of algorithms that can learn from and make predictions on data. It’s often divided into supervised learning, where a researcher drives the process, and unsupervised learning, where the computer is allowed to creatively run wild and look for patterns. The latter isn’t of much use to us, at least yet.

Supervised learning includes specific topics like regression, dimensionality reduction, anomaly detection that we obviously have in Psychometrics. But its the general definition above that really fits what Psychometrics has been doing for decades.

What is Machine Learning in Psychometrics?

We can’t cover all the ways that machine learning and related topics are used in psychometrics and test development, but here’s a sampling. My goal is not to cover them all but to point out that this is old news and that should not get hung up on buzzwords and fads and marketing schticks – but by all means, we should continue to drive in this direction.

Dimensionality Reduction

One of the first, and most straightforward, areas is dimensionality reduction. Given a bunch of unstructured data, how can we find some sort of underlying structure, especially based on latent dimension? We’ve been doing this, utilizing methods like cluster analysis and factor analysis, since Spearman first started investigating the structure of intelligence 100 years ago. In fact, Spearman helped invent those approaches to solve the problems that he was trying to address in psychometrics, which was a new field at the time and had no methodology yet. How seminal was this work in psychometrics for the field of machine learning in general? The Coursera MOOC on Machine Learning uses Spearman’s work as an example in one of the early lectures!

Classification

Classification is a typical problem in machine learning. A common example is classifying images, and the classic dataset is the MNIST handwriting set (though Silicon Valley fans will think of the “not hot dog” algorithm). Given a bunch of input data (image files) and labels (what number is in the image), we develop an algorithm that most effectively can predict future image classification.  A closer example to our world is the iris dataset, where several quantitative measurements are used to predict the species of a flower.

The contrasting groups method of setting a test cutscore is a simple example of classification in psychometrics. We have a training set where examinees are already classified as pass/fail by a criterion other than test score (which of course rarely happens, but that’s another story), and use mathematical models to find the cutscore that most efficiently divides them. Not all standard setting methods take a purely statistical approach; understandably, the cutscores cannot be decided by an arbitrary computer algorithm like support vector machines or they’d be subject to immediate litigation. Strong use of subject matter experts and integration of the content itself is typically necessary.

Of course, all tests that seek to assign examinees into categories like Pass/Fail are addressing the classification problem. Some of my earliest psychometric work on the sequential probability ratio test and the generalized likelihood ratio was in this area.

One of the best examples of supervised learning for classification, but much more advanced than the contrasting groups method, is automated essay scoring, which as been around for about 2 decades. It has all the classic trappings: a training set where the observations are classified by humans first, and then mathematical models are trained to best approximate the humans. What makes it more complex is that the predictor data is now long strings of text (student essays) rather than a single number.

Anomaly Detection

The most obvious way this is used in our field is psychometric forensics, trying to find examinees that are cheating or some other behavior that warrants attention. But we also use it to evaluate model fit, possibly removing items or examinees from our data set.

Using Algorithms to Learn/Predict from Data

five item response functionsItem response theory is a great example of the general definition. With IRT, we are certainly using a training set, which we call a calibration sample. We use it to train some models, which are then used to make decisions in future observations, primarily scoring examinees that take the test by predicting where those examinees would fall in the score distribution of the calibration sample. IRT is also applied to solve more sophisticated algorithmic problems: Computerized adaptive testing and automated test assembly are fantastic examples. We IRT more generally to learn from the data; which items are most effective, which are not, the ability range where the test provides most precision, etc.

What differs from the Classification problem is that we don’t have a “true state” of labels for our training set. That is, we don’t know what the true scores are of the examinees, or if they are truly a “pass” or a “fail” – especially because those terms can be somewhat arbitrary. It is for this reason we rely on a well-defined model with theoretical reasons for it fitting our data, rather than just letting a machine learning toolkit analyze it with any model it feels like.

Arguably, classical test theory also fits this definition. We have a very specific mathematical model that is used to learn from the data, including which items are stronger or more difficult than others, and how to construct test forms to be statistically equivalent. However, its use of prediction is much weaker. We do not predict where future examinees would fall in the distribution of our calibration set. The fact that it is test-form-specific hampers is generalizability.

Reinforcement learning

The Wikipedia article also mentions reinforcement learning. This is used less often in psychometrics because test forms are typically published with some sort of finality. That is, they might be used in the field for a year or two before being retired, and no data is analyzed in that time except perhaps some high level checks like the NCCA Annual Statistical Report. Online IRT calibration is a great example, but is rarely used in practice. There, response data is analyzed algorithmically over time, and used to estimate or update the IRT parameters. Evaluation of parameter drift also fits in this definition.

Use of Test Scores

We also use test scores “outside” the test in a machine learning approach. A classic example of this is using pre-employment test scores to predict job performance, especially with additional variables to increase the incremental validity. But I’m not going to delve into that topic here.

Automation

Another huge opportunity for machine learning in psychometrics that is highly related is automation. That is, programming computers to do tasks more effectively or efficient than humans. Automated test assembly and automated essay scoring are examples of this, but there are plenty of of ways that automation can help that are less “cool” but have more impact. My favorite is the creation of psychometrics reports; Iteman and Xcalibre do not produce any numbers also available in other software, but they automatically build you a draft report in MS Word, with all the tables, graphs, and narratives already embedded. Very unique. Without that automation, organizations would typically pay a PhD psychometrician to spend hours of time on copy-and-paste, which is an absolute shame. The goal of my mentor, Prof. David Weiss, and myself is to automate the test development cycle as a whole; driving job analysis, test design, item writing, item review, standard setting, form assembly, test publishing, test delivery, and scoring. There’s no reason people should be allowed to continue making bad tests, and then using those tests to ruin people’s lives, when we know so much about what makes a decent test.

Summary

I am sure there are other areas of psychometrics and the testing industry that are soon to be disrupted by technological innovations such as this. What’s next?

As this article notes, the future direction is about the systems being able to learn on their own rather than being programmed; that is, more towards unsupervised learning than supervised learning. I’m not sure how well that fits with psychometrics.

But back to my original point: psychometrics has been a data-driven field since its inception a century ago. In fact, we contributed some of the methodology that is used generally in the field of machine learning and data analytics. So it shouldn’t be any big news when you hear terms like machine learning, data mining, AI, or dimensionality reduction used in our field! In contrast, I think it’s more important to consider how we remove roadblocks to more widespread use.

One of the hurdles we need to overcome for machine learning in psychometrics yet is simply how to get more organizations doing what has been considered best practice for decades. There are two types of problem organizations. The first type is one that does not have the sample sizes or budget to deal with methodologies like I’ve discussed here. The salient example I always think of is a state licensure test required by law for a niche profession that might have only 3 examinees per year (I have talked with such programs!). Not much we can do there. The second type is those organizations that indeed have large sample sizes and a decent budget, but are still doing things the same way they did them 30 years ago. How can we bring modern methods and innovations to these organizations? Because they will definitely only make their tests more effective and fairer.

student-profile-cognitive-diagnostic-models

Cognitive diagnostic models are a psychometric paradigm for designing and scoring tests with the goal of providing a profile of examinee skill mastery rather than just an overall test score.

CDMS are an area of psychometric research that has seen substantial growth in the past decade, though the mathematics behind them, dating back to MacReady and Dayton (1977).  The reason that they have been receiving more attention is that in many assessment situations, a simple overall score does not serve our purposes and we want a finer evaluation of the examinee’s skills or traits.  For example, the purpose of formative assessment in education is to provide feedback to students on their strengths and weaknesses, so an accurate map of these is essential.  In contrast, a professional certification/licensure test focuses on a single overall score with a pass/fail decision.

What are cognitive diagnostic models?

The predominant psychometric paradigm since the 1980s is item response theory (IRT), which is also known as latent trait theory.  Cognitive diagnostic models are part of a different paradigm known as latent class theory.  Instead of assuming that we are measuring a single neatly unidimensional factor, latent class theory instead tries to assign examinees into more qualitative groups by determining whether they categorized along a number of axes.

What this means is that the final “score” we hope to obtain on each examinee is not a single number, but a profile of which axes they have and which they do not.  The axes could be a number of different psychoeducational constructs, but are often used to represent cognitive skills examinees have learned.  Because we are trying to diagnose strengths vs. weaknesses, we call it a cognitive diagnostic model.

Example: Fractions

A classic example you might see in the literature is a formative assessment on dealing with fractions in mathematics. Suppose you are designing such a test, and the curriculum includes these teaching points, which are fairly distinct skills or pieces of knowledge.

  1. Find the lowest common denominator
  2. Add fractions
  3. Subtract fractions
  4. Multiply fractions
  5. Divide fractions
  6. Convert mixed number to improper fraction

Now suppose this is one of the questions on the test.

 What is 2 3/4 + 1 1/2?

 

This item utilizes skills 1, 2, and 6.  We can apply a similar mapping to all items, and obtain a table.  Researchers call this the “Q Matrix.”  Our example item is Item 1 here.  You’d create your own items and tag appropriately.

Item Find the lowest common denominator Add fractions Subtract fractions Multiply fractions Divide fractions Convert mixed number to improper fraction
 Item 1  X X  X
 Item 2  X  X
 Item 3  X  X
 Item 4  X  X

 

So how do we obtain the examinee’s skill profile?

This is where the fun starts.  I used the plural cognitive diagnostic models because there are a number of available models.  Just like in item response theory we have the Rasch, 2 parameter, 3 parameter, generalized partial credit, and more.  Choice of model is up to the researcher and depends on the characteristics of the test.

The simplest model is the DINA model, which has two parameters per item.  The slippage parameter s refers to the probability that a student will get the item wrong if they do have the skills.  The guessing parameter g refers to the probability a student will get the item right if they do not have the skills.

The mathematical calculations for determining the skill profile are complex, and are based on maximum likelihood.  To determine the skill profile, we need to first find all possible profiles, calculate the likelihood of each (based on item parameters and the examinee response vector), then select the profile with the highest likelihood.

Calculations of item parameters are an order of magnitude greater complexity.  Again, compare to item response theory: brute force calculation of theta with maximum likelihood is complex, but can still be done using Excel formulas.  Item parameter estimation for IRT with marginal maximum likelihood can only be done by specialized software like  Xcalibre.  For CDMs, item parameter estimation can be done in software like MPlus or R (see this article).

In addition to providing the most likely skill profile for each examinee, the CDMs can also provide the probability that a given examinee has mastered each skill.  This is what can be extremely useful in certain contexts, like formative assessment.

How can I implement cognitive diagnostic models?

The first step is to analyze your data to evaluate how well CDMs work by estimating one or more of the models.  As mentioned, this can be done in software like MPlus or R.  Actually publishing a real assessment that scores examinees with CDMs is a greater hurdle.

Most tests that use cognitive diagnostic models are proprietary.  That is, a large K12 education company might offer a bank of prefabricated formative assessments for students in grades 3-12.  That, of course, is what most schools need, because they don’t have a PhD psychometrician on staff to develop new assessments with CDMs.  And the testing company likely has several on staff.

On the other hand, if you want to develop your own assessments that leverage CDMs, your options are quite limited.  I recommend our  FastTest platform for test development, delivery, and analytics.

This is cool!  I want to learn more!

I like this article by Alan Huebner, which talks about adaptive testing with the DINA model, but has a very informative introduction on CDMs.

Jonathan Templin, a professor at the University of Iowa, is one of the foremost experts on the topic.  Here is his website.  Lots of fantastic resources.

Here is a textbook on CDMs.

 

psychometrics-possibilities

Today I read an article in The Industrial-Organizational Psychologist (the colloquial journal published by the Society for Industrial Organizational Psychology) that really resonated with me.

Has Industrial-Organizational Psychology Lost Its Way?
-Deniz S. Ones, Robert B. Kaiser, Tomas Chamorro-Premuzic, Cicek Svensson

Why?  Because I think a lot of the points they are making are also true about the field of Psychometrics and our innovation. They summarize their point in six bullet points that they suggest present a troubling direction for their field. Though honestly, I suppose a lot of Academia falls under these, while some great innovation is happening over on some free MOOCs and the like because they aren’t fettered by the chains of the purely or partially academic world.

  • An overemphasis on theory
  • A proliferation of, and fixation on, trivial methodological minutiae
  • A suppression of exploration and a repression of innovation
  • An unhealthy obsession with publication while ignoring practical issues
  • A tendency to be distracted by fads
  • A growing habit of losing real-world influence to other fields

 

So what is psychometrics supposed to be doing?

The part that has irked me the most about Psychometrics over the years is the overemphasis on theory and minutiae rather than solving practical problems.  This is the main reason I stopped attending the NCME conference and instead attend practical conferences like ATP.  It stems from my desire to improve the quality of assessment throughout the world.  Development of esoteric DIF methodology, new multidimensional IRT models, or a new CAT sub-algorithm when there are already dozens and the new one offers a 0.5% increase in efficiency… stuff like that isn’t going to impact all the terrible assessment being done in the world and the terrible decisions being made about people based on those assessments.  Don’t get me wrong, there is a place for the substantive research, but I feel the latter point is underserved.

The Goal: Quality Assessment

psychometrician in code

And it’s that point that is driving the work that I do.  There is a lot of mediocre or downright bad assessment out there in the world.  I once talked to a Pre-Employment testing company and asked if I could help implement strong psychometrics to improve their tests as well as validity documentation.  Their answer?  It was essentially “No thanks, we’ve never been sued so we’re OK where we are.”  Thankfully, they fell in the mediocre category rather the downright bad category.

Of course, in many cases, there is simply a lack of incentive to produce quality assessment.  Higher Education is a classic case of this.  Professional schools (e.g., Medicine) often have accreditation tied in some part to demonstrating quality assessment of their students.  There is typically no such constraint on undergraduate education, so your Intro to Psychology and freshman English Comp classes still do assessment the same way they did 40 years ago… with no psychometrics whatsoever.  Many small credentialing organizations lack incentive too, until they decide to pursue accreditation.

I like to describe the situation this way: take all the assessments of the world and get them a percentile rank in psychometric quality.  The top 5% are the big organizations, such as Nursing licensure in the US, that have in-house psychometricians, large volumes, and huge budgets.  We don’t have to worry about them as they will be doing good assessment (and that substantive research I mentioned might be of use to them!).  The bottom 50% or more are like university classroom assessments.  They’ll probably never use real psychometrics.  I’m concerned about that 50-95th percentile.

Example: Credentialing

A great example of this level is the world of Credentialing.  There a TON of poorly constructed licensure and certification tests that are being used to make incredibly important decisions about people’s lives.  Some are simply because the organization is for-profit and doesn’t care.  Some are caused by external constraints.  I once worked with a Department of Agriculture for a western US State, where the legislature mandated that licensure tests be given for certain professions, even though only like 3 people per year took some tests.

So how do we get groups like that to follow best practices in assessment?  In the past, the only way to get psychometrics done is for them to pay a consultant a ton of money that they don’t have.  Why spend $5k on an Angoff study or classical test report for 3 people/year?  I don’t blame them.  The field of Psychometrics needs to find a way to help such groups.  Otherwise, the tests are low quality and they are giving licenses to unqualified practitioners.

There are some bogus providers out there, for sure.  I’ve seen Certification delivery platforms that don’t even store the examinee responses, which would be necessary to do any psychometric analysis whatsoever.  Obviously they aren’t doing much to help the situation.  Software platforms that focus on things like tracking payments and prerequisites simply miss the boat too.  They are condoning bad assessment.

Similarly, mathematically complex advancements such as multidimensional IRT are of no use to this type of organization.  It’s not helping the situation.

An Opportunity for Innovation

making-predictions-and-decisions-based-on-test-scores

I think there is still a decent amount of innovation in our field.  There are organizations that are doing great work to develop innovative items, psychometrics, and assessments.  However, it is well known that large corporations will snap up fresh PhDs in Psychometrics and then lock them in a back room to do uninnovative work like run SAS scripts or conduct Angoff studies over and over and over.  This happened to me and after only 18 months I was ready for more.

Unfortunately, I have found that a lot of innovation is not driven by producing good measurement.  I was in a discussion on LinkedIn where someone was pushing gamification for assessments and declared that measurement precision was of no interest.  This, of course, is ludicrous.  It’s OK to produce random numbers as long as the UI looks cool for students?

Innovation in Psychometrics at ASC

Much of the innovation at ASC is targeted towards the issue I have presented here.  I originally developed Iteman 4 and Xcalibre 4 to meet this type of usage.  I wanted to enable an organization to produce professional psychometric analysis reports on their assessments without having to pay massive amounts of money to a consultant.  Additionally, I wanted to save time; there are other software programs which can produce similar results, but drop them in text files or Excel spreadsheets instead of Microsoft Word which is of course what everyone would use to draft a report.

Much of our FastTest platform is designed with a similar bent.  Tired of running an Angoff study with items on a projector and the SMEs writing all their ratings with pencil and paper, only to be transcribed later?  Well, you can do this online.  Moreover, because it is only you can use the SMEs remotely rather than paying to fly them into a central office.  Want to publish an adaptive (CAT) exam without writing code?  We have it built directly into our test publishing interface.

Back to My Original Point

So the title is “What is Psychometrics Supposed to be Doing?” with regards to psychometrics innovation.  My answer, of course, is improving assessment.  The issue I take with the mathematically advanced research is that it is only relevant for that top 5% of organizations that is mentioned.  It’s also our duty as psychometricians to find better ways to help the other 95%.

What else can we be doing?  I think the future here is automation.  Iteman 4 and Xcalibre 4, as well as FastTest, were really machine learning and automation platforms before those things became so en vogue.  As the SIOP article mentioned at the beginning talks about, other scholarly areas like Big Data are gaining more real-world influence even if they are doing things that Psychometrics has done for a long time.  Item Response Theory is a form of machine learning and it’s been around for 50 years!