Posts on psychometrics: The Science of Assessment

job analysis

Subject matter experts are an important part of the process in developing a defensible exam.  There are several ways that their input is required.  Here is a list from highest involvement/responsibility to lowest:

  1. Serving on the Certification Committee (if relevant) to decide important things like eligibility pathways
  2. Serving on panels for psychometric steps like Job Task Analysis or Standard Setting (Angoff)
  3. Writing and reviewing the test questions
  4. Answering the survey for the Job Task Analysis

Who are Subject Matter Experts?

A subject matter expert (SME) is someone with knowledge of the exam content.  If you are developing a certification exam for widgetmakers, you need a panel of expert widgetmakers, and sometimes other stakeholders like widget factory managers.

You also need test development staff and psychometricians.  Their job is to guide the process to meet international standards, and make the SME time the most efficient.

Example: Item Writing Workshop

psychometric training and workshopsThe most obvious usage of subject matter experts in exam development is item writing and review. Again, if you are making a certification exam for experienced widgetmakers, then only experienced widgetmakers know enough to write good items.  In some cases, supervisors do as well, but then they are also SMEs.  For example, I once worked on exams for ophthalmic technicians; some of the SMEs were ophthalmic technicians, but some of the SMEs (and much of the nonprofit board) were ophthalmologists, the medical doctors for whom the technicians worked.

An item writing workshop typically starts with training on item writing, including what makes a good item, terminology, and format.  Item writers will then author questions, sometimes alone and sometimes as a group or in pairs.  For higher stakes exams, all items will then be reviewed/edited by other SMEs.

Example: Job Task Analysis

Job Task Analysis studies are a key step in the development of a defensible certification program.  It is the second step in the process, after the initial definition, and sets the stage for everything that comes afterward.  Moreover, if you seek to get your certification accredited by organizations such as NCCA or ANSI, you need to re-perform the job task analysis study periodically. JTAs are sometimes called job analysis, practice analysis, or role delineation studies.

The job task analysis study relies heavily on the experience of Subject Matter Experts (SMEs), just like Cutscore studies. The SMEs have the best tabs on where the profession is evolving and what is most important, which is essential both for the initial JTA and the periodic re-set of the exam. The frequency depends on how quickly your field is evolving, but a cycle of 5 years is often recommended.

The goal of the job task analysis study is to gain quantitative data on the structure of the profession.  Therefore, it typically utilizes a survey approach to gain data from as many professionals as possible.  This starts with a group of SMEs generating an initial list of on-the-job tasks, categorizing them, and then publishing a survey.  The end goal is a formal report with a blueprint of what knowledge, skills, and abilities (KSAs) are required for certification in a given role or field, and therefore what are the specifications of the certification test.

  • Observe— Typically the psychometrician (that’s us) shadows a representative sample of people who perform the job in question (chosen through Panel Composition) to observe and take notes. After the day(s) of observation, the SMEs sit down with the observer so that he or she may ask any clarifying questions.

    The goal is to avoid doing this during the observation so that the observer has an untainted view of the job.  Alternatively, your SMEs can observe job incumbents – which is often the case when the SMEs are supervisors.

  • Generate— The SMEs now have a corpus of information on what is involved with the job, and generate a list of tasks that describe the most important job-related components. Not all job analysis uses tasks, but this is the most common approach in certification testing, hence you will often hear the term job task analysis as a general term.
  • Survey— Now that we have a list of tasks, we send a survey out to a larger group of SMEs and ask them to rate various features of each task.

    How important is the task? How often is it performed? What larger category of tasks does it fall into?

  • Analyze— Next, we crunch the data and quantitatively evaluate the SMEs’ subjective ratings to determine which of the tasks and categories are most important.

  • Review— As a non-SME, the psychometrician needs to take their findings back to the SME panel to review the recommendation and make sure it makes sense.

  • Report— We put together a comprehensive report that outlines what the most important tasks/categories are for the given job.  This in turn serves as the foundation for a test blueprint, because more important content deserves more weight on the test.

    This connection is one of the fundamental links in the validity argument for an assessment.

Example: Cutscore studies

When the JTA is completed, we have to determine who should pass the assessment, and who should fail. This is most often done using the modified Angoff process, where the SMEs conceptualize a minimally competent candidate (MCC) and then set pass/fail point so that the MCC would just barely pass.  There are other methods too, such as Bookmark or Contrasting Groups.

For newly-launching certification programs, these processes go hand-in-hand with item writing and review. The use of evidence-based practices in conducting the job task analysis, test design, writing items, and setting a cutscore serve as the basis for a good certification program.  Moreover, if you are seeking to achieve accreditation – a third part stamp of approval that your credential is high quality – documentation that you completed all these steps is required.

Performing these tasks with a trained psychometrician inherently checks a lot of boxes on the accreditation to-do list, which can position your organization well for the future. When it comes to accreditation— the psychometricians and measurement specialists at Assessment Systems have been around the block a time or two. We can walk you through the lengthy process of becoming accredited, or we can help you perform these tasks a la carte.

school-teacher-teaching-a-class

One of the most cliche phrases associated with assessment is “teaching to the test.”  I’ve always hated this phrase, because it is only used in a derogatory matter, almost always by people who do not understand the basics of assessment and psychometrics.  I recently saw it mentioned in this article on PISA, and that was one time too many, especially since it was used in an oblique, vague, and unreferenced manner.

So, I’m going to come out and say something very unpopular: in most cases, TEACHING TO THE TEST IS A GOOD THING.

Why teaching to the test is usually a good thing

If the test reflects the curriculum – which any good test will – then someone who is teaching to the test will be teaching to the curriculum. Which, of course, is the entire goal of teaching. The phrase “teaching to the test” is used in an insulting sense, especially because the alliteration is resounding and sellable, but it’s really not a bad thing in most cases.  If a curriculum says that 4th graders should learn how to add and divide fractions, and the test evaluates this, what is the problem? Especially if it uses modern methodology like adaptive testing or tech-enhanced items to make the process more engaging and instructional, rather than oversimplifying to a text-only multiple choice question on paper bubble sheets?

The world of credentialing assessment, this is an extremely important link.  Credential tests start with a job analysis study, which surveys professionals to determine what they consider to be the most important and frequently used skills in the job.  This data is then transformed into test blueprints. Instructors for the profession, as well as aspiring students that are studying to pass the test, then focus on what is in the blueprints.  This, of course, still contains the skills that are most important and frequently used in the job!

So what is the problem then?

Now, telling teachers how to teach is more concerning, and more likely to be a bad thing.  Finland does well because it gives teachers lots of training and then power to choose how they teach, as noted in the PISA article.

As a counterexample, my high school math department made an edict starting my sophomore year thaborderline method educational assessmentt all teachers had to use the “Chicago Method.” It was pure bunk and based on the fact that students should be doing as much busy work as possible instead of the teachers actually teaching. I think it is because some salesman convinced the department head to make the switch so that they would buy a thousand brand new textbooks.  The method makes some decent points (here’s an article from, coincidentally, when I was a sophomore in high school) but I think we ended up with a bastardization of it, as the edict was primarily:

  1. Assign students to read the next chapter in class (instead of teaching them!); go sit at your desk.
  2. Assign students to do at least 30 homework questions overnight, and come back tomorrow with any questions they have.
  3. Answer any questions, then assign them the next chapter to read.  Whatever you do, DO NOT teach them about the topic before they start doing the homework questions.  Go sit at your desk.

Isn’t that preposterous?  Unsurprisingly, after two years of this, I went from being a leader of the Math Team to someone who explicitly said “I am never taking Math again”.  And indeed, I managed to avoid all math during my senior year of high school and first year of college. Thankfully, I had incredible professors in my years at Luther College, leading to me loving math again, earning a math major, and applying to grad school in psychometrics.  This shows the effect that might happen with “telling teachers how to teach.” Or in this case, specifically – and bizarrely – to NOT teach.

What about all the bad tests out there?

Now, let’s get back to the assumption that a test does reflect a curriculum/blueprints.  There are, most certainly, plenty of cases where an assessment is not designed or built well.  That’s an entirely different problem, and is an entirely valid concern. I have seen a number of these in my career.  This danger why we have international standards on assessments, like AERA/APA/NCME and NCCA.  These provide guidelines on how a test should be build, sort of like how you need to build a house according to building code and not just throwing up some walls and a roof.

ansi accreditation certification exam candidates

For example, there is nothing that is stopping me from identifying a career that has a lot of people looking to gain an edge over one another to get a better job… then buying a textbook, writing 50 questions in my basement, and throwing it up on a nice-looking website to sell as a professional certification.  I might sell it for $395, and if I get just 100 people to sign up, I’ve made $39,500!!!! This violates just about every NCCA guideline, though. If I wanted to get a stamp of approval that my certification was legit – as well as making it legally defensible – I would need to follow the NCCA guidelines.

My point here is that there are definitely bad tests out there, just like there are millions of other bad products in the world.  It’s a matter of caveat emptor. But just because you had some cheap furniture on college that broke right away, doesn’t mean you swear off on all furniture.  You stay away from bad furniture.

There’s also the problem of tests being misused, but again that’s not a problem with the test itself.  Certainly, someone making decisions is uninformed. It could actually be the best test in the world, with 100% precision, but if it is used for an invalid application then it’s still not a good situation.  For example, if you took a very well-made exam for high school graduation and started using it for employment decisions with adults. Psychometricians call this validity – that we have evidence to support the intended use of the test and interpretations of scores.  It is the #1 concern of assessment professionals, so if a test is being misused, it’s probably by someone without a background in assessment.

So where do we go from here?

Put it this way, if an overweight person is trying to become fitter, is success more likely to come from changing diet and exercise habits, or from complaining about their bathroom scale?  Complaining unspecifically about a high school graduation assessment is not going to improve education; let’s change how we educate our children to prepare them for that assessment, and ensure that the assessment reflects the goals of the education.  Nevertheless, of course, we need to invest in making the assessment as sound and fair as we can – which is exactly why I am in this career.

two-parameter-irt-model

Item response theory is the predominant psychometric paradigm for mid or large scale assessment.  As noted in my introductory blog post, it is actually a family of models.  In this post, we discuss the two parameter IRT model (IRT 2PL).

Consider the following 3PL equation (simplified from Hambleton & Swaminathan, 1985, Eq. 3.3).  The IRT 2PL simply removes the c and (1-c) elements, so that probability is only a function of a and b.

3PL irt equation

This equation is predicting the probability of a certain response based on the examinee trait/ability level, the item discrimination parameter a, and the item difficulty/location parameter b.  If the examinee’s trait level is higher than the item location, the person has more than a 50% chance of responding in the keyed direction.

This phrase “in the keyed direction” is one you might often hear with the IRT 2PL.  This is because it is not often used with education/knowledge/ability assessments where items usually have a correct answer and guessing is often possible.  The IRT 2PL is used more often in attitudinal or other psychological assessments where guessing is irrelevant and there is no correct answer.  For example, consider an Extroversion scale, where examinees are responding Yes/No to statements like “I love to go to parties” or “I prefer to read books in my free time.”  There is not much to guess here, and the sense of “correct” is not relevant.

However, it is quite clear that the first statement is keyed in the direction of extroversion while the second statement is the reverse.  In fact, you would get the 1 point of response for saying No to that statement rather than Yes.  This is often called reverse-scored.

There are other aspects that go into whether you should use the 2PL model, but this is one of the most important.  In addition, you should also examine model fit indices and take sample size into account.

How do I implement the two parameter IRT model?

Like other IRT models, the 2PL requires specialized software.  Not all statistical packages will do it.  And while you can easily calculate classical statistics in Excel, there is no way to do IRT (well, unless you want to write your own VBA programs to do so).  As mentioned in this article on the three parameter model, there are a lot of IRT software programs available, but not all meet the required standards.

You should evaluate cost and functionality.  If you are a fan of R, there are packages to estimate IRT there.  However, I recommend our Xcalibre program for both newbies and professionals.  For newbies, it is much easier to use, which means you spend more time learning the concepts of IRT and not fighting command code that might be 30 years old.  For professionals, Xcalibre saves you from having to create reports by copy and paste which is incredibly expensive.

three-parameter-irt-model

Item response theory (IRT) is an extremely powerful psychometric paradigm that addresses many of the inadequacies of classical test theory (CTT).  If you are new to the topic, there is a broad intro here, where you will learn that IRT is actually a family of mathematical models rather than one specific one.  Today, I’m talking about the 3PL.

One of the most commonly used models is called the three parameter IRT model (3PM), or the three parameter logistic model (3PL or 3PLM) because it is almost always expressed in a logistic form.  The equation for this is below (Hambleton & Swaminathan, 1985, Eq. 3.3).

3PL irt equation

 

Like all IRT models, it is seeking to predict the probability of a certain response based on examinee ability/trait level and some parameters which describe the performance of the item.  With the 3PL, those parameters are a (discrimination), b (difficulty or location), and c (pseudo-guessing).  For more on these, check out the descriptions in my general IRT article.

The remaining point then is what we mean by the probability of a certain response.  The 3PL is a dichotomous model which means that it is predicting a binary outcome such as correct/incorrect or agree/disagree.

When should I use the three parameter IRT model?

The applicability of the 3PL to a certain assessment depends on the relevance of the components just discussed.  First, the response to the items must be binary.  This eliminates Likert-type items (“Rate on a scale of 1 to 5”), partial credit items (scoring an essay as 0 to 5 points), and performance assessments where scoring might include a range of points, deductions, or timing (number of words typed per minute).

Next, you should evaluate the applicability of the use of all three parameters.  Most notably, are the items in your assessment susceptible to guessing?  Because the thing that differentiates the 3PL from its sisters the 1PL and 2PL is that it attempts to model for guessing.  This, of course, is highly relevant for multiple-choice items on knowledge or ability assessments, so the 3PL is often a great fit for those.

Even in this case, though, there are a number of practitioners and researchers that still prefer to use the 1PL or 2PL models.  There are some deeper methodological issues driving this choice.  The 2PL is sometimes chosen because it works well with an estimation method called Joint Maximum Likelihood.

The 1PL, also known as the Rasch model (yes, I know the Rasch people will say they are not the same, I am grouping them together for simplicity in comparison), is often selected because adherents to the model believe in certain advantages such as it providing “objective measurement.”  Also, the Rasch model works far better for smaller samples (see this technical report by Guyer & Thompson and this one by Yoes).  Regardless, you should probably evaluate model fit when selecting models.

I am from a camp that is pragmatic in choice rather than dogmatic.  While training on the 3PL in graduate school, I have no qualms against using the 2PL or 1PL/Rasch if the test type and sample size warrant it or if fit statistics indicate they are sufficient.

How do I implement the three parameter IRT model?

If you want to implement the three parameter IRT model, you need specialized software.  General statistical software such as SPSS does not always produce IRT analysis, though some do.  Even in the realm of IRT-specific software, not all produce the 3PL.  And, of course, the software can vary greatly in terms of quality.  Here are three important ways it can vary:

  1. Accuracy of results: check out this research study which shows that some programs are inaccurate
  2. User-friendliness: some programs require you to write extensive code, and some have a purely graphical interface
  3. Output usability and interpretability: some programs just give simple ASCII text, others provide extensive Word or HTML reports with many beautiful tables and graphs.

For more on this topic, head over to my post on how to implement IRT in general.

Want to get started immediately?  Download a free copy of our IRT software Xcalibre.

item response theory

Classical test theory is a century-old paradigm for psychometrics – using quantitative and scientific processes to develop and analyze assessments to improve their quality.  (Nobody likes unfair tests!)  The most basic and frequently used item statistic from classical test theory is the P-value.  It is usually called item difficulty but is sometimes called item facility, which can lead to possible confusion.

The P-Value Statistic

The classical P-value is the proportion of examinees that respond correctly to a question, or respond in the “keyed direction” for items where the notion of correct is not relevant (imagine a personality assessment where all questions are Yes/No statements such as “I like to go to parties” … Yes is the keyed direction for an Extraversion scale).  Note that this is NOT the same as the p-value that is used in hypothesis testing from general statistical methods.  This P-value is almost universally agreed upon in terms of calculation.  But some people call it item difficulty and others call it item facility.  Why?

It has to do with the clarity interpretation.  It usually makes sense to think of difficulty as an important aspect of the item.  The P-value presents this, but in a reverse manner.  We usually expect higher values to indicate more of something, right?  But a P-value of 1.00 is high, and it means that there is not much difficulty; everyone gets the item correct, so it is actually no difficulty whatsoever.  A P-value of 0.25 is low, but it means that there is a lot of difficulty; only 25% of examinees are getting it correct, so it has quite a lot of difficulty.

So where does “item facility” come in?

See how the meaning is reversed?  It’s for this reason that some psychometricians prefer to call it item facility or item easiness.  We still use the P-value, but 1.00 means high facility/easiness, and 0.25 means low facility/easiness.  The direction of the semantics fits much better.

Nevertheless, this is a minority of psychometricians.  There’s too much momentum to change an entire field at this point!  It’s similar to the 3 dichotomous IRT parameters (a, b, c); some of you might have noticed that they are actually in the wrong order because the 1-parameter model does not use the parameter, it uses the b. 

At the end of the day, it doesn’t really matter, but it’s another good example of how we all just got used to doing something and it’s now too far down the road to change it.  Tradition is a funny thing.

Have you heard about standard setting approaches such as the Hofstee method, or perhaps the Angoff, Ebel, Nedelsky, or Bookmark methods?  There are certainly various ways to set a defensible cutscore or a professional credentialing or pre-employment test.  Today, we are going to discuss the Hofstee method.  You may also be interested in reading this introductory post on setting a cutscore using item response theory.

Why Standard Setting?

Certification organizations that care about the quality of their examinations need to follow best practices and international standards for test development, such as the Standards laid out by the National Commission for Certifying Agencies (NCCA).  One component of that is standard setting, also known as cutscore studies.  One of the most common and respected approaches for that is the modified-Angoff methodology.

However, the Angoff approach has one flaw: the subject matter experts (SMEs) tend to expect too much out of minimally competent candidates, and sometimes set a cutscore so high that even they themselves would not pass the exam.  There are several reasons this can occur.  For example, raters might think “I would expect anyone that worked for me to know how to do this” and not consider the fact that people who work for them might have 10 years of experience while test candidates could be fresh out of training/school and have the topic only touched on for 5 minutes.  SMEs often forget what it was like to be a much younger and inexperienced version of themselves.

For this reason, several compromise methods have been suggested to compare the Angoff-recommended cutscore with a “reality check” of actual score performance on the exam, allowing the SMEs to make a more informed decision when setting the official cutscore of the exam.  I like to use the Beuk method and the Hofstee method.

The Hofstee Method

One method of adjusting the cutscore based on raters’ impressions of the difficulty of the test and possible pass rates is the Hofstee method (Mills & Melican, 1987; Cizek, 2006; Burr et al., 2016).  This method requires the raters to estimate four values:

  1. The minimum acceptable failure rate
  2. The maximum acceptable failure rate
  3. The minimum cutscore, even if all examinees failed
  4. The maximum cutscore, even if all examinees passed

The first two values are failure rates, and are therefore between 0% and 100%, with 100% indicating a test that is too difficult for anyone to pass.  The latter two values are on the raw score scale, and therefore range between 0 and the number of items in the test, again with a higher value indicating a more difficult cutscore to achieve.

These values are paired, and the line that passes through the two points estimated.  The intersection of this line with the failure rate function, is the recommendation of the adjusted cutscore.   

hofstee

How can I use the Hofstee Method?

Unlike the Beuk, the Hofstee method does not utilize the Angoff ratings, so it represents a completely independent reality check.  In fact, it is sometimes used as a standalone cutscore setting method itself, but because it does not involve rating of every single item, I recommend it be used in concert with the Angoff and Beuk approaches.

 

Spearman-Brown

 

The Spearman-Brown formula, also known as the Spearman-Brown Prophecy Formula or Correction, is a method used in evaluating test reliability.  It is based on the idea that split-half reliability has better assumptions than coefficient alpha but only estimates reliability for a half-length test, so you need to implement a correction that steps it up to a true estimate for a full-length test.

Looking for software to help you analyze reliability?  Download a free copy of Iteman.

 

Coefficient Alpha vs. Split Half

The most commonly used index of test score reliability is coefficient alpha.  However, it’s not the only index on internal consistency.  Another common approach is split-half reliability, where you split the test into two halves (first/last, even/odd, or random split) and then correlate scores on each.  The reasoning is that if both halves of the test measure the same construct at a similar level of precision and difficulty, then scores on one half should correlate highly with scores on the other half.  More information on split-half is found here.

However, split-half reliability provides an inconvenient situation: we are effectively gauging the reliability of half a test.  It is a well-known fact that reliability is increased by more items (observations); we can all agree that a 100-item test is more reliable than a 10 item test comprised of similar quality items.  So the split half correlation is blatantly underestimating the reliability of the full-length test.

The Spearman-Brown Formula

To adjust for this, psychometricians use the Spearman-Brown prophecy formula.  It takes the split half correlation as input and converts it to an estimate of the equivalent level of reliability for the full-length test.  While this might sound complex, the actual formula is quite simple.

Spearman-Brown

As you can see, the formula takes the split half reliability (rhalf) as input and produces the full-length estimation (rfull) .  This can then be interpreted alongside the ubiquitously used coefficient alpha.

While the calculation is quite simple, you still shouldn’t have to do it yourself.  Any decent software for classical item analysis will produce it for you.  As an example, here is the output of the Reliability Analysis table from our Iteman software for automated reporting and assessment intelligence with CTT.  This lists the various split-half estimates alongside the coefficient alpha (and its associated SEM) for the total score as well as the domains, so you can evaluate if there are domains that are producing unusually unreliable scores. 

Note: There is an ongoing argument amongst psychometricians whether domain scores are even worthwhile since the assumed unidimensionality of most tests means that the domain scores are  less reliable estimates of the total score, but that’s a whole ‘another blog post!

Score N Items Alpha SEM Split-Half (Random) Split-Half (First-Last) Split-Half (Odd-Even) S-B Random S-B First-Last S-B Odd-Even
All items 50 0.805 3.058 0.660 0.537 0.668 0.795 0.699 0.801
1 10 0.522 1.269 0.338 0.376 0.370 0.506 0.547 0.540
2 18 0.602 1.860 0.418 0.309 0.448 0.590 0.472 0.619
3 12 0.605 1.496 0.449 0.417 0.383 0.620 0.588 0.553
4 10 0.485 1.375 0.300 0.329 0.297 0.461 0.495 0.457

You can see that, as mentioned earlier, there are 3 ways to do the split in the first place, and Iteman reports all three.  It then reports the Spearman-Brown formula for each.  These generally align with the results of the alpha estimates, which overall provide a cohesive picture about the structure of the exam and its reliability of scores.  As you might expect, domains with more items are slightly more reliable, but not super reliable since they are all less than 20 items.

So, what does this mean in the big scheme of things?  Well, in many cases the Spearman-Brown estimates might not differ from the alpha estimates, but it’s still good to know that they do.  In the case of high-stakes tests, you want to go through every effort you can to ensure that the scores are highly reliable and precise.

Tell me more!

If you’d like to learn more, here is an article on the topic.  Or, contact solutions@assess.com to discuss consulting projects with our Ph.D. psychometricians.

automated item generation AI

Simulation studies are an essential step in the development of a computerized adaptive test (CAT) that is defensible and meets the needs of your organization or other stakeholders. There are three types of simulations: Monte Carlo, Real Data (post hoc), and Hybrid.

Monte Carlo simulation is the most general-purpose approach, and the one most often used early in the process of developing a CAT.  This is because it requires no actual data, either on test items or examinees – although real data is welcome if available – which makes it extremely useful in evaluating whether CAT is even feasible for your organization before any money is invested in moving forward.

Let’s begin with an overview of how Monte Carlo simulation works before we return to that point.

How a Monte Carlo simulation works: An overview

First of all, what do we mean by CAT simulation?  Well, a CAT is a test that is administered to students via an algorithm.  We can use that same algorithm on imaginary examinees, or real examinees from the past, and simulate how well a CAT performs on them.

Best of all, we can change the specifications of the algorithm to see how it impacts the examinees and the CAT performance.

Each simulation approach requires three things:

  1. Item parameters from item response theory (IRT), though new CAT methods such as diagnostic models are now being developed.
  2. Examinee scores (theta) from IRT.
  3. A way to determine how an examinee responds to an item if the CAT algorithm says it should be delivered to the examinee.

The Monte Carlo simulation approach is defined by how it addresses the third requirement: it generates a response using some sort of mathematical model, while the other two simulation approaches look up actual responses for past examinees (real-data approach) or a mix of the two (hybrid).

The Monte Carlo simulation approach only uses the response generation process.  The item parameters can either be from a bank of actual items or generated.

Likewise, the examinee thetas can be from a database of past data, or generated.

How does the response generation process work? 

Well, it differs based on the model that is used as the basis for the CAT algorithm.  Here, let’s assume that we are using the three-parameter logistic model.  Start by supposing we have a fake examinee with a true theta of 0.0.  The CAT algorithm looks in the bank and says that we need to administer item #17 as the first item, which has the following item parameters: a=1.0, b=0.0, and c=0.20.

Well, we can simply plug those numbers into the equation for the three-parameter model and obtain the probability that this person would correctly answer this item.

Item response function - IRF 1.0 0.0 0.2

The probability, in this case, is 0.6.  The next step is to generate a random number from the set of all real numbers between 0.0 and 1.0.  If that number is less than the probability of correct response, the examinee “gets” the item correct.  If greater, the examinee gets the item incorrect.  Either way, the examinee is scored and the CAT algorithm proceeds.

For every item that comes up to be used, we utilize this same process.  Of course, the true theta does not change, but the item parameters are different for each item.  Each time, we generate a new random number and compare it to the probability to determine a response of correct or incorrect.

The CAT algorithm proceeds as if a real examinee is on the other side of the computer screen, actually responding to questions, and stops whenever the termination criterion is satisfied.  However, the same process can be used to “deliver” linear exams to examinees; instead of the CAT algorithm selecting the next item, we just process sequentially through the test.

A road to research

For a single examinee, this process is not much more than a curiosity.  Where it becomes useful is at a large scale aggregate level.  Imagine the process above as part of a much larger loop.  First, we establish a pool of 200 items pulled from items used in the past by your program.  Next, we generate a set of 1,000 examinees by pulling numbers from a random distribution.

Finally, we loop through each examinee and administer a CAT by using the CAT algorithm and generating responses with the Monte Carlo simulation process.  We then have extensive data on how the CAT algorithm performed, which can be used to evaluate the algorithm and the item bank.  The two most important are the length of the CAT and its accuracy, which are a trade-off in most cases.

So how is this useful for evaluating the feasibility of CAT?

Well, you can evaluate the performance of the CAT algorithm by setting up an experiment to compare different conditions.  Suppose you don’t have past items and are not even sure how many items you need?  Well, you can create several different fake item banks and administer a CAT to the same set of fake examinees.

Or you might know the item bank to be used, but need to establish that a CAT will outperform the linear tests you currently use.  There is a wide range of research questions you can ask, and since all the data is being generated, you can design a study to answer many of them.  In fact, one of the greatest problems you might face is that you can get carried away and start creating too many conditions!

How do I actually do a Monte Carlo simulation study?

Fortunately, there is software to do all the work for you.  The best option is CATSim, which provides all the options you need in a straightforward user interface (beware, this makes it even easier to get carried away).  The advantage of CATSim is that it collates the results for you and presents most of the summary statistics you need without you having to calculate them.  For example, it calculates the average test length (number of items used by a variable-length CAT), and the correlation of CAT thetas with true thetas.  Other software exists which is useful in generating data sets using Monte Carlo simulation (see SimulCAT), but they do not include this important feature.

adaptive testing simulation

decision-consistency

If you are involved with certification testing and are accredited by the National Commission of Certifying Agencies (NCCA), you have come across the term decision consistency.  NCCA requires you to submit a report of 11 important statistics each year, each for all active test forms.  These 11 provide a high level summary of the psychometric health of each form; more on that report here.  One of the 11 is decision consistency.

What is Decision Consistency?

Decision consistency is an estimate of how consistent the pass/fail decision is on your test.  That is, if someone took your test today, had their brain wiped of that memory, and took the test again next week, what is the probability that they would obtain the same classification both times?  This is often estimated as a proportion or percentage, and we would of course hope that this number is high, but if the test is unreliable it might not be.

The reasoning behind the need for a index specifically on this is that the psychometric aspect we are trying to estimate is different than reliability of point scores (Moltner, Timbil, & Junger, 2015; Downing & Mehrens, 1978).  The argument is that examinees near the cutscore are of interest, and reliability evaluates the entire scale.  It’s for this reason that if you are using item response theory (IRT), the NCCA allows you to instead submit the conditional standard error of measurement function at the cutscore.  But all of the classical decision consistency indices evaluate all examinees, and since most candidates are not near the cutscore, this inflates the baseline.  Only the CSEM – from IRT – follows the line of reasoning of focusing on examinees near the cutscore.

An important distinction that stems from this dichotomy is that of decision consistency vs. accuracy.  Consistency refers to receiving the same pass/fail classification each time if you take the test twice.  But what we really care about is whether your pass/fail based on the test matches with your true state.  For a more advanced treatment on this, I recommend Lathrop (2015).

Indices of Decision Consistency

There are a number of classical methods for estimating an index of decision consistency that have been suggested in the psychometric literature.  A simple and classic approach is Hambleton (1972), which is based on an assumption that examinees actually take the same test twice (or equivalent forms).  Of course, this is rarely feasible in practice, so a number of methods were suggested over the next few years on how to estimate this with a single test administration to a given set of examinees.  These include Huynh (1976), Livingston (1972), and Subkoviak (1976).  These are fairly complex.  I once reviewed a report from a psychometrician that faked the Hambleton index because they didn’t have the skills to figure out any of the indices.

How does decision consistency relate to reliability?

The note I made above about unreliability is worth another visit, however.  After the rash of publications on the topic, Mellenbergh and van der Linden (1978; 1980) pointed out that if you assume a linear loss function for misclassification, the conventional estimate of reliability – coefficient alpha – serves as a solid estimate of decision consistency.  What is a linear loss function?  It means that a misclassification is worse if the person’s score is further from the cutscore.  That is, of the cutscore is 70, failing someone with a true score of 80 is twice as bad as failing someone with a true score of 75.  Of course, we never know someone’s true score, so this is a theoretical assumption, but the researchers make an excellent point.

But while research amongst psychometricians on the topic cooled since they made that point, NCCA still requires one of the statistics -most from the 1970s – to be reported.  The only other well-known index on the topic was Hanson and Brennan (1990).  While the indices have been show to be different than classical reliability, I remain to be convinced that they are the right approach.  Of course, I’m not much of a fan of classical test theory at all in the first place; that acceptance of CSEM from IRT is definitely aligned with my views on how psychometrics should tackle measurement problems.

three standard errors

Sympson-Hetter is a method of item exposure control within the algorithm of Computerized adaptive testing (CAT).  It prevents the algorithm from over-using the best items in the pool.

CAT is a powerful paradigm for delivering tests that are smarter, faster, and fairer than the traditional linear approach.  However, CAT is not without its challenges.  One is that it is a greedy algorithm that always selects your best items from the pool if it can.  The way that CAT researchers address this issue is with item exposure controls.  These are sub algorithms that are injected into the main item selection algorithm, to alter it from always using the best items. The Sympson-Hetter method is one such approach.  Another is the Randomesque method.

The Randomesque Method5 item information functions IIF for Sympson-Hetter

The simplest approach is called the randomesque method.  This selects from the top X items in terms of item information (a term from item response theory), usually for the first Y items in a test.  For example, instead of always selecting the top item, the algorithm finds the 3 top items and then randomly selects between those.

The figure on the right displays item information functions (IIFs) for a pool of 5 items.  Suppose an examinee had a theta estimate of 1.40.  The 3 items with the highest information are the light blue, purple, and green lines (5, 4, 3).  The algorithm would first identify this and randomly pick amongst those three.  Without item exposure controls, it would always select Item 4.

The Sympson-Hetter Method

A more sophisticated method is the Sympson-Hetter method.

Here, the user specifies a target proportion as a parameter for the selection algorithm.  For example, we might decide that we do not want an item seen by more than 75% of examinees.  So, every time that the CAT algorithm goes into the item pool to select a new item, we generate a random number between 0 and 1, which is then compared to the threshold.  If the number is between 0 and 0.75 in this case, we go ahead and administer the item.  If the number is from 0.75 to 1.0, we skip over it and go on to the next most informative item in the pool, though we then do the same comparison for that item.

Why do this?  It obviously limits the exposure of the item.  But just how much it limits it depends on the difficulty of the item.  A very difficult item is likely only going to be a candidate for selection for very high-ability examinees.  Let’s say it’s the top 4%… well, then the approach above will limit it to 3% of the sample overall, but 75% of the examinees in its neighborhood.

On the other hand, an item of middle difficulty is used not only for middle examinees but often for any examinee.  Remember, unless there are some controls, the first item for the test will be the same for everyone!  So if we apply the Sympson-Hetter rule to that item, it limits it to 75% exposure in a more absolute sense.

Because of this, you don’t have to set that threshold parameter to the same value for each item.  The original recommendation was to do some CAT simulation studies, then set the parameters thoughtfully for different items.  Items that are likely to be highly exposed (middle difficulty with high discrimination) might deserve a more strict parameter like 0.40.  On the other hand, that super-difficult item isn’t an exposure concern because only the top 4% of students see it anyway… so we might leave its parameter at 1.0 and therefore not limit it at all.

Is this the only method available?

No.  As mentioned, there’s that simple randomesque approach.  But there are plenty more.  You might be interested in this paper, this paper, or this paper.  The last one reviews the research literature from 1983 to 2005.

What is the original reference?

Sympson, J. B., & Hetter, R. D. (1985, October). Controlling item-exposure rates in computerized adaptive testing. Proceedings of the 27th annual meeting of the Military Testing Association (pp. 973–977). San Diego, CA: Navy Personnel Research and Development Center.

How can I apply this to my tests?

Well, you certainly need a CAT platform first.  Our platform at ASC allows this method right out of the box – that is, all you need to do is enter the target proportion when you publish your exam, and the Sympson-Hetter method will be implemented.  No need to write any code yourself!  Click here to sign up for a free account.