Posts on psychometrics: The Science of Assessment

I was recently asked about scaled scoring and if the transformation must be based on the normal curve. This is an important question, especially since most human traits fall in a fairly normal distribution, and item response theory places items and people on that latent scale. The short answer is “no” – there are other options for the scaled scoring transformation, and your situation can help you select the right method.

First of all: if you are new to the concept of scaled scoring, start out by reading this blog post. In short: it is a way of converting scores on a test to another scale for reporting to examinees, to hide certain important aspects such as differences in test form difficulty.

There are 4 types of scaled scoring, in general:

  1. Normal/standardized
  2. Linear
  3. Linear dogleg
  4. Equipercentile


This is an approach to scaled scoring that many of us are familiar with due to some famous applications, including the T score, IQ, and large-scale assessments like the SAT. It starts by finding the mean and standard deviation of raw scores on a test, then converts whatever that is to another mean and standard deviation. If this seems fairly arbitrary and doesn’t change the meaning… you are totally right!

Let’s start by assuming we have a test of 50 items, and our data has a raw score average of 35 points with an SD of 5. The T score transformation – which has been around so long that a quick Googling can’t find me the actual citation – says to convert this to a mean of 50 with SD of 10. So, 35 raw points becomes a scaled score of 50. A raw score of 45 (2 SDs above mean) becomes a T of 70. We could also place this on the IQ scale (mean=100, SD=15) or the classic SAT scale (mean=500, SD=100).

A side not about the boundaries of these scales… one of the first things you learn in any stats class is that plus/minus 3 SDs contains 99% of the population, so many scaled scores adopt these and convenient boundaries. This is why the classic SAT scale went from 200 to 800, with the urban legend that “you get 200 points for putting your name on the paper.” Similarly, the ACT goes from 0 to 36 because it nominally had a mean=18 and SD=6.

The normal/standardized approach can be used with classical number-correct scoring, but makes more sense if you are using item response theory, because all scores default to a standardized metric.


The linear approach is quite simple. It employs the y=mx+b that we all learned as schoolkids. With the previous example of a 50 item test, we might say intercept=200 and slope=4. This then means that scores range from 200 to 400 on the test.

Yes, I know… the Normal conversion above is technically linear also, but deserves its own definition.

Linear dogleg

The Linear Dogleg approach is a special case of the previous one, where you need to stretch the scale to reach two endpoints. Let’s suppose we published a new form of the test, and a classical equating method like Tucker or Levine says that it is 2 points easier and the slope of Form A to Form B is 3.8 rather than 4. This throws off our clean conversion of 200 to 400 scale. So suppose we use the equation SCALED = 200 + 3.8*RAW but only up until the score of 30. From 31 onwards, we use SCALED = 185 + 4.3*RAW. Note that the raw score of 50 then still comes out to be scaled of 400, so we still go from 200 to 800 but there is now a slight bend in the line. This is called the “dogleg” similar to the golf hole of the same name.


Lastly, there is Equipercentile, which is mostly used for equating forms but can similarly be used for scaling.  In this conversion, we match the percentile for each, even if it is a very nonlinear transformation.  For example, suppose our Form A had 90th percentile of 46, which became scaled of 384.  We find that Form B has a 90th percentile at 44 points, so we call that a scaled score of 384, and calculate a similar conversion for all other points.

Why are we doing this again?

Well, you can kind of see it in the example of having two forms with a difference in difficulty. In the Equipercentile example, suppose there is a cutscore to be in the top 10% to win a scholarship… if you get 45 on Form A you will lose, but if you get 45 on Form B you will win. Test sponsors don’t want to have this conversation with angry examinees, so they convert all scores to an arbitrary scale. The 90th percentile is always a 384, no matter how hard the test is. (Yes, that simple example assumes the populations are the same… there’s an entire portion of psychometric research dedicated to performing stronger equating.)

A Standard Setting Study is a formal process fo establishing an performance standard. In the assessment world, there are actually two uses of the word standard – the other one refers to a formal definition of the content that is being tested, such as the Common Core State Standards in the USA. For this reason, I prefer the term cutscore study.

After item authoring, item review, and test form assembly, a cutscore or passing score will often be set to determine what level of performance qualified as “pass” or similar classification.  This cannot be done arbitrarily (e.g., setting it at 70% because that’s what you saw when you were in school).  To be legally defensible and eligible for Accreditation, it must be done using one of several standard setting approaches from the psychometric literature.  The choice of method depends upon the nature of the test, the availability of pilot data, and the availability of subject matter experts.

Some types of Cutscore Studies:

  • Angoff – In an Angoff study, a panel of subject matter experts rates each item, estimating the percentage of minimally competent candidates that would answer each item correctly.  It is often done in tandem with the Beuk Compromise.  The Angoff method does not require actual examinee data, though the Beuk does.
  • Bookmark – The bookmark method orders the items in a test form in ascending difficulty, and a panel of experts reads through and places a “bookmark” in the book where they think a cutscore should be.  Obviously, this requires enough real data to calibrate item difficulty, usually using item response theory, which requires several hundred examinees.
  • Contrasting Groups – Candidates are sorted into Pass and Fail groups based on their performance on a different exam or some other unrelated standard.  If using data from another exam, a sample of at least 50 candidates is obviously needed.
  • Borderline Group – Similar to Contrasting Groups, but a borderline group is defined using alternative information such as biodata, and the scores of the group are evaluated.

If you are involved with any sort of data science, which psychometrics most definitely is, you’ve probably used R.  R is an environment that allows you to implement packages for many different types of analysis, which are built by a massive community of data scientists around the world.  R has become one of the two main languages for data science and machine learning (the other being Python), and remains growing in popularity in both those general areas.  However, R for psychometrics is becoming much more common.

I was extremely anti-R for a number of years, but have recently started using it, for several important reasons.  However, for some even more important reasons, I don’t use it for all of my work.  I recommend you do the same.  Let’s talk a bit about why.

What is R?

R is a programming-language-like environment for statistical analysis.  Its Wikipedia article defines it as a “programming language and free software environment for statistical computing and graphics” but I use the term “programming-language-like environment” because it is more like command scripting from DOS than an actual compiled language like Java or Pascal.  R has an extremely steep learning curve compared to software that provides a decent UI; it claims that RStudio is a UI, but it really is just a more advanced window to see the same command code!

R can be maddeningly frustrating because of other relatively straightforward reasons.  For example, it will not recognize a missing value in data when running a simple correlation, and is unable to give you a decent error message explaining this.  This was my first date with R, and turned me off for years.  A similar thing occurred to me the first time I used PARSCALE in 2009 and couldn’t get it to work for days.  Eventually I discovered it was because the original code base was DOS, which limits you to 8-character file names, and they never bothered to tell you that.  They literally expected all users to be knowledgeable on 1980s DOS rules.  In 2009.

BUT… R is free, and everybody likes free.  Even though free never means there is no cost.

What are packages?

R comes with some analysis out of the box, but the vast majority is available in packages.  For example, if you want to do factor analysis or item response theory, you install one of several packages that does those.  These packages are written by contributors and uploaded to an R server somewhere.  There is no code review or anything else to check the packages, so it is entirely a caveat emptor situation.  This isn’t malicious, they’re just taking the scientific approach that assumes other researchers will replicate, disprove, or alternativize work.  For important, commonly used packages (I am a huge fan of caret), this is most definitely the case.  For rarely used packages and pet projects, it is the opposite.

Why do I use R for psychometrics or elsewhere?

As mentioned above, I use R for when there are well-known packages that are accepted in the community.  The caret package is a great example.  Just Google “r caret” and you can get a glimpse of the many resources, blog posts, papers, and other uses of the package.  Actually, it isn’t an analysis package in itself, it just makes it easier to call existing, proven packages.  Another favorite is the text2vec package, and of course there is the ubiquitous tidyverse.

I love to use R in cases of more general data science problems, because this means a community several orders of magnitude above psychometricians, which definitely contributes to the higher quality.  The caret package is for regression and classification, which are used in just about every field.  The text2vec package is for natural language processing, used in fields as diverse as marketing, political science, and education.  One of my favorite projects I’ve heard about there was the analysis of the Jane Austen corpus.  Fascinating.

When would I use R packages that I might consider less stellar?  I don’t mind using R when it is a low stakes situation, such as exploratory data analysis for a client.  I would also consider it and acceptable alternative to commercial software when the analysis is something I do very rarely.  No way am I going to pay $10,000 or whatever for something I do 2 hours per year.  Finally, I would consider it for niche analyses where no other option exists except to write my own code, and it does not make financial sense to do so.  However, in these cases, I still try to perform due diligence.


Why do I not use R?

In many cases, it comes down to a single word: quality.

For niche packages, the code might be 100% from a grad student who was a complete newbie on a topic for their thesis, with no background in software development or ancillary topics like writing a user manual.  Wow, does it show.  Moreover, no one has ever validated a single line of the code.  Because of this, I am very wary when using R packages.  If you use one like this, I highly recommend you do some QA or background research on it first!   I would love it if R had a community rating system, like exists for WordPress plugins.  With those, you can see that one plugin might be used on 1,000,000 sites with a 4.5/5.0 rating, while another is used on 43 sites with 2.7/5.0 rating.

This quality thing is, of course, a continuum.  There is a massive gap between the grad student project and something like caret.  In between you might have an R package that is a hobby of a professor, who devotes some time to it and has extremely deep knowledge of the subject matter, but it remains a part-time endeavor by someone with no experience in commercial software.  For examples of this situation, please see this review of an R package, or this comparison of IRT results with R vs professional tools.

The issue on User Manuals is of particular concern to me, as someone that provides commercial software and knows what it is like to support users.  I have seen user manuals in the R world that literally do not tell the users how to use the package.  They might provide a long-winded description of some psychometrics, obviously copied from a dissertation, as a “manual” when at best it only belongs as an appendix.  No info on formatting of input files, no provision of example input, no examples of usage, and no description of interpreting output.

Even in the cases of an extremely popular package that has high quality code, the documentation is virtually unreadable.  Check out the official landing page for tidyverse.  How welcoming is that?  I’ve found that the official documentation is almost guaranteed to be worthless – instead, head over to popular blogs or YouTube channels on your favorite topic.

The output is also famously bad quality.  R stores its output as objects, a sort of mini-database behind the scenes.  If you want to make graphs or dump results to something like a CSV file, you have to write more code just for such basics.  And if you want a nice report in Word or PDF, get ready to write a ton of code, or spend a week doing copy-and-paste.  I noticed that there was a workshop a few weeks ago at NCME (April 2019) that was specifically on how to get useful output reports from R, since this is a known issue.

Is R turning the corner?

I’ve got another post coming about R and how it has really turned the corner because of 3 things: Shiny, RStudio, and availability of quality packages.  More on that in the future, but for now:

  • Shiny allows you to make applications out of R code, so that the power of R can be available to end-users without them having to write & run code themselves.  Until Shiny, R was limited to people who wanted to write & run code.
  • RStudio makes it easier to develop R code, by overlaying an integrated development environment (IDE) on top of R.  If you have ever used and IDE, you know how important this is.  You’ve got to be incredibly clueless to not use an IDE for development.  Yet the first release of RStudio did not happen until 2011.  This shows how rooted R was in academia.
  • As you might surmise from my rant above, it is the quality packages (and third-party documentation!) that are really opening the floodgates.

Another, newer direction is that the world of R is hopping on the bandwagon of the API economy.  It might become the lingua franca of the data analytics world from an integration perspective.  This might be the true future of R for psychometrics.

But there are still plenty of issues.  One of my pet peeves is the lack of quality error trapping.  For example, if you do simple errors, the system will crash with completely worthless error messages.  I found this to happen if I run an analysis, open my output file, and run it again when forgetting to close the output file.  As previously mentioned, there is also the issue with a single missing data point in a correlation.

Nevertheless, R is still not really consumer facing.  That is, actual users will always be limited to people that have strong coding skills AND deep content knowledge on a certain area of data science or psychometrics.  Just like there will always be a home for more user-friendly statistical software like SPSS, there will always be a home for true psychometric software like Xcalibre.

Psychometric forensics is a surprisingly deep and complex field.  Many of the indices are incredibly sophisticated, but a good high-level and simple analysis to start with is overall time vs. scores, which I call Time-Score Analysis.  This approach uses simple flagging on two easily interpretable metrics (total test time in minutes and number correct raw score) to identify possible pre-knowledge, clickers, and harvester/sleepers.  Consider the four quadrants that a bivariate scatterplot of these variables would produce.


QuadrantInterpretationPossible threat?Suggested flagging
Upper rightHigh scores and taking their diligent timeGood examineesNA
Upper leftHigh scores with low timePre-knowledgeTop 50% score and bottom 5% time
Lower left Low scores with low time“Clickers” or other low motivationBottom 5% time and score
Lower right Low scores with high timeHarvesters, sleepers, or just very low abilityTop 5% time and bottom 5% scores

An example of Time-Score Analysis

Consider the example data below.  What can this tell us about the performance of the test in general, and about specific examinees?

This test had 100 items, scored classically (number-correct), and a time limit of 60 minutes.  Most examinees took 45-55 minutes, so the time limit was appropriate.  A few examinees spent 58-59 minutes; there will usually be some diligent students like that.  There was a fairly strong relationship of time with score, in that examinees who took longer, scored highly.

Now, what about the individuals?  I’ve highlighted 5 examples.

  1. This examinee had the shortest time, and one of the lowest scores.  They apparently did not care very much.  They are an example of a low motivation examinee that moved through quickly.  One of my clients calls these “clickers.”
  2. This examinee also took a short time, but had a suspiciously high score.  They definitely are an outlier on the scatterplot, and should perhaps be investigated.
  3. This examinee is simply super-diligent.  They went right up to the 60 minute limit, and achieved one of the highest scores.
  4. This examinee also went right up to the 60 minute limit, but had one of the lowest scores.  They are likely low ability or low motivation.  That same client of mine calls these “sleepers” – a candidate that is forced to take the exam but doesn’t care, so just sits there and dozes.Alternatively, it might be a harvester; some who has been assigned to memorize test content, so they spend all the time they can, but only look at half the items so they can focus on memorization.
  5. This examinee had by far the lowest score, and one of the lowest times.  Perhaps they didn’t even answer every question.  Again, there is a motivation/effort issue here, most likely.

How useful is time-score analysis?

Like other aspects of psychometric forensics, this is primarily useful for flagging purposes.  We do not know yet if #4 is a Harvester or just low motivation.  Instead of accusing them, we open an investigation.  How many items did they attempt?  Are they a repeat test-taker?  What location did they take the test?  Do we have proctor notes, site video, remote proctoring video, or other evidence that we can review?  There is a lot that can go into such an investigation.  Moreover, simple analyses such as this are merely the tip of the iceberg when it comes to psychometric forensics.  In fact, so much that I’ve heard some organizations simply stick their head in the sand and don’t even bother checking out someone like #4.  It just isn’t in the budget.

However, test security is an essential aspect of validity.  If someone has stolen your test items, the test is now compromised, and you are guaranteed that scores do not mean the same thing they meant when the test was published.  It’s now apples and oranges, even though the items on the test are the same.  Perhaps you might not challenge individual examinees, but perhaps institute a plan to publish new test forms every 6 months.  Regardless, your organization needs to have some difficult internal discussions and establish a test security plan.


Whether you are a newly-launched credentialing program or a mature certification body, it is important to perform frequent “checkups” on your assessments, to ensure that they’re not only accurate, but also legally defensible.  The primary component of this process is a psychometric performance report, which provides important statistics on the test like reliability, and item statistics like difficulty and discrimination.  This work is primarily done by a psychometrician, though particular items flagged for poor performance should be reviewed by Subject Matter Experts (SMEs).  However, checkups should also sometimes include Job Task Analysis studies (JTAs) and Cutscore studies. This is where your SMEs really come in.  The frequency depends on how quickly your field is evolving, but a cycle of 5 years is often recommended. JTAs are sometimes called job analysis, practice analysis, or role delineation studies.

Your SMEs play a pivotal role in getting new assessments off the ground and keeping existing assessments fair and accurate. Whether they keep your program abreast with current innovations and industry standards or help you quantify the knowledge and various skills measured in your assessment, your SMEs work side-by-side with your psychometric experts through the job task analysis and cutscore process to ensure fair and accurate decisions are made.

If your program or assessment is in its infant stages, you will need to perform a Job Task Analysis to kick things off. The JTA is all about surveying on-the-job tasks, creating a list of tasks, and then devising a blueprint of what knowledge, skills, and abilities (KSAs) are required for certification in a given role or field.

The Basics of Job Task Analysis

  • Observe— Typically the psychometrician (that’s us) shadows a representative sample of people who perform the job in question (chosen through Panel Composition) to observe and take notes. After the day(s) of observation, the SMEs sit down with the observer so that he or she may ask any clarifying questions. The goal is to avoid doing this during the observation so that the observer has an untainted view of the job.  Alternatively, your SMEs can observe job incumbents – which is often the case when the SMEs are supervisors.

  • Generate— The SMEs now have a corpus of information on what is involved with the job, and generate a list of tasks that describe the most important job-related components.  Not all job analysis uses tasks, but this is the most common approach in certification testing, hence you will often hear the term job task analysis as a general term.
  • Survey— Now that we have a list of tasks, we send a survey out to a larger group of SMEs and ask them to rate various features of each task. How important is the task? How often is it performed? What larger category of tasks does it fall into?

  • Analyze— Next, we crunch the data and quantitatively evaluate the SMEs’ subjective ratings to determine which of the tasks and categories are most important.

  • Review— As a non-SME, the psychometrician needs to take their findings back to the SME panel to review the recommendation and make sure it makes sense.

  • Report— We put together a comprehensive report that outlines what the most important tasks/categories are for the given job.  This in turn serves as the foundation for a test blueprint, because more important content deserves more weight on the test.  This connection is one of the fundamental links in the validity argument for an assessment.

Cutscore studies after job task analysis

When the JTA is completed, we have to determine who should pass the assessment, and who should fail. This is most often done using the modified Angoff process, where the SMEs conceptualize a minimally competent candidate (MCC) and then set pass/fail point so that the MCC would just barely pass.  There are other methods too, such as Bookmark or Contrasting Groups.

For newly-launching certification programs, these processes go hand-in-hand with item writing and review. The use of evidence-based practices in conducting the job task analysis, test design, writing items, and setting a cutscore serve as the basis for a good certification program.  Moreover, if you are seeking to achieve accreditation – a third part stamp of approval that your credential is high quality – documentation that you completed all these steps is required.

Performing these tasks with a trained psychometrician inherently checks a lot of boxes on the accreditation to-do list, which can position your organization well for the future. When it comes to accreditation— the psychometricians and measurement specialists at Assessment Systems have been around the block a time or two. We can walk you through the lengthy process of becoming accredited, or we can help you perform these tasks a la carte.

One of the most cliche phrases associated with assessment is “teaching to the test.”  I’ve always hated this phrase, because it is only used in a derogatory matter, almost always by people who do not understand the basics of assessment and psychometrics.  I recently saw it mentioned in this article on PISA, and that was one time too many, especially since it was used in an oblique, vague, and unreferenced manner.

So, I’m going to come out and say something very unpopular: in most cases, TEACHING TO THE TEST IS A GOOD THING.


Why teaching to the test is usually a good thing

If the test reflects the curriculum – which any good test will – then someone who is teaching to the test will be teaching to the curriculum.  Which, of course, is the entire goal of teaching. The phrase “teaching to the test” is used in an insulting sense, especially because the alliteration is resounding and sellable, but it’s really not a bad thing in most cases.  If a curriculum says that 4th graders should learn how to add and divide fractions, and the test evaluates this, what is the problem? Especially if it uses modern methodology like adaptive testing or tech-enhanced items to make the process more engaging and instructional, rather than oversimplifying to a text-only multiple choice question on paper bubble sheets?

The the world of credentialing assessment, this is an extremely important link.  Credential tests start with a job analysis study, which surveys professionals to determine what they consider to be the most important and frequently used skills in the job.  This data is then transformed into test blueprints. Instructors for the profession, as well as aspiring students that are studying to pass the test, then focus on what is in the blueprints.  This, of course, still contains the skills that are most important and frequently used in the job!


So what is the problem then?

Now, telling teachers how to teach is more concerning, and more likely to be a bad thing.  Finland does well because it gives teachers lots of training and then power to choose how they teach, as noted in the PISA article.

As a counterexample, my high school math department made an edict starting my sophomore year that all teachers had to use the “Chicago Method.”  It was pure bunk and based on the fact that students should be doing as much busy work as possible instead of the teachers actually teaching. I think it is because some salesman convinced the department head to make the switch so that they would buy a thousand brand new textbooks.  The method makes some decent points (here’s an article from, coincidentally, when I was a sophomore in high school) but I think we ended up with a bastardization of it, as the edict was primarily:

  1. Assign students to read the next chapter in class (instead of teaching them!); go sit at your desk.
  2. Assign students to do at least 30 homework questions overnight, and come back tomorrow with any questions they have.  
  3. Answer any questions, then assign them the next chapter to read.  Whatever you do, DO NOT teach them about the topic before they start doing the homework questions.  Go sit at your desk.

Isn’t that preposterous?  Unsurprisingly, after two years of this, I went from being a leader of the Math Team to someone who explicitly said “I am never taking Math again”.  And indeed, I managed to avoid all math during my senior year of high school and first year of college. Thankfully, I had incredible professors in my years at Luther College, leading to me loving math again, earning a math major, and applying to grad school in psychometrics.  This shows the effect that might happen with “telling teachers how to teach.” Or in this case, specifically – and bizarrely – to NOT teach.


What about all the bad tests out there?

Now, let’s get back to the assumption that a test does reflect a curriculum/blueprints.  There are, most certainly, plenty of cases where an assessment is not designed or built well.  That’s an entirely different problem, and is an entirely valid concern. I have seen a number of these in my career.  This danger why we have international standards on assessments, like AERA/APA/NCME and NCCA.  These provide guidelines on how a test should be build, sort of like how you need to build a house according to building code and not just throwing up some walls and a roof.

For example, there is nothing that is stopping me from identifying a career that has a lot of people looking to gain an edge over one another to get a better job… then buying a textbook, writing 50 questions in my basement, and throwing it up on a nice-looking website to sell as a professional certification.  I might sell it for $395, and if I get just 100 people to sign up, I’ve made $39,500!!!! This violates just about every NCCA guideline, though. If I wanted to get a stamp of approval that my certification was legit – as well as making it legally defensible – I would need to follow the NCCA guidelines.

My point here is that there are definitely bad tests out there, just like there are millions of other bad products in the world.  It’s a matter of caveat emptor. But just because you had some cheap furniture on college that broke right away, doesn’t mean you swear off on all furniture.  You stay away from bad furniture.

There’s also the problem of tests being misused, but again that’s not a problem with the test itself.  Certainly, someone making decisions is uninformed. It could actually be the best test in the world, with 100% precision, but if it is used for an invalid application then it’s still not a good situation.  For example, if you took a very well-made exam for high school graduation and started using it for employment decisions with adults. Psychometricians call this validity – that we have evidence to support the intended use of the test and interpretations of scores.  It is the #1 concern of assessment professionals, so if a test is being misused, it’s probably by someone without a background in assessment.


So where do we go from here?

Put it this way, if an overweight person is trying to become fitter, is success more likely to come from changing diet and exercise habits, or from complaining about their bathroom scale?  Complaining unspecifically about a high school graduation assessment is not going to improve education; let’s change how we educate our children to prepare them for that assessment, and ensure that the assessment reflects the goals of the education.  Nevertheless, of course, we need to invest in making the assessment as sound and fair as we can – which is exactly why I am in this career.

Item response theory is the predominant psychometric paradigm for mid or large scale assessment.  As noted in my introductory blog post, it is actually a family of models.  In this post, we discuss the two parameter IRT model (2PL).

The 2PL is described by the following equation (simplified from Hambleton & Swaminathan, 1985, Eq. 3.3):

This equation is predicting the probability of a certain response based on the examinee trait/ability level, the item discrimination parameter a, and the item difficulty/location parameter b.  If the examinee trait level is higher than the item location, the person has more than a 50% chance of responding in the keyed direction.

This phrase “in the keyed direction” is one you might often hear with the 2PL.  This is because it is not often used with education/knowledge/ability assessments, where items usually have a correct answer and guessing is often possible.  The 2PL is used more often in attitudinal or other psychological assessments, where guessing is irrelevant and there is no correct answer.  For example, consider an Extroversion scale, where examinees are responding Yes/No to statements like “I love to go to parties” or “I prefer to read books in my free time.”  There is not much to guess here, and the sense of “correct” is not relevant.  However, it is quite clear that the first statement is keyed in the direction of Extroversion, while the second statement is the reverse.  In fact, you would get the 1 point of response for saying No to that statement rather than Yes.  This is often called reverse-scored.

There are other aspects that go into whether you should use the 2PL model, but this is one of the most important.  In addition, you should also examine model fit indices and take sample size into account.

How do I implement the two parameter IRT model?

Like other IRT models, the 2PL requires specialized software.  Not all statistical packages will do it.  And while you can easily calculate classical statistics in Excel, there is no way to do IRT (well, unless you want to write your own VBA programs to do so).  As mentioned in this article on the three parameter model, there are a number of Irt software programs available but not all are created equal.  You should evaluate cost and functionality.  If you are a fan of R, there are packages to estimate IRT there.  However, I recommend our Xcalibre program for both newbies and professionals.  For newbies, it is much easier to use, which means you spend more time learning the concepts of IRT and not fighting command code that might be 30 years old.  For professionals,, Xcalibre saves you from having to create reports by copy and paste, which if you think about how much your hourly rate costs, that copy-and-paste time is incredibly expensive.

Item response theory (IRT) is an extremely powerful psychometric paradigm that addresses many of the inadequacies of classical test theory (CTT).  If you are new to the topic, there is a broad intro here, where you will learn that IRT is actually a family of mathematical models rather than one specific one.  Today, I’m talking about the 3PL.

One of the most commonly used models is called the three parameter IRT model (3PM), or  the three parameter logistic model (3PL or 3PLM) because it is almost always expressed in a logistic form.  The equation for this is below (Hambleton & Swaminathan, 1985, Eq. 3.3).

Like all IRT models, it is seeking to predict the probability of a certain response based on examinee ability/trait level and some parameters which describe the performance of the item.  With the 3PL, those parameters are a (discrimination), b (difficulty or location) and c (pseudoguessing).  For more on these, check out the descriptions in my general IRT article.

The remaining point then is what we mean by probability of a certain response.  The 3PL is a dichotomous model which means that it is predicting a binary outcome such as correct/incorrect or agree/disagree.

When should I use the three parameter IRT model?

The applicability of the 3PL to a certain assessment depends on the relevance of the components just discussed.  First, the response to the items must be binary.  This eliminates Likert-type items (“Rate on a scale of 1 to 5”), partial credit items (scoring an essay as 0 to 5 points), and performance assessments where scoring might include a range of points, deductions, or timing (number of words typed per minute).

Next, you should evaluate the applicability of the use of all three parameters.  Most notably, are the items in your assessment susceptible to guessing?  Because the thing that differentiates the 3PL from its sisters the 1PL and 2PL is that it attempts to model for guessing.  This, of course, is highly relevant for multiple choice items on knowledge or ability assessments, so the 3PL is often a great fit for those.

Even in this case, though, there are a number of practitioners and researchers that still prefer to use the 1PL or 2PL models.  There are some deeper methodological issues driving this choice.  The 2PL is sometimes chosen because it works well with an estimation method called Joint Maximum Likelihood.  The 1PL, also known as the Rasch model (yes, I know the Rasch people will say they are not the same, I am grouping them together for simplicity in comparison), is often selected because adherents to the model believe in certain advantages such as it providing “objective measurement.”  Also, the Rasch model works far better for smaller samples (see this technical report by Guyer & Thompson and this one by Yoes).  Regardless, you should probably evaluate model fit when selecting models.

I am from the camp that is pragmatic in choice rather than dogmatic.  While trained on the 3PL in graduate school, I have no qualms against using the 2PL or 1PL/Rasch if the test type and sample size warrant it or if fit statistics indicate they are sufficient.

How do I implement the three parameter IRT model?

If you want to implement the three parameter IRT model, you need specialized software.  General statistical software such as SPSS does not always produce IRT analysis, though some do.  Even in the realm of IRT-specific software, not all produce the 3PL.  And, of course, software can vary greatly in terms of quality.  Here are three important ways it can vary:

  1. Accuracy of results: check out this research study which shows that some programs are inaccurate
  2. User-friendliness: some programs require you to write extensive code, and some have a purely graphical interface
  3. Output usability and interpretability: some programs just give simple ASCII text, others provide extensive Word or HTML reports with many beautiful tables and graphs.

For more on this topic, head over to my post on how to implement IRT in general.

Want to get started immediately?  Download a free copy of our IRT software Xcalibre.

Classical test theory is a century-old paradigm for psychometrics – using quantitative and scientifically-based processes to develop and analyze assessments to maximize their quality.  (nobody likes unfair tests!)  The most basic and frequently used item statistic from classical test theory is the P-value.  It is usually called item difficulty but is sometimes called item facility, which can lead to possible confusion.

The P-Value Statistic

The classical P-value is the proportion of examinees that respond correctly to a question, or respond in the “keyed direction” for items where the notion of correct is not relevant (imagine a personality assessment where all questions are Yes/No statements such as “I like to go to parties” … Yes is the keyed direction for an Extraversion scale).  Note that this is NOT the same as the p-value that is used in hypothesis testing from general statistical methods.  This P-value is almost universally agreed upon in terms of calculation.  But some people call it item difficulty and others call it item facility.  Why?

It has to do with the clarity interpretation.  It usually makes sense to think of difficulty as an important aspect of the item.  The P-value presents this, but in a reverse manner.  We usually expect higher values to indicate more of something, right?  But a P-value of 1.00 is high, and it means that there is not much difficulty; everyone gets the item correct, so it is actually no difficulty whatsoever.  A P-value of 0.25 is low, but it means that there is a lot of difficulty; only 25% of examinees are getting it correct, so it has quite a lot of difficulty.

So where does “item facility” come in?

See how the meaning is reversed?  It’s for this reason that some psychometricians prefer to call it item facility or item easiness.  We still use the P-value, but 1.00 means high facility/easiness and 0.25 means low facility/easiness.  The direction of the semantics fits much better.

Nevertheless, this is a minority of psychometricians.  There’s too much momentum to change an entire field at this point!  It’s similar to the 3 dichotomous IRT parameters (a,b,c); some of you might have noticed that they are actually in the wrong order, because the 1-parameter model does not use the a parameter, it uses the b.  At the end of the day, it doesn’t really matter, but it’s another good example of how we all just got used to doing something and it’s now too far down the road to change it.  Tradition is a funny thing.

The modified-Angoff method is arguably the most common method of setting a cutscore on a test.  The Angoff cutscore is legally defensible and meets international standards such as AERA/APA/NCME, ISO 17024, and NCCA.  It also has the benefit that it does not require the test to be administered to a sample of candidates first; methods like Contrasting Groups, Borderline Group, and Bookmark do so.

There are, of course, some drawbacks to the Angoff cutscore process.  The most significant is the fact that the subject matter experts (SMEs) tend to overestimate their conceptualization of a minimally competent candidate, and therefore overestimate the cutscore.  Sometimes to the point that the expected pass rate is zero!

Another drawback is that the Angoff cutscore process only works in the classical psychometric paradigm – the recommended cutscores are on the number-correct metric or percentage-correct metric.  If your tests are developed and scored in the item response theory (IRT) paradigm, you need to convert the classical cutscore to the IRT theta scale.  The easiest way to do that is to reverse-calculate the test response function (TRF) from IRT.

The Test Response Function

The TRF (sometimes called a test characteristic curve) is an important method of characterizing test performance in the IRT paradigm.  The TRF predicts a classical score from an IRT score, as you see below.  Like the item response function and test information function (these need blog posts too), it uses the theta scale as the X-axis.  The Y-axis can be either the number-correct metric or proportion-correct metric.

In this example, you can see that a theta of -0.6 translates to an estimated number-correct score of approximately 10, and +1 to 15.5.  Note that the number-correct metric only makes sense for linear or LOFT exams, where every examinee receives the same number of items.  In the case of CAT exams, only the proportion correct metric makes sense.

Angoff cutscore to IRT

So how does this help us with the conversion of a cutscore?  Well, we hereby have a way of translating any number-correct score or proportion-correct score.  So any Angoff-recommended cutscore can be reverse-calculated to a theta value.  If your Angoff study (or Beuk) recommends a cutscore of 10 out of 20 points, you can convert that to a theta cutscore of -0.6.  If the recommended cutscore was 15.5, the theta cutscore would be 1.0.

Because IRT works in a way that it scores examinees on the same scale with any set of items, as long as those items have been part of a linking/equating study.  Therefore, a single Angoff study on a set of items can be equated to any other linear test form, LOFT pool, or CAT pool.  This makes it possible to apply the classically-focused Angoff method to IRT-focused programs.