Posts on psychometrics: The Science of Assessment

Automated item generation (AIG) is a paradigm for developing assessment items, aka test questions, utilizing principles of artificial intelligence and automation. As the name suggests, it tries to automate some or all of the effort involved with item authoring, as that is one of the most time-intensive aspects of assessment development – which is no news to anyone who has authored test questions! Items can cost up to $2000 to develop, so even cutting the average cost in half could provide massive time/money savings to an organization.

There are two types of automated item generation:

Type 1: Item Templates (Current Technology)

The first type is based on the concept of item templates to create a family of items using dynamic, insertable variables. Authors, or a team, create an item model by isolating what it is they are exactly trying to assess and different ways that it the knowledge could be presented or evidenced. An algorithm then turns this template into a family of related items, often by producing all possible permutations (see below).

Obviously, you can’t use more than one of these on a given test form. And in some cases, some of the permutations will be an unlikely scenario or possibly completely irrelevant. But the savings can still be quite real. I saw a conference presentation by Andre de Champlain from the Medical Council of Canada, stating that overall efficiency improved by 6x and the generated items were higher quality than traditionally written items because the process made the authors think more deeply about what they were assessing and how. He also recommended that template permutations not be automatically moved to the item bank but instead that each be reviewed by SMEs, for reasons such as those stated above.

You might think “Hey, that’s not really AI…” – AI is doing things that have been in the past done by humans, and the definition gets pushed further every year. Remember, AI used to be just having the Atari be able to play Pong with you!

Type 2: AI Processing of Source Text (Future Technology)

The second type is what the phrase “automated item generation” more likely brings to mind: upload a textbook or similar source to some software, and it spits back drafts of test questions. For example, see this article by von Davier (2019). This technology is still cutting edge and working through issues. For example, how do you automatically come up with quality, plausible distractors for a multiple choice item? This might be automated in some cases like mathematics, but in most cases the knowledge of plausibility lies with content matter expertise. Moreover, this approach is certainly not accessible for the typical educator.

How Can I Implement Automated Item Generation?

AIG has been used has been used the large testing companies for years, but is no longer limited to their domain. It is now available off the shelf as part of ASC’s nextgen assessment platform, Ada. Best of all, that component is available at the free subscription level, all you need to do is register with a valid email address. Click here to sign up.

Ada provides a clean, intuitive interface to implement Type 1 AIG, in a way that is accessible to all organizations. Develop your item templates, insert dynamic fields, and then export the results to review then implement in an item banking system, which is also available for free in Ada.

If you have worked in the field of assessment and psychometrics, you have undoubtedly encountered the word “standard.” While a relatively simple word, it has the potential to be confusing because it is used in three (and more!) completely different but very important ways. Here’s a brief discussion.

Standard = Cutscore

As noted by the well-known professor Gregory Cizek here, “standard setting refers to the process of establishing one or more cut scores on a test.” The various methods of setting a cutscore, like Angoff or Bookmark, are referred to as standard setting studies. In this context, the standard is the bar that separates a Pass from a Fail. We use methods like the ones mentioned to determine this bar in as scientific and defensible fashion as possible, and give it more concrete meaning than an arbitrarily selected round number like 70%. Selecting a round number like that will likely get you sued since there is no criterion-referenced interpretation.

Standard = Blueprint

If you work in the field of education, you often hear the term “educational standards.” These refer to the curriculum blueprints for an educational system, which also translate into assessment blueprints, because you want to assess what is on the curriculum. Several important ones in the USA are noted here, perhaps the most common of which nowadays is the Common Core State Standards, which attempted to standardize the standards across states. These standards exist to standardize the educational system, by teaching what a group of experts have agreed upon should be taught in 6th grade Math classes for example. Note that they don’t state how or when a topic should be taught, merely that 6th Grade Math should cover Number Lines, Measurement Scales, Variables, whatever – sometime in the year.

Standard = Guideline

If you work in the field of professional certification, you hear the term just as often but in a different context, accreditation standards. The two most common are the National Commission for Certifying Agencies (NCCA) and the ANSI National Accreditation Board (ANAB). These two organizations are a consortium of credentialing bodies that give a stamp of approval to credentialing bodies, stating that a Certification or Certificate program is legit. Why? Because there is no law to stop me from buying a textbook on any topic, writing 50 test questions in my basement, and selling it as a Certification. It is completely a situation of caveat emptor, and these organizations are helping the buyers by giving a stamp of approval that the certification was developed with accepted practices like a Job Analysis, Standard Setting Study, etc.

In addition, there are the professional standards for our field. These are guidelines on assessment in general rather than just credentialing. Two great examples are the AERA/APA/NCME Standards for Educational and Psychological Measurement and the International Test Commission’s Guidelines (yes they switch to that term) on various topics.

Also: Standardized = Equivalent Conditions

The word is also used quite frequently in the context of standardized testing, though it is rarely chopped to the root word “standard.” In this case, it refers to the fact that the test is given under equivalent conditions to provide greater fairness and validity. A standardized test does NOT mean multiple choice, bubble sheets, or any of the other pop connotations that are carried with it. It just means that we are standardizing the assessment and the administration process. Think of it as a scientific experiment; the basic premise of the scientific method is holding all variables constant except the variable in question, which in this case is the student’s ability. So we ensure that all students receive a psychometrically equivalent exam, with equivalent (as much as possible) writing utensils, scrap paper, computer, time limit, and all other practical surroundings. The problem comes with the lack of equivalence in access to study materials, prep coaching, education, and many bigger questions… but those are a societal issue and not a psychometric one.

So despite all the bashing that the term gets, a standardized test is MUCH better than the alternatives of no assessment at all, or an assessment that is not a level playing field and has low reliability. Consider the case of hiring employees: if assessments were not used to provide objective information on applicant skills and we could only use interviews (which are famously subjective and inaccurate), all hiring would be virtually random and the amount of incompetent people in jobs would increase a hundredfold. And don’t we already have enough people in jobs where they don’t belong?

The generalized partial credit model (GPCM; Muraki, 1992) is one of the family of models from item response theory.  It is designed to work, as you might have guess, with items that are partial credit.  That is, instead of just right/wrong as possible scoring, an examinee can receive partial points for completing some aspects of the item correctly.  For example, a typical multiple choice item is scored as 0 points for incorrect and 1 point for correct.  A GPCM item might consist of 3 aspects and be 0 points for in correct, 3 points for fully correct, and 1 or 2 points if the examinee only completes 1 or 2 of the aspects but not all three.

Examples of GPCM items

GPCM items therefore contain multiple point levels, starting at 0.  There are several examples that are common in the world of educational assessment.

The first example, which nearly everyone is familiar with, is essay rubrics.  A student might be instructed to write an essay on why extracurriculuars are important in school, with at least 3 supporting points.  Such an essay might be scored with number of points presented (0,1,2,3) as well as on grammar (0=10 or more errors, 1= 3-9 errors, and 2 = 2 errors or less). Here’s a shorter example.

Another example is multiple response items.  For example, a student might be presented a list of 5 animals and be asked identify which are Mammals.  There are 2 correct answers, so the possible points are 0,1,2. Note that this also includes their tech-enhanced equivalents such as drag and drop; such items might be reconfigured to dragging the animal names into boxes, but that’s just window dressing to make the item look sexier.

The National Assessment of Educational Progress and many other K-12 assessments utilize the GPCM since they so often use item types like this.

Why use the generalized partial credit model?

Well, the first part of the answer is a more general question: why use polytomous items?  Well, these items are generally regarded to be higher-fidelity and assess deeper thinking that multiple choice items. They also provide much more information than multiple choice items in an IRT paradigm.

The second part of the answer is the specific question: If we have polytomous items, why use the GPCM rather than other models?  There are two parts to that answer, that refer to the name generalized partial credit model.  First, partial credit models are appropriate for items where the scoring starts at 0, and different polytomous items could have very difference performance.  In contrast, Likert-style items are also polytomous (almost always) but start at 1 and utilize the same psychological response process on every item.  For example, a survey where statements are presented and examinees are to “Rate each on a scale of 1 to 5.”  Second, the “generalized” part of the name means that it includes a discrimination parameter for evaluating the measurement quality of an item.  This is similar to using the 2PL or 3PL for dichotomous items rather than using the Rasch model and assuming items are of equal discrimination.  There is also a Rasch partial credit model that is equivalent, and can be used alongside Rasch dichotomous items, but this post is just focusing on GPCM.

Definition of the Generalized Partial Credit Model

The generalized partial credit is defined by the equation below (Embretson & Reise, 2000).

In this equation
 m=Number of possible points

  x = the student’s score on the item

  i = index for item

θ = student ability

 a = discrimination parameter for item i

gij = the boundary parameter for step j on item i; there are always m-1 boundaries

 r is an index used to manage the summation.

What do these mean?  The a parameter is the same concept as the a parameter in dichotomous IRT, where 0.5 might be low and 1.2 might be high.  The boundary parameters define the steps or thresholds that explain how the GPCM works, which will become clearer when you see the graph below.

As an example, let us consider a 4 point item with the following parameters.

If you utilize those numbers to graph the functions for each point level as a function of theta, you would see a graph like the one below.  Here, consider Option 1 to be the probability of getting 0 points; this is very high probability for the lowest examinees, but drops as ability increases.  Conversely, the Option 5 line is for receiving all possible points; high probability for the best examinees, but probability decreases as ability does.  In between, we have probability curves for 1, 2, and 3 points.  If an examinee has a theta of -0.5, they have a high probability of getting 2 points on the item (yellow curve).

The boundary parameters mentioned earlier have a very real interpretation with this graph; they are literally the boundaries between the curves.  That is, the theta level at which 1 point (purple) becomes more likely that 0 points (red) is at -2.4, as you can see where the two lines cross.  Note that this is the first boundary parameter, b1 in the image earlier.

How to use the GPCM

As mentioned before, the GPCM is appropriate to use as your IRT model for multi-point items in an educational context, as opposed to Likert-style psychological items.  They re almost always used in conjunction with the 2PL or 3PL dichotomous models; consider a test of 25 multiple choice items, 3 multiple response items, and an essay with 2 rubrics.

To implement, you need an IRT software program that can estimate dichotomous and polytomous items jointly, such as Xcalibre.  Consider the screenshot below to specify these. 

If you implement IRT with Xcalibre, it produces a page like this for each GPCM item.

To score students with the GPCM, you either need to use the IRT program like Xcalibre to score students, or a test delivery system that has been specifically designed to support the GPCM in the item banker and implement GPCM in scoring routines.  The former only works when you are doing the IRT analysis after all examinees have completed a test; if you have continuous deployment of assessments, you will need to use the latter approach.

Where can I learn more?

IRT textbooks will provide a treatment of polytomous models like the generalized partial credit model. Examples are de Ayala (2010) and Embretson & Reise (2000). In addition, I recommend the 2010 book by Nering and Ostini, which was previously available as a monograph.

I was recently asked about scaled scoring and if the transformation must be based on the normal curve. This is an important question, especially since most human traits fall in a fairly normal distribution, and item response theory places items and people on that latent scale. The short answer is “no” – there are other options for the scaled scoring transformation, and your situation can help you select the right method.

First of all: if you are new to the concept of scaled scoring, start out by reading this blog post. In short: it is a way of converting scores on a test to another scale for reporting to examinees, to hide certain important aspects such as differences in test form difficulty.

There are 4 types of scaled scoring, in general:

  1. Normal/standardized
  2. Linear
  3. Linear dogleg
  4. Equipercentile

Normal/standardized

This is an approach to scaled scoring that many of us are familiar with due to some famous applications, including the T score, IQ, and large-scale assessments like the SAT. It starts by finding the mean and standard deviation of raw scores on a test, then converts whatever that is to another mean and standard deviation. If this seems fairly arbitrary and doesn’t change the meaning… you are totally right!

Let’s start by assuming we have a test of 50 items, and our data has a raw score average of 35 points with an SD of 5. The T score transformation – which has been around so long that a quick Googling can’t find me the actual citation – says to convert this to a mean of 50 with SD of 10. So, 35 raw points becomes a scaled score of 50. A raw score of 45 (2 SDs above mean) becomes a T of 70. We could also place this on the IQ scale (mean=100, SD=15) or the classic SAT scale (mean=500, SD=100).

A side not about the boundaries of these scales… one of the first things you learn in any stats class is that plus/minus 3 SDs contains 99% of the population, so many scaled scores adopt these and convenient boundaries. This is why the classic SAT scale went from 200 to 800, with the urban legend that “you get 200 points for putting your name on the paper.” Similarly, the ACT goes from 0 to 36 because it nominally had a mean=18 and SD=6.

The normal/standardized approach can be used with classical number-correct scoring, but makes more sense if you are using item response theory, because all scores default to a standardized metric.

Linear

The linear approach is quite simple. It employs the y=mx+b that we all learned as schoolkids. With the previous example of a 50 item test, we might say intercept=200 and slope=4. This then means that scores range from 200 to 400 on the test.

Yes, I know… the Normal conversion above is technically linear also, but deserves its own definition.

Linear dogleg

The Linear Dogleg approach is a special case of the previous one, where you need to stretch the scale to reach two endpoints. Let’s suppose we published a new form of the test, and a classical equating method like Tucker or Levine says that it is 2 points easier and the slope of Form A to Form B is 3.8 rather than 4. This throws off our clean conversion of 200 to 400 scale. So suppose we use the equation SCALED = 200 + 3.8*RAW but only up until the score of 30. From 31 onwards, we use SCALED = 185 + 4.3*RAW. Note that the raw score of 50 then still comes out to be scaled of 400, so we still go from 200 to 800 but there is now a slight bend in the line. This is called the “dogleg” similar to the golf hole of the same name.

Equipercentile

Lastly, there is Equipercentile, which is mostly used for equating forms but can similarly be used for scaling.  In this conversion, we match the percentile for each, even if it is a very nonlinear transformation.  For example, suppose our Form A had 90th percentile of 46, which became scaled of 384.  We find that Form B has a 90th percentile at 44 points, so we call that a scaled score of 384, and calculate a similar conversion for all other points.

Why are we doing this again?

Well, you can kind of see it in the example of having two forms with a difference in difficulty. In the Equipercentile example, suppose there is a cutscore to be in the top 10% to win a scholarship… if you get 45 on Form A you will lose, but if you get 45 on Form B you will win. Test sponsors don’t want to have this conversation with angry examinees, so they convert all scores to an arbitrary scale. The 90th percentile is always a 384, no matter how hard the test is. (Yes, that simple example assumes the populations are the same… there’s an entire portion of psychometric research dedicated to performing stronger equating.)

A Standard Setting Study is a formal process fo establishing an performance standard. In the assessment world, there are actually two uses of the word standard – the other one refers to a formal definition of the content that is being tested, such as the Common Core State Standards in the USA. For this reason, I prefer the term cutscore study.

After item authoring, item review, and test form assembly, a cutscore or passing score will often be set to determine what level of performance qualified as “pass” or similar classification.  This cannot be done arbitrarily (e.g., setting it at 70% because that’s what you saw when you were in school).  To be legally defensible and eligible for Accreditation, it must be done using one of several standard setting approaches from the psychometric literature.  The choice of method depends upon the nature of the test, the availability of pilot data, and the availability of subject matter experts.

Some types of Cutscore Studies:

  • Angoff – In an Angoff study, a panel of subject matter experts rates each item, estimating the percentage of minimally competent candidates that would answer each item correctly.  It is often done in tandem with the Beuk Compromise.  The Angoff method does not require actual examinee data, though the Beuk does.
  • Bookmark – The bookmark method orders the items in a test form in ascending difficulty, and a panel of experts reads through and places a “bookmark” in the book where they think a cutscore should be.  Obviously, this requires enough real data to calibrate item difficulty, usually using item response theory, which requires several hundred examinees.
  • Contrasting Groups – Candidates are sorted into Pass and Fail groups based on their performance on a different exam or some other unrelated standard.  If using data from another exam, a sample of at least 50 candidates is obviously needed.
  • Borderline Group – Similar to Contrasting Groups, but a borderline group is defined using alternative information such as biodata, and the scores of the group are evaluated.

If you are involved with any sort of data science, which psychometrics most definitely is, you’ve probably used R.  R is an environment that allows you to implement packages for many different types of analysis, which are built by a massive community of data scientists around the world.  R has become one of the two main languages for data science and machine learning (the other being Python), and remains growing in popularity in both those general areas.  However, R for psychometrics is becoming much more common.

I was extremely anti-R for a number of years, but have recently started using it, for several important reasons.  However, for some even more important reasons, I don’t use it for all of my work.  I recommend you do the same.  Let’s talk a bit about why.

What is R?

R is a programming-language-like environment for statistical analysis.  Its Wikipedia article defines it as a “programming language and free software environment for statistical computing and graphics” but I use the term “programming-language-like environment” because it is more like command scripting from DOS than an actual compiled language like Java or Pascal.  R has an extremely steep learning curve compared to software that provides a decent UI; it claims that RStudio is a UI, but it really is just a more advanced window to see the same command code!

R can be maddeningly frustrating because of other relatively straightforward reasons.  For example, it will not recognize a missing value in data when running a simple correlation, and is unable to give you a decent error message explaining this.  This was my first date with R, and turned me off for years.  A similar thing occurred to me the first time I used PARSCALE in 2009 and couldn’t get it to work for days.  Eventually I discovered it was because the original code base was DOS, which limits you to 8-character file names, and they never bothered to tell you that.  They literally expected all users to be knowledgeable on 1980s DOS rules.  In 2009.

BUT… R is free, and everybody likes free.  Even though free never means there is no cost.

What are packages?

R comes with some analysis out of the box, but the vast majority is available in packages.  For example, if you want to do factor analysis or item response theory, you install one of several packages that does those.  These packages are written by contributors and uploaded to an R server somewhere.  There is no code review or anything else to check the packages, so it is entirely a caveat emptor situation.  This isn’t malicious, they’re just taking the scientific approach that assumes other researchers will replicate, disprove, or alternativize work.  For important, commonly used packages (I am a huge fan of caret), this is most definitely the case.  For rarely used packages and pet projects, it is the opposite.

Why do I use R for psychometrics or elsewhere?

As mentioned above, I use R for when there are well-known packages that are accepted in the community.  The caret package is a great example.  Just Google “r caret” and you can get a glimpse of the many resources, blog posts, papers, and other uses of the package.  Actually, it isn’t an analysis package in itself, it just makes it easier to call existing, proven packages.  Another favorite is the text2vec package, and of course there is the ubiquitous tidyverse.

I love to use R in cases of more general data science problems, because this means a community several orders of magnitude above psychometricians, which definitely contributes to the higher quality.  The caret package is for regression and classification, which are used in just about every field.  The text2vec package is for natural language processing, used in fields as diverse as marketing, political science, and education.  One of my favorite projects I’ve heard about there was the analysis of the Jane Austen corpus.  Fascinating.

When would I use R packages that I might consider less stellar?  I don’t mind using R when it is a low stakes situation, such as exploratory data analysis for a client.  I would also consider it and acceptable alternative to commercial software when the analysis is something I do very rarely.  No way am I going to pay $10,000 or whatever for something I do 2 hours per year.  Finally, I would consider it for niche analyses where no other option exists except to write my own code, and it does not make financial sense to do so.  However, in these cases, I still try to perform due diligence.

Why do I not use R?

In many cases, it comes down to a single word: quality.

For niche packages, the code might be 100% from a grad student who was a complete newbie on a topic for their thesis, with no background in software development or ancillary topics like writing a user manual.  Wow, does it show.  Moreover, no one has ever validated a single line of the code.  Because of this, I am very wary when using R packages.  If you use one like this, I highly recommend you do some QA or background research on it first!   I would love it if R had a community rating system, like exists for WordPress plugins.  With those, you can see that one plugin might be used on 1,000,000 sites with a 4.5/5.0 rating, while another is used on 43 sites with 2.7/5.0 rating.

This quality thing is, of course, a continuum.  There is a massive gap between the grad student project and something like caret.  In between you might have an R package that is a hobby of a professor, who devotes some time to it and has extremely deep knowledge of the subject matter, but it remains a part-time endeavor by someone with no experience in commercial software.  For examples of this situation, please see this review of an R package, or this comparison of IRT results with R vs professional tools.

The issue on User Manuals is of particular concern to me, as someone that provides commercial software and knows what it is like to support users.  I have seen user manuals in the R world that literally do not tell the users how to use the package.  They might provide a long-winded description of some psychometrics, obviously copied from a dissertation, as a “manual” when at best it only belongs as an appendix.  No info on formatting of input files, no provision of example input, no examples of usage, and no description of interpreting output.

Even in the cases of an extremely popular package that has high quality code, the documentation is virtually unreadable.  Check out the official landing page for tidyverse.  How welcoming is that?  I’ve found that the official documentation is almost guaranteed to be worthless – instead, head over to popular blogs or YouTube channels on your favorite topic.

The output is also famously bad quality.  R stores its output as objects, a sort of mini-database behind the scenes.  If you want to make graphs or dump results to something like a CSV file, you have to write more code just for such basics.  And if you want a nice report in Word or PDF, get ready to write a ton of code, or spend a week doing copy-and-paste.  I noticed that there was a workshop a few weeks ago at NCME (April 2019) that was specifically on how to get useful output reports from R, since this is a known issue.

Is R turning the corner?

I’ve got another post coming about R and how it has really turned the corner because of 3 things: Shiny, RStudio, and availability of quality packages.  More on that in the future, but for now:

  • Shiny allows you to make applications out of R code, so that the power of R can be available to end-users without them having to write & run code themselves.  Until Shiny, R was limited to people who wanted to write & run code.
  • RStudio makes it easier to develop R code, by overlaying an integrated development environment (IDE) on top of R.  If you have ever used and IDE, you know how important this is.  You’ve got to be incredibly clueless to not use an IDE for development.  Yet the first release of RStudio did not happen until 2011.  This shows how rooted R was in academia.
  • As you might surmise from my rant above, it is the quality packages (and third-party documentation!) that are really opening the floodgates.

Another, newer direction is that the world of R is hopping on the bandwagon of the API economy.  It might become the lingua franca of the data analytics world from an integration perspective.  This might be the true future of R for psychometrics.

But there are still plenty of issues.  One of my pet peeves is the lack of quality error trapping.  For example, if you do simple errors, the system will crash with completely worthless error messages.  I found this to happen if I run an analysis, open my output file, and run it again when forgetting to close the output file.  As previously mentioned, there is also the issue with a single missing data point in a correlation.

Nevertheless, R is still not really consumer facing.  That is, actual users will always be limited to people that have strong coding skills AND deep content knowledge on a certain area of data science or psychometrics.  Just like there will always be a home for more user-friendly statistical software like SPSS, there will always be a home for true psychometric software like Xcalibre.

Psychometric forensics is a surprisingly deep and complex field.  Many of the indices are incredibly sophisticated, but a good high-level and simple analysis to start with is overall time vs. scores, which I call Time-Score Analysis.  This approach uses simple flagging on two easily interpretable metrics (total test time in minutes and number correct raw score) to identify possible pre-knowledge, clickers, and harvester/sleepers.  Consider the four quadrants that a bivariate scatterplot of these variables would produce.

 

QuadrantInterpretationPossible threat?Suggested flagging
Upper rightHigh scores and taking their diligent timeGood examineesNA
Upper leftHigh scores with low timePre-knowledgeTop 50% score and bottom 5% time
Lower left Low scores with low time“Clickers” or other low motivationBottom 5% time and score
Lower right Low scores with high timeHarvesters, sleepers, or just very low abilityTop 5% time and bottom 5% scores

An example of Time-Score Analysis

Consider the example data below.  What can this tell us about the performance of the test in general, and about specific examinees?

This test had 100 items, scored classically (number-correct), and a time limit of 60 minutes.  Most examinees took 45-55 minutes, so the time limit was appropriate.  A few examinees spent 58-59 minutes; there will usually be some diligent students like that.  There was a fairly strong relationship of time with score, in that examinees who took longer, scored highly.

Now, what about the individuals?  I’ve highlighted 5 examples.

  1. This examinee had the shortest time, and one of the lowest scores.  They apparently did not care very much.  They are an example of a low motivation examinee that moved through quickly.  One of my clients calls these “clickers.”
  2. This examinee also took a short time, but had a suspiciously high score.  They definitely are an outlier on the scatterplot, and should perhaps be investigated.
  3. This examinee is simply super-diligent.  They went right up to the 60 minute limit, and achieved one of the highest scores.
  4. This examinee also went right up to the 60 minute limit, but had one of the lowest scores.  They are likely low ability or low motivation.  That same client of mine calls these “sleepers” – a candidate that is forced to take the exam but doesn’t care, so just sits there and dozes.Alternatively, it might be a harvester; some who has been assigned to memorize test content, so they spend all the time they can, but only look at half the items so they can focus on memorization.
  5. This examinee had by far the lowest score, and one of the lowest times.  Perhaps they didn’t even answer every question.  Again, there is a motivation/effort issue here, most likely.

How useful is time-score analysis?

Like other aspects of psychometric forensics, this is primarily useful for flagging purposes.  We do not know yet if #4 is a Harvester or just low motivation.  Instead of accusing them, we open an investigation.  How many items did they attempt?  Are they a repeat test-taker?  What location did they take the test?  Do we have proctor notes, site video, remote proctoring video, or other evidence that we can review?  There is a lot that can go into such an investigation.  Moreover, simple analyses such as this are merely the tip of the iceberg when it comes to psychometric forensics.  In fact, so much that I’ve heard some organizations simply stick their head in the sand and don’t even bother checking out someone like #4.  It just isn’t in the budget.

However, test security is an essential aspect of validity.  If someone has stolen your test items, the test is now compromised, and you are guaranteed that scores do not mean the same thing they meant when the test was published.  It’s now apples and oranges, even though the items on the test are the same.  Perhaps you might not challenge individual examinees, but perhaps institute a plan to publish new test forms every 6 months.  Regardless, your organization needs to have some difficult internal discussions and establish a test security plan.

 

Whether you are a newly-launched credentialing program or a mature certification body, it is important to perform frequent “checkups” on your assessments, to ensure that they’re not only accurate, but also legally defensible.  The primary component of this process is a psychometric performance report, which provides important statistics on the test like reliability, and item statistics like difficulty and discrimination.  This work is primarily done by a psychometrician, though particular items flagged for poor performance should be reviewed by Subject Matter Experts (SMEs).  However, checkups should also sometimes include Job Task Analysis studies (JTAs) and Cutscore studies. This is where your SMEs really come in.  The frequency depends on how quickly your field is evolving, but a cycle of 5 years is often recommended. JTAs are sometimes called job analysis, practice analysis, or role delineation studies.

Your SMEs play a pivotal role in getting new assessments off the ground and keeping existing assessments fair and accurate. Whether they keep your program abreast with current innovations and industry standards or help you quantify the knowledge and various skills measured in your assessment, your SMEs work side-by-side with your psychometric experts through the job task analysis and cutscore process to ensure fair and accurate decisions are made.

If your program or assessment is in its infant stages, you will need to perform a Job Task Analysis to kick things off. The JTA is all about surveying on-the-job tasks, creating a list of tasks, and then devising a blueprint of what knowledge, skills, and abilities (KSAs) are required for certification in a given role or field.

The Basics of Job Task Analysis

  • Observe— Typically the psychometrician (that’s us) shadows a representative sample of people who perform the job in question (chosen through Panel Composition) to observe and take notes. After the day(s) of observation, the SMEs sit down with the observer so that he or she may ask any clarifying questions. The goal is to avoid doing this during the observation so that the observer has an untainted view of the job.  Alternatively, your SMEs can observe job incumbents – which is often the case when the SMEs are supervisors.

  • Generate— The SMEs now have a corpus of information on what is involved with the job, and generate a list of tasks that describe the most important job-related components.  Not all job analysis uses tasks, but this is the most common approach in certification testing, hence you will often hear the term job task analysis as a general term.
  • Survey— Now that we have a list of tasks, we send a survey out to a larger group of SMEs and ask them to rate various features of each task. How important is the task? How often is it performed? What larger category of tasks does it fall into?

  • Analyze— Next, we crunch the data and quantitatively evaluate the SMEs’ subjective ratings to determine which of the tasks and categories are most important.

  • Review— As a non-SME, the psychometrician needs to take their findings back to the SME panel to review the recommendation and make sure it makes sense.

  • Report— We put together a comprehensive report that outlines what the most important tasks/categories are for the given job.  This in turn serves as the foundation for a test blueprint, because more important content deserves more weight on the test.  This connection is one of the fundamental links in the validity argument for an assessment.

Cutscore studies after job task analysis

When the JTA is completed, we have to determine who should pass the assessment, and who should fail. This is most often done using the modified Angoff process, where the SMEs conceptualize a minimally competent candidate (MCC) and then set pass/fail point so that the MCC would just barely pass.  There are other methods too, such as Bookmark or Contrasting Groups.

For newly-launching certification programs, these processes go hand-in-hand with item writing and review. The use of evidence-based practices in conducting the job task analysis, test design, writing items, and setting a cutscore serve as the basis for a good certification program.  Moreover, if you are seeking to achieve accreditation – a third part stamp of approval that your credential is high quality – documentation that you completed all these steps is required.

Performing these tasks with a trained psychometrician inherently checks a lot of boxes on the accreditation to-do list, which can position your organization well for the future. When it comes to accreditation— the psychometricians and measurement specialists at Assessment Systems have been around the block a time or two. We can walk you through the lengthy process of becoming accredited, or we can help you perform these tasks a la carte.

One of the most cliche phrases associated with assessment is “teaching to the test.”  I’ve always hated this phrase, because it is only used in a derogatory matter, almost always by people who do not understand the basics of assessment and psychometrics.  I recently saw it mentioned in this article on PISA, and that was one time too many, especially since it was used in an oblique, vague, and unreferenced manner.

So, I’m going to come out and say something very unpopular: in most cases, TEACHING TO THE TEST IS A GOOD THING.

 

Why teaching to the test is usually a good thing

If the test reflects the curriculum – which any good test will – then someone who is teaching to the test will be teaching to the curriculum.  Which, of course, is the entire goal of teaching. The phrase “teaching to the test” is used in an insulting sense, especially because the alliteration is resounding and sellable, but it’s really not a bad thing in most cases.  If a curriculum says that 4th graders should learn how to add and divide fractions, and the test evaluates this, what is the problem? Especially if it uses modern methodology like adaptive testing or tech-enhanced items to make the process more engaging and instructional, rather than oversimplifying to a text-only multiple choice question on paper bubble sheets?

The the world of credentialing assessment, this is an extremely important link.  Credential tests start with a job analysis study, which surveys professionals to determine what they consider to be the most important and frequently used skills in the job.  This data is then transformed into test blueprints. Instructors for the profession, as well as aspiring students that are studying to pass the test, then focus on what is in the blueprints.  This, of course, still contains the skills that are most important and frequently used in the job!

 

So what is the problem then?

Now, telling teachers how to teach is more concerning, and more likely to be a bad thing.  Finland does well because it gives teachers lots of training and then power to choose how they teach, as noted in the PISA article.

As a counterexample, my high school math department made an edict starting my sophomore year that all teachers had to use the “Chicago Method.”  It was pure bunk and based on the fact that students should be doing as much busy work as possible instead of the teachers actually teaching. I think it is because some salesman convinced the department head to make the switch so that they would buy a thousand brand new textbooks.  The method makes some decent points (here’s an article from, coincidentally, when I was a sophomore in high school) but I think we ended up with a bastardization of it, as the edict was primarily:

  1. Assign students to read the next chapter in class (instead of teaching them!); go sit at your desk.
  2. Assign students to do at least 30 homework questions overnight, and come back tomorrow with any questions they have.  
  3. Answer any questions, then assign them the next chapter to read.  Whatever you do, DO NOT teach them about the topic before they start doing the homework questions.  Go sit at your desk.

Isn’t that preposterous?  Unsurprisingly, after two years of this, I went from being a leader of the Math Team to someone who explicitly said “I am never taking Math again”.  And indeed, I managed to avoid all math during my senior year of high school and first year of college. Thankfully, I had incredible professors in my years at Luther College, leading to me loving math again, earning a math major, and applying to grad school in psychometrics.  This shows the effect that might happen with “telling teachers how to teach.” Or in this case, specifically – and bizarrely – to NOT teach.

 

What about all the bad tests out there?

Now, let’s get back to the assumption that a test does reflect a curriculum/blueprints.  There are, most certainly, plenty of cases where an assessment is not designed or built well.  That’s an entirely different problem, and is an entirely valid concern. I have seen a number of these in my career.  This danger why we have international standards on assessments, like AERA/APA/NCME and NCCA.  These provide guidelines on how a test should be build, sort of like how you need to build a house according to building code and not just throwing up some walls and a roof.

For example, there is nothing that is stopping me from identifying a career that has a lot of people looking to gain an edge over one another to get a better job… then buying a textbook, writing 50 questions in my basement, and throwing it up on a nice-looking website to sell as a professional certification.  I might sell it for $395, and if I get just 100 people to sign up, I’ve made $39,500!!!! This violates just about every NCCA guideline, though. If I wanted to get a stamp of approval that my certification was legit – as well as making it legally defensible – I would need to follow the NCCA guidelines.

My point here is that there are definitely bad tests out there, just like there are millions of other bad products in the world.  It’s a matter of caveat emptor. But just because you had some cheap furniture on college that broke right away, doesn’t mean you swear off on all furniture.  You stay away from bad furniture.

There’s also the problem of tests being misused, but again that’s not a problem with the test itself.  Certainly, someone making decisions is uninformed. It could actually be the best test in the world, with 100% precision, but if it is used for an invalid application then it’s still not a good situation.  For example, if you took a very well-made exam for high school graduation and started using it for employment decisions with adults. Psychometricians call this validity – that we have evidence to support the intended use of the test and interpretations of scores.  It is the #1 concern of assessment professionals, so if a test is being misused, it’s probably by someone without a background in assessment.

 

So where do we go from here?

Put it this way, if an overweight person is trying to become fitter, is success more likely to come from changing diet and exercise habits, or from complaining about their bathroom scale?  Complaining unspecifically about a high school graduation assessment is not going to improve education; let’s change how we educate our children to prepare them for that assessment, and ensure that the assessment reflects the goals of the education.  Nevertheless, of course, we need to invest in making the assessment as sound and fair as we can – which is exactly why I am in this career.

Item response theory is the predominant psychometric paradigm for mid or large scale assessment.  As noted in my introductory blog post, it is actually a family of models.  In this post, we discuss the two parameter IRT model (2PL).

The 2PL is described by the following equation (simplified from Hambleton & Swaminathan, 1985, Eq. 3.3):

This equation is predicting the probability of a certain response based on the examinee trait/ability level, the item discrimination parameter a, and the item difficulty/location parameter b.  If the examinee trait level is higher than the item location, the person has more than a 50% chance of responding in the keyed direction.

This phrase “in the keyed direction” is one you might often hear with the 2PL.  This is because it is not often used with education/knowledge/ability assessments, where items usually have a correct answer and guessing is often possible.  The 2PL is used more often in attitudinal or other psychological assessments, where guessing is irrelevant and there is no correct answer.  For example, consider an Extroversion scale, where examinees are responding Yes/No to statements like “I love to go to parties” or “I prefer to read books in my free time.”  There is not much to guess here, and the sense of “correct” is not relevant.  However, it is quite clear that the first statement is keyed in the direction of Extroversion, while the second statement is the reverse.  In fact, you would get the 1 point of response for saying No to that statement rather than Yes.  This is often called reverse-scored.

There are other aspects that go into whether you should use the 2PL model, but this is one of the most important.  In addition, you should also examine model fit indices and take sample size into account.

How do I implement the two parameter IRT model?

Like other IRT models, the 2PL requires specialized software.  Not all statistical packages will do it.  And while you can easily calculate classical statistics in Excel, there is no way to do IRT (well, unless you want to write your own VBA programs to do so).  As mentioned in this article on the three parameter model, there are a number of Irt software programs available but not all are created equal.  You should evaluate cost and functionality.  If you are a fan of R, there are packages to estimate IRT there.  However, I recommend our Xcalibre program for both newbies and professionals.  For newbies, it is much easier to use, which means you spend more time learning the concepts of IRT and not fighting command code that might be 30 years old.  For professionals,, Xcalibre saves you from having to create reports by copy and paste, which if you think about how much your hourly rate costs, that copy-and-paste time is incredibly expensive.