Posts on psychometrics: The Science of Assessment

FastTest has been ASC’s flagship platform for the past decade, securely delivering millions of exams, loaded with best practices like computerized adaptive testing and item response theory. FastTest is based off decades of experience in computerized test delivery and item banking, from MICROCAT in the 1980s to FastTest PC/LAN in the 1990s. And now the time has come for FastTest to be replaced with its own nextgen successor: Assess.ai.

With Assess.ai, we started by redesigning everything from the ground up, rather than just giving a facelift to FastTest. This leads to some differences in capability. Moreover, FastTest has seen more than 10 years of continuous development, so there is a massive amount of powerful functionality that has not yet been built into Assess.ai. So we’ve provided this guide to help you understand the advancements in Assess.ai and effectively select the right solution for your organization?

Will FastTest be riding into the proverbial sunset? Yes, but not anytime soon. For current users of FastTest, we’ll be working with you to ensure a smooth transition.

Important differences between FastTest and Assess.ai

AspectFastTestAssess.ai
AvailabilityCloud or On-PremiseCloud only
UI/UX2010 design with right-click menusModern Angular with completely new UX
Item types1250+
Automated item generationNoYes
Test delivery methodsLinear, LOFT, CATLinear (LOFT and CAT in development)
ExamineesNot reusable (must upload for each test)Reusable (can take more than one test)
Examinee testcode emailsNot customizableCustomizable
AccessibilityTimeTime, zoom, color
WidgetsCalculator, protractorProtractor, calculator, scientific calculator
Content managementFoldersOrthogonal tags
Delivery languagesEnglish, Spanish, ArabicEnglish, Arabic, Chinese, French, German, Italian, Russian, Spanish, Tagalog

There are of course many more differences. Want to hear more? Email solutions@assess.com to set up a demo. You might also be interested in this outline.

Online proctoring has been around for over a decade by now. But given the recent outbreak of COVID-19, educational and workforce/certification institutions are scrambling to change their operations, and a huge part of this is an incredible surge in online proctoring. This blog post is intended to provide an overview of the online proctoring industry for someone that is new to the topic or or is starting to shop and is just overwhelmed by all the options out there!

Online Proctoring: Two Distinct Markets

First, I would describe the online proctoring industry as actually falling into two distinct markets, so the first step is to determine which of these fits your organization

  1. Large scale, lower cost (when large scale), lower security systems designed to be used only as a plugin to major LMS platforms like Blackboard or Canvas. These online proctoring systems are therefore designed for medium-stakes exams like an Intro to Psychology midterm at a university.
  2. Lower scale, higher cost, higher security systems designed to be used with standalone assessment platforms. These are generally for higher-stakes exams like certification or workforce, or perhaps special use at universities like Admissions and Placement exams.

How to tell the difference? The first type will advertise about easy integration with systems like Blackboard or Canvas as a key feature. They will also often focus on AI review of videos, rather than using real humans. Another key consideration is to look at the existing client base, which is often advertised.

Other ways that Online Proctoring systems can differ

AI vs humans: Some systems rely purely on artificial intelligence algorithms to flag video recordings of examinees. Other systems utilize real humans.

Record & Review vs. Real-Time Humans: If live humans are used, there are two ways. First, it can be live and real-time, meaning that there is a human on the other end of the video that can confirm identity before allowing the test to start, and stop the test if there is obviously illicit activity. Record & Review will record the audio and a human will check it within 24-48 hours. This is more scalable, but you can’t stop the test if someone is stealing the content – you probably won’t know until the next day.

Screencapture: Some online proctoring providers have an option to record/stream the screen as well as the webcam. Some also provide the option to only do this (no webcam) for lower stakes exams.

Mobile phone as third camera: Some newer platforms provide the option to easily integrate the examinee’s mobile phone as a third camera, which effectively operates as a human proctor. Examinees will be instructed to use the video to show under the table, behind the monitor, etc., before starting the exam. They then might be instructed to stand up the phone 2 meters away with a clear view of the entire room while the test is being delivered.

Using your own proctors: Some online proctoring systems allow you to utilize your own staff as proctors, which is especially useful if the test is delivered in a small time window. If continuously delivered 24×7 all year, you probably want to use the vendor’s highly trained staff.

API integrations: Some systems require software developers to set up an API integration with your LMS or assessment platform. Others are more flexible, and you can just log in yourself, upload a list of examinees, and you are all set.

On-Demand vs. Scheduled: Some platforms involve the examinee scheduling a time slot. Others are purely on-demand, and the examinee can show up whenever they are ready. MonitorEDU is a prime example of this: examinees show up at any time, present their ID to a live human, and are then started on the test immediately – no downloads/installs, no system checks, no API integrations, nothing.

More security: A better test delivery system

A good testing delivery platform will also come with its own functionality to enhance test security: randomization, automated item generation, computerized adaptive testing, linear-on-the-fly testing, professional item banking, item response theory scoring, scaled scoring, psychometric analytics, equating, lockdown delivery, and more. In the context of online proctoring, perhaps the most salient is the lockdown delivery. In this case, the test will completely take over the examinee’s computer and they can’t use it for anything else until the test is done.

LMS systems rarely include any of this functionality, because they are not needed for a midterm exam of Intro to Psychology. However, most assessments in the world that have real stakes – university admissions, certifications, workforce hiring, etc. – depend heavily on such functionality. It’s not just out of habit or tradition, either. Such methods are considered essential by international standards including AERA/APA/NCMA, ITC, and NCCA.

ASC’s online proctoring Partners

ASC partners with some of the leaders in the space to give an out-of-the-box solution to our clients. These include: MonitorEDU, ProctorExam, Examity, and Proctor360. Learn more at our webpage regarding that functionality and another that explains the concept of scalable test security.

An item distractor, also known as a foil or a trap, is an incorrect option for a selected-response item on an assessment.

What makes a good item distractor?

One word: plausibility.  We need the item distractor to attract examinees.  If it is so irrelevant that no one considers it, then it does not do any good to include it in the item.  Consider the following item.

What is the capital of the United States of America?

A. Los Angeles

B. New York

C. Washington, D.C.

D. Mexico City

The last option is quite implausible – not only is it outside the USA, but it mentions another country in the name, so no student is likely to select this.  This then becomes a three-horse race, and students have a 1 in 3 chance of guessing.  This certainly makes the item easier.

In addition, the distractor needs to have negative discrimination.  That is, while we want the correct answer to attract the more capable examinees, we want the distractors to attract the lower examinees.  If you have a distractor that you thought was incorrect, and it turns out to attract all the top students, you need to take a long, hard look at that question! To calculate discrimination statistics on distractors, you will need software such as Iteman.

What makes a bad item distractor?

Obviously, implausibility and negative discrimination are frequent offenders.  But if you think more deeply about plausibility, the key is actually plausibility without being arguably correct.  This can be a fine line to walk, and is a common source of problems for items.  You might have a medical item that presents a scenario and asks for a likely diagnosis; perhaps one of the distractors is very unlikely so as to be essentially implausible, but it might actually be possible for a small subset of patients under certain conditions.  If the author and item reviewers did not catch this, the examinees probably will, and this will be evident in the statistics.  This is one of the reasons it is important to do psychometric analysis of test results; in fact, accreditation standards often require you to go through this process at least once a year.

What is item review?  It is the process of performing quality control on items before they are ever delivered to examinees.  This is an absolutely essential step in the development of items for medium and high stakes exams; while a teacher might not have other teachers review questions on a 4th grade math quiz, items that are part of a admissions exam or professional certification exam will go through multiple layers of independent item review before a single examinee sees them.  This blog post will discuss some important aspects of the item review process.

Why item review?

Assessment items are, when you look at it from a business perspective, a work product.  They are component parts of a larger machine, the test or assessment; in some cases interchangeable, in other cases very intentional and specific.  It is obviously common practice to perform quality assurance on work products, and the item review process simply applies this concept to test questions.

Who does the item review?

This can differ greatly based on the type of assessment and the stakes involved.  In a medium stakes situation, it might be just one other reviewer.  A professional certificate exam might have all items reviewed by one content expert other than the person who wrote the item, and this could be considered sufficient.  In higher stakes exams that are developed by large organizations, the item might go through two content reviewers, a psychometric reviewer, a bias reviewer, and an editor.  Additionally, it then might go through additional stages for formatting.  You can see how this can then become a very big deal, with dozens of people and hundreds of items floating around.

What do the reviewers check?

It depends on who the reviewer is, but there are often checklists that the organization provides.  A content reviewer might check that the stem is clear, the key is fully correct, the distractors fully incorrect, and all answers of reasonably equivalent length.  The psychometric reviewer might check for aspects that inadvertently tip off the correct answer.  The bias reviewer might look for a specific set of situations that potentially disadvantage some subgroup of the populations.  An editor might look for correct usage of punctuation, such as the fact that the stem should never end in a colon.

For example, during my graduate school years I used to write items that were eventually used in the US State of Alaska for K-12 assessments.  The reviewers not only looked for straightforward issues like answer correctness, but for potential bias in the case of Alaskans.  As item writers, we were warned to be careful about mentioning any objects that we take for granted in the Lower 48: roads, shopping malls, indoor plumbing, and farms are examples that come to mind. Checking this was a stage of item review.

How do we manage the work?

Best practice to manage the process is to implement stages.  An organization might decide that all items go to the reviewers listed previously, and in the order that I described them.  Each one must complete their review checklist before the item can be moved onto the next stage.  This might seem like a coldhearted assembly line, given that there certainly is an art to writing good items, but assembly lines unarguably lead to greater quality and increased productivity.

Is there software that makes the item review process easier?

Yes.  You have likely used some form of work process management software in your own job, such as Trello, JIRA, or Github. These are typically based on the concept of swimlanes, which as a whole is often referred to as a Kanban board.  Back in the day, Kanban board were actual boards with post-its on them, as you might have seen on shows like Silicon Valley.   This presents the aforementioned stages as columns in a user interface, and tasks (items) are moved through the stages.  Once the Content Reviewer 1 is done with their work and leaves comments on the item, the software provides a way for them to change the stage to Content Review 2 and assign some person as Content Reviewer 2.

Below is an example of this from ASC’s online assessment platform, Assess.ai.  Because Assess.ai is designed for organizations that are driven by best practices and advanced psychometrics, there is an entire portion of the system dedicated to management of item review via the swimlanes interface.

kanban board

To implement this process, an administrator at the organization defines the stages that they want all items to receive, and Assess.ai will present these as columns in the swimlane interface.  Administrators can then track and manage the workflow visually.  The reviewers themselves don’t need access to everything, but instead are instructed to click on the items they are supposed to review, and they will be presented an interface like the one below.

item review

Can I implement Kanban item review at my organization?

Absolutely!  Assess.ai is available as a free version (sign up here), with a limit of 500 items and 1 user.  While this means that the free version won’t let you manage dozens of users, you can still implement some aspects of the process to improve item quality in your organization.  Once you are ready to expand, you can simply upgrade your account and add the users. 

Want to learn more?  Drop us an email at solutions@assess.com.

Technology enhanced items are assessment items (questions) that utilize technology to improve the interaction of the item, over and above what is possible with paper.  Technology enhanced items can improve examinee engagement (important with K12 assessment), assess complex concepts with higher fidelity, improve precision/reliability, and enhance face validity/sellability.  To some extent, the last word is the key one; tech enhanced items simply look sexier and therefore make an assessment platform easier to sell, even if they don’t actually improve assessment.  I’d argue that there is also technology enabled items , which is distinct, as discussed below.

What is the goal of technology enhanced items?

The goal is to improve assessment, by increasing things like reliability/precision, validity, and fidelity. However, there are a number of TEIs that are actually designed more for sales purposes than psychometric purposes. So, how to know if TEIs improve assessment?  That, of course, is an empirical question that is best answered with an experiment.  But let me suggest one metric to address this question: how far does the item go beyond just reformulating a traditional item format to use current user-interface technology?  I would define the reformulating of traditional format to be a fake TEI, while going beyond would define a true TEI.  An alternative nomenclature might be to call the reformulations technology enhanced items and the true tech usage to be technology enabled items (Almond et al, 2010; Bryant, 2017), as they would not be possible without technology.

A great example of this is is the relationship between a traditional multiple response item and certain types of drag and drop items.  There are a number of different ways that drag and drop items can be created, but for now, let’s use the example of a format that asks the examinee to drag text statements into a box.  An example of this is K12 assessment items from PARCC that ask the student to read a passage, then the item presents a list of some statements about the story, asking the student to drag all true statements into a box.  Take this tech-enhanced item drag & drop, for example.

Brians winter drag drop statements

Now, consider the following item, often called multiple response.

Brians winter multiple response

You can see how this item is the exact same in terms of psychometric interaction: the student is presented a list of statements, and select those they think are true.  The item is scored with integers from 0 to K where K is the number of correct statements; the integers are often then used to implement the generalized partial credit model for final scoring.  This would be true regardless of whether the item was presented as multiple response vs. drag and drop. The multiple response item, of course, could just as easily be delivered via paper and pencil. Converting it to drag and drop enhances the item with technology, but the interaction of the student with the item, psychometrically, remains the same.

Some True TEIs, or Technology Enabled Items

Of course, the past decade or so has witnessed stronger innovation in item formats. Gamified assessments change how the interaction of person and item is approached, though this is arguably not as relevant for high stakes assessment due to concerns of validity. There are also simulation items. For example, a test for a construction crane operator might provide an interface with crane controls and ask the examinee to complete a tasks. Even at the K-12 level there can be such items, such as the simulation of a science experiment where the student is given various test tubes or other instruments on the screen.

Both of these approaches are extremely powerful but have a major disadvantage: cost. They are typically custom-designed. In the case of the crane operator exam or even the science experiment, you would need to hire software developers to create this simulation. There are now some simulation-development ecosystems that make this process more efficient, but the items still involve custom authoring and custom scoring algorithms.

To address this shortcoming, there is a new generation of self-authored item types that are true TEIs. By “self-authored” I mean that a science teacher would be able to create these items themselves, just like they would a multiple choice item. The amount of technology leveraged is somewhere between a multiple choice item and a custom-designed simulation, providing a compromise of reduced cost but still increasing the engagement for the examinee. An example of this is shown below from ASC’s Assess.ai assessment platform. A major advantage of this approach is that the items do not need custom scoring algorithms, and instead are typically scored via point integers, which enables the use of polytomous item response theory.

tech enhanced items

Are we at least moving forward?  Not always!

There is always pushback against technology, and in this topic the counterexample is the gridded item type.  It actually goes in reverse of innovation, because it doesn’t take a traditional format and reformulate it for current UI. It actually ignores the capabilities of current UI (actually, UI for the past 20+ years) and is therefore a step backward. With that item type, students are presented a bubble sheet from a 1960s style paper exam, on a computer screen, and asked to fill in the bubbles by clicking on them rather than using a pencil on paper.

Another example is the EBSR item type from the artist formerly known as PARCC. It was a new item type that intended to assess deeper understanding, but it did not use any tech-enhancement or -enablement, instead asking two traditional questions in a linked manner. As any psychometrician can tell you, this approach ignored basic assumptions of psychometrics, so you can guess the quality of measurement that it put out.

How can I implement TEIs?

It takes very little software development expertise to develop a platform that supports multiple choice items. An item like the graphing one above, though, takes substantial investment. So there are relatively few platforms that can support these, especially with best practices like workflow item review or item response theory. You can try authoring them for free in our Assess.ai assessment platform, or if you have more questions, contact solutions@assess.com.

Automated item generation (AIG) is a paradigm for developing assessment items, aka test questions, utilizing principles of artificial intelligence and automation. As the name suggests, it tries to automate some or all of the effort involved with item authoring, as that is one of the most time-intensive aspects of assessment development – which is no news to anyone who has authored test questions! Items can cost up to $2000 to develop, so even cutting the average cost in half could provide massive time/money savings to an organization.

There are two types of automated item generation:

Type 1: Item Templates (Current Technology)

The first type is based on the concept of item templates to create a family of items using dynamic, insertable variables. There are three stages to this work. For more detail, read this article by Gierl, Lai, and Turner (2012).

  • Authors, or a team, create an cognitive model by isolating what it is they are exactly trying to assess and different ways that it the knowledge could be presented or evidenced. This might include information such as what are the important vs. incidental variables, and what a correct answer should include .
  • They then develop templates for items based on this model, like the example you see below.
  • An algorithm then turns this template into a family of related items, often by producing all possible permutations.

Obviously, you can’t use more than one of these on a given test form. And in some cases, some of the permutations will be an unlikely scenario or possibly completely irrelevant. But the savings can still be quite real. I saw a conference presentation by Andre de Champlain from the Medical Council of Canada, stating that overall efficiency improved by 6x and the generated items were higher quality than traditionally written items because the process made the authors think more deeply about what they were assessing and how. He also recommended that template permutations not be automatically moved to the item bank but instead that each be reviewed by SMEs, for reasons such as those stated above.

You might think “Hey, that’s not really AI…” – AI is doing things that have been in the past done by humans, and the definition gets pushed further every year. Remember, AI used to be just having the Atari be able to play Pong with you!

Type 2: AI Processing of Source Text (Future Technology)

The second type is what the phrase “automated item generation” more likely brings to mind: upload a textbook or similar source to some software, and it spits back drafts of test questions. For example, see this article by von Davier (2019). This technology is still cutting edge and working through issues. For example, how do you automatically come up with quality, plausible distractors for a multiple choice item? This might be automated in some cases like mathematics, but in most cases the knowledge of plausibility lies with content matter expertise. Moreover, this approach is certainly not accessible for the typical educator. It is currently in use, but by massive organizations that spend millions of dollars.

How Can I Implement Automated Item Generation?

AIG has been used the large testing companies for years, but is no longer limited to their domain. It is now available off the shelf as part of ASC’s nextgen assessment platform, Assess.ai. Best of all, that component is available at the free subscription level, all you need to do is register with a valid email address.

Assess.ai provides a clean, intuitive interface to implement Type 1 AIG, in a way that is accessible to all organizations. Develop your item templates, insert dynamic fields, and then export the results to review then implement in an item banking system, which is also available for free in Assess.ai.

If you have worked in the field of assessment and psychometrics, you have undoubtedly encountered the word “standard.” While a relatively simple word, it has the potential to be confusing because it is used in three (and more!) completely different but very important ways. Here’s a brief discussion.

Standard = Cutscore

As noted by the well-known professor Gregory Cizek here, “standard setting refers to the process of establishing one or more cut scores on a test.” The various methods of setting a cutscore, like Angoff or Bookmark, are referred to as standard setting studies. In this context, the standard is the bar that separates a Pass from a Fail. We use methods like the ones mentioned to determine this bar in as scientific and defensible fashion as possible, and give it more concrete meaning than an arbitrarily selected round number like 70%. Selecting a round number like that will likely get you sued since there is no criterion-referenced interpretation.

Standard = Blueprint

If you work in the field of education, you often hear the term “educational standards.” These refer to the curriculum blueprints for an educational system, which also translate into assessment blueprints, because you want to assess what is on the curriculum. Several important ones in the USA are noted here, perhaps the most common of which nowadays is the Common Core State Standards, which attempted to standardize the standards across states. These standards exist to standardize the educational system, by teaching what a group of experts have agreed upon should be taught in 6th grade Math classes for example. Note that they don’t state how or when a topic should be taught, merely that 6th Grade Math should cover Number Lines, Measurement Scales, Variables, whatever – sometime in the year.

Standard = Guideline

If you work in the field of professional certification, you hear the term just as often but in a different context, accreditation standards. The two most common are the National Commission for Certifying Agencies (NCCA) and the ANSI National Accreditation Board (ANAB). These two organizations are a consortium of credentialing bodies that give a stamp of approval to credentialing bodies, stating that a Certification or Certificate program is legit. Why? Because there is no law to stop me from buying a textbook on any topic, writing 50 test questions in my basement, and selling it as a Certification. It is completely a situation of caveat emptor, and these organizations are helping the buyers by giving a stamp of approval that the certification was developed with accepted practices like a Job Analysis, Standard Setting Study, etc.

In addition, there are the professional standards for our field. These are guidelines on assessment in general rather than just credentialing. Two great examples are the AERA/APA/NCME Standards for Educational and Psychological Measurement and the International Test Commission’s Guidelines (yes they switch to that term) on various topics.

Also: Standardized = Equivalent Conditions

The word is also used quite frequently in the context of standardized testing, though it is rarely chopped to the root word “standard.” In this case, it refers to the fact that the test is given under equivalent conditions to provide greater fairness and validity. A standardized test does NOT mean multiple choice, bubble sheets, or any of the other pop connotations that are carried with it. It just means that we are standardizing the assessment and the administration process. Think of it as a scientific experiment; the basic premise of the scientific method is holding all variables constant except the variable in question, which in this case is the student’s ability. So we ensure that all students receive a psychometrically equivalent exam, with equivalent (as much as possible) writing utensils, scrap paper, computer, time limit, and all other practical surroundings. The problem comes with the lack of equivalence in access to study materials, prep coaching, education, and many bigger questions… but those are a societal issue and not a psychometric one.

So despite all the bashing that the term gets, a standardized test is MUCH better than the alternatives of no assessment at all, or an assessment that is not a level playing field and has low reliability. Consider the case of hiring employees: if assessments were not used to provide objective information on applicant skills and we could only use interviews (which are famously subjective and inaccurate), all hiring would be virtually random and the amount of incompetent people in jobs would increase a hundredfold. And don’t we already have enough people in jobs where they don’t belong?

The generalized partial credit model (GPCM; Muraki, 1992) is one of the family of models from item response theory.  It is designed to work, as you might have guess, with items that are partial credit.  That is, instead of just right/wrong as possible scoring, an examinee can receive partial points for completing some aspects of the item correctly.  For example, a typical multiple choice item is scored as 0 points for incorrect and 1 point for correct.  A GPCM item might consist of 3 aspects and be 0 points for in correct, 3 points for fully correct, and 1 or 2 points if the examinee only completes 1 or 2 of the aspects but not all three.

Examples of GPCM items

GPCM items therefore contain multiple point levels, starting at 0.  There are several examples that are common in the world of educational assessment.

The first example, which nearly everyone is familiar with, is essay rubrics.  A student might be instructed to write an essay on why extracurriculuars are important in school, with at least 3 supporting points.  Such an essay might be scored with number of points presented (0,1,2,3) as well as on grammar (0=10 or more errors, 1= 3-9 errors, and 2 = 2 errors or less). Here’s a shorter example.

Another example is multiple response items.  For example, a student might be presented a list of 5 animals and be asked identify which are Mammals.  There are 2 correct answers, so the possible points are 0,1,2. Note that this also includes their tech-enhanced equivalents such as drag and drop; such items might be reconfigured to dragging the animal names into boxes, but that’s just window dressing to make the item look sexier.

The National Assessment of Educational Progress and many other K-12 assessments utilize the GPCM since they so often use item types like this.

Why use the generalized partial credit model?

Well, the first part of the answer is a more general question: why use polytomous items?  Well, these items are generally regarded to be higher-fidelity and assess deeper thinking that multiple choice items. They also provide much more information than multiple choice items in an IRT paradigm.

The second part of the answer is the specific question: If we have polytomous items, why use the GPCM rather than other models?  There are two parts to that answer, that refer to the name generalized partial credit model.  First, partial credit models are appropriate for items where the scoring starts at 0, and different polytomous items could have very difference performance.  In contrast, Likert-style items are also polytomous (almost always) but start at 1 and utilize the same psychological response process on every item.  For example, a survey where statements are presented and examinees are to “Rate each on a scale of 1 to 5.”  Second, the “generalized” part of the name means that it includes a discrimination parameter for evaluating the measurement quality of an item.  This is similar to using the 2PL or 3PL for dichotomous items rather than using the Rasch model and assuming items are of equal discrimination.  There is also a Rasch partial credit model that is equivalent, and can be used alongside Rasch dichotomous items, but this post is just focusing on GPCM.

Definition of the Generalized Partial Credit Model

The generalized partial credit is defined by the equation below (Embretson & Reise, 2000).

In this equation
 m=Number of possible points

  x = the student’s score on the item

  i = index for item

θ = student ability

 a = discrimination parameter for item i

gij = the boundary parameter for step j on item i; there are always m-1 boundaries

 r is an index used to manage the summation.

What do these mean?  The a parameter is the same concept as the a parameter in dichotomous IRT, where 0.5 might be low and 1.2 might be high.  The boundary parameters define the steps or thresholds that explain how the GPCM works, which will become clearer when you see the graph below.

As an example, let us consider a 4 point item with the following parameters.

If you utilize those numbers to graph the functions for each point level as a function of theta, you would see a graph like the one below.  Here, consider Option 1 to be the probability of getting 0 points; this is very high probability for the lowest examinees, but drops as ability increases.  Conversely, the Option 5 line is for receiving all possible points; high probability for the best examinees, but probability decreases as ability does.  In between, we have probability curves for 1, 2, and 3 points.  If an examinee has a theta of -0.5, they have a high probability of getting 2 points on the item (yellow curve).

The boundary parameters mentioned earlier have a very real interpretation with this graph; they are literally the boundaries between the curves.  That is, the theta level at which 1 point (purple) becomes more likely that 0 points (red) is at -2.4, as you can see where the two lines cross.  Note that this is the first boundary parameter, b1 in the image earlier.

How to use the GPCM

As mentioned before, the GPCM is appropriate to use as your IRT model for multi-point items in an educational context, as opposed to Likert-style psychological items.  They re almost always used in conjunction with the 2PL or 3PL dichotomous models; consider a test of 25 multiple choice items, 3 multiple response items, and an essay with 2 rubrics.

To implement, you need an IRT software program that can estimate dichotomous and polytomous items jointly, such as Xcalibre.  Consider the screenshot below to specify these. 

If you implement IRT with Xcalibre, it produces a page like this for each GPCM item.

To score students with the GPCM, you either need to use the IRT program like Xcalibre to score students, or a test delivery system that has been specifically designed to support the GPCM in the item banker and implement GPCM in scoring routines.  The former only works when you are doing the IRT analysis after all examinees have completed a test; if you have continuous deployment of assessments, you will need to use the latter approach.

Where can I learn more?

IRT textbooks will provide a treatment of polytomous models like the generalized partial credit model. Examples are de Ayala (2010) and Embretson & Reise (2000). In addition, I recommend the 2010 book by Nering and Ostini, which was previously available as a monograph.

I was recently asked about scaled scoring and if the transformation must be based on the normal curve. This is an important question, especially since most human traits fall in a fairly normal distribution, and item response theory places items and people on that latent scale. The short answer is “no” – there are other options for the scaled scoring transformation, and your situation can help you select the right method.

First of all: if you are new to the concept of scaled scoring, start out by reading this blog post. In short: it is a way of converting scores on a test to another scale for reporting to examinees, to hide certain important aspects such as differences in test form difficulty.

There are 4 types of scaled scoring, in general:

  1. Normal/standardized
  2. Linear
  3. Linear dogleg
  4. Equipercentile

Normal/standardized

This is an approach to scaled scoring that many of us are familiar with due to some famous applications, including the T score, IQ, and large-scale assessments like the SAT. It starts by finding the mean and standard deviation of raw scores on a test, then converts whatever that is to another mean and standard deviation. If this seems fairly arbitrary and doesn’t change the meaning… you are totally right!

Let’s start by assuming we have a test of 50 items, and our data has a raw score average of 35 points with an SD of 5. The T score transformation – which has been around so long that a quick Googling can’t find me the actual citation – says to convert this to a mean of 50 with SD of 10. So, 35 raw points becomes a scaled score of 50. A raw score of 45 (2 SDs above mean) becomes a T of 70. We could also place this on the IQ scale (mean=100, SD=15) or the classic SAT scale (mean=500, SD=100).

A side not about the boundaries of these scales… one of the first things you learn in any stats class is that plus/minus 3 SDs contains 99% of the population, so many scaled scores adopt these and convenient boundaries. This is why the classic SAT scale went from 200 to 800, with the urban legend that “you get 200 points for putting your name on the paper.” Similarly, the ACT goes from 0 to 36 because it nominally had a mean=18 and SD=6.

The normal/standardized approach can be used with classical number-correct scoring, but makes more sense if you are using item response theory, because all scores default to a standardized metric.

Linear

The linear approach is quite simple. It employs the y=mx+b that we all learned as schoolkids. With the previous example of a 50 item test, we might say intercept=200 and slope=4. This then means that scores range from 200 to 400 on the test.

Yes, I know… the Normal conversion above is technically linear also, but deserves its own definition.

Linear dogleg

The Linear Dogleg approach is a special case of the previous one, where you need to stretch the scale to reach two endpoints. Let’s suppose we published a new form of the test, and a classical equating method like Tucker or Levine says that it is 2 points easier and the slope of Form A to Form B is 3.8 rather than 4. This throws off our clean conversion of 200 to 400 scale. So suppose we use the equation SCALED = 200 + 3.8*RAW but only up until the score of 30. From 31 onwards, we use SCALED = 185 + 4.3*RAW. Note that the raw score of 50 then still comes out to be scaled of 400, so we still go from 200 to 800 but there is now a slight bend in the line. This is called the “dogleg” similar to the golf hole of the same name.

Equipercentile

Lastly, there is Equipercentile, which is mostly used for equating forms but can similarly be used for scaling.  In this conversion, we match the percentile for each, even if it is a very nonlinear transformation.  For example, suppose our Form A had 90th percentile of 46, which became scaled of 384.  We find that Form B has a 90th percentile at 44 points, so we call that a scaled score of 384, and calculate a similar conversion for all other points.

Why are we doing this again?

Well, you can kind of see it in the example of having two forms with a difference in difficulty. In the Equipercentile example, suppose there is a cutscore to be in the top 10% to win a scholarship… if you get 45 on Form A you will lose, but if you get 45 on Form B you will win. Test sponsors don’t want to have this conversation with angry examinees, so they convert all scores to an arbitrary scale. The 90th percentile is always a 384, no matter how hard the test is. (Yes, that simple example assumes the populations are the same… there’s an entire portion of psychometric research dedicated to performing stronger equating.)

A Standard Setting Study is a formal process fo establishing an performance standard. In the assessment world, there are actually two uses of the word standard – the other one refers to a formal definition of the content that is being tested, such as the Common Core State Standards in the USA. For this reason, I prefer the term cutscore study.

After item authoring, item review, and test form assembly, a cutscore or passing score will often be set to determine what level of performance qualified as “pass” or similar classification.  This cannot be done arbitrarily (e.g., setting it at 70% because that’s what you saw when you were in school).  To be legally defensible and eligible for Accreditation, it must be done using one of several standard setting approaches from the psychometric literature.  The choice of method depends upon the nature of the test, the availability of pilot data, and the availability of subject matter experts.

Some types of Cutscore Studies:

  • Angoff – In an Angoff study, a panel of subject matter experts rates each item, estimating the percentage of minimally competent candidates that would answer each item correctly.  It is often done in tandem with the Beuk Compromise.  The Angoff method does not require actual examinee data, though the Beuk does.
  • Bookmark – The bookmark method orders the items in a test form in ascending difficulty, and a panel of experts reads through and places a “bookmark” in the book where they think a cutscore should be.  Obviously, this requires enough real data to calibrate item difficulty, usually using item response theory, which requires several hundred examinees.
  • Contrasting Groups – Candidates are sorted into Pass and Fail groups based on their performance on a different exam or some other unrelated standard.  If using data from another exam, a sample of at least 50 candidates is obviously needed.
  • Borderline Group – Similar to Contrasting Groups, but a borderline group is defined using alternative information such as biodata, and the scores of the group are evaluated.