Posts on psychometrics: The Science of Assessment

math educational assessment

One of the core concepts in psychometrics is item difficulty.  This refers to the probability that examinees will get the item correct for educational/cognitive assessments or respond in the keyed direction with psychological/survey assessments (more on that later).  Difficulty is important for evaluating the characteristics of an item and whether it should continue to be part of the assessment; in many cases, items are deleted if they are too easy or too hard.  It also allows us to better understand how the items and test as a whole operate as a measurement instrument, and what they can tell us about examinees.

I’ve heard of “item facility.” Is that similar?

Item difficulty is also called item facility, which is actually a more appropriate name.  Why?  The P value is a reverse of the concept: a low value indicates high difficulty, and vice versa.  If we think of the concept as facility or easiness, then the P value aligns with the concept; a high value means high easiness.  Of course, it’s hard to break with tradition, and almost everyone still calls it difficulty.  But it might help you here to think of it as “easiness.”

How do we calculate classical item difficulty?

There are two predominant paradigms in psychometrics: classical test theory (CTT) and item response theory (IRT).  Here, I will just focus on the simpler approach, CTT.

To calculate classical item difficulty with dichotomous items, you simply count the number of examinees that responded correctly (or in the keyed direction) and divide by the number of respondents.  This gets you a proportion, which is like a percentage but is on the scale of 0 to 1 rather than 0 to 100.  Therefore, the possible range that you will see reported is 0 to 1.  Consider this data set.

Person Item1 Item2 Item3 Item4 Item5 Item6 Score
1 0 0 0 0 0 1 1
2 0 0 0 0 1 1 2
3 0 0 0 1 1 1 3
4 0 0 1 1 1 1 4
5 0 1 1 1 1 1 5
Diff: 0.00 0.20 0.40 0.60 0.80 1.00

Item6 has a high difficulty index, meaning that it is very easy.  Item4 and Item5 are typical items, where the majority of items are responding correctly.  Item1 is extremely difficult; no one got it right!

For polytomous items (items with more than one point), classical item difficulty is the mean response value.  That is, if we have a 5 point Likert item, and two people respond 4 and two response 5, then the average is 4.5.  This, of course, is mathematically equivalent to the P value if the points are 0 and 1 for a no/yes item.  An example of this situation is this data set:

Person Item1 Item2 Item3 Item4 Item5 Item6 Score
1 1 1 2 3 4 5 1
2 1 2 2 4 4 5 2
3 1 2 3 4 4 5 3
4 1 2 3 4 4 5 4
5 1 2 3 5 4 5 5
Diff: 1.00 1.80 2.60 4.00 4.00 5.00

Note that this is approach to calculating difficulty is sample-dependent.  If we had a different sample of people, the statistics could be quite different.  This is one of the primary drawbacks to classical test theory.  Item response theory tackles that issue with a different paradigm.  It also has an index with the right “direction” – high values mean high difficulty with IRT.

If you are working with multiple choice items, remember that while you might have 4 or 5 responses, you are still scoring the items as right/wrong.  Therefore, the data ends up being dichotomous 0/1.

Very important final note: this P value is NOT to be confused with p value from the world of hypothesis testing.  They have the same name, but otherwise are completely unrelated.  For this reason, some psychometricians call it P+ (pronounced “P-plus”), but that hasn’t caught on.

How do I interpret classical item difficulty?

For educational/cognitive assessments, difficulty refers to the probability that examinees will get the item correct.  If more examinees get the item correct, it has low difficulty.  For psychological/survey type data, difficulty refers to the probability of responding in the keyed direction.  That is, if you are assessing Extraversion, and the item is “I like to go to parties” then you are evaluating how many examinees agreed with the statement.

What is unique with survey type data is that it often includes reverse-keying; the same assessment might also have an item that is “I prefer to spend time with books rather than people” and an examinee disagreeing with that statement counts as a point towards the total score.

For the stereotypical educational/knowledge assessment, with 4 or 5 option multiple choice items, we use general guidelines like this for interpretation.

Range Interpretation Notes
0.0-0.3 Extremely difficult Examinees are at chance level or even below, so your item might be miskeyed or have other issues
0.3-0.5 Very difficult Items in this range will challenge even top examinees, and therefore might elicit complaints, but are typically very strong
0.5-0.7 Moderately difficult These items are fairly common, and a little on the tougher side
0.7-0.90 Moderately easy These are the most common range of items on most classically built tests; easy enough that examinees rarely complain
0.90-1.0 Very easy These items are mastered by most examinees; they are actually too easy to provide much info on examinees though, and can be detrimental to reliability.

Do I need to calculate this all myself?

No.  There is plenty of software to do it for you.  If you are new to psychometrics, I recommend CITAS, which is designed to get you up and running quickly but is too simple for advanced situations.  If you have large samples or are involved with production-level work, you need Iteman.  Sign up for a free account with the button below.  If that is you, I also recommend that you look into learning IRT if you have not yet.

ways-to-improve-item-banks

The foundation of a decent assessment program is the ability to develop and manage strong item banks. Item banks are a central repository of test questions, each stored with important metadata such as Author or Difficulty. They are designed to treat items are reusable objects, which makes it easier to publish new exam forms.

Of course, the storage of metadata is very useful as well and provides validity documentation evidence. Most importantly, a true item banking system will make the process of developing new items more efficient (lower cost) and effective (higher quality).

1. Item writers are screened for expertise

Make sure the item writers (authors) that are recruited for the program will meet minimum levels of expertise. Often this involves a lot of years of experience in the field. You also might want to make sure their demographics are sufficiently distributed, such as specialty area or geographic region.

2. Item writers are trained on best practices

Item writers must be trained on best practices in item writing, as well as any guidelines provided by the organization. A great example is this book from TIMSS. ASC has provided their guidelines for download here. This facilitates higher quality item banks.

3. Items go through review workflow to check best practices

After items are written, they should proceed through a standardized workflow and quality assurance. This is the best practice in developing any products. The field of software development uses a concept called the Kanban Board, which ASC has implemented in its item banking platform.

Review steps can include psychometrician, bias, language editing, and course content.

4. Items are all linked to blueprint/standards

All items in the item banks should be appropriately categorized. This guarantees that no items are measuring an unknown or unneeded concept. Items should be written to meet blueprints or standards.

item writing laptop paper

5. Item banks piloting

Items are all written with good intent. However, we all know that some items are better than others. Items need to be given to some actual examinees so we can obtain feedback, and also obtain data for psychometric analysis.

Often, they are piloted as unscored items before eventual use as “live” scored items. But this isn’t always possible.

6. Psychometric analysis of items

After items are piloted, you need to analyze them with classical test theory and/or item response theory to evaluate their performance. I like to say there are three possible choices after this evaluation: hold, revise, and retire. Items that perform well are preserved as-is.

Those of moderate quality might be modified and re-piloted. Those that are unsalvageable are slated for early retirement.

How to accomplish all this?

This process can be extremely long, involved, and expensive. Many organizations hire in-house test development managers or psychometricians; those without that option will hire organizations such as ASC to serve as consultants.

Regardless, it is important to have a software platform in place that can effectively manage this process. Such platforms have been around since the 1980s, but many organizations still struggle by managing their item banks with Word, Excel, PowerPoint, and Email!

ASC provides an item banking platform for free, which is used by hundreds of organizations. Click below to sign up for your own account.


Sign Up For Free Account

pair-of-students-examinees-that-have-common-responses

This collusion detection (test cheating) index simply calculates the number of responses in common between a given pair of examinees.  For example, both answered ‘B’ to a certain item regardless of whether it was correct or incorrect.  There is no probabilistic evaluation that can be used to flag examinees.  However, it could be of good use from a descriptive or investigative perspective. 

It has a major flaw in that we expect it to be very high for high-ability examinees.  If two smart examinees both get 99/100 correct, the minimum RIC they could have is 98/100.  Even if they have never met each other and have no possibility of collusion or cheating.

Note that RIC is not standardized in any way, so its range and relevant flag cutoff will depend on the number of items in your test, and how much your examinee responses vary.  For a 100-item test, you might want to set the flag at 90 items.  But for a 50-item test, this is obviously irrelevant, and you might want to set it at 45.

Problems such as these with Responses In Common have led to the development of much more sophisticated indices of examinee collusion and copying, such as Holland’s K index and variants.

Need an easy way to calculate this?  Download our SIFT software for free.

two-examinees-cheating

Exact Errors in Common (EEIC) is an extremely basic collusion detection index simply calculates the number of responses in common between a given pair of examinees.

For example, suppose two examinees got 80/100 correct on a test. Of the 20 each got wrong, they had 10 in common. Of those, they gave the same wrong answer on 5 items. This means that the EEIC would be 5. Why does this index provide evidence of collusion detection? Well, if you and I both get 20 items wrong on a test (same score), that’s not going to raise any eyebrows. But what if we get the same 20 items wrong? A little more concerning. What if we gave the same exact wrong answers on all of those 20? Definitely cause for concern!

There is no probabilistic evaluation that can be used to flag examinees.  However, it could be of good use from a descriptive or investigative perspective. Because it is of limited use by itself, it was incorporated into more advanced indices, such as Harpp, Hogan, and Jennings (1996).

Note that because Exact Errors in Common is not standardized in any way, so its range and relevant flag cutoff will depend on the number of items in your test, and how much your examinee responses vary.  For a 100-item test, you might want to set the flag at 10 items.  But for a 20-item test, this is obviously irrelevant, and you might want to set it at 5 (because most examinees will probably not even get more than 10 errors).

EEIC is easy to calculate, but you can download the SIFT software for free.

pair-of-students-cheating

This exam cheating index (collusion detection) simply calculates the number of errors in common between a given pair of examinees.  For example, two examinees got 80/100 correct, meaning 20 errors, and they answered all of the same questions wrongly, the EIC would be 20. If they both scored 80/100 but had only 10 wrong questions in common, the EIC would be 10.  There is no probabilistic evaluation that can be used to flag examinees, as with more advanced indices. In fact, it is used inside some other indices, such as Harpp & Hogan.  However, this index could be of good use from a descriptive or investigative perspective.

Note that EIC is not standardized in any way, so its range and relevant flag cutoff will depend on the number of items in your test, and how much your examinee responses vary.  For a 100-item test, you might want to set the flag at 10 items.  But for a 30-item test, this is obviously irrelevant, and you might want to set it at 5 (because most examinees will probably not even get more than 10 errors).

Learn more about applying EIC with SIFT, a free software program for exam cheating detection and other assessment issues.

Harpp, Hogan, and Jennings (1996) revised their Response Similarity Index somewhat from Harpp and Hogan (1993). This produced a new equation for a statistic to detect collusion and other forms of exam cheating: response similarity index.

Explanation of Response Similarity Index

EEIC denote the number of exact errors in common or identically wrong,

D is the number of items with a different response.

Note that D is calculated across all items, not just incorrect responses, so it is possible (and likely) that D>EEIC.  Therefore, the authors suggest utilizing a flag cutoff of 1.0 (Harpp, Hogan, & Jennings, 1996):

Analyses of well over 100 examinations during the past six years have shown that when this number is ~1.0 or higher, there is a powerful indication of cheating.  In virtually all cases to date where the exam has ~30 or more questions, has a class average <80% and where the minimum number of EEIC is 6, this parameter has been nearly 100% accurate in finding highly suspicious pairs.

However, Nelson (2006) has evaluated this index in comparison to Wesolowsky’s (2000) index and strongly recommends against using the HHJ.  It is notable that neither makes any attempt to evaluate probabilities or standardize.  Cizek (1999) notes that both Harpp-Hogan methods do not even receive attention in the psychometric literature.

This approach has very limited ability to detect cheating when the source has a high ability level. While individual classroom instructors might find the EEIC/D straightforward and useful, there are much better indices for use in large-scale, high-stakes examinations.

Harpp Hogan

Harpp and Hogan (1993) suggested a response similarity index defined as   

response similarity index by Harpp and Hogan (1993)

 

Response Similarity Index Explanation

EEIC denote the number of exact errors in common or identically wrong, EIC is the number of errors in common.

This is calculated for all pairs of examinees that the researcher wishes to compare. 

One advantage of this approach is that it is extremely simple to interpret: if examinee A and B each get 10 items wrong, 5 of which are in common, and gave the same answer on 4 of those 5, then the index is simply 4/5 = 0.80.  A value of 1.0 would therefore be perfect “cheating” – on all items that both examinees answered incorrectly, they happened to select the same distractor.

The authors suggest utilizing a flag cutoff of with the following reasoning (Harpp & Hogan, 1993, p. 307):

The choice of 0.75 is derived empirically because pairs with less than this fraction were not found to sit adjacent to one another while pairs with greater than this ratio almost always were seated adjacently.

The cutoff can differ from dataset to dataset, so SIFT allows you to specify the cutoff you wish to use for flagging pairs of examinees.  However, because this cutoff is completely arbitrary, a very high value (e.g., 0.95) is recommended by as this index can easily lead to many flaggings, especially if the test is short.  False positives are likely, and this index should be used with great caution.  Wesolowsky (unpublished PowerPoint presentation) called this method “better but not good.”

You may also be interested in the revised version of this index produced by Harpp, Hogan, and Jennings in 1996.

This index evaluates error similarity analysis (ESA), namely estimating the probability that a given pair of examinees would have the same exact errors in common (EEIC), given the total number of errors they have in common (EIC) and the aggregated probability P of selecting the same distractor.  Bellezza and Bellezza utilize the notation of k=EEIC and N=EIC, and calculate the probability

Bellezza and Bellezza calculate the probability

Note that this is summed from k to N; the example in the original article is that a pair of examinees had N=20 and k=18, so the equation above is calculated three times (k=18, 19, 20) to estimate the probability of having 18 or more EEIC out of 20 EIC.  For readers of the Cizek (1999) book, note that N and k are presented correctly in the equation but their definitions in the text are transposed.

The calculation of P is left to the researcher to some extent.  Published resources on the topic note that if examinees always selected randomly amongst distractors, the probability of an examinee selecting a given distractor is 1/d, where d is the number of incorrect answers, usually one less than the total number of possible responses.  Two examinees randomly selecting the same distractor would be (1/d)(1/d).  Summing across d distractors by multiplying by d, the calculation of P would be

error similarity analysis

That is, for a four-option multiple choice item, d=3 and P=0.3333.  For a five-option item, d=4 and P=0.25.

However, examinees most certainly do not select randomly amongst distractors. Suppose a four-option multiple-choice item was answered correctly by 50% (0.50) of the sample.  The first distractor might be chosen by 0.30 of the sample, the second by 0.15, and the third by 0.05.  SIFT calculates these probabilities and uses the observed values to provide a more realistic estimate of P

SIFT therefore calculates this error similarity analysis index using the observed probabilities and also the random-selection assumption method, labeling them as B&B Obs and B&B Ran, respectively.  The indices are calculated all possible pairs of examinees or all pairs in the same location, depending on the option selected in SIFT. 

How to interpret this index?  It is estimating a probability, so a smaller number means that the event can be expected to be very rare under the assumption of no collusion (that is, independent test taking).  So a very small number is flagged as possible collusion.  SIFT defaults to 0.001.  As mentioned earlier, implementation of a Bonferroni correction might be prudent.

The software program Scrutiny! also calculates this ESA index.  However, it utilizes a normal approximation rather than exact calculations, and details are not given regarding the calculation of P, so its results will not agree exactly with SIFT.

Cizek (1999) notes:

          “Scrutiny! uses an approach to identifying copying called “error similarity analysis” or ESA—a method which, unfortunately, has not received strong recommendation in the professional literature. One review (Frary, 1993) concluded that the ESA method: 1) fails to utilize information from correct response similarity; 2) fails to consider total test performance of examinees; and 3) does not take into account the attractiveness of wrong options selected in common. Bay (1994) and Chason (1997) found that ESA was the least effective index for detecting copying of the three methods they compared.”

Want to implement this statistic? Download the SIFT software for free.

frary g2

The Frary, Tideman, and Watts (1977) g2 index is a collusion (cheating) detection index, which is a standardization that evaluates a number of common responses between two examinees in the typical standardized format: observed common responses minus the expectation of common responses, divided by the expected standard deviation of common responses.  It compares all pairs of examinees twice: evaluating examinee copying off b and vice versa.

Frary, Tideman, and Watts (1977) g2 Index

The g2 collusion index starts by finding the probability, for each item, that the Copier would choose (based on their ability) the answer that the Source actually chose.  The sum of these probabilities than the expected number of equivalent responses.  We can then compare this to the actual observed number of equivalent responses and standardize that difference with the standard deviation.  A very positive value could be possibly indicative of copying.

 

g2 collusion index

Where

Cab = Observed number of common responses (e.g., both examinees selected answer D)

k = number of items i

Uia = Random variable for examinee a’s response to item i

Xia = Observed response of examinee b to item i.

Frary et al. estimated P using classical test theory, and the definitions are provided in the original paper, while a slightly more clear definitions are provided in Khalid, Mehmood, and Rehman (2011).

The g2 approach produces two half-matrices, which SIFT presents as a single matrix separated by a blank diagonal.  That is, the lower half of the matrix evaluates whether examinee a copied off b, and the upper half whether b copied off a.  More specifically, the row number is the copier and the column number is the source.  So Row1/Column2 evaluates whether 1 copied off 2, while Row2/Column1 evaluates whether 2 copied off 1.

For g2 and Wollack’s (1997) ω, the flagging procedure counts all values in the matrix greater than the critical value, so it is possible – likely actually – that each pair will be flagged twice.  So the numbers in those flag total columns will be greater than those in the unidirectional indices.

How to interpret?  This collusion index is standardized onto a z-metric, and therefore can easily be converted to the probability you wish to use.  A standardized value of 3.09 is default for g2, ω, and Zjk because this translates to a probability of 0.001.  A value beyond 3.09 then represents an event that is expected to be very rare under the assumption of no collusion.

Want to implement this statistic? Download the SIFT software for free.

Wollack Omega

Wollack (1997) adapted the standardized collusion index of Frary, Tidemann, and Watts (1977) g2 to item response theory (IRT) and produced the Wollack Omega (ω) index.  It is clear that the graphics in the original article by Frary, Tideman, and Watts (1977) were crude classical approximations of an item response function, so Wollack replaced the probability calculations from the classical approximations with those from IRT. 

The probabilities could be calculated with any IRT model.  Wollack suggested Bock’s Nominal Response Model since it is appropriate for multiple-choice data, but that model is rarely used in practice and very few IRT software packages support it.  SIFT instead supports the use of dichotomous models: 1-parameter, 2-parameter, 3-parameter, and Rasch.

Because of using IRT, implementation of ω requires additional input.  You must include the IRT item parameters in the control tab, as well as examinee theta values in the examinee tab.  If any of that input is missing, the omega output will not be produced.

The ω index is defined as

standardized collusion index of Frary, Tidemann, and Watts (1977) g2

Where P is the probability of an examinee with θa selecting the response, that examinee b selected, and cab is the Responses in Common (RIC) Index.  That is, the probability that the copier with θa would select the responses that the source did when summed, this can be interpreted as the expected RIC. 

Note: This uses all responses, not just errors.

How to interpret?  The value will be higher when the copier had more responses in common with the source than we’d expect from a person of that (probably lower) ability.  This index is standardized onto a z-metric, and therefore can easily be converted to the probability you wish to use. 

A standardized value of 3.09 is the default for g2, ω, and Zjk Collusion Detection Index because this translates to a probability of 0.001.  A value beyond 3.09, then, represents an event that is expected to be very rare under the assumption of no collusion.

Interested in applying the Wollack Omega index to your data? Download the SIFT software for free.