Coefficient alpha reliability, sometimes called Cronbach’s alpha, is a statistical index that is used to evaluate the internal consistency or reliability of an assessment. That is, it quantifies how consistent we can expect scores to be, by analyzing the item statistics. A high value indicates that the test is of high reliability, and a low value indicates low reliability.  This is one of the most fundamental concepts in psychometrics, and alpha is arguably the most common index.

Cronbach’s Alpha

The classic reference to alpha is Cronbach (1954). He defines it as:

coefficient alpha

where k is the number of items, sigma-i is variance of item i, and sigma-X is total score variance.

Kuder-Richardson 20

While Cronbach tends to get the credit, to the point that the index is often called “Cronbach’s Alpha” he really did not invent it. Kuder and Richardson (1927) suggested the following equation to estimate the reliability of a test with dichotomous (right/wrong) items.

kr 20 reliability

Note that it is the same as Cronbach’s equation, except that he replaced the binomial variance pq with the more general notation of variance (sigma). This just means that you can use Cronbach’s equation on polytomous data such as Likert rating scales. In the case of dichotomous data such as multiple choice items, Cronbach’s alpha and KR-20 are the exact same.

Additionally, Cyril Hoyt defined reliability in an equivalent approach using ANOVA in 1941, a decade before Cronbach’s paper.

How to interpret alpha

In general, alpha will range from 0.0 (random number generator) to 1.0 (perfect measurement). However, in rare cases, it can go below 0.0, such as if the test is very short or if there is a lot of missing data (sparse matrix). This, in fact, is one of the reasons NOT to use alpha in some cases. If you are dealing with linear-on-the-fly tests (LOFT), computerized adaptive tests (CAT), or a set of overlapping linear forms for equating (non-equivalent anchor test, or NEAT design), then you will likely have a large proportion of sparseness in the data matrix and alpha will be very low or negative. In such cases, item response theory provides a much more effective way of evaluating the test.

What is “perfect measurement?”  Well, imagine using a ruler to measure a piece of paper.  If it is American-sized, that piece of paper is always going to be 8.5 inches wide, no matter how many times you measure it with the ruler.  A bathroom scale is slightly less reliability; You might step on it, see 190.2 pounds, then step off and on again, and see 190.4 pounds.  This is a good example of how we often accept unreliability in measurement.

Of course, we never have this level of accuracy in the world of psychoeducational measurement.  Even a well-made test is something where a student might get 92% today and 89% tomorrow (assuming we could wipe their brain of memory of the exact questions).

Reliability can also be interpreted as the ratio of true score variance to total score variance. That is, all test score distributions have a total variance, which consist of variance due to the construct of interest (i.e., smart students do well and poor students do poorly), but also some error variance (random error, kids not paying attention to a question, second dimension in the test… could be many things.

What is a good value of coefficient alpha?

As psychometricians love to say, “it depends.” The rule of thumb that you generally hear is that a value of 0.70 is good and below 0.70 is bad, but that is terrible advice. A higher value indeed indicates higher reliability, but you don’t always need high reliability. A test to certify surgeons, of course, deserves all the items it needs to make it quite reliable. Anything below 0.90 would be horrible. However, the survey you take from a car dealership will likely have the statistical results analyzed, and a reliability of 0.60 isn’t going to be the end of the world; it will still provide much better information than not doing a survey at all!

Here’s a general depiction of how to evaluate levels of coefficient alpha.

Coefficient cronbachs alhpa interpretation

Using Alpha: The classical standard error of measurement

Coefficient alpha is also often used to calculate the classical standard error of measurement (SEM), which provides a related method of interpreting the quality of a test and the precision of its scores. The SEM can be interpreted as the standard deviation of scores that you would expect if a person took the test many times, with their brain wiped clean of the memory each time. If the test is reliable, you’d expect them to get almost the same score each time, meaning that SEM would be small.


Note that SEM is a direct function of alpha, so that if alpha is 0.99, SEM will be small, and if alpha is 0.1, then SEM will be very large.

Coefficient Alpha and Unidimensionality

It can also be interpreted as a measure of unidimensionality. If all items are measuring the same construct, then scores on them will align, and the value of alpha will be high. If there are multiple constructs, alpha will be reduced, even if the items are still high quality. For example, if you were to analyze data from a Big Five personality assessment with all five domains at once, alpha would be quite low. Yet if you took the same data and calculated alpha separately on each domain, it would likely be quite high.

How to calculate the index

Because the calculation of coefficient alpha reliability is so simple, it can be done quite easily if you need to calculate it from scratch, such as using formulas in Microsoft Excel. However, any decent assessment platform or psychometric software will produce it for you as a matter of course. It is one of the most important statistics in psychometrics.

Cautions on Overuse

Because alpha is just so convenient – boiling down the complex concept of test quality and accuracy to a single easy-to-read number – it is overused and over-relied upon. There are papers out in the literature that describe the cautions in detail; here is a classic reference.

One important consideration is the over-simplification of precision with coefficient alpha, and the classical standard error of measurement, when juxtaposed to the concept of conditional standard error of measurement from item response theory. This refers to the fact that most traditional tests have a lot of items of middle difficulty, which maximizes alpha. This measures students of middle ability quite well. However, if there are no difficult items on a test, it will do nothing to differentiate amongst the top students. Therefore, that test would have a high overall alpha, but have virtually no precision for the top students. In an extreme example, they’d all score 100%.

Also, alpha will completely fall apart when you calculate it on sparse matrices, because the total score variance is artifactually reduced.


In conclusion, coefficient alpha is one of the most important statistics in psychometrics, and for good reason. It is quite useful in many cases, and easy enough to interpret that you can discuss it with test content developers and other non-psychometricians. However, there are cases where you should be cautious about its use, and some cases where it completely falls apart. In those situations, item response theory is highly recommended.

Test validation is the process of verifying whether the specific requirements to test development stages are fulfilled or not, based on solid evidence. In particular, test validation is an ongoing process of developing an argument that a specific test, its score interpretation or use is valid. The interpretation and use of testing data should be validated in terms of content, substantive, structural, external, generalizability, and consequential aspects of construct validity (Messick, 1994). Validity is the status of an argument that can be positive or negative: positive evidence supports and negative evidence weakens the validity argument, accordingly. Validity cannot be absolute and can be judged only in degrees. American Educational Research Association [AERA], American Psychological Association [APA], and National Council on Measurement in Education [NCME] (1999) claim that validity is crucial for educational and psychological test development and evaluation.

Validation as part of test development

To be effective, test development has to be structured, systematic, and detail-oriented. These features can guarantee sufficient validity evidence supporting inferences proposed by test scores obtained via assessment. Downing (2006) suggested a twelve-step framework for the effective test development:

  1. Overall plan
  2. Content definition
  3. Test blueprint
  4. Item development
  5. Test design and assembly
  6. Test production
  7. Test administration
  8. Scoring test responses
  9. Standard setting
  10. Reporting test results
  11. Item bank management
  12. Technical report

Even though this framework is outlined as a sequential timeline, in practice some of these steps may occur simultaneously or may be ordered differently. A starting point of the test development – the purpose – defines the planned test and regulates almost all validity-related activities. Each step of the test development process focuses on its crucial aspect – validation.

Hypothetically, an excellent performance of all steps can ensure a test validity, i.e. the produced test would estimate examinee ability fairly within the content area to be measured by this test. However, human factor involved in the test production might play a negative role, so there is an essential need for the test validation.

Reasons for test validation

There are myriads of possible reasons that can lead to the invalidation of test score interpretation or use. Let us consider some obvious issues that potentially jeopardize test validation and are subject to validation:

  • overall plan: wrong choice of a psychometric model;
  • content definition: content domain is ill defined;
  • test blueprint: test blueprint does not specify an exact sampling plan for the content domain;
  • item development: items measure content at an inappropriate cognitive level;
  • test design and assembly: unequal booklets;
  • test administration: cheating;
  • scoring test responses: inconsistent scoring among examiners;
  • standard setting: unsuitable method of establishing passing scores;
  • item bank management: inaccurate updating of item parameters.

Context for test validation

All tests have common types of validity evidence that is purported, e.g. reliability, comparability, equating, and item quality. However, tests can vary in terms of a quantity of constructs measured (single, multiple) and can have different purposes which call for the unique types of test validation evidence. In general, there are several major types of tests:

  • Admissions tests (e.g., SAT, ACT, and GRE)
  • Credentialing tests (e.g., a live-patient examination for a dentist before licensing)
  • Large-scale achievement tests (e.g., Stanford Achievement Test, Iowa Test of Basic Skills, and TerraNova)
  • Pre-employment tests
  • Medical or psychological
  • Language

The main idea is that the type of test usually defines a unique validation agenda that focuses on appropriate types of validity evidence and issues that are challenged in that type of test.

Categorization of test validation studies

Since there are multiple precedents for the test score invalidation, there are many categories of test validation studies that can be applied to validate test results. In our post, we will look at the categorization suggested by Haladyna (2011):

Category 1: Test Validation Studies Specific to a Testing Program

Subcategory of a study

Focus of a study

    1. Studies That Provide Validity Evidence in Support of the Claim for a Test Score Interpretation or Use
  • Content analysis
  • Item analysis
  • Standard setting
  • Equating
  • Reliability
    2. Studies That Threaten a Test Score Interpretation of Use
  • Cheating
  • Scoring errors
  • Student motivation
  • Unethical test preparation
  • Inappropriate test administration
    3. Studies That Address Other Problems That Threaten Test Score Interpretation or Use
  • Drop in reliability
  • Drift in item parameters over time
  • Redesign of a published test
  • Possible security problem

Category 2: Test Validation Studies That Apply to More Than One Testing Program

    Studies that lead to the establishment of concepts, principles, or procedures that guide, inform, or improve test development or scoring
  • Introducing a concept
  • Introducing a principle
  • Introducing a procedure
  • Studying a pervasive problem


Even though test development is a longitudinal laborious process, test creators have to be extremely accurate while executing their obligations within each activity. The crown of this process is obtaining valid and reliable test scores, and their adequate interpretation and use. The higher the stakes or consequences of the test scores, the greater attention should be paid to the test validity, and, therefore, to the test validation. The latter one is emphasized by integrating all reliable sources of evidence to strengthen the argument for test score interpretation and use.


American Educational Research Association (AERA), American Psychological Association (APA), & National Council on Measurement in Education (NCME). (1999). Standards for educational and psychological testing. American Educational Research Association.

Downing, S. M. (2011). Twelve steps for effective test development. In. S. M. Downing & T. M. Haladyna (Eds.), Handbook of test development (pp. 3-25). Lawrence Erlbaum Associates.

Haladyna, T. M. (2011). Roles and importance of validity studies in test development. In. S. M. Downing & T. M. Haladyna (Eds.), Handbook of test development (pp. 739-755). Lawrence Erlbaum Associates.

Messick, S. (1994). The interplay of evidence and consequences in the validation of performance assessments. Educational researcher23(2), 13-23.


The borderline group method of standard setting is one of the most common approaches to establishing a cutscore for an exam.  In comparison with the item-centered standard setting methods such as modified-Angoff, Nedelsky, and Ebel, there are two well-known examinee-centered methods (Jaeger, 1989), the contrasting groups method and the borderline group method (Livingston & Zieky, 1989). This post will focus on the latter one.

The concept of the borderline group method

Examinee-centered methods require participants to judge whether an individual examinee possesses adequate knowledge, skills, and abilities across specific content standards. The borderline group method is based on the idea to determine a common passing score for all examinees that would be expected from an examinee whose competencies are on the borderline between quite adequate and yet not inadequate.

How to perform the borderline group methodpsychometric training and workshops

First of all, the judges are selected from those who are thoroughly familiar with the content examined and are knowledgeable about knowledge, skills, and abilities of individual examinees. Next, the judges engage in a discussion to develop a description of an examinee who is on the borderline between two extremes, mastery and non-mastery. Alternatively, the judges may be tasked to sort examinees into three categories: clearly competent, clearly incompetent, and those in-between.

After the description is agreed upon, borderline examinees need to be identified. The ultimate goal of the borderline group method is to distribute the borderline examinees’ scores and to find the median of that distribution (50th percentile), which would become the recommended cut score for the borderline group.

Why is the median used and not the mean, you might ask? The reason is that the median is much less affected by extremely high or extremely low values. This feature of the median is particularly important for the borderline group method, because an examinee with a very high or very low score is likely to not really belong in the group.

Analyzing the borderline group method

Advantages of this method:

  • Time efficient
  • Straightforward to implement

Disadvantages of this method:

  • Difficult to achieve consensus on the nature borderline examinees
  • The cut score could have low validity in case of a small number of borderline examinees

What could the borderline group method work poorly and how this can be tackled?

Possible issue Probable solution
The judges could identify some examinees as borderline by mistake (e.g. their skills were difficult to judge), so the borderline group might contain examinees who do not belong it. Remind the judges not to include in the borderline group any examinees whose competencies they are not sure about.
The judges may base their judgements on something other than what the examine measures. Give the judges appropriate instructions and get them agree with each other when defining a borderline examinee.
The judgements in terms of individual standards regarding the examinees’ skills and abilities may differ greatly.

There is also a risk that judges would be sensitive to errors of central tendency and, therefore, might assign a disproportionately large number of examinees to the borderline group if they do not have sufficient knowledge about individual examinees’ performances. Thus, it is key for implementing the borderline group method to pick highly competent judges.


Let’s summarize which steps need to be made to implement the borderline group method:

  • Select the competent judges
  • Define the borderline level of examinees’ knowledge, skills, and abilities
  • Evolve the borderline examinees
  • Obtain the test scores of the borderline examinees
  • Calculate the cut off score as the median of the distribution of the borderline examinees’ test scores



Jaeger, R. M. (1989). Certification of student competence.

Livingston, S. A., & Zieky, M. J. (1989). A comparative study of standard-setting methods. Applied Measurement in Education2(2), 121-141.


The Ebel method of standard setting is a psychometric approach to establish a cutscore for tests consisting of multiple-choice questions. It is usually used for high-stakes examinations in the fields of higher education, medical and health professions, and for selecting applicants.

How is the Ebel method performed?

The Ebel method requires a panel of judges who would first categorize each item in a data set by two criteria: level of difficulty and relevance or importance. Then the panel would agree upon an expected percentage of items that should be answered correctly for each group of items according to their categorization.

It is crucial that judges are the experts in the examined field; otherwise, their judgement would not be valid and reliable. Prior to the item rating process, the panelists should be given sufficient amount of information about the purpose and procedures of the Ebel method. In particular, it is important that the judges would understand the meaning of difficulty and relevance in the context of the current assessment.

Next stage would be to determine what “minimally competent” performance means in the specific case depending on the content. When everything is clear and all definitions are agreed upon, the experts should classify each item across difficulty (easy, medium, or hard) and relevance (minimal, acceptable, important, or essential). In order to minimize the influence of the judges’ opinion on each other, it is more recommended to use individual ratings rather than consensus ones.

Afterwards judgements on the proportion of items expected to be answered correctly by minimally competent candidates need to be collected for each item category, e.g. easy and desirable. However, for the rating and timesaving purposes the grid proposed by Ebel and Frisbie (1972) might be used. It is worth mentioning though that Ebel ratings are content-specific, so values in the grid might happen to be too low or too high for a test.


At the end, the Ebel method, like the modified-Angoff method, identifies a cut-off score for an examination based on the performance of candidates in relation to a defined standard (absolute), rather than how they perform in relation to their peers (relative). Ebel scores for each item and for the whole exam are calculated as the average of the scores provided by each expert: the number of items in each category is multiplied by the expected percentage of correct answers, and the total results are added to calculate the cutscore.

Pros of using Ebel

  • This method provides an overview of a test difficulty
  • Cut-off score is identified prior to an examination
  • It is relatively easy for experts to perform

Cons of using Ebel

  • This method is time-consuming and costly
  • Evaluation grid is hard to get right
  • Digital software is required
  • Back-up is necessary


The Ebel method is a quite complex standard-setting process compared to others due to the need of an analysis of the content, and it therefore imposes a burden on the standard-setting panel. However, Ebel considers the relevance of the test items and the expected proportion of the correct answers of the minimally competent candidates, including borderline candidates. Thus, even though the procedure is complicated, the results are very stable and very close to the actual cut-off scores.


Ebel, R. L., & Frisbie, D. A. (1972). Essentials of educational measurement.

Item parameter drift (IPD) refers to the phenomenon in which parameter values a given test item changes over multiple testing occasions within the item response theory (IRT) framework. This phenomenon is often relevant to student progress monitoring assessments where a set of items is used several times in one year, or across years, to track student growth;  the observing of trends in student academic achievements depends upon stable linking (anchoring) between assessment moments over time, and if the item parameters are not stable, the scale is not stable and time-to-time comparisons are not either. Some psychometricians consider IPD as a special case of differential item functioning (DIF), but these two are different issues and should not be confused with each other.

Reasons for Item Parameter Drift

IRT modeling is attractive for assessment field since its property of item parameter invariance to a particular sample of test-takers which is fundamental for their estimation, and that assumption enables important things like strong equating of tests across time and the possibility for computerized adaptive testing. However, item parameters are not always invariant. There are plenty of reasons that could stand behind IPD. One possibility is curricular changes based on assessment results or instruction that is more concentrated. Other feasible reasons are item exposure, cheating, or curricular misalignment with some standards. No matter what has led to IPD, its presence can cause biased estimates of student ability. In particular, IPD can be highly detrimental in terms of reliability and validity in case of high-stakes examinations. Therefore, it is crucial to detect item parameter drift when anchoring assessment occasions over time, especially when the same anchor items are used repeatedly.

Perhaps the simplest example is item exposure.  Suppose a 100-item test is delivered twice per year, with 20 items always remaining as anchors.  Eventually students will share memories and the topics of those will become known.  More students will get them correct over time, making the items appear easier.

Identifying IPD

There are several methods of detecting IPD. Some of them are simpler because they do not require estimation of anchoring constants, and some of them are more difficult due to the need of that estimation. Simple methods include the “3-sigma p-value”, the “0.3 logits”, and the “3-sigma IRT” approaches. Complex methods involve the “3-sigma scaled IRT”, the “Mantel-Haenszel”, and the “area between item characteristic curves”, where the last two approaches are based on consideration that IPD is a special case of DIF, and therefore there is an opportunity to draw upon a massive body of existing research on DIF methodologies.

Handling IPD

Even though not all psychometricians think that removal of outlying anchor items is the best solution for item parameter drift, if we do not eliminate drifting items from the process of equating test scores, they will affect transformations of ability estimates, not only item parameters. Imagine that there is an examination, which classifies examinees as either failing or passing, or into four performance categories; then in case of IPD, 10-40% of students could be misclassified. In high-stakes testing occasions where classification of examinees implies certain sanctions or rewards, IPD scenarios should be minimized as much as possible. As soon as it is detected that some items exhibit IPD, these items should be referred to the subject-matter experts for further investigation. Otherwise, if there is a need in a faster decision, such flagged anchor items should be removed immediately. Afterwards, psychometricians need to re-estimate linking constants and evaluate IPD again. This process should repeat unless none of the anchor items shows item parameter drift.

The concept of Speeded vs Power Test is one of the ways of differentiating psychometric or educational assessments. In the context of educational measurement and depending on the assessment goals and time constraints, tests are categorized as speeded and power. There is also the concept of a Timed test, which is really a Power test. Let’s look at these types more carefully.

Speeded test


In this test, examinees are limited in time but expected to answer as many questions as possible but there is a unreasonably short time limit that prevents even the best examinees from completing the test, and therefore forces the speed.  Items are delivered sequentially starting from the first one and until the last one. All items are relatively easy, usually.  Sometimes they are increasing in difficulty.  If a time limit and difficulty level are correctly set, none of the test takers will be able to reach the last item before the time limit is reached. A speeded test is supposed to demonstrate how fast an examinee can respond to questions within a time limit. In this case, examinees’ answers are not as important as their speed of answering questions. Total score is usually computed as a number of questions answered correctly when a time limit is met, and differences in scores are mainly attributed to individual differences in speed rather than knowledge.

An example of this might be a mathematical calculation speed test. Examinees are given 100 multiplication problems and told to solve as many as they can in 20 seconds. Most examinees know the answers to all the items, it is a question of how many they can finish. Another might be a 10-key task, where examinees are given a list of 100 5-digit strings and told to type as many as they can in 20 seconds.

Pros of a speeded test:

  • Speeded test is appropriate for when you actually want to test the speed of examinees; the 10-digit task above would be useful in selecting data entry clerks, for example. The concept of “knowledge of 5 digit string” in this case is not relevant and doesn’t even make sense.
  • Tests can sometimes be very short but still discriminating.
  • In case when a test is a mixture of items in terms of their difficulty, examinees might save some time when responding easier items in order to respond to more difficult items. This can create an increased spread in scores.

Cons of a speeded test:

  • Most situations where a test is used is to evaluate knowledge, not speed.
  • The nature of the test provokes examinees commit errors even if they know the answers, which can be stressful.
  • Speeded test does not consider individual peculiarities of examinees.

Power Test

A power test provides examinees with sufficient time so that they could attempt all items and express their true level of knowledge or ability. Therefore, this testing category focuses on assessing knowledge, skills, and abilities of the examinees.  The total score is often computed as a number of questions answered correctly (or with item response theory), and individual differences in scores are attributed to differences in ability under assessment, not to differences in basic cognitive abilities such as processing speed or reaction time.

There is also the concept of a Timed Test. This has a time limit, but it is NOT a major factor in how examinees respond to questions or affect their score. For example, the time limit might be set so that 95% of examinees are not affected at all, and the remaining 5% are slightly hurried. This is done with the CAT-ASVAB.

Pros of a power test:

  • There is no time restrictions for test-takers
  • Power test is great to evaluate knowledge, skills, and abilities of examinees
  • Power test reduces chances of committing errors by examinees even if they know the answers
  • Power test considers individual peculiarities of examinees

Cons of a power test:

  • It can be time consuming (some of these exams are 8 hours long or even more!)
  • This test format sometimes does not suit competitive examinations because of administrative issues (too much test time across too many examinees)
  • Power test is sometimes bad for discriminative purposes, since all examinees have high chances to perform well.  There are certainly some pass/fail knowledge exams where almost everyone passes.  But the purpose of those exams is not to differentiate for selection, but to make sure students have mastered the material, so this is a good thing in that case.

Speeded vs power test

The categorization of speed or power test depends on the assessment purpose. For instance, an arithmetical test for Grade 8 students might be a speeded test when containing many relatively easy questions but the same test could be a power test for Grade 7 students. Thus, a speeded test measures the power when all of the items are correctly responded in a limited time period. Similarly, a power test might turn into a speeded test when easy items are correctly responded in shorter time period. Once a time limit is fixed for a power test, it becomes a speeded test. Today, a pure speeded or power test is rare. Usually, what we meet in practice is a mixture of both, typically a Timed Test.

Below you may find a comparison of a speeded vs power test, in terms of the main features.


Speeded test Power test
Time limit is fixed, and it affects all examinees There is no time limit, or there is one and it only affects a small percentage of examinees
The goal is to evaluate speed only, or a combination of speed and correctness The goal is to evaluate correctness in the sense knowledge, skills, and abilities of test-takers
Questions are relatively easy in nature Questions are relatively difficult in nature
Test format increases chances of committing errors Test format reduces chances of committing errors


Distractor analysis refers to the process of evaluating the performance of incorrect answers vs the correct answer for multiple choice items on a test.  It is a key step in the psychometric analysis process to evaluate item and test performance as part of documenting test reliability and validity.

What is a distractor?

Multiple-choice questions always have a few options for an answer, one of which is a key/correct answer, and the remaining ones are distractors/wrong answers. It is worth noting that distractors should not be just any wrong answers but have to be probable answers in case an examinee makes a mistake when looking for a right option. In short, distractors are feasible answers that a examinee might select when making misjudgments or having partial knowledge/understanding.  A great example is later in this article with the word “confectioner.”

Parts of an item - stem options distractor

After a test form is delivered to examinees, distractor analysis should be implemented to make sure that all answer options work well, and that the item is performing well and defensibly. For example, it is expected that around 40-95% of students pick a correct answer, and the distractors will be chosen by the smaller number of examinees compared to the number chosen the key with approximately equal distribution of choices.

Distractor analysis is usually done with classical test theory, even if item response theory is used for scoring, equating, and other tasks.

How to do a distractor analysis

There are three main aspects:

  1. Option frequencies/proportions
  2. Option point-biserial
  3. Quantile plot

The option frequencies/proportions just refers to the analysis of how many examinees selected each answer.  Usually it is a proportion and labeled as “P.”  Did 70% choose the correct answer while the remaining 30% were evenly distributed amongst the 3 distractors?  Great.  But if only 40% chose the correct answer and 45% chose one of the distractors, you might have a problem on your hands.  Perhaps the answer specified as the Key was not actually correct.

The point-biserials (Rpbis) will help you evaluate if this is the case.  The point-biserial is an item-total correlation, meaning that we correlate scores on the item with the total score on the test, which is a proxy index of examinee ability.  If 0.0, there is no relationship, which means the item is not correlated with ability, and therefore probably not doing any good.  If negative, it means that the lower-ability students are selecting it more often; if positive, it means that the higher-ability students are selecting it more often.  We want the correct answer to have a positive value and the distractors to have a negative value.  This is one of the most important points in determining if the item is performing well.

In addition, there is a third approach, which is visual, called the quantile plot.  It is very useful for diagnosing how an item is working and how it might be improved.  This splits the sample up into blocks ordered by performance, such as 5 groups where Group 1 is 0-20th percentile, Group 2 is 21-40th, etc.  We expect the smartest group to have a high proportion of examinees selecting the correct answer and low proportion selecting the distractors, and vise versa.  You can see how this aligns with the concept of point-biserial.  An example of this is below.

Note that the P and point-biserial for the correct answer serve as “the” statistics for the item as a whole.  The P for the item is called the item difficulty or facility statistic.

Examples of distractor analysis

Here is an example of a good item.  The P is medium (67% correct) and the Rpbis is strongly positive for the correct answer while strongly positive for the incorrect answers.  This translates to a clean quantile plot where the curve for the correct answer (B) goes up while the curves for the incorrect answers go down.  An ideal situation.

Distractor analysis quantile plot classical

Now contrast that with the following item.  Here, only 12% of examinees got this correct, and the Rpbis was negative.  Answer C had 21% and a nicely positive Rpbis, as well as a quantile curve that goes up.  This item should be reviewed to see if C is actually correct.  Or B, which had the most responses.  Most likely, this item will need a total rewrite!

Bad quantile plot and table for distractor analysis

Note that an item can be extremely difficult but still perform well.  Here is an example where the distractor analysis supports continued use of the item.  The distractor is just extremely attractive to lower students; they think that a confectioner makes confetti, since those two words look the closest.  Look how strong the Rpbis is here, and very negative for that distractor.  This is a good result!

Confectioner confetti distractor analysis

Multi-modal test delivery refers to an exam that is capable of being delivered in several different ways, or of a online testing software platform designed to support this process. For example, you might provide the option for a certification exam to be taken on computer at third-party testing centers or via paper at the annual conference for the profession. The goal of multi-modal test delivery s to improve access and convenience for the examinees. In the example, perhaps the testing center approach requires an extra $60 for the proctoring fee as well as requiring the examinee to drive up to an hour to get there; they might be attending the annual conference next month anyway, and it would be very convenient for them to duck into a side room to take the exam.

Multi-modal test delivery requires scalable security on the part of your delivery partner. The exam platform should be able to support various types of exam delivery. Here are some approaches to consider.

Paper exams

Your platform should be able to make print-ready versions of the test. Note that this is quite different from exporting test items to Word or PDF; straight exports are often ugly and include metadata.  You might also need advanced formats like Adobe InDesign.

Additionally, the system should also be able to import the results of a paper test back in, so that it is available for scoring and reporting along with other modes of delivery.

FastTest can do all of these things, as well as the points below.  You can sign up for a free account and try it yourself.

Online unproctored

The platform should be able to deliver exams online, without proctoring. Additionally, there can be several ways of candidates entering the exam.

1. As a direct link, without registration, such as an anonymous survey

2. As a direct link, but requiring self-registration

3. Pre-registration, with some sort of password to ensure the right person is taking the exam. This can be emailed or distributed, or perhaps is available via another software platform like a learning management system or applicant tracking system.

Online remote-proctored

The platform should be able to deliver the test online, with remote proctoring. There are several levels of remote proctoring, corresponding to increasing levels of security or stakes.

1. AI only: Video is recorded of the candidate taking the exam, and it is “reviewed” by AI algorithms. A human has the opportunity to review the flagged students, but in many cases it does not happen.

2. Record and review: Video is recorded, and every video is reviewed by a human. This provides stronger security than AI only, but it does not prevent test theft because it would only be found a day or two later.

3. Live: Video is live-streamed and watched in real time. This provides the opportunity to stop the exam if someone is cheating. The proctors can be third-party or in some cases the organization’s staff. If you are using your staff, make sure to avoid the mistakes made by Cleveland State University.

Testing centers managed by you

Some testing platforms have functionality for you to manage your own testing centers. When candidates are registered for an exam, they are assigned to an appropriate center. In some cases, the center is also assigned a proctor. The platform might have a separate login for the proctor, requiring them to enter a password before the examinee can enter theirs (or the proctor enter it on their behalf).

New test scheduler sites proctor code

Formal third-party testing centers

Some vendors will have access to a network of testing centers. These will have trained proctors, computers, and sometimes additional security considerations like video monitoring or biometric scanners when candidates arrive. There are three types of testing centers.

1. Owned: The testing company actually owns their own centers, and they are professionally staffed.

2. Independent/affiliated: The testing might contract with professional testing centers that are owned by a different company. In some cases, these are independent.

3. Public: Some organizations will contract with public locations, such as computer labs at universities or libraries.

Summary: multi-modal test delivery

Multi-modal test delivery provides flexibility for exam sponsors. There are two situations where this is important. First, a single test can be delivered in multiple ways with equivalent security, to allow for greater convenience, like the Conference example above. But it also empowers a testing organization to run multiple types of exams at different levels of security. For instance, a credentialing board might have an unproctored online exam as a practice test, a test center exam for their primary certification exam, and a remote-proctored test for annual recertification. Having a single platform makes it easier for the organization to manage their assessment activities, reducing costs while increasing the customer experience for the people for whom it really matters – the candidates.

Split Half Reliability is an internal consistency approach to quantifying the reliability of a test, in the paradigm of classical test theoryReliability refers to the repeatability or consistency of the test scores; we definitely want a test to be reliable.  The name comes from a simple description of the method: we split the test into two halves, calculate the score on each half for each examinee, then correlate those two columns of numbers.  If the two halves measure the same thing, then the correlation is high, indicating a decent level of unidimensionality in the construct and reliability in measuring the construct.

Why do we need to estimate reliability?  Well, it is one of the easiest ways to quantify the quality of the test.  Some would argue, in fact, that it is a gross oversimplification.  However, because it is so convenient, classical indices of reliability are incredibly popular.  The most popular is coefficient alpha, which is a competitor to split half reliability.

How to Calculate Split Half Reliability

The process is simple.

  1. Take the test and split it in half
  2. Calculate the score of each examinee on each half
  3. Correlate the scores on the two halves

The correlation is best done with the standard Pearson correlation.


This, of course, begs the question:  How do we split the test into two halves?  There are so many ways.  Well, psychometricians generally recommend three ways:

  1. First half vs last half
  2. Odd-numbered items vs even-numbered items
  3. Random split

You can do these manually with your matrix of data, but good psychometric software will for these for you, and more (see screenshot below).


Suppose this is our data set, and we want to calculate split half reliability.

Person Item1 Item2 Item3 Item4 Item5 Item6 Score
1 1 0 0 0 0 0 1
2 1 0 1 0 0 0 2
3 1 1 0 1 0 0 3
4 1 0 1 1 1 1 5
5 1 1 0 1 0 1 4

Let’s split it by first half and last half.  Here are the scores.

Score 1 Score 2
1 0
2 0
2 1
2 3
2 2

The correlation of these is 0.51.

Now, let’s try odd/even.

Score 1 Score 2
1 0
2 0
1 2
3 2
1 3

The correlation of these is -0.04!  Obviously, the different ways of splitting don’t always agree.  Of course, with such a small sample here, we’d expect a wide variation.

Advantages of Split Half Reliability

One advantage is that it is so simple, both conceptually and computationally.  It’s easy enough that you can calculate it in Excel if you need to.  This also makes it easy to interpret and understand.

Another advantage, which I was taught in grad school, is that split half reliability assumes equivalence of the two halves that you have created; on the other hand, coefficient alpha is based at an item level and assumes equivalence of items.  This of course is never the case – but alpha is fairly robust and everyone uses it anyway.

Disadvantages… and the Spearman-Brown Formula

The major disadvantage is that this approach is evaluating half a test.  Because tests are more reliable with more items, having fewer items in a measure will reduce its reliability.  So if we take a 100 item test and divide it into two 50-item halves, then we are essentially making a quantification of reliability for a 50 item test.  This means we are underestimating the reliability of the 100 item test.  Fortunately, there is a way to adjust for this.  It is called the Spearman-Brown Formula.  This simple formula adjusts the correlation back up to what it should be for a 100 item test.

Another disadvantage was mentioned above: the different ways of splitting don’t always agree.  Again, fortunately, if you have a larger sample of people or a longer test, the variation is minimal.

OK, how do I actually implement?

Any good psychometric software will provide some estimates of split half reliability.  Below is the table of reliability analysis from Iteman.  This table actually continues for all subscores on the test as well.  You can download Iteman for free at its page and try it yourself.

This test had 100 items and 85 scored items (15 unscored pilot).  The alpha was around 0.82, which is acceptable, though it should be higher for 100 items.  The results then show for all three split half methods, and then again for the Spearman-Brown (S-B) adjusted version of each.  Do they agree with alpha?  For the total test, the results don’t jive for two of the three methods.  But for the Scored Items, the three S-B calculations align with the alpha value.  This is most likely because some of the 15 pilot items were actually quite bad.  In fact, note that the alpha for 85 items is higher than for 100 items – which says the 15 new items were actually hurting the test!

Reliability analysis Iteman

This is a good example of using alpha and split half reliability together.  We made an important conclusion about the exam and its items, merely by looking at this table.  Next, the researcher should evaluate those items, usually with P value difficulty and point-biserial discrimination.


A cutscore or passing point (aka cut-off score and cutoff score as well) is a score on a test that is used to categorized examinees.  The most common example of this is pass/fail, which we are all familiar with from our school days.  For instance, a score of 70% and above will pass, while below 70% will fail.  However, many tests have more than one cutscore.  An example of this is the National Assessment of Educational Progress (NAEP) in the USA, which as 3 cutscores, creating 4 categories: Below Basic, Basic, Proficient, and Advanced.

The process of setting a cutscore is called a standard-setting study.  However, I dislike this term because the word “standard” is used to reflect other things in the assessment world.  In some cases, it is the definition of what is to be learned or covered (see Common Core State Standards) and in other cases it refers to the process of reducing construct-irrelevant variance by ensuring that all examinees are taking the testing in standardized conditions (standardized testing).  So I prefer cutscore or passing point.  And passing point is limited to the case of an exam with only one cutscore where the classifications are pass/fail, which is not always the case – not only are there many situations where there are more than one cutscore, but many two-category situations might use other decisions, like Hire/NotHire or a clinical diagnosis like Depressed/NotDepressed.

Types of cutscores

There are two types of cutscores, reflecting the two ways that a test score can be interpreted: Norm-referenced and criterion-referenced.

Criterion-referenced Cutscore

A cutscore of this type is referenced to the material of the exam, regardless of examinee performance.  In most cases, this is the sort of cutscore that you need to be legally defensible for high stakes exams.  Psychometricians have spent a lot of time inventing ways to do this, and scientifically studying them.

Names of some methods you might see for this type are: modified-Angoff, Nedelsky, and Bookmark.


An example of this is a certification exam.  If the cutscore is 75%, you pass.  In some months or years, this might be most candidates, in other months it might be fewer.  The standard does not change.  In fact, the organizations that manage such exams go to great lengths to keep it stable over time, a process known as equating.

Norm-referenced Cutscore

A cutscore of this type is referenced to the examinees, regardless of their mastery of the material.

A name of this you might see is a quota.  Such as when a test is delivered to only accept the top 10% of applicants.


An example of this was in my college Biology class.  It was a weeder class, to weed out the students who start college planning to be pre-med simply because they like the idea of being a doctor or are drawn to the potential salary.  So, the exams were intentionally made very hard, so that the average score might only be 50% correct.  They then awarded an A to anyone who had a z-score of 1.0 or greater, which is the top 15% of students – regardless of how well you actually scored on the exam.  You might get a score of 60% correct but be 95th percentile and get an A.