What are cognitive diagnostic models?

Cognitive diagnostic models are an area of psychometric research that has seen substantial growth in the past decade, though the mathematics behind them, dating back to MacReady and Dayton (1977).  The reason that they have been receiving more attention is that in many assessment situations, a simple overall score does not serve our purposes and we want a finer evaluation of the examinee’s skills or traits.  For example, the purpose of formative assessment in education is to provide feedback to students on their strengths and weaknesses, so an accurate map of these is essential.  In contrast, a professional certification/licensure test focuses on a single overall score with a pass/fail decision.

What are cognitive diagnostic models?

The predominant psychometric paradigm since the 1980s is item response theory, which is also known as latent trait theory.  Cognitive diagnostic models are part of a different paradigm known as latent class theory.  Instead of assuming that we are measuring a single neatly unidimensional factor, latent class theory instead tries to assign examinees into more qualitative groups by determining whether they categorized along a number of axes.

What this means is that the final “score” we hope to obtain on each examinee is not a single number, but a profile of which axes they have and which they do not.  The axes could be a number of different psychoeducational constructs, but are often used to represent cognitive skills examinees have learned.  Because we are trying to diagnose strengths vs. weaknesses, we call it a cognitive diagnostic model.

Example: Fractions

A classic example you might see in the literature is a formative assessment on dealing with fractions in mathematics.  Suppose you are designing such a test, and the curriculum includes these teaching points, which are fairly distinct skills or pieces of knowledge.

1. Find the lowest common denominator
3. Subtract fractions
4. Multiply fractions
5. Divide fractions
6. Convert mixed number to improper fraction

Now suppose this is one of the questions on the test.

What is 2 3/4 + 1 1/2?

This item utilizes skills 6, 1, and 2.  We can apply a similar mapping to all items, and obtain a table that looks like this.  Researchers call this the “Q Matrix.”

• Item
• 1
• 2
• 3
• 4
• 5
• Skill 1
• 1
• 0
• 1
• 0
• 0
• Skill 2
• 1
• 0
• 0
• 1
• 0
• Skill 3
• 0
• 1
• 0
• 0
• 1
• Skill 4
• 0
• 0
• 1
• 0
• 0
• Skill 5
• 0
• 0
• 0
• 1
• 0
• Skill 6
• 1
• 1
• 0
• 0
• 1

So how to we obtain the examinee’s skill profile?

This is where the fun starts.  I used the plural cognitive diagnostic models because there are a number of available models.  Just like in item response theory we have the Rasch, 2 parameter, 3 parameter, generalized partial credit, and more.  Choice of model is up to the researcher and depends on the characteristics of the test.

The simplest model is the DINA model, which has two parameters per item.  The slippage parameter s refers to the probability that a student will get the item wrong if they do have the skills.  The guessing parameter g refers to the probability a student will get the item right if they do not have the skills.

The mathematical calculations for determining the skill profile are complex, and are based on maximum likelihood.  To determine the skill profile, we need to first find all possible profiles, calculate the likelihood of each (based on item parameters and the examinee response vector), then select the profile with high highest likelihood.

Calculations of item parameters are an order of magnitude greater complexity.  Again, compare to item response theory: brute force calculation of theta with maximum likelihood is complex, but can still be done using Excel formulas.  Item parameter estimation  for IRT with marginal maximum likelihood can only be done by specialized software like Xcalibre.  For CDMs, item parameter estimation can be done in software like MPlus or R (see this article).

In addition to providing the most likely skill profile for each examinee, the CDMs can also provide the probability that a given examinee has mastered each skill.  This is what can be extremely useful in certain contexts, like formative assessment.

How can I implement cognitive diagnostic models?

The first step is to analyze your data to evaluate how well CDMs work by estimating one or more of the models.  As mentioned, this can be done in software like MPlus or R.  Actually publishing a real assessment that scores examinees with CDMs is a greater hurdle.

Most tests that use cognitive diagnostic models are proprietary.  That is, a large K12 education company might offer a bank of prefabricated formative assessments for students in grades 3-12.  That, of course, is what most schools need, because they don’t have a PhD psychometrician on staff to develop new assessments with CDMs.  And the testing company likely has several on staff.

On the other hand, if you want to develop your own assessments that leverage CDMs, your options are quite limited.  I recommend our FastTest platform for test development, delivery, and analytics.  You can sign up for a free account here.

I like this article by Alan Huebner, which talks about adaptive testing with the DINA model, but has a very informative introduction on CDMs.

Jonathan Templin, a professor at the University of Kansas, is one of the foremost experts on the topic.  Here is his website.  Lots of fantastic resources.

This article has an introduction to different CDM models, and guidelines on estimating parameters in R.

How do I implement item response theory?

I recently received a email from a researcher that wanted to implement item response theory, but was not sure where to start.  It occurred to me that there are plenty of resources out there which describe IRT but few, if any, that provide guidance for how someone new to the topic could apply IRT.  That is, plenty of resources that define the a-b-c parameters and discuss the item response function, but few resources that tell you how to calculate those parameters or what to do with them.

Why do I need to implement item response theory?

First of all, you might want to ask yourself this question.  Don’t just be using IRT because you heard it is an advanced psychometric paradigm.  IRT was invented to address shortcomings in classical test theory, and works best in the situations where those shortcomings are highlighted.  For example, you might want to design adaptive tests, assemble parallel forms, or equate score scales across years.

What sort of tests/data work with IRT?

This is the next question you need to ask yourself is whether your test can work with IRT.  IRT assumes unidimensionality and local independence.  Unidimensionality means that all items intercorrelate highly, and from a factor analysis perspective, load highly on one primary factor.  Local independence means that items are independent of one another – so testlets and “innovative” item types that violate this might not work well.

IRT assumes that items are scored dichotomously (correct/incorrect) or polytomously (integer points where smarter or high-trait examinees earn higher points).  Surprisingly, this isn’t always the case.  This blog post explores how a certain PARCC item type violated the should-be-obvious assumption that smarter students earn higher points, a great example of pedagogues trying to do psychometrics.

And, of course, IRT has sample size requirements.  I’ve received plenty of email questions from people who wonder why Xcalibre doesn’t work on their data set… of 6 students.  Well, IRT requires 100 examinees for the simplest model and up to a minimum of 1,000 for more complex models.  Six students obviously isn’t enough for classical test theory, for that matter.

How do I calculate IRT analytics?

Classical test theory is super-super-simple, so that anyone can easily calculate things like P, Rpbis, and coefficient alpha with Microsoft Excel formulas.  CITAS does this.  IRT calculations are much more complex, and it takes hundreds of lines of real code to estimate item parameters like a, b, and c.  I recommend the program Xcalibre to do so.  It has a straightforward, user-friendly interface and will automatically create MS Word reports for you.  If you are a member of the Rasch club, the go-to software is Winsteps.  You can also try R packages, but to do so you will need to learn to program in the R language, and the output is greatly inferior to commercial software.

Some of the secondary analyses in IRT can be calculated easily enough that Excel formulas are an option.  The IRT Scoring Spreadsheet scores a single student with IRT item parameters you supply, in an interactive way that helps you learn how IRT scoring works. I also have a spreadsheet that helps you build parallel forms by calculating the test information function (TIF) and conditional standard error of measurement (CSEM).  However, my TestAssembler program does that with automation, saving you hours of manual labor.

There are also a few specific-use tools available on the web.  One of my favorites is IRTEQ, which performs conversion-style equating such as mean/sigma and Stocking-Lord.  That is, it links together scores from different forms of an exam onto a common scale, even if the forms are delivered in different years.

So where do I go from here?

What is a rubric?

What is a rubric? It’s a rule for converting unstructured responses on an assessment into structured data that we can use psychometrically.

Why do we need rubrics?

Measurement is a quantitative endeavor.  In psychometrics, we are trying to measure things like knowledge, achievement, aptitude, or skills.  So we need a way to convert qualitative data into quantitative data.  We can still keep the qualitative data on hand for certain uses, but typically need the quantitative data for the primary use.  For example, writing essays in school will need to be converted to a score, but the teacher might also want to talk to the student to provide a learning opportunity.

How many rubrics do I need?

In some cases, a single rubric will suffice.  This is typical in mathematics, where the goal is a single correct answer.  In writing, the goal is often more complex.  You might be assessing writing and argumentative ability at the same time you are assessing language skills.  For example, you might have rubrics for spelling, grammar, paragraph structure, and argument structure – all on the same essay.

Examples

Spelling rubric for an essay

Points Description
0 Essay contains 5 or more spelling mistakes
1 Essay contains 1 to 4 spelling mistakes
2 Essay does not contain any spelling mistakes

Argument rubric for an essay

“Your school is considering the elimination of organized sports.  Write an essay to provide to the School Board that provides 3 reasons to keep sports, with a supporting explanation for each.”

Points Description
0 Student does not include any reasons with explanation (includes providing 3 reasons but no explanations)
1 Student provides 1 reason with a clear explanation
2 Student provides 2 reasons with clear explanations
3 Student provides 3 reasons with clear explanations

Points Description
0 Student provides no response or a response that does not indicate understanding of the problem.
1 Student provides a response that indicates understanding of the problem, but does not arrive at correct answer OR provides the correct answer but no supporting work.
2 Student provides a response with the correct answer and supporting work that explains the process.

How do I score tests with a rubric?

Well, the traditional approach is to just take the integers supplied by the rubric and add them to the number-correct score. This is consistent with classical test theory, and therefore fits with conventional statistics such as coefficient alpha for reliability and Pearson correlation for discrimination. However, the modern paradigm of assessment is item response theory, which analyzes the rubric data much more deeply and applies advanced mathematical modeling like the generalized partial credit model (Muraki, 1992; resources on that here and here).

How can I efficiently implement rubrics?

It is much easier to implement rubrics if your online assessment platform supports them in an online marking module, especially if the platform already has integrated psychometrics like the generalized partial credit model.  Check out this blog post to learn more.

6 Reasons for Online Essay Marking

Why online essay marking? Essay questions and other extended constructed response (ECR) items remain a mainstay of educational assessment.  From a purely psychometric perspective, they are usually not beneficial – that is, from an item response theory paradigm, the amount of information added per minute of testing time will be less than other item types – but because ECR items have extensive face validity for assessing deeper constructs or other aspects not easily measured by traditional item types, there will likely always be a place for them.

So, if we are going to keep using ECR items such as those used on PARCC, PISA, and other assessments, it would behoove us to find more effective ways of implementing them.  The days of a handwritten essay on paper, graded on a global rubric, are long gone.  To get the most out of ECR items, you need to be leveraging both technology and psychometrics.  For this reason, we’ve built an online essay marking system that’s directly integrated into our educational assessment platform, FastTest.  So how is using such a system better than the old paper-based methods?

1. It’s more efficient… way more efficient

Raters can rate more responses per hour online than on paper, and you are eliminating manual processes that would take a lot of time, such as managing and distributing the stacks of paper essays, gathering the bubble sheets with marks, scanning the bubble sheets, and uploading the results of the scan into the system. When marking online, the marks are immediately entered into the database.  How much more efficient is it?  One of our clients estimated it provided a 60% reduction in the number of teacher hours needed to complete all their essay marking.

2. Faster turnaround

Because of the increase in efficiency, and the fact that marks are stored directly in the database, the time to turn around scores for reporting to students will be drastically reduced.  The 60% reduction mentioned above was only for actual essay marking time – when you include the time saved for shipping/distributing papers and gathering/scanning the bubble sheets, the overall turnaround time for student scores will be reduced by 80% or more.

3. You can use remote raters

Are some teachers home sick?  Has summer/spring/winter break already started?  Are you working with a national or statewide test?  Do you want essays marked by markers in a location different than the students?  All of these are much easier to do when markers can simply log in online.

4. It saves paper

This one is fairly obvious.  You’ll easily be saving tens of thousands of sheets of paper, even for a relatively small project.

5. Easier to flag responses

Do you flag responses for use as anchor items next year, or if you see a possible critical student issue?  This is done automatically in our online essay marking module.

6. Integrated psychometrics and real-time reporting

Because all the numbers are going directly into the database, you can act on those numbers immediately.  Want to track the performance of your markers by calculating inter-rater reliability and agreement every day, as well as velocity?  No problem.  Want to export the data matrix as soon as marking is complete so you can run an item response theory calibration with Xcalibre, then score all the students with IRT the next day?  Once again, the system is built for that.  You can attach IRT parameters to your rubrics, ensuring a stronger test with better student feedback.  And pretty soon, we’ll have Xcalibre right there in the online platform too!

What is Item Banking? What are item banks?

Item banking refers to the purposeful creation of a database of assessment items to serve as a central repository of all test content, improving efficiency and quality. The term item refers to what many call questions; though their content need not be restricted as such and can include problems to solve or situations to evaluate in addition to straightforward questions. As a critical foundation to the test development cycle, item banking is the foundation for the development of valid, reliable content and defensible test forms.

Automated item banking systems, such as Assess.ai or FastTest, result in significantly reduced administrative time for developing/reviewing items and assembling/publishing tests.  Contact us to request a free account.

What is Item Banking?

While there are no absolute standards in creating and managing item banks, best practice guidelines are emerging. Here are the essentials your should be looking for:

Items are reusable objects; when selecting an item banking platform it is important to ensure that items can be used more than once; ideally, item performance should be tracked not only within a test form but across test forms as well.

Item history and usage are tracked; the usage of a given item, whether it is actively on a test form or dormant waiting to be assigned, should be easily accessible for test developers to assess, as the over-exposure of items can reduce the validity of a test form. As you deliver your items, their content is exposed to examinees. Upon exposure to many examinees, items can then be flagged for retirement or revision to reduce cheating or teaching to the test.

Items can be sorted; as test developers select items for a test form, it is imperative that they can sort items based on their content area or other categorization methods, so as to select a sample of items that is representative of the full breadth of constructs we intend to measure.

Item versions are tracked; as items appear on test forms, their content may be revised for clarity. Any such changes should be tracked and versions of the same item should have some link between them so that we can easily review the performance of earlier versions in conjunction with current versions.

Review process workflow is tracked; as items are revised and versioned, it is imperative that the changes in content and the users who made these changes are tracked. In post-test assessment, there may be a need for further clarification, and the ability to pinpoint who took part in reviewing an item and expedite that process.

Metadata is recorded; any relevant information about an item should be recorded and stored with the item. The most common applications for metadata that we see are author, source, description, content area, depth of knowledge, IRT parameters, and CTT statistics, but there are likely many data points specific to your organization that is worth storing.

Managing an Item Bank

Names are important. As you create or import your item banks it is important to identify each item with a unique, but recognizable name. Naming conventions should reflect your bank’s structure and should include numbers with leading zeros to support true numerical sorting.  You might want to also add additional pieces of information.  If importing, the system should be smart enough to recognize duplicates.

Search and filter. The system should also have a reliable sorting mechanism.

Prepare for the Future: Store Extensive Metadata

Metadata is valuable. As you create items, take the time to record simple metadata like author and source. Having this information can prove very useful once the original item writer has moved to another department, or left the organization. Later in your test development life cycle, as you deliver items, you have the ability to aggregate and record item statistics. Values like discrimination and difficulty are fundamental to creating better tests, driving reliability, and validity.

Statistics are used in the assembly of test forms while classical statistics can be used to estimate mean, standard deviation, reliability, standard error, and pass rate.Item response theory parameters can come in handy when calsulating test information and standard error functions. Data from both psychometric theories can be used to pre-equate multiple forms.

In the event that your organization decides to publish an adaptive test, utilizing CAT delivery, item parameters for each item will be essential. This is because they are used for inteligent selection of items and scoring examinees. Additionally, in the event that the integrity of your test or scoring mechanism is ever challenged, documentation of validity is essential to defensibility and the storage of metadata is one such vital piece of documentation.

Increase Content Quality: Track Workflow

Utilize a review workflow to increase quality. Using a standardized review process will ensure that all items are vetted in a similar matter. Have a step in the process for grammar, spelling, and syntax review, as well as content review by a subject matter expert. As an item progresses through the workflow, its development should be tracked, as workflow results also serve as validity documentation.

Accept comments and suggestions from a variety of sources. It is not uncommon for each item reviewer to view an item through their distinctive lens. Having a diverse group of item reviewers stands to benefit your test-takers, as they are likely to be diverse as well!

Keep Your Items Organized: Categorize Them

Identify items by content area. Creating a content hierarchy can also help you to organize your item bank and ensure that your test covers the relevant topics. Most often, we see content areas defined first by an analysis of the construct(s) being tested. In the event of a high school science test, this may include the evaluation of the content taught in class. A high-stakes certification exam, almost always includes a job-task analysis. Both methods produce what is called a test blueprint, indicating how important various content areas are to the demonstration of knowledge in the areas being assessed.

Once content areas are defined, we can assign items to levels or categories based on their content. As you are developing your test, and invariably referring back to your test blueprint, you can use this categorization to determine which items from each content area to select.

Why Item Banking?

There is no doubt that item banking is a key aspect of developing and maintaining quality assessments. Utilizing best practices, and caring for your items throughout the test development life cycle, will pay great dividends as it increases the reliability, validity, and defensibility of your assessment. Moreover, good item banking will make the job easier and more efficient thus reducing the cost of item development and test publishing.

Item Writing Tips

With so many things to consider, it’s no wonder psychometricians often recommend the retirement of poor performing items. Here are some of the most common issues we see, along with our tried and true methods for designing good, psychometrically sound items.  We could all use some reminders on good item writing tips.  These are made easier if you use a strong item banking system like FastTest.

Recommendation

Key is invalid due to multiple correct answers. Consider each answer option individually; the key should be fully correct with each distractor being fully incorrect.
Item was written in a hard to comprehend way, examinees were unable to apply their knowledge because of poor wording.

Ensure that the item can be understood after just one read through. If you have to read the stem multiple times, it needs to be rewritten.
Grammar, spelling, or syntax errors direct savvy test takers toward the correct answer (or away from incorrect answers). Read the stem, followed by each answer option, aloud. Each answer option should fit with the stem.
Information was introduced in the stem text that was not relevant to the question. After writing each question, evaluate the content of the stem. It should be clear and concise without introducing irrelevant information.
Item emphasizes trivial facts. Work off of a test blue print to ensure that each of your items map to a relevant construct. If you are using Bloom’s taxonomy or a similar approach, items should be from higher order levels.
Numerical answer options overlap. Carefully evaluate numerical ranges to ensure there is no overlap among options.
Examinees noticed answer was most often A. Distribute the key evenly among the answer options. This can be avoided with FastTest’s randomized delivery functionality.
Key was overly specific compared to distractors. Answer options should all be about the same length and contain the same amount of information.
Key was only option to include key word from item stem. Avoid re-using key words from the stem text in your answer options. If you do use such words, evenly distribute them among all of the answer options so as to not call out individual options.
Rare exception can be argued to invalidate true/false always/never question. Avoid using “always” or “never” as there can be unanticipated or rare scenarios. Opt for less absolute terms like “most often” or “rarely”.
Distractors were not plausible, key was obvious. Review each answer option and ensure that it has some bearing in reality. Distractors should be plausible.
Idiom or jargon was used; non-native English speakers did not understand. It is best to avoid figures of speech, keep the stem text and answer options literal to avoid introducing undue discrimination against certain groups.
Key was significantly longer than distractors. There is a strong tendency to write a key that is very descriptive. Be wary of this and evaluate distractors to ensure that they are approximately the same length.

What is Scaling?

I often hear this question about scaling, especially regarding the scaled scoring functionality found in software like FastTest and Xcalibre.  The following is adapted from lecture notes I wrote while teaching a course in Measurement and Assessment at the University of Cincinnati.

Scaling: Sort of a Tale of Two Cities

Scaling at the test level really has two meanings in psychometrics. First, it involves defining the method to operationally scoring the test, establishing an underlying scale on which people are being measured.  It also refers to score conversions used for reporting scores, especially conversions that are designed to carry specific information.  The latter is typically called scaled scoring.

You have all been exposed to this type of scaling, though you might not have realized it at the time. Most high-stakes tests like the ACT, SAT, GRE, and MCAT are reported on scales that are selected to convey certain information, with the actual numbers selected more or less arbitrarily. The SAT and GRE have historically had a nominal mean of 500 and a standard deviation of 100, while the ACT has a nominal mean of 18 and standard deviation of 6. These are actually the same scale, because they are nothing more than a converted z-score (standard or zed score), simply because no examinee wants to receive a score report that says you got a score of -1. The numbers above were arbitrarily selected, and then the score range bounds were selected based on the fact that 99% of the population is within plus or minus three standard deviations. Hence, the SAT and GRE range from 200 to 800 and the ACT ranges from 0 to 36. This leads to the urban legend of receiving 200 points for writing your name correctly on the SAT; again, it feels better for the examinee. A score of 300 might seem like a big number and 100 points above the minimum, but it just means that someone is in the 3rd percentile.

Now, notice that I said “nominal.” I said that because the tests do not actually have those means observed in samples, because the samples have substantial range restriction. Because these tests are only taken by students serious about proceeding to the next level of education, the actual sample is of higher ability than the population. The lower third or so of high school students usually do not bother with the SAT or ACT. So many states will have an observed average ACT of 21 and standard deviation of 4. This is an important issue to consider in developing any test. Consider just how restricted the population of medical school students is; it is a very select group.

How can I select a score scale?

For various reasons, actual observed scores from tests are often not reported, and only converted scores are reported.  If there are multiple forms which are being equated, scaling will hide the fact that the forms differ in difficulty, and in many cases, differ in cutscore.  Scaled scores can facilitate feedback.  They can also help the organization avoid explanations of IRT scoring, which can be a headache to some.

When deciding on the conversion calculations, there are several important questions to consider.

First, do we want to be able to make fine distinctions among examinees? If so, the range should be sufficiently wide. My personal view is that the scale should be at least as wide as the number of items; otherwise you are voluntarily giving up information. This in turn means you are giving up variance, which makes it more difficult to correlate your scaled scores with other variables, like the MCAT is correlated with success in medical school. This, of course, means that you are hampering future research – unless that research is able to revert back to actual observed scores to make sure all information possible is used. For example, supposed a test with 100 items is reported on a 5-point grade scale of A-B-C-D-F. That scale is quite restricted, and therefore difficult to correlate with other variables in research. But you have the option of reporting the grades to students and still using the original scores (0 to 100) for your research.

Along the same lines, we can swing completely in the other direction. For many tests, the purpose of the test is not to make fine distinctions, but only to broadly categorize examinees. The most common example of this is a mastery test, where the examinee is being assessed on their mastery of a certain subject, and the only possible scores are pass and fail. Licensure and certification examinations are an example. An extension of this is the “proficiency categories” used in K-12 testing, where students are classified into four groups: Below Basic, Basic, Proficient, and Advanced. This is used in the National Assessment of Educational Progress (http://nces.ed.gov/nationsreportcard/). Again, we see the care taken for reporting of low scores; instead of receiving a classification like “nonmastery” or “fail,” the failures are given the more palatable “Below Basic.”

Another issue to consider, which is very important in some settings but irrelevant in others, is vertical scaling. This refers to the chaining of scales across various tests that are at quite different levels. In education, this might involve linking the scales of exams in 8th grade, 10th grade, and 12th grade (graduation), so that student progress can be accurately tracked over time. Obviously, this is of great use in educational research, such as the medical school process. But for a test to award a certification in a medical specialty, it is not relevant because it is really a one-time deal.

Lastly, there are three calculation options: pure linear (ScaledScore = RawScore * Slope + Intercept), standardized conversion (Old Mean/SD to New Mean/SD), and nonlinear approaches like Equipercentile.

Perhaps the most important issue is whether the scores from the test will be criterion-referenced or norm-referenced. Often, this choice will be made for you because it distinctly represents the purpose of your tests. However, it is quite important and usually misunderstood, so I will discuss this in detail.

Criterion-Referenced vs. Norm-Referenced

This is a distinction between the ways test scores are used or interpreted. A criterion-referenced score interpretation means that the score is interpreted with regards to defined content, blueprint, or curriculum (the criterion), and ignores how other examinees perform (Bond, 1996). A classroom assessment is the most common example; students are scored on the percent of items correct, which is taken to imply the percent of the content they have mastered. Conversely, a norm-referenced score interpretation is one where the score provides information about the examinee’s standing in the population, but no absolute (or ostensibly absolute) information regarding their mastery of content. This is often the case with non-educational measurements like personality or psychopathology. There is no defined content which we can use as a basis for some sort of absolute interpretation. Instead, scores are often either z-scores or some linear function of z-scores.  IQ is historically scaled with a mean of 100 and standard deviation of 15.

It is important to note that this dichotomy is not a characteristic of the test, but of the test score interpretations. This fact is more apparent when you consider that a single test or test score can have several interpretations, some of which are criterion-referenced and some of which are norm-referenced. We will discuss this deeper when we reach the topic of validity, but consider the following example. A high school graduation exam is designed to be a comprehensive summative assessment of a secondary education. It is therefore specifically designed to cover the curriculum used in schools, and scores are interpreted within that criterion-referenced context. Yet scores from this test could also be used for making acceptance decisions at universities, where scores are only interpreted with respect to their percentile (e.g., accept the top 40%). The scores might even do a fairly decent job at this norm-referenced application. However, this is not what they are designed for, and such score interpretations should be made with caution.

Another important note is the definition of “criterion.” Because most tests with criterion-referenced scores are educational and involve a cutscore, a common misunderstanding is that the cutscore is the criterion. It is still the underlying content or curriculum that is the criterion, because we can have this type of score interpretation without a cutscore. Regardless of whether there is a cutscore for pass/fail, a score on a classroom assessment is still interpreted with regards to mastery of the content.  To further add to the confusion, Industrial/Organizational psychology refers to outcome variables as the criterion; for a pre-employment test, the criterion is typically Job Performance at a later time.

This dichotomy also leads to some interesting thoughts about the nature of your construct. If you have a criterion-referenced score, you are assuming that the construct is concrete enough that anybody can make interpretations regarding it, such as mastering a certain percentage of content. This is why non-concrete constructs like personality tend to be only norm-referenced. There is no agreed-upon blueprint of personality.

Multidimensional Scaling

An advanced topic worth mentioning is multidimensional scaling (see Davison, 1998). The purpose of multidimensional scaling is similar to factor analysis (a later discussion!) in that it is designed to evaluate the underlying structure of constructs and how they are represented in items. This is therefore useful if you are working with constructs that are brand new, so that little is known about them, and you think they might be multidimensional. This is a pretty small percentage of the tests out there in the world; I encountered the topic in my first year of graduate school – only because I was in a Psychological Scaling course – and have not encountered it since.

Summary of scaling

Scaling is the process of defining the scale that on which your measurements will take place. It raises fundamental questions about the nature of the construct. Fortunately, in many cases we are dealing with a simple construct that has a well-defined content, like an anatomy course for first-year medical students. Because it is so well-defined, we often take criterion-referenced score interpretations at face value. But as constructs become more complex, like job performance of a first-year resident, it becomes harder to define the scale, and we start to deal more in relatives than absolutes. At the other end of the spectrum are completely ephemeral constructs where researchers still can’t agree on the nature of the construct and we are pretty much limited to z-scores. Intelligence is a good example of this.

Some sources attempt to delineate the scaling of people and items or stimuli as separate things, but this is really impossible as they are so confounded. Especially since people define item statistics (the percent of people that get an item correct) and items define people scores (the percent of items a person gets correct). It is for this reason that IRT, the most advanced paradigm in measurement theory, was designed to place items and people on the same scale. It is also for this reason that item writing should consider how they are going to be scored and therefore lead to person scores. But because we start writing items long before the test is administered, and the nature of the construct is caught up in the scale, the issues presented here need to be addressed at the very beginning of the test development cycle.

SIFT: Software for Investigating Test Fraud

Test fraud is an extremely common occurrence.  We’ve all seen articles like this one.  However, there are very few defensible tools to help detect it.  I once saw a webinar from an online testing provider that proudly touted their reports on test security… but it turned out that all they provided was a simple export of student answers that you could subjectively read and form conjectures.  The goal of SIFT is to provide a tool that implements real statistical indices from the corpus of scientific research on statistical detection of test fraud, yet is user-friendly enough to be used by someone without a PhD in psychometrics and experience in data forensics.  SIFT still provides more collusion indices and other analysis than any other software on the planet, making it the standard in the industry from the day of its release.  The science behind SIFT is also being implemented in our world-class online testing platform, FastTest.  It is also worth noting that FastTest supports computerized adaptive testing, which is known to increase test security.

What is Test Fraud?

As long as tests have been around, people have been trying to cheat them.  This is only natural; anytime there is a system with some sort of stakes/incentive involved (and maybe even when not), people will try to game that system.  Note that the root culprit is the system itself, not the test.  Blaming the test is just shooting the messenger.  However, in most cases, the system serves a useful purpose.  In the realm of assessment, that means that K12 assessments provide useful information on curriculum on teachers, certification tests identify qualified professionals, and so on.  In such cases, we must minimize the amount of test fraud in order to preserve the integrity of the system.

When it comes to test fraud, the old cliche is true: an ounce of prevention is worth a pound of cure.  You’ll undoubtedly see that phrase at conferences and in other resources.  So I of course recommend that your organization implement reasonable preventative measures to deter test fraud.  Nevertheless, there will still always be some cases.  SIFT is intended to help find those.  Also, some examinees might also be deterred by the knowledge that such analysis is even being done.

How can SIFT help me with statistical detection of test fraud?

Like other psychometric software, SIFT does not interpret results for you.  For example, software for item analysis like Iteman and Xcalibre do not specifically tell you which items to retire or revise, or how to revise them.  But they provide the output necessary for a practitioner to do so.  SIFT provides you a wide range of output that can help you find different types of test fraud, like copying, proctor help, suspect test centers, brain dump usage, etc.  It can also help find other issues, like low examinee motivation.  But YOU have to decide what is important to you regarding statistical detection of test fraud, and look for relevant evidence.  More information on this is provided in the manual, but here is a glimpse.

First, there are a number if intra-individual indices to evaluate.  Consider the third examinee here.  They took less than half the time of most examinees, had a very low score, and were flagged for answering Option 4 too often… likely a case of a student giving up and answering D for most of the test.

A certification organization could use SIFT to look for evidence of brain dump makers and takers by evaluating similarity between examinee response vectors and answers from a brain dump site – especially if those were intentionally seeded by the organization!  We also might want to find adjacent examinees or examinees in the same location that group together in the collusion index output.  Unfortunately, these indices can differ substantially in their conclusions.

Finally, we can roll up many of these statistics to the group level.  Below is an example that provides a portion of SIFT output regarding teachers.  Note the Gutierrez has suspiciously high scores but without spending much more time.  Cheating?  Possibly.  On the other hand, that is the smallest N, so perhaps the teacher just had a group of accelerated students.  Worthington, on the other hand, also had high scores but had notably shorter times – perhaps the teacher was helping?

These are only the descriptive statistics – this doesn’t even touch on the collusion indices yet!

The Story of SIFT

I started SIFT in 2012.  Years ago, ASC sold a software program called Scrutiny!  We had to stop selling it because it did not work on recent versions of Windows, but we still received inquiries for it.  So I set out to develop a program that could perform the analysis from Scrutiny! (the Bellezza & Bellezza index) but also much more.  I quickly finished a few collusion indices and planned to publish SIFT in March 2013, as my wife and I were expecting our first child on March 25.  Alas, he arrived a full month early and all plans went out the window!  Then unfortunately I had to spend a few years dealing with the realities of business, wasting hundreds of hours in pointless meetings and other pitfalls.  I finally set a goal to release SIFT before the second child arrived in July 2016.  I unfortunately failed at that too, but the delay this time was 3 weeks, not 3 years.  Whew!

Version 1.0 of SIFT includes 10 collusion indices (5 probabilistic, 5 descriptive), response time analysis, group level analysis, and much more to aid in the statistical detection of test fraud.  This is obviously not an exhaustive list of the analyses from the literature, but still far surpasses other options for the practitioner, including the choice to write all your own code.  Suggestions?  I’d love to hear them.  Email me at nthompson@54.89.150.95.

Where does the “Opt Out of testing” movement come from? You’d be surprised.

The “opt out” movement is a supposedly-grass-roots movement against K-12 standardized testing, primarily focusing action on encouraging parents to refuse to allow their kids to take tests, i.e., opt out of testing.  The absolutely bizarre part of this is that large scale test scores are rarely used for individual impact on the student, and that tests take up only a tiny fraction of school time throughout the year.  An extremely well-written paper was recently released that explored this befuddling situation, written by Randy E. Bennett at Educational Testing Service (ETS).  Dr. Bennett is an internationally-renowned researcher whose opinion is quite respected.  He came to an interesting conclusion about the opt out of testing topic.

After a brief background, he states the situation:

Despite the fact that reducing testing time is a recurring political response, the evidence described thus far suggests that the actual time devoted to testing might not provide the strongest rationale for opting out, especially in the suburban low-poverty schools in which test refusal appears to occur more frequently.

A closer look at New York, the state with the highest opt-out rates, found a less obvious but stronger relationship (page 7):

It appears to have been the confluence of a revamped teacher evaluation system with a dramatically harder, Common Core-aligned test that galvanized the opt-out movement in New York State (Fairbanks, 2015; Harris & Fessenden, 2015;PBS Newshour, 2015). For 2014, 96% of the state’s teachers had been rated as effective or highly effective, even though only 31% of students had achieved proficiency in ELA and only 36% in mathematics (NYSED, 2014; Taylor, 2015). These proficiency rates were very similar to ones achieved on the 2013 NAEP for Grades 4 and 8 (USDE, 2013a, 2013b, 2013c,2013d). The rates were also remarkably lower than on New York’s pre-Common-Core assessments. The new rates might be taken to imply that teachers were doing a less-than-adequate job and that supervisors, perhaps unwittingly, were giving them inflated evaluations for it.

That view appears to have been behind a March 2015 initiative from New York Governor Andrew Cuomo (Harris& Fessenden, 2015; Taylor, 2015). At his request, the legislature reduced the role of the principal’s judgment, favored by teachers, and increased from 20% to 50% the role of test-score growth indicators in evaluation and tenure decisions (Rebora, 2015). As a result, the New York State United Teachers union urged parents to boycott the assessment so as to subvert the new teacher evaluations and disseminated information to guide parents specifically in that action (Gee, 2015;Karlin, 2015).

I am certainly sympathetic to the issues facing teachers today, being the son of two teachers and having a sibling who is a teacher, as well as having wanted to be a high school teacher myself until I was 18.  The lack of resources and low pay facing most educators is appalling.  However, the situation described above is simply an extension of the soccer-syndrome that many in our society decry: how all kids should be allowed to play and rewarded equally, merely for participation and not performance.  With no measure of performance, there is no external impetus to perform – and we all know the role that motivation plays in performance.

It will be interesting to see the role that the Opt Out Of Testing movement plays in the post-NLCB world.

Doing it Right: Selecting an Online Testing Platform

If you are looking around for an online testing platform, you have certainly discovered there are many out there.  How do you choose?  Well, for starters, it is important to understand that the range of quality and sophistication is incredible.  This post will help you identify some salient features and benefits that you should consider in first establishing your requirements and then selecting an online testing platform.

Types of online testing platforms

There are many, many systems in existence that can perform some sort of online assessment delivery.  From a high level, they differ substantially in terms of sophistication.

1. There are, of course, many survey engines that are designed for just that: unproctored, unscored surveys.  No stakes involved, no quality required.
2. At a slightly higher level are platforms for simple or formative assessment.  For example, a learning management system will typically include a component to deliver multiple choice questions.  However, this is obviously not their strength – you should use an LMS for assessment no more than you should use an assessment platform as an LMS.
3. At the top end of the spectrum are assessment platforms that are designed for real assessment.  That is, they implement best practices in the field, like Angoff studies, item response theory, and adaptive testing.  The type of assessment effort being done by school teachers is quite different from that being done by a company producing high-stakes international tests.  FastTest is an example of such a platform.

This post describes some of the aspects that separate the third level from the lower two levels.  If you need these aspects, then you likely need a “Level 3” system.  Of course, there are many testing situations where such a high quality system is complete overkill.

Another consideration is that many testing platforms are closed-content.  That is, they are 100% proprietary and used only within the organization that built it.  You likely need an open-content system, which allows anyone to build and deliver tests.

Test development aspects

Reusable items: If you write an item for this semester’s test, you should be able to easily reuse it next semester.  Or let another instructor use it.  Surprisingly, many systems do not have this basic functionality.

Item metadata: All items should have extensive metadata fields, such as author, content area, depth of knowledge, etc.  It should also store item statistics or IRT parameter, and more importantly, actually use them.  For example, test form assembly should utilize item statistics to evaluate form difficulty and reliability.

Form assembly: The system needs advanced functionality for form assembly, including extensive search functionality, automation, support for parallel forms, etc.

Publication options: The system needs to support all the publication situations that your organization might use: paper vs. online, implementation of time limits, control of widgets like calculators, etc.

Item types: The field of assessment is moving excitedly towards technology enhanced items (TEIs) in an effort to assess more deeply and authentically.  While this brings challenges, online testing platforms obviously need to support these.

Best practices in online testing

Supports standards: The workflow and reports of the system should facilitate the following of industry standards like APA/AERA/NCME,  NCCA, ANSI, and ITC.

Psychometrics:  Psychometrics is the Science of Assessment.  The system needs to implement real psychometrics like item response theory, computerized adaptive testing, distractor analysis, and test fraud analysis.  Note: just looking at P values and point-biserials is extremely outdated.

Logs: The online testing platform should log user and examinee activity, and have it available for queries and reports.

Reporting: You need reports on various aspects of the system, including item banks, users, tests, examinees, and domains.

System aspects

Security: Security is obviously essential to high stakes testing organizations.  There are many aspects though: user roles, content access control, browser lockdown, options for virtual proctoring, examinee access, proctor management, and more.

Reliability: The online testing platform needs to have very little downtime.

Scalability: The online testing platform needs to be able to scale up to large volumes.

Configurability: The functionality throughout the system needs to be configurable to meet the needs of your organization and individual users.