A Standard Setting Study is a formal process fo establishing an performance standard. In the assessment world, there are actually two uses of the word standard – the other one refers to a formal definition of the content that is being tested, such as the Common Core State Standards in the USA. For this reason, I prefer the term cutscore study.

After item authoring, item review, and test form assembly, a cutscore or passing score will often be set to determine what level of performance qualified as “pass” or similar classification.  This cannot be done arbitrarily (e.g., setting it at 70% because that’s what you saw when you were in school).  To be legally defensible and eligible for Accreditation, it must be done using one of several standard setting approaches from the psychometric literature.  The choice of method depends upon the nature of the test, the availability of pilot data, and the availability of subject matter experts.

Some types of Cutscore Studies:

  • Angoff – In an Angoff study, a panel of subject matter experts rates each item, estimating the percentage of minimally competent candidates that would answer each item correctly.  It is often done in tandem with the Beuk Compromise.  The Angoff method does not require actual examinee data, though the Beuk does.
  • Bookmark – The bookmark method orders the items in a test form in ascending difficulty, and a panel of experts reads through and places a “bookmark” in the book where they think a cutscore should be.  Obviously, this requires enough real data to calibrate item difficulty, usually using item response theory, which requires several hundred examinees.
  • Contrasting Groups – Candidates are sorted into Pass and Fail groups based on their performance on a different exam or some other unrelated standard.  If using data from another exam, a sample of at least 50 candidates is obviously needed.
  • Borderline Group – Similar to Contrasting Groups, but a borderline group is defined using alternative information such as biodata, and the scores of the group are evaluated.

One of the most cliche phrases associated with assessment is “teaching to the test.”  I’ve always hated this phrase, because it is only used in a derogatory matter, almost always by people who do not understand the basics of assessment and psychometrics.  I recently saw it mentioned in this article on PISA, and that was one time too many, especially since it was used in an oblique, vague, and unreferenced manner.

So, I’m going to come out and say something very unpopular: in most cases, TEACHING TO THE TEST IS A GOOD THING.

 

Why teaching to the test is usually a good thing

If the test reflects the curriculum – which any good test will – then someone who is teaching to the test will be teaching to the curriculum.  Which, of course, is the entire goal of teaching. The phrase “teaching to the test” is used in an insulting sense, especially because the alliteration is resounding and sellable, but it’s really not a bad thing in most cases.  If a curriculum says that 4th graders should learn how to add and divide fractions, and the test evaluates this, what is the problem? Especially if it uses modern methodology like adaptive testing or tech-enhanced items to make the process more engaging and instructional, rather than oversimplifying to a text-only multiple choice question on paper bubble sheets?

The the world of credentialing assessment, this is an extremely important link.  Credential tests start with a job analysis study, which surveys professionals to determine what they consider to be the most important and frequently used skills in the job.  This data is then transformed into test blueprints. Instructors for the profession, as well as aspiring students that are studying to pass the test, then focus on what is in the blueprints.  This, of course, still contains the skills that are most important and frequently used in the job!

 

So what is the problem then?

Now, telling teachers how to teach is more concerning, and more likely to be a bad thing.  Finland does well because it gives teachers lots of training and then power to choose how they teach, as noted in the PISA article.

As a counterexample, my high school math department made an edict starting my sophomore year that all teachers had to use the “Chicago Method.”  It was pure bunk and based on the fact that students should be doing as much busy work as possible instead of the teachers actually teaching. I think it is because some salesman convinced the department head to make the switch so that they would buy a thousand brand new textbooks.  The method makes some decent points (here’s an article from, coincidentally, when I was a sophomore in high school) but I think we ended up with a bastardization of it, as the edict was primarily:

  1. Assign students to read the next chapter in class (instead of teaching them!); go sit at your desk.
  2. Assign students to do at least 30 homework questions overnight, and come back tomorrow with any questions they have.  
  3. Answer any questions, then assign them the next chapter to read.  Whatever you do, DO NOT teach them about the topic before they start doing the homework questions.  Go sit at your desk.

Isn’t that preposterous?  Unsurprisingly, after two years of this, I went from being a leader of the Math Team to someone who explicitly said “I am never taking Math again”.  And indeed, I managed to avoid all math during my senior year of high school and first year of college. Thankfully, I had incredible professors in my years at Luther College, leading to me loving math again, earning a math major, and applying to grad school in psychometrics.  This shows the effect that might happen with “telling teachers how to teach.” Or in this case, specifically – and bizarrely – to NOT teach.

 

What about all the bad tests out there?

Now, let’s get back to the assumption that a test does reflect a curriculum/blueprints.  There are, most certainly, plenty of cases where an assessment is not designed or built well.  That’s an entirely different problem, and is an entirely valid concern. I have seen a number of these in my career.  This danger why we have international standards on assessments, like AERA/APA/NCME and NCCA.  These provide guidelines on how a test should be build, sort of like how you need to build a house according to building code and not just throwing up some walls and a roof.

For example, there is nothing that is stopping me from identifying a career that has a lot of people looking to gain an edge over one another to get a better job… then buying a textbook, writing 50 questions in my basement, and throwing it up on a nice-looking website to sell as a professional certification.  I might sell it for $395, and if I get just 100 people to sign up, I’ve made $39,500!!!! This violates just about every NCCA guideline, though. If I wanted to get a stamp of approval that my certification was legit – as well as making it legally defensible – I would need to follow the NCCA guidelines.

My point here is that there are definitely bad tests out there, just like there are millions of other bad products in the world.  It’s a matter of caveat emptor. But just because you had some cheap furniture on college that broke right away, doesn’t mean you swear off on all furniture.  You stay away from bad furniture.

There’s also the problem of tests being misused, but again that’s not a problem with the test itself.  Certainly, someone making decisions is uninformed. It could actually be the best test in the world, with 100% precision, but if it is used for an invalid application then it’s still not a good situation.  For example, if you took a very well-made exam for high school graduation and started using it for employment decisions with adults. Psychometricians call this validity – that we have evidence to support the intended use of the test and interpretations of scores.  It is the #1 concern of assessment professionals, so if a test is being misused, it’s probably by someone without a background in assessment.

 

So where do we go from here?

Put it this way, if an overweight person is trying to become fitter, is success more likely to come from changing diet and exercise habits, or from complaining about their bathroom scale?  Complaining unspecifically about a high school graduation assessment is not going to improve education; let’s change how we educate our children to prepare them for that assessment, and ensure that the assessment reflects the goals of the education.  Nevertheless, of course, we need to invest in making the assessment as sound and fair as we can – which is exactly why I am in this career.

Classical test theory is a century-old paradigm for psychometrics – using quantitative and scientifically-based processes to develop and analyze assessments to maximize their quality.  (nobody likes unfair tests!)  The most basic and frequently used item statistic from classical test theory is the P-value.  It is usually called item difficulty but is sometimes called item facility, which can lead to possible confusion.

The P-Value Statistic

The classical P-value is the proportion of examinees that respond correctly to a question, or respond in the “keyed direction” for items where the notion of correct is not relevant (imagine a personality assessment where all questions are Yes/No statements such as “I like to go to parties” … Yes is the keyed direction for an Extraversion scale).  Note that this is NOT the same as the p-value that is used in hypothesis testing from general statistical methods.  This P-value is almost universally agreed upon in terms of calculation.  But some people call it item difficulty and others call it item facility.  Why?

It has to do with the clarity interpretation.  It usually makes sense to think of difficulty as an important aspect of the item.  The P-value presents this, but in a reverse manner.  We usually expect higher values to indicate more of something, right?  But a P-value of 1.00 is high, and it means that there is not much difficulty; everyone gets the item correct, so it is actually no difficulty whatsoever.  A P-value of 0.25 is low, but it means that there is a lot of difficulty; only 25% of examinees are getting it correct, so it has quite a lot of difficulty.

So where does “item facility” come in?

See how the meaning is reversed?  It’s for this reason that some psychometricians prefer to call it item facility or item easiness.  We still use the P-value, but 1.00 means high facility/easiness and 0.25 means low facility/easiness.  The direction of the semantics fits much better.

Nevertheless, this is a minority of psychometricians.  There’s too much momentum to change an entire field at this point!  It’s similar to the 3 dichotomous IRT parameters (a,b,c); some of you might have noticed that they are actually in the wrong order, because the 1-parameter model does not use the a parameter, it uses the b.  At the end of the day, it doesn’t really matter, but it’s another good example of how we all just got used to doing something and it’s now too far down the road to change it.  Tradition is a funny thing.

The modified-Angoff method is arguably the most common method of setting a cutscore on a test.  The Angoff cutscore is legally defensible and meets international standards such as AERA/APA/NCME, ISO 17024, and NCCA.  It also has the benefit that it does not require the test to be administered to a sample of candidates first; methods like Contrasting Groups, Borderline Group, and Bookmark do so.

There are, of course, some drawbacks to the Angoff cutscore process.  The most significant is the fact that the subject matter experts (SMEs) tend to overestimate their conceptualization of a minimally competent candidate, and therefore overestimate the cutscore.  Sometimes to the point that the expected pass rate is zero!

Another drawback is that the Angoff cutscore process only works in the classical psychometric paradigm – the recommended cutscores are on the number-correct metric or percentage-correct metric.  If your tests are developed and scored in the item response theory (IRT) paradigm, you need to convert the classical cutscore to the IRT theta scale.  The easiest way to do that is to reverse-calculate the test response function (TRF) from IRT.

The Test Response Function

The TRF (sometimes called a test characteristic curve) is an important method of characterizing test performance in the IRT paradigm.  The TRF predicts a classical score from an IRT score, as you see below.  Like the item response function and test information function (these need blog posts too), it uses the theta scale as the X-axis.  The Y-axis can be either the number-correct metric or proportion-correct metric.

In this example, you can see that a theta of -0.6 translates to an estimated number-correct score of approximately 10, and +1 to 15.5.  Note that the number-correct metric only makes sense for linear or LOFT exams, where every examinee receives the same number of items.  In the case of CAT exams, only the proportion correct metric makes sense.

Angoff cutscore to IRT

So how does this help us with the conversion of a cutscore?  Well, we hereby have a way of translating any number-correct score or proportion-correct score.  So any Angoff-recommended cutscore can be reverse-calculated to a theta value.  If your Angoff study (or Beuk) recommends a cutscore of 10 out of 20 points, you can convert that to a theta cutscore of -0.6.  If the recommended cutscore was 15.5, the theta cutscore would be 1.0.

Because IRT works in a way that it scores examinees on the same scale with any set of items, as long as those items have been part of a linking/equating study.  Therefore, a single Angoff study on a set of items can be equated to any other linear test form, LOFT pool, or CAT pool.  This makes it possible to apply the classically-focused Angoff method to IRT-focused programs.

Linear on the fly testing (LOFT) is an approach to delivering assessments to examinees.  In general, there are two families of test delivery.  Static approaches deliver the same test form or forms to everyone; this is the ubiquitous and traditional “linear” method of testing.  Algorithmic approaches deliver the test to each examinee based on a computer algorithm; this includes LOFT, computerized adaptive testing (CAT), and multistage testing (MST).

What is linear on the fly testing?

The purpose of linear on the fly testing is to give every an examinee a linear form that is uniquely created for them – but each one is create to be psychometrically equivalent to all others to ensure fairness.  For example, we might have a pool of 200 items, and every person only gets 100, but that 100 is balanced for each person.  This can be done be ensuring content and/or statistical equivalency, as well ancillary metadata such as item types or cognitive level.

Content Equivalence

This portion is relatively straightforward.  If your test blueprint calls for 20 items in each of 5 domains, for a total of 100 items, then each form administered to examinees should follow this blueprint.  Sometimes the content blueprint might go 2 or even 3 levels deep.

Statistical Equivalence

There are, of course, two predominant psychometric paradigms: classical test theory and item response theory.  With CTT, forms can easily be built to have an equivalent P value, and therefore expected mean score.  If point-biserial statistics are available for each item, you can also design the algorithm to design forms that have the same standard deviation and reliability.  With item response theory, the typical approach is to design forms to have the same test information function, or inversely, conditional standard error of measurement function.  To learn more about how these are implemented, download our IRT Scoring Spreadsheet or Classical Form Assembly Tool.

Implementing LOFT

LOFT is typically implemented by publishing a pool of items with an algorithm to select subsets that meet the requirements.  Therefore, you need a psychometrically sophisticated testing engine that stores the necessary statistics and item metadata, lets you define a pool of items, specify the relevant options such as target statistics and blueprints, and deliver the test in a secure manner.  Very few testing platforms can implement a quality LOFT assessment.

Why all this?

It certainly is not easy to build a strong item bank, design LOFT pools, and develop a complex algorithm that meets content and statistical balancing needs.  So why would an organization use linear on the fly testing?  Well, it is much more secure than having a few linear forms.  Since everyone receives a unique form, it is impossible for word to get out about what the first questions on the test are.  And of course, we could simply perform a random selection of 100 items from a pool of 200, but that would be potentially unfair.  Using LOFT will ensure the test remains fair and defensible.

Fraudulent testing data is everywhere. In academic testing, students cheat by looking at other students’ responses or informing their friends in the next section what questions are on the test. In professional credentialing, candidates will sit for the exam simply to steal the content for posting on brain dump sites, while other candidates purchasing the content from these sites never pause to consider the ethical ramifications of trading in stolen property.

Threats to test security are also threats to validity and, by extension, the entire existence and integrity of the assessment. What’s worse? The greater the stakes, the greater the incentive to cheat. Has your organization ever taken a deep dive into your assessment data to search for evidence of cheating or other invalid behavior?

Dr. Nathan Thompson, Assessment Systems co-founder and VP of Psychometrics, has long recognized the value of psychometric forensics to an assessment program, but also the lack of software to implement it. Because of this, Dr. Thompson developed Software for Investigating Fraud in Testing (SIFT) in 2016.

“The software is easy to run because of its friendly UI, but the results are so complex that only a small percentage of Ph.D. psychometricians can understand the output,” Dr. Thompson said.

That is why Assessment Systems is proud to offer Psychometric Forensics service, leveraging Dr. Thompson’s expertise (and our love for test security) to bring this customized consulting to organizations who wish to protect the integrity of their assessments.

“The cliché holds true here: an ounce of prevention is worth a pound of cure,” Dr. Thompson said. “We can work with you to identify areas of concern and explore policies, procedures, and practices that will help you.”

If you provide us a dataset, we’ll analyze it with a range of collusion indices and other statistics, evaluating your examinees individually as well as groups such as test centers or classrooms. ASC’s mission is to improve the quality of as many assessments as we can.

Have you heard about standard setting approaches such as the Hofstee method, or perhaps the Angoff, Ebel, Nedelsky, or Bookmark methods?  There are certainly various ways to set a defensible cutscore or a professional credentialing or pre-employment test.  Today, we are going to discuss the Hofstee method.

Why Standard Setting?

Certification organizations that care about the quality of their examinations need to follow best practices and international standards for test development, such as  the Standards laid out by the National Commission for Certifying Agencies (NCCA).  One component of that is standard setting, also known as cutscore studies.  One of the most common and respected approaches for that is the modified-Angoff methodology.

However, the Angoff approach has one flaw: the subject matter experts (SMEs) tend to expect too much out of minimally competent candidates, and sometimes set a cutscore so high that even they themselves would not pass the exam.  There are several reasons this can occur.  For example, raters might think “I would expect anyone that worked for me to know how to do this” and not consider the fact that people who work for them might have 10 years of experience while test candidates could be fresh out of training/school and have the topic only touched on for 5 minutes.  SMEs often forget what it was like to be a much younger and inexperienced version of themselves.

For this reason, several compromise methods have been suggested to compare the Angoff-recommended cutscore with a “reality check” of actual score performance on the exam, allowing the SMEs to make a more informed decision when setting the official cutscore of the exam.  I like to use the Beuk method and the Hofstee method.

The Hofstee Method

One method of adjusting the cutscore based on raters’ impressions of the difficulty of the test and possible pass rates is the Hofstee method (Mills & Melican, 1987; Cizek, 2006; Burr et al., 2016).  This method requires the raters to estimate four values:

 

  1. The minimum acceptable failure rate
  2. The maximum acceptable failure rate
  3. The minimum cutscore, even if all examinees failed
  4. The maximum cutscore, even if all examinees passed

 

The first two values are failure rates, and are therefore between 0% and 100%, with 100% indicating a test that is too difficult for anyone to pass.  The latter two values are on the raw score scale, and therefore range between 0 and the number of items in the test, again with a higher value indicating a more difficult cutscore to achieve.

These values are paired, and the line that passes through the two points estimated.  The intersection of this line with the failure rate function, is the recommendation of the adjusted cutscore.   hofstee method cutscore standard setting

How can I use the Hofstee Method?

Unlike the Beuk, the Hofstee method does not utilize the Angoff ratings, so it represents a completely independent reality check.  In fact, it is sometimes used as a standalone cutscore setting method itself, but because it does not involve rating of every single item, I recommend it be used in concert with the Angoff and Beuk approaches.

How can you perform all the calculations that go into the Hofstee method?  Well, you don’t need to program it all from scratch.  Just head over to our Angoff Analysis Tool page and download a copy for yourself.

The Spearman-Brown Prediction Formula, also known as the Spearman-Brown Prophecy Formula or Correction, is a method used in evaluating test reliability.  It is based on the idea that split-half reliability has better assumptions than coefficient alpha, but only estimates reliability for a half-length test, so we need to implement a correction that steps it up to a true estimate for a full length test.

Coefficient Alpha vs. Split Half

The most commonly used index of test score reliability is coefficient alpha.  However, it’s not the only index on internal consistency.  Another common approach is split-half reliability, where we split the test into two halves (first/last, even/odd, or random split) and then correlate scores on each.  The reasoning is that if both halves of the test measure the same construct at a similar level of precision and difficulty, then scores on one half should correlate highly with scores on the other half.  More information on split-half is found here.

However, split-half reliability provides an inconvenient situation: we are effectively gauging the reliability of half a test.  It is a well-known fact that reliability is increased by more items (observations); we can all agree that a 100-item test is more reliable than a 10 item test comprised of similar quality items.  So the split half correlation is blatantly underestimating the reliability of the full length test.

The Spearman-Brown Prediction Formula

To adjust for this, psychometricians use the Spearman-Brown prophecy formula.  It takes the split half correlation as input and converts it to an estimate of the equivalent level of reliability for the full length test.  While this might sound complex, the actual formula is quite simple.

As you can see, the formula takes the split half reliability (pxx’) and number of items (n) as input and produces the full-length equivalent (p*xx’) .  This can then be interpreted alongside the ubiquitously used coefficient alpha.

While the calculation is quite simple, you still shouldn’t have to do it yourself.  Any decent software for classical item analysis will produce it for you.  As an example, here is the output of the Reliability Analysis table from our Iteman software for automated reporting and assessment intelligence with CTT.  This lists the various split-half estimates alongside the coefficient alpha (and its associated SEM) for the total score as well as the domains, so you can evaluate if there are domains that are producing unusually unreliable scores.  (Note: there is an ongoing argument amongst psychometricians whether domain scores are even worthwhile since the assumed unidimensionality of most test means that the domain scores are just less-reliable estimates of the total score, but that’s a whole ‘nother blog post!)

 

ScoreN ItemsAlphaSEMSplit-Half (Random)Split-Half (First-Last)Split-Half (Odd-Even)S-B RandomS-B First-LastS-B Odd-Even
All items500.8053.0580.6600.5370.6680.7950.6990.801
1100.5221.2690.3380.3760.3700.5060.5470.540
2180.6021.8600.4180.3090.4480.5900.4720.619
3120.6051.4960.4490.4170.3830.6200.5880.553
4100.4851.3750.3000.3290.2970.4610.4950.457

 

You can see that, as mentioned earlier, there are 3 ways to do the split in the first place, and Iteman reports all three.  It then reports the Spearman-Brown formula for each.  These generally align with the results of the alpha estimates, which overall provides a cohesive picture about the structure of the exam and its reliability of scores.  As you might expect, domains with more items are slightly more reliable, but not super reliable since they are all less than 20 items.

So, what does this mean in the big scheme of things?  Well, in many cases the Spearman-Brown estimates might not differ than the alpha estimates, but it’s still good to know that they do.  In the case of high-stakes tests, you want to go through every effort you can to ensure that the scores are highly reliable and precise.

If you’d like to learn more, here is a recent article on the topic.

Psychometrics is the cornerstone of any high-quality assessment program.  Most organizations do not have an in-house PhD psychometrician, which then necessitates the search for psychometric consulting.  Most organizations, when first searching, are new to the topic and not sure what role the psychometrician plays.  In this article, we’ll talk about how psychometricians and their tools can help improve your assessments, whether you just want to check on test reliability or pursue the lengthy process of accreditation.

Why ASC?

Whether you are establishing or expanding a credentialing program, streamlining operations, or moving from paper to online testing, ASC has a proven track record of providing practical, cost-efficient solutions with uncompromising quality. We offer a free consultation with our team of experts to discuss your needs and determine which solutions are the best fit, including our enterprise SaaS platforms, consulting on sound psychometrics, or recommending you to one of our respected partners.
 

At the heart of our business is our people.

Our collaborative team of Ph.D. psychometricians, accreditation experts, and software developers have diverse experience developing solutions that drive best practices in assessment. This real-world knowledge enables us to consult your organization with solutions tailored specifically to your goals, timeline, and budget.
 

Comprehensive Solutions to Address Specific Measurement Problems

Much of psychometric consulting is project-based around solving a specific problem.  For example, you might be wondering how to set a cutscore on a certification/licensure exam that is legally defensible and meets accreditation standards.  This is a very specific issue, and the scientific literature has suggested a number of sound approaches.  Here are some of the topics where psychometricians can really help:

  • Test Design: Job Analysis & Blueprints
  • Standard and Cutscore Setting Studies
  • Item Writing and Review Workshops
  • Test and Item Statistical Analysis
  • Equating Across Years and Forms
  • Adaptive Testing Research
  • Test Security Evaluation
  • NCCA/ANSI Accreditation

 

Why psychometric consulting?

All areas of assessment can be smarter, faster and fairer.

Develop Reliable and Valid Assessments
We’ll help you understand what needs to be done to develop defensible tests and how to implement them in a cost-efficient manner.  Much of the work revolves around establishing a sound test development cycle.

Increase Test Security
We have specific expertise in psychometric forensics, allowing you to flag suspicious candidates or groups in real time, using our automated forensics report.

Achieve Accreditation
Our dedicated experts will assist in setting your organization up for success with NCCA/ANSI accreditation of professional certification programs.

Comprehensive Psychometric Analytics
We use CTT and IRT with principles of machine learning and AI to deeply understand your data and provide actionable recommendations.

We can help your organization develop and publish certification and licensure exams, based on best practices and accreditation standards, in a matter of months.

If you’re looking for a way to add these best practices to your assessments, here’s how:

Item and Test Statistical Analysis
If you are doing this process at least annually, you are not meeting best practices or accreditation standards. But don’t worry, we can help! In addition to performing these analyses for you, you also have the option of running them yourself in our FastTest platform or using our psychometric software like Iteman and Xcalibre.

Job Analysis
How do you know what a professional certification test should cover?  Well, let’s get some hard data by surveying job incumbents. Knowing and understanding this information and how to use it is essential if you want to test people on whether they are prepared for the job or profession.

Cutscore Studies (Standard Setting)
When you use sound psychometric practices like the modified-Angoff, Beuk Compromise, Bookmark, and Contrasting Groups methods, it will help you establish a cutscore that meets professional standards.

 

It’s all much easier if you use the right software!

Once we help you determine the best solutions for your organization, we can train you on best practices, and it’s extremely easy to use our software yourself.  Software like Iteman and Xcalibre is designed to replace much of the manual work done by psychometricians for item and test analysis, and FastTest automates many aspects of test development and publishing.  We even offer free software like the Angoff Analysis Tool.  However, our ultimate goal is your success: Assessment Systems is a full-service company that continues to provide psychometric consulting and support even after you’ve made a purchase. Our team of professionals is available to provide you with additional support at any point in time. We want to ensure you’re getting the most out of our products!  Click below to sign up for a free account in FastTest and see for yourself.

 

Artificial intelligence (AI) and machine learning (ML) have become buzzwords over the past few years.  As I already wrote about, they are actually old news in the field of psychometrics.   Factor analysis is a classical example of ML, and item response theory also qualifies as ML. Computerized adaptive testing is actually an application of AI to psychometrics that dates back to the 1970s.

One thing that is very different about the world of AI/ML today is the massive power available in free platforms like R, Python, and TensorFlow.  I’ve been thinking a lot over the past few years how these tools can impact the world of assessment.  A straightforward application is to automated essay scoring; a common way to approach that problem is through natural language processing with  the “bag of words” model and utilize the document-term matrix (DTM) as predictors in a model for essay score as a criterion variable.  Surprisingly simple.  This got me to wondering where else we could apply that sort of modeling.  Obviously, student response data on selected-response items provides a ton of data, but the research questions are less clear.  So, I turned to the topic that I think has the next largest set of data and text: item banks.

Step 1: Text Mining

The first step was to explore tools for text mining in R.  I found this well-written and clear tutorial on the text2vec package and used that as my springboard.  Within minutes I was able to get a document-term matrix, and in a few more minutes was able to prune it.  This DTM alone can provide useful info to an organization on their item bank, but I wanted to delve further.  Can the DTM predict item quality?

Step 2: Fit Models

To do this, I utilized both the caret and glmnet packages to fit models.  I love the caret package, but if you search the literature you’ll find it has problem with sparse matrices, which is exactly what the DTM is.  One blog post I found said that anyone with a sparse matrix is pretty much stuck using glmnet.

I tried a few models on a small item bank of 500 items from a friend of mine, and my adjusted R squared for the prediction of IRT a parameters (as an index of item quality) was 0.53 – meaning that I could account for more than half the variance of item quality just by knowing some of the common words in each item’s stem.  I wasn’t even using the answer texts n-grams, or additional information like Author and content domain.

Want to learn more about your item bank?

I’d love to swim even deeper on this issue.  If you have a large item bank and would like to work with me to analyze it so you can provide better feedback and direction to your item writers and test developers, drop me a message at nthompson@54.89.150.95!  This could directly impact the efficiency of your organization and the quality of your assessments.