If you have worked in the field of assessment and psychometrics, you have undoubtedly encountered the word “standard.” While a relatively simple word, it has the potential to be confusing because it is used in three (and more!) completely different but very important ways. Here’s a brief discussion.

Standard = Cutscore

As noted by the well-known professor Gregory Cizek here, “standard setting refers to the process of establishing one or more cut scores on a test.” The various methods of setting a cutscore, like Angoff or Bookmark, are referred to as standard setting studies. In this context, the standard is the bar that separates a Pass from a Fail. We use methods like the ones mentioned to determine this bar in as scientific and defensible fashion as possible, and give it more concrete meaning than an arbitrarily selected round number like 70%. Selecting a round number like that will likely get you sued since there is no criterion-referenced interpretation.

Standard = Blueprint

If you work in the field of education, you often hear the term “educational standards.” These refer to the curriculum blueprints for an educational system, which also translate into assessment blueprints, because you want to assess what is on the curriculum. Several important ones in the USA are noted here, perhaps the most common of which nowadays is the Common Core State Standards, which attempted to standardize the standards across states. These standards exist to standardize the educational system, by teaching what a group of experts have agreed upon should be taught in 6th grade Math classes for example. Note that they don’t state how or when a topic should be taught, merely that 6th Grade Math should cover Number Lines, Measurement Scales, Variables, whatever – sometime in the year.

Standard = Guideline

If you work in the field of professional certification, you hear the term just as often but in a different context, accreditation standards. The two most common are the National Commission for Certifying Agencies (NCCA) and the ANSI National Accreditation Board (ANAB). These two organizations are a consortium of credentialing bodies that give a stamp of approval to credentialing bodies, stating that a Certification or Certificate program is legit. Why? Because there is no law to stop me from buying a textbook on any topic, writing 50 test questions in my basement, and selling it as a Certification. It is completely a situation of caveat emptor, and these organizations are helping the buyers by giving a stamp of approval that the certification was developed with accepted practices like a Job Analysis, Standard Setting Study, etc.

In addition, there are the professional standards for our field. These are guidelines on assessment in general rather than just credentialing. Two great examples are the AERA/APA/NCME Standards for Educational and Psychological Measurement and the International Test Commission’s Guidelines (yes they switch to that term) on various topics.

Also: Standardized = Equivalent Conditions

The word is also used quite frequently in the context of standardized testing, though it is rarely chopped to the root word “standard.” In this case, it refers to the fact that the test is given under equivalent conditions to provide greater fairness and validity. A standardized test does NOT mean multiple choice, bubble sheets, or any of the other pop connotations that are carried with it. It just means that we are standardizing the assessment and the administration process. Think of it as a scientific experiment; the basic premise of the scientific method is holding all variables constant except the variable in question, which in this case is the student’s ability. So we ensure that all students receive a psychometrically equivalent exam, with equivalent (as much as possible) writing utensils, scrap paper, computer, time limit, and all other practical surroundings. The problem comes with the lack of equivalence in access to study materials, prep coaching, education, and many bigger questions… but those are a societal issue and not a psychometric one.

So despite all the bashing that the term gets, a standardized test is MUCH better than the alternatives of no assessment at all, or an assessment that is not a level playing field and has low reliability. Consider the case of hiring employees: if assessments were not used to provide objective information on applicant skills and we could only use interviews (which are famously subjective and inaccurate), all hiring would be virtually random and the amount of incompetent people in jobs would increase a hundredfold. And don’t we already have enough people in jobs where they don’t belong?

What is computerized adaptive testing?  Computerized adaptive tests (CATs), also known as computer-adaptive tests or simply adaptive tests, are a sophisticated method of test delivery based that uses AI algorithms to personalize the test to every examinee.  This means that the test becomes shorter, more accurate, more secure, and fairer.  The AI algorithms are almost always based on item response theory (IRT), an application of machine learning to assessment, but can be based on other models as well. 

This post will cover the following topics:

  1. What is computerized adaptive testing?
  2. How does the test adapt?
  3. An example of computerized adaptive testing
  4. Advantages of computerized adaptive testing
  5. How to develop an CAT that is valid and defensible
  6. What do I need for adaptive testing?

click here to sign up for adaptive testing platform

What is computerized adaptive testing?

Computerized adaptive testing is an algorithm that drives how a test is delivered.  It is coded into a software platform, using the machine-learning approach of IRT to select items and score examinees.  The algorithm proceeds in a loop until the test is complete.

computerized adaptive testing

The steps in the diagram above are adapted from Kingsbury and Weiss (1984).  Let’s step through how it works.

For starters, you need an item bank that has been calibrated with a relevant psychometric or machine learning model.  That is, you can’t just write a few items and subjectively rank them as Easy, Medium, or Hard difficulty.  That’s an easy way to get sued.  Instead, you need to write a large number of items (rule of thumb is 3x your intended test length) and then pilot them on a representative sample of examinees.  The sample must be large enough to support the psychometric model you choose, and can range from 100 to 1000.  You then need to perform simulation research – more on that later.

Once you have an item bank ready, here is how the computerized adaptive testing algorithm works for a student that sits down to take the test.

  1. Starting point: there are three option to select the starting score, which psychometricians call theta
    1. Everyone gets the same value, like 0.0 (average, in the case of non-Rasch models)
    2. Randomized within a range, to help test security and item exposure
    3. Predicted value, perhaps from external data, or from a previous exam
  2. Select item
    1. Find the item in the bank that has the highest information value
    2. Often, you need to balance this with practical constraints such as Item Exposure or Content Balancing
  3. Score examinee
    1. Score the examinee; if using IRT, perhaps maximum likelihood or Bayes modal
  4. Evaluate termination criterion: using a predefined rule supported by your simulation research
    1. Is a certain level of precision reached, such as a standard error of measurement <0.30
    2. Are there no good items left in the bank
    3. Has a time limit been reached
    4. Has a Max Items limit been reached

The algorithm works by looping through 2-3-4 until the termination criterion is satisfied.

 

Do I need to program all that myself?

No.  Our revolutionary platform, FastTest, makes it easy to publish a CAT.  Once you upload the IRT parameters, you can choose whatever options you please for steps 2-3-4 of the algorithm, simply by clicking on elements in our easy-to-use interface.  Want to try it yourself?  Contact us to set up a free account and demo.

But of course, there are many technical considerations that affect the quality and defensibility of your CAT – we’ll be talking about those in this post.

Adaptive testing options

How does the test adapt? By Difficulty and/or Quantity

They operate by adapting both the difficulty and quantity of items seen by each examinee.

Difficulty
Most characterizations of adaptive testing focus on how item difficulty is matched to examinee ability. High-ability examinees receive more difficult items, while low ability examinees receive easier items, which has important benefits to the student and the organization. An adaptive test typically begins by delivering an item of medium difficulty; if you get it correct, you get a tougher item, and if you get it incorrect, you get an easier item. This basic algorithm continues until the test is finished, though it usually includes sub algorithms for important things like content distribution and item exposure.

Quantity
A less publicized facet of adaptation is the number of items. Adaptive tests can be designed to stop when certain psychometric criteria are reached, such as a specific level of score precision. Some examinees finish very quickly with few items, so that adaptive tests are typically about half as many questions as a regular test, with at least as much accuracy. Since some examinees have longer tests, these adaptive tests are referred to as variable-length. Obviously, this makes for a massive benefit: cutting testing time in half, on average, can substantially decrease testing costs.

Some adaptive tests use a fixed length, and only adapt item difficulty. This is merely for public relations issues, namely the inconvenience of dealing with examinees who feel they were unfairly treated by the CAT, even though it is arguably more fair and valid than conventional tests.

In general, it is best practice to meld the two: allow test length to be shorter or longer, but put caps on either end that prevent inadvertently too-short tests or tests that could potentially go on to 400 items.  For example, the NCLEX has a minimum length exam of 75 items and the maximum length exam of 145 items.

 

An example of computerized adaptive testing

Let’s walk through an oversimplified example.  Below are the item information functions for five items in a bank.  Let’s suppose the starting theta is 0.0.  

item information functions

 

  1. We find the first item to deliver.  Which item has the highest information at 0.0?  It is Item 4.
  2. Suppose the student answers incorrectly.
  3. We run the IRT scoring algorithm, and suppose the score is -2.0.  
  4. Check the termination criterion; we certainly aren’t done yet, after 1 item.
  5. Find the next item.  Which has the highest information at -2.0?  Item 2.
  6. Suppose the student answers correctly.
  7. We run the IRT scoring algorithm, and suppose the score is -0.8.  
  8. Evaluate termination criterion; not done yet.
  9. Find the next item.  Item 2 is the highest at -0.8 but we already used it.  Item 4 is next best, but we already used it.  So the next best is Item 1.
  10. Item 1 is very easy, so the student gets it correct.
  11. New score is -0.2.
  12. Best remaining item at -0.2 is Item 3.
  13. Suppose the student gets it incorrect.
  14. New score is perhaps -0.4.
  15. Evaluate termination criterion.  Suppose that the test has a max of 3 items, an extremely simple criterion.  We have met it.  The test is now done and automatically submitted.

 

Advantages of computerized adaptive testing

By making the test more intelligent, adaptive testing provides a wide range of benefits.  Some of the well-known advantages of adaptive testing, recognized by scholarly psychometric research, are listed below.  However, the development of an adaptive test is a very complex process that requires substantial expertise in item response theory (IRT) and CAT simulation research.  

Our experienced team of psychometricians can provide your organization with the requisite experience to implement adaptive testing and help your organization benefit from these advantages. Contact us or read this white paper to learn more.

  • Shorter tests, anywhere from a 50% to 90% reduction; reduces cost, examinee fatigue, and item exposure
  • More precise scores: CAT will make tests more accurate
  • More control of score precision (accuracy): CAT ensures that all students will have the same accuracy, making the test much more fair.  Traditional tests measure the middle students well but not the top or bottom students.
  • Increased efficiency
  • Greater test security because everyone is not seeing the same form
  • A better experience for examinees, as they only see items relevant for them, providing an appropriate challenge
  • The better experience can lead to increased examinee motivation
  • Immediate score reporting
  • More frequent retesting is possible; minimize practice effects, which makes this extremely useful for K-12 formative assessment.
  • Individual pacing of tests; examinees move at their own speed
  • On-demand testing can reduce printing, scheduling, and other paper-based concerns
  • Storing results in a database immediately makes data management easier
  • Computerized testing facilitates the use of multimedia in items
 

How to develop an CAT that is valid and defensible

CATs are the future of assessment. They operate by adapting both the difficulty and number of items to each individual examinee. The development of an adaptive test is no small feat, and requires five steps integrating the expertise of test content developers, software engineers, and psychometricians.

The development of a quality adaptive test is complex and requires experienced psychometricians in both item response theory (IRT) calibration and CAT simulation research. FastTest can provide you the psychometrician and software; if you provide test items and pilot data, we can help you quickly publish an adaptive version of your test.

   Step 1: Feasibility, applicability, and planning studies. First, extensive monte carlo simulation research must occur, and the results formulated as business cases, to evaluate whether adaptive testing is feasible, applicable, or even possible.

   Step 2: Develop item bank. An item bank must be developed to meet the specifications recommended by Step 1.

   Step 3: Pretest and calibrate item bank. Items must be pilot tested on 200-1000 examinees (depends on IRT model) and analyzed by a Ph.D. psychometrician.

   Step 4: Determine specifications for final CAT. Data from Step 3 is analyzed to evaluate CAT specifications and determine most efficient algorithms using CAT simulation software such as CATSim.

   Step 5: Publish live CAT. The adaptive test is published in a testing engine capable of fully adaptive tests based on IRT.  There are not very many of them out in the market.  Sign up for a free account in our platform FastTest and try for yourself!

Want to learn more about our one-of-a-kind model? Click here to read the seminal article by our two co-founders., or read this blog post on developing an adaptive test. More adaptive testing research is available here.

What do I need for adaptive testing?

Here are some minimum requirements to evaluate if you are considering a move to the CAT approach.

  • A large item bank piloted so that each item has at least 100 valid responses (Rasch model) or 500 (3PL model)
  • 500 examinees per year
  • Specialized IRT calibration and CAT simulation software like Xcalibre and CATsim.
  • Staff with a Ph.D. in psychometrics or an equivalent level of experience. Or, leverage our internationally recognized expertise in the field.
  • Items (questions) that can be scored objectively correct/incorrect in real-time
  • An item banking system and CAT delivery platform
  • Financial resources: Because it is so complex, the development of a CAT will cost at least $10,000 (USD) — but if you are testing large volumes of examinees, it will be a significantly positive investment. If you pay $20/hour for proctoring seats and cut a test from 2 hours to 1 hour for just 1,000 examinees… that’s a $20,000 savings.  If you are doing 200,000 exams?  That is $4,000,000 in seat time that is saved.

Adaptive testing: Resources for further reading

Visit the links below to learn more about adaptive testing.  

How can I start developing a CAT?

Sign up below for a free account in our industry-leading CAT platform.

 

One of the most cliche phrases associated with assessment is “teaching to the test.”  I’ve always hated this phrase, because it is only used in a derogatory matter, almost always by people who do not understand the basics of assessment and psychometrics.  I recently saw it mentioned in this article on PISA, and that was one time too many, especially since it was used in an oblique, vague, and unreferenced manner.

So, I’m going to come out and say something very unpopular: in most cases, TEACHING TO THE TEST IS A GOOD THING.

 

Why teaching to the test is usually a good thing

If the test reflects the curriculum – which any good test will – then someone who is teaching to the test will be teaching to the curriculum.  Which, of course, is the entire goal of teaching. The phrase “teaching to the test” is used in an insulting sense, especially because the alliteration is resounding and sellable, but it’s really not a bad thing in most cases.  If a curriculum says that 4th graders should learn how to add and divide fractions, and the test evaluates this, what is the problem? Especially if it uses modern methodology like adaptive testing or tech-enhanced items to make the process more engaging and instructional, rather than oversimplifying to a text-only multiple choice question on paper bubble sheets?

The the world of credentialing assessment, this is an extremely important link.  Credential tests start with a job analysis study, which surveys professionals to determine what they consider to be the most important and frequently used skills in the job.  This data is then transformed into test blueprints. Instructors for the profession, as well as aspiring students that are studying to pass the test, then focus on what is in the blueprints.  This, of course, still contains the skills that are most important and frequently used in the job!

 

So what is the problem then?

Now, telling teachers how to teach is more concerning, and more likely to be a bad thing.  Finland does well because it gives teachers lots of training and then power to choose how they teach, as noted in the PISA article.

As a counterexample, my high school math department made an edict starting my sophomore year that all teachers had to use the “Chicago Method.”  It was pure bunk and based on the fact that students should be doing as much busy work as possible instead of the teachers actually teaching. I think it is because some salesman convinced the department head to make the switch so that they would buy a thousand brand new textbooks.  The method makes some decent points (here’s an article from, coincidentally, when I was a sophomore in high school) but I think we ended up with a bastardization of it, as the edict was primarily:

  1. Assign students to read the next chapter in class (instead of teaching them!); go sit at your desk.
  2. Assign students to do at least 30 homework questions overnight, and come back tomorrow with any questions they have.  
  3. Answer any questions, then assign them the next chapter to read.  Whatever you do, DO NOT teach them about the topic before they start doing the homework questions.  Go sit at your desk.

Isn’t that preposterous?  Unsurprisingly, after two years of this, I went from being a leader of the Math Team to someone who explicitly said “I am never taking Math again”.  And indeed, I managed to avoid all math during my senior year of high school and first year of college. Thankfully, I had incredible professors in my years at Luther College, leading to me loving math again, earning a math major, and applying to grad school in psychometrics.  This shows the effect that might happen with “telling teachers how to teach.” Or in this case, specifically – and bizarrely – to NOT teach.

 

What about all the bad tests out there?

Now, let’s get back to the assumption that a test does reflect a curriculum/blueprints.  There are, most certainly, plenty of cases where an assessment is not designed or built well.  That’s an entirely different problem, and is an entirely valid concern. I have seen a number of these in my career.  This danger why we have international standards on assessments, like AERA/APA/NCME and NCCA.  These provide guidelines on how a test should be build, sort of like how you need to build a house according to building code and not just throwing up some walls and a roof.

For example, there is nothing that is stopping me from identifying a career that has a lot of people looking to gain an edge over one another to get a better job… then buying a textbook, writing 50 questions in my basement, and throwing it up on a nice-looking website to sell as a professional certification.  I might sell it for $395, and if I get just 100 people to sign up, I’ve made $39,500!!!! This violates just about every NCCA guideline, though. If I wanted to get a stamp of approval that my certification was legit – as well as making it legally defensible – I would need to follow the NCCA guidelines.

My point here is that there are definitely bad tests out there, just like there are millions of other bad products in the world.  It’s a matter of caveat emptor. But just because you had some cheap furniture on college that broke right away, doesn’t mean you swear off on all furniture.  You stay away from bad furniture.

There’s also the problem of tests being misused, but again that’s not a problem with the test itself.  Certainly, someone making decisions is uninformed. It could actually be the best test in the world, with 100% precision, but if it is used for an invalid application then it’s still not a good situation.  For example, if you took a very well-made exam for high school graduation and started using it for employment decisions with adults. Psychometricians call this validity – that we have evidence to support the intended use of the test and interpretations of scores.  It is the #1 concern of assessment professionals, so if a test is being misused, it’s probably by someone without a background in assessment.

 

So where do we go from here?

Put it this way, if an overweight person is trying to become fitter, is success more likely to come from changing diet and exercise habits, or from complaining about their bathroom scale?  Complaining unspecifically about a high school graduation assessment is not going to improve education; let’s change how we educate our children to prepare them for that assessment, and ensure that the assessment reflects the goals of the education.  Nevertheless, of course, we need to invest in making the assessment as sound and fair as we can – which is exactly why I am in this career.

The field of psychometrics is definitely a small niche in the academic world, despite being an integral part of everyone’s life. When I’m trying to explain what I do to people from outside the field, I’m often asked something like, “Where do you even go to study something like that?”  I’m also frequently asked by people already in the field where they can go to get a graduate degree in psychometrics, especially on sophisticated topics like item response theory or adaptive testing

Well, there are indeed a good number of Ph.D. programs in psychometrics, though they rarely appear with that straightforward name, as you can see below.  This can make them tough to find even if you are specifically looking for them.

Note: This list is not intended to be comprehensive, but rather a sampling of the most well-known or unique programs.

  If you want to do deeper research and are actually shopping for a grad school, I highly recommend you check out a comprehensive list of programs on the NCME website.   I also recommend the SIOP list of grad programs; they are for I/O psychology but many of them have professors with expertise in things like assessment validation or item response theory.

How to choose a graduate degree in psychometrics or assessment?

Here’s an oversimplification of how I see the selection of education…

  1. When you are in high school and selecting a university or college, you are selecting a school.
  2. When you are 18-20 and selecting a major, you are selecting a department.
  3. When you are selecting where to pursue a Master’s, you are selecting a program.
  4. When you are selecting where to pursue a Ph.D., you are selecting an advisor.

The key point: When you do a Ph.D., you are going to spend a lot of time working one on one with your advisor, both for the dissertation but also likely for research projects.  It is therefore vital that you selected someone who not only aligns with your interests (otherwise you’ll be bored and disengaged) but also whom you quite simply like enough at a personal level to work one on one for several years!

University of Minnesota: Quantitative/Psychometrics Program (Psychology) and Quantitative Foundations of Educational Research (Education)

I’m partial to this one since it is where I completed my Ph.D., with Prof. David J. Weiss in the Psychology Department.  The UMN is interesting in that it actually has two separate graduate programs in psychometrics: the one in Psychology, which has since become more focused on quantitative psychology, but also one in the Education department.

Website: https://cla.umn.edu/psychology/graduate/areas-specialization/quantitativepsychometric-methods-qpm

http://www.cehd.umn.edu/edpsych/programs/qme/

University of Massachusetts: Research, Educational Measurement, and Psychometrics (REMP)

For many years, if you wanted to learn item response theory, you read Item Response Theory. Principles and Applications by Hambleton and Swaminathan (1985).  These were two longtime professors at UMass, and it speaks to the quality of that program.  Both have since retired but the faculty remains excellent.  Also, note that the program website has a nice page on psychometric resources and software.

Website: https://www.umass.edu/remp/

University of Iowa: Center for Advanced Studies in Measurement and Assessment

This program is in the Education department, and has the advantage of being in one of the epicenters of the industry: the testing giant ACT is headquartered only a few miles away, the giant Pearson has an office in town, and the Iowa Test of Basic Skills is an offshoot of the university itself.  Like UMass, Iowa also has a website with educational materials and useful software.

Website: https://education.uiowa.edu/centers/casma

University of Wisconsin-Madison

UW has well-known professors like Daniel Bolt and James Wollack.  Plus, Madison is well-known for being a fun city given its small size.  The large K-12 testing company, Renaissance Learning, is headquartered only a few miles away.

https://edpsych.education.wisc.edu/category/quantitative-methods/

University of Nebraska – Lincoln: Quantitative, Qualitative & Psychometric Methods

For many years, the cornerstones of this program were the husband-and-wife duo of James Impara and Barbara Plake.  They’ve now retired, but excellent new professors have joined.  In addition, UNL is the home of the Buros Institute.

https://cehs.unl.edu/edpsych/quantitative-qualitative-psychometric-methods/

University of Kansas: Research, Evaluation, Measurement, and Statistics

Not far from Lincoln, NE is Lawrence, Kansas.  The program here has been around a long time, with excellent faculty.  Students have an option for practical experience working at the Achievement and Assessment Institute.

https://epsy.ku.edu/academics/educational-psychology-research/phd/overview-benefits

Michigan State University: Measurement and Quantitative Methods

Like most of the rest of these programs, it is in a vibrant college town.  The focus is more on quantitative methods than assessment.

https://education.msu.edu/cepse/mqm/

UNC-Greensboro: Educational Research, Measurement, and Evaluation

While most programs listed here are in the northern USA, this one is in the southern part of the country, where such programs are smaller and fewer.  UNCG is quite strong however.  https://soe.uncg.edu/academics/departments/erm/erm-programs/ph-d-in-educational-research-measurement-and-evaluation/

University of Texas: Quantitative Methods

UT, like some of the other programs, has an advantage in that the educational assessment arm of Pearson is located there.

https://education.utexas.edu/departments/educational-psychology/graduate-programs/quantitative-methods

Boston College: Measurement, Evaluation, Statistics, and Assessment (MESA)

This program is involved in international research such as IMSS & PIRLS.

https://www.bc.edu/bc-web/schools/lynch-school/academics/departments/mesa.html

Morgan State University: Graduate Program in Psychometrics

Morgan State is unique in that it is a historically black institution that has an excellent program dedicated to psychometrics. https://www.morgan.edu/psychometrics

Fordham University: Psychometrics and Quantitative Psychology

Fordham has an excellent program, located in New York City.

https://www.fordham.edu/info/21665/phd_in_psychometrics_and_quantitative_psychology

James Madison University: Assessment and Measurement

While not as large as the major public universities on this list, JMU has a strong, practically focused program in psychometrics.

https://www.jmu.edu/grad/programs/snapshots/psychology-assessment-and-measurement.shtml

Outside the US

University of Alberta:  Measurement, Evaluation, and Data Science

This is arguably the leading program in all of Canada.

https://www.ualberta.ca/educational-psychology/graduate-programs/measurement-evaluation-and-data-sciences/index.html

University of British Columbia: Measurement, Evaluation, and Research Methodology

UBC is home to Bruno Zumbo, one of the most prolific researchers in the field.

http://ecps.educ.ubc.ca/program/measurement-evaluation-and-research-methodology/

University of Twente: Research Methodology, Measurement, and Data Analysis

For decades, Twente has been the center of psychometrics in Europe, with professors like Wim van der Linden, Theo Eggen, Cees Glas, and Bernard Veldkamp.  It’s also linked with Cito, the premier testing company in Europe, which provides excellent opportunities to apply your skills.

https://www.utwente.nl/en/bms/omd/

University of Amsterdam: Psychological Methods

This program has a number of well-known professors, with expertise in both psychometrics and quantitative psychology.

Website: https://psyres.uva.nl/content/research-groups/programme-group-psychological-methods/programme-group-psychological-methods.html?cb

University of Cambridge: The Psychometrics Centre

The Psychometrics Centre at Cambridge includes professors John Rust and David Stillwell.  It hosted the 2015 IACAT conference and is the home to the open-source CAT platform Concerto.

https://www.psychometrics.cam.ac.uk/

KU Leuven: Research Group of Quantitative Psychology and Individual Differences

This is home to well-known researchers such as Paul De Boeck.

https://ppw.kuleuven.be/okp/home/

University of Western Australia: Pearson Psychometrics Laboratory

This is home to David Andrich, best known for the Rasch Rating Scale Model.

Website: http://www.education.uwa.edu.au/ppl

University of Oslo: Assessment, Measurement, and Evaluation

This program provides an opportunity in the Nordic/Scandinavian countries, with a program in assessment and psychometrics.

Website: https://www.uio.no/english/studies/programmes/assessment-evaluation-master

Online

There are very few programs that offer graduate training in psychometrics that is 100% online.  Here’s the only one I know of.  If you know of another one, please get in touch with me.

The University of Illinois at Chicago: Measurement, Evaluation, Statistics, and Assessment

This program is of particular because it has an online Master’s program, which allows you to get a high-quality graduate degree in psychometrics from just about anywhere in the world.  One of my colleagues here at ASC has recently enrolled in this program.

Website: https://education.uic.edu/academics-admissions/programs/measurement-evaluation-statistics-and-assessment-mesa

We hope the article helps you find the best institution to pursue your graduate degree in psychometrics.

Estimated reading time: 3 minutes

Artificial intelligence (AI) and machine learning (ML) have become buzzwords over the past few years.  As I already wrote about, they are actually old news in the field of psychometrics.   Factor analysis is a classical example of ML, and item response theory also qualifies as ML.  Computerized adaptive testing is actually an application of AI to psychometrics that dates back to the 1970s.

One thing that is very different about the world of AI/ML today is the massive power available in free platforms like R, Python, and TensorFlow.  I’ve been thinking a lot over the past few years about how these tools can impact the world of assessment.  A straightforward application is too automated essay scoring; a common way to approach that problem is through natural language processing with the “bag of words” model and utilize the document-term matrix (DTM) as predictors in a model for essay score as a criterion variable.  Surprisingly simple.  This got me to wondering where else we could apply that sort of modeling.  Obviously, student response data on selected-response items provides a ton of data, but the research questions are less clear.  So, I turned to the topic that I think has the next largest set of data and text: item banks.

Step 1: Text Mining

The first step was to explore tools for text mining in R.  I found this well-written and clear tutorial on the text2vec package and used that as my springboard.  Within minutes I was able to get a document term matrix, and in a few more minutes was able to prune it.  This DTM alone can provide useful info to an organization on their item bank, but I wanted to delve further.  Can the DTM predict item quality?

Step 2: Fit Models

To do this, I utilized both the caret and glmnet packages to fit models.  I love the caret package, but if you search the literature you’ll find it has a problem with sparse matrices, which is exactly what the DTM is.  One blog post I found said that anyone with a sparse matrix is pretty much stuck using glmnet.

I tried a few models on a small item bank of 500 items from a friend of mine, and my adjusted R squared for the prediction of IRT parameters (as an index of item quality) was 0.53 – meaning that I could account for more than half the variance of item quality just by knowing some of the common words in each item’s stem.  I wasn’t even using the answer texts n-grams, or additional information like Author and content domain.

Want to learn more about your item banks?

I’d love to swim even deeper on this issue.  If you have a large item bank and would like to work with me to analyze it so you can provide better feedback and direction to your item writers and test developers, drop me a message at nthompson@54.89.150.95!  This could directly impact the efficiency of your organization and the quality of your assessments.

 

 

The traditional Learning Management System (LMS) is designed to serve as a portal between educators and their learners. Platforms like Moodle are successful in facilitating cooperative online learning in a number of groundbreaking ways: course management, interactive discussion boards, assignment submissions, and delivery of learning content. While all of this is great, we’ve yet to see an LMS that implements best practices in assessment and psychometrics to ensure that medium or high stakes tests meet international standards.

To put it bluntly, LMS systems have assessment functionality that is usually good enough for short classroom quizzes but falls far short of what is required for a test that is used to award a credential.  A white paper on this topic is available here, but some examples include:

  • Treatment of items as reusable objects
  • Item metadata and historical use
  • Collaborative item review and versioning
  • Test assembly based on psychometrics
  • Psychometric forensics to search for non-independent test-taking behavior
  • Deeper score reporting and analytics

Assessment Systems is pleased to announce the launch of an easy-to-use bridge between FastTest and Moodle that will allow users to seamlessly deliver sound assessments from within Moodle while taking advantage of the sophisticated test development and psychometric tools available within FastTest. In addition to seamless delivery for learners, all candidate information is transferred to FastTest, eliminating the examinee import process.  The bridge makes use of the international Learning Tools Interoperability standards.

If you are already a FastTest user, watch a step-by-step tutorial on how to establish the connection, in the FastTest User Manual by logging into your FastTest workspace and selecting Manual in the upper right-hand corner. You’ll find the guide in Appendix N.

If you are not yet a FastTest user and would like to discuss how it can improve your assessments while still allowing you to leverage Moodle or other LMS systems for learning content, sign up for a free account here.

Computerized adaptive testing (CAT) is a powerful paradigm for delivering tests that are smarter, faster, and fairer than the traditional linear approach.  However, CAT is not without its challenges.  One is that it is a greedy algorithm that always selects your best items from the pool if it can.

The way that CAT researchers address this issue is with item exposure controls.  These are sub algorithms that are injected into the main item selection algorithm, to alter it from always using the best items. The Sympson-Hetter method is one such approach.

The simplest approach is called the randomesque method.

This selects from the top X items in terms of item information (a term from item response theory), usually for the first Y items in a test.  For example, instead of always selecting the top item, the algorithm finds the 3 top items and then randomly selects between those.

The Sympson-Hetter Method

A more sophisticated method is the Sympson-Hetter method.

Here, the user specifies a target proportion as a parameter for the selection algorithm.  For example, we might decide that we do not want an item seen by more than 75% of examinees.  So, every time that the CAT algorithm goes into the item pool to select a new item, we generate a random number between 0 and 1, which is then compared to the threshold.  If the number is between 0 and 0.75 in this case, we go ahead and administer the item.  If the number is from 0.75 to 1.0, we skip over it and go on to the next most informative item in the pool, though we then do the same comparison for that item.

Why do this?  It obviously limits the exposure of the item.  But just how much it limits it depends on the difficulty of the item.  A very difficult item is likely only going to be a candidate for selection for very high-ability examinees.  Let’s say it’s the top 4%… well, then the approach above will limit it to 3% of the sample overall, but 75% of the examinees in its neighborhood.

On the other hand, an item of middle difficulty is used not only for middle examinees but often for any examinee.  Remember, unless there are some controls, the first item for the test will be the same for everyone!  So if we apply the Sympson-Hetter rule to that item, it limits it to 75% exposure in a more absolute sense.

Because of this, you don’t have to set that threshold parameter to the same value for each item.  The original recommendation was to do some CAT simulation studies, then set the parameters thoughtfully for different items.  Items that are likely to be highly exposed (middle difficulty with high discrimination) might deserve a more strict parameter like 0.40.  On the other hand, that super-difficult item isn’t an exposure concern because only the top 4% of students see it anyway… so we might leave its parameter at 1.0 and therefore not limit it at all.

Is this the only method available?

No.  As mentioned, there’s that simple randomesque approach.  But there are plenty more.  You might be interested in this paper, this paper, or this paper.  The last one reviews the research literature from 1983 to 2005.

What is the original reference?

Sympson, J. B., & Hetter, R. D. (1985, October). Controlling item-exposure rates in computerized adaptive testing. Proceedings of the 27th annual meeting of the Military Testing Association (pp. 973–977). San Diego, CA: Navy Personnel Research and Development Center.

How can I apply this to my tests?

Well, you certainly need a CAT platform first.  Our platform at ASC allows this method right out of the box – that is, all you need to do is enter the target proportion when you publish your exam, and the Sympson-Hetter method will be implemented.  No need to write any code yourself!  Click here to sign up for a free account.

Desperation is seldom fun to see.

Some years ago, having recently released our online marking functionality I was reviewing some of the functionality in a customer workspace I was intrigued to see “Beyonce??” mentioned in a marker’s comments on an essay. The student’s essay was evaluating some poetry and had completely misunderstood the use of metaphor in the poem in question. The student also clearly knew that her interpretation was way off, but didn’t know how and had reached the end of her patience. So after a desultory attempt at answering, with a cry from the heart, reminiscent of William Wallace’s call for freedom, she wrote “BEYONCE” with about seventeen exclamation points. It felt good to see that her spirit was not broken, and it was a moment of empathy that drove home the damage that standardized tests are inflicting on our students. That vignette is playing itself out millions of time each year in this country, the following explains why.

What are “Standardized Tests”?

We use standardized tests for a variety of reasons, but underlying every reason (curriculum effectiveness, college/career preparedness, teacher effectiveness, etc.) is the understanding that the test is measuring what a student has learned. In order to know how all our students are doing, we give them all standardized tests, meaning every student receives essentially the same set of tests. So, a standardized test is a test where all students take essentially the same test. This is a difficult endeavor given the wide range of students and number of tests, and raises the question “How do we do this reliably and in a reasonable amount of time?”

Accuracy and Difficulty vs Length

We all want tests to reliably measure the students’ learning. In order to make these tests reliable, we need to supply questions of varying difficulty, from very easy to very difficult, to cover a wide range of abilities. In order to reduce the length of the test, most of the questions fall in the medium easy to medium difficulty range because that is where most of the students’ ability level will fall. So the test that best balances length and accuracy for the whole population should be constructed such that the amount of questions of any difficulty is proportionate to the number of students of that ability.

Why are most questions in the medium difficulty range? Imagine creating a test to measure 10th graders’ math ability. A small number of the students might have a couple years of calculus. If the test covered those topics, imagine the experience of most students who would often not even understand the notation in the question. Frustrating, right? On the other hand, if the test was also constructed to measure students with only rudimentary math knowledge, these average to advanced students would be frustrated and bored from answering a lot of questions on basic math facts. The solution most organizations use is to present only a few questions that are really easy or difficult, and accept that this score is not as accurate as they would prefer for the students at either end of the ability range.

These Tests are Inaccurate and Mean Spirited

The problem is that while this might work OK for a lot of kids, it exacts a pretty heavy toll on others. Almost one in five students will not know the answer to 80% of the questions on these tests, and scoring about 20% on a test certainly feels like failing. It feels like failing every time a student takes such a test. Over the course of an academic career, students in the bottom quintile will guess on or skip 10,000 questions. That is 10,000 times the student is told that school, learning, or success is not for them. Even biasing the test to be easier only makes a slight improvement.

Computerized Adaptive Testing, Test Performance with Bell Curve

The shaded area represents students who will miss at least 80% of questions.

It isn’t necessarily better for the top students whose every testing experience assures them that they are already very successful when the reality is that they are likely being outperformed by a significant percentage of their future colleagues.

In other words, at both ends of the Bell Curve, we are serving our students very poorly, inadvertently encouraging lower performing students to give up (there is some evidence that the two correlate) and higher performing students to take it easy. It is no wonder that people dislike standardized tests.

There is a Solution

A computerized adaptive test (CAT) solves all the problems outlined above. Properly constructed, a CAT has the ability to make the following faster, fairer, and more valid:

  • Every examinee completes the test in less time (fast)
  • Every examinee gets a more accurate score (valid)
  • Every examinee receives questions tuned to their ability so gets about half right (fair)

Given all the advantages of CAT, it may seem hard to believe that they are not used more often. While they are starting to catch on, it is not fast enough given the heavy toll that the old methods exact on our students. It is true that few testing providers can enable CATs, but that is simply making an excuse. If a standardized test is delivered to as few as 500 students it can be made adaptive. It probably isn’t, but it could be. All that is needed are computers or tablets, an Internet connection, and some effort. We should expect more.

How can my organization implement CAT?

While CAT used to only be feasible for large organizations that tested hundreds of thousands or millions of examinees per year, a number of advances have changed this landscape.  If you’d like to do something about your test, it might be worthwhile for you to evaluate CAT.  We can help you with that evaluation; if you’d like to chat, here is a link to schedule a meeting. Or contact me if you’d like to discuss the math or related ideas please drop me a note.

Estimated reading time: 5 minutes

Since the first tests were developed 2000 years ago for entry into the civil service of Imperial China, test security has been a concern.  The reason is quite straightforward: most threats to test security are also validity threats. The decisions we make with test scores could therefore be invalid, or at least suboptimal.  It is therefore imperative that organizations that use or develop tests should develop a Test Security Plan (TSP). 

The test security plan is a document that helps an organization anticipate test security issues, establish deterrent and detection methods, and plan responses.  It can also include validity threats not security-related, such as how to deal with examinees that have low motivation.

There are several reasons to develop a test security plan.  First, it drives greater security and therefore validity.  The TSP will enhance the legal defensibility of the testing program.  It helps to safeguard the content, which is typically an expensive investment for any organization that develops tests themselves.  If incidents do happen, they can be dealt with more swiftly and effectively.  It helps to manage all the security-related efforts.

The development of such a complex document requires a strong framework.  We advocate a framework with three phases: planning, implementation, and response.  In addition, the TSP should be revised periodically.

 

Phase 1: Planning

The first step in this phase is to list all potential threats to each assessment program at your organization.  This could include harvesting of test content, preknowledge of test content from past harvesters, copying other examinees, proxy testers, proctor help, and outside help.  Next, these should be rated on axes that are important to the organization; a simple approach would be to rate on potential impact to score validity, cost to the organization, and likelihood of occurrence.  This risk assessment exercise will help the remainder of the framework.

Next, the organization should develop the TSP.  The first piece is to identify deterrents and procedures to reduce the possibility of issues.  This includes delivery procedures (such as a lockdown browser or proctoring), proctor training manuals, a strong candidate agreement, anonymous reporting pathways, confirmation testing, and candidate identification requirements.  The second piece is to explicitly plan for psychometric forensics. 

This can range from complex collusion indices based on item response theory to simple flags, such as a candidate responding to a certain multiple choice option more than 50% of the time or obtaining a score in the top 10% but in the lowest 10% of time.  The third piece is to establish planned responses.  What will you do if a proctor reports that two candidates were copying each other?  What if someone obtains a high score in an unreasonably short time? 

What if someone obviously did not try to pass the exam, but still sat there for the allotted time?  If a candidate were to lose a job opportunity due to your response, it helps you defensibility to show that the process was established ahead of time with the input of important stakeholders.

 

Phase 2: Implementation

The second phase is to implement the relevant aspects of the Test Security Plan, such as training all proctors in accordance with the manual and login procedures, setting IP address limits, or ensuring that a new secure testing platform with lockdown is rolled out to all testing locations.  There are generally two approaches.  Proactive approaches attempt to reduce the likelihood of issues in the first place, and reactive methods happen after the test is given.  The reactive methods can be observational, quantitative, or content-focused.  Observational methods include proctor reports or an anonymous tip line.  Quantitative methods include psychometric forensics, for which you will need software like SIFT.  Content-focused methods include automated web crawling.

Both approaches require continuous attention.  You might need to train new proctors several times per year, or update your lockdown browser.  If you use a virtual proctoring service based on record-and-review, flagged candidates must be periodically reviewed.  The reactive methods are similar: incoming anonymous tips or proctor reports must be dealt with at any given time.  The least continuous aspect is some of the psychometric forensics, which depend on a large-scale data analysis; for example, you might gather data from tens of thousands of examinees in a testing window and can only do a complete analysis at that point, which could take several weeks.

 

Phase 3: Response

The third phase, of course, to put your planned responses into motion if issues are detected.  Some of these could be relatively innocuous; if a proctor is reported as not following procedures, they might need some remedial training, and it’s certainly possible that no security breach occurred.  The more dramatic responses include actions taken against the candidate.  The most lenient is to provide a warning or simply ask them to retake the test.  The most extreme methods include a full invalidation of the score with future sanctions, such as a five-year ban on taking the test again, which could prevent someone from entering a profession for which they spent 8 years and hundreds of thousands of dollars in educative preparation.

 

What does a test security plan mean for me?

It is clear that test security threats are also validity threats, and that the extensive (and expensive!) measures warrant a strategic and proactive approach in many situations.  A framework like the one advocated here will help organizations identify and prioritize threats so that the measures are appropriate for a given program.  Note that the results can be quite different if an organization has multiple programs, from a practice test to an entry level screening test to a promotional test to a professional certification or licensure.

Another important difference between test sponsors/publishers and test consumers.  In the case of an organization that purchases off-the-shelf pre-employment tests, the validity of score interpretations is of more direct concern, while the theft of content might not be an immediate concern.  Conversely, the publisher of such tests has invested heavily in the content and could be massively impacted by theft, while the copying of two examinees in the hiring organization is not of immediate concern.

In summary, there are more security threats, deterrents, procedures, and psychometric forensic methods than can be discussed in one blog post, so the focus here rather on the framework itself.  For starters, start thinking strategically about test security and how it impacts their assessment programs by using the multi-axis rating approach, then begin to develop a Test Security Plan.  The end goal is to improve the health and validity of your assessments.

Want to implement some of the security aspects discussed here, like online delivery lockdown browser, IP address limits, and proctor passwords? Sign up for a free account in FastTest!

 

One of the best aspects of my position is the opportunity to travel the world and talk with many experts about psychometrics and educational assessment.  In December 2017, I was lucky enough to travel to Monterrey, Mexico, with the dual purpose of a conference and a Psychometrics Seminar.  It was an exciting week that taught me a lot about education in Latin America.

The first half of the week was the Congreso Internacional de Innovacion Educativa, the premier conference in educational technology in Latin America, with more than 3,000 attendees.  I had the opportunity to present on a project we are doing with Tecnologico de Monterrey to implement adaptive testing in admissions exams.  Tecnologico is the #2 university in Mexico, and therefore a leader in all of Latin America, as well as being the host of the conference.  In addition, I heard of number of interesting talks, including one by the CEO of Coursera on the role of MOOCs in higher education.  I’m personally a MOOC addict, though I tend to start new courses far faster than I ever finish one.

 

adaptive testing seminar

The second half of the week was the 2017 ASC Learning Summit, a 2-day seminar to teach item response theory and adaptive testing to anyone that wanted to learn.  We had 22 attendees, which was extremely successful given the short notice. (Tec had to move the entire conference from Mexico City to Monterrey after an earthquake)  We had 18 from Mexico, two from the US, one from Barbados, and one from the UK.  These included university professors, K12 assessment professionals, higher education admissions staff, certification test psychometricians, and more.

 

 

If time allows, I hope to record my presentations from that summit as a series of webinars or videos, so join our mailing list to stay tuned if you aren’t already on it.  In addition, I’ll be holding similar seminars in the future.  If you are interested in having me visit your country for such an event, get in touch with me at nthompson@54.89.150.95.