assessment-test-battery

A test battery or assessment battery is a set multiple psychometrically-distinct exams delivered in one administration.  In some cases, these are various tests that are cobbled together for related purposes, such as a psychologist testing a 8 year old child on their intelligence, anxiety, and autism spectrum.  However, in many cases it is a single test title that we often refer to as a single test but is actually several separate tests, like a university admissions test that has English, Math, and Logical Reasoning components.  Why do so? The key here is that we want to keep them psychometrically separate, but maximize the amount of information about the person to meet the purposes of the test.

Learn more about our powerful exam platform that allows you to easily develop and deliver test batteries.

 

Examples of a Test Battery

Test batteries are used in a variety of fields, pretty much anywhere assessment is done.

Admissions and Placement Testing

The classic example is a university admissions test that has English, Math, and Logic portions.  These are separate tests, and psychometricians would calculate the reliability and other important statistics separately.  However, the scores are combined at the end to get an overall picture of examinee aptitude or achievement, and use that to maximally predict 4-graduation rates and other important criterion variables.

Why is is called a battery?  Because we are battering the poor student with not just one, but many exams!

Pre-Employment Testing

Exam batteries are often used in pre-employment testing.  You might get tested on computer skills, numerical reasoning, and noncognitive traits such as integrity or conscientiousness. These are used together to gain incremental validity.  A good example is the CAT-ASVAB, which is the selection test to get into the US Armed Forces.  There are 10 tests (vocabulary, math, mechanical aptitude…).

Psychological or Psychoeducational Assessment

In a clinical setting, clinicians will often use a battery of tests, such as IQ, autism, anxiety, and depression.  Some IQ tests themselves as a battery, as they might assess visual reasoning, logical reasoning, numerical reasoning, etc.  However, these have a positive manifold, meaning that they correlate quite highly with each other.  Another example is the Woodcock-Johnson.

K-12 Educational Assessment

Many large-scale tests that are used in schools are considered a battery, though often with only 2 or 3 aspects.  A common one in the USA is the NWEA Measures of Academic Progress.

 

Composite Scores

A composite score is a combination of scores in a battery.  If you took an admissions test like the SAT and GRE, you recall how it would add your scores on the different subtests, while the ACT test takes the average.  The ASVAB takes a linear combination of the 4 most important subtests and uses them for admission; the others are used for job matching.

 

A Different Animal: Test with Sections

The battery is different than a single test that has distinct sections.  For example, a K12 English test might have 10 vocab items, 10 sentence-completion grammar items, and 2 essays.  Such tests are usually analyzed as a single test, as they are psychometrically unidimensional.

 

How to Deliver A Test Battery

In ASC’s platforms,  Assess.ai  and  FastTest, all this functionality is available out of the box: test batteries, composite scores, and sections within a test.  Moreover, they come with a lot of important functionality, such as separation of time limits, navigation controls, customizable score reporting, and more.  Click here to request a free account and start applying best practices.

Paper-and-pencil testing used to be the only way to deliver assessments at scale.  The introduction of computer-based testing (CBT) in the 1980s was a revelation – higher fidelity item types, immediate scoring & feedback, and scalability all changed with the advent of the personal computer and then later the internet.  Delivery mechanisms including remote proctoring provided students with the ability to take their exams anywhere in the world.  This all exploded tenfold when the pandemic arrived.  So why are some exams still offline, with paper and pencil?

Many education institutions are confused about which examination models to stick to.  Should you go on with the online model they used when everyone was stuck in their homes?  Should you adopt multi-modal examination models, or should you go back to the traditional pen-and-paper method?  

This blog post will provide you with an evaluation of whether paper-and-pencil exams are still worth it in 2021. 

 

Paper-and-pencil testing; The good, the bad, and the ugly

The Good

Answer Bubble Sheet OrangeOffline exams have been a stepping stone towards the development of modern assessment models that are more effective. We can’t ignore the fact that there are several advantages of traditional exams. 

Some advantages of paper-and-pencil testing include students having familiarity with the system, development of a social connection between learners, exemption from technical glitches, and affordability. Some schools don’t have the resources and pen-and-paper assessments are the only option available. 

This is especially true in areas of the world that do not have the internet bandwidth or other technology necessary to deliver internet-based testing.

Another advantage of paper exams is that they can often work better for students with special needs, such as blind students which need a reader.

Paper and pencil testing is often more cost-efficient in certain situations where the organization does not have access to a professional assessment platform or learning management system.

 

The Bad and The Ugly

However, the paper-and-pencil testing does have a number of shortfalls.

1. Needs a lot of resources to scale

Delivery of paper-and-pencil testing at large scale requires a lot of resources. You are printing and shipping, sometimes with hundreds of trucks around the country.  Then you need to get all the exams back, which is even more of a logistical lift.

2. Prone to cheating

Most people think that offline exams are cheat-proof but that is not the case. Most offline exams count on invigilators and supervisors to make sure that cheating does not occur. However, many pen-and-paper assessments are open to leakages. High candidate-to-ratio is another factor that contributes to cheating in offline exams.

3. Poor student engagement

We live in a world of instant gratification and that is the same when it comes to assessments. Unlike online exams which have options to keep the students engaged, offline exams are open to constant destruction from external factors.

Offline exams also have few options when it comes to question types. 

4. Time to score

To err is human.” But, when it comes to assessments, accuracy, and consistency. Traditional methods of hand-scoring paper tests are slow and labor-intensive. Instructors take a long time to evaluate tests. This defeats the entire purpose of assessments.

5. Poor result analysis

Pen-and-paper exams depend on instructors to analyze the results and come up with insight. This requires a lot of human resources and expensive software. It is also difficult to find out if your learning strategy is working or it needs some adjustments. 

6. Time to release results

Online exams can be immediate.  If you ship paper exams back to a single location, score them, perform psychometrics, then mail out paper result letters?  Weeks.

7. Slow availability of results to analyze

Similarly, psychometricians and other stakeholders do not have immediate access to results.  This prevents psychometric analysis, timely feedback to students/teachers, and other issues.

8. Accessibility

Online exams can be built with tools for zoom, color contrast changes, automated text-to-speech, and other things to support accessibility.

9. Convenience

traditional approach vs modern approach

Online tests are much more easily distributed.  If you publish one on the cloud, it can immediately be taken, anywhere in the world.

10. Support for diversified question types

Unlike traditional exams which are limited to a certain number of question types, online exams offer many question types.  Videos, audio, drag and drop, high-fidelity simulations, gamification, and much more are possible.

11. Lack of modern psychometrics

Paper exams cannot use computerized adaptive testing, linear-on-the-fly testing, process data, computational psychometrics, and other modern innovations.

12. Environmental friendliness

Sustainability is an important aspect of modern civilization.  Online exams eliminate the need to use resources that are not environmentally friendly such as paper. 

 

Conclusion

Is paper-and-pencil testing still useful?  In most situations, it is not.  The disadvantages outweigh the advantages.  However, there are many situations where paper remains the only option, such as poor tech infrastructure.

How ASC Can Help 

Transitioning from paper-and-pencil testing to the cloud is not a simple task.  That is why ASC is here to help you every step of the way, from test development to delivery.  We provide you with the best assessment software and access to the most experienced team of psychometricians.  Ready to take your assessments online?  Contact us!

 

student-assessment-tools

Student assessment tools are a critical part of educational technology and are often integrated into Learning Management Systems (LMS), providing a seamless platform for both instruction and evaluation. These tools serve the important role of evaluating what the student has learned, so that the instructor and other stakeholders can adjust the instruction, either individually or at the aggregate level (school, district, state, country). The assessment of mathematical skills is particularly important, as it helps identify students’ proficiency and areas needing improvement in a subject that is foundational to many academic and career paths. However, there is a massive range of student assessment tools, from free software like Google Forms, which by its name is obviously not designed for educational assessment, to enterprise platforms designed for nationwide high-stakes exams. These tools can include various types of assessments, such as speeded and power tests, which measure different aspects of student performance. In recent years, there has been a growing trend towards incorporating gamification elements into assessment tools, aiming to enhance student engagement and motivation.

Here are some aspects to consider when evaluating student assessment tools.

Reporting, analysis, and visualization

Assess.ai Free Version of Iteman-online assessment software

This is the most important consideration to make when evaluating student assessment tools. Reports are a measure of progress. They help educational institutions and businesses adjust their learning processes to improve assessment effectiveness.

Some tools to look out for about reporting and analysis include psychometric software such as  Xcalibre  (IRT Analysis),  Iteman  (Classical analysis),  CITAS, and many others. These tools should have visualization capabilities such as creating graphs and charts in relation to the assessment process.  The output should also be in a format that is easy to interpret.

Interested in getting free access to some of these psychometric analytical tools including Xcalibre, Iteman, and many others? Fill out this  form  to get free access to the tools.

Scalability

The most common you make when looking for student assessment tools is not evaluating how robust the platform is. This can alter the learning process and cause financial ruin since you will have to get another system.  The ideal platform should be robust enough to handle any form of workload.

Ease-of-use 

Everything should be made as simple as possible, but no simpler.-

Albert Einstein 

The best software is one that offers sophisticated solutions in a way that anyone can use. Student assessment software, especially in education, should be in its simplest use.

The interface shouldn’t be intimidating, and it should have important functions such as autosaving answers to avoid frustrating the examinee. The software should be cloud-based with no need to install it on devices.   The process of creating and managing item banks should be as simple as possible.

Item banking

Item Banking refers to the development and management of a large pool of high-quality test questions.  Items are treated as reusable objects, which allows you to more efficiently publish new test forms.  Items are stored with extensive historical metadata to drive validity.

The right student assessment tool should also support the best practices in workflow management and support collaboration.

Automated item generation

Automated item generation (AIG) refers to software tools that make it easier to generate new questions.  These can be template-based, as seen below, or generative based on LLMs like ChatGPT.

item template cpr.001

 

Compatibility with existing systems

Most businesses and education institutions already have a Learning Management System (LMS) in their workflow. The right student assessment tool should therefore be easy to sync with the existing system.  This is important because it would be costly and time-consuming to re-develop their entire system to integrate an assessment tool into the process.

Enhanced student assessment security

Cheating is one of the greatest concerns when it comes to student assessments. It is important to check the technologies and methods used by the software to curb infidelity.  Here are some functionalities you should look for in student assessment security:

Lockdown browser

This is a feature that limits the examinees to one screen. This stops them from accessing files from local storages or getting help. If an examinee attempts to access external software or a private tab in the browser, a notification is sent to the proctor who will take action.

AI flagging

AI flagging helps supervisors spot any suspicious behaviors using audios and videos captured during the examination period. Some actions that may indicate cheating include background audio, extras faces on the screen, and suspicious body language.

AI -flagging (online assesment software)
AI-Flagging In Action (Assess.com)

 

IP-based authentication

This is an interesting feature since it eliminates impersonation by using the examinees’ IP addresses for user identification. This can also eliminate cheating through remote access tools.

They are a few functionalities to look out for when vetting the security level of an student assessment platform. If you want to learn more about assessment security, feel free to check out this blog post.

Good customer support  

We all get stuck once in a while and good customer support shouldn’t be ignored when looking for an assessment tool.

  • How long do they take to reply to a query by a customer?
  • Do they have a FAQs page?
  • How well is the software documented?
  • Have they implemented self-service support into their process?

Consider asking these questions when vetting customer support in student assessment tools.

Student assessment tool checklist

To sum up the article, here is a checklist to help you find the right platform for your needs.

  • What cheating prevention methods does it offer? (Lockdown browser, IP-based authentication, and IP-flagging)
  • How good is their item authoring functionality? Go for the one with tech-enhanced item types, classical item statistics storage, and a separate module for managing multimedia files.
  • How does the software offer online delivery? Check out for adaptive testing capabilities, customizable options, and brand-ability.
  • What is their reputation? Always be sure to check out what other people say about the brand and the software. How are their reviews online? Have they won any Ed-tech awards?
  • How good is their reporting? Choose the tool that offers classical item performance reports with  Iteman  and has visualization capabilities.
  • Does the software support remote proctoring?
  • Are all their test development modules in alignment with the best psychometric practices?
  • Do they offer multichannel support? How good is their documentation?
  • Is the software easy to use? Is it accessible from anywhere?  Always go for user-friendly software.
  • How well does it integrate with existing systems?
  • What type of assessments (formative, diagnostic, summative, synoptic, ipsative, or work-integrated assessments) are you looking for? Use this resource to help you differentiate between the types of online assessments.
  • Do they have an experienced team to help in test development and other consulting services?

Finding the right software for your needs is hard, especially in this competitive market. We hope this article, the long checklist specifically, helps you find the right exam software. If you are interested in having access to the best student assessment tools and psychometrics consulting, feel free to Contact us to discuss your needs.

If you are interested in leadership assessments, you might want to check out this post.

Item banking refers to the purposeful creation of a database of assessment items to serve as a central repository of all test content, improving efficiency and quality. The term item refers to what many call questions; though their content need not be restricted as such and can include problems to solve or situations to evaluate in addition to straightforward questions. Regular item review is essential to ensure that each item meets content standards, is fair, and is free from bias, thereby maintaining the integrity and accuracy of the item bank. As a critical foundation to the test development cycle, item banking is the foundation for the development of valid, reliable content and defensible test forms.

Automated item banking systems, such as  Assess.ai  or  FastTest, result in significantly reduced administrative time for developing/reviewing items and assembling/publishing tests, while producing exams that have greater reliability and validity.  Contact us to request a free account.

What is Item Banking?

While there are no absolute standards in creating and managing item banks, best practice guidelines are emerging. Here are the essentials your should be looking for:

   Items are reusable objects; when selecting an item banking platform it is important to ensure that items can be used more than once; ideally, item performance should be tracked not only within a test form but across test forms as well.

   Item history and usage are tracked; the usage of a given item, whether it is actively on a test form or dormant waiting to be assigned, should be easily accessible for test developers to assess, as the over-exposure of items can reduce the validity of a test form. As you deliver your items, their content is exposed to examinees. Upon exposure to many examinees, items can then be flagged for retirement or revision to reduce cheating or teaching to the test.

   Items can be sorted; as test developers select items for a test form, it is imperative that they can sort items based on their content area or other categorization methods, so as to select a sample of items that is representative of the full breadth of constructs we intend to measure.

   Item versions are tracked; as items appear on test forms, their content may be revised for clarity. Any such changes should be tracked and versions of the same item should have some link between them so that we can easily review the performance of earlier versions in conjunction with current versions.

   Review process workflow is tracked; as items are revised and versioned, it is imperative that the changes in content and the users who made these changes are tracked. In post-test assessment, there may be a need for further clarification, and the ability to pinpoint who took part in reviewing an item and expedite that process.

   Metadata is recorded; any relevant information about an item should be recorded and stored with the item. The most common applications for metadata that we see are author, source, description, content area, depth of knowledge, item response theory parameters, and classical test theory statistics, but there are likely many data points specific to your organization that is worth storing.

 

Managing an Item Bank

Names are important. As you create or import your item banks it is important to identify each item with a unique, but recognizable name. Naming conventions should reflect your bank’s structure and should include numbers with leading zeros to support true numerical sorting.  You might want to also add additional pieces of information.  If importing, the system should be smart enough to recognize duplicates.

Search and filter. The system should also have a reliable sorting mechanism. 

automated item generation cpr

 

Prepare for the Future: Store Extensive Metadata

Metadata is valuable. As you create items, take the time to record simple metadata like author and source. Having this information can prove very useful once the original item writer has moved to another department, or left the organization. Later in your test development life cycle, as you deliver items, you have the ability to aggregate and record item statistics. Values like discrimination and difficulty are fundamental to creating better tests, driving reliability, and validity.

Statistics are used in the assembly of test forms while classical statistics can be used to estimate mean, standard deviation, reliability, standard error, and pass rate. 

Item banking statistics

Item response theory parameters can come in handy when calculating test information and standard error functions. Data from both psychometric theories can be used to pre-equate multiple forms.

In the event that your organization decides to publish an adaptive test, utilizing computerized adaptive testing delivery, item parameters for each item will be essential. This is because they are used for intelligent selection of items and scoring examinees. Additionally, in the event that the integrity of your test or scoring mechanism is ever challenged, documentation of validity is essential to defensibility and the storage of metadata is one such vital piece of documentation.

 

Increase Content Quality: Track Workflow

Utilize a review workflow to increase quality. Using a standardized review process will ensure that all items are vetted in a similar matter. Have a step in the process for grammar, spelling, and syntax review, as well as content review by a subject matter expert. As an item progresses through the workflow, its development should be tracked, as workflow results also serve as validity documentation.

Accept comments and suggestions from a variety of sources. It is not uncommon for each item reviewer to view an item through their distinctive lens. Having a diverse group of item reviewers stands to benefit your test-takers, as they are likely to be diverse as well!

item review kanban

 

Keep Your Items Organized: Categorize Them

Identify items by content area. Creating a content hierarchy can also help you to organize your item bank and ensure that your test covers the relevant topics. Most often, we see content areas defined first by an analysis of the construct(s) being tested. In the event of a high school science test, this may include the evaluation of the content taught in class. A high-stakes certification exam, almost always includes a job-task analysis. Both methods produce what is called a test blueprint, indicating how important various content areas are to the demonstration of knowledge in the areas being assessed.

Once content areas are defined, we can assign items to levels or categories based on their content. As you are developing your test, and invariably referring back to your test blueprint, you can use this categorization to determine which items from each content area to select.

 

The Benefits of Item Banking

There is no doubt that item banking is a key aspect of developing and maintaining quality assessments. Utilizing best practices, and caring for your items throughout the test development life cycle, will pay great dividends as it increases the reliability, validity, and defensibility of your assessment. Moreover, good item banking will make the job easier and more efficient thus reducing the cost of item development and test publishing.

 

Ready to Improve assessment quality through item banking?

Visit our Contact Us page, where you can request a demonstration or a free account (up to 500 items).

Artificial intelligence (AI) and machine learning (ML) have become buzzwords over the past few years.  As I already wrote about, they are actually old news in the field of psychometrics.   Factor analysis is a classical example of ML, and item response theory (IRT) also qualifies as ML.  Computerized adaptive testing (CAT) is actually an application of AI to psychometrics that dates back to the 1970s.

One thing that is very different about the world of AI/ML today is the massive power available in free platforms like R, Python, and TensorFlow.  I’ve been thinking a lot over the past few years about how these tools can impact the world of assessment.  A straightforward application is too automated essay scoring; a common way to approach that problem is through natural language processing with the “bag of words” model and utilize the document-term matrix (DTM) as predictors in a model for essay score as a criterion variable.  Surprisingly simple.  This got me to wondering where else we could apply that sort of modeling.  Obviously, student response data on selected-response items provides a ton of data, but the research questions are less clear.  So, I turned to the topic that I think has the next largest set of data and text: item banks.

Step 1: Text Mining

The first step was to explore tools for text mining in R.  I found this well-written and clear tutorial on the text2vec package and used that as my springboard.  Within minutes I was able to get a document term matrix, and in a few more minutes was able to prune it.  This DTM alone can provide useful info to an organization on their item bank, but I wanted to delve further.  Can the DTM predict item quality?

Step 2: Fit Models

To do this, I utilized both the caret and glmnet packages to fit models.  I love the caret package, but if you search the literature you’ll find it has a problem with sparse matrices, which is exactly what the DTM is.  One blog post I found said that anyone with a sparse matrix is pretty much stuck using glmnet.

I tried a few models on a small item bank of 500 items from a friend of mine, and my adjusted R squared for the prediction of IRT parameters (as an index of item quality) was 0.53 – meaning that I could account for more than half the variance of item quality just by knowing some of the common words in each item’s stem.  I wasn’t even using the answer texts n-grams, or additional information like Author and content domain.

Want to learn more about your item banks?

I’d love to swim even deeper on this issue.  If you have a large item bank and would like to work with me to analyze it so you can provide better feedback and direction to your item writers and test developers, drop me a message at solutions@assess.com!  This could directly impact the efficiency of your organization and the quality of your assessments.

standard setting

If you have worked in the field of assessment and psychometrics, you have undoubtedly encountered the word “standard.” While a relatively simple word, it has the potential to be confusing because it is used in three (and more!) completely different but very important ways. Here’s a brief discussion.

Standard = Cutscore

As noted by the well-known professor Gregory Cizek here, “standard setting refers to the process of establishing one or more cut scores on a test.” The various methods of setting a cutscore, like Angoff or Bookmark, are referred to as standard setting studies. In this context, the standard is the bar that separates a Pass from a Fail. We use methods like the ones mentioned to determine this bar in as scientific and defensible fashion as possible, and give it more concrete meaning than an arbitrarily selected round number like 70%. Selecting a round number like that will likely get you sued since there is no criterion-referenced interpretation.

Standard = Blueprint

If you work in the field of education, you often hear the term “educational standards.” These refer to the curriculum blueprints for an educational system, which also translate into assessment blueprints, because you want to assess what is on the curriculum. Several important ones in the USA are noted here, perhaps the most common of which nowadays is the Common Core State Standards, which attempted to standardize the standards across states. These standards exist to standardize the educational system, by teaching what a group of experts have agreed upon should be taught in 6th grade Math classes for example. Note that they don’t state how or when a topic should be taught, merely that 6th Grade Math should cover Number Lines, Measurement Scales, Variables, whatever – sometime in the year.

Standard = Guideline

If you work in the field of professional certification, you hear the term just as often but in a different context, accreditation standards. The two most common are the National Commission for Certifying Agencies (NCCA) and the ANSI National Accreditation Board (ANAB). These two organizations are a consortium of credentialing bodies that give a stamp of approval to credentialing bodies, stating that a Certification or Certificate program is legit. Why? Because there is no law to stop me from buying a textbook on any topic, writing 50 test questions in my basement, and selling it as a Certification. It is completely a situation of caveat emptor, and these organizations are helping the buyers by giving a stamp of approval that the certification was developed with accepted practices like a Job Analysis, Standard Setting Study, etc.

In addition, there are the professional standards for our field. These are guidelines on assessment in general rather than just credentialing. Two great examples are the AERA/APA/NCME Standards for Educational and Psychological Measurement and the International Test Commission’s Guidelines (yes they switch to that term) on various topics.

Also: Standardized = Equivalent Conditions

The word is also used quite frequently in the context of standardized testing, though it is rarely chopped to the root word “standard.” In this case, it refers to the fact that the test is given under equivalent conditions to provide greater fairness and validity. A standardized test does NOT mean multiple choice, bubble sheets, or any of the other pop connotations that are carried with it. It just means that we are standardizing the assessment and the administration process. Think of it as a scientific experiment; the basic premise of the scientific method is holding all variables constant except the variable in question, which in this case is the student’s ability. So we ensure that all students receive a psychometrically equivalent exam, with equivalent (as much as possible) writing utensils, scrap paper, computer, time limit, and all other practical surroundings. The problem comes with the lack of equivalence in access to study materials, prep coaching, education, and many bigger questions… but those are a societal issue and not a psychometric one.

So despite all the bashing that the term gets, a standardized test is MUCH better than the alternatives of no assessment at all, or an assessment that is not a level playing field and has low reliability. Consider the case of hiring employees: if assessments were not used to provide objective information on applicant skills and we could only use interviews (which are famously subjective and inaccurate), all hiring would be virtually random and the amount of incompetent people in jobs would increase a hundredfold. And don’t we already have enough people in jobs where they don’t belong?

school-teacher-teaching-a-class

One of the most cliche phrases associated with assessment is “teaching to the test.”  I’ve always hated this phrase, because it is only used in a derogatory matter, almost always by people who do not understand the basics of assessment and psychometrics.  I recently saw it mentioned in this article on PISA, and that was one time too many, especially since it was used in an oblique, vague, and unreferenced manner.

So, I’m going to come out and say something very unpopular: in most cases, TEACHING TO THE TEST IS A GOOD THING.

Why teaching to the test is usually a good thing

If the test reflects the curriculum – which any good test will – then someone who is teaching to the test will be teaching to the curriculum. Which, of course, is the entire goal of teaching. The phrase “teaching to the test” is used in an insulting sense, especially because the alliteration is resounding and sellable, but it’s really not a bad thing in most cases.  If a curriculum says that 4th graders should learn how to add and divide fractions, and the test evaluates this, what is the problem? Especially if it uses modern methodology like adaptive testing or tech-enhanced items to make the process more engaging and instructional, rather than oversimplifying to a text-only multiple choice question on paper bubble sheets?

The world of credentialing assessment, this is an extremely important link.  Credential tests start with a job analysis study, which surveys professionals to determine what they consider to be the most important and frequently used skills in the job.  This data is then transformed into test blueprints. Instructors for the profession, as well as aspiring students that are studying to pass the test, then focus on what is in the blueprints.  This, of course, still contains the skills that are most important and frequently used in the job!

So what is the problem then?

Now, telling teachers how to teach is more concerning, and more likely to be a bad thing.  Finland does well because it gives teachers lots of training and then power to choose how they teach, as noted in the PISA article.

As a counterexample, my high school math department made an edict starting my sophomore year thaborderline method educational assessmentt all teachers had to use the “Chicago Method.” It was pure bunk and based on the fact that students should be doing as much busy work as possible instead of the teachers actually teaching. I think it is because some salesman convinced the department head to make the switch so that they would buy a thousand brand new textbooks.  The method makes some decent points (here’s an article from, coincidentally, when I was a sophomore in high school) but I think we ended up with a bastardization of it, as the edict was primarily:

  1. Assign students to read the next chapter in class (instead of teaching them!); go sit at your desk.
  2. Assign students to do at least 30 homework questions overnight, and come back tomorrow with any questions they have.
  3. Answer any questions, then assign them the next chapter to read.  Whatever you do, DO NOT teach them about the topic before they start doing the homework questions.  Go sit at your desk.

Isn’t that preposterous?  Unsurprisingly, after two years of this, I went from being a leader of the Math Team to someone who explicitly said “I am never taking Math again”.  And indeed, I managed to avoid all math during my senior year of high school and first year of college. Thankfully, I had incredible professors in my years at Luther College, leading to me loving math again, earning a math major, and applying to grad school in psychometrics.  This shows the effect that might happen with “telling teachers how to teach.” Or in this case, specifically – and bizarrely – to NOT teach.

What about all the bad tests out there?

Now, let’s get back to the assumption that a test does reflect a curriculum/blueprints.  There are, most certainly, plenty of cases where an assessment is not designed or built well.  That’s an entirely different problem, and is an entirely valid concern. I have seen a number of these in my career.  This danger why we have international standards on assessments, like AERA/APA/NCME and NCCA.  These provide guidelines on how a test should be build, sort of like how you need to build a house according to building code and not just throwing up some walls and a roof.

ansi accreditation certification exam candidates

For example, there is nothing that is stopping me from identifying a career that has a lot of people looking to gain an edge over one another to get a better job… then buying a textbook, writing 50 questions in my basement, and throwing it up on a nice-looking website to sell as a professional certification.  I might sell it for $395, and if I get just 100 people to sign up, I’ve made $39,500!!!! This violates just about every NCCA guideline, though. If I wanted to get a stamp of approval that my certification was legit – as well as making it legally defensible – I would need to follow the NCCA guidelines.

My point here is that there are definitely bad tests out there, just like there are millions of other bad products in the world.  It’s a matter of caveat emptor. But just because you had some cheap furniture on college that broke right away, doesn’t mean you swear off on all furniture.  You stay away from bad furniture.

There’s also the problem of tests being misused, but again that’s not a problem with the test itself.  Certainly, someone making decisions is uninformed. It could actually be the best test in the world, with 100% precision, but if it is used for an invalid application then it’s still not a good situation.  For example, if you took a very well-made exam for high school graduation and started using it for employment decisions with adults. Psychometricians call this validity – that we have evidence to support the intended use of the test and interpretations of scores.  It is the #1 concern of assessment professionals, so if a test is being misused, it’s probably by someone without a background in assessment.

So where do we go from here?

Put it this way, if an overweight person is trying to become fitter, is success more likely to come from changing diet and exercise habits, or from complaining about their bathroom scale?  Complaining unspecifically about a high school graduation assessment is not going to improve education; let’s change how we educate our children to prepare them for that assessment, and ensure that the assessment reflects the goals of the education.  Nevertheless, of course, we need to invest in making the assessment as sound and fair as we can – which is exactly why I am in this career.

Want to get a graduate degree in psychometrics, measurement, and assessment?  This field is definitely a small niche in the academic world, despite being an integral part of everyone’s life. When I’m trying to explain what I do to people from outside the field, I’m often asked something like, “Where do you even go to study something like that?”  I’m also frequently asked by people already in the field where they can go to get a graduate degree, especially on sophisticated topics like item response theory or adaptive testing

Well, there are indeed a good number of Ph.D. programs, though they have a range of titles, as you can see below.  This can make them tough to find even if you are specifically looking for them.

Note: This list is not intended to be comprehensive, but rather a sampling of the most well-known or unique programs.

If you want to do deeper research and are actually shopping for a grad school, I highly recommend you check out a comprehensive list of programs on the NCME website.   I also recommend the SIOP list of grad programs; they are for I/O psychology but many of them have professors with expertise in things like assessment validation or item response theory.

 

How to choose a graduate degree in psychometrics?

Here’s an oversimplification of how I see the selection of education…

  1. When you are in high school and selecting a university or college, you are selecting a school.
  2. When you are 18-20 and selecting a major, you are selecting a department.
  3. When you are selecting where to pursue a Master’s, you are selecting a program.
  4. When you are selecting where to pursue a Ph.D., you are selecting an advisor.

The key point: When you do a Ph.D., you are going to spend a lot of time working one on one with your advisor, both for the dissertation but also likely for research projects.  It is therefore vital that you selected someone who not only aligns with your interests (otherwise you’ll be bored and disengaged) but also whom you quite simply like enough at a personal level to work one on one for several years!  This is arguably the most important thing to consider when choosing where to attain your graduate degree.

 

University of Minnesota: Quantitative/Psychometrics Program (Psychology) and Quantitative Foundations of Educational Research (Education)

I’m partial to this one since it is where I completed my Ph.D., with Prof. David J. Weiss in the Psychology Department.  The UMN is interesting in that it actually has two separate graduate programs in psychometrics: the one in Psychology, which has since become more focused on quantitative psychology, but also one in the Education department.

Website: https://cla.umn.edu/psychology/graduate/areas-specialization/quantitativepsychometric-methods-qpm

https://edpsych.umn.edu/academics/quantitative-methods

University of Massachusetts: Research, Educational Measurement, and Psychometrics (REMP)

For many years, if you wanted to learn item response theory, you read Item Response Theory. Principles and Applications by Hambleton and Swaminathan (1985).  These were two longtime professors at UMass, and it speaks to the quality of that program.  Both have since retired but the faculty remains excellent.  Also, note that the program website has a nice page on psychometric resources and software.

Website: https://www.umass.edu/remp/

University of Iowa: Center for Advanced Studies in Measurement and Assessment

This program is in the Education department, and has the advantage of being in one of the epicenters of the industry: the testing giant ACT is headquartered only a few miles away, the giant Pearson has an office in town, and the Iowa Test of Basic Skills is an offshoot of the university itself.  Like UMass, Iowa also has a website with educational materials and useful software.

Website: https://education.uiowa.edu/casma

University of Wisconsin-Madison

UW has well-known professors like Daniel Bolt and James Wollack.  Plus, Madison is well-known for being a fun city given its small size.  The large K-12 testing company, Renaissance Learning, is headquartered only a few miles away.

Website: https://edpsych.education.wisc.edu/category/quantitative-methods/

University of Nebraska – Lincoln: Quantitative, Qualitative & Psychometric Methods

For many years, the cornerstones of this program were the husband-and-wife duo of James Impara and Barbara Plake.  They’ve now retired, but excellent new professors have joined.  In addition, UNL is the home of the Buros Institute.

Website: https://cehs.unl.edu/edpsych/quantitative-qualitative-psychometric-methods/

University of Kansas: Research, Evaluation, Measurement, and Statistics

Not far from Lincoln, NE is Lawrence, Kansas.  The program here has been around a long time, with excellent faculty.  Students have an option for practical experience working at the Achievement and Assessment Institute.

Website: https://epsy.ku.edu/academics/educational-psychology-research/phd

Michigan State University: Measurement and Quantitative Methods

Like most of the rest of these programs, it is in a vibrant college town.  The focus is more on quantitative methods than assessment.

Website: https://education.msu.edu/ 

UNC-Greensboro: Educational Research, Measurement, and Evaluation

While most programs listed here are in the northern USA, this one is in the southern part of the country, where such programs are smaller and fewer.  UNCG is quite strong however.

Website: https://www.uncg.edu/degrees/educational-research-measurement-and-evaluation-ph-d/

University of Texas: Quantitative Methods

UT, like some of the other programs, has an advantage in that the educational assessment arm of Pearson is located there.

Website: https://education.utexas.edu/departments/educational-psychology/edp-programs/quantitative-methods/

Boston College: Measurement, Evaluation, Statistics, and Assessment (MESA)

This program is involved in international research such as TIMSS & PIRLS.

Website: https://www.bc.edu/bc-web/schools/lynch-school/academics/departments/mesa.html

Morgan State University: Graduate Program in Psychometrics

Morgan State is unique in that it is a historically black institution that has an excellent program dedicated to psychometrics.

Website: https://www.morgan.edu/psychometrics

Fordham University: Psychometrics and Quantitative Psychology

Fordham has an excellent program, located in New York City.

Website: https://www.fordham.edu/academics/departments/psychology/graduate-program/phd-in-psychometrics-and-quantitative-psychology/

James Madison University: Assessment and Measurement

While not as large as the major public universities on this list, JMU has a strong, practically focused program in psychometrics.

Website: https://www.jmu.edu/grad/programs/snapshots/psychology-assessment-and-measurement.shtml

Outside the US

University of Alberta:  Measurement, Evaluation, and Data Science

This is arguably the leading program in all of Canada.

Website: https://www.ualberta.ca/en/educational-psychology/graduate-programs/measurement-evaluation-and-data-sciences/index.html 

University of British Columbia: Measurement, Evaluation, and Research Methodology

UBC is home to Bruno Zumbo, one of the most prolific researchers in the field.

Website: http://ecps.educ.ubc.ca/program/measurement-evaluation-and-research-methodology/

University of Twente: Research Methodology, Measurement, and Data Analysis

For decades, Twente has been the center of psychometrics in Europe, with professors like Wim van der Linden, Theo Eggen, Cees Glas, and Bernard Veldkamp.  It’s also linked with Cito, the premier testing company in Europe, which provides excellent opportunities to apply your skills.

Website: https://www.utwente.nl/en/bms/omd/

University of Amsterdam: Psychological Methods

This program has a number of well-known professors, with expertise in both psychometrics and quantitative psychology.

Website: https://psyres.uva.nl/content/research-groups/programme-group-psychological-methods/programme-group-psychological-methods.html?cb

University of Cambridge: The Psychometrics Centre

The Psychometrics Centre at Cambridge includes professors John Rust and David Stillwell.  It hosted the 2015 IACAT conference and is the home to the open-source CAT platform Concerto.

Website: https://www.psychometrics.cam.ac.uk/

KU Leuven: Research Group of Quantitative Psychology and Individual Differences

This is home to well-known researchers such as Paul De Boeck.

Website: https://ppw.kuleuven.be/okp/home/

University of Western Australia: Pearson Psychometrics Laboratory

This is home to David Andrich, best known for the Rasch Rating Scale Model.

Website: https://www.uwa.edu.au/schools/medicine/psychometric-laboratory

University of Oslo: Assessment, Measurement, and Evaluation

This program provides an opportunity in the Nordic/Scandinavian countries, with a program in assessment and psychometrics.

Website: https://www.uio.no/english/studies/programmes/assessment-evaluation-master

Online

There are very few programs that offer graduate training in psychometrics that is 100% online.  Here’s the only one I know of.  If you know of another one, please get in touch with me.

The University of Illinois at Chicago: Measurement, Evaluation, Statistics, and Assessment

This program is of particular interest because it has an online Master’s program, which allows you to get a high-quality graduate degree in psychometrics from just about anywhere in the world.  One of my colleagues here at ASC has recently enrolled in this program.

Website: https://mesaonline.ec.uic.edu/programs/master-education-measurement-evaluation-statistics-assessment/ 

We hope the article helps you find the best institution to pursue your graduate degree in psychometrics.

three standard errors

Sympson-Hetter is a method of item exposure control within the algorithm of Computerized adaptive testing (CAT).  It prevents the algorithm from over-using the best items in the pool.

CAT is a powerful paradigm for delivering tests that are smarter, faster, and fairer than the traditional linear approach.  However, CAT is not without its challenges.  One is that it is a greedy algorithm that always selects your best items from the pool if it can.  The way that CAT researchers address this issue is with item exposure controls.  These are sub algorithms that are injected into the main item selection algorithm, to alter it from always using the best items. The Sympson-Hetter method is one such approach.  Another is the Randomesque method.

The Randomesque Method5 item information functions IIF for Sympson-Hetter

The simplest approach is called the randomesque method.  This selects from the top X items in terms of item information (a term from item response theory), usually for the first Y items in a test.  For example, instead of always selecting the top item, the algorithm finds the 3 top items and then randomly selects between those.

The figure on the right displays item information functions (IIFs) for a pool of 5 items.  Suppose an examinee had a theta estimate of 1.40.  The 3 items with the highest information are the light blue, purple, and green lines (5, 4, 3).  The algorithm would first identify this and randomly pick amongst those three.  Without item exposure controls, it would always select Item 4.

The Sympson-Hetter Method

A more sophisticated method is the Sympson-Hetter method.

Here, the user specifies a target proportion as a parameter for the selection algorithm.  For example, we might decide that we do not want an item seen by more than 75% of examinees.  So, every time that the CAT algorithm goes into the item pool to select a new item, we generate a random number between 0 and 1, which is then compared to the threshold.  If the number is between 0 and 0.75 in this case, we go ahead and administer the item.  If the number is from 0.75 to 1.0, we skip over it and go on to the next most informative item in the pool, though we then do the same comparison for that item.

Why do this?  It obviously limits the exposure of the item.  But just how much it limits it depends on the difficulty of the item.  A very difficult item is likely only going to be a candidate for selection for very high-ability examinees.  Let’s say it’s the top 4%… well, then the approach above will limit it to 3% of the sample overall, but 75% of the examinees in its neighborhood.

On the other hand, an item of middle difficulty is used not only for middle examinees but often for any examinee.  Remember, unless there are some controls, the first item for the test will be the same for everyone!  So if we apply the Sympson-Hetter rule to that item, it limits it to 75% exposure in a more absolute sense.

Because of this, you don’t have to set that threshold parameter to the same value for each item.  The original recommendation was to do some CAT simulation studies, then set the parameters thoughtfully for different items.  Items that are likely to be highly exposed (middle difficulty with high discrimination) might deserve a more strict parameter like 0.40.  On the other hand, that super-difficult item isn’t an exposure concern because only the top 4% of students see it anyway… so we might leave its parameter at 1.0 and therefore not limit it at all.

Is this the only method available?

No.  As mentioned, there’s that simple randomesque approach.  But there are plenty more.  You might be interested in this paper, this paper, or this paper.  The last one reviews the research literature from 1983 to 2005.

What is the original reference?

Sympson, J. B., & Hetter, R. D. (1985, October). Controlling item-exposure rates in computerized adaptive testing. Proceedings of the 27th annual meeting of the Military Testing Association (pp. 973–977). San Diego, CA: Navy Personnel Research and Development Center.

How can I apply this to my tests?

Well, you certainly need a CAT platform first.  Our platform at ASC allows this method right out of the box – that is, all you need to do is enter the target proportion when you publish your exam, and the Sympson-Hetter method will be implemented.  No need to write any code yourself!  Click here to sign up for a free account.

student-profile-cognitive-diagnostic-models

Cognitive diagnostic models are a psychometric paradigm for designing and scoring tests with the goal of providing a profile of examinee skill mastery rather than just an overall test score.

CDMS are an area of psychometric research that has seen substantial growth in the past decade, though the mathematics behind them, dating back to MacReady and Dayton (1977).  The reason that they have been receiving more attention is that in many assessment situations, a simple overall score does not serve our purposes and we want a finer evaluation of the examinee’s skills or traits.  For example, the purpose of formative assessment in education is to provide feedback to students on their strengths and weaknesses, so an accurate map of these is essential.  In contrast, a professional certification/licensure test focuses on a single overall score with a pass/fail decision.

What are cognitive diagnostic models?

The predominant psychometric paradigm since the 1980s is item response theory (IRT), which is also known as latent trait theory.  Cognitive diagnostic models are part of a different paradigm known as latent class theory.  Instead of assuming that we are measuring a single neatly unidimensional factor, latent class theory instead tries to assign examinees into more qualitative groups by determining whether they categorized along a number of axes.

What this means is that the final “score” we hope to obtain on each examinee is not a single number, but a profile of which axes they have and which they do not.  The axes could be a number of different psychoeducational constructs, but are often used to represent cognitive skills examinees have learned.  Because we are trying to diagnose strengths vs. weaknesses, we call it a cognitive diagnostic model.

Example: Fractions

A classic example you might see in the literature is a formative assessment on dealing with fractions in mathematics. Suppose you are designing such a test, and the curriculum includes these teaching points, which are fairly distinct skills or pieces of knowledge.

  1. Find the lowest common denominator
  2. Add fractions
  3. Subtract fractions
  4. Multiply fractions
  5. Divide fractions
  6. Convert mixed number to improper fraction

Now suppose this is one of the questions on the test.

 What is 2 3/4 + 1 1/2?

 

This item utilizes skills 1, 2, and 6.  We can apply a similar mapping to all items, and obtain a table.  Researchers call this the “Q Matrix.”  Our example item is Item 1 here.  You’d create your own items and tag appropriately.

Item Find the lowest common denominator Add fractions Subtract fractions Multiply fractions Divide fractions Convert mixed number to improper fraction
 Item 1  X X  X
 Item 2  X  X
 Item 3  X  X
 Item 4  X  X

 

So how do we obtain the examinee’s skill profile?

This is where the fun starts.  I used the plural cognitive diagnostic models because there are a number of available models.  Just like in item response theory we have the Rasch, 2 parameter, 3 parameter, generalized partial credit, and more.  Choice of model is up to the researcher and depends on the characteristics of the test.

The simplest model is the DINA model, which has two parameters per item.  The slippage parameter s refers to the probability that a student will get the item wrong if they do have the skills.  The guessing parameter g refers to the probability a student will get the item right if they do not have the skills.

The mathematical calculations for determining the skill profile are complex, and are based on maximum likelihood.  To determine the skill profile, we need to first find all possible profiles, calculate the likelihood of each (based on item parameters and the examinee response vector), then select the profile with the highest likelihood.

Calculations of item parameters are an order of magnitude greater complexity.  Again, compare to item response theory: brute force calculation of theta with maximum likelihood is complex, but can still be done using Excel formulas.  Item parameter estimation for IRT with marginal maximum likelihood can only be done by specialized software like  Xcalibre.  For CDMs, item parameter estimation can be done in software like MPlus or R (see this article).

In addition to providing the most likely skill profile for each examinee, the CDMs can also provide the probability that a given examinee has mastered each skill.  This is what can be extremely useful in certain contexts, like formative assessment.

How can I implement cognitive diagnostic models?

The first step is to analyze your data to evaluate how well CDMs work by estimating one or more of the models.  As mentioned, this can be done in software like MPlus or R.  Actually publishing a real assessment that scores examinees with CDMs is a greater hurdle.

Most tests that use cognitive diagnostic models are proprietary.  That is, a large K12 education company might offer a bank of prefabricated formative assessments for students in grades 3-12.  That, of course, is what most schools need, because they don’t have a PhD psychometrician on staff to develop new assessments with CDMs.  And the testing company likely has several on staff.

On the other hand, if you want to develop your own assessments that leverage CDMs, your options are quite limited.  I recommend our  FastTest platform for test development, delivery, and analytics.

This is cool!  I want to learn more!

I like this article by Alan Huebner, which talks about adaptive testing with the DINA model, but has a very informative introduction on CDMs.

Jonathan Templin, a professor at the University of Iowa, is one of the foremost experts on the topic.  Here is his website.  Lots of fantastic resources.

Here is a textbook on CDMs.