Posts on psychometrics: The Science of Assessment

assessment-technology-improve-exams

Psychometrics is the science of educational and psychological assessment, using data to ensure that tests are fair and accurate.  Ever felt like you took a test which was unfair, too hard, didn’t cover the right topics, or was full of questions that were simply confusing or poorly written?  Psychometricians are the people who help organizations fix these things using data science, as well as more advanced topics like how to design an AI algorithm that adapts to each examinee.

Psychometrics is a critical aspect of many fields.  Having accurate information on people is essential to education, human resources, workforce development, corporate training, professional certifications/licensure, medicine, and more.  It scientifically studies how tests are designed, developed, delivered, validated, and scored.

Key Takeaways on Psychometrics

  • Psychometrics is the study of how to measure and assess mental constructs, such as intelligence, personality, or knowledge of accounting law
  • Psychometrics is NOT just screening tests for jobs
  • Psychometrics is dedicated to making tests more accurate and fair
  • Psychometrics is heavily reliant on data analysis and machine learning, such as item response theory

 

What is Psychometrics?

Psychometrician Qualities
Psychometrics is the study of assessment itself, regardless of what type of test is under consideration. In fact, many psychometricians don’t even work on a particular test, they just work on psychometrics itself, such as new methods of data analysis.  Most professionals don’t care about what the test is measuring, and will often switch to new jobs at completely unrelated topics, such as moving from a K-12 testing company to psychological measurement to an Accountant certification exam.  We often refer to whatever we are measuring simply as “theta” – a term from item response theory.

Psychometrics tackles fundamental questions around assessment, such as how to determine if a test is reliable or if a question is of good quality, as well as much more complex questions like how to ensure that a score today on a university admissions exam means the same thing as it did 10 years ago.  Additionally, it examines phenomena like the positive manifold, where different cognitive abilities tend to be positively correlated, supporting the consistency and generalizability of test scores over time.

Psychometrics is a branch of data science.  In fact, it’s been around a long time before that term was even a buzzword.  Don’t believe me?  Check out this Coursera course on Data Science, and the first example they give as one of the foundational historical projects in data science is… psychometrics!  (early research on factor analysis of intelligence).

Even though assessment is everywhere and Psychometrics is an essential aspect of assessment, to most people it remains a black box, and professionals are referred to as “psychomagicians” in jest. However, a basic understanding is important for anyone working in the testing industry, especially those developing or selling tests.

Psychometrics is NOT limited to very narrow types of assessment.  Some people use the term interchangeably with concepts like IQ testing, personality assessment, or pre-employment testing.  These are each but tiny parts of the field!  Also, it is not the administration of a test.

 

Why do we need Psychometrics?

This purpose of tests is providing useful information about people, such as whether to hire them, certify them in a profession, or determine what to teach them next in school.  Better tests mean better decisions.  Why?  The scientific evidence is overwhelming that tests provide better information for decision makers than many other types of information, such as interviews, resumes, or educational attainment.  Thus, tests serve an extremely useful role in our society.

The goal of psychometrics is to provide validity: evidence to support that interpretations of scores from the test are what we intended.  If a certification test is supposed to mean that someone passing it meets the minimum standard to work in a certain job, we need a lot of evidence about that, especially since the test is so high stakes in that case.  Meta-analysis, a key tool in psychometrics, aggregates research findings across studies to provide robust evidence on the reliability and validity of tests. By synthesizing data from multiple studies, meta-analysis strengthens the validity claims of tests, especially crucial in high-stakes certification exams where accuracy and fairness are paramount.

 

What does Psychometrics do?

test development cycle job task analysis psychometrics

Building and maintaining a high-quality test is not easy.  A lot of big issues can arise.  Much of the field revolves around solving major questions about tests: what should they cover, what is a good question, how do we set a good cutscore, how do we make sure that the test predicts job performance or student success, etc.  Many of these questions align with the test development cycle – more on that later.

How do we define what should be covered by the test? (Test Design)

Before writing any items, you need to define very specifically what will be on the test.  If the test is in credentialing or pre-employment, psychometricians typically run a job analysis study to form a quantitative, scientific basis for the test blueprints.  A job analysis is necessary for a certification program to get accredited.  In Education, the test coverage is often defined by the curriculum.

How do we ensure the questions are good quality? (Item Writing)

There is a corpus of scientific literature on how to develop test items that accurately measure whatever you are trying to measure.  A great overview is the book by Haladyna.  This is not just limited to multiple-choice items, although that approach remains popular.  Psychometricians leverage their knowledge of best practices to guide the item authoring and review process in a way that the result is highly defensible test content.  Professional item banking software provides the most efficient way to develop high-quality content and publish multiple test forms, as well as store important historical information like item statistics.

How do we set a defensible cutscore? (Standard Setting)

Test scores are often used to classify candidates into groups, such as pass/fail (Certification/Licensure), hire/non-hire (Pre-Employment), and below-basic/basic/proficient/advanced (Education).  Psychometricians lead studies to determine the cutscores, using methodologies such as Angoff, Beuk, Contrasting-Groups, and Borderline.

How do we analyze results to improve the exam? (Psychometric Analysis)

Psychometricians are essential for this step, as the statistical analyses can be quite complex.  Smaller testing organizations typically utilize classical test theory, which is based on simple mathematics like proportions and correlations.  Large, high-profile organizations typically use item response theory (IRT), which is based on a type of nonlinear regression analysis.  Psychometricians evaluate overall reliability of the test, difficulty and discrimination of each item, distractor analysis, possible bias, multidimensionality, linking multiple test forms/years, and much more.  Software such as  Iteman  and  Xcalibre  is also available for organizations with enough expertise to run statistical analyses internally.  Scroll down below for examples.

How do we compare scores across groups or years? (Equating)

This is referred to as linking and equating.  There are some psychometricians that devote their entire career to this topic.  If you are working on a certification exam, for example, you want to make sure that the passing standard is the same this year as last year.  If you passed 76% last year and this year you passed 25%, not only will the candidates be angry, but there will be much less confidence in the meaning of the credential.

How do we know the test is measuring what it should? (Validity)

Validity is the evidence provided to support score interpretations.  For example, we might interpret scores on a test to reflect knowledge of English, and we need to provide documentation and research supporting this.  There are several ways to provide this evidence.  A straightforward approach is to establish content-related evidence, which includes the test definition, blueprints, and item authoring/review.  In some situations, criterion-related evidence is important, which directly correlates test scores to another variable of interest.  Delivering tests in a secure manner is also essential for validity.

 

Where is Psychometrics Used?

Certification/Licensure/Credentialing

In certification testing, psychometricians develop the test via a documented chain of evidence following a sequence of research outlined by accreditation bodies, typically: job analysis, test blueprints, item writing and review, cutscore study, and statistical analysis.  Web-based item banking software like  FastTest  is typically useful because the exam committee often consists of experts located across the country or even throughout the world; they can then easily log in from anywhere and collaborate.

Pre-Employment

In pre-employment testing, validity evidence relies primarily on establishing appropriate content (a test on PHP programming for a PHP programming job) and the correlation of test scores with an important criterion like job performance ratings (shows that the test predicts good job performance).  Adaptive tests are becoming much more common in pre-employment testing because they provide several benefits, the most important of which is cutting test time by 50% – a big deal for large corporations that test a million applicants each year. Adaptive testing is based on item response theory, and requires a specialized psychometrician as well as specially designed software like  FastTest.

K-12 Education

Most assessments in education fall into one of two categories: lower-stakes formative assessment in classrooms, and higher-stakes summative assessments like year-end exams.  Psychometrics is essential for establishing the reliability and validity of higher-stakes exams, and on equating the scores across different years.  They are also important for formative assessments, which are moving towards adaptive formats because of the 50% reduction in test time, meaning that student spend less time testing and more time learning.

Universities

Universities typically do not give much thought to psychometrics even though a significant amount of testing occurs in higher education, especially with the move to online learning and MOOCs.  Given that many of the exams are high stakes (consider a certificate exam after completing a year-long graduate program!), psychometricians should be used in the establishment of legally defensible cutscores and in statistical analysis to ensure reliable tests, and professionally designed assessment systems used for developing and delivering tests, especially with enhanced security.

Medicine/Psychology

Have you ever taken a survey at your doctor’s office, or before/after a surgery?  Perhaps a depression or anxiety inventory at a psychotherapist?  Psychometricians have worked on these.

 

The Test Development Cycle

Psychometrics is the core of the test development cycle, which is the process of developing a strong exam.  It is sometimes called similar names like assessment lifecycle.

You will recognize some of the terms from the introduction earlier.  What we are trying to demonstrate here is that those questions are not standalone topics, or something you do once and simply file a report.  An exam is usually a living thing.  Organizations will often be republishing a new version every year or 6 months, which means that much of the cycle is repeated on that timeline.  Not all of it is; for example, many orgs only do a job analysis and standard setting every 5 years.

Consider a certification exam in healthcare.  The profession does not change quickly because things like anatomy never change and medical procedures rarely change (e.g., how to measure blood pressure).  So, every 5 years it does a job analysis of its certificants to see what they are doing and what is important.  This is then converted to test blueprints.  Items are re-mapped if needed, but most likely do not need it because there are probably only minor changes to the blueprints.  Then a new cutscore is set with the modified-Angoff method, and the test is delivered this year.  It is delivered again next year, but equated to this year rather than starting again.  However, the item statistics are still analyzed, which leads to a new cycle of revising items and publishing a new form for next year.

 

Example of Psychometrics in Action

Here is some output from our Iteman software.  This is deeply analyzing a single question on English vocabulary, to see if the student knows the word alleviate.  About 70% of the students answered correctly, with a very strong point-biserial.  The distractor P values were all in the minority and the distractor point-biserials were negative, which adds evidence to the validity.  The graph shows that the line for the correct answer is going up while the others are going down, which is good.  If you are familiar with item response theory, you’ll notice how the blue line is similar to an item response function.  That is not a coincidence.

FastTest Iteman Psychometrics Analysis

Now, let’s look at another one, which is more interesting.  Here’s a vocab question about the word confectioner.  Note that only 37% of the students get it right… even though there is a 25% chance just of guessing!!!  However, the point-biserial discrimination remains very strong at 0.49.  That means it is a really good item.  It’s just hard, which means it does a great job to differentiate amongst the top students.

Confectioner confetti

Psychometrics looks fun!  How can I join the band?

You will need a graduate degree.  I recommend you look at the NCME website (ncme.org) with resources for students.  Good luck!

Already have a degree and looking for a job?  Here’s the two sites that I recommend:

  • NCME – Also has a job listings page that is really good (ncme.org)
  • Horizon Search – Headhunter for Psychometricians and I/O Psychologists
Learning Management System Avatar

In today’s digital-first world, educational institutions and organizations are leveraging technology to deliver training and instruction in more dynamic and efficient ways. A core component of this shift is the Learning Management System (LMS). But what exactly is an LMS, and why is it so critical to modern education and training? Let’s explore this transformative technology and its key features.

Understanding the Basics: What is a Learning Management System?

LMS is a software application or platform used to plan, implement, and assess a specific learning process. It provides educators, administrators, and learners with a single location for communication, course material, and assessment tools. LMS platforms are commonly used in schools, universities, corporate training programs, and online learning environments. LMS have faced a massive growth in usage due to the emphasis on remote learning during the COVID-19 pandemic. 

The core function of an LMS is to make educational content accessible to users anytime, anywhere, and often at their own pace. This flexibility is crucial in accommodating the diverse needs of learners and organizations.

Key Features of a Learning Management System

Learning Management Systems are designed to simplify the process of delivering training and educational content. Here are some of the primary features that make LMS platforms so valuable:

LMS - Connect

  1. Course Management: Create, organize, and manage courses with ease. This feature often includes the ability to upload different types of content, such as videos, presentations, PDFs, and quizzes.
  2. Assessment and Tracking: LMS allows for automated assessments and grading. It can track progress, monitor engagement, and provide insights through data analytics.
  3. User Management: Manage user roles and permissions to control access to different parts of the platform. Instructors, administrators, and learners each have unique permissions and access.
  4. Communication Tools: Many LMS platforms include integrated messaging, discussion forums, and video conferencing, fostering communication between learners and educators.
  5. Learning Analytics: LMS often incorporates dashboards to track student progress and performance. LMS can report key items like: completion rates and success likelihood. Administrators, educators and learners can use these metrics to better understand gaps in knowledge.

Examples of Popular Learning Management System Platforms

LMS - Modules

There are hundreds of LMS platforms available on the market, catering to various educational and corporate needs. The options range from open-source platforms like Moodle and Chamilo, which offer extensive customization but require technical expertise, to commercial solutions such as Blackboard and Canvas, known for their robust feature sets and support services. Pricing can vary significantly based on factors like the number of users, features, and deployment options.

Some platforms, like Google Classroom, are free for qualifying institutions. There are three paid plans. First, the Google Workspace for Education Standard plan costs $3 per student, per year and adds on a security center, advanced device and app management features, Gmail and Classroom logs for export into BigQuery, and audit logs. Then there’s the Teaching and Learning Upgrade plan that costs $4 per license, per month and includes additional features like advanced Google Meet features, unlimited originality reports and the ability to check for peer matches across a private repository. Finally, the Google Workspace for Education Plus plan costs $5 per student, per year and includes all of the features of the other plans, plus live streams with up to 100,000 in-domain viewers, syncing rosters from SISs to Google Classroom, personalized cloud search and prioritized support (Better Buys, 2023).

It’s essential to evaluate your needs and budget before choosing an LMS, as costs can quickly escalate with additional modules and support services.

Below are some widely used options:

  • Moodle: An open-source platform favored by educational institutions due to its flexibility and community support. Moodle is highly customizable and can be tailored to meet specific learning needs.

LMS - Moodle

  • Canvas: A popular choice for both K-12 and higher education, Canvas offers a clean interface and extensive integrations with third-party tools, making it ideal for tech-savvy institutions.

LMS - Canvas

  • Blackboard: Widely adopted by universities and colleges, Blackboard focuses on providing comprehensive features for large-scale educational organizations.

LMS - Blackboard

  • Google Classroom: A simple and intuitive tool, Google Classroom is popular in K-12 settings. It integrates seamlessly with other Google products, making it a convenient option for schools already using Google Workspace.

LMS - Google Classroom

When implementing an LMS, there are several additional expenses to consider beyond the platform’s base pricing. These include:

  1. Implementation and Setup Costs: Depending on the complexity of the LMS and your organization’s specific requirements, there may be initial setup costs. This could involve customizing the platform, integrating it with existing systems, and migrating existing content and user data.
  2. Training and Support: It’s crucial to allocate a budget for training administrators, instructors, and learners to use the LMS effectively. Some platforms offer onboarding and support as part of their package, while others charge separately for these services.
  3. Content Creation and Licensing: Developing new courses, multimedia content, or interactive assessments can be time-consuming and expensive. Additionally, if you’re using third-party content or e-learning modules, you may need to pay licensing fees.
  4. Maintenance and Upgrades: Keeping the LMS up-to-date with software patches, security updates, and new feature releases often incurs ongoing costs. Organizations that opt for self-hosted solutions will also need to consider server maintenance and IT support costs.
  5. Integration with Other Tools: If you plan to integrate the LMS with other systems like HR software, CRM platforms, or data analytics tools, there may be costs associated with custom integrations or purchasing additional licenses for these tools.
  6. Compliance and Security: Ensuring that your LMS complies with regulations (e.g., GDPR, ADA) may involve additional expenses for compliance assessments, legal consultations, and security enhancements.
  7. Scalability: If your organization grows, you might need to expand your LMS capacity, which could mean upgrading your plan, adding new features, or expanding server capacity—all of which can increase costs.

By considering these additional expenses, organizations can develop a more accurate budget and avoid unexpected costs during the LMS implementation process.

Why Your Organization Needs a Learning Management System

Whether you’re running a university, a corporate training program, or a small online course, an LMS can streamline your educational process. With the ability to host and organize content, track learner progress, and provide insights through analytics, an LMS offers much more than just a place to upload learning materials. It can be a strategic tool to enhance the learning experience, increase engagement, and ensure that your educational objectives are met.

Advantages of Using a Learning Management System

Learning Management Systems have become a cornerstone for modern education and corporate training environments. Here are six key benefits that define the value and effectiveness of an LMS.

  1. Interoperability: Seamless Integration Across Systems

One of the most significant advantages of an LMS is its ability to integrate seamlessly with other systems through standardized data formats and protocols. LMS platforms adhere to standards such as SCORM (Sharable Content Object Reference Model), xAPI (Experience API), and LTI (Learning Tools Interoperability), which enable the exchange of content and data between different applications. This level of interoperability simplifies the process of sharing resources and tracking learner progress across multiple platforms, ensuring a cohesive learning experience.

  1. Accessibility: Inclusive Learning for All Students

Accessibility is a critical factor in modern education, and LMS platforms are designed to support students with diverse needs, including those with disabilities. Most LMS platforms adhere to accessibility standards like the Web Content Accessibility Guidelines (WCAG), providing features such as screen reader support, keyboard navigation, and closed captioning for videos. Consistent layouts and interfaces make it easier for all users to navigate the platform and access content. By fostering an inclusive environment, an LMS can help organizations comply with legal requirements such as the Americans with Disabilities Act (ADA) and ensure that learning opportunities are available to everyone, regardless of physical or cognitive limitations.

  1. Reusability: Maximizing the Value of Educational Content

Reusability is a key strength of LMS platforms, enabling organizations to develop educational content once and reuse it across different courses, training programs, or departments. This feature significantly reduces the time and costs associated with creating new content for each learning module. Content created within an LMS can be structured into reusable learning objects that can be easily updated, repurposed, and shared. This flexibility is especially valuable for large organizations and educational institutions looking to standardize training materials and curricula while keeping them up-to-date with minimal effort.

  1. Durability: A Sustainable Solution for Long-Term Growth

As technology continues to transform education and training, the LMS market is poised for significant growth. Reports suggest that the global LMS market is expected to achieve a compound annual growth rate (CAGR) of 17.1% by 2028 (Reports, Valuates, 2022). This growth is driven by the increasing demand for flexible learning solutions, remote training, and the incorporation of new technologies like artificial intelligence and virtual reality into the learning process. By choosing a durable and scalable LMS, organizations can ensure that their investment remains relevant and adaptable to future educational trends and technologies.

  1. Maintainability: Ensuring a Continuously Evolving Platform

LMS platforms are designed with maintainability in mind, allowing developers to make updates, add new features, and fix bugs without disrupting the user experience. This is crucial in a rapidly changing educational landscape where learner needs and technological standards are constantly evolving. With cloud-based LMS platforms, maintenance is often handled automatically by the provider, ensuring that the system is always up-to-date with the latest security patches and performance optimizations. This continuous improvement cycle enables organizations to keep their learning environments modern, secure, and aligned with user expectations.

  1. Adaptability: Evolving with the Needs of Learners

Since their inception in the 1990s, LMS platforms have evolved significantly to keep up with changing societal needs and educational practices. Modern LMS platforms are highly adaptable, supporting a wide range of learning methodologies, such as blended learning, flipped classrooms, and competency-based learning. They also offer extensive customization options, allowing organizations to tailor the platform’s look and feel to match their branding and pedagogical approaches. As educational trends and technologies continue to evolve, LMS platforms are equipped to integrate emerging tools and approaches, such as gamification, microlearning, and artificial intelligence-driven personalized learning paths, making them a future-proof solution for delivering high-quality education and training.

By understanding these key advantages, organizations and institutions can leverage LMS platforms to create impactful learning experiences that not only meet current needs but are also prepared for the future of education and training.

Weaknesses of Using a Learning Management System

While Learning Management Systems offer many benefits, there are some limitations to be aware of, especially in specific contexts where advanced features are needed. Here are three key weaknesses to consider:

  1. Limited Functionality for Assessments
    Many LMS platforms lack sophisticated assessment tools. While most systems support basic quizzes and exams, they may not include advanced features like item banking, Item Response Theory (IRT), or adaptive testing capabilities. This limits their use for institutions or organizations looking to implement more complex testing methodologies, such as those used in standardized assessments or psychometric evaluations. In such cases, additional software or integrations with specialized assessment platforms may be required.
  2. Ineffective Student Management
    An LMS is not designed to function as a full-fledged Student Management System (SMS). It typically lacks the robust database management features necessary for handling complex student records, attendance tracking, and detailed progress reporting. This limitation means that many organizations must integrate the LMS with a separate SMS or a Customer Relationship Management (CRM) system to gain comprehensive student management capabilities. Without these integrations, tracking student progress and managing enrollment data can become cumbersome.
  3. Lack of e-Commerce Functionality
    Not all LMS platforms include built-in e-Commerce capabilities, making it difficult to monetize courses directly within the system. For organizations looking to sell courses, certifications, or training materials, the lack of e-Commerce features can be a significant drawback. While some platforms offer plugins or third-party integrations to support payment processing and course sales, these solutions can add complexity and additional costs to the system. If selling courses or certifications is a priority, it’s crucial to choose an LMS with robust e-Commerce support or consider integrating it with an external e-Commerce platform.
  4. Steep Learning Curve for Administrators and Instructors
    LMS platforms can be complex to navigate, especially for administrators and instructors who may not have a technical background. Setting up courses, managing user roles, configuring permissions, and integrating third-party tools often require specialized training and expertise. This learning curve can lead to inefficiencies, particularly in organizations without dedicated IT or instructional design support. Training costs and time investment can add up, reducing the overall efficiency of the platform.
  5. High Implementation and Maintenance Costs
    Implementing an LMS can be expensive, especially when accounting for customization, setup, training, and content creation. Self-hosted solutions may require ongoing IT support, server maintenance, and regular updates, all of which add to the cost. Even cloud-based solutions can have hidden fees for additional features, support, or upgrades. For organizations with limited budgets, these expenses can quickly become a barrier to effective implementation and long-term use.
  6. User Engagement and Retention Challenges
    While LMS platforms offer tools for tracking engagement and participation, they can sometimes struggle to keep learners motivated, especially in self-paced or online-only environments. If the courses are not designed with engaging content or interactive features, learners may lose interest and drop out. This issue is compounded when the LMS interface is not user-friendly, leading to poor user experience and decreased retention rates.
  7. Lack of Support for Personalized Learning Paths
    While some LMS platforms offer rudimentary support for personalized learning, most struggle to deliver truly customized learning paths that adapt to individual learner needs. This limitation can hinder the ability to address diverse learning styles, knowledge levels, or specific skill gaps. As a result, organizations may need to supplement their LMS with other tools or platforms that provide adaptive learning technologies, which adds complexity to the learning ecosystem.
  8. Data Privacy and Compliance Concerns
    Depending on the region and type of data being stored, LMS platforms may not always comply with data privacy regulations such as GDPR, CCPA, or FERPA. Organizations must carefully evaluate the platform’s data security features and ensure compliance with relevant standards. Failure to meet these requirements can result in significant legal and financial repercussions.

Final Thoughts

Understanding what a Learning Management System is and how it can benefit your organization is crucial in today’s education and training landscape. With platforms like Moodle, Canvas, and Blackboard, it’s easier than ever to create engaging and effective learning experiences. Ready to explore your options? Check out some of these LMS comparisons to find the best platform for your needs.

An LMS isn’t just a tool—it’s a bridge to more effective and scalable learning solutions.

References

Reports, Valuates. (2022). “Learning Management System (LMS) Market to Grow USD 40360 Million by 2028 at a CAGR of 17.1% | Valuates Reports”. www.prnewswire.com (Press release). https://www.prnewswire.com/news-releases/learning-management-system-lms-market-to-grow-usd-40360-million-by-2028-at-a-cagr-of-17-1–valuates-reports-301588142.html

Better buys. (2023). How Much Does an LMS Cost? 2024 Pricing Guide. https://www.betterbuys.com/lms/lms-pricing-guide/

Woman thinking of careers with RIASEC assessment

RIASEC assessment is type of personality assessment used to help individuals identify their career interests and strengths. Based on theory from John Holland, a renowned psychologist, this type of assessment is based on the premise that people perform best in environments that align with their level on six personality factors: Realistic, Investigative, Artistic, Social, Enterprising, and Conventional (RIASEC).  

In this blog post, we’ll dive deeper into what RIASEC assessment is, how it works, and why it’s useful for career planning.

Understanding the RIASEC Model

The RIASEC model is structured around six personality factors:Holland Hexagon

  • Realistic: People who enjoy working with their hands, using tools, and engaging in physical activity. Careers in engineering, construction, or athletics are typical for this type.
  • Investigative: Individuals who are analytical, curious, and enjoy solving complex problems. These people often thrive in science, research, and technical fields.
  • Artistic: Creative thinkers who express themselves through art, music, writing, or design. These individuals prefer jobs in the creative industries.
  • Social: Compassionate and helpful individuals who are drawn to teaching, counseling, or healthcare. Social types enjoy working with others and making a positive impact.
  • Enterprising: These people are confident, persuasive, and like to lead. They often excel in business, sales, or management roles.
  • Conventional: Detail-oriented individuals who enjoy structure and organization. Jobs in accounting, administration, or data management typically attract this type.

You can find more in-depth descriptions of the RIASEC personality types on trusted career exploration platforms like O*Net Online’s Interest Profiler here.

How the RIASEC Assessment Works

Taking a RIASEC assessment typically involves answering a series of questions that measure your preferences for different types of work activities. These questions might ask how much you enjoy tasks like solving math problems, drawing, or managing a project. Based on your responses, the test assigns you a score in each of the six categories. The higher your score in a category, the more likely that personality type fits you.

The results usually highlight your top three RIASEC codes, which are referred to as your Holland Code. This combination helps to suggest career paths or work environments that align with your preferences and strengths.

Example items:

  • Realistic
    • I enjoy building things with my hands
    • I prefer a job where I am physically active
  • Investigative
    • I would enjoy a job where I need to think hard every day
    • I like crossword puzzles and mind teasers
  • Artistic
    • I enjoy making my own art
    • I like to create infographics
  • Social
    • I would like a job where I can make a personal impact on people
    • I like to help people
  • Enterprising
    • People tend to follow me
    • I would enjoy a job where I talk to people a lot
  • Conventional
    • I like to perform tasks where there is a clear right answer
    • I would like a job where there are a lot of numerical calculations

Why Take a RIASEC Assessment?

The RIASEC assessment is valuable for people of all ages. Whether you’re a high school student exploring future career options or a professional considering a career change, the RIASEC model can help clarify which fields best align with your personality. Understanding your Holland Code provides direction on potential job satisfaction, helping to avoid career mismatches that might lead to dissatisfaction or burnout.

Sites like Truity offer free RIASEC assessments that give immediate feedback. They can provide useful insights even if you’re in the early stages of career planning.

Applying Your Results

 

Career counselor with riasec assessment

Once you’ve taken the RIASEC assessment, it’s important to use your results thoughtfully. Review your top three personality types and start exploring careers that align with those interests. Many resources, like the U.S. Department of Labor’s CareerOneStop, offer tools to match your RIASEC profile with specific career options.

You should also consider combining your RIASEC results with other career planning tools, such as skill assessments or personality tests like the Myers-Briggs Type Indicator (MBTI). Doing so can provide a fuller picture of how your interests and abilities overlap.  Make sure you use assessments that have predictive validity.

In some cases, a career counselor at your university or other professional might help you interpret results, recommend professions for you to consider, and then help you select an educational pathway to achieve your goals.

Conclusion

The RIASEC assessment is a useful and widely recognized tool for identifying careers that align with your personality and interests. By understanding the six personality types and discovering where your preferences lie, you can make more informed decisions about your career path. Whether you’re just starting out or making a mid-career switch, this assessment provides valuable guidance for finding a job that suits you best.

Factor analysis is a statistical technique widely used in research to understand and evaluate the underlying structure of assessment data. In fields such as education, psychology, and medicine, this approach to unsupervised machine learning helps researchers and educators identify latent variables, called factors, and which items or tests load on these factors.

For instance, when students take multiple tests, factor analysis can reveal whether these assessments are influenced by common underlying abilities, like verbal reasoning or mathematical reasoning. This insight is crucial for developing reliable and valid assessments, as it helps ensure that test items are measuring the intended constructs.  It can also be used to evaluate whether items in an assessment are unidimensional, which is an assumption of both item response theory and classical test theory.

Why Do We Need Factor Analysis?

Factor analysis is a powerful tool for test validation. By analyzing the data, educators and psychometricians can confirm whether the items on a test align with the theoretical constructs they are designed to measure. This ensures that the test is not only reliable but also valid, meaning it accurately reflects the abilities or knowledge it intends to assess. Through this process, factor analysis contributes to the continuous improvement of educational tools, helping to enhance both teaching and learning outcomes.

What is Factor Analysis?

Factor analysis is a comprehensive statistical technique employed to uncover the latent structure underlying a set of observed variables. In the realms of education and psychology, these observed variables are often test scores or scores on individual test items. The primary goal of factor analysis is to identify underlying dimensions, or factors, that explain the patterns of intercorrelations among these variables. By analyzing these intercorrelations, factor analysis helps researchers and test developers understand which variables group together and may be measuring the same underlying construct.

One of the key outputs of factor analysis is the loading table or matrix (see below), which displays the correlations between the observed variables with the latent dimensions, or factors. These loadings indicate how strongly each variable is associated with a particular factor, helping to reveal the structure of the data. Ideally, factor analysis aims to achieve a “simple structure,” where each variable loads highly on one factor and has minimal loadings on others. This clear pattern makes it easier to interpret the results and understand the underlying constructs being measured. By providing insights into the relationships between variables, factor analysis is an essential tool in test development and validation, helping to ensure that assessments are both reliable and valid.

Confirmatory vs. Exploratory Factor Analysis

Factor analysis comes in two main forms: Confirmatory Factor Analysis (CFA) and Exploratory Factor Analysis (EFA), each serving distinct purposes in research.

Exploratory Factor Analysis (EFA) is typically used when researchers have little to no prior knowledge about the underlying structure of their data. It is a data-driven approach that allows researchers to explore the potential factors that emerge from a set of observed variables. In EFA, the goal is to uncover patterns and identify how many latent factors exist without imposing any preconceived structure on the data. This approach is often used in the early stages of research, where the objective is to discover the underlying dimensions that might explain the relationships among variables.

On the other hand, Confirmatory Factor Analysis (CFA) is a hypothesis-driven approach used when researchers have a clear theoretical model of the factor structure they expect to find. In CFA, researchers specify the number of factors and the relationships between the observed variables and these factors before analyzing the data. The primary goal of CFA is to test whether the data fit the hypothesized model. This approach is often used in later stages of research or in validation studies, where the focus is on confirming the structure that has been previously identified or theoretically proposed. By comparing the model fit indices, researchers can determine how well their proposed factor structure aligns with the actual data, providing a more rigorous test of their hypotheses.

Factor Analysis of Test Batteries or Sections, or Multiple Predictors

Factor analysis is particularly valuable when dealing with test batteries, which are collections of tests designed to measure various aspects of student cognitive abilities, skills, or knowledge. In the context of a test battery, factor analysis helps to identify the underlying structure of the tests and determine whether they measure distinct yet related constructs.

For example, a cognitive ability test battery might include subtests for verbal reasoning, quantitative reasoning, and spatial reasoning. Through factor analysis, researchers can examine how these subtests correlate and whether they load onto separate factors, indicating they measure distinct abilities, or onto a single factor, suggesting a more general underlying ability, often referred to as the “g factor” or general intelligence.

This approach can also incorporate non-assessment data. For example a researcher on employee selection might look at a set of assessments (cognitive ability, job knowledge, quantitative reasoning, MS Word skills, integrity, counterproductive work behavior), but also variables such as interview scores or resume ratings. Below is an oversimplified example of how the loading matrix might look for this.

Table 1

Variable Dimension 1 Dimension 2
Cognitive ability 0.42 0.09
Job knowledge 0.51 0.02
Quantitative reasoning 0.36 -0.02
MS Word skills 0.49 0.07
Integrity 0.03 0.26
Counterproductive work behavior -0.01 0.31
Interview scores 0.16 0.29
Resume ratings 0.11 0.12

Readers that are familiar with the topic will recognize this as a nod to the work by Walter Borman and Steve Motowidlo on Task vs. Contextual aspects of job performance.  A variable like Job Knowledge would load highly on a factor of task aspects of performing a job.  However, an assessment of counterproductive work behavior might not predict how well they do tasks, but how well they contribute to company culture and other contextual aspects.

This analysis is crucial for ensuring that the test battery provides comprehensive and valid measurements of the constructs it aims to assess. By confirming that each subtest contributes unique information, factor analysis supports the interpretation of composite scores and aids in the design of more effective assessment tools. The process of validating test batteries is essential to maintain the integrity and utility of the test results in educational and psychological settings.

This approach typically uses “regular” factor analysis, which assumes that scores for each input variable are normally distributed. This, of course, is usually the case with something like scores on an intelligence test. But if you are analyzing scores on test items, these are rarely normally distributed, especially for dichotomous data where there is only possible scores of 0 and 1, this is impossible. Therefore, other mathematical approaches must be applied.

Factor Analysis on the Item Level

Factor analysis at the item level is a more granular approach, focusing on the individual test items rather than entire subtests or batteries. This method is used to ensure that each item contributes appropriately to the overall construct being measured and to identify any items that do not align well with the intended factors.

For instance, in a reading comprehension test, factor analysis at the item level can reveal whether each question accurately measures the construct of reading comprehension or whether some items are more aligned with other factors, such as vocabulary knowledge or reasoning skills. Items that do not load strongly onto the intended factor may be flagged for revision or removal, as they could distort the accuracy of the test scores.

This item-level analysis is crucial for developing high-quality educational or knowledge assessments, as it helps to ensure that every question is both valid and reliable, contributing meaningfully to the overall test score. It also aids in identifying “enemy items,” which are questions that could undermine the test’s consistency and fairness.

Similarly, in personality assessments like the Big Five Personality Test, factor analysis is used to confirm the structure of personality traits, ensuring that the test accurately captures the five broad dimensions: openness, conscientiousness, extraversion, agreeableness, and neuroticism. This process ensures that each trait is measured distinctly while also considering how they may interrelate.  Note that the result here was not to show overall unidimensionality in personality, but evidence to support five factors.  An assessment of a given factor is then more or less unidimensional.

An example of this is show in Table 2 below.  Consider if all the descriptive statements are items in a survey where people rate them on a Likert scale of 1 to 5.  The survey might have hundreds of adjectives but these would align themselves with the Big Five with factor analysis, and the simple structure would look like something you see below (2 items per factor in this small example).

 

Table 2

Statement Dimension 1 Dimension 2 Dimension 3 Dimension 4 Dimension 5
I like to try new things 0.63 0.02 0.00 -0.03 -0.02
I enjoy exciting sports 0.71 0.00 0.11 -0.08 0.07
I consider myself neat and tidy 0.02 0.56 0.08 0.11 0.08
I am a perfectionist -0.05 0.69 -0.08 0.09 -0.09
I like to go to parties 0.11 0.15 0.74 0.08 0.00
I prefer to spend my free time alone (reverse scored) 0.13 0.07 0.81 0.01 0.05
I tend to “go with the flow” -0.14 0.02 -0.04 0.68 0.08
I enjoy arguments and debates (reverse scored) 0.03 -0.04 -0.05 0.72 0.11
I get stressed out easily (reverse scored) -0.05 0.03 0.03 0.05 0.81
I perform well under pressure 0.02 0.02 0.02 -0.01 0.77

 

Tools like MicroFACT, a specialized software for evaluating unidimensionality, are invaluable in this process. MicroFACT enables psychometricians to assess whether each item in a test measures a single underlying construct, ensuring the test’s coherence and effectiveness.

Summary

Factor analysis plays a pivotal role in the field of psychometrics, offering deep insights into the structure and validity of educational assessments. Whether applied to test batteries or individual items, factor analysis helps ensure that tests are both reliable and meaningful.

Overall, factor analysis is indispensable for developing effective educational tools and improving assessment practices. It ensures that tests not only measure what they are supposed to but also do so in a way that is fair and consistent across different groups and over time. As educational assessments continue to evolve, the insights provided by factor analysis will remain crucial in maintaining the integrity and effectiveness of these tools.

References

Geisinger, K. F., Bracken, B. A., Carlson, J. F., Hansen, J.-I. C., Kuncel, N. R., Reise, S. P., & Rodriguez, M. C. (Eds.). (2013). APA handbook of testing and assessment in psychology, Vol. 1. Test theory and testing and assessment in industrial and organizational psychology. American Psychological Association. https://doi.org/10.1037/14047-000

Kline, R. B. (2015). Principles and practice of structural equation modeling (4th ed.). The Guilford Press.

Nunnally, J. C., & Bernstein, I. H. (1994). Psychometric theory (3rd ed.). Tata Mcgraw-Hill Ed.

 

Test response function 10 items Angoff

Setting a cutscore on a test scored with item response theory (IRT) requires some psychometric knowledge.  This post will get you started.

How do I set a cutscore with item response theory?

There are two approaches: directly with IRT, or using CTT then converting to IRT.

  1. Some standard setting methods work directly with IRT, such as the Bookmark method.  Here, you calibrate your test with IRT, rank the items by difficulty, and have an expert panel place “bookmarks” in the ranked list.  The average IRT difficulty of their bookmarks is then a defensible IRT cutscore.  The Contrasting Groups method and the Hofstee method can also work directly with IRT.
  2. Cutscores set with classical test theory, such as the Angoff, Nedelsky, or Ebel methods, are easy to implement when the test is scored classically.  But if your test is scored with the IRT paradigm, you need to convert your cutscores onto the theta scale.  The easiest way to do that is to reverse-calculate the test response function (TRF) from IRT.

The Test Response Function

The TRF (sometimes called a test characteristic curve) is an important method of characterizing test performance in the IRT paradigm.  The TRF predicts a classical score from an IRT score, as you see below.  Like the item response function and test information function (item response and test information function), it uses the theta scale as the X-axis.  The Y-axis can be either the number-correct metric or proportion-correct metric.

Test response function 10 items Angoff

In this example, you can see that a theta of -0.3 translates to an estimated number-correct score of approximately 7, or 70%.

Classical cutscore to IRT

So how does this help us with the conversion of a classical cutscore?  Well, we hereby have a way of translating any number-correct score or proportion-correct score.  So any classical cutscore can be reverse-calculated to a theta value.  If your Angoff study (or Beuk) recommends a cutscore of 7 out of 10 points (70%), you can convert that to a theta cutscore of -0.3 as above.  If the recommended cutscore was 8 (80%), the theta cutscore would be approximately 0.7.

Because IRT works in a way that it scores examinees on the same scale with any set of items, as long as those items have been part of a linking/equating study.  Therefore, a single study on a set of items can be equated to any other linear test form, LOFT pool, or CAT pool.  This makes it possible to apply the classically-focused Angoff method to IRT-focused programs.  You can even set the cutscore with a subset of your item pool, in a linear sense, with the full intention to apply it on CAT tests later.

Note that the number-correct metric only makes sense for linear or LOFT exams, where every examinee receives the same number of items.  In the case of CAT exams, only the proportion correct metric makes sense.

How do I implement IRT?

Interested in applying IRT to improve your assessments?  Download a free trial copy of  Xcalibre  here.  If you want to deliver online tests that are scored directly with IRT, in real time (including computerized adaptive testing), check out  FastTest.

Equation editor item type

Technology-enhanced items are assessment items (questions) that utilize technology to improve the interaction of a test question in digital assessment, over and above what is possible with paper.  Tech-enhanced items can improve examinee engagement (important with K12 assessment), assess complex concepts with higher fidelity, improve precision/reliability, and enhance face validity/sellability. 

To some extent, the last word is the key one; tech-enhanced items simply look sexier and therefore make an assessment platform easier to sell, even if they don’t actually improve assessment.  I’d argue that there are also technology-enabled items, which are distinct, as discussed below.

What is the goal of technology enhanced items?

The goal is to improve assessment, by increasing things like reliability/precision, validity, and fidelity. However, there are a number of TEIs that is actually designed more for sales purposes than psychometric purposes. So, how to know if TEIs improve assessment?  That, of course, is an empirical question that is best answered with an experiment.  But let me suggest one metric address this question: how far does the item go beyond just reformulating a traditional item format to use current user-interface technology?  I would define the reformulating of traditional format to be a fake TEI while going beyond would define a true TEI.

An alternative nomenclature might be to call the reformulations technology-enhanced items and the true tech usage to be technology-enabled items (Almond et al, 2010; Bryant, 2017), as they would not be possible without technology.

A great example of this is the relationship between a traditional multiple response item and certain types of drag and drop items.  There are a number of different ways that drag and drop items can be created, but for now, let’s use the example of a format that asks the examinee to drag text statements into a box. 

An example of this is K12 assessment items from PARCC that ask the student to read a passage, then ask questions about it.

drag drop sequence

The item is scored with integers from 0 to K where K is the number of correct statements; the integers are often then used to implement the generalized partial credit model for final scoring.  This would be true regardless of whether the item was presented as multiple response vs. drag and drop. The multiple response item, of course, could just as easily be delivered via paper and pencil. Converting it to drag and drop enhances the item with technology, but the interaction of the student with the item, psychometrically, remains the same.

Some True TEIs, or Technology Enabled Items

Of course, the past decade or so has witnessed stronger innovation in item formats. Gamified assessments change how the interaction of person and item is approached, though this is arguably not as relevant for high stakes assessment due to concerns of validity. There are also simulation items. For example, a test for a construction crane operator might provide an interface with crane controls and ask the examinee to complete a tasks. Even at the K-12 level there can be such items, such as the simulation of a science experiment where the student is given various test tubes or other instruments on the screen.

Both of these approaches are extremely powerful but have a major disadvantage: cost. They are typically custom-designed. In the case of the crane operator exam or even the science experiment, you would need to hire software developers to create this simulation. There are now some simulation-development ecosystems that make this process more efficient, but the items still involve custom authoring and custom scoring algorithms.

To address this shortcoming, there is a new generation of self-authored item types that are true TEIs. By “self-authored” I mean that a science teacher would be able to create these items themselves, just like they would a multiple choice item. The amount of technology leveraged is somewhere between a multiple choice item and a custom-designed simulation, providing a compromise of reduced cost but still increasing the engagement for the examinee. A major advantage of this approach is that the items do not need custom scoring algorithms, and instead are typically scored via point integers, which enables the use of polytomous item response theory.

Are we at least moving forward?  Not always!

There is always pushback against technology, and in this topic the counterexample is the gridded item type.  It actually goes in reverse of innovation, because it doesn’t take a traditional format and reformulate it for current UI. It actually ignores the capabilities of current UI (actually, UI for the past 20+ years) and is therefore a step backward. With that item type, students are presented a bubble sheet from a 1960s style paper exam, on a computer screen, and asked to fill in the bubbles by clicking on them rather than using a pencil on paper.

Another example is the EBSR item type from the artist formerly known as PARCC. It was a new item type that intended to assess deeper understanding, but it did not use any tech-enhancement or -enablement, instead asking two traditional questions in a linked manner. As any psychometrician can tell you, this approach ignored basic assumptions of psychometrics, so you can guess the quality of measurement that it put out.

How can I implement TEIs?

It takes very little software development expertise to develop a platform that supports multiple choice items. An item like the graphing one above, though, takes substantial investment. So there are relatively few platforms that can support these, especially with best practices like workflow item review or item response theory. 

modified-Angoff Beuk compromise

A modified-Angoff method study is one of the most common ways to set a defensible cutscore on an exam.  It therefore means that the pass/fail decisions made by the test are more trustworthy than if you picked an arbitrary round number like 70%.  If your doctor, lawyer, accountant, or other professional has passed an exam where the cutscore has been set with this method, you can place more trust in their skills.

What is the Angoff method?

The Angoff method is a scientific way of setting a cutscore (pass point) on a test.  If you have a criterion-referenced interpretation, it is not legally defensible to just conveniently pick a round number like 70%; you need a formal process.  There are a number of acceptable methodologies in the psychometric literature for standard-setting studies, also known as cutscores or passing points.  Some examples include Angoff, modified-Angoff, Bookmark, Contrasting Groups, and Borderline.  The modified-Angoff approach is by far the popular approach.  It is used especially frequently for certification, licensure, certificate, and other credentialing exams.

It was originally suggested as a mere footnote by renowned researcher William Angoff, at Educational Testing Service.

How does the Angoff approach work?

First, you gather a group of subject matter experts (SMEs), with a minimum of 6, though 8-10 is preferred for better reliability, and have them define what they consider to be a Minimally Competent Candidate (MCC).  Next, you have them estimate the percentage of minimally competent candidates that will answer each item correctly.  You then analyze the results for outliers or inconsistencies.  If experts disagree, you will need to evaluate inter-rater reliability and agreement, and after that have the experts discuss and re-rate the items to gain better consensus.  The average final rating is then the expected percent-correct score for a minimally competent candidate.

Advantages of the Angoff method

  1. It is defensible.  Because it is the most commonly used approach and is widely studied in the scientific literature, it is well-accepted.
  2. You can implement it before a test is ever delivered.  Some other methods require you to deliver the test to a large sample first.
  3. It is conceptually simple, easy enough to explain to non-psychometricians.
  4. It incorporates the judgment of a panel of experts, not just one person or a round number.
  5. It works for tests with both classical test theory and item response theory.
  6. It does not take long to implement – if a short test, it can be done in a matter of hours!
  7. It can be used with different item types, including polytomously scored items (multi-points).

Disadvantages of the Angoff method

  1. It does not use actual data, unless you implement the Beuk method alongside.
  2. It can lead to the experts overestimating the performance of entry-level candidates, as they forgot what it was like to start out 20-30 years ago.  This is one reason to use the Beuk method as a “reality check” by showing the experts that if they stay with the cutscore they just picked, the majority of candidates might fail!

Example of the Modified-Angoff Approach

First of all, do not expect a straightforward, easy process that leads to an unassailably correct cutscore.  All standard-setting methods involve some degree of subjectivity.  The goal of the methods is to reduce that subjectivity as much as possible.  Some methods focus on content, others on examinee performance data, while some try to meld the two.

Step 1: Prepare Your Team

The modified-Angoff process depends on a representative sample of SMEs, usually 6-20. By “representative” I mean they should represent the various stakeholders. For instance, a certification for medical assistants might include experienced medical assistants, nurses, and physicians, from different areas of the country. You must train them about their role and how the process works, so they can understand the end goal and drive toward it.

Step 2: Define The Minimally Competent Candidate (MCC)

This concept is the core of the modified-Angoff method, though it is known by a range of terms or acronyms, including minimally qualified candidates (MQC) or just barely qualified (JBQ).  The reasoning is that we want our exam to separate candidates that are qualified from those that are not.  So we ask the SMEs to define what makes someone qualified (or unqualified!) from a perspective of skills and knowledge. This leads to a conceptual definition of an MCC. We then want to estimate what score this borderline candidate would achieve, which is the goal of the remainder of the study. This step can be conducted in person, or via webinar.

Step 3: Round 1 Ratings

Next, ask your SMEs to read through all the items on your test form and estimate the percentage of MCCs that would answer each correctly.  A rating of 100 means the item is a slam dunk; it is so easy that every MCC would get it right.  A rating of 40 is very difficult.  Most ratings are in the 60-90 range if the items are well-developed. The ratings should be gathered independently; if everyone is in the same room, let them work on their own in silence. This can easily be conducted remotely, though.

Step 4: Discussion

This is where it gets fun.  Identify items where there is the most disagreement (as defined by grouped frequency distributions or standard deviation) and make the SMEs discuss it.  Maybe two SMEs thought it was super easy and gave it a 95 and two other SMEs thought it was super hard and gave it a 45.  They will try to convince the other side of their folly. Chances are that there will be no shortage of opinions and you, as the facilitator, will find your greatest challenge is keeping the meeting on track. This step can be conducted in person, or via webinar.

Step 5: Round 2 Ratings

Raters then re-rate the items based on the discussion.  The goal is that there will be a greater consensus.  In the previous example, it’s not likely that every rater will settle on a 70.  But if your raters all end up from 60-80, that’s OK. How do you know there is enough consensus?  We recommend the inter-rater reliability suggested by Shrout and Fleiss (1979), as well as looking at inter-rater agreement and dispersion of ratings for each item. This use of multiple rounds is known as the Delphi approach; it pertains to all consensus-driven discussions in any field, not just psychometrics.

Step 6: Evaluate Results and Final Recommendation

Evaluate the results from Round 2 as well as Round 1.  An example of this is below.  What is the recommended cutscore, which is the average or sum of the Angoff ratings depending on the scale you prefer?  Did the reliability improve?  Estimate the mean and SD of examinee scores (there are several methods for this). What sort of pass rate do you expect?  Even better, utilize the Beuk Compromise as a “reality check” between the modified-Angoff approach and actual test data.  You should take multiple points of view into account, and the SMEs need to vote on a final recommendation. They, of course, know the material and the candidates so they have the final say.  This means that standard setting is a political process; again, reduce that effect as much as you can.

Some organizations do not set the cutscore at the recommended point, but at one standard error of judgment (SEJ) below the recommended point.  The SEJ is based on the inter-rater reliability; note that it is NOT the standard error of the mean or the standard error of measurement.  Some organizations use the latter; the former is just plain wrong (though I have seen it used by amateurs).

 

modified angoff

Step 7: Write Up Your Report

Validity refers to evidence gathered to support test score interpretations.  Well, you have lots of relevant evidence here. Document it.  If your test gets challenged, you’ll have all this in place.  On the other hand, if you just picked 70% as your cutscore because it was a nice round number, you could be in trouble.

Additional Topics

In some situations, there are more issues to worry about.  Multiple forms?  You’ll need to equate in some way.  Using item response theory?  You’ll have to convert the cutscore from the modified-Angoff method onto the theta metric using the Test Response Function (TRF).  New credential and no data available? That’s a real chicken-and-egg problem there.

Where Do I Go From Here?

Ready to take the next step and actually apply the modified-Angoff process to improving your exams?  Sign up for a free account in our  FastTest item banker. You can also download our Angoff analysis tool for free.

References

Shrout, P. E., & Fleiss, J. L. (1979). Intraclass correlations: uses in assessing rater reliability. Psychological bulletin86(2), 420.

test response functions

Item response theory (IRT) is a family of machine learning models in the field of psychometrics, which are used to design, analyze, validate, and score assessments.  It is a very powerful psychometric paradigm that allows researchers to build stronger assessments, whether they work in Education, Psychology, Human Resources, or other fields.  It also solves critical measurement problems like equating across years, designing adaptive tests, or creating vertical scales.

Want to learn more about IRT, how it works, and why it is so important for assessment?  Read on.

What is Item Response Theory?

IRT is a family of models that try to describe how examinees respond to items on a test, hence the name.  These models can be used to evaluate item performance, because the descriptions are quite useful in and of themselves.  However, item response theory ended up doing so much more.Example Item response theory function

IRT is model-driven, in that there is a specific mathematical equation that is assumed, and we fit the models based on raw data, similar to linear regression.  There are different parameters (a, b, c) that shape this equation to different needs.  That’s what defines different IRT models.  This will be discussed at length below.

The models put people and items onto a latent scale, which is usually called θ (theta).  This represents whatever is being measured, whether IQ, anxiety, or knowledge of accounting laws in Croatia.  IRT helps us understand the nature of the scale, how a person answers each question, the distribution of item difficulty, and much more.  IRT used to be known as latent trait theory and item characteristic curve theory.

IRT requires specially-designed software.  Click the link below to download our software Xcalibre, which provides a user-friendly and visual platform for implementing IRT.

 

IRT analysis with Xcalibre

 

Why do we need Item Response Theory?

IRT represents an important innovation in the field of psychometrics. While now more than 50 years old – assuming the “birth” is the classic Lord and Novick (1969) text – it is still underutilized and remains a mystery to many practitioners.

Item response theory is more than just a way of analyzing exam data, it is a paradigm to drive the entire lifecycle of designing, building, delivering, scoring, and analyzing assessments.

IRT requires larger sample sizes and is much more complex than its predecessor, classical test theory, but is also far more powerful.  IRT requires quite a lot of expertise, typically a PhD.  So it is not used for small assessments like a final exam at universities, but is used for almost all major assessments in the world.

 

The Driver: Problems with Classical Test Theory

Classical test theory (CTT) is approximately 100 years old, and still remains commonly used because it is appropriate for certain situations, and it is simple enough that it can be used by many people without formal training in psychometrics.  Most statistics are limited to means, proportions, and correlations.  However, its simplicity means that it lacks the sophistication to deal with a number of very important measurement problems.  A list of these is presented later.

Learn more about the differences between CTT and IRT here.

 

Item Response Theory Parameters

The foundation of IRT is a mathematical model defined by item parametersA parameter is an aspect of a mathematical model that can change its shape or other aspects.  For dichotomous items (those scored correct/incorrect), each item has three parameters:

 

   a: the discrimination parameter, an index of how well the item differentiates low from top examinees; typically ranges from 0 to 2, where higher is better, though not many items are above 1.0.

   b: the difficulty parameter, an index of what level of examinees for which the item is appropriate; typically ranges from -3 to +3, with 0 being an average examinee level.

   c: the pseudo-guessing parameter, which is a lower asymptote; typically is focused on 1/k where k is the number of options.

These paramters are used in the formula below, but are also displayed graphically.

3PL irt equation

Item response function

These parameters are used to graphically display an item response function (IRF), which models the probability of a correct answer as a function of ability.  In the example IRF, the a parameter is approximately, 1.0, indicating a fairly discriminating test item.  The b parameter is approximately 0.0 (the point on the x-axis where the midpoint of the curve is), indicating an average-difficulty item; examinees of average ability would have a 60% chance of answering correctly.  The c parameter is approximately 0.20, like a 5-option multiple choice item.  Consider the x-axis to be z-scores on a standard normal scale.

In some cases, there is no guessing involved, and we only use and b.  This is called the two-parameter model.  If we only use b, this is the one-parameter or Rasch Model.  Here is how that is calculated.

One-parameter-logistic-model-IRT

Item parameters, which are crucial within the IRT framework, might change over time or multiple testing occasions, a phenomenon known as item parameter drift.

 

Example Item Response Theory calculations

Examinees with higher ability are much more likely to respond correctly.  Look at the graph above.  Someone at +2.0 (97th percentile) has about a 94% chance of getting the item correct.  Meanwhile, someone at -2.0 has only a 25% chance – barely above the 1 in 5 guessing rate of 20%.  An average person (0.0) has a 60% chance.  Why 60?  Because we are accounting for guessing.  If the curve went from 0% to 100% probability, then yes, the middle would be 50% change.  But here, we assume 20% as a baseline due to guessing, so halfway up is 60%.

five item response functions

Of course, the parameters can and should differ from item to item, reflecting differences in item performance.  The following graph shows five IRFs with the three-parameter model.  The dark blue line is the easiest item, with a b of -2.00.  The light blue item is the hardest, with a b of +1.80.  The purple one has a c=0.00 while the light blue has c=0.25, indicating that it is more susceptible to guessing.

These IRFs are not just a pretty graph or a way to describe how an item performs.  They are the basic building block to accomplishing those important goals mentioned earlier.  That comes next…

 

Applications of Item Response Theory to Improve Assessment

Item response theory uses the IRF for several purposes.  Here are a few.

test information function from item response theory

  1. Interpreting and improving item performance
  2. Scoring examinees with maximum likelihood or Bayesian methods
  3. Form assembly, including linear on the fly testing (LOFT) and pre-equating
  4. Calculating the accuracy of examinee scores
  5. Development of computerized adaptive tests (CAT)
  6. Post-equating
  7. Differential item functioning (finding bias)
  8. Data forensics to find cheaters or other issues

In addition to being used to evaluate each item individually, IRFs are combined in various ways to evaluate the overall test or form.  The two most important approaches are the conditional standard error of measurement (CSEM) and the test information function (TIF).  The test information function is higher where the test is providing more measurement information about examinees; if relatively low in a certain range of examinee ability, those examinees are not being measured accurately.  The CSEM is the inverse of the TIF, and has the interpretable advantage of being usable for confidence intervals; a person’s score plus or minus 1.96 times the SEM is a 95% confidence interval for their score.  The graph on the right shows part of the form assembly process in our  FastTest  platform.

 

Assumptions of Item Response Theory

Item response theory assumes a few things about your data.

  1. The latent trait you are measuring is unidimensional.  If it is multidimensional, there is multidimensional item response theory, or you can treat the dimensions as separate traits.
  2. Items have local independence, which means that the act of answering one is not impacted by others.  This affects the use of testlets and enemy items.
  3. The probability of responding correctly to an item (or in a certain response, in the case of polytomous like Likert), is a function of the examinee’s ability/trait level and the parameters of the model, following the calculation of the item response function, with some allowance for random error.  As a corollary, we are assuming that the ability/trait has some distribution, with some people having higher or lower levels (e.g., intelligence) and that we are trying to find those differences.

Many texts will only postulate the first two as assumptions, because the third is just implicitly assumed.

 

Advantages and Benefits of Item Response Theory

So why does this matter?  Let’s go back to the problems with classical test theory.  Why is IRT better?

  • Sample-independence of scale: Classical statistics are all sample dependent, and unusable on a different sample; results from IRT are sample-independent. within a linear transformation.  Two samples of different ability levels can be easily converted onto the same scale.
  • Test statistics: Classical statistics are tied to a specific test form.
  • Sparse matrices are OK: Classical test statistics do not work with sparse matrices introduced by multiple forms, linear on the fly testing, or adaptive testing.
  • Linking/equating: Item response theory has much stronger equating, so if your exam has multiple forms, or if you deliver twice per year with a new form, you can have much greater validity in the comparability of scores.
  • Measuring the range of students: Classical tests are built for the average student, and do not measure high or low students very well; conversely, statistics for very difficult or easy items are suspect.
  • Vertical scaling: IRT can do vertical scaling but CTT cannot.
  • Accounting for guessing: CTT does not account for guessing on multiple choice exams.
  • Scoring: Scoring in classical test theory does not take into account item difficulty.  With IRT, you can score a student on any set of items and be sure it is on the same latent scale.
  • Adaptive testing: CTT does not support adaptive testing in most cases.  Adaptive testing has its own list of benefits.
  • Characterization of error: CTT assumes that every examinee has the same amount of error in their score (SEM); IRT recognizes that if the test is all middle-difficulty items, then low or high students will have inaccurate scores.
  • Stronger form building: IRT has functionality to build forms to be more strongly equivalent and meet the purposes of the exam.
  • Nonlinear function: IRT does not assume linear function of the student-item relationship when it is impossible.  CTT assumes a linear function (point-biserial) when it is blatantly impossible.

 

Item Response Theory Models: One Big Happy Family

Remember: IRT is actually a family of models, making flexible use of the parameters.  In some cases, only two (a,b) or one parameters (b) are used, depending on the type of assessment and fit of the data.  If there are multipoint items, such as Likert rating scales or partial credit items, the models are extended to include additional parameters. Learn more about the partial credit situation here.

Here’s a quick breakdown of the family tree, with the most common models.

 

How do I analyze my test with Item Response Theory?

OK item fit

First: you need to get special software.  There are some commercial packages like  Xcalibre, or you can use packages inside platforms like R and Python.

The software will analyze the data in cycles or loops to try to find the best model.  This is because, as always, the data does not always perfectly align.  You might see graphs like the one below if you compared actual proportions (red) to the predicted ones from the item response function (black).  That’s OK!  IRT is quite robust.  And there are analyses built in to help you evaluate model fit.

Some more unpacking of the image above:

  • This was item #39 on the test
  • We are using the three parameter logistic model (3PL), as this was a multiple choice item with 4 options
  • 3422 examinees answered the item
  • 76.9 of them got it correct
  • The classical item discrimination (point biserial item-total correlation) was 0.253, which is OK but not very high
  • The a parameter was 0.432, which is OK but not very strong
  • The b parameter was -1.195, which means the item was quite easy
  • The c parameter was 0.248, which you would expect if there was a 25% chance of guessing
  • The Chi-square fit statistic rejected the null, indicating poor fit, but this statistic is susceptible to sample size
  • The z-Resid fit statistic is a bit more robust, and it did not flag the item for bad fit

Xcalibre-poly-output
The image here shows output from  Xcalibre  from the generalized partial credit model, which is a polytomous model often used for items scored with partial credit.  For example, if a question lists 6 animals and asks students to click on the ones that are reptiles, of which there are 3.  The possible scores are then 0, 1, 2, 3.

Here, the graph labels them as 1-2-3-4, but the meaning is the same.  Here is how you can interpret this.

  • Someone is likely to get 0 points if their theta is below -2.0 (bottom 3% or so of students).
  • A few low students might get 1 point (green)
  • Low-middle ability students are likely to get 2 correct (blue)
  • Anyone above average (0.0) is likely to get all 3 correct.

The boundary locations are where one level becomes more likely than another, i.e., where the curves cross.  For example, you can see that the blue and black lines cross at the boundary -0.339.

Where can I learn more?

For more information, we recommend the textbook Item Response Theory for Psychologists by Embretson & Riese (2000) for those interested in a less mathematical treatment, or de Ayala (2009) for a more mathematical treatment.  If you really want to dive in, you can try the 3-volume Handbook of Item Response Theory edited by van der Linden, which contains a chapter discussing ASC’s IRT analysis software,  Xcalibre.

Want to talk to one of our experts about how to apply IRT?  Get in touch!

TALK TO US Contact

laptop data graph

Criterion-related validity is evidence that test scores are related to other data which we expect them to be.  This is an essential part of the larger issue of test score validity, which is providing evidence that test scores have the meaning we intend them to have.  If you’ve ever felt that a test doesn’t cover what it should be covering, or that it doesn’t reflect the skills needed to perform the job you are applying for – that’s validity.

What is criterion-related validity?

Criterion-related validity is an aspect of test score validity which refers to evidence that scores from a test correlate with an external variable that it should correlate with.  In many situations, this is the critical consideration of a test; for example, a university admissions exam would be quite suspect if scores did not correlate well with high school GPA or accurately predict university GPA.  That is literally its purpose for existence, so we want to have some proof that the test is performing that way.  A test serves its purpose, and people have faith in it, when we have such highly relevant evidence.

Incremental validity is a specific aspect of criterion-related validity that assesses the added predictive value of a new assessment or variable beyond the information provided by existing measures.  There are two approaches to establishing criterion-related validity: concurrent and predictive.  There are also two directions: discriminant and convergent.

Concurrent validity

The concurrent approach to criterion-related validity means that we are looking at variables at the same point in time, or at least very close.  In the example of university admissions testing, this would be correlating the test scores with high school GPA.  The students would most likely just be finishing high school at the time they took the test, excluding special cases like students that take a gap year before university.

Predictive validity

The predictive validity approach, as its name suggests, is regarding the prediction of future variables.  In the example of university admissions testing, we would be using test scores to predict university GPA or graduation rates.  A common application of this is pre-employment testing, where job candidates are testing with the goal of predicting positive variables like job performance, or variables that the employer might want to avoid, like counterproductive work behavior.  Which leads us to the next point…

Convergent validity

Convergent validity refers to criterion-related validity where we want a positive correlation, such as test scores with job performance or university GPA.  This is frequently the case with criterion-related validity studies.  One thing to be careful of in this case is differential prediction, also known as predictive bias.  This is where the validity is different for one group of examinees, often a certain demographic group, even though the average score might be the same for each group.

Here is an example of the data you might evaluate for predictive convergent validity of a university admissions test.

Predictive validity

Discriminant validity

Unlike convergent, discriminant validity is where we want to correlate negatively or zero with other variables.  As noted above, some pre-employment tests have this case.  An integrity or conscientiousness assessment should correlate negatively with instances of counterproductive work behavior, perhaps quantified as number if disciplinary marks on employee HR files.  In some cases, the goal might be to find a zero correlation.  That can be the case with noncognitive traits, where a measure of conscientiousness should not have a strong correlation in any direction with other members of the Big Five.

The big picture

Validity is a complex topic with many aspects.  Criterion-related validity is only one part of the picture.  However, as seen in some of the examples above, it is profoundly critical to some types of assessment, especially where the exam exists only to predict some future variables.

Want to delve further into validity?  The classic reference is Cronbach & Meehl (1955).  We also recommend work by Messick, such as this one.  Of course, check with relevant standards to your assessment, such as AERA/APA/NCME or NCCA.

students-taking-digital-test

Digital assessment (DA) aka e-Assessment or electronic assessment is the delivery of assessments, tests, surveys, and other measures via digital devices such as computers, tablets, and mobile phones.  The primary goal is to be able to develop items, publish tests, deliver tests, and provide meaningful results – as quickly, easily, and validly as possible.  The use of computers enables many modern benefits, from adaptive testing (e.g. adaptive SAT) to tech-enhanced items.  To deliver digital assessment, an organization typically implements cloud-based digital assessment platforms.  Such platforms do much more than just the delivery though, and modules include:

test development cycle fasttest

 

 

Why Digital Assessment / e-Assessment?

Globalization and digital technology are rapidly changing the world of education, human resources, and professional development. Teaching and learning are becoming more learner-centric, and technology provides an opportunity for assessment to be integrated into the learning process with corresponding adjustments. Furthermore, digital technology grants opportunities for teaching and learning to move their focus from content to critical thinking. Teachers are already implementing new strategies in classrooms, and assessment needs to reflect these changes, as well.

Looking for such a platform?  Request a free account in ASC’s industry-leading e-Assessment ecosystem.

 

Free FastTest Account

 

Advantages of Digital Assessment

Accessibility

One of the main pros of DA is the ease-of-use for staff and learners—examiners can easily set up questionnaires, determine grading methods, and send invitations to examinees. In turn, examinees do not always have to be in a classroom setting to take assessments and can do it remotely in a more comfortable environment. In addition, DA gives learners the option of taking practice tests whenever they are available for that.

Transparency

DA allows educators quickly evaluate performance of a group against an individual learner for analytical and pedagogical reasons. Report-generating capabilities of DA enable educators to identify learning problem areas on both individual and group levels soon after assessments occur in order to adapt to learners’ needs, strengths, and weaknesses. As for learners, DA provides them with instant feedback, unlike traditional paper exams.

Profitability

Conducting exams online, especially those at scale, seems very practical since there is no need to print innumerable question papers, involve all school staff in organization of procedures, assign invigilators, invite hundreds of students to spacious classrooms to take tests, and provide them with answer-sheets and supplementary materials. Thus, flexibility of time and venue, lowered human, logistic and administrative costs lend considerable preeminence to electronic assessment over traditional exam settings.

Eco-friendliness

In this digital era, our utmost priority should be minimizing detrimental effects on the environment that pen-and-paper exams bring. Mercilessly cutting down trees for paper can no longer be the norm as it has the adverse environmental impact. DA will ensure that organizations and institutions can go paper-free and avoid printing exam papers and other materials. Furthermore, DAs take up less storage space since all data can be stored on a single server, especially in respect to keeping records in paper.

Security

Enhanced privacy for students is another advantage of digital assessment that validates its utility. There is a tiny probability of malicious activities, such as cheating and other unlawful practices that can potentially rig the system and lead to incorrect results. Secure assessment system supported by AI-based proctoring features makes students embrace test results without contesting them, which, in turn, fosters a more positive mindset toward institutions and organizations building a stronger mutual trust between educators and learners.

Autograding

The benefits of DA include setting up an automated grading system, more convenient and time-efficient than standard marking and grading procedures, which minimizes human error. Automated scoring juxtaposes examinees’ responses against model answers and makes relevant judgements. The dissemination of technology in e-education and the increasing number of learners demand a sophisticated scoring mechanism that would ease teachers’ burden, save a lot of time, and ensure fairness of assessment results. For example, digital assessment platforms can include complex modules for essay scoring, or easily implement item response theory and computerized adaptive testing.

Time-efficiency

Those involved in designing, managing and evaluating assessments are aware of the tediousness of these tasks. Probably, the most routine process among assessment procedures is manual invigilation which can be easily avoided by employing proctoring services. Smart exam software, such as FastTest, features the options of automated item generation, item banking, test assembling and publishing, saving precious time that would otherwise be wasted on repetitive tasks. Examiners should only upload the examinees’ emails or ids to invite them for assessment. The best part about it all is instant exporting of results and delivering reports to stakeholders.

Public relations and visibility

There is a considerably lower use of pen and paper in the digital age. The infusion of technology has considerably altered human preferences, so these days an immense majority of educators rely more on computers for communication, presentations, digital designing, and other various tasks. Educators have an opportunity to mix question styles on exams, including graphics, to make them more interactive than paper ones. Many educational institutions utilize learning management systems (LMS) for publishing study materials on the cloud-based platforms and enabling educators to evaluate and grade with ease. In turn, students benefit from such systems as they can submit their assignments remotely.

 

Challenges of Implementing Digital Assessment

Difficulty in grading long-answer questions

DA copes brilliantly with multiple-choice questions; however, there are still some challenges with grading long-answer questions. This is where Digital e-Assessment intersects with the traditional one as subjective answers ask for manual grading. Luckily, technology in the education sector continues to evolve and even essays can already be marked digitally with a help of AI-features on the platforms like FastTest.

Need to adapt

Implementing something new always brings disruption and demands some time to familiarize all stakeholders with it. Obviously, transition from traditional assessment to DA will require certain investments to upgrade the system, such as professional development of staff and finances. Some staff and students might even resist this tendency and feel isolated without face-to-face interactions. However, this stage is inevitable and will definitely be a step forward for both educators and learners.

Infrastructural barriers & vulnerability

One of the major cons of DA is that technology is not always reliable and some locations cannot provide all examinees with stable access to electricity, internet connection, and other basic system requirements. This is a huge problem in developing nations, and still remains a problem in many areas of well-developed nations. In addition, integrating DA technology might be very costly in case of wrong strategies while planning assessment design, both conceptual and aesthetic. Such barriers hamper DA, which is why authorities should consider addressing them prior to implementing DA.

 

Conclusion

To sum up, implementing DA has its merits and demerits, as outlined above. Even though technology simplifies and enhances many processes for institutions and stakeholders, it still has some limitations. Nevertheless, all possible drawbacks can be averted by choosing the right methodology and examination software. We cannot reject the necessity to transit from traditional form of assessment to digital one, admitting that the benefits of DA outweigh its drawbacks and costs by far. Of course, it is up to you to choose whether to keep using hard copy assessments or go for online option. However, we believe that in the digital era all we need to do is to plan wisely and choose an easy-to-use and robust examination platform with AI-based anti-cheating measures, such as FastTest, to secure credible outcomes.

 

Reference

Wall, J. E. (2000). Technology-Delivered Assessment: Diamonds or Rocks? ERIC Clearinghouse on Counseling and Student Services.