Posts on psychometrics: The Science of Assessment

It is not easy to find an online testing platform that provides a professional level of functionality.  There are many, many software products out in wild that provide at least some functionality for online testing.  The biggest problem is that there is an incredible range in quality, though there are also other differentiators, such as some being made only to deliver pre-built employment skill tests rather than being for general usage.

So how do you know what level of quality you need in an online testing platform?  It mostly depends on the stakes of your test, which governs the need for quality in the test itself, which then drives the need for a quality platform to build and deliver the test.  This post helps you identify the types of functionality that set apart “real” online testing platforms, and you can evaluate which components are most critical for you once you go shopping.

Prefer to get your hands dirty?  Sign up for a free account in our platform or request a personalized demonstration.

What is a professional online testing platform, anyway?

An online testing platform is much more than an assessment module in a learning management system (LMS) or an inexpensive quiz/survey maker.  A real online testing platform is designed for professionals, that is, someone whose entire job is to make assessments.  A good comparison is a customer relationship management (CRM) system.  That is a platform that is designed for use be people whose job is to manage customers, whether for existing customers or to manage the sales process.  While it is entirely possible to use a spreadsheet to manage such things at a small scale, all organizations doing any sort of scale will leverage a true CRM like SalesForce or Zoho.   You wouldn’t hire a team of professional sales experts and then have them waste hours each day in a spreadsheet; you would give them SalesForce to make them much more effective.

The same is true for online testing and assessment.  If you are a teacher making math quizzes, then Microsoft Word might be sufficient.  But there are many organizations that are doing a professional level of assessment, with dedicated staff.  Some examples, by no means an exhaustive list:

  • Professional credentialing: Certification and licensure exams that a person passes to work in a profession, such as chiropractors
  • Employment: Testing job applicants to make sure they have relevant skills, ability, and experience
  • Universities: Not so much for classroom assessments, but rather for topics like placement testing of all incoming students, or for nationwide admissions exams
  • K-12 benchmark: If you are a government that tests all 8th graders at the end of the year, or a company that delivers millions of formative assessments

 

The traditional vs modern approach

For starters, one important thing to consider is the approach that the platform takes to assessment.  Some of the aspects listed here are points in the detailed discussion below.

Assessment-traditional-modern

 

OK, here is the list!  As with the one above, this is by no means exhaustive.

Goal 1: Item banking that makes your team more efficient

True item banking: The platform should treat items as reusable objects that exist with persistent IDs and metadata.

Configurability: The platform should allow you to configure how items are scored and presented, such as font size, answer layout, and weighting.

Multimedia management: Audio, video, and images should be stored in their own banks, with their own metadata fields, as reusable objects.  If an image is in 7 questions, you should not have to upload 7 times… you upload once and the system tracks which items use it.

Statistics and other metadata: All items should have many fields that are essential metadata: author name, date created, tests which use the item, content area, Bloom’s taxonomy, classical statistics, IRT parameters, and much more.

Custom fields: You should be able to create any new metadata fields that you like.

Item review workflow: Professionally built items will go through a review process, like Psychometric Review, English Editing, and Content Review. The platform should manage this, allowing you to assign items to people with due dates and email notifications.

Automated item generation: There should be functionality for automated item generation.

Powerful test assembly: When you publish a test, there should be many options, including sections, navigation limits, paper vs online, scoring algorithms, instructional screens, score reports, etc.

Equation Editor: Many math exams need a professional equation editor to write the items, embedded in the item authoring.

Goal 2: Professional test delivery

Scheduling options: Date ranges, retakes, alternate forms, passwords, etc.

Item response theory: A modern psychometric paradigm used by organizations dedicated to modern assessment.

Linear on the fly testing (LOFT): Suppose you have a pool of 200 questions, and you want every student to get 50 randomly picked, but balanced so that there are 10 items from each of 5 content areas.

Computerized adaptive testing: This uses AI and machine learning to customize the test uniquely to every examinee.  CATs are much more secure, more accurate, more engaging, and can reduce test length by 50-90%.

Tech-enhanced item types: Drag and drop, audio/video, hotspot, fill-in-the-blank, etc.

Scalability: Because most “real” exams will be doing thousands, tens of thousands, or even hundreds of thousands of examinees, the online testing platform needs to be able to scale up.

Online essay marking: The platform should have a module to score open-response items. Preferably with advanced options, like having multiple markers or anonymity.

Goal 3: Maintaining test integrity and security

Delivery security options: There should be choices for how to create/disseminate passcodes, set time/date windows, disallow movement back to previous sections, etc.

Lockdown browser: An option to deliver with software that locks the computer while the examinee is in the test.

Remote proctoring: There should be an option for remote (online) proctoring.

Live proctoring: There should be functionality that facilitates live human proctoring, such as in computer labs at a university.

User roles and content access: There should be various roles for users, as well as options to limit them by content.  For example, limiting a Math teacher doing reviews to do nothing but review Math items.

Rescoring: If items are compromised or challenged, you need functionality to easily remove them from scoring for an exam, and rescore all candidates

Live dashboard: You should be able to see who is in the exam, stop them if needed, and restart or re-register if needed.

Goal 4: Powerful reporting and exporting

Support for QTI: You should be able to import and export items with QTI, as well as common formats like Word or Excel.

Detailed psychometric analytics: You should be able to see reports on reliability, standard error of measurement, point-biserial item discriminations, and all the other statistics that a psychometrician needs.

Exporting of detailed raw files: You should be able to easily export the examinee response matrix, item times, item comments, scores, and all other result data.

API connections: You should have options to set up APIs to other platforms, like an LMS or CRM.

 

OK, now how do I find a real online testing platform?

If you are out shopping, ask about the aspects in the list above.  For sure, make sure to check the websites for documentation on these?

Want to save yourself some time?  Click here to request a free account in our platform.

Digital assessment (DA) is the delivery of assessments, tests, surveys, and other measures via digital devices such as computers, tablets, and mobile phones.  It typically leverages additional technology such as the Internet or Intranets.  The primary goal is to be able to quickly develop items, publish tests, deliver tests, and provide meaningful results – as quickly, easily, and validly as possible. To produce digital assessment, its design, performance, and feedback must be mediated by technologies.

 

Why is Digital Assessment Getting So Popular?

Obviously, it is not solely because of the pandemic, it is because people have seen that things can be done differently and in more efficient ways than before. Globalization and digital technology are rapidly changing the world of education. Teaching and learning are becoming more learner-centric, and technology provides an opportunity for assessment to be integrated into the learning process with corresponding adjustments. Furthermore, digital technology grants opportunities for teaching and learning to move their focus from content to critical thinking. Teachers are already implementing new strategies in classrooms, and assessment needs to reflect these changes, as well. Even after the pandemic ends, the education will never be the way it was before, and the world will have to admit the benefits that DA brings. Let’s look critically at pros and cons of DA.

 

Advantages of Digital Assessment

  • Accessibility

One of the main pros of DA is the ease-of-use for staff and learners—examiners can easily set up questionnaires, determine grading methods, and send invitations to examinees. In turn, examinees do not always have to be in a classroom setting to take assessments and can do it remotely in a more comfortable environment. In addition, DA gives learners the option of taking practice tests whenever they are available for that.

  • Transparency

DA allows educators quickly evaluate performance of a group against an individual learner for analytical and pedagogical reasons. Report-generating capabilities of DA enable educators to identify learning problem areas on both individual and group levels soon after assessments occur in order to adapt to learners’ needs, strengths, and weaknesses. As for learners, DA provides them with instant feedback, unlike traditional paper exams.

  • Profitability

Conducting exams online, especially those at scale, seems very practical since there is no need to print innumerable question papers, involve all school staff in organization of procedures, assign invigilators, invite hundreds of students to spacious classrooms to take tests, and provide them with answer-sheets and supplementary materials. Thus, flexibility of time and venue, lowered human, logistic and administrative costs lend considerable preeminence to DA over traditional exam settings.

  • Eco-friendliness

In this digital era, our utmost priority should be minimizing detrimental effects on the environment that pen-and-paper exams bring. Mercilessly cutting down trees for paper can no longer be the norm as it has the adverse environmental impact. DA will ensure that organizations and institutions can go paper-free and avoid printing exam papers and other materials. Furthermore, DAs take up less storage space since all data can be stored on a single server, especially in respect to keeping records in paper.

  • Security

Enhanced privacy for students is another advantage of DA that validates its utility. There is a tiny probability of malicious activities, such as cheating and other unlawful practices that can potentially rig the system and lead to incorrect results. Secure assessment system supported by AI-based proctoring features makes students embrace test results without contesting them, which, in turn, fosters a more positive mindset toward institutions and organizations building a stronger mutual trust between educators and learners.

  • Autograding

The benefits of DA include setting up an automated grading system, more convenient and time-efficient than standard marking and grading procedures, which minimizes human error. Automated scoring juxtaposes examinees’ responses against model answers and makes relevant judgements. The dissemination of technology in e-education and the increasing number of learners demand a sophisticated scoring mechanism that would ease teachers’ burden, save a lot of time, and ensure fairness of assessment results. For example, digital assessment platforms can include complex modules for essay scoring, or easily implement item response theory and computerized adaptive testing.

  • Time-efficiency

Those involved in designing, managing and evaluating assessments are aware of the tediousness of these tasks. Probably, the most routine process among assessment procedures is manual invigilation which can be easily avoided by employing proctoring services. Smart exam software, such as FastTest, features the options of automated item generation, item banking, test assembling and publishing, saving precious time that would otherwise be wasted on repetitive tasks. Examiners should only upload the examinees’ emails or ids to invite them for assessment. The best part about it all is instant exporting of results and delivering reports to stakeholders.

  • Trendiness

There is a considerably lower use of pen and paper in the digital age. The infusion of technology has considerably altered human preferences, so these days an immense majority of educators rely more on computers for communication, presentations, digital designing, and other various tasks. Educators have an opportunity to mix question styles on exams, including graphics, to make them more interactive than paper ones. Many educational institutions utilize learning management systems (LMS) for publishing study materials on the cloud-based platforms and enabling educators to evaluate and grade with ease. In turn, students benefit from such systems as they can submit their assignments remotely.

 

Disadvantages of Digital Assessment

  • Difficulty in grading long-answer questions

DA copes brilliantly with multiple-choice questions; however, there are still some challenges with grading long-answer questions. This is where DA intersects with the traditional one as subjective answers ask for manual grading. Luckily, technology in the education sector continues to evolve and even essays can already be marked digitally with a help of AI-features on the platforms like FastTest.

  • Need to adapt

Implementing something new always brings disruption and demands some time to familiarize all stakeholders with it. Obviously, transition from traditional assessment to DA will require certain investments to upgrade the system, such as professional development of staff and finances. Some staff and students might even resist this tendency and feel isolated without face-to-face interactions. However, this stage is inevitable and will definitely be a step forward for both educators and learners.

  • Infrastructural barriers & vulnerability

One of the major cons of DA is that technology is not always reliable and some locations cannot provide all examinees with stable access to electricity, internet connection, and other basic system requirements. This is a huge problem in developing nations, and still remains a problem in many areas of well-developed nations. In addition, integrating DA technology might be very costly in case of wrong strategies while planning assessment design, both conceptual and aesthetic. Such barriers hamper DA, which is why authorities should consider addressing them prior to implementing DA.

 

Conclusion

To sum up, implementing DA has its merits and demerits, as outlined above. Even though technology simplifies and enhances many processes for institutions and stakeholders, it still has some limitations. Nevertheless, all possible drawbacks can be averted by choosing the right methodology and examination software. We cannot reject the necessity to transit from traditional form of assessment to digital one, admitting that the benefits of DA outweigh its drawbacks and costs by far. Of course, it is up to you to choose whether to keep using hard copy assessments or go for online option. However, we believe that in the digital era all we need to do is to plan wisely and choose an easy-to-use and robust examination platform with AI-based anti-cheating measures, such as FastTest, to secure credible outcomes.

 

Reference

Wall, J. E. (2000). Technology-Delivered Assessment: Diamonds or Rocks? ERIC Clearinghouse on Counseling and Student Services.

Assessment Systems Corporation (ASC; https://assess.com) has announced a new partnership with Sumadi (https://sumadi.net/), a leading provider of automated proctoring and exam integrity solutions.  ASC’s cutting-edge assessment platforms will deliver tests with Sumadi’s lockdown browser and proctoring based on artificial intelligence (AI), protecting the validity of exam scores and the investment made in test content.  This combines with ASC’s AI functionality in the assessment itself, including computerized adaptive testing (CAT), multistage testing (MST), item response theory (IRT), and psychometric forensics.

The two platforms are perfectly complementary, jointly providing an ideal solution for large-scale assessments that need more security than a standard browser without proctoring.  This provides a highly scalable platform that is ideal for educational assessment and pre-employment testing where there might be thousands, or tens of thousands, of examinees.  The two companies recently partnered to deliver 40,000 exams in a weekend for high stakes national assessments in Colombia.

“Sumadi is a perfect partner on a number of levels.” Says Nathan Thompson, PhD, CEO of ASC.  “The products fit very well together, providing the end-to-end solution that our clients are asking for.  Some of our largest clients are in Latin America, for which a Latin American partner is ideal.  And finally, they are a great culture fit, given the focus on improving assessment with artificial intelligence, and making such solutions available all around the world in a cost-effective manner.  Sumadi truly aligns with our core mission of improving assessment with user-friendly technology.”

“We’re thrilled to announce this new partnership, because like Sumadi, ASC is dedicated to improving online assessments,” said Raúl Rivera, Executive Director of Sumadi.  “ASC’s online testing platform and our automated proctoring solutions will enable clients to successfully develop, deliver and safeguard the integrity of assessments, anywhere in the world.”

 

About Sumadi

Sumadi (https://sumadi.net/) provides secure automated proctoring solutions to clients around the world that safeguard learning outcomes and ensure the integrity of online assessments. Powered by the latest advancements in artificial intelligence and machine learning, these solutions ensure exam integrity with facial and typing pattern recognition, authentication, object detection, and browser tracking solutions. Sumadi offers the only multilingual automated proctoring solutions capable of being delivered simultaneously, at scale, anywhere in the world, with real-time reporting capability. Recent accolades are testament to Sumadi’s success, including being named one of the Top 10 EdTech startups in Europe in 2020 and one of the Top 10 Biometric Solution Providers in Europe in 2021 (according to Enterprise Security Magazine).  With headquarters in Honduras, Sumadi is uniquely positioned to provide best-in-class proctoring solutions across Latin America and the world.  

Sumadi comes from the ethnic Garifuna language of Honduras, meaning ‘intelligence’.

 

About ASC

Assessment Systems Corporation (ASC; https://assess.com) provides its software to organizations across many verticals, based on the common thread of leveraging better psychometric science.  EdTech companies, K-12 school districts, national Ministries of Education, national Ministries of Defence, certification/licensure boards, employment testing companies, and language assessment organizations are some of the groups that ASC partners with.  ASC’s underlying mission is to improve educational and career opportunities for people worldwide, by ensuring that assessment organizations are providing effective measurements.  In the age of data, having accurate information about humans is paramount.

Assessment Systems Corporation (ASC) is excited to announce that Chris Dufour, EdD, has joined the team as Director of Business Development.  Chris has more than 10 years of experience in building growth for EdTech, SaaS, artificial intelligence, and consulting solutions – which makes him a perfect fit for ASC, who provides all of those. ASC’s focus is on cloud-based platforms that integrate best practices in assessment, like computerized adaptive testing, multistage testing, item response theory, and automated item generation.

“I’m incredibly delighted to add someone as knowledgeable and talented as Chris to our team,” says Nathan Thompson, PhD, the CEO of ASC.  “His expertise spans both the Education market and also the Certification/ContinuingEd market, and he truly understands our clients and partners.  He will help bring ASC’s AI-based assessment software to more organizations, improving the business efficiency and psychometric rigor of exams around the world.”

Chris’ experience includes the Online Learning Consortium, Othot (predictive analytics SaaS), Engage2Serve (higher education CRM), and Element451 (higher ed marketing and CRM).  Before his decade working in EdTech business development, Chris was previously an administrator at Penn State University, where he led a team that was in charge of statewide distance learning and professional certification programs.  He also earned a Doctorate in Education from Penn State.

Chris Dufour remarks: “I am excited to be joining Assessment Systems Corporation (ASC) as Director of Business Development. The company’s mission to positively impact educational and career opportunities of adult learners by facilitating quality online assessments aligns well with both my academic and professional background. Having worked in the field of adult and continuing education, studying adult learning theory and assessment, and selling both online professional development and training, and consulting services, I am dedicated to serving the companies mission and making a difference in the lives of adult learners everywhere.”

For more information on ASC and its AI-based assessment platforms, visit assess.com.

Progress monitoring is an essential component of a modern educational system.  Are you interested in tracking learners’ academic achievements during a period of learning, such as a school year? Then you need to design a valid and reliable progress monitoring system that would enable educators assist students in achieving a performance target. Progress monitoring is a standardized process of assessing a specific construct or a skill that should take place often enough to make pedagogical decisions and take appropriate actions.

Why Progress monitoring?

Progress monitoring mainly serves two purposes: to identify students in need and to adjust instructions based on assessment results. Such adjustments can be used on both individual and aggregate levels of learning. Educators should use progress monitoring data to make decisions about whether appropriate interventions should be employed to ensure that students obtain support to propel their learning and match their needs (Issayeva, 2017).

This assessment is usually criterion-referenced and not normed. Data collected after administration can show a discrepancy between students’ performances in relation to the expected outcomes, and can be graphed to display a change in rate of progress over time.

Progress monitoring dates back to the 1970s when Deno and his colleagues at the University of Minnesota initiated research on applying this type of assessment to observe student progress and identify the effectiveness of instructional interventions (Deno, 1985, 1986; Foegen et al., 2008). Positive research results suggested to use progress monitoring as a potential solution of the educational assessment issues existing in the late 1980s–early 1990s (Will, 1986).

Approaches to development of measures

Two approaches to item development are highly applicable these days: robust indicators and curriculum sampling (Fuchs, 2004). It is interesting to note, that advantages of using one approach tend to mirror disadvantages of the other one.

According to Foegen et al. (2008), robust indicators represent core competencies integrating a variety of concepts and skills. Classic examples of robust indicator measures are oral reading fluency in reading and estimation in Mathematics. The most popular illustration of this case is the Programme for International Student Assessment (PISA) that evaluates preparedness of students worldwide to apply obtained knowledge and skills in practice regardless of the curriculum they study at schools (OECD, 2012).

When using the second approach, a curriculum is analyzed and sampled in order to construct measures based on its proportional representations. Due to the direct link to the instructional curriculum, this approach enables teachers to evaluate student learning outcomes, consider instructional changes, and determine eligibility for other educational services. Progress monitoring is especially applicable when curriculum is spiral (Bruner, 2009) since it allows students revisit the same topics with increasing complexity.

CBM and CAT

Curriculum-based measures (CBMs) are commonly used for progress monitoring purposes. They typically embrace standardized procedures for item development, administration, scoring, and reporting. CBMs are usually conducted under timed conditions as this allows obtain evidence of a student’s fluency within a targeted skill.

Computerized adaptive tests (CATs) are gaining more and more popularity these days, particularly within progress monitoring framework. CATs were primarily developed to replace traditional fixed-length paper-and-pencil tests and have been proven to become a helpful tool determining each learner’s achievement levels (Weiss & Kingsbury, 1984).

CATs utilize item response theory (IRT) and provide students with subsequent items based on difficulty level and their answers in real time. In brief, IRT is a statistical method that parameterizes items and examinees on the same scale, and facilitates stronger psychometric approaches such as CAT (Weiss, 2004). Thompson and Weiss (2011) suggest a step-by-step guidance on how to build CATs.

Progress monitoring vs. traditional assessments

Progress monitoring significantly differs from traditional classroom assessments by many reasons. First, it provides objective, reliable, and valid data on student performance, e. g. in terms of the mastery of a curriculum. Subjective judgement is unavoidable for teachers when they prepare classroom assessments for their students. On the contrary, student progress monitoring measures and procedures are standardized which guarantees relative objectivity, as well as reliability and validity of assessment results (Deno, 1985; Foegen & Morrison, 2010). In addition, progress monitoring results are not graded, and there is no preparation prior to the test. Second, it leads to thorough feedback from teachers to students. Competent feedback helps teachers adapt their teaching methods or instructions in response to their students’ needs (Fuchs & Fuchs, 2011). Third, progress monitoring enables teachers help students in achieving long-term curriculum goals by tracking their progress in learning (Deno et al., 2001; Stecker et al., 2005). According to Hintze, Christ, and Methe (2005), progress monitoring data assist teachers in identifying specific actions towards instructional changes in order to help students in mastering all learning objectives from the curriculum. Ultimately, this results in a more effective preparation of students for the final high-stakes exams.

 

References

Bruner, J. S. (2009). The process of education. Harvard University Press.

Deno, S. L. (1985). Curriculum-based measurement: The emerging alternative. Exceptional children, 52, 219-232.

Deno, S. L. (1986). Formative evaluation of individual student programs: A new role of school psychologists. School Psychology Review, 15, 358-374.

Deno, S. L., Fuchs, L. S., Marston, D., & Shin, J. (2001). Using curriculum-based measurement to establish growth standards for students with learning disabilities. School Psychology Review, 30(4), 507-524.

Foegen, A., & Morrison, C. (2010). Putting algebra progress monitoring into practice: Insights from the field. Intervention in School and Clinic46(2), 95-103. Retrieved from http://isc.sagepub.com/content/46/2/95

Foegen, A., Olson, J. R., & Impecoven-Lind, L. (2008). Developing progress monitoring measures for secondary mathematics: An illustration in algebra. Assessment for Effective Intervention33(4), 240-249.

Fuchs, L. S. (2004). The past, present, and future of curriculum-based measurement research. School Psychology Review, 33, 188-192.

Fuchs, L. S., & Fuchs, D. (2011). Using CBM for Progress Monitoring in Reading. National Center on Student Progress Monitoring. Retrieved from http://files.eric.ed.gov/fulltext/ED519252.pdf

Hintze, J. M., Christ, T. J., & Methe, S. A. (2005). Curriculum-based assessment. Psychology in the School, 43, 45–56. doi: 10.1002/pits.20128

Issayeva, L. B. (2017). A qualitative study of understanding and using student performance monitoring reports by NIS Mathematics teachers [Unpublished master’s thesis]. Nazarbayev University.

Samson, J. M. (2016). Human trafficking and globalization [Unpublished doctoral dissertation]. Virginia Polytechnic Institute and State University.

OECD (2012). Lessons from PISA for Japan, Strong Performers and Successful Reformers in Education. OECD Publishing. http://dx.doi.org/10.1787/9789264118539-en.

Stecker, P. M., Fuchs, L. S., & Fuchs, D. (2005). Using curriculum-based measurement to improve student achievement: Review of research. Psychology in the Schools, 42(8), 795-819.

Thompson, N. A., & Weiss, D. A. (2011). A framework for the development of computerized adaptive tests. Practical Assessment, Research, and Evaluation16(1), 1.

Weiss, D. J., & Kingsbury, G. G. (1984). Application of computerized adaptive testing to educational problems. Journal of Educational Measurement21(4), 361-375.

Weiss, D. J. (2004). Computerized adaptive testing for effective and efficient measurement in counseling and education. Measurement and Evaluation in Counseling and Development37(2), 70-84.

Will, M. C. (1986). Educating children with learning problems: A shared responsibility. Exceptional children52(5), 411-415.

 

A T Score (sometimes hyphenated T-Score) is a common example of a scaled score in psychometrics and assessment.  A scaled score is simply a way to present scores in a more meaningful and easier-to-digest context, with the benefit of hiding the sometimes obtuse technicalities of psychometrics.  Therefore, a T Score is a standardized way that scores are presented to make them easier to understand.

What is a T Score?

A T score is a conversion of the standard normal distribution, aka Bell Curve.  The normal distribution places observations (of anything, not just test scores) on a scale that has a mean of 0.00 and a standard deviation of 1.00.  The T Score simply converts this to have a mean of 50 and standard deviation of 10.  This has two immediate benefits to most consumers:

  1. There are no negative scores; people generally do not like to receive a negative score!
  2. Scores are round numbers that generally range from 0 to 100, depending on whether 3, 4, or 5 standard deviations is the bound (usually 20 to 80); this somewhat fits with what most people expect from their school days, even though the numbers are entirely different.

The image below shows the normal distribution, labeled with the T Score and also Percentile Rank for interpretation.

T scores

How to interpret a T score?

As you can see above, a T Score of 40 means that you are approximately the 16th percentile.  A 70 means that you are approximately the 98th percentile – so that it is actually quite high though students who are used to receiving 90s will feel like it is low!

Since there is a 1-to-1 mapping of T Score to the other rows, you can see that it does not actually provide any new information.  It is simply a conversion to round, positive numbers, that is easier to digest.

Is a T Score like a t-test?

No.  Couldn’t be more unrelated.  Nothing like the t-test.

How do I implement with an assessment?

If you are using off-the-shelf psychological assessments, they will likely produce a T Score for you in the results.  If you want to utilize it for your own assessments, you need a world-class assessment platform like FastTest that has strong functionality for scoring methods and scaled scoring.  An example of this is below.

Scaled scores in FastTest

 

There are many remote proctoring software providers on the market, with a surprisingly wide range of functionality. The COVID-19 pandemic has accelerated this industry even further.  Because of this, ASC combed the internet to compile a list of remote proctoring software for our clients. Are we missing any?  Contact sales@assess.com to suggest one.

Note that this does not include general video software, such as Google Meet, Microsoft Teams, or Zoom – all can be used for proctoring as well, but are not specifically designed for it.

ASC’s Preferred Partners

ASC’s online assessment platforms are integrated with some of the leading remote proctoring software providers.

Type Vendors
Live MonitorEDU, Examity
AI Examus, Sumadi
Record and Review Examus, ProctorExam
Bring Your Own Proctor Examus, ProctorExam

List of Remote Proctoring Software Providers

 

# Name Website Country Proctor Service
1 Aiproctor https://www.aiproctor.com/ USA AI
2 Centre Based Test (CBT) https://www.conductexam.com/center-based-online-test-software India Live, Record and Review
3 Class in Pocket https://classinpocket.com/ India AI
4 Datamatics https://www.datamatics.com/industries/education-technology/proctoring India AI, Live, Record and Review
5 DigiProctor https://www.digiproctor.com India AI
6 Disamina https://disamina.in/ India AI
7 Examity https://www.examity.com/ USA Live
8 ExamMonitor https://examsoft.com/ USA Record and Review
9 ExamOnline https://examonline.in/remote-proctoring/ India AI, Live
10 ExamRoom.AI https://examroom.ai/ USA AI, Live
11 Examus https://examus.com Russia AI, Bring Your Own Proctor, Live
12 EasyProctor https://www.excelsoftcorp.com/products/assessment-and-proctoring-solutions/ India AI, Live, Record and Review
13 HonorLock https://honorlock.com/ USA AI, Record and Review
14 Invigulus https://www.invigulus.com/ USA AI, Live, Record and Review
15 Iris Invigilation https://www.irisinvigilation.com/ Australia AI
16 Mettl https://mettl.com/en/online-remote-proctoring/ India AI, Live, Record and Review
17 MonitorEdu https://monitoredu.com/proctoring USA Live
18 OnVUE https://home.pearsonvue.com/Test-takers/OnVUE-online-proctoring.aspx USA Live
19 Oxagile https://www.oxagile.com/competence/edtech-solutions/proctoring/ USA AI, Live, Record and Review
20 Parakh https://parakh.online/blog/remote-proctoring-ultimate-solution-for-secure-online-exam India AI, Live, Record and Review
21 ProctorFree https://www.proctorfree.com/ USA AI, Live
22 Proctor360 https://proctor360.com/ USA AI, Bring Your Own Proctor, Live, Record and Review
23 ProctorEDU https://proctoredu.com/ Russia AI, Live, Record and Review
24 ProctorExam https://proctorexam.com/ Netherlands Bring Your Own Proctor, Live, Record and Review
25 Proctorio https://proctorio.com/products/online-proctoring USA AI, Live
26 Proctortrack https://www.proctortrack.com/ USA AI, Live
27 ProctorU https://www.proctoru.com/ USA AI, Live, Record and Review
28 Proview https://proview.io/ USA AI, Live
29 PSI Bridge https://www.psionline.com/en-gb/platforms/psi-bridge/ USA Live, Record and Review
30 Respondus Monitor https://web.respondus.com/he/monitor/ USA AI, Live, Record and Review
31 Rosalyn https://www.rosalyn.ai/ USA AI, Live
32 SmarterProctoring https://smarterservices.com/smarterproctoring/ USA AI, Bring Your Own Proctor, Live
33 Sumadi https://sumadi.net/ Honduras AI
34 Suyati https://suyati.com/solutions/online-proctoring-solution/ India AI, Live, Record and Review
35 TCS iON Remote Assessments https://learning.tcsionhub.in/hub/remote-assessment-marking-internship/ India AI, Live
36 Think Exam https://www.thinkexam.com/remoteproctoring India AI, Live
37 uxpertise XP https://uxpertise.ca/en/uxpertise-xp/ Canada AI, Live, Record and Review
38 Proctor AI https://www.visive.ai/solutions/proctor-ai India AI, Live, Record and Review
39 Wise Proctor https://wiseattend.com/wiseproctor USA AI, Record and Review
40 Xobin https://xobin.com/online-remote-proctoring India AI
41 Youtestme https://www.youtestme.com/using-professional-tool/ Canada AI, Live

 

 

Laila Issayeva, MS

Nathan Thompson, PhD

Vertical scaling is the process of placing scores from educational assessments measuring same/similar knowledge domains but at different ability levels onto a common scale (Tong & Kolen, 2008). The most common example is putting Mathematics and Language assessments for K-12 onto a single scale. While general information about scaling can be found at What is Scaling?, this article will focus specifically on vertical scaling.

Why vertical scaling?

A vertical scale is incredibly important, as enables inferences about student progress from one moment to another, e. g. from elementary to high school grades, and can be considered as a developmental continuum of student academic achievements. In other words, students move along that continuum as they develop new abilities, and their scale score alters as a result (Briggs, 2010).

This is not only important for individual students, because we can track learning and assign appropriate interventions or enrichments, but also in an aggregate sense.  Which schools are growing more than others?  Are certain teachers better? Perhaps there is a noted difference between instructional methods or curricula?  Here, we are coming up to the fundamental purpose of assessment; just like it is necessary to have a bathroom scale to track your weight in a fitness regime, if a governments implements a new Math instructional method, how does it know that students are learning more effectively?

Using a vertical scale can create a common interpretive framework for test results across grades and, therefore, provide important data that inform individual and classroom instruction. To be valid and reliable, these data have to be gathered based on properly constructed vertical scales.

Vertical scales can be compared with rulers that measure student growth in some subject areas from one testing moment to another. Similarly to height or weight, student capabilities are assumed to grow with time.  However, if you have a ruler that is only 1 meter long and you are trying to measure growth 3-year-olds to 10-year-olds, you would need to link two rulers together.

Construction of Vertical Scales

Construction of a vertical scale is a complicated process which involves making decisions on test design, scaling design, scaling methodology, and scale setup. Interpretation of progress on a vertical scale depends on the resulting combination of such scaling decisions (Harris, 2007; Briggs & Weeks, 2009). Once a vertical scale is established, it needs to be maintained over different forms and time. According to Hoskens et al. (2003), a method chosen for maintaining vertical scales affects the resulting scale, and, therefore, is very important.

A measurement model that is used to place student abilities on a vertical scale is represented by item response theory (IRT; Lord, 2012; De Ayala, 2009) or the Rasch model (Rasch, 1960).  This approach allows direct comparisons of assessment results based on different item sets (Berger et al., 2019). Thus, each student is supposed to work with a selected bunch of items not similar to the items taken by other students, but still his results will be comparable with theirs, as well as with his own ones from other assessment moments.

The image below shows how student results from different grades can be conceptualized by a common vertical scale.  Suppose you were to calibrate data from each grade separately, but have anchor items between the three groups.  A linking analysis might suggest that Grade 4 is 0.5 logits above Grade 3, and Grade 5 is 0.7 logits above Grade 4.  You can think of the bell curves overlapped like you see below.  A theta of 0.0 on the Grade 5 scale is equivalent to 0.7 on the Grade 4 scale, and 1.3 on the Grade 3 scale.  If you have a strong linking, you can put Grade 3 and Grade 4 items/students onto the Grade 5 scale… as well as all other grades using the same approach.

Vertical-scaling

Test design

Kolen and Brennan (2014) name three types of test designs aiming at collecting student response data that need to be calibrated:

  •  Equivalent group design. Student groups with presumably comparable ability distributions within a grade are randomly assigned to answer items related to their own or an adjacent grade;
  •  Common item design. Using identical items to students from adjacent grades (not requiring equivalent groups) to establish a link between two grades and to align overlapping item blocks within one grade, such as putting some Grade 5 items on the Grade 6 test, some Grade 6 items on the Grade 7 test, etc.;
  •  Scaling test design. This type is very similar to common item design but, in this case, common items are shared not only between adjacent grades; there is a block of items administered to all involved grades besides items related to the specific grade.

From a theoretical perspective, the most consistent design with a domain definition of growth is scaling test design. Common item design is the easiest one to implement in practice but only if administering the same items to adjacent grades is reasonable from a content perspective. Equivalent group design requires more complicated administration procedures within one school grade to ensure samples with equivalent ability distributions.

Scaling design

The scaling procedure can use observed scores or it can be IRT-based. The most commonly used scaling design procedures in vertical scale settings are the Hieronymus, Thurstone, and IRT scaling (Yen, 1986; Yen & Burket, 1997; Tong & Harris, 2004). An interim scale is chosen in all these three methodologies (von Davier et al., 2006).

  • Hieronymus scaling. This method uses a total number-correct score for dichotomously scored tests or a total number of points for polytomously scored items (Petersen et al., 1989). The scaling test is constructed in a way to represent content in an increasing order in terms of level of testing, and it is administered to a representative sample from each testing level or grade. The within- and between-level variability and growth are set on an external scaling test, which is the special set of common items.
  • Thurstone scaling. According to Thurstone (1925, 1938), this method first creates an interim-score-scale and then normalizes distributions of variables at each level or grade. It assumes that scores on an underlying scale are normally distributed within each group of interest and, therefore, makes use of a total number-correct scores for dichotomously scored tests or a total number of points of polytomously scored items to conduct scaling. Thus, Thurstone scaling normalizes and linearly equates raw scores, and it is usually conducted within equivalent groups.
  • IRT scaling. This method of scaling considers person-item interactions. Theoretically, IRT scaling is applied for all existing IRT models, including multidimensional IRT models or diagnostic models. In practice, only unidimensional models, such as the Rasch and/or partial credit models (PCM) or the 3PL models, are used (von Davier et al., 2006).

Data calibration

When all decisions are taken, including test design and scaling design, and tests are administered to students, the items need to be calibrated with software like Xcalibre to establish a vertical measurement scale. According to Eggen and Verhelst (2011), item calibration within the context of the Rasch model implies the process of establishing model fit and estimating difficulty parameter of an item based on response data by means of maximum likelihood estimation procedures.

Two procedures, concurrent and grade-by-grade calibration, are employed to link IRT-based item difficulty parameters to a common vertical scale across multiple grades (Briggs & Weeks, 2009; Kolen & Brennan, 2014). Under concurrent calibration, all item parameters are estimated in a single run by means of linking items shared by several adjacent grades (Wingersky & Lord, 1983).  In contrast, under grade-by-grade calibration, item parameters are estimated separately for each grade and then transformed into one common scale via linear methods. The most accurate method for determining linking constants by minimizing differences between linking items’ characteristic curves among grades is the Stocking and Lord method (Stocking & Lord, 1983). This is accomplished with software like IRTEQ.

Summary of Vertical Scaling

Vertical scaling is an extremely important topic in the world of educational assessment, especially K-12 education.  As mentioned above, this is not only because it facilitates instruction for individual students, but is the basis for information on education at the aggregate level.

There are several approaches to implement vertical scaling, but the IRT-based approach is very compelling.  A vertical IRT scale enables representation of student ability across multiple school grades and also item difficulty across a broad range of difficulty. Moreover, items and people are located on the same latent scale. Thanks to this feature, the IRT approach supports purposeful item selection and, therefore, algorithms for computerized adaptive testing (CAT). The latter use preliminary ability estimates for picking the most appropriate and informative items for each individual student (Wainer, 2000; van der Linden & Glas, 2010).  Therefore, even if the pool of items is 1,000 questions stretching from kindergarten to Grade 12, you can deliver a single test to any student in the range and it will adapt to them.  Even better, you can deliver the same test several times per year, and because students are learning, they will receive a different set of items.  As such, CAT with a vertical scale is an incredibly fitting approach for K-12 formative assessment.

 

Additional Reading

Reckase (2010) states that the literature on vertical scaling is scarce going back to the 1920s, and recommends some contemporary practice-oriented research studies:

Paek and Young (2005). This research study dealt with the effects of Bayesian priors on the estimation of student locations on the continuum when using a fixed item parameter linking method. First, a within group calibration was done for one grade level; then the parameters from the common items in that calibration were fixed to calibrate the next grade level. This approach forces the parameter estimates to be the same for the common items at the adjacent grade levels. The study results showed that the prior distributions could affect the results and that careful checks should be done to minimize the effects.

Reckase and Li (2007). This book chapter depicts a simulation study of the dimensionality impacts on vertical scaling. Both multidimensional and unidimensional IRT models were employed to simulate data to observe growth across three achievement constructs. The results presented that the multidimensional model recovered the gains better than the unidimensional models, but those gains were underestimated mostly due to the common item selection. This emphasizes the importance of using common items that cover all of the content assessed at adjacent grade levels.

Li (2007). The goal of this doctoral dissertation was to identify if multidimensional IRT methods could be used for vertical scaling and what factors might affect the results. This study was based on a simulation designed to match state assessment data in Mathematics. The results showed that using multidimensional approaches was feasible, but it was important that the common items would include all the dimensions assessed at the adjacent grade levels.

Ito, Sykes, and Yao (2008). This study compared concurrent and separate grade group calibration while developing a vertical scale for nine consequent grades tracking student competencies in Reading and Mathematics. The research study used the BMIRT software implementing Markov-chain Monte Carlo estimation. The results showed that concurrent and separate grade group calibrations had provided different results for Mathematics than for Reading. This, in turn, confirms that the implementation of vertical scaling is very challenging, and combinations of decisions about its construction can have noticeable effects on the results.

Briggs and Weeks (2009). This research study was based on real data using item responses from the Colorado Student Assessment Program. The study compared vertical scales based on the 3PL model with those from the Rasch model. In general, the 3PL model provided vertical scales with greater rises in performance from year to year, but also greater increases within grade variability than the scale based on the Rasch model did. All methods resulted in growth curves having less gain along with an increase in grade level, whereas the standard deviations were not much different in size at different grade levels.

References

Berger, S., Verschoor, A. J., Eggen, T. J., & Moser, U. (2019, October). Development and validation of a vertical scale for formative assessment in mathematics. In Frontiers in Education (Vol. 4, p. 103). Frontiers. Retrieved from https://www.frontiersin.org/articles/10.3389/feduc.2019.00103/full

Briggs, D. C., & Weeks, J. P. (2009). The impact of vertical scaling decisions on growth interpretations. Educational Measurement: Issues and Practice, 28(4), 3–14.

Briggs, D. C. (2010). Do Vertical Scales Lead to Sensible Growth Interpretations? Evidence from the Field. Online Submission. Retrieved from https://files.eric.ed.gov/fulltext/ED509922.pdf

De Ayala, R. J. (2009). The Theory and Practice of Item Response Theory. New York: Guilford Publications Incorporated.

Eggen, T. J. H. M., & Verhelst, N. D. (2011). Item calibration in incomplete testing designs. Psicológica 32, 107–132.

Harris, D. J. (2007). Practical issues in vertical scaling. In Linking and aligning scores and scales (pp. 233–251). Springer, New York, NY.

Hoskens, M., Lewis, D. M., & Patz, R. J. (2003). Maintaining vertical scales using a common item design. In annual meeting of the National Council on Measurement in Education, Chicago, IL.

Ito, K., Sykes, R. C., & Yao, L. (2008). Concurrent and separate grade-groups linking procedures for vertical scaling. Applied Measurement in Education, 21(3), 187–206.

Kolen, M. J., & Brennan, R. L. (2014). Item response theory methods. In Test Equating, Scaling, and Linking (pp. 171–245). Springer, New York, NY.

Li, T. (2007). The effect of dimensionality on vertical scaling (Doctoral dissertation, Michigan State University. Department of Counseling, Educational Psychology and Special Education).

Lord, F. M. (2012). Applications of item response theory to practical testing problems. Routledge.

Paek, I., & Young, M. J. (2005). Investigation of student growth recovery in a fixed-item linking procedure with a fixed-person prior distribution for mixed-format test data. Applied Measurement in Education, 18(2), 199–215.

Petersen, N. S., Kolen, M. J., & Hoover, H. D. (1989). Scaling, norming, and equating. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 221–262). New York: Macmillan.

Rasch, G. (1960). Probabilistic Models for Some Intelligence and Attainment Tests. Copenhagen: Danmarks Paedagogiske Institut.

Reckase, M. D., & Li, T. (2007). Estimating gain in achievement when content specifications change: a multidimensional item response theory approach. Assessing and modeling cognitive development in school. JAM Press, Maple Grove, MN.

Reckase, M. (2010). Study of best practices for vertical scaling and standard setting with recommendations for FCAT 2.0. Unpublished manuscript. Retrieved from https://www.fldoe.org/core/fileparse.php/5663/urlt/0086369-studybestpracticesverticalscalingstandardsetting.pdf

Stocking, M. L., & Lord, F. M. (1983). Developing a common metric in item response theory. Applied psychological measurement, 7(2), 201–210. doi:10.1177/014662168300700208

Thurstone, L. L. (1925). A method of scaling psychological and educational tests. Journal of educational psychology, 16(7), 433–451.

Thurstone, L. L. (1938). Primary mental abilities (Psychometric monographs No. 1). Chicago: University of Chicago Press.

Tong, Y., & Harris, D. J. (2004, April). The impact of choice of linking and scales on vertical scaling. Paper presented at the annual meeting of the National Council on Measurement in Education, San Diego, CA.

Tong, Y., & Kolen, M. J. (2008). Maintenance of vertical scales. In annual meeting of the National Council on Measurement in Education, New York City.

van der Linden, W. J., & Glas, C. A. W. (eds.). (2010). Elements of Adaptive Testing. New York, NY: Springer.

von Davier, A. A., Carstensen, C. H., & von Davier, M. (2006). Linking competencies in educational settings and measuring growth. ETS Research Report Series, 2006(1), i–36. Retrieved from https://files.eric.ed.gov/fulltext/EJ1111406.pdf

Wainer, H. (Ed.). (2000). Computerized adaptive testing: A Primer, 2nd Edn. Mahwah, NJ: Lawrence Erlbaum Associates.

Wingersky, M. S., & Lord, F. M. (1983). An Investigation of Methods for Reducing Sampling Error in Certain IRT Procedures (ETS Research Reports Series No. RR-83-28-ONR). Princeton, NJ: Educational Testing Service.

Yen, W. M. (1986). The choice of scale for educational measurement: An IRT perspective. Journal of Educational Measurement, 23(4), 299–325.

Yen, W. M., & Burket, G. R. (1997). Comparison of item response theory and Thurstone methods of vertical scaling. Journal of Educational Measurement, 34(4), 293–313.

Laila Issayeva, MSc

Nathan Thompson, PhD

Test equating refers to the issue of defensibly translating scores from one test form to another.  That is, if you have an exam where half of students see one set of items while the other half see a different set, how do you know that a score of 70 is the same one both forms?  What if one is a bit easier?  If you are delivering assessments in conventional linear forms – or piloting a bank for CAT/LOFT – you are likely to utilize more than one test form, and, therefore, are faced with the issue of test equating.

When two test forms have been properly equated, educators can validly interpret performance on one test form as having the same substantive meaning compared to the equated score of the other test form (Ryan & Brockmann, 2009). While the concept is simple, the methodology can be complex, and there is an entire area of psychometric research devoted to this topic. This post will provide an overview of the topic.

Why do we need test equating?

The need is obvious: to adjust for differences in difficulty to ensure that all examinees receive a fair score on a stable scale. Suppose you take Form A and get a score of 72/100 while your friend takes Form B and gets a score of 74/100. Is your friend smarter than you, or did his form happen to have easier questions?  What if the passing score on the exam was 73? Well, if the test designers built-in some overlap of items between the forms, we can answer this question empirically.

Suppose the two forms overlap by 50 items, called anchor items or equator items. They are delivered to a large, representative sample. Here are the results.

Exam FormMean score on 50 overlap itemsMean score on 100 total items
A3072
B3274

Because the mean score on the anchor items was higher, we then think that the Form B group was a little smarter, which led to a higher total score.

Now suppose these are the results:

Exam FormMean score on 50 overlap itemsMean score on 100 total items
A3272
B3274

Now, we have evidence that the groups are of equal ability.  The higher total score on Form B must then be because the unique items on that form are a bit easier

What is test equating?

According to Ryan and Brockmann (2009), “Equating is a technical procedure or process conducted to establish comparable scores, with equivalent meaning, on different versions of test forms of the same test; it allows them to be used interchangeably.” (p. 8).  Thus, successful equating is an important factor in evaluating assessment validity, and, therefore, it often becomes an important topic of discussion within testing programs.

Practice has shown that scores, and tests producing scores, must satisfy very strong requirements to achieve this demanding goal of interchangeability. Equating would not be necessary if test forms were assembled as strictly parallel, meaning that they would have identical psychometric properties. In reality, it is almost impossible to construct multiple test forms that are strictly parallel, and equating is necessary to attune a test construction process.

Dorans, Moses, and Eignor (2010) suggest the following five requirements towards equating of two test forms:

  • tests should measure the same construct (e.g. latent trait, skill, ability);
  • tests should have the same level of reliability;
  • equating transformation for mapping the scores of tests should be the inverse function;
  • test results should not depend on the test form an examinee actually takes;
  • the equating function used to link the scores of two tests should be the same regardless of the choice of (sub) population from which it is derived.

How do I calculate an equating?

CTT methods include linear equating and equipercentile equating as well as several others.  A newer approaches that work well with small samples are Circle-Arc (Livingston & Kim, 2009) and Nominal Weights (Babcock, Albano, & Raymond, 2012).  Specific methods for linear equating include Tucker, Levine, and Chained (von Davier & Kong, 2003).  Linear equating approaches are conceptually simple and easy to interpret; given the examples above, the equating transformation might be estimated with a slope of 1.01 and an intercept of 1.97, which would directly confirm the hypothesis that one form was about 2 points easier than the other.

IRT approaches include equating through common items (equating by applying an equating constant, equating by concurrent or simultaneous calibration, and equating with common items through test characteristic curves), and common person calibration (Ryan & Brockmann, 2009). The common-item approach is quite often used, and specific methods for finding the constants (conversion parameters) include Stocking-Lord, Haebara, Mean/Mean, and Mean/Sigma.  Because IRT assumes that two scales on the same construct differ by only a simple linear transformation, all we need to do is find the slope and intercept of that transformation.  Those methods do so, and often produce nice looking figures like this.  Note that the b parameters do not fall on the identity line, because there was indeed a difference between the groups, and the results clearly find that is the case.

IRTEQ test equating

Practitioners can equate forms with classical test theory (CTT) or item response theory (IRT). However, one of the reasons that IRT was invented was that equating with CTT was very weak. Hambleton and Jones (1993) explain that when CTT equating methods are applied, both ability parameter (i.e., observed score) and item parameters (i.e., difficulty and discrimination) are dependent on each other, limiting its utility in practical test development. IRT solves the CTT interdependency problem by combining ability and item parameters in one model. The IRT equating methods are more accurate and stable than the CTT methods (Hambleton & Jones, 1993; Han, Kolen, & Pohlmann, 1997; De Ayala, 2013; Kolen and Brennan, 2014) and provide a solid basis for modern large-scale computer-based tests, such as computerized adaptive tests (Educational Testing Service, 2010; OECD, 2017).

Of course, one of the reasons that CTT is still around in general is that it works much better with smaller samples, and this is also the case for CTT test equating (Babcock, Albano, & Raymond, 2012).

How do I implement test equating?

Test equating is a mathematically complex process, regardless of which method you use.  Therefore, it requires special software.  Here are some programs to consider.

  1. CIPE performs both linear and equipercentile equating with classical test theory. It is available from the University of Iowa’s CASMA site, which also includes several other software programs.
  2. IRTEQ is an easy-to-use program which performs all major methods of IRT Conversion equating.  It is available from the University of Massachusetts website, as well as several other good programs.
  3. There are many R packages for equating and related psychometric topics. This article claims that there are 45 packages for IRT analysis alone!
  4. If you want to do IRT equating, you need IRT calibration software. We highly recommend Xcalibre since it is easy to use and automatically creates reports in Word for you. If you want to do the calibration approach to IRT equating (both anchor-item and concurrent-calibration), rather than the conversion approach, this is handled directly by IRT software like Xcalibre. For the conversion approach, you need separate software like IRTEQ.

Equating is typically performed by highly trained psychometricians; in many cases, an organization will contract out to a testing company or consultant with the relevant experience.  Contact us if you’d like to discuss this.

Does equating happen before or after delivery?

Both.  These are called pre-equating and post-equating (Ryan & Brockmann, 2009).  Post-equating means the calculation is done after delivery and you have a full data set, for example if a test is delivered twice per year on a single day, we can do it after that day.  Pre-equating is more tricky, because you are trying to calculate the equating before a test form has ever been delivered to an examinee; but this is 100% necessary in many situations, especially those with continuous delivery windows.

How do I learn more about test equating?

If you are eager to learn more about the topic of equating, the classic reference is the book by Kolen and Brennan (2004; 2014) that provides the most complete coverage of score equating and linking.  There are other resources more readily available on the internet, like this free handbook from CCSSO. If you would like to learn more about IRT, we suggest the books by De Ayala (2008) and Embretson and Reise (2000). A brief intro of IRT equating is available on our website.

Several new ideas of general use in equating, with a focus on kernel equating, were introduced in the book by von Davier, Holland, and Thayer (2004). Holland and Dorans (2006) presented a historical background for test score linking, based on work by Angoff (1971), Flanagan (1951), and Petersen, Kolen, and Hoover (1989). If you look for a straightforward description of the major issues and procedures encountered in practice, then you should turn to Livingston (2004).

References

Angoff, W. H. (1971). Scales, norms and equivalent scores. In R. L. Thorndike (Ed.), Educational measurement (2nd ed., pp. 508-600). Washington DC: American Council on Education.

Babcock, B., Albano, A., & Raymond, M. (2012). Nominal Weights Mean Equating: A Method for Very Small Samples. Educational and Psychological Measurement, 72(4), 1-21.

Dorans, N. J., Moses, T. P., & Eignor, D. R. (2010). Principles and practices of test score equating. ETS Research Report Series, 2010(2), i-41. Retrieved from https://www.ets.org/Media/Research/pdf/RR-10-29.pdf

De Ayala, R. J. (2008). A commentary on historical perspectives on invariant measurement: Guttman, Rasch, and Mokken.

De Ayala, R. J. (2013). Factor analysis with categorical indicators: Item response theory. In Applied quantitative analysis in education and the social sciences (pp. 220-254). Routledge.

Educational Testing Service (2010). Linking TOEFL iBT Scores to IELTS Scores: A Research Report. Princeton, NJ: Educational Testing Service.

Embretson, S. E., & Reise, S. P. (2000). Item response theory for psychologists. Maheah.

Flanagan, J. C. (1951). Units, scores, and norms. In E. F. Lindquist (Ed.), Educational measurement (pp. 695-763). Washington DC: American Council on Education.

Hambleton, R. K., & Jones, R. W. (1993). Comparison of classical test theory and item response theory and their applications to test development. Educational measurement: issues and practice, 12(3), 38-47.

Han, T., Kolen, M., & Pohlmann, J. (1997). A comparison among IRT true-and observed-score equatings and traditional equipercentile equating. Applied Measurement in Education, 10(2), 105-121.

Holland, P. W., & Dorans, N. J. (2006). Linking and equating. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 187-220).Westport, CT: Praeger.

Kolen, M. J., & Brennan, R. L. (2004). Test equating, linking, and scaling: Methods and practices (2nd ed.). New York, NY: Springer-Verlag.

Kolen, M. J., & Brennan, R. L. (2014). Item response theory methods. In Test Equating, Scaling, and Linking (pp. 171-245). Springer, New York, NY.

Livingston, S. A. (2004). Equating test scores (without IRT). Princeton, NJ: ETS.

Livingston, S. A., & Kim, S. (2009). The Circle‐Arc Method for Equating in Small Samples. Journal of Educational Measurement 46(3): 330-343.

OECD (2017). PISA 2015 Technical Report. Paris: OECD Publishing.

Petersen, N. S., Kolen, M. J., & Hoover, H. D. (1989). Scaling, norming and equating. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 221-262). New York, NY: Macmillan.

Ryan, J., & Brockmann, F. (2009). A Practitioner’s Introduction to Equating with Primers on Classical Test Theory and Item Response Theory. Council of Chief State School Officers.

von Davier, A. A., Holland, P. W., & Thayer, D. T. (2004). The kernel method of test equating. New York, NY: Springer.

von Davier, A. A., & Kong, N. (2003). A unified approach to linear equating for non-equivalent groups design. Research report 03-31 from Educational Testing Service. https://www.ets.org/Media/Research/pdf/RR-03-31-vonDavier.pdf

 

Positive manifold refers to the fact that scores on cognitive assessment tend to correlate very highly with each other, indicating a common latent dimension that is very strong.  This latent dimension became known as g for general intelligence or general cognitive ability.  This post discusses what the positive manifold is, but since there are MANY other resources on the definition, the post also explains how this concept is useful in the real world.

The term positive manifold originally came out of work in the field of intelligence testing, including research by Charles Spearman.  There literally hundreds of studies on this topic, and over one hundred years of research has shown that this concept is scientifically supported, but it is important to remember that it is just a manifold and not a perfect relationship.  That is, we can expect verbal reasoning ability to correlate highly with quantitative reasoning or logical reasoning, but it is by no means a 1-to-1 relationship.  There are certainly some people that can be high on one but not another.  But it is very unlikely for you to be in the 90th percentile on one but 10th percentile on another.

What is Positive Manifold?

If you were to take a set of cognitive tests, either separate, or as subtests of a battery like the Wechsler Adult Intelligence Scale, and correlate their scores, the correlation matrix would be overwhelmingly positive.  For example, look at Table 2-9 in this book.   Or Table 4 in this article.  There are many, many more examples if you search for keywords like “intelligence intercorrelation.”

As you might expect, related constructs will correlate more highly.  A battery might have a Verbal Reasoning test and a Vocabulary test; we would expect these to correlate more highly with each other (maybe 0.80) than a Figural Reasoning test (maybe 0.50).  Researchers like to use a methodology called factor analysis to analyze this structure and drive interpretations.

Practical implications

Positive manifold and the structure of cognitive ability is historically an academic research topic, and remains so.  Researchers are still publishing articles like this one.  However, the concept of positive manifold has many practical implications in the real world.  It affects situations where cognitive ability testing is used to obtain information about people and make decisions about them.  Two of the most common examples are test batteries for admissions/placement or employment.

Admissions/placement exams are used in the education sector to evaluate student ability and make decisions about schools or courses that the student can/should enter.  Admissions refers to whether the student should be admitted to a school, such as a university or a prestigious high school.  Examples of this in the USA are the SAT and ACT exams.  Placement refers to sending students to the right course, such as testing them on Math and English to determine if they are ready for certain courses.  Both of these examples will typically test the student on 3 or 4 aspects, which is an example of a test battery.  The SAT discusses intercorrelations of its subtests in the technical manual (page 104).  Tests like the SAT can provide incremental validity above the predictive power of high school grade point average (HSGPA) alone, as seen in this report.

Employment testing is also often done with several cognitive tests.  You might take psychometric tests to apply for a job, and they test you on quantitative reasoning and verbal reasoning.

In both cases, the tests are validated by doing research to show that they predict a criterion of interest.  In the case of university admissions, this might be First Year GPA or Four Year Graduation Rate.  In the case of Employment Testing, it could be Job Performance Rating by a supervisor or 1-year retention rate.

Why are they using multiple tests?  They are trying to capitalize on the differences to get more predictive power for the criterion.  Success in university isn’t due to just verbal/language skills alone, but also logical reasoning and other skills.  They recognize that there is a high correlation, but the differences between the constructs can be leveraged to get more information about people.  Employment testing goes further, and tries to add incremental validity by adding other tests that are completely unrelated but relevant to the job world, like job samples or even noncognitive tests like Conscientiousness .  These also correlate with job performance, and therefore help with prediction, but correlate even lower with measures of g than another cognitive test would; this then adds more prediction power.