Posts on psychometrics: The Science of Assessment

La seguridad y validez de las pruebas y exámenes en línea son extremadamente importantes.  La pandemia COVID-19 cambió drásticamente todos los aspectos de nuestro mundo, y una de las áreas más afectadas es la evaluación educativa y otros tipos de evaluación. Muchas organizaciones aún realizaban pruebas con metodologías de hace 50 años, como colocar a 200 evaluados en una sala grande con escritorios, exámenes en papel y un lápiz. COVID-19 está obligando a muchas organizaciones a dar un giro, lo que brinda la oportunidad de modernizar las evaluaciones. 

Pero, ¿cómo podemos mantener la seguridad en la evaluación, y por lo tanto la validez, a través de estos cambios? A continuación, presentamos algunas sugerencias, las cuales se pueden implementar fácilmente en las plataformas de evaluación de ASC, líderes en la industria. Comience registrándose para obtener una cuenta gratuita en https://assess.com/assess-ai/.

Verdadera banca de ítems con acceso a contenido

Una buena evaluación en línea comienza con buenos ítems. Si bien los Sistemas de Gestión del Aprendizaje (LMS por sus siglas en inglés) y otras plataformas que no son realmente de evaluación incluyen algunas funciones de creación de ítems, por lo general no cumplen con los requisitos básicos para una verdadera banca de ítems. Existen prácticas recomendadas con respecto a la banca de ítems que son estándar en las organizaciones de evaluación a gran escala (p. Ej., Los Departamentos de Educación de Estado en EE. UU.), pero son sorprendentemente raras para los exámenes de certificación/licencia profesional, universidades y otras organizaciones. A continuación, se muestran algunos ejemplos.collaborative item banking

• Los ítems son reutilizables (no es necesario cargarlos para cada prueba en la que se utilicen).

• Seguimiento de la versión del ítem.

• Seguimiento y auditorías de edición hecha por usuarios.

• Controles de contenido de autor (los profesores de matemáticas solo pueden ver elementos de matemáticas).

• Almacenar metadatos como parámetros de la Teoría de Respuesta al Ítem (TRI) y estadísticas clásicas.

• Seguimiento del uso de ítems en las pruebas.

• Flujo de trabajo de revisión de ítems.

Acceso basado en roles

Todos los usuarios deben estar limitados por roles, como Autor del ítem, Revisor del Ítem, Editor de Pruebas y Administrador de los Evaluados. Entonces, por ejemplo, es posible que alguien a cargo de administrar la lista de evaluados/estudiantes nunca vea ninguna pregunta del examen.

Análisis forense de datos

Hay muchas formas de analizar los resultados de tu prueba para buscar posibles amenazas de seguridad / validez. Nuestro  software SIFT  proporciona una plataforma de software gratuita para ayudarte a implementar esta metodología moderna. Puedes evaluar los índices de colusión, que cuantifican qué tan similares son las respuestas para cualquier par de evaluados. También puedes evaluar los tiempos de respuesta, el rendimiento del grupo y las estadísticas acumuladas.

Aleatorización

Cuando las pruebas se entregan en línea, debe tener la opción de aleatorizar el orden de los ítems y también el orden de las respuestas. Al imprimir en papel, debe haber una opción para aleatorizar el orden. Pero, por supuesto, está mucho más limitado respecto a esto cuando se usa papel.

Prueba lineal sobre la marcha (LOFT)

LOFT creará una prueba aleatoria única para cada evaluado. Por ejemplo, puedes tener un grupo de 300 ítems distribuidos en 4 dominios, y cada evaluado recibirá 100 ítems con 25 de cada dominio. Esto aumenta enormemente la seguridad.

Pruebas adaptativas computarizadas (CAT)

CAT lleva la personalización aún más lejos y adapta la dificultad del examen y el número de ítems que ve cada alumno, en base a ciertos algoritmos y objetivos psicométricos. Esto hace que la prueba sea extremadamente segura.

Navegador bloqueado

¿Quieres asegurarte de que el alumno no pueda navegar en busca de respuestas o tomar capturas de pantalla de ítems? Necesitas un navegador bloqueado. Las plataformas de evaluación de ASC,  Assess.ai  y  FastTest, vienen con esto listo para usar y sin costo adicional.

Códigos de prueba para evaluados

¿Quieres asegurarte de que la persona adecuada realice el examen adecuado? Genera contraseñas únicas de un solo uso para que las entregue un supervisor después de la verificación de identidad. Esto es especialmente útil en la supervisión remota; el estudiante nunca recibe ninguna información  antes del examen sobre cómo ingresar, excepto para iniciar la sesión de supervisión virtual. Una vez que el supervisor verifica la identidad del evaluado le proporciona la contraseña única de un solo uso.

Códigos de supervisor

¿Quieres un paso adicional en el procedimiento de inicio de la prueba? Una vez que se verifica la identidad de un estudiante e ingresa su código, el supervisor también debe ingresar una contraseña diferente que sea exclusiva para él ese día.

Ventanas de fecha / hora

¿Quieres evitar que los evaluados ingresen temprano o tarde? Configura una ventana de tiempo específica, como el viernes de 9 a 12 am.

Supervisión basada en IA (Inteligencia Artificial)

Deliver-exams remote proctoring

Este nivel de supervisión es relativamente económico, y hace un gran trabajo validando los resultados de un evaluado individual. Sin embargo, no protege la propiedad intelectual de las preguntas de tu examen. Si un evaluado roba todas las preguntas, no lo sabrás de inmediato. Por lo tanto, es muy útil para exámenes de nivel bajo o medio, pero no tan útil para exámenes de alto riesgo como certificaciones o licenciaturas. Obtenga más información sobre nuestras opciones de supervisión remota. También te recomiendo esta publicación de blog para obtener una descripción general de la industria de supervisión remota.

Supervisión pruebas en línea en tiempo real

Si no puedes asistir a los centros de pruebas en persona debido a COVID, esta es la siguiente mejor opción. Los supervisores en vivo pueden registrar al candidato, verificar la identidad e implementar todas las demás cosas anteriores. Además, pueden verificar el entorno del evaluado y detener el examen si ven que el evaluado roba preguntas u otros problemas importantes. MonitorEDU es un gran ejemplo de esto.

¿Cómo puedo empezar?

¿Necesitas ayuda para implementar algunas de estas medidas? ¿O simplemente quieres hablar sobre las posibilidades? Envía un correo electrónico a ASC a solutions@assess.com.

La vigilancia en línea existe desde hace más de una década. Pero dado el reciente brote de COVID-19, las instituciones educativas y de fuerza laboral / certificación están luchando por cambiar sus operaciones, y una gran parte de esto es un aumento increíble en la vigilancia en línea. Esta publicación de blog está destinada a proporcionar una descripción general de la industria de vigilancia en línea para alguien que es nuevo en el tema o está comenzando a comprar y está abrumado por todas las opciones que existen.

Vigilancia en Línea: Dos Mercados Distintos

En primer lugar, describiría la industria de vigilancia en línea como perteneciente a dos mercados distintos, por lo que el primer paso es determinar cuál de ellos se adapta a tu organización.

1. Sistemas a mayor escala, de menor costo (cuando son a gran escala) y con menos seguridad, diseñados para ser utilizados solo como un complemento para las principales plataformas LMS como Blackboard o Canvas. Por lo tanto, estos sistemas de vigilancia en línea están diseñados para exámenes de nivel medio, como un examen de mitad de período de Introducción a la psicología en una universidad.

2. Sistemas de menor escala, mayor costo y mayor seguridad diseñados para ser utilizados con plataformas de evaluación independientes. Estos son generalmente para exámenes de mayor importancia como certificación o fuerza laboral, o quizás para uso especial en universidades como exámenes de Admisión y Colocación.

¿Cómo reconocer la diferencia? El primer tipo anunciará la fácil integración con sistemas como Blackboard o Canvas como característica clave. También se centrarán a menudo en la revisión de videos por IA, en lugar de usar humanos en tiempo real. Otra consideración clave es observar la base de clientes existente, que usualmente es anunciada.

Otras formas en que los sistemas de vigilancia en línea pueden diferir

IA vs humanos: Algunos sistemas se basan exclusivamente en algoritmos de inteligencia artificial para marcar las grabaciones de video de los examinados. Otros sistemas utilizan humanos reales.

Grabar y Revisar vs Humanos en Tiempo Real: Existen dos formas si se utilizan humanos. Primero, puede ser en vivo y en tiempo real, lo que significa que hay un ser humano en el otro extremo del video que puede confirmar la identidad antes de permitir que comience la prueba, y detener la prueba si hay actividad ilícita. Grabar y Revisar grabará el audio y un humano lo comprobará en un plazo de 24 a 48 horas. Esto es más flexible, pero no puedes detener la prueba si alguien está robando el contenido; probablemente no lo sabrás hasta el día siguiente.

Captura de pantalla: Algunos proveedores de vigilancia en línea tienen la opción de grabar / transmitir la pantalla y también la cámara web. Algunos también brindan la opción de hacer únicamente esto (sin cámara web) para exámenes de menor importancia.

Teléfono móvil como tercera cámara: Algunas plataformas más nuevas ofrecen la opción de integrar fácilmente el teléfono móvil del examinado como una tercera cámara, que funciona efectivamente como un supervisor humano. Se les indicará a los examinados que utilicen el video para mostrar debajo de la mesa, detrás del monitor, etc., antes de comenzar el examen. Luego, se les puede indicar que coloquen el teléfono a 2 metros de distancia con una vista clara de toda la habitación mientras se realiza la prueba.

Uso de supervisores propios: Algunos sistemas de vigilancia en línea le permiten utilizar su propio personal como supervisores, lo que es especialmente útil si la prueba se realiza en un período de tiempo reducido. Si se entrega continuamente 24 × 7 durante todo el año, probablemente desee utilizar el personal altamente capacitado del proveedor.

Integraciones de API: Algunos sistemas requieren que los desarrolladores de software configuren una integración de API con su LMS o plataforma de evaluación. Otros son más flexibles y puedes iniciar sesión por ti mismo, cargar una lista de examinados y ya queda todo listo para la prueba.

Bajo pedido vs Programado: Algunas plataformas requieren que se programe un margen de tiempo para que los examinados realicen la prueba. Otros son puramente bajo demanda y el examinado puede presentarse cuando esté listo. MonitorEDU es un excelente ejemplo de esto: los examinados se presentan en cualquier momento, presentan su identificación a un humano en tiempo real y luego comienzan la prueba de inmediato: sin descargas / instalaciones, sin verificaciones del sistema, sin integraciones de API, nada.

Más seguridad: Un Mejor Sistema de Entrega de Pruebas

Una buena plataforma de entrega de pruebas también vendrá con su propia funcionalidad para mejorar la seguridad de las pruebas: aleatorización, generación automatizada de ítems, pruebas adaptativas computarizadas, pruebas lineales sobre la marcha, banca profesional de ítems, puntuación de la teoría de respuesta a los ítems, puntuación escalada, análisis psicométrico, equiparación, entrega de bloqueo y más. En el contexto de la vigilancia en línea, quizás lo más destacado sea la entrega de bloqueo. En este caso, la prueba se hará cargo por completo de la computadora del examinado y no podrá usarla para nada más hasta que termine la prueba.

Los sistemas LMS rara vez incluyen esta funcionalidad, porque no son necesarios para un examen de mitad de período de Introducción a la psicología. Sin embargo, hay muchas cosas en juego en la mayoría de las evaluaciones del mundo (admisiones universitarias, certificaciones, contratación de personal, etc.) y estas pruebas dependen en gran medida de dicha funcionalidad. Tampoco es solo una costumbre o una tradición. Dichos métodos se consideran esenciales según los estándares internacionales, incluidos AERA/APA/NCMA, ITC y NCCA.

Socios de ASC de Vigilancia en Línea

ASC les brinda a sus clientes una solución lista para ser usada, debido a que está asociado con algunos de los líderes en el ámbito. Estos incluyen: MonitorEDU, ProctorExam, Examity y Proctor360. Obtén más información en nuestra página web sobre esa funcionalidad y otra que explica el concepto de seguridad de prueba configurable.

Traducido de la entrada de blog escrita por el Dr. Nathan Thompson.

Nathan Thompson obtuvo su doctorado en psicometría de la Universidad de Minnesota, con un enfoque en pruebas adaptativas computarizadas. Su licenciatura fue de Luther College con una triple especialización en Matemáticas, Psicología y Latín. Está interesado principalmente en el uso de la IA y la automatización de software para aumentar y reemplazar el trabajo realizado por psicometristas, lo que le ha proporcionado una amplia experiencia en el diseño y programación de software. El Dr. Thompson ha publicado más de 100 artículos de revistas y presentaciones de conferencias, pero su favorito sigue siendo https://pareonline.net/getvn.asp?v=16&n=1.

Test information function

The IRT Test Information Function is a concept from item response theory (IRT) that is designed to evaluate how well an assessment differentiates examinees, and at what ranges of ability. For example, we might expect an exam composed of difficult items to do a great job in differentiating top examinees, but it is worthless for the lower half of examinees because they will be so confused and lost.

The reverse is true of an easy test; it doesn’t do any good for top examinees. The test information function quantifies this and has a lot of other important applications and interpretations.

IRT Test Information Function: how to calculate it

The test information function is not something you can calculate by hand. First, you need to estimate item-level IRT parameters, which define the item response function. The only way to do this is with specialized software; there are a few options in the market, but we recommend Xcalibre.

Next, the item response function is converted to an item information function for each item. The item information functions can then be summed into a test information function. Lastly, the test information function is often inverted into the conditional standard error of measurement function, which is extremely useful in test design and evaluation.

IRT Item Parameters

Software like Xcalibre will estimate a set of item parameters. The parameter you use depends on the item types and other aspects of your assessment.

For example, let’s just use the 3-parameter model, which estimates a, b, and c. And we’ll use a small test of 5 items. These are ordered by difficulty: item 1 is very easy and Item 5 is very hard.

Item a b c
1 1.00 -2.00 0.20
2 0.70 -1.00 0.40
3 0.40 0.00 0.30
4 0.80 1.00 0.00
5 1.20 2.00 0.25

 

Item Response Function

The item response function uses the IRT equation to convert the parameters into a curve. The purpose of the item parameters is to fit this curve for each item, like a regression model to describe how it performs.

Here are the response functions for those 5 items. Note the scale on the x-axis, similar to the bell curve, with the easy items to the left and hard ones to the right.

item response function five graphs

 

Item Information Function

The item information function evaluates the calculus derivative of the item response function. An item provides more information about examinees where it provides more slope.

For example, consider Item 5: it is difficult, so it is not very useful for examinees in the bottom half of ability. The slope of the Item 5 IRF is then nearly 0 for that entire range. This then means that its information function is nearly 0.

item information function five graphs

 

Test Information Function

The test information function then sums up the item information functions to summarize where the test is providing information. If you imagine adding the graphs above, you can easily imagine some humps near the top and bottom of the range where there are the prominent IIFs. 

test information function

 

Conditional Standard Error of Measurement Function

The test information function can be inverted into an estimate of the conditional standard error of measurement. What do we mean by conditional? If you are familiar with classical test theory, you know that it estimates the same standard error of measurement for everyone that takes a test.

But given the reasonable concepts above, it is incredibly unreasonable to expect this. If a test has only difficult items, then it measures top students well, and does not measure lower students well, so why should we say that their scores are just as accurate? The conditional standard error of measurement turns this into a function of ability.

Also, note that it refers to the theta scale and not to the number-correct scale.

conditional standard error of measurement

 

How can I implement all this?

For starters, I recommend delving deeper into an item response theory book. My favorite is Item Response Theory for Psychologists by Embretson and Riese. Next, you need some item response theory software.

Xcalibre can be downloaded as a free version for learning and is the easiest program to learn how to use (no 1980s-style command code… how is that still a thing?). But if you are an R fan, there are plenty of resources in that community as well.

Tell me again: why are we doing this?

The purpose of all this is to effectively model how items and tests work, namely, how they interact with examinees. This then allows us to evaluate their performance so that we can improve them, thereby enhancing reliability and validity.

Classical test theory had a lot of shortcomings in this endeavor, which led to IRT being invented. IRT also facilitates some modern approaches to assessment, such as linear on-the-fly testing, adaptive testing, and multistage testing.

conditional standard error of measurement

The standard error of measurement (SEM) is one of the core concepts in psychometrics.  One of the primary assumptions of any assessment is that it is accurately and consistently measuring whatever it is we want to measure.  We, therefore, need to demonstrate that it is doing so.  There are a number of ways of quantifying this, and one of the most common is the SEM.

The SEM can be used in both the classical test theory (CTT) perspective and item response theory (IRT) perspective, though it is defined quite differently in both.

 

What is measurement error?

We can all agree that assessments are not perfect, from a 4th grade math quiz to a Psych 101 exam at university to a driver’s license test.  Suppose you got 80% on an exam today.  If we wiped your brain clean and you took the exam tomorrow, what score would you get?  Probably a little higher or lower.  Psychometricians consider you to have a true score which is what would happen if the test was perfect, you had no interruptions or distractions, and everything else fell into place.  But in reality, you, of course, do not get that score each time.  So psychometricians try to estimate the error in your score, and use this in various ways to improve the assessment and how scores are used.

 

The Standard Error of Measurement in Classical Test Theory

In CTT, it is defined as

SEM = SD*sqrt(1-r),

where SD is the standard deviation of scores for everyone who took the test, and r is the reliability of the test.  It is interpreted as the standard deviation of scores that you would find if you had the person take the test over and over, with a fresh mind each time.  A confidence interval with this is then interpreted as the band where you would expect the person’s true score on the test to fall.

This has some conceptual disadvantages.  For one, it assumes that SEM is the same for all examinees, which is unrealistic.  The interpretation focuses only on this single test form rather than the accuracy of measuring someone’s true standing on the trait.  Moreover, it does not utilize the examinee’s responses in any way.  Lord (1984) suggested a conditional standard error of measurement based on classical test theory, but it focuses on the error of the examinee taking the same test again, rather than the measurement of the true latent value as is done with IRT below.

The classical SEM is reported in Iteman for each subscore, the total score, score on scored items only, and score on pretest items.

Item Response Theory: Conditional Standard Error of Measurement 

Early researchers realized that this assumption is unreasonable.  Suppose that a test has a lot of easy questions.  It will therefore measure low-ability examinees quite well.  Imagine that it is a Math placement exam for university, and has a lot of Geometry and Algebra questions at a high school level.  It will measure students well who are at that level, but do a very poor job of measuring top students.  In an extreme case, let’s say the top 20% of students get every item correct, and there is no way to differentiate them; that defeats the purpose of the test.

The weaknesses of the classical SEM are one of the reasons that IRT was developed.  IRT conceptualizes the SEM as a continuous function across the range of student ability, which is an inversion of the test information function (TIF).  A test form will have more accuracy – less error – in a range of ability where there are more items or items of higher quality.  That is, a test with most items of middle difficulty will produce accurate scores in the middle of the range, but not measure students on the top or bottom very well.  

An example of this is shown below.  On the right is the conditional standard error of measurement function, and on the left is its inverse, the test information function.  Clearly, this test has a lot of items around -1.0 on the theta spectrum, which is around the 15th percentile.  Students above 1.0 (85th percentile) are not being measured well.

Standard error of measurement and test information function

This is actually only the predicted SEM based on all the items in a test/pool.  The observed SEM can differ for each examinee based on the items that they answered, and which ones they answered correctly.  If you want to calculate the IRT SEM on a test of yours, you need to download Xcalibre and implement a full IRT calibration study.

How is CSEM used?

A useful way to think about conditional standard error of measurement is with confidence intervals.  Suppose your score on a test is 0.5 with item response theory.  If the CSEM is 0.25 (see above) then we can get a 95% confidence interval by taking plus or minus 2 standard errors.  This means that we are 95% certain that your true score lies between 0.0 and 1.0.  For a theta of 2.5 with an CSEM of 0.5, that band is then 1.5 to 2.5 – which might seem wide, but remember that is like 94th percentile to 99th percentile.

You will sometimes see scores reported in this manner.  I once saw a report on an IQ test that did not give a single score, but instead said “we can expect that 9 times out of 10 that you would score between X and Y.”

There are various ways to use the CSEM and related functions in the design of tests, including the assembly of parallel linear forms and the development of computerized adaptive tests. To learn more about this, I recommend you delve into a book on IRT, such as Embretson and Riese (2000).  That’s more than I can cover here.

question bank

What is a question bank? A question bank refers to a pool of test questions to be used on various assessments across time.  For example, a Certified Widgetmaker Exam might have a pool of 500 questions developed over the past 10 years. Suppose the exam is delivered in June and December of every year, and each time 150 questions are used. This strong pool of items allows the organization to easily select questions and publish a new form of the exam each time, maintaining security and validity.

A question bank is more commonly called an item bank. It is due to the fact that the term ‘question’ is not often used because many assessment items are not actually questions; they might be statements, vignettes, simulations, or many things other than the traditional question-and-4-answers. It is important to regularly review the item bank to identify and address any ‘enemy items,’ which are items that might negatively impact the test’s reliability and fairness.

What goes into a question bank?  Metadata.

A question bank is actually much more than the questions themselves. If you ran the Certified Widgetmaker Exam, you would want to keep track of some additional important information. This is all based on the concept of treating the question as a reusable object; if you use the item 4 times, you should never need to type/upload it 4 times. It should be in the system only once, with all its associated metadata!

What to track Examples
Which exam forms used each question Dec 2017, May 2018, May 2019, Dec 2020
Unique item ID Math.Algebra.078
Source/Reference Wilson (2016) p. 123
Status New, Under Review, Active, Retired
Statistics Classical difficulty and discrimination: Item response theory parameters
Reviewer comments Jake Smith 2020/11/22: “I think that D is arguably correct, and we need to provide greater detail in the stem.”
Content area, domain, blueprint Math / Algebra / Quadratic

 

The Solution: Question Banking Software

As you can see, there’s actually quite a bit of functionality and data that goes into a true question bank system. And this is only regarding the questions themselves – it doesn’t get into additional topics such as media file management, Workflow Management, Automated Item Generation, or Test Assembly & Publishing. A professional question banking software system will have much, much more than just a way to store the questions.  FastTest provides a powerful alternative solution to some older platforms on the market.

Looking for a deeper treatment of the topic? Check out the chapter Computerized Item Banking by ASC’s cofounder, C. David Vale, in the 2006 Handbook of Test Development.

Want to learn more about how question banking software can help your organization? Click here, check out this other post, or fill out our contact form for a demonstration.

 

psychometrician psychometrist

A psychometrist is an important profession within the world of assessment and psychology.  Their primary role is to deliver and interpret assessments, typically the sorts of assessments that are delivered in a one-on-one clinical situation.  For example, they might give IQ tests to kids to identify those who qualify as Gifted, then explain the results to parents and teachers.  Obviously, there are many assessments which do not require one-on-one in-person delivery like this; psychometrists are unique in that they are trained on how to deliver these complex types of assessments.  This post will describe more about the role of a psychometrist.

What is a Psychometrist?

A psychometrist is someone involved in the use and administration of assessments, and in most cases is working in the field of psychological testing. This is someone who uses tests every day and is familiar with how to administer such tests (especially complex ones like IQ) and interpret their results to provide feedback to individuals. Some have doctoral degrees as a clinical/counseling psychologist and have extensive expertise in that role; for example, the use of an Autism-spectrum screening test to effectively diagnose patients and develop individualized plans.

Consider the following definition from the National Association of Psychometrists:

A psychometrist is responsible for the administration and scoring of psychological and neuropsychological tests under the supervision of a clinical psychologist or clinical neuropsychologist. 

Source: https://www.napnet.org/what

 

Where do psychometrists work?

The vast majority of psychometrists work in a clinical setting.  One might work in an Autism center.  One might be at a psychiatric hospital.  One might be at a neurological clinic.  Some school psychologists also perform this work, working directly in schools.  In all cases, they are working directly with the examinee (patient, student, etc.).

Psychometrist Training and Certification

Psychometrists have at least a Bachelor’s degree in psychology or related field, often a Master’s.  There is typically a clinical training component.  Learn more at the National Association of Psychometrists

There is a specific certification for psychometrists, offered by the Board of Certified Psychometrists.  This involves passing a certification exam of 120 questions over 2.5 hours; the test is professionally designed and administered to meet best practices for credentialing exams.

Career Opportunities for Psychometrists

Psychometrists have excellent career prospects, given the general shortage of healthcare personnel.  However, as their training is much less than doctoral-level roles like a psychologist or psychiatrist, the pay rate is far less.

Psychometrist vs. Related Roles

One misconception that I often see on the internet is the distinction or lack thereof between the related job titlesSome professionals are only involved with the engineering of assessments, usually not even in the field of psychology.  They do not work with patients.  Others work with patients but focus on counseling rather than assessment.  The most flagrant offender, curiously, is Google. Like most companies, we utilize AdWords, and find that some job titles and terms are treated interchangeably when they are not related.

A psychometrist usually works under the direction of a psychiatrist or psychologist, though sometimes a psychologist serves as their own psychometrist.  For example, a psychologist at a mental health clinic is in charge of screening patients and treating them, but might have staff to deliver psychological assessments.  But a psychologist in a school might not have staff for that, and also delivers IQ tests to students.

For clarification, here is a comparison of related job titles:

 

Aspect Psychometrist Psychometrician Psychologist Psychiatrist
How are they involved with assessment? Administration & interpretation Engineering & validation Patient treatment Medical treatment
Education Bachelor’s/Master’s in Psychology (often Counseling) PhD in Psychometrics, Psychology, or Education PhD in Psychology (often Counseling or Clinical) MD (Doctor of Medicine or Osteopathy)
Quantitative skills Interpreting scores with summary statistics (mean, standard deviation, z-scores, correlations) Complex analyses like item response theory or factor analysis; complex designs such as adaptive testing Quantitative research outside of assessment, such as comparing treatment methods Some training, but primary purpose is patient care
Soft skills Works extensively with patients and students, often in a counseling role, and can be highly trained on those aspects Often a pure data analyst, but some work with expert panels for topics like job analysis or Angoff studies; never with patients or students Works extensively with patients and students, often in a counseling role, and can be highly trained on those aspects Works extensively with patients and students, often in a counseling role, and can be highly trained on those aspects
Example Staff in a clinic that delivers IQ and other assessments to patients Researcher involved in designing high-stakes exams such as medical certification or university admissions Clinical therapist in private practice Supervisory staff in a clinic or inpatient facility that treats patients

 

Need help in designing an assessment?  Contact us.

math educational assessment

One of the core concepts in psychometrics is item difficulty.  This refers to the probability that examinees will get the item correct for educational/cognitive assessments or respond in the keyed direction with psychological/survey assessments (more on that later).  Difficulty is important for evaluating the characteristics of an item and whether it should continue to be part of the assessment; in many cases, items are deleted if they are too easy or too hard.  It also allows us to better understand how the items and test as a whole operate as a measurement instrument, and what they can tell us about examinees.

I’ve heard of “item facility.” Is that similar?

Item difficulty is also called item facility, which is actually a more appropriate name.  Why?  The P value is a reverse of the concept: a low value indicates high difficulty, and vice versa.  If we think of the concept as facility or easiness, then the P value aligns with the concept; a high value means high easiness.  Of course, it’s hard to break with tradition, and almost everyone still calls it difficulty.  But it might help you here to think of it as “easiness.”

How do we calculate classical item difficulty?

There are two predominant paradigms in psychometrics: classical test theory (CTT) and item response theory (IRT).  Here, I will just focus on the simpler approach, CTT.

To calculate classical item difficulty with dichotomous items, you simply count the number of examinees that responded correctly (or in the keyed direction) and divide by the number of respondents.  This gets you a proportion, which is like a percentage but is on the scale of 0 to 1 rather than 0 to 100.  Therefore, the possible range that you will see reported is 0 to 1.  Consider this data set.

Person Item1 Item2 Item3 Item4 Item5 Item6 Score
1 0 0 0 0 0 1 1
2 0 0 0 0 1 1 2
3 0 0 0 1 1 1 3
4 0 0 1 1 1 1 4
5 0 1 1 1 1 1 5
Diff: 0.00 0.20 0.40 0.60 0.80 1.00

Item6 has a high difficulty index, meaning that it is very easy.  Item4 and Item5 are typical items, where the majority of items are responding correctly.  Item1 is extremely difficult; no one got it right!

For polytomous items (items with more than one point), classical item difficulty is the mean response value.  That is, if we have a 5 point Likert item, and two people respond 4 and two response 5, then the average is 4.5.  This, of course, is mathematically equivalent to the P value if the points are 0 and 1 for a no/yes item.  An example of this situation is this data set:

Person Item1 Item2 Item3 Item4 Item5 Item6 Score
1 1 1 2 3 4 5 1
2 1 2 2 4 4 5 2
3 1 2 3 4 4 5 3
4 1 2 3 4 4 5 4
5 1 2 3 5 4 5 5
Diff: 1.00 1.80 2.60 4.00 4.00 5.00

Note that this is approach to calculating difficulty is sample-dependent.  If we had a different sample of people, the statistics could be quite different.  This is one of the primary drawbacks to classical test theory.  Item response theory tackles that issue with a different paradigm.  It also has an index with the right “direction” – high values mean high difficulty with IRT.

If you are working with multiple choice items, remember that while you might have 4 or 5 responses, you are still scoring the items as right/wrong.  Therefore, the data ends up being dichotomous 0/1.

Very important final note: this P value is NOT to be confused with p value from the world of hypothesis testing.  They have the same name, but otherwise are completely unrelated.  For this reason, some psychometricians call it P+ (pronounced “P-plus”), but that hasn’t caught on.

How do I interpret classical item difficulty?

For educational/cognitive assessments, difficulty refers to the probability that examinees will get the item correct.  If more examinees get the item correct, it has low difficulty.  For psychological/survey type data, difficulty refers to the probability of responding in the keyed direction.  That is, if you are assessing Extraversion, and the item is “I like to go to parties” then you are evaluating how many examinees agreed with the statement.

What is unique with survey type data is that it often includes reverse-keying; the same assessment might also have an item that is “I prefer to spend time with books rather than people” and an examinee disagreeing with that statement counts as a point towards the total score.

For the stereotypical educational/knowledge assessment, with 4 or 5 option multiple choice items, we use general guidelines like this for interpretation.

Range Interpretation Notes
0.0-0.3 Extremely difficult Examinees are at chance level or even below, so your item might be miskeyed or have other issues
0.3-0.5 Very difficult Items in this range will challenge even top examinees, and therefore might elicit complaints, but are typically very strong
0.5-0.7 Moderately difficult These items are fairly common, and a little on the tougher side
0.7-0.90 Moderately easy These are the most common range of items on most classically built tests; easy enough that examinees rarely complain
0.90-1.0 Very easy These items are mastered by most examinees; they are actually too easy to provide much info on examinees though, and can be detrimental to reliability.

 

Do I need to calculate this all myself?

No.  There is plenty of software to do it for you.  If you are new to psychometrics, I recommend CITAS, which is designed to get you up and running quickly but is too simple for advanced situations.  If you have large samples or are involved with production-level work, you need Iteman.  Sign up for a free account with the button below.  If that is you, I also recommend that you look into learning IRT if you have not yet.

ways-to-improve-item-banks

The foundation of a decent assessment program is the ability to develop and manage strong item banks. Item banks are a central repository of test questions, each stored with important metadata such as Author or Difficulty. They are designed to treat items are reusable objects, which makes it easier to publish new exam forms.

Of course, the storage of metadata is very useful as well and provides validity documentation evidence. Most importantly, a true item banking system will make the process of developing new items more efficient (lower cost) and effective (higher quality).

1. Item writers are screened for expertise

Make sure the item writers (authors) that are recruited for the program will meet minimum levels of expertise. Often this involves a lot of years of experience in the field. You also might want to make sure their demographics are sufficiently distributed, such as specialty area or geographic region.

2. Item writers are trained on best practices

Item writers must be trained on best practices in item writing, as well as any guidelines provided by the organization. A great example is this book from TIMSS. ASC has provided their guidelines for download here. This facilitates higher quality item banks.

3. Items go through review workflow to check best practices

After items are written, they should proceed through a standardized workflow and quality assurance. This is the best practice in developing any products. The field of software development uses a concept called the Kanban Board, which ASC has implemented in its item banking platform.

Review steps can include psychometrician, bias, language editing, and course content.

4. Items are all linked to blueprint/standards

All items in the item banks should be appropriately categorized. This guarantees that no items are measuring an unknown or unneeded concept. Items should be written to meet blueprints or standards.

item writing laptop paper

5. Item banks piloting

Items are all written with good intent. However, we all know that some items are better than others. Items need to be given to some actual examinees so we can obtain feedback, and also obtain data for psychometric analysis.

Often, they are piloted as unscored items before eventual use as “live” scored items. But this isn’t always possible.

6. Psychometric analysis of items

After items are piloted, you need to analyze them with classical test theory and/or item response theory to evaluate their performance. I like to say there are three possible choices after this evaluation: hold, revise, and retire. Items that perform well are preserved as-is.

Those of moderate quality might be modified and re-piloted. Those that are unsalvageable are slated for early retirement.

How to accomplish all this?

This process can be extremely long, involved, and expensive. Many organizations hire in-house test development managers or psychometricians; those without that option will hire organizations such as ASC to serve as consultants.

Regardless, it is important to have a software platform in place that can effectively manage this process. Such platforms have been around since the 1980s, but many organizations still struggle by managing their item banks with Word, Excel, PowerPoint, and Email!

ASC provides an item banking platform for free, which is used by hundreds of organizations. Click below to sign up for your own account.


Sign Up For Free Account

pair-of-students-examinees-that-have-common-responses

This collusion detection (test cheating) index simply calculates the number of responses in common between a given pair of examinees.  For example, both answered ‘B’ to a certain item regardless of whether it was correct or incorrect.  There is no probabilistic evaluation that can be used to flag examinees.  However, it could be of good use from a descriptive or investigative perspective. 

It has a major flaw in that we expect it to be very high for high-ability examinees.  If two smart examinees both get 99/100 correct, the minimum RIC they could have is 98/100.  Even if they have never met each other and have no possibility of collusion or cheating.

Note that RIC is not standardized in any way, so its range and relevant flag cutoff will depend on the number of items in your test, and how much your examinee responses vary.  For a 100-item test, you might want to set the flag at 90 items.  But for a 50-item test, this is obviously irrelevant, and you might want to set it at 45.

Problems such as these with Responses In Common have led to the development of much more sophisticated indices of examinee collusion and copying, such as Holland’s K index and variants.

Need an easy way to calculate this?  Download our SIFT software for free.

two-examinees-cheating

Exact Errors in Common (EEIC) is an extremely basic collusion detection index simply calculates the number of responses in common between a given pair of examinees.

For example, suppose two examinees got 80/100 correct on a test. Of the 20 each got wrong, they had 10 in common. Of those, they gave the same wrong answer on 5 items. This means that the EEIC would be 5. Why does this index provide evidence of collusion detection? Well, if you and I both get 20 items wrong on a test (same score), that’s not going to raise any eyebrows. But what if we get the same 20 items wrong? A little more concerning. What if we gave the same exact wrong answers on all of those 20? Definitely cause for concern!

There is no probabilistic evaluation that can be used to flag examinees.  However, it could be of good use from a descriptive or investigative perspective. Because it is of limited use by itself, it was incorporated into more advanced indices, such as Harpp, Hogan, and Jennings (1996).

Note that because Exact Errors in Common is not standardized in any way, so its range and relevant flag cutoff will depend on the number of items in your test, and how much your examinee responses vary.  For a 100-item test, you might want to set the flag at 10 items.  But for a 20-item test, this is obviously irrelevant, and you might want to set it at 5 (because most examinees will probably not even get more than 10 errors).

EEIC is easy to calculate, but you can download the SIFT software for free.