Automated Item Generation

December 9, 2022

Automated item generation (AIG) is a paradigm for developing assessment items (test questions), utilizing principles of artificial intelligence and automation. As the name suggests, it tries to automate some or all of the effort involved with item authoring, as that is one of the most time-intensive aspects of assessment development – which is no news to anyone who has authored test questions!

What is Automated Item Generation?

Automated item generation involves the use of computer algorithms to create new test questions, or variations of them. It can also be used for item review, or the generation of answers, or the generation of assets such as reading passages. Items still need to be reviewed and edited by humans, but this still saves a massive amount of time in test development.

Why Use Automated Item Generation?

Items can cost up to $2000 to develop, so even cutting the average cost in half could provide massive time/money savings to an organization. ASC provides AIG functionality, with no limits, to anyone who signs up for a free item banking account in our platform Assess.ai.

Types of Automated Item Generation?

There are two types of automated item generation. The Item Templates approach was developed before large language models (LLMs) were widely available. The second approach is to use LLMs, which became widely available at the end of 2022.

Type 1: Item Templates

The first type is based on the concept of item templates to create a family of items using dynamic, insertable variables. There are three stages to this work. For more detail, read this article by Gierl, Lai, and Turner (2012).

Authors, or a team, create a cognitive model by isolating what it is they are exactly trying to assess and different ways that it the knowledge could be presented or evidenced. This might include information such as what are the important vs. incidental variables, and what a correct answer should include .
They then develop templates for items based on this model, like the example you see below.
An algorithm then turns this template into a family of related items, often by producing all possible permutations.

Obviously, you can’t use more than one of these on a given test form. And in some cases, some of the permutations will be an unlikely scenario or possibly completely irrelevant. But the savings can still be quite real. I saw a conference presentation by Andre de Champlain from the Medical Council of Canada, stating that overall efficiency improved by 6x and the generated items were higher quality than traditionally written items because the process made the authors think more deeply about what they were assessing and how. He also recommended that template permutations not be automatically moved to the item bank but instead that each is reviewed by SMEs, for reasons such as those stated above.

You might think “Hey, that’s not really AI…” – AI is doing things that have been in the past done by humans, and the definition gets pushed further every year. Remember, AI used to be just having the Atari be able to play Pong with you!

Type 2: AI Generation or Processing of Source Text

The second type is what the phrase “automated item generation” more likely brings to mind: upload a textbook or similar source to some software, and it spits back drafts of test questions. For example, see this article by von Davier (2019). Or alternatively, simply state a topic as a prompt and the AI will generate test questions.

Until the release of ChatGPT and other publicly available AI platforms to implement large language models (LLMs), this approach was only available to experts at large organizations. Now, it is available to everyone with an internet connection. If you use such products directly, you can provide a prompt such as “Write me 10 exam questions on Glaucoma, in a 4-option multiple choice format” and it will do so. You can also update the instructions to be more specific, and add instructions such as formatting the output for your preferred method, such as QTI or JSON.

Alternatively, many assessment platforms now integrate with these products directly, so you can do the same thing, but have the items appear for you in the item banker under New status, rather than have them go to a raw file on your local computer that you then have to clean and upload. FastTest has such functionality available.

This technology has completely revolutionized how we develop test questions. I’ve seen several research presentations on this, and they all find that AIG produces more items, of quality that is as good or even better than humans, in a fraction of the time! But, they have also found that prompt engineering is critical, and even one word – like including “concise” in your prompt – can affect the quality of the items.

The Limitations of Automated Item Generation

Automated item generation (AIG) has revolutionized the way educational and psychological assessments are developed, offering increased efficiency and consistency. However, this technology comes with several limitations that can impact the quality and effectiveness of the items produced.

One significant limitation is the challenge of ensuring content validity. AIG relies heavily on algorithms and pre-defined templates, which may not capture the nuanced and comprehensive understanding of subject matter that human experts possess. This can result in items that are either too simplistic or fail to fully address the depth and breadth of the content domain .

Another limitation is the potential for over-reliance on statistical properties rather than pedagogical soundness. While AIG can generate items that meet certain psychometric criteria, such as difficulty and discrimination indices, these items may not always align with best practices in educational assessment or instructional design. This can lead to tests that are technically robust but lack relevance or meaningfulness to the learners .

Furthermore, the use of AIG can inadvertently introduce bias. Algorithms used in item generation are based on historical data and patterns, which may reflect existing biases in the data. Without careful oversight and adjustment, AIG can perpetuate or even exacerbate these biases, leading to unfair assessment outcomes for certain groups of test-takers .

Lastly, there is the issue of limited creativity and innovation. Automated systems generate items based on existing templates and rules, which can result in a lack of variety and originality in the items produced. This can make assessments predictable and less engaging for test-takers, potentially impacting their motivation and performance .

In conclusion, while automated item generation offers many benefits, it is crucial to address these limitations through continuous oversight, integration of expert input, and regular validation studies to ensure the development of high-quality assessment items.

How Can I Implement Automated Item Generation?

If you are a user of AI products like ChatGPT or Bard, you can work directly with them. Advanced users can implement APIs to upload documents or fine-tune the machine learning models. The aforementioned article by von Davier talks about such usage.

If you want to save time, FastTest provides a direct ChatGPT integration, so you can provide the prompt using the screen shown above, and items will then be automatically created in the item banking folder you specify, with the item naming convention you specify, tagged as Status=New and ready for review. Items can then be routed through our configurable Item Review Workflow process, including functionality to gather modified-Angoff ratings.

Ready to improve your test development process? Click here to talk to a psychometric expert.

Nathan Thompson

Nathan Thompson earned his PhD in Psychometrics from the University of Minnesota, with a focus on computerized adaptive testing. His undergraduate degree was from Luther College with a triple major of Mathematics, Psychology, and Latin. He is primarily interested in the use of AI and software automation to augment and replace the work done by psychometricians, which has provided extensive experience in software design and programming. Dr. Thompson has published over 100 journal articles and conference presentations, but his favorite remains https://scholarworks.umass.edu/pare/vol16/iss1/1/ .

Ready to talk to an assessment expert?

Get in touch, and we'll meet to discuss how we can improve your exam development, delivery, and psychometrics!

Request a Consultation