Artificial intelligence (AI) and machine learning (ML) have become buzzwords over the past few years. As I already wrote about, they are actually old news in the field of psychometrics. Factor analysis is a classical example of ML, and item response theory also qualifies as ML. Computerized adaptive testing is actually an application of AI to psychometrics that dates back to the 1970s.
One thing that is very different about the world of AI/ML today is the massive power available in free platforms like R, Python, and TensorFlow. I’ve been thinking a lot over the past few years about how these tools can impact the world of assessment. A straightforward application is too automated essay scoring; a common way to approach that problem is through natural language processing with the “bag of words” model and utilize the document-term matrix (DTM) as predictors in a model for essay score as a criterion variable. Surprisingly simple. This got me to wondering where else we could apply that sort of modeling. Obviously, student response data on selected-response items provides a ton of data, but the research questions are less clear. So, I turned to the topic that I think has the next largest set of data and text: item banks.
Step 1: Text Mining
The first step was to explore tools for text mining in R. I found this well-written and clear tutorial on the text2vec package and used that as my springboard. Within minutes I was able to get a document term matrix, and in a few more minutes was able to prune it. This DTM alone can provide useful info to an organization on their item bank, but I wanted to delve further. Can the DTM predict item quality?
Step 2: Fit Models
To do this, I utilized both the caret and glmnet packages to fit models. I love the caret package, but if you search the literature you’ll find it has a problem with sparse matrices, which is exactly what the DTM is. One blog post I found said that anyone with a sparse matrix is pretty much stuck using glmnet.
I tried a few models on a small item bank of 500 items from a friend of mine, and my adjusted R squared for the prediction of IRT parameters (as an index of item quality) was 0.53 – meaning that I could account for more than half the variance of item quality just by knowing some of the common words in each item’s stem. I wasn’t even using the answer texts n-grams, or additional information like Author and content domain.
Want to learn more about your item banks?
I’d love to swim even deeper on this issue. If you have a large item bank and would like to work with me to analyze it so you can provide better feedback and direction to your item writers and test developers, drop me a message at email@example.com! This could directly impact the efficiency of your organization and the quality of your assessments.