automated-essay-scoring-machine-learning

Automated essay scoring (AES) is an important application of machine learning and artificial intelligence to the field of psychometrics and assessment.  In fact, it’s been around far longer than “machine learning” and “artificial intelligence” have been buzzwords in the general public!  The field of psychometrics has been doing such groundbreaking work for decades.

So how does AES work, and how can you apply it?

 

 

What is automated essay scoring?

The first and most critical thing to know is that there is not an algorithm that “reads” the student essays.  Instead, you need to train an algorithm.  That is, if you are a teacher and don’t want to grade your essays, you can’t just throw them in an essay scoring system.  You have to actually grade the essays (or at least a large sample of them) and then use that data to fit a machine learning algorithm.  Data scientists use the term train the model, which sounds complicated, but if you have ever done simple linear regression, you have experience with training models.

 

There are three steps for automated essay scoring:

  1. Establish your data set. Begin by gathering a substantial collection of student essays, ensuring a diverse range of topics and writing styles. Each essay should be meticulously graded by human experts to create a reliable and accurate benchmark. This data set forms the foundation of your automated scoring system, providing the necessary examples for the machine learning model to learn from.
  2. Determine the features. Identify the key features that will serve as predictor variables in your model. These features might include grammar, syntax, vocabulary usage, coherence, structure, and argument strength. Carefully selecting these attributes is crucial as they directly impact the model’s ability to assess essays accurately. The goal is to choose features that are indicative of overall writing quality and are relevant to the scoring criteria.
  3. Train the machine learning model. Use the established data set and selected features to train your machine learning model. This involves feeding the graded essays into the model, allowing it to learn the relationship between the features and the assigned grades. Through iterative training and validation processes, the model adjusts its algorithms to improve accuracy. Continuous refinement and testing ensure that the model can reliably score new, unseen essays with a high degree of precision.

 

Here’s an extremely oversimplified example:

  • You have a set of 100 student essays, which you have scored on a scale of 0 to 5 points.
  • The essay is on Napoleon Bonaparte, and you want students to know certain facts, so you want to give them “credit” in the model if they use words like: Corsica, Consul, Josephine, Emperor, Waterloo, Austerlitz, St. Helena.  You might also add other Features such as Word Count, number of grammar errors, number of spelling errors, etc.
  • You create a map of which students used each of these words, as 0/1 indicator variables.  You can then fit a multiple regression with 7 predictor variables (did they use each of the 7 words) and the 5 point scale as your criterion variable.  You can then use this model to predict each student’s score from just their essay text.

 

Obviously, this example is too simple to be of use, but the same general idea is done with massive, complex studies.  The establishment of the core features (predictive variables) can be much more complex, and models are going to be much more complex than multiple regression (neural networks, random forests, support vector machines).

Here’s an example of the very start of a data matrix for features, from an actual student essay.  Imagine that you also have data on the final scores, 0 to 5 points.  You can see how this is then a regression situation.

Examinee Word Count i_have best_jump move and_that the_kids well
1 307 0 1 2 0 0 1
2 164 0 0 1 0 0 0
3 348 1 0 1 0 0 0
4 371 0 1 1 0 0 0
5 446 0 0 0 0 0 2
6 364 1 0 0 0 1 1

 

How do you score the essay?

If they are on paper, then automated essay scoring won’t work unless you have an extremely good software for character recognition that converts it to a digital database of text.  Most likely, you have delivered the exam as an online assessment and already have the database.  If so, your platform should include functionality to manage the scoring process, including multiple custom rubrics.  An example of our  FastTest platform  is provided below.

 

FastTest_essay-marking

Some rubrics you might use:

  • Grammar
  • Spelling
  • Content
  • Style
  • Supporting arguments
  • Organization
  • Vocabulary / word choice

 

How do you pick the Features?

This is one of the key research problems.  In some cases, it might be something similar to the Napoleon example.  Suppose you had a complex item on Accounting, where examinees review reports and spreadsheets and need to summarize a few key points.  You might pull out a few key terms as features (mortgage amortization) or numbers (2.375%) and consider them to be Features.  I saw a presentation at Innovations In Testing 2022 that did exactly this.  Think of them as where you are giving the students “points” for using those keywords, though because you are using complex machine learning models, it is not simply giving them a single unit point.  It’s contributing towards a regression-like model with a positive slope.

In other cases, you might not know.  Maybe it is an item on an English test being delivered to English language learners, and you ask them to write about what country they want to visit someday.  You have no idea what they will write about.  But what you can do is tell the algorithm to find the words or terms that are used most often, and try to predict the scores with that.  Maybe words like “jetlag” or “edification” show up in students that tend to get high scores, while words like “clubbing” or “someday” tend to be used by students with lower scores.  The AI might also pick up on spelling errors.  I worked as an essay scorer in grad school, and I can’t tell you how many times I saw kids use “ludacris” (name of an American rap artist) instead of “ludicrous” when trying to describe an argument.  They had literally never seen the word used or spelled correctly.  Maybe the AI model finds to give that a negative weight.   That’s the next section!

 

How do you train a model?

bart model train

Well, if you are familiar with data science, you know there are TONS of models, and many of them have a bunch of parameterization options.  This is where more research is required.  What model works the best on your particular essay, and doesn’t take 5 days to run on your data set?  That’s for you to figure out.  There is a trade-off between simplicity and accuracy.  Complex models might be accurate but take days to run.  A simpler model might take 2 hours but with a 5% drop in accuracy.  It’s up to you to evaluate.

If you have experience with Python and R, you know that there are many packages which provide this analysis out of the box – it is a matter of selecting a model that works.

 

How effective is automated essay scoring?

Well, as psychometricians love to say, “it depends.”  You need to do the model fitting research for each prompt and rubric.  It will work better for some than others.  The general consensus in research is that AES algorithms work as well as a second human, and therefore serve very well in that role.  But you shouldn’t use them as the only score; of course, that’s impossible in many cases.

Here’s a graph from some research we did on our algorithm, showing the correlation of human to AES.  The three lines are for the proportion of sample used in the training set; we saw decent results from only 10% in this case!  Some of the models correlated above 0.80 with humans, even though this is a small data set.   We found that the Cubist model took a fraction of the time needed by complex models like Neural Net or Random Forest; in this case it might be sufficiently powerful.

 

Automated essay scoring results

 

How can I implement automated essay scoring without writing code from scratch?

There are several products on the market.  Some are standalone, some are integrated with a human-based essay scoring platform.  ASC’s platform for automated essay scoring is SmartMarq; click here to learn more.  It is currently in a standalone approach like you see below, making it extremely easy to use.  It is also in the process of being integrated into our online assessment platform, alongside human scoring, to provide an efficient and easy way of obtaining a second or third rater for QA purposes.

Want to learn more?  Contact us to request a demonstration.

 

SmartMarq automated essay scoring

ChatGPT logo name

Even those ones who do not consider themselves technology buffs have heard about ChatGPT. Today, everybody is talking about ChatGPT—a modern megastar of artificial intelligence (AI) and machine learning (ML) industries. This is how ChatGPT introduces itself:

“I am an AI language model called ChatGPT. My purpose is to assist and communicate with users through text-based conversations, and to provide helpful and informative responses to their questions and inquiries.”

I realized its sense straight away but still decided to interact with ChatGPT more closely while working on item writing for a Math test. I opened this link and initiated a new conversation—stupid simple. Below I am sharing my observations of ChatGPT features, classifying them as its strengths and imperfections.  I will then discuss my experience in item writing with ChatGPT.

Strengths of ChatGPT

ChatGPT is an intelligent chatbot that

  • is polite,
  • declines inadequate requests,
  • generates responses very quickly,
  • allows user to provide follow-up adjustments,
  • remembers what user said earlier in the conversation,
  • can significantly reduce time for item writing to subject matter experts.

 

Why be a psychometrician chatgpt item writing

 

Imperfections of ChatGPT

General ones

  • Free version of ChatGPT has a restricted number of symbols produced (around 2,000). Therefore, if you need to construct a test, like me, you need to task the chatbot in portions or keep asking it to finalize the assignment that you have given. Thus, your interaction with ChatGPT will be more efficient if you plan your assignment better.
  • ChatGPT has limited knowledge of the world after 2021. You need to consider this fact when designing a prompt.
  • ChatGPT can generate inaccurate responses to questions. You may find several examples below.

 

Math Item Writing

ChatGPT’s Math skills are not excellent. Below I have outlined some bright examples:

  • ChatGPT “thinks” that 11/2 equals to 2 and 6/5 equals to 1;
  • ChatGPT does not realize that one root cannot be a solution of a quadratic equation that has two different roots while it lists both roots among answer options;
  • ChatGPT makes mistakes when naming a correct response out of a few options generated;
  • ChatGPT makes distractors (incorrect answers in multiple choice questions) up without considering possible mistakes that test takers can make;
  • ChatGPT provides a correct solution to a question but does not include it into answer options;
  • ChatGPT cannot differentiate between levels of difficulty even if it knows the sample from the prompt.

 

Let’s have a look at a couple of Math questions written by ChatGPT according to my prompts.

Prompt

Response

Comments

ChatGPT, please, write 1 multiple-choice Math question of low difficulty for 16-17 year old high school American students for the following topic in Algebra: Linear equations in two variables. Multiple-choice questions must have only one correct answer (key) and three incorrect answers (distractors) based on test takers’ incorrect reasoning. Please, provide an explanation to each distractor and a solution to the question.

 

Which point satisfies the equation y = -2x + 7?

A) (5, 1)

B) (-1, 5)

C) (2, 5)

D) (1, -5)

 

Explanation:

To find a point that satisfies the equation, we can plug in values for x and y that make the equation true. For example, if we let x = 1, then y = -2(1) + 7 = 5. So, the point (1, 5) satisfies the equation.

ChatGPT did not include a correct response into answer options even though it solved this question correctly.
ChatGPT, please, write 1 student-produced response Math question of medium difficulty for 16-17 year old high school American students for the following topic in Algebra: Systems of two linear equations in two variables. Student-produced response questions must have only one correct answer as a real number. Please, provide a solution to this question. Solve the system of equations:

2x + 3y = 8

x – y = 1.

 

Explanation: To solve the system of equations, we can use either substitution or elimination method. Using substitution, we can solve one equation for one variable and substitute it into the other equation. For example, we can solve the second equation for x and get x = y + 1. Then, we substitute x = y + 1 into the first equation and get 2(y + 1) + 3y = 8. Simplifying this equation gives us 5y + 2 = 8, which gives us y = 1. Substituting y = 1 into x = y + 1 gives us x = 2. Therefore, the solution to the system of equations is (2, 1).

Solving 5y + 2 = 8 does not give us y = 1 but y = 6/5 or 1.2. Therefore, the point (2, 1) is not a correct solution of this system of equations.

 

English Item Writing

ChatGPT’s English skills are not as perfect as I thought they would be since it is an English native speaker (I suppose). I have had some experience with ChatGPT making it design several tasks to test students’ English skills. As expected, ChatGPT is more productive with chunks of text. I asked it to make some Reading and Writing questions for SAT and IELTS, and the results were much better than those for Math. However, there were some shortcomings in terms of grammar errors and poor lexical choice. For instance, ChatGPT suggested option A as a correct response that is arguable:

Which of the following options best replaces the underlined phrase in the sentence: “The company’s profits have increased by 20% over the last quarter.”

A) The company’s profits have risen by 20% over the last quarter.

B) The company’s profits have grown by 20% over the last quarter.

C) The company’s profits have expanded by 20% over the last quarter.

D) The company’s profits have developed by 20% over the last quarter.

 

Russian Item Writing

I was curious how proficient ChatGPT was in Russian. Well, it knows how many letters the Russian alphabet contains and some other facts, but when I asked ChatGPT to write a question to check some grammar, my expectations were betrayed: ChatGPT gave a wrong answer when asked to identify a gerund.

Какое из перечисленных слов является деепричастием?

A) Зеленеющий

B) Шумно

C) Летящий

D) Бежавший

Правильный ответ: C) Летящий – это деепричастие настоящего времени. Ошибочные ответы: A) – прилагательное, B) – наречие, D) – причастие прошедшего времени.

 

Summary

Since AI-enabled tools do not have real brains and only regurgitate information based on what they “learned” by interacting with billions of data points, it is fair enough that there will be some inaccuracies in their responses even though they will look human-like.

In its own introduction, ChatGPT announced straight away that it was intended to ASSIST us, humans, not to DO things instead of us. ChatGPT can provide helpful responses but is not capable of independent thought or emotions. Like any AI, ChatGPT is vulnerable to multiple issues, including bias and potential inaccuracies.

I would like to re-emphasize once again my point that I outlined in the previous post, that any AI-powered tool no matter how “good” it is needs a smart user to manipulate it. Otherwise, it is going to be a complete disaster! However, if users provide ChatGPT with a very well structured question and the right guidance, there will be a high chance of obtaining an accurate response.

One more thing to consider is that Math is a way of thinking. Therefore, I would not expect AI-powered chatbots to be super-efficient in this field, at least in the near future.

Another important consideration is the legal situation. Current opinion in the field is this: Copyright laws are designed to protect creative development of new things by humans, but if ChatGPT is doing the heavy lifting, then it is not human-created and therefore not fully protected by copyright law. If the items are stolen, you might not have legal recourse. However, this topic is, of course, quite new and continues to evolve.

In the end, is it worth it to use ChatGPT for item writing? Absolutely. Even though there are errors which are quickly found, such as having two correct answers to a question or not having a key specified, these sorts of things are easily fixed. The average time to develop new items can be significantly reduced. Moreover, ChatGPT will continue to get even more powerful! So, stay tuned!

 

Artificial intelligence (AI) is poised to address some challenges that education deals with today, through innovation of teaching and learning processes. By applying AI in education technologies, educators can determine student needs more precisely, keep students more engaged, improve learning, and adapt teaching accordingly to boost learning outcomes. A process of utilizing AI in education started off from looking for a substitute for one-on-one tutoring in the 1970s and has been witnessing multiple improvements since then. This article will look at some of the latest AI developments used in education, their potential impact, and drawbacks they possess.

Application of AI

AI robot - AI in Education

Recently, a helping hand of AI technologies has permeated into all aspects of educational process. The research that has been going since 2009 shows that AI has been extensively employed in managing, instructing, and learning sectors. In management, AI tools are used to review and grade student assignments, sometimes they operate even more accurately than educators do. There are some AI-based interactive tools that teachers apply to build and share student knowledge. Learning can be enhanced through customization and personalization of content enabled by new technological systems that leverage machine learning (ML) and adaptability.

Below you may find a list of major educational areas where AI technologies are actively involved and that are worthy of being further developed.

Personalized learning This educational approach tailors learning trajectory to individual student needs and interests. AI algorithms analyze student information (e.g. learning style and performance) to create customized learning paths. Based on student weaknesses and strengths, AI recommends exercises and learning materials.  AI technologies are increasingly pivotal in online learning apps, personalizing education and making it more accessible to a diverse learner base.
Adaptive learning This approach does the same as personalized learning but in real-time stimulating learners to be engaged and motivated. ALEKS is a good example of an adaptive learning program.
Learning courses These are AI-powered online platforms that are designed for eLearning and course management, and enable learners to browse for specific courses and study with their own speed. These platforms offer learning activities in an increasing order of their difficulty aiming at ultimate educational goals. For instance, advanced Learning Management Systems (LMS) and Massive Open Online Courses (MOOCs).
Learning assistants/Teaching robots AI-based assistants can supply support and resources to learners upon request. They can respond to questions, provide personalized feedback, and guide students through learning content. Such virtual assistants might be especially helpful for learners who cannot access offline support.
Adaptive testing This mode of delivering tests means that each examinee will get to respond to specific questions that correspond to their level of expertise based on their previous responses. It is possible due to AI algorithms enabled by ML and psychometric methods, i.e. item response theory (IRT). You can get more information about adaptive testing from Nathan Thompson’s blog post.
Remote proctoring It is a type of software that allows examiners to coordinate an assessment process remotely whilst keeping confidentiality and preventing examinees from cheating. In addition, there can be a virtual proctor who can assist examinees in resolving any issues arisen during the process. The functionality of proctoring software can differ substantially depending on the stakes of exams and preferences of stakeholders. You can read more on this topic from the ASC’s blog here.
Test assembly Automated test assembly (ATA) is a widely used valid and efficient method of test construction based on either classical test theory (CTT) or item response theory (IRT). ATA lets you assemble test forms that are equivalent in terms of content distribution and psychometric statistics in seconds. ASC has designed TestAssembler to minimize a laborious and time-consuming process of form building.
Automated grading Grading student assignments is one of the biggest challenges that educators face. AI-powered grading systems automate this routine work reducing bias and inconsistencies in assessment results and increasing validity. ASC has developed an AI essay scoring system—SmartMarq. If you are interested in automated essay scoring, you should definitely read this post.
Item generation There are often cases when teachers are asked to write a bunch of items for assessment purposes, as if they are not busy with lesson planning and other drudgery. Automated item generation is very helpful in terms of time saving and producing quality items.
Search engine The time of libraries has sunk into oblivion, so now we mostly deal with huge search engines that have been constructed to carry out web searches. AI-powered search engines help us find an abundance of information; search results heavily depend on how we formulate our queries, choose keywords, and navigate between different sites. One of the biggest search engines so far is Google.
Chatbot Last but not least… Chatbots are software applications that employ AI and natural language processing (NLP) to make humanized conversations with people. AI-powered chatbots can provide learners with additional personalized support and resources. ChatGPT can truly be considered as the brightest example of a chatbot today.

 

Highlights of AI and challenges to address

ai chatbot - AI in Education

Today AI-powered functions revolutionize education, just to name a few: speech recognition, NLP, and emotion detection. AI technologies enable identifying patterns, building algorithms, presenting knowledge, sensing, making and following plans, maintaining true-to-life interactions with people, managing complex learning activities, magnifying human abilities in learning contexts, and supporting learners in accordance with their individual interests and needs. AI allows students to use handwriting, gestures or speech as input while studying or taking a test.

Along with numerous opportunities, AI-evolution brings some risks and challenges that should be profoundly investigated and addressed. While approaching utilization of AI in education, it is important to keep caution and consideration to make sure that it is done in a responsible and ethical way, and not to get caught up in the mainstream since some AI tools consult billions of data available to everyone on the web. Another challenge associated with AI is a variability in its performance: some functions are performed on a superior level (such as identifying patterns in data) but some of them are quite primitive (such as inability to support an in-depth conversation). Even though AI is very powerful, human beings still play a crucial role in verifying AI’s output to avoid plagiarism and falsification of information.

 

Conclusion

AI is already massively applied in education around the world. With the right guidance and frameworks in place, AI-powered technologies can help build more efficient and equitable learning experiences. Today we have an opportunity to witness how AI- and ML-based approaches contribute to development of individualized, personalized, and adaptive learning.

ASC’s CEO, Dr Thompson, presented several topics on AI at the 2023 ATP Conference in Dallas, TX. If you are interested in utilizing AI-powered services provided by ASC, please do not hesitate to contact us!

 

References

Miao, F., Holmes, W., Huang, R., & Zhang, H. (2021). AI and education: A guidance for policymakers. UNESCO.

Niemi, H., Pea, R. D., & Lu, Y. (Eds.). (2022). AI in learning: Designing the future. Springer. https://doi.org/10.1007/978-3-031-09687-7