top of page

Research

Our research focuses on understanding what actually works when AI meets education, from automated assessment to tutoring. Below is a selection of our published research.

 

Follow us on LinkedIn to stay up-to-date with our latest work.

Can LLMs Grade Short-answer Reading Comprehension Questions: generative large language models can reliably evaluate short-answer reading comprehension questions.

Using State-of-the-Art Speech Models to Evaluate Oral Reading Fluency in Ghana: using large-scale speech models to assess ORF may be feasible to implement and scale in lower-resource, linguistically diverse educational contexts.

Can Large Language Models Make the Grade? An Empirical Study Evaluating LLMs Ability to Mark Short Answer Questions in K-12 Education: The proximity to human-level performance, across a variety of subjects and grade levels suggests that LLMs could be a valuable tool for supporting low-stakes formative assessment tasks in K-12 education.

Safe Generative Chats in a WhatsApp Intelligent Tutoring System​: Designing and testing safety guardrails for LLM-powered tutoring conversations, with field evidence from 8,000+ student interactions showing that content moderation should focus primarily on student inputs rather than AI outputs.​

Retrieval-augmented Generation to Improve Math Question-Answering: Trade-offs Between Groundedness and Human Preference​: Using vetted textbook content to ground LLM responses improves accuracy but may reduce naturalness, revealing key trade-offs in designing AI tutoring systems.

Leveraging Human Feedback to Scale Educational Datasets: Combining Crowdworkers and Comparative Judgement: Demonstrates that comparative judgement (where crowdworkers choose which of two answers is better) substantially improves inter-rater reliability over traditional categorical scoring when using non-expert crowdworkers to evaluate student responses to open-ended reading comprehension questions.

bottom of page