Research

Our research focuses on understanding what actually works when AI meets education, from automated assessment to tutoring. Below is a selection of our published research.

Can LLMs Grade Short-answer Reading Comprehension Questions: Generative large language models can reliably evaluate short-answer reading comprehension questions.

Using State-of-the-Art Speech Models to Evaluate Oral Reading Fluency in Ghana: Using large-scale speech models to assess ORF may be feasible to implement and scale in lower-resource, linguistically diverse educational contexts.

Can Large Language Models Make the Grade? An Empirical Study Evaluating LLMs Ability to Mark Short Answer Questions in K-12 Education: The proximity to human-level performance, across a variety of subjects and grade levels suggests that LLMs could be a valuable tool for supporting low-stakes formative assessment tasks in K-12 education.

Effective and Scalable Math Support: Evidence on the Impact of an AI- Tutor on Math Achievement in Ghana: Students who used an AI math tutor accessible via WhatsApp showed significantly improved math performance compared to control groups.

Safe Generative Chats in a WhatsApp Intelligent Tutoring System: Designing and testing safety guardrails for LLM-powered tutoring conversations, with field evidence from 8,000+ student interactions showing that content moderation should focus primarily on student inputs rather than AI outputs.

Retrieval-augmented Generation to Improve Math Question-Answering: Trade-offs Between Groundedness and Human Preference: Using vetted textbook content to ground LLM responses improves accuracy but may reduce naturalness, revealing key trade-offs in designing AI tutoring systems.

Leveraging Human Feedback to Scale Educational Datasets: Combining Crowdworkers and Comparative Judgement: Demonstrates that comparative judgement (where crowdworkers choose which of two answers is better) substantially improves inter-rater reliability over traditional categorical scoring when using non-expert crowdworkers to evaluate student responses to open-ended reading comprehension questions.