Research
Our research focuses on understanding what actually works when AI meets education, from automated assessment to tutoring. Below is a selection of our published research.
Follow us on LinkedIn to stay up-to-date with our latest work.
Can LLMs Grade Short-answer Reading Comprehension Questions: generative large language models can reliably evaluate short-answer reading comprehension questions.
Using State-of-the-Art Speech Models to Evaluate Oral Reading Fluency in Ghana: using large-scale speech models to assess ORF may be feasible to implement and scale in lower-resource, linguistically diverse educational contexts.
Can Large Language Models Make the Grade? An Empirical Study Evaluating LLMs Ability to Mark Short Answer Questions in K-12 Education: The proximity to human-level performance, across a variety of subjects and grade levels suggests that LLMs could be a valuable tool for supporting low-stakes formative assessment tasks in K-12 education.
Students who used an AI math tutor accessible via WhatsApp showed significantly improved math performance compared to control groups: AI powered maths tutor in Ghana (finding positive learning gains) might be of interest.
Safe Generative Chats in a WhatsApp Intelligent Tutoring System​: Designing and testing safety guardrails for LLM-powered tutoring conversations, with field evidence from 8,000+ student interactions showing that content moderation should focus primarily on student inputs rather than AI outputs.​
Retrieval-augmented Generation to Improve Math Question-Answering: Trade-offs Between Groundedness and Human Preference​: Using vetted textbook content to ground LLM responses improves accuracy but may reduce naturalness, revealing key trade-offs in designing AI tutoring systems.
Leveraging Human Feedback to Scale Educational Datasets: Combining Crowdworkers and Comparative Judgement: Demonstrates that comparative judgement (where crowdworkers choose which of two answers is better) substantially improves inter-rater reliability over traditional categorical scoring when using non-expert crowdworkers to evaluate student responses to open-ended reading comprehension questions.
