TEMPLEs: Teacher Evaluation using Metamorphic-mediated Personality-aware LLM Evaluators

Dave Towey, Anthony Bellotti, Matthew Pike

IEEE International Conference on Teaching, Assessment, and Learning for Engineering (TALE), December 2025

teaching evaluation, large language models, metamorphic testing, oracle problem, higher education, open educational resource

Abstract

This paper proposes a novel framework for evaluating teaching quality in higher education (HE) through the use of multiple large language models (LLMs). The paper draws an analogy between the distinct "personalities" of different LLMs and the varied perspectives of human peer reviewers. This analogy is used to explore the inherent difficulty in objectively assessing teaching effectiveness, which we describe as an instance of the "oracle problem" — a well-studied area of software testing/engineering where categorical determination of correctness may not be possible. Traditional teaching/teacher evaluation methods are notoriously subjective and biased. Our proposed solution uses a suite of LLMs as a "panel of peer reviewers," taking advantage of their diverse analytical styles (personalities). The framework incorporates a validation layer based on metamorphic testing (MT), a software-testing approach with an excellent record of alleviating the oracle problem. Instead of focusing on individual correctness, MT uses relations (called metamorphic relations, MRs) to help identify errors or mistakes. This work is contextualised within the recent, rapid adoption of generative AI in HE, particularly within Sino-foreign HE institutions (SfHEIs). A key aim of this project is to package this framework into a dual-purpose open educational resource (OER): a formal tool to support teaching/teacher evaluation, and a resource to help educators, both in-service and pre-service, with their self-reflection and continuous professional development (CPD).