Large language models (LLMs) have recently gained attention in automated writing evaluation (AWE) due to their flexibility, ease of use, and free accessibility. However, most existing studies have relied on standardized rubrics and detailed scoring guidelines to guide model outputs. Recent evidence suggests that LLMs can adapt their scoring behavior through example-based calibration. Building on this insight, the present study examines whether ChatGPT-4o can mirror individual instructors’ evaluative tendencies. Data consisted of 100 previously graded final exam writing samples from Saudi students of English as a second language (ESL), provided by five instructors at a Saudi university’s Bachelor of Arts program. GPT (generative pre-trained transformer) was calibrated using instructor-graded writing samples to enhance its alignment with human grading criteria. Subsequent analysis involved 82 samples, excluding those used in calibration. Results revealed a strong positive and statistically significant correlation (r = 0.816, p < .001) between GPT scores and teacher-assigned scores. Descriptive analyses further indicated differential scoring tendencies: GPT was more generous toward lower-quality writings, assigning higher mean scores than human raters, whereas teachers tended to award higher scores than GPT for high-quality writings. These findings suggest that GPT, particularly when effectively calibrated, can mirror teacher grading practices, though with notable differences at performance extremes. Consequently, this study highlights GPT’s potential as a complementary assessment tool in ESL writing instruction.
- Inicie sesión para enviar comentarios