The Lazy Student's Dream: ChatGPT Passing an Engineering Course on Its Own

University of Illinois Urbana-Champaign

Abstract

This paper presents a comprehensive investigation into the capability of Large Language Models (LLMs) to successfully complete a semester-long undergraduate control systems course. Through evaluation of 115 course deliverables, we assess LLM performance using ChatGPT under a "minimal effort" protocol that simulates realistic student usage patterns. The investigation employs a rigorous testing methodology across multiple assessment formats, from auto-graded multiple choice questions to complex Python programming tasks and long-form analytical writing. Our analysis provides quantitative insights into AI's strengths and limitations in handling mathematical formulations, coding challenges, and theoretical concepts in control systems engineering. The LLM achieved a B-grade performance (82.24%), approaching but not exceeding the class average (84.99%), with strongest results in structured assignments and greatest limitations in open-ended projects. The findings inform discussions about course design adaptation in response to AI advancement, moving beyond simple prohibition towards thoughtful integration of these tools in engineering education.

Overview of Results

Overall performance: Image below shows LLM performance across assessment types and prompting methods, based on three runs per question. While ChatGPT's exact wording varied between runs, these variations had minimal impact on scoring. Several key patterns emerge: First, context-enhanced prompting consistently outperforms other methodologies across all question types. Second, the progression from image-based to text-based inputs shows systematic improvement, particularly in questions involving mathematical notation. The LLM achieved an overall score of 82.24% using context-enhanced prompting, compared to the class average of 84.99% -- both corresponding to a 'B' grade. This performance difference manifests distinctly across assessments: minimal in structured assignments but substantial in open-ended assessments.

Overall performance
(a) Overall performance
HW performance
(b) HW performance
Project performance
(c) Project performance
Examination performance
(d) Examination performance

Performance Data

Table 1: LLM performance across assessment types using various prompting methodologies.
Question Type Image Based Simplified Text Context-Enhanced
Category Sub-type Zero-shot (%) Multi-shot (%) Zero-shot (%) Multi-shot (%) Multi-shot (%)
HW MCQ 89.5 93.2 91.2 94.8 96.5
MCMCQ 85.3 88.4 86.7 90.2 92.1
Numerical 82.4 85.6 84.2 87.5 89.3
Code-Based 86.5 89.8 88.2 91.3 93.2
Projects Code - - 56.2 57.6 58.5
Report - - 62.8 64.9 65.8
Exam Mid-Term 85.3 87.5 86.7 88.8 89.8
Finals: Written 83.0 84.1 83.8 84.6 86.5
Finals: Auto-graded 93.5 95.3 94.8 96.2 97.4

Homework Performance:

Homework performance reveals subtle patterns across question types. The LLM achieved 90.38% against a class average of 91.44%, with performance varying by question type. MCQs showed the highest success, followed by code-based questions, MCMCQ, and numerical problems. This hierarchy persisted across all prompting methodologies, though with varying gaps. The multi-shot approach proved particularly effective for MCQs. Analysis of the 92 homework questions reveals that performance degradation correlated strongly with question complexity: single-concept questions saw higher success rates compared to integration-heavy problems.

Examination Performance:

Examination analysis provides critical insights into LLM capabilities under different assessment conditions. The model achieved 89.72% overall compared to the class average of 84.81%, but this aggregate masks important variations. Auto-graded components of the final (97.4%) significantly outperformed written sections (86.5%), with midterm performance (89.8%) showing intermediate results. This pattern held consistent across prompting methodologies. The performance gap between written and auto-graded components suggests fundamental differences in the model's ability to handle structured versus open-ended problems.

Project Performance:

Project evaluation exposed systematic limitations in LLM capabilities, with the most significant performance gap observed (64.34% versus class average 80.99%). The distinction between code implementation and report writing reveals specific challenges. Code submissions showed consistent patterns of failure in system integration, error handling, and optimization, while maintaining basic functional correctness. Report analysis indicates stronger performance in methodology description and result presentation but weaker performance in critical analysis and design justification. Neither image-based nor multi-shot approaches provided significant improvements in project performance, suggesting fundamental limitations rather than methodology-dependent constraints.

Example Questions and Response

Acknowledgments

This work was supported by the Grants for Advancement of Teaching in Engineering program at the Grainger College of Engineering, University of Illinois Urbana-Champaign. The authors thank Prof. Timothy Bretl for developing the course materials, assignments, projects, and PrairieLearn infrastructure. We appreciate Grayson Schaer for creating the project environments critical to our evaluation methodology, and Pranay Thangeda for his contributions to PrairieLearn questions and lectures. We also thank all others who contributed to the course materials and supported this research.