The Lazy Student's Dream: ChatGPT Passing an Engineering Course on Its Own

Abstract

This paper presents a comprehensive investigation into the capability of Large Language Models (LLMs) to successfully complete a semester-long undergraduate control systems course. Through evaluation of 115 course deliverables, we assess LLM performance using ChatGPT under a "minimal effort" protocol that simulates realistic student usage patterns. The investigation employs a rigorous testing methodology across multiple assessment formats, from auto-graded multiple choice questions to complex Python programming tasks and long-form analytical writing. Our analysis provides quantitative insights into AI's strengths and limitations in handling mathematical formulations, coding challenges, and theoretical concepts in control systems engineering. The LLM achieved a B-grade performance (82.24%), approaching but not exceeding the class average (84.99%), with strongest results in structured assignments and greatest limitations in open-ended projects. The findings inform discussions about course design adaptation in response to AI advancement, moving beyond simple prohibition towards thoughtful integration of these tools in engineering education.

Relevant Resources and Links for the Paper

Syllabus

Midterm 1 Midterm 2 Final Exam

Design Project

Overview of Results

Overall performance: Image below shows LLM performance across assessment types and prompting methods, based on three runs per question. While ChatGPT's exact wording varied between runs, these variations had minimal impact on scoring. Several key patterns emerge: First, context-enhanced prompting consistently outperforms other methodologies across all question types. Second, the progression from image-based to text-based inputs shows systematic improvement, particularly in questions involving mathematical notation. The LLM achieved an overall score of 82.24% using context-enhanced prompting, compared to the class average of 84.99% -- both corresponding to a 'B' grade. This performance difference manifests distinctly across assessments: minimal in structured assignments but substantial in open-ended assessments.

Performance Data

**Table 1:** LLM performance across assessment types using various prompting methodologies.
Question Type		Image Based		Simplified Text		Context-Enhanced
Category	Sub-type	Zero-shot (%)	Multi-shot (%)	Zero-shot (%)	Multi-shot (%)	Multi-shot (%)
HW	MCQ	89.5	93.2	91.2	94.8	96.5
	MCMCQ	85.3	88.4	86.7	90.2	92.1
	Numerical	82.4	85.6	84.2	87.5	89.3
	Code-Based	86.5	89.8	88.2	91.3	93.2
Projects	Code	-	-	56.2	57.6	58.5
Projects	Report	-	-	62.8	64.9	65.8
Exam	Mid-Term	85.3	87.5	86.7	88.8	89.8
	Finals: Written	83.0	84.1	83.8	84.6	86.5
	Finals: Auto-graded	93.5	95.3	94.8	96.2	97.4

Homework Performance:

Homework performance reveals subtle patterns across question types. The LLM achieved 90.38% against a class average of 91.44%, with performance varying by question type. MCQs showed the highest success, followed by code-based questions, MCMCQ, and numerical problems. This hierarchy persisted across all prompting methodologies, though with varying gaps. The multi-shot approach proved particularly effective for MCQs. Analysis of the 92 homework questions reveals that performance degradation correlated strongly with question complexity: single-concept questions saw higher success rates compared to integration-heavy problems.

Examination Performance:

Examination analysis provides critical insights into LLM capabilities under different assessment conditions. The model achieved 89.72% overall compared to the class average of 84.81%, but this aggregate masks important variations. Auto-graded components of the final (97.4%) significantly outperformed written sections (86.5%), with midterm performance (89.8%) showing intermediate results. This pattern held consistent across prompting methodologies. The performance gap between written and auto-graded components suggests fundamental differences in the model's ability to handle structured versus open-ended problems.

Project Performance:

Project evaluation exposed systematic limitations in LLM capabilities, with the most significant performance gap observed (64.34% versus class average 80.99%). The distinction between code implementation and report writing reveals specific challenges. Code submissions showed consistent patterns of failure in system integration, error handling, and optimization, while maintaining basic functional correctness. Report analysis indicates stronger performance in methodology description and result presentation but weaker performance in critical analysis and design justification. Neither image-based nor multi-shot approaches provided significant improvements in project performance, suggesting fundamental limitations rather than methodology-dependent constraints.

Acknowledgments

This study was supported by the University of Illinois Urbana-Champaign’s Grainger College of Engineering through its Grants for the Advancement of Teaching in Engineering program. The course's question bank represents several years of iterative development by the authors. We gratefully acknowledge Grayson Schaer for building the project environments, and David Hanley and Pranay Thangeda for their significant contributions to the PrairieLearn question bank and lecture materials. We also thank the many others whose efforts helped make this research possible.