The Practical Limits of AI Reasoning
Investigating the concept of AI reasoning inside the context of the real-world task of scheduling student defenses for a capstone project
Thanks for joining our growing community of educators exploring AI in learning! With 200-300 new subscribers each week, you're now part of an important conversation about how AI is transforming education at all levels.
Everything we share remains freely available, though paid subscriptions help support our work. Questions about anything I cover? Just DM me - always happy to chat!
Nick
What Scheduling Student Defenses Taught Me About AI's Real Capabilities
In education circles, there's growing enthusiasm about AI's potential to transform how we work. Articles abound about AI writing assistants, content generators, and personalized learning tools. I've been increasingly impressed with AI's writing and editorial skills myself, which fed my perception that AI capabilities were advancing at a remarkable pace. But using AI for writing is one thing; testing its organizational, analytical, and logical capacities revealed a different story entirely.
The Challenge: A Real-World Test Case
At my school, we run a senior capstone program that begins halfway through junior year. Students select a topic of interest, study scholarly sources, write an annotated bibliography that evolves into a literature review, formulate research questions, conduct fieldwork, and design their own expression of research. The process culminates in April when students present their findings to a panel of 2-3 school staff members before giving a final presentation to the whole community.
Our school is small, but scheduling 40 student defenses in a single week—coordinating faculty, staff, and students to find available time and space—is an arduous task that seemed perfect for AI assistance. I decided to test one of the most powerful systems available to me: ChatGPT-o3, which is specifically touted for its logic and reasoning capabilities.
Setting Up the Experiment
If you follow AI experts like Lance Cummings, you know that working with AI for research and analysis requires meticulously curating your documents so the AI can effectively use your information. I spent about an hour breaking down three different spreadsheets containing vital scheduling information.
Everyone involved in this project should know that I meticulously anonymized all my data. Student and faculty names were coded before entering them into an AI system. In educational settings, protecting student and faculty information is non-negotiable, especially when using external AI systems.
Student availability: For approximately 40 seniors, each with 2 free periods in an 8-period block schedule running on a 6-day cycle. For the 4-day defense week, each student had two opportunities to be scheduled in a particular free period.
Faculty/staff availability: Our generous faculty volunteered their free periods. Some offered 6-8 possible slots, others just 2-3. With a target of 2-3 faculty members per defense panel, we needed between 80-120 volunteer slots.
Room availability: Though I eventually removed this variable to simplify the process, it remained an important practical consideration.
After the anonymization process, I converted the spreadsheet information into straightforward lists of available periods for each coded participant.
The Approach: A Multi-Step Process
I conceptualized the task as a data flow with discrete steps:
Have the AI study and internalize the student schedule
Have the AI analyze faculty/staff availabilities
Ask the AI to build rough schedules based on both datasets
For each step, I wrote distinct, straightforward prompts that included what I call "meta-rules" to govern the process—constraints like "do not schedule faculty and staff for more than 2 defenses each" and "ensure each student is assigned exactly one defense time."
What Actually Happened: AI's Processing Limitations
I've never seen an AI model process for as long as ChatGPT did on some of these steps—and then always with results that completely ignored my meta-rules. In the generated schedules, some teachers had zero defenses assigned while others had ten. Some students weren't scheduled at all.
With each iteration, I tried to refine my prompts and simplify the constraints. At first, I thought the AI should easily accomplish what seemed like a straightforward organizational task. But I soon realized that what I was trying to make happen with AI would probably be better handled through a more traditional coding approach. Looking into the "black box" of AI, the steps accumulating in my process created too diffuse of an "action space" for the model to navigate successfully.
I've experienced this elsewhere in my AI work, for instance, when trying to create multiple-step AI mentor prompts for my students. Three or four steps in a pre-programmed AI inquiry experience works reasonably well—these current models can hold onto that structure. But push out to six or seven steps, and everything gets mushy. The model simply can't maintain coherence across all the constraints.
The Implications: Pattern Matching vs. Human Reasoning
This experience raised questions about how we understand current AI systems. What we might call AI "reasoning," if we even dare call it that, is reasoning of a fundamentally different kind and number than human reasoning. You get the clearest experience of this when you're face to face with its misalignment with your own purposes.
What seemed straightforward to me likely hinged on the AI's misunderstanding—over or under-reading of some key term in my meta-rules—or on a disjuncture—imperceptible as of yet to me—in how I configured the day and period schedules across different lists, impinging on the model's ability to match multiple patterns simultaneously.
A pattern-based form of reasoning feels fundamentally different from a purpose-based form of reasoning. Purpose perhaps is made up of patterns, but it is not entirely reducible to them. Or so it feels right now from this side of the black box.
These experiences also had me thinking a lot about this "vibe coding" phenomenon that's being discussed. Hang on, readers! Don't drop out of your coding classes yet. It seems that certain processes can be "vibe coded" that can harness AI's generative processes—usually assisted by some application that hones and directs AI's generativity toward particular purposes. But other things require a different tact. Perhaps I could use an AI coding tool to write a program that could do this work for me. But even then, I would be working beyond my comfort zone and already way too invested in a process that, at this point, has only managed to schedule 10 defenses successfully.
What This Means for Educators
For educators exploring AI integration, this experience offers valuable lessons:
Know the boundaries: Current AI excels at generating content and answering questions but struggles with multi-step planning problems that require maintaining numerous constraints.
Consider the task type: AI works better for tasks with fewer variables and constraints. Writing assistance? Great. Complex scheduling with multiple interdependent rules? Not so much.
Augment rather than replace: Some processes still require human judgment to resolve conflicts between competing constraints. AI can help analyze data but may not successfully optimize solutions on its own.
Specialized tools matter: General AI models may be less effective than purpose-built software for specific administrative tasks. Sometimes a traditional programming approach or specialized scheduling software remains the better solution.
In the end, I managed to schedule only 10 defenses successfully with AI assistance before reverting to my traditional approach. The experience hasn't diminished my enthusiasm for AI in education, but it has certainly sharpened my understanding of its current capabilities and limitations. As we continue integrating these tools into our schools, recognizing these boundaries will help us apply AI where it truly enhances, rather than complicates, our work.
Check out some of our favorite Substacks:
Mike Kentz’s AI EduPathways: Insights from one of our most insightful, creative, and eloquent AI educators in the business!!!
Terry Underwood’s Learning to Read, Reading to Learn: The most penetrating investigation of the intersections between compositional theory, literacy studies, and AI on the internet!!!
Suzi’s When Life Gives You AI: An cutting-edge exploration of the intersection among computer science, neuroscience, and philosophy
Alejandro Piad Morffis’s Mostly Harmless Ideas: Unmatched investigations into coding, machine learning, computational theory, and practical AI applications
Michael Woudenberg’s Polymathic Being: Polymathic wisdom brought to you every Sunday morning with your first cup of coffee
Rob Nelson’s AI Log: Incredibly deep and insightful essay about AI’s impact on higher ed, society, and culture.
Michael Spencer’s AI Supremacy: The most comprehensive and current analysis of AI news and trends, featuring numerous intriguing guest posts
Daniel Bashir’s The Gradient Podcast: The top interviews with leading AI experts, researchers, developers, and linguists.
Daniel Nest’s Why Try AI?: The most amazing updates on AI tools and techniques
Riccardo Vocca’s The Intelligent Friend: An intriguing examination of the diverse ways AI is transforming our lives and the world around us.
Jason Gulya’s The AI Edventure: An important exploration of cutting edge innovations in AI-responsive curriculum and pedagogy.
I remember how I felt when I tried o3 after learning that it had "achieved an ELO score of 2727 on Codeforces, a competitive programming platform, surpassing OpenAI's Chief Scientist's score of 2665" but then failing miserably on a pretty dumb Google Apps Scripts project I had in mind...
I recently saw a recorded webinar my supervisors sent me, it was meant for administrators, and the instructor was making very bold claims, such has "it's great for scheduling", and I totally called BS on that. Thx for confirming, I didn't even try.
Hi, Nick great article and experiment, I wonder what would have happened If you had trained the model with several examples (RAG for instance) I fully agree with your observations but I have managed to make the model make better solutions by teaching them, that simple questions give simple answers- that looks good or impressive to outsiders. If it looks like shit, it is shit (boogie nights) Give it context and you will be surprised 😉