What Is the Matter with Grading in an AI-Mediated Classroom?
So many students never develop the resilience to separate their sense of self from the inconsistency of institutional evaluation.
The response to the Thinking with AI Student Workbook (e-book in progress!!!) and the accompanying training sessions has been genuinely humbling. Thank you. This week I am releasing the Parent and Guardian Workshop, which brings that seven-part paid series to a close. If you have been following along, this final installment completes the arc. If you are coming to it fresh, the full series is available in the archive.
And then, in April: Talking with Machines: The Subtle Art of Working with AI. The premise is simple and I think radical — that the breakthrough moments in how we interact with LLMs are simultaneously the best diagnostic tools we have for understanding what these machines actually are. AI literacy, at its deepest, is not a framework you learn. It is something you accumulate through close, honest attention to what happens in the room between a human and a machine. More soon.
She already knew something was wrong before she handed me the essay.
She is an eighth grader in a mid-sized Ohio public school — driven, intellectually curious, the kind of student who pushes back on an argument not because she is obstinate but because she is actually thinking. She writes in an elevated vocabulary. She has a voice, and she knows it. She came to our tutoring session having already run her argumentative essay through an AI-powered feedback tool her teacher had set up on a major ed-tech platform. The tool had scored it 100%.
She wanted me to look anyway.
What I found was a competent essay, well-scaffolded, formulaic in the way that state-testing environments reward, and built substantially on the sentence frames her teacher had provided in class. She told me, matter-of-factly, that she had noticed over time that her scores improved the more directly she incorporated her teacher’s language. So she had. The AI agreed. One hundred percent.
The Formula and What It Leaves Out
But there was slippage in the counterargument paragraph. The assignment, like so many argumentative essays assigned to students over the past fifteen years, asked her to weigh in on the impacts of social media. She had not chosen her side; the pro-social-media-as-business-tool position had been assigned to her. In the counterargument, the some say sentence raised the superiority of personal websites for promotion and advertising. The others respond sentence drifted back to a point already made in body paragraph one about the potential lower cost of social media, rather than engaging directly with what the some say had actually argued.
We spent twenty minutes on it. We talked through the real tension: the personalization of a website versus the algorithmic reach of social media. We noticed together that the formula had no room for a blended approach. The either/or logic was baked into the instruction, and likely into the state rubric beneath it. We revised the counterclaim. We returned to a quoted statistic that 50% of companies use social media to promote their brands, and I asked the pointed question: so what about the other 50%? What followed was a genuine conversation about how to word a justification carefully, how to make a number feel significant without overstating it, how to argue honestly inside a constrained form.
It was, in other words, exactly the kind of instruction writing teachers hope to provide.
When we finished, she ran the revised essay through the AI tool again. It scored her an 80%. The feedback was general to the point of near-uselessness. The tool attempted to quote her essay to identify specific problems, but the quoted passages did not accurately reflect her actual text. She ran it again, same essay, new chat window. This time it scored a 90%.
She looked at the screen. She looked at me.
What an Inconsistent Score Actually Costs
This is the part worth pausing on, because what happened next is not simply a story about a flawed tool. It is a story about what flawed tools do to the ecosystems around them.
AI feedback tools in educational settings exist on a wide spectrum of reliability. At the more rigorous end, automated scoring systems used by state assessments are normed against human-scored samples, calibrated carefully, and tested for consistency before deployment. What this student encountered was something different: a general-purpose AI embedded in a classroom platform, handed a rubric, and asked to approximate summative evaluation. The distinction matters enormously, because students and teachers cannot always see it. The tool presents with the same confidence regardless of its actual precision. It scores. It comments. It feels authoritative.
When it contradicts itself across two runs of the same essay, that authority does not quietly recede. It explodes outward. And what it takes with it is not just the student’s confidence in the tool. In that moment, sitting with my tutee, I watched it erode her confidence in her teacher’s approach, in the state testing apparatus the formula was built to serve, in the twenty minutes of genuine intellectual work we had just done together, and, most quietly and most damagingly, in her own judgment as a writer.
Why did we spend all that time, her silence was asking, if the result is a lower grade?
I have taught in public school classrooms under similar conditions. I know what it is to pitch instruction toward a struggling cohort while a student like her sits in the back, capable of far more. I understand why a teacher reaches for a tool that promises to lighten the feedback load. I am not here to assign blame. I am here to say that an insufficiently optimized AI tool, deployed in a high-stakes evaluative role, does not simply fail to help. It actively destabilizes the feedback ecosystem that learning depends on. The connective tissue between effort, quality, and outcome, which has always been imperfect, now has a new and visible fault line running through it.
There is a version of this story that ends with a call to action: we need better tools, better teacher training, a clearer distinction between formative feedback and summative evaluation. All of that is true. All of that is necessary. But those conditions are a horizon, not a resolution, and I want to be honest about the distance between where we are and where we need to be.
Resilience, or Something We Are Mistaking for It
Because here is what actually happened at the end of our session.
She thought about it for a moment. She restored the original essay, the one that had scored 100%, and she submitted it. Then she closed her laptop and moved on.
I have been thinking about that moment ever since. There is something in it that reads like maturity. She assessed the situation clearly, made a pragmatic decision, declined to be derailed by an unreliable signal. That is a genuine cognitive competency, and one that many students never develop.
But I am not sure I am allowed to call it maturity.
Because what she also did, in that same gesture, was disengage from the process entirely. She did not advocate. She did not push back. She looked at a system that had failed her and decided, reasonably, that the most efficient response was to stop trusting it and comply with its most forgiving version. She is twelve years old, and she has already learned that lesson.
So many students never develop the resilience to separate their sense of self from the inconsistency of institutional evaluation. They internalize the grade. They disengage. School becomes a game you are winning or losing, and the game has no reliable referee. What worries me is not just the students who will be broken by that. What worries me is the students who will handle it exactly the way she did.
I want to call what she did that afternoon maturity.
I am not sure I am allowed to.
Nick Potkalitsky, Ph.D.
Check out some of our favorite Substacks:
Mike Kentz’s AI EduPathways: Insights from one of our most insightful, creative, and eloquent AI educators in the business!!!
Terry Underwood’s Learning to Read, Reading to Learn: The most penetrating investigation of the intersections between compositional theory, literacy studies, and AI on the internet!!!
Suzi’s When Life Gives You AI: A cutting-edge exploration of the intersection among computer science, neuroscience, and philosophy
Alejandro Piad Morffis’s The Computerist Journal: Unmatched investigations into coding, machine learning, computational theory, and practical AI applications
Michael Woudenberg’s Polymathic Being: Polymathic wisdom brought to you every Sunday morning with your first cup of coffee
Rob Nelson’s AI Log: Incredibly deep and insightful essay about AI’s impact on higher ed, society, and culture.
Michael Spencer’s AI Supremacy: The most comprehensive and current analysis of AI news and trends, featuring numerous intriguing guest posts
Daniel Bashir’s The Gradient Podcast: The top interviews with leading AI experts, researchers, developers, and linguists.
Daniel Nest’s Why Try AI?: The most amazing updates on AI tools and techniques
Jason Gulya’s The AI Edventure: An important exploration of cutting-edge innovations in AI-responsive curriculum and pedagogy





What the inconsistent scoring did was to sever the connection between effort and outcome at the exact moment genuine thinking had produced real work. She spent twenty minutes doing the thing we actually want students to do, then watched it cost her marks. The lesson that teaches is about whether the game is worth playing honestly.
I've watched a version of this across 40 years in NZ classrooms. Students who had to work hardest to meet expectations often developed the tenacity that held them at transition points. Students for whom the system consistently rewarded compliance over thinking often hadn't needed to build it. AI now offers every student the compliance pathway at scale. When the feedback ecosystem can't distinguish between them, the rational move is exactly what your student did.
The fix isn't better tools, though you're right that we need them. It's what you already did in that session: make the thinking the assessed thing, not the product. When the trace matters more than the score, the tool's inconsistency stops being decisive.
The worst part is that this has nothing to do with AI. AI is just grading a predefined bias. It's also interesting that the 'drive to the algo' was something I really noticed first in Resumes. The over-inflation of tasks and the reframing to slip through the ATS and lazy screeners, not what the hiring company actually wanted.