"If Testing Companies Use AI to Grade, Why Can't We?"
Understanding the Technology Behind Automated Essay Scoring
Thank you. The response to Thinking With AI: A Student’s Guide to Literacy in an AI-Rich World has been more than I could have hoped for. Your comments, your shares, your willingness to put these ideas into practice with your students. That’s what makes this community something special.
The final two chapters dropped on Monday. Chapter 6 (Writing) and Chapter 7 (Conclusion) close out the guide by tackling the literacy domain where AI anxiety runs highest and pulling everything together into a vision for what disciplinary AI literacy actually looks like in practice.
After that, I’m turning my attention to the next wave of paid subscriber content: a process manual for running half-day and full-day DSAIL workshops (March) and a teacher manual for using the student workbook to integrate AI literacy into existing disciplinary contexts (April or May). Practical materials for people doing the work.
If you’ve been finding value here, a paid subscription is the best way to support it and get access to everything that’s coming.
The Conversation That Sparked This Investigation
In a recent professional development session, I watched a conversation about AI and grading spiral into confusion. Some teachers were convinced that Ohio’s standardized tests use AI to score student writing, though no one could say what kind. Another educator shared that their district was training a popular AI tool on past student samples to help teachers grade faster. Beneath it all lurked an unspoken anxiety: Are we handing over the evaluation of student writing to machines?
What struck me wasn’t the concern, that was reasonable. It was that we were all using “AI” to mean completely different things. No one could articulate what was actually happening when a computer “scored” student writing.
So I decided to find out.
The Ohio Reality: It’s Not What You Think
Yes, Ohio uses AI to score writing on standardized tests. But it’s not ChatGPT, and it’s not what most people imagine.
According to the Ohio Department of Education’s documentation (updated January 2026), the state uses a hybrid human-AI system. Ohio educators first review student responses and select examples representing the full range of scores. Then Data Recognition Corporation (DRC) trains human scorers using detailed rubrics.
Here’s the crucial part: 2,500 randomly selected responses are hand-scored a second time, with every discrepancy resolved by a third human scorer. Only after this intensive validation does AI enter the picture, learning from these carefully vetted human scores.
The AI component, Cambium Assessment’s Autoscore, uses “a mix of expert-designed features to assess writing quality and Latent Semantic Analysis (LSA) to assess concepts in essays.” LSA dates back to the 1990s. This isn’t the shiny new AI everyone’s talking about.
Even during operational testing, the first 500 responses are both machine-scored and human-scored to verify accuracy, and 25 percent of all responses get double-checked by humans throughout the testing window.
The Distinction That Changes Everything
Here’s what was missing from our workshop: Not all AI does the same thing.
Ohio uses discriminative AI. Its job is to classify and score existing text. You give it an essay, it returns a number: 1, 2, 3, or 4 points.
The AI teachers worry about, tools like ChatGPT, is generative AI. Its job is to create new text. You give it a prompt, it writes an essay.
Think of it this way: Ohio’s system is a reading comprehension expert who analyzes student writing. ChatGPT is a writer who creates content. Same AI family, completely different jobs. This distinction matters enormously for grading student work.
The Generative AI Experiment: Not Ready for Prime Time
Researchers are testing whether ChatGPT and GPT-4 can score essays. These studies circulate in education networks and sometimes get misinterpreted as describing systems already in use. But they’re experiments, not operational programs. And the results are troubling.
The Prompt Problem
A Harvard study found that simply changing how you ask ChatGPT to grade changes the scores. Told to grade “as an elementary school teacher,” it showed an R² correlation of 0.42 with human scores. Told to grade “as a college professor,” correlation dropped to 0.38. Same essays, different scores, just from rephrasing the instruction.
The Consistency Problem
A 2025 study in Education found GPT-4’s performance drifts as the model updates. “GPT-4’s accuracy in identifying prime numbers dropped from 84% in March 2023 to 51% in June 2023.” If it can’t consistently identify prime numbers, should we trust it with nuanced writing evaluation?
The Variability Problem
Different studies reach opposite conclusions about whether generative AI grades too harshly or too leniently. A 2024 dental education study concluded that while ChatGPT showed promise, “an appropriate rubric design is essential for optimal reliability.” It works sometimes, if you set it up just right, but we’re not sure when or why.
The Bias Question: It’s About Training Data
Someone in our workshop mentioned hearing that AI scores the same essay differently depending on whether it’s written by a native English speaker or an English learner. The reality is more systemic.
A January 2025 study found that “current Transformer-based regression models trained primarily on native-speaker corpora often learn spurious correlations between surface-level L2 linguistic features and essay quality.” High-proficiency English learner essays received scores 10.3% lower than native speaker essays that human raters judged to be identical in quality.
The AI isn’t discriminating because it “knows” a student is an English learner. As researchers explain, “transformer attention heads often disproportionately attend to distinct L2 markers such as prepositional misuse or specific sentence structures as proxies for predicting lower scores, ignoring the semantic vector.”
The AI learned that certain grammatical patterns mean poor writing, when really those patterns just mean “written by someone whose first language isn’t English.”
The good news? Research from May 2025 found that “no AI bias and distorted disparities between ELLs and non-ELLs were found when the training dataset was large enough (ELL≈30,000 and ELL≈1,000), but concerns could exist if the sample size is limited (ELL≈200).”
The solution is straightforward: train AI on diverse data. Which means districts experimenting with AI tools must ask: What was this trained on? Who’s represented? Who isn’t?
The District Reality: Where Oversight Is Weakest
That teacher’s story about their district training an AI tool on past student work? This is where the real action is, and where oversight is weakest.
Unlike Ohio’s heavily validated, publicly documented system, local experiments often have:
No standardized validation
No transparency about training data
No formal bias testing
No external accountability
Ohio’s traditional AI system has extensive human oversight and multiple validation checkpoints. Generative AI experiments in districts often have none of these safeguards.
What Teachers Need to Know
Ask Which Kind of AI
When someone says “AI is grading,” ask: discriminative (classification) or generative (text creation)? They work differently and carry different risks.
Demand Transparency
If your district uses AI for grading, ask:
What specific system?
What was it trained on?
What validation has been done?
What happens when it’s wrong?
Protect English Learners
If AI scores work from English language learners, ask:
What percentage of training data came from ELL writers?
What testing was done for bias?
If these questions can’t be answered, the system isn’t ready.
The Conversation We Should Be Having
The workshop conversation wasn’t wrong to worry. We were asking the wrong questions.
Not: “Is Ohio using AI?” (Yes, for years)
But: “What kind of AI, with what safeguards, validated how?”
Not: “Should we use AI for grading?” (We already are)
But: “Which students does this AI serve well, and which does it disadvantage?”
The technology isn’t going away. Our obligation is to understand it well enough to protect the students who encounter it. That means getting specific about what we mean by “AI,” demanding transparency, and staying vigilant about who gets harmed when technology makes mistakes.
Because it will make mistakes. The question is whether we’ll notice, and whether we’ll care enough to do something about it.
Nick Potkalitsky, Ph.D.
Key Sources
Mike Kentz’s AI EduPathways: Insights from one of our most insightful, creative, and eloquent AI educators in the business!!!
Terry Underwood’s Learning to Read, Reading to Learn: The most penetrating investigation of the intersections between compositional theory, literacy studies, and AI on the internet!!!
Suzi’s When Life Gives You AI: A cutting-edge exploration of the intersection among computer science, neuroscience, and philosophy
Alejandro Piad Morffis’s The Computerist Journal: Unmatched investigations into coding, machine learning, computational theory, and practical AI applications
Michael Woudenberg’s Polymathic Being: Polymathic wisdom brought to you every Sunday morning with your first cup of coffee
Rob Nelson’s AI Log: Incredibly deep and insightful essay about AI’s impact on higher ed, society, and culture.
Michael Spencer’s AI Supremacy: The most comprehensive and current analysis of AI news and trends, featuring numerous intriguing guest posts
Daniel Bashir’s The Gradient Podcast: The top interviews with leading AI experts, researchers, developers, and linguists.
Daniel Nest’s Why Try AI?: The most amazing updates on AI tools and techniques
Jason Gulya’s The AI Edventure: An important exploration of cutting-edge innovations in AI-responsive curriculum and pedagogy.
Stephen Fitzpatrick’s Teaching in the Age of AI: Essential reflections from a veteran high school educator on the challenges and opportunities of generative AI in the classroom!!!



Super appreciate these questions. Interestingly, I found myself asking similar things at my daughter’s pediatrician office. Before an appointment, I was told that her doctor now uses AI, and I needed to sign my consent. What kind of AI? In what ways? Is this just a transcription service, or will it be providing medical advice? Where does the data go?? The receptionist could not answer my questions, so I had to say, no thank you—one old fashioned doctor, please. But now I have a lot of research left to do!