Why AI Can't Stop Using Em Dashes
A Deep Dive into Machine Learning's Punctuation Problem
If you find this content valuable and shareable, please consider becoming a paid subscriber to support the deep research and nuanced analysis these complex issues deserve.
The Measurement of Machine Habits
In the rapidly evolving landscape of AI-generated content, one peculiar pattern has emerged as perhaps the most reliable tell of machine authorship: an overwhelming fondness for the em dash. What began as casual observations among writers and editors has evolved into a full-blown internet phenomenon, with Reddit threads, academic analyses, and social media discussions all focused on this single punctuation mark's curious prominence in AI writing.
The numbers tell a striking story. Research comparing scientific abstracts from 2021 to 2025 found that em dash usage more than doubled during precisely the period when AI writing tools became mainstream. Editors report seeing them "in every third sentence" of AI-generated content. The pattern has become so pronounced that some readers claim they can identify artificial intelligence authorship simply by counting dashes.
But dismissing this as a simple quirk would be a mistake. The em dash phenomenon represents something far more complex: a convergence of linguistic patterns, training methodologies, technical constraints, and stylistic inheritance that reveals how AI systems process and generate human language. Understanding why AI gravitates toward this particular punctuation mark offers a window into the deeper mechanics of how these systems work and what happens when human writing patterns meet algorithmic optimization.
Why AI Can't Stop Using Em Dashes
The reasons behind AI's dash obsession are surprisingly complex, involving everything from linguistic patterns to training objectives:
Linguistic Patterns and Sentence Structure
Large language models have developed a particular way of constructing sentences that naturally leads to em dash usage. These models often produce complex, highly structured sentences and frequently insert extra clauses or asides. The em dash becomes a natural tool to weave these elements into a single sentence without breaking flow.
This creates what researchers call a "rhythmic pattern" where the em dash becomes part of the model's default natural flow of writing. Unlike human writers who vary their punctuation for readability, LLMs lack an innate sense of "too much of a good thing." They may insert em dashes at a rate no human would normally use because they optimize for local coherence rather than global stylistic balance.
The versatility of the em dash particularly appeals to AI systems. Linguistically, an em dash can substitute for commas, parentheses, or a colon, introducing a pause or shift in tone. As one writing instructor notes, the em dash "captures natural inflections of speech in a way that other punctuation doesn't," giving AI-generated sentences a conversational yet polished cadence. Models exploit this versatility to keep sentences fluent without ending them abruptly.
Stylistic Conventions from Training Data
LLMs have no inherent writing style; they derive their approach entirely from the vast human texts in their training data. The em dash is a longstanding staple of sophisticated writing, so models internalize it as a normal, even desirable, stylistic device.
Several factors from human writing conventions contribute to this pattern:
Literary and Journalistic Influence: In literature and journalism, em dashes are often used to add emphasis or voice. Writers from Emily Dickinson to modern bloggers have embraced the em dash for its dramatic effect and clarity of thought. Magazines and newspapers, especially since the 1970s, increasingly used em dashes to create a narrative, conversational tone in reporting. Because models like GPT-4 were trained on many such sources, they absorbed these conventions. As one journalism professor points out, if an AI's training leaned heavily on magazine or blog writing, "those two styles were quite fond of the em dash," which could explain AI's fondness for it.
Authority and Polish: Users often prompt LLMs to produce well-structured, formally written answers. To sound "refined" or authoritative, models lean on punctuation that appears in serious, well-edited prose. The em dash contributes to a confident and clear tone, and models associate em dashes with high-quality writing style. OpenAI's team even acknowledged having a "soft spot for the em dash," since it can help communicate ideas clearly in many contexts. The punctuation isn't taboo or slangy; it's formal but lively, which aligns with the stylistic register these models aim to match.
Deep Training Data Bias
The corpora used to train major language models contain what one researcher called "entire civilizations' worth of text," and within these mountains of words, em dashes abound. The models picked up on this statistical reality, making the prevalence of em dashes in human writing a learned bias.
This bias runs remarkably deep. One Reddit moderator attempting to "de-AI" his writing found the em dash nearly impossible to eliminate: "Even when I prohibit em-dashes at the level of the system prompt, the LLM keeps inserting them into text." Power users have discovered the best workaround is perhaps post-hoc AI revision that focuses on rewriting all phrases and sentences including em dashes.
Since em dashes were never treated as undesirable during training (unlike profanity or certain disallowed content), models had no reason to suppress them. OpenAI community discussions have described the em dash habit as a "deep bias" embedded in how models understand written flow. The model predicts an em dash simply because many real texts would have one in that spot.
Technical Training Objectives
The design of model training objectives plays a crucial role in punctuation choices. LLMs are trained via next-token prediction: they aim to produce the most statistically likely continuation of given text. This objective, combined with later fine-tuning for helpfulness, indirectly encourages more em dashes.
Token Efficiency: Recent analysis suggests that using an em dash can be a sort of "hack" for efficiency from the model's perspective. The training objective rewards lowering perplexity (better predicting training data), and often an em dash lets the model condense information into fewer tokens. Instead of writing verbose connective phrases that might require multiple tokens, the model can use a single em dash token to seamlessly attach a clause. Because "—" is typically a single token in the model's vocabulary, leaning on it may minimize the number of tokens needed to express a thought, thereby marginally improving the training objective.
Clarity Over Style: In reinforcement learning from human feedback (RLHF), human evaluators reward outputs that are clear, well-structured, and thorough. This inadvertently encourages em dash usage. When a model wants to add clarification or an example to an ongoing sentence, a dash is often the clearest way to do so without creating run-on sentences or confusing comma usage. The model's priority is helping users communicate clearly, and if a dash improves clarity, the model will use it without concern for seeming repetitive or pretentious.
Lack of Global Editing: Human writers often revise drafts to avoid quirky overuse, but LLMs generate text in one pass without global editorial judgment. They don't "hear" that they've used five dashes in one paragraph and decide to vary the next sentence. The model's objective is local optimization, not holistic style equilibrium. Any subtle preference learned during training can repeat unchecked.
Resistance to Override: Once an LLM has learned to heavily use a pattern, it proves difficult to train that habit away without explicit retraining or constraints. The penchant for dashes is essentially baked into the model's neural weights from its vast reading of human text. From the model's perspective, using an em dash is never an error to be avoided; it's often the most natural choice given its training distribution.
The Human Connection
The most fascinating aspect of this phenomenon is what it reveals about our own writing. Research analyzing scientific abstracts found that em dash usage more than doubled between 2021 and 2025, precisely when AI writing tools became mainstream. We're not just noticing AI's dash habit; we're inadvertently adopting it ourselves.
This creates an interesting feedback loop. Writers who use AI assistance absorb its dash-heavy style, then produce training data for future models, potentially amplifying the trend. Meanwhile, other writers consciously avoid em dashes to distance themselves from "AI-sounding" prose, creating a strange punctuation arms race.
The Irony of Detection
Here's where it gets really interesting: using em dashes to detect AI writing is fundamentally flawed. The dash isn't an alien invasion of our language but a beloved tool of accomplished writers from Emily Dickinson to modern journalists. Publications like The New Yorker and major newspapers have used em dashes extensively for decades.
The em dash captures natural inflections of speech in a way that other punctuation doesn't. It's aesthetically elegant and at home in both formal and conversational writing. When we criticize AI for using dashes, we're essentially criticizing it for imitating good writing.
What This Means for Writers
The em dash controversy highlights a broader truth about AI and creativity: these systems are sophisticated mirrors of human expression. Their quirks often reveal our own patterns and preferences, sometimes in uncomfortably concentrated doses.
For writers, this offers both warning and opportunity. Understanding how AI generates text — including its punctuation preferences — can help us make more intentional stylistic choices. Whether you embrace the em dash as a powerful tool or avoid it to maintain a distinctly human voice, the key is making that choice consciously.
The Bigger Picture
The em dash debate is really about authenticity in the age of AI. As these tools become more prevalent, we're grappling with questions about what makes writing uniquely human. The irony is that by obsessing over AI's punctuation habits, we might be overlooking the deeper elements that truly distinguish human creativity: genuine insight, lived experience, and authentic voice.
Perhaps the real lesson isn't that we should avoid em dashes but that we should focus on what no AI can replicate: our individual perspectives, experiences, and the messy humanity that gives writing its soul.
After all, the humble em dash's journey from prized literary device to alleged AI signature reminds us that these models are ultimately reflections of ourselves. Their quirks are our quirks, their patterns our patterns, just amplified, concentrated, and served back to us with algorithmic precision.
The question isn't whether AI uses too many em dashes. The question is: what does that tell us about how we write, and how we want to write in the future?
What's your take on the em dash debate? Have you noticed this pattern in AI writing, or do you think it's overblown? Reply and let us know. We'd love to hear your thoughts.
Nick Potkalitsky, Ph.D.
Check out some of our favorite Substacks:
Mike Kentz’s AI EduPathways: Insights from one of our most insightful, creative, and eloquent AI educators in the business!!!
Terry Underwood’s Learning to Read, Reading to Learn: The most penetrating investigation of the intersections between compositional theory, literacy studies, and AI on the internet!!!
Suzi’s When Life Gives You AI: A cutting-edge exploration of the intersection among computer science, neuroscience, and philosophy
Alejandro Piad Morffis’s The Computerist Journal: Unmatched investigations into coding, machine learning, computational theory, and practical AI applications
Michael Woudenberg’s Polymathic Being: Polymathic wisdom brought to you every Sunday morning with your first cup of coffee
Rob Nelson’s AI Log: Incredibly deep and insightful essay about AI’s impact on higher ed, society, and culture.
Michael Spencer’s AI Supremacy: The most comprehensive and current analysis of AI news and trends, featuring numerous intriguing guest posts
Daniel Bashir’s The Gradient Podcast: The top interviews with leading AI experts, researchers, developers, and linguists.
Daniel Nest’s Why Try AI?: The most amazing updates on AI tools and techniques
Jason Gulya’s The AI Edventure: An important exploration of cutting-edge innovations in AI-responsive curriculum and pedagogy.
People can complain what the want, but I learned more about English grammar in the last few months than in the 57 years before. I never knew em-dashes existed, but now that I know, I really like them. :)
Very good insights, Nick. Especially the ones about training objectives and reward, I hadn't thought of how this kind of human supervision can embed very subtle biases.
I'm lately becoming interested in seeing LLMs more and more as objects of study in themselves, as a kind of dynamic corpora that can be poked to understand how language is used throughout different contexts. Especially smaller LLMs trained on regional or historical variants of languages.
What do you think about this? Is there something useful to learn about language itself by analyzing LLMs?