Understanding AI in 2026: A Two-Part Deep Dive into Current Realities and Future Directions
Why educators and researchers need to listen to AI experts--not AI hypers or doomsters--about the future of AI development.
Thank you for reading and engaging with my work. Right now I am offering a 20% Forever Discount to readers who sign up for a yearly paid subscription before the end of the year. Thanks for all the support over the years. Together we are creating a very special collaborative community. In the meanwhile, I am continuing my research into disciplinary-specific practices for AI implementation. Expect to hear more about that in the new year!!!
Introduction to the Series
It seems like a good time to revisit the comparison between AI and human intelligence, and to examine where LLM development is actually headed. I haven’t done anything like this since 2023, and the landscape has shifted considerably, though perhaps not in the ways the hype cycle would suggest.
Since Daniel Bashir reduced his output at The Gradient this year after an impressive three-year run of almost weekly deep dives with AI experts, I’ve migrated over to Dwarkesh Patel’s podcast to fill that vacuum. For educators and researchers outside the AI field, it’s crucial to understand what experts are actually saying, especially as the AI hype-doom cycle continues its volatile swings, generating misconceptions at every turn.
For this series, I’ve drawn on Dwarkesh’s recent interviews with three researchers whose work I’ve followed for years and whose opinions I deeply respect: Ilya Sutskever (co-founder of OpenAI, now leading Safe Superintelligence Inc.), Andrej Karpathy (former director of AI at Tesla, creator of influential educational content on neural networks), and Richard Sutton (recipient of the 2024 Turing Award, often called the father of reinforcement learning).
What’s particularly striking about these 2025 conversations is that while Dwarkesh continues to believe in the possibility of AGI, he has interviewed a growing number of AI experts who push that possibility further and further into the future. All three of these experts agree on something fundamental: LLMs are not the path forward to general intelligence. LLMs have fundamental limitations that no pretraining scheme, no amount of compute, and no dataset will overcome.
This two-part series will explore what those limitations are and what they tell us about both the current state of AI capabilities and the future direction of the field.
Part 1: AI vs Human Intelligence—Where We Actually Stand
The Generalization Gap: When Brilliance Becomes Brittleness
The most profound difference between current AI and human intelligence isn’t about raw test performance. Models routinely exceed human scores on difficult benchmarks. The problem runs deeper: these systems are spectacular pattern-matchers that struggle when confronted with anything genuinely novel.
Sutskever captures this with a pointed analogy. Imagine two students learning competitive programming. The first dedicates 10,000 hours to the craft, memorizing every algorithm, every proof technique, becoming lightning-fast at implementation. The second casually practices for 100 hours. Both do well in competitions, but which one will have a better career?
Obviously the second. They learned principles that transfer, while the first overfit to a specific domain. “The models are much more like the first student,” Sutskever explains, “but even more.”
This isn’t abstract. Sutskever describes a frustratingly common interaction with coding models: “You go to some place and then you get a bug. Then you tell the model, ‘Can you please fix the bug?’ And the model says, ‘Oh my God, you’re so right. I have a bug. Let me go fix that.’ And it introduces a second bug. Then you tell it, ‘You have this new second bug,’ and it tells you, ‘Oh my God, how could I have done it? You’re so right again,’ and brings back the first bug.”
You can ping-pong between these two errors indefinitely. How is this possible from a system that can solve complex math problems?
Sutskever’s hypothesis centers on how models are trained with reinforcement learning. Labs create specific RL training environments, often taking inspiration from their evaluation benchmarks (the tests they’ll use to measure success). Models become exceptional at these exact scenarios but fail to develop robust, generalizable understanding.
“If you combine this with generalization of the models actually being inadequate,” he argues, “that has the potential to explain a lot of what we are seeing, this disconnect between eval performance and actual real-world performance.”
A 35-Million-Fold Information Density Gap
Karpathy reveals a technical detail that illuminates just how differently humans and current AI systems process information. During pre-training (when models learn from massive text datasets), information gets compressed dramatically into the model’s parameters, the numbers that define its behavior.
For a large model like Llama 3’s 70-billion parameter version, trained on 15 trillion tokens, this works out to storing just 0.07 bits per token it encounters.
Compare that to what happens during inference, when the model is actually being used. Information processed through the context window (the text the model is actively working with) gets stored at approximately 320 kilobytes per token through something called the KV cache.
That’s a 35-million-fold difference in information density.
As Karpathy explains: “Anything that happens during the training of the neural network, the knowledge is only a hazy recollection of what happened in training time. That’s because the compression is dramatic... Whereas anything that happens in the context window of the neural network, you’re plugging in all the tokens and building up all those KV cache representations, is very directly accessible to the neural net.”
This architectural reality creates a fundamental mismatch with human learning. Somehow, we extract deeper and more transferable knowledge from far less data. A child doesn’t need to see 15 trillion examples to learn language or understand cause and effect.
Why you need to input a source directly into a chat in order to get good recall and analysis!!!
The Missing Loop: Continual Learning
Perhaps the starkest capability gap is the one most easily overlooked: humans learn continuously from ongoing experience, while AI systems fundamentally don’t.
Sutton emphasizes this constantly in his interview. When discussing how animals learn, he points out: “Squirrels don’t go to school. Squirrels can learn all about the world. It’s absolutely obvious, I would say, that supervised learning doesn’t happen in animals.”
Animals (and humans) learn through direct interaction with the world. They try things, observe consequences, and adjust. Current AI maintains a rigid separation between training time (when learning happens) and deployment time (when the system is used). You can’t genuinely teach ChatGPT something new mid-conversation in a way that persists or updates its core knowledge.
Karpathy points to something humans do that AI systems completely lack: memory consolidation during sleep. While awake, we accumulate experiences and build up what he calls a “context window of stuff that’s happening during the day.” But during sleep, “something magical happens where I don’t think that context window stays around. There’s some process of distillation into the weights of my brain.”
He continues: “We don’t have an equivalent of that in large language models... These models don’t really have a distillation phase of taking what happened, analyzing it obsessively, thinking through it, doing some synthetic data generation process and distilling it back into the weights.”
This isn’t a minor missing feature. It’s fundamental to how biological intelligence works.
The Goal Problem: Prediction Isn’t Intelligence
Sutton raises a philosophical point that cuts to the heart of what’s wrong with the LLM paradigm. These systems don’t have real goals in any meaningful sense.
When Dwarkesh pushes back, suggesting that “next token prediction” constitutes a goal, Sutton is blunt: “That’s not a goal. It doesn’t change the world. Tokens come at you, and if you predict them, you don’t influence them.”
Think about what a goal actually means. A chess-playing AI has a genuine goal: win the game. It takes actions (moves pieces), observes consequences (how the board state changes), and learns to achieve its objective. A robot vacuum has a real goal: clean the floor. It acts in the physical world and receives feedback about whether surfaces are clean.
But an LLM just predicts what text should come next based on patterns. It doesn’t act on the world and observe consequences. As Sutton puts it: “You can’t look at a system and say it has a goal if it’s just sitting there predicting and being happy with itself that it’s predicting accurately.”
This matters because goal-directedness is central to what we mean by intelligence. “You have to have goals or you’re just a behaving system,” Sutton argues. “You’re not anything special, you’re not intelligent.”
His alternative vision centers on what he calls “the experiential paradigm,” systems that learn from continuous streams of sensation, action, and reward. “Intelligence is about taking that stream and altering the actions to increase the rewards in the stream,” he explains.
Crucially, the knowledge such systems build is fundamentally different: “Your knowledge is about if you do some action, what will happen. Or it’s about which events will follow other events. It’s about the stream. The content of the knowledge is statements about the stream. Because it’s a statement about the stream, you can test it by comparing it to the stream, and you can learn it continually.”
This is a completely different paradigm from predicting text.
When Memory Undermines Intelligence
Karpathy makes a counterintuitive observation: models might actually be too good at memorization, and this hurts their intelligence.
Unlike humans, who forget most details while retaining general principles, LLMs can recite passages verbatim from their training data. Feed them a sequence of random numbers once or twice, and they can suddenly recite the entire sequence back. “There’s no way a person can read a single sequence of random numbers and recite it to you,” Karpathy notes.
But here’s the insight: “That’s a feature, not a bug, because it forces you to only learn the generalizable components. Whereas LLMs are distracted by all the memory that they have of the pre-training documents, and it’s probably very distracting to them in a certain sense.”
This leads him to an unexpected proposal. When he talks about building what he calls a “cognitive core” (the essential intelligence stripped of unnecessary components), he wants to “remove the memory.”
“I’d love to have them have less memory so that they have to look things up, and they only maintain the algorithms for thought,” he explains.
Consider how human experts actually work. A skilled doctor doesn’t have every medical fact memorized. They know how to think about medical problems, what questions to ask, where to look for information, and how to reason from evidence to diagnosis. The thinking algorithms matter more than encyclopedic recall.
The Collapse Problem: Why AI Can’t Learn From AI
Karpathy identifies another fundamental limitation that constrains how models can improve: model collapse when training on synthetic data.
The problem is subtle but severe. When models generate text, the outputs “are silently collapsed,” meaning “they occupy a very tiny manifold of the possible space of thoughts about content.”
He demonstrates with a simple experiment: “Go to ChatGPT and ask it, ‘Tell me a joke.’ It only has like three jokes. It’s not giving you the whole breadth of possible jokes.”
This matters enormously for a common proposal: having models learn from their own outputs or from other AI-generated data. “Any individual sample will look okay,” Karpathy explains, “but the distribution of it is quite terrible. It’s quite terrible in such a way that if you continue training on too much of your own stuff, you actually collapse.”
Interestingly, he thinks humans experience this too: “I also think humans collapse over time... This is why children, they haven’t overfit yet. They will say stuff that will shock you because you can see where they’re coming from, but it’s just not the thing people say, because they’re not yet collapsed. But we’re collapsed. We end up revisiting the same thoughts.”
The difference is that humans at least start with high diversity before collapsing over decades. Models start collapsed.
What Actually Works in Practice
These limitations manifest in concrete, everyday interactions. Karpathy recently built nanochat (a complete but simplified ChatGPT clone) and paid close attention to where coding assistants helped versus hindered.
Models excel at boilerplate code: “Boilerplate code that’s just copy-paste stuff, they’re very good at that. They’re very good at stuff that occurs very often on the Internet because there are lots of examples of it in the training sets of these models.”
But for genuinely novel work requiring integration into a specific codebase with particular architectural choices? “The models are not very good at code that has never been written before... They don’t know how to fully integrate it into the repo and your style and your code and your place, and some of the custom things that you’re doing and how it fits with all the assumptions of the repository.”
His bottom line assessment: “You’re not going to hire this thing as an intern. It’s missing a lot of it because it comes with a lot of these cognitive deficits that we all intuitively feel when we talk to the models.”
The Expert Consensus
Despite coming from different backgrounds and working on different problems, all three experts converge on a shared assessment: current LLMs represent impressive engineering achievements but are fundamentally limited in ways that matter for general intelligence.
Karpathy offers the most balanced take: “I feel like the industry is making too big of a jump and is trying to pretend like this is amazing, and it’s not. It’s slop. They’re not coming to terms with it... We’re at this intermediate stage. The models are amazing. They still need a lot of work.”
Sutton cuts to the philosophical core of the limitation: “To mimic what people say is not really to build a model of the world at all. You’re mimicking things that have a model of the world: people.”
And Sutskever, despite leading a company explicitly aimed at building superintelligence, emphasizes what he sees as the central gap: “These models somehow just generalize dramatically worse than people. It’s a very fundamental thing.”
In Part 2, we’ll explore what comes next: the specific technical barriers preventing progress, why current approaches to RL training are flawed, and what alternative paradigms these experts see as necessary to move beyond LLMs toward more capable.
Nick Potkalitsky, Ph.D.






When writing a prompt, with Gemini, for a Deep Research project, I requested that terms be mentioned with the abbreviations in parenthesis: Electric Reliability Council of Texas (ERCOT). I requested this in multiple iterations of the prompt and Gemini sometimes would leave it out. Even in it's responses to questions about the prompt, it would just use the abbreviation.
This is the type of thing that prevents AI LLMs from taking over the world. When AI cannot keep simple instructions clear, in the current context window, how can it be trusted in an agent mode? Unless there are versions of AI beyond the access of the general public, I don't see how the Shopify CEO was able to create his keynote speech using AI agents. The idea of the AI available to me being able to rewrite and refine something one hundred times overnight is preposterous.
The brain, like a PC, is an assembly of intelligences: linguistic, visual, logical, and so on.
LLMs, in their present form, can be compared to one of the intelligences (linguistic), though I don't know if the mechanisms are the same. The Bayesian brain claim doesn't have full approval.
A computer can produce or perceive an image/picture (through the agency of language, AI, or no AI). But it is quite dissimilar to how the brain does it. The brain doesn't need language as a medium, but a computer does.
Yes, the founding of LLM was "attention", and the brain has a dedicated compartment to address that. We call it "mind" - a filtration device to select a cognitive task.
Assuming we draw from that design, for computers to have AGI, LLM would have to be the attention orchestrator to multi-modal perception. (Unlike the brain, though, the medium is still the language only). With agents, we are slowly heading in that direction.
But to be "equivalent" to the brain, AI will have to accept that language is paramount; it isn't everything. We can aim to achieve functional equivalence, but the costs will be enormous.