Minds and Machines: Decoding the Enigma of Learning: Part 2
An In-Depth Analysis of Human and Artificial Cognitive Processes
Introduction
Welcome to the second part of our series exploring the fascinating world of artificial intelligence (AI) and its relationship to human cognition. In the first part, we laid the groundwork by delving into the complexities of human learning, providing a context for understanding the unique aspects of human intelligence.
Now, we embark on a captivating journey into the realm of AI, with a special focus on machine learning. In this installment, we are privileged to have the expertise of
, a renowned expert in the field of AI and natural language processing. Alejandro's insights will illuminate the incredible advancements in machine learning and shed light on the potential impact of AI on various domains, including education.Through Alejandro's contributions, we will explore the intricacies of machine learning algorithms, their ability to process and analyze vast amounts of data, and their potential to revolutionize the way we approach problem-solving and decision-making. We will also delve into the fascinating world of natural language processing, where machines are learning to understand and generate human language with increasing sophistication.
As we navigate through this second part, we will draw parallels and contrasts between human and artificial intelligence, highlighting the unique strengths and limitations of each. We will explore the ethical considerations surrounding the development and deployment of AI systems and discuss the importance of responsible AI practices.
By the end of this installment, you will have gained a deeper understanding of the current state of AI and machine learning, as well as a glimpse into the future possibilities that these technologies hold. You will be equipped with the knowledge to critically evaluate the impact of AI on society and to engage in meaningful discussions about its potential benefits and challenges.
So, buckle up and get ready for an exhilarating ride as we explore the frontiers of artificial intelligence and machine learning with Alejandro Piad Morffis as our guide. Whether you are a technology enthusiast, an educator, or simply curious about the future of AI, this series promises to be an enlightening and thought-provoking experience.
Let's dive in and uncover the mysteries of AI together!
Part 2: Machine Learning
“The Science of Machine Learning: From N-Grams to LLMs” by
1. The Basics of Machine Learning
Let’s begin with a high-level definition of machine learning. It is a unique approach in computer science, focusing on specific problems that are difficult to solve by traditional means. So, why do we need machine learning?
When solving a traditional problem with a computer, a software engineer must consider how humans would solve it and provide precise instructions for the computer to emulate that behavior. This method has been used forever to build most of our computational systems, from operating systems to financial management systems to manufacturing robots. The key point is that in all these tasks we know exactly which steps to take and which tools to use, but they may be too difficult or uncomfortable for humans.
However, what if the task is something you can’t easily explain how to do? A prime example is creating a chess player. World champion chess players perform amazing calculations in their minds, but it’s incredibly hard to write down specific rules for a computer to follow. Traditional computer science methods aren’t suitable for this kind of task.
Similarly, when examining a picture of a lung to detect a tumor, radiologists can expertly identify problematic tissue, but it’s impossible to write down the precise rules they use.
In these situations, machine learning comes into play. Instead of providing precise instructions, we give the computer examples of the desired outcome and let it figure out the rules on its own. This approach allows computers to learn from experience and improve their performance over time, making them more adept at complex tasks that are challenging to explain or document step by step.
2. Machine Learning Paradigms
Now that we understand the basics, let’s dive into different machine learning and modeling paradigms suitable for various problems. There are two fundamental ways to learn from experience in machine learning: learning by imitation and learning by trial and error.
In learning by imitation, you access examples of correct performances for a given task, like brain scans annotated by expert radiologists or sentiment analysis of tweets. This leads to two primary learning paradigms: supervised and unsupervised learning. Both involve a static source of experience, such as a large dataset, but the difference lies in whether the data has explicit annotations related to the task.
Supervised classification, for example, includes labeled texts or images. In unsupervised scenarios, no explicit output annotation exists, but you can still transform the problem into a machine learning scenario by choosing an internal target to predict. Language models trained on internet data illustrate this approach – predicting the next word in a phrase without explicit annotation.
Reinforcement learning represents the other major paradigm – learning by trial and error. Instead of correct demonstrations of task performance, you have a judge evaluating your performance without telling you how to do it. This involves interacting with an environment and receiving feedback through rewards or penalties, enabling the model to improve over time.
Reinforcement learning is great for situations where performing a task is more difficult than judging it. For example, in self-driving cars, it’s challenging to gather many demonstrations of correct driving. However, it’s easier to put a machine learning driver in a simulation and evaluate its performance based on factors like crashes, speed, and traffic adherence.
3. Understanding Language Models
Now let’s take a look at the largest and most largest and most successful subfamily of gradient-based machine learning methods: language models. In machine learning, language modeling means guessing how likely a sentence is to “exist.” For example, “the sun rises in the east and sets in the west” is a common sentence with a high chance of existing. But a sentence with random words that do not mean anything has a low probability of having been uttered by anyone.
Language modeling can be tricky because it’s hard to say how likely a sentence is to “exist.” But what does that even mean? In machine learning, we use a group of texts called a corpus to help with this. Instead of the abstract, ontological question, we might ask something much more straightforward: How likely is it for this sentence to appear in all the written text on the internet?
However, if we only looked at sentences that already exist on the internet, language modeling wouldn’t be very useful. We’d just say a sentence is either there or not, probability 0 or 1. So instead, we can think about it in statistical frequentist terms like this: If the internet was made and erased many times, how often would this sentence show up?
To answer this question, we can think about if a word is likely to come after a group of words in a sentence. For example, “The sun rises in the east and sets in the…” What word would most likely come next? We want our language model to be able to guess that word.
Thus, we need to know how likely a word is to appear after a group of words. If we can do that, we can find the best word to complete the sentence. We keep doing this over and over again to create full sentences or even conversations.
But if we always choose the most likely word, it could get boring. So, instead, we pick from the top 15 or 50 most likely words randomly. That way, our program can make different sentences and be more interesting.
Now, let’s talk about one way to make this language modeling program. It’s called statistical language modeling. We start with lots of text and learn how words connect with each other. In simple words, a context is a group of words around a specific word in a sentence. For example, in the sentence “the sun rises in the east and sets in the west,” the word “east” is in the context of “the sun rises and sets.” If we look at many sentences, we can find words that are often in the same context. This helps us understand which words are related to each other.
For example, if we see “the capital of France is Paris” and “the capital of the United States is Washington,” we can learn that Paris and France, as well as Washington and the United States, are related. They all have the same relationship: being the capital of a country. We might not know what to call this relationship, but we know it’s the same type.
Statistical language modeling is making a model that can guess how often a word appears in a certain context using lots of data. This doesn’t necessarily mean it truly understands the words’ meanings. But if we use enough data, it starts to look like the model can indeed capture some of the semantics. Whether this means the model really understands language is another discussion. But at least it looks like it knows some meanings in different contexts.
3.1 The Simplest Language Model: N-Grams
We’ve been building statistical language models since the early days of AI. The n-gram model is one of the simplest ones, storing the probability of each n-gram’s occurrence. An n-gram is a collection of n words that appear together in common sentences. For example, in a 2-gram model, we count how many times pairs of words appear together in a large corpus, creating a table showing their frequency.
As we increase the n-grams to 3, 4, or 5, the table becomes extremely large. Before the internet revolution, Google built a massive n-gram model from the entire internet with up to 5-grams. However, since the combination of all 5 words in English is huge, we only store probabilities for the most common combinations, compressing the table and storing only the larger numbers. This makes our statistical language model an approximation of language.
This simple model counts words in a strict context when they’re within a specific window size together. It’s very explicit, as each n-gram has its probability or frequency recorded. To compress this model further, we use embeddings – representing discrete objects in continuous space. For instance, words can be represented as vectors in a 300-dimensional space.
Embeddings aim to transform semantic properties from the original space into numerical properties of the embedding space. In the case of words, we want those that occur together in context to map to similar vectors and cluster in the embedding space where they’re often used together.
Word2Vec, in 2011, was the first massively successful use of embeddings. They trained a large embedding model using statistics from text all over the internet and discovered an amazing property: directions in the embedding space can encode semantic properties.
For instance, if you go from France to Paris, the same vector needed to add to the word France to reach Paris is similar to the vector needed to add to the word United States to reach Washington. This showed that the semantic property is-capital-of was encoded as a specific direction in this space. Many other semantic properties were found encoded this way too.
This was an early example of how encoding words in a dense vector space can capture some of their semantics.
3.2 Contextual Word Embeddings
The issue with Word2Vec is its assignment of a unique vector to each word, regardless of context. As words have different meanings in different contexts, many attempts were made to create contextual embeddings instead of static ones. The most successful attempt is the transformer architecture, with BERT being the first example. The first transformer paper revolutionized natural language processing (NLP) in artificial intelligence, providing a single tool to tackle various NLP problems.
The transformer generates a text representation or embedding that considers the entire content of a sentence, for example, or even a larger fragment of text. This means each word’s embedding changes according to its context. Additionally, a global embedding for an entire sentence or paragraph can be computed. Why does this matter? It connects to our previous discussion on vector representations and neural networks.
Neural networks are among the most powerful machine learning paradigms we have. By using a single representation, we can find embeddings for text, images, audio, categories, and programming code. This enables machine learning across multiple domains using a consistent approach.
With neural networks, you can transform images to text, text to image, text to code or audio, etc. The first idea of the transformer was to take a large chunk of text, obtain an embedding, and then use a specific neural network for tasks like text classification or translation. However, sequence-to-sequence architectures were developed, allowing neural networks to receive a chunk of text, embed it into a real-value vector, and generate a completely different chunk of text.
For example, in translation, you can encode a sentence in English with a transformer that embeds it into a real-value vector and then decode it in another transformer that “speaks” French. The real-value vector in the middle represents the meaning of the text independent of language. So you can have different encoders and decoders for various languages and translate any language pair.
One cool aspect is that you can train on pairs of languages like English-Spanish and German-French and then translate from English to French without ever training on that specific pair. This is due to the internal representation used by all languages. The sequence-to-sequence transformer is a fundamental piece behind technologies like ChatGPT. The next step is training it on massive amounts of text and teaching it to generate similar text.
3.3 Large Language Models
Large language models are the latest development in statistical language modeling, evolving from N-Gram models, embeddings, and transformers. These advanced architectures can compute contextual embeddings for extensive text contexts, thanks to innovations that efficiently accommodate thousands of words in memory. This capacity has increased continuously, with the first version of ChatGPT holding something like 4000 words, and recently Google Gemini’s claim to hold over 1 million words in the context.
A significant change is the scale of data these models are trained on. BERT was trained on a vast dataset for its time, but it pales in comparison to GPT-2, 3, and 4. Large language models learn from a massive amount of internet text, including technical texts, books, Wikipedia articles, blog posts, social media, news, and more. This exposure to diverse text styles and content allows them to understand various mainstream languages.
Large language models, like GPT-2, generate text by predicting the next word in a sentence or paragraph, just like all previous language models. This is the standard technique can complete simple tasks, such as finishing a paragraph. As the scale increases, these models become more creative. For example, they can create coherent narratives about, say, a group of scientists discovering unicorns in the Peruvian mountains.
However, with GPT-3’s size, emerging capabilities like “in-context learning” appear. This means the model can complete tasks based on provided examples. For instance, you can give English-to-French translation examples and the model will generate a French translation for a new English sentence. This works for various tasks such as summarization, translation, and general knowledge question answering.
The paper “Large Language Models are Few-Shot Learners” demonstrates how GPT-3 can be transformed into a question-answering assistant by using proper prompts and context. The cool thing is that it turns language models from completion machines on steroids to fully-fledged chat bots. But we need something else to get there.
The next phase, instruction fine-tuning, involves taking a pre-trained language model like GPT-3 and further fine-tuning it for instruction following. This way, the model is trained to follow instructions and provide solutions or responses to those instructions. You can fine-tune the model with as few as 50,000 to 100,000 instructions covering various tasks, from text normalization to generating code. So instead of simple completion, now the model expects a question or instructions and completes it with the answer.
The final step is reinforcement learning with human feedback (RLHF), which helps steer the model towards answers that align with certain principles and values. This is necessary because there are many ways to answer the same instruction, each with different characteristics. Some answers may be shorter or longer, more chatty or focused, inquisitive or direct, and potentially biased or even racist.
With RLHF, you can guide the model towards responses that better suit your users’ principles and values. To create an instruction set that encodes values, have the model produce, say, 10 answers for each question. A human evaluator assesses these responses based on friendliness, depth, informativeness, interest, language quality, politeness, potential biases, etc. Using reinforcement learning, we can teach the model to evaluate its own responses based on these dimensions and generate answers that align with user intentions and values. This process turns ChatGPT into a safe, friendly, and polite assistant that refuses to engage in harmful or discriminatory tasks.
Of course, this doesn’t solve all problems. Alignment remains one of the hardest challenges in computer science. However, it transforms ChatGPT from mostly unusable to mostly usable for daily users without exhibiting extreme behaviors.
Reaching ChatGPT required a combination of clever engineering to scale a well-known architecture with massive data and compute power, fine-tuning for instruction-following and alignment with developer and user values, and strong marketing and design to create a user-friendly tool. This is just the beginning of LLMs, however, with many more innovations expected in the future. The core ideas behind its machine learning paradigm provide a solid foundation for ongoing improvements.
4. Language Learning: Humans vs. Machines
Regarding the differences between how large language models and humans learn language, a few key points stand out. First, the scale is vastly different - large models require billions of training examples to generate grammatically correct sentences, while human children need far fewer. This has sparked debate in linguistics about nature versus nurture in language acquisition. Is there a pre-programmed structure in the brain for learning grammar, or is it all learned during childhood? Although I’m not an expert in linguistics, we know that machine learning models with built-in biases for specific problems can learn more easily and with less data. It makes sense that there might be similar architectural designs in the human brain.
However, we also know that backpropagation, the learning algorithm used in large models, is biologically impossible. The neural networks in our brains function differently than those in artificial models. This fundamental difference in learning algorithms could impact how much these models truly understand.
Another important distinction is symbolic grounding. When humans learn language, they not only learn relationships between words but also connections to real-world objects. Large models lack this grounding, which could affect their understanding of language.
Symbols, like the sound of cats or the word “cat” in a text, connect to real-life anchors, such as an actual cat. Grounding is crucial in language acquisition because children first learn concepts they experience daily, like mom, dad, food, and home. Only later do they make connections with abstract ideas like values and metacognition that can’t be directly tied to experiences.
So, humans learn language while interacting with the real world. Language is grounded in physical experiences and other sensory inputs like images, touch, sound, and taste. This contrasts with large language models that only learn correlations between words without grounding meanings in experiential things.
For example, when you say “the apple is red” to a large language model, it recognizes this as a likely true sentence due to context. However, for a human, the same phrase connects the abstract symbols with experiences of what an apple looks like, tastes like, and feels like. This shows that language models reason differently than humans when it comes to real-world concepts.
One could argue that humans can reason about purely symbolic things, like math. Even though numbers might be grounded in physical notions of quantity, abstract fields of math involve human reasoning by manipulating symbols and learning correlations between them. In this sense, there may be a case for language models to be able to reason similarly in certain contexts. Language reasoning, in its limited form, involves understanding how words, sentences, and contexts relate and appear together. This provides a certain level of linguistic comprehension. However, this understanding differs greatly from that of humans or even animals with primitive understanding.
5. Conclusion
To extend Alejandro's insights to an AI-responsive writing curriculum, we must delve deeper into the fundamental differences between human and machine learning processes. Human learning is a multifaceted journey that involves the active construction of knowledge through social interactions, cognitive development, and the formation of neural connections in the brain.
It is a process that is deeply rooted in our experiences, emotions, and the cultural contexts in which we live. Machine learning, in contrast, relies on the processing of vast amounts of data through complex algorithms and computational power to identify patterns, generate outputs, and make predictions. While machine learning has made remarkable strides in recent years, it is essential to recognize that it operates on a fundamentally different level than human learning.
To design an effective AI-responsive writing curriculum, we must place context and grounding at the forefront. Human language learning is intricately tied to our real-world experiences and sensory inputs, allowing us to form deep, symbolic connections between words and their meanings. When we learn a new word, we associate it with the sights, sounds, smells, and feelings that we have encountered in our lives.
This experiential grounding is what gives human language its richness, nuance, and depth. Machine learning models, on the other hand, learn from vast corpora of text data but lack the experiential grounding that is so crucial to human understanding. While they can identify patterns and generate coherent outputs, they do not have the same level of symbolic understanding that humans possess. An AI-responsive writing curriculum must therefore encourage students to explore the importance of context and grounding in language understanding, while also examining the limitations of AI models in this regard.
Moreover, human cognition is uniquely capable of symbolic reasoning and abstraction. We have the ability to take the concrete experiences and sensory inputs that we encounter in the world and transform them into abstract concepts and ideas. This is what allows us to engage in higher-order thinking, to analyze complex problems, and to generate creative solutions.
An AI-responsive writing curriculum should place a strong emphasis on developing these critical thinking skills, encouraging students to engage with language and ideas in ways that complement and extend the capabilities of AI. By fostering symbolic reasoning and abstraction, we can help students to become more effective communicators, problem-solvers, and creative thinkers.
Another key insight that emerges from Alejandro's analysis is the potential for collaboration between human and machine intelligence. While human and machine learning operate on fundamentally different levels, there is enormous potential for synergy between the two. Machine learning models can process vast amounts of data and generate outputs at a speed and scale that is far beyond human capabilities.
At the same time, human intelligence brings a level of context, nuance, and symbolic understanding that is essential for effective communication and problem-solving. An AI-responsive writing curriculum should explore how human and machine intelligence can work together in complementary ways, leveraging the strengths of each to enhance learning and creativity.
As machine learning models become increasingly sophisticated, issues of alignment with human values and principles become ever more pressing. While AI has the potential to transform many aspects of our lives, it is crucial that we ensure that these technologies are developed and deployed in ways that are consistent with our ethical principles and values.
An AI-responsive writing curriculum must engage students in deep, thoughtful discussions about the ethical implications of AI, encouraging them to consider the social and moral dimensions of these technologies. By fostering a strong sense of ethical reasoning and responsibility, we can help to ensure that the development and deployment of AI is guided by a commitment to human values and the greater good.
Ultimately, by weaving together these insights, we can create an AI-responsive writing curriculum that empowers students to navigate the complex landscape of language and learning in the age of artificial intelligence. Through a deep understanding of the interplay between human and machine cognition, and a commitment to developing critical thinking, ethical reasoning, and creative expression, we can cultivate a generation of learners who are well-equipped to harness the potential of AI while remaining grounded in the richness of human experience.
Thank you, Alejandro, for your brilliant insights that have illuminated the path forward in designing an AI-responsive writing curriculum. Your contributions have been invaluable in shaping our understanding of the complex relationship between human and machine learning, and in helping us to envision a future in which the power of AI is harnessed in service of human flourishing.
As we move forward in this exciting and rapidly-evolving field, let us continue to be guided by a deep commitment to the values of curiosity, creativity, and compassion, and to the belief that by working together, we can create a brighter future for all.
Check out some of my favorite Substacks:
Thanks Alejandro! That was one of the most lucid, clear, and approachable primers on how machine learning in general and LLMs in particular do what they do. Even though I already had a bit of top-level understanding of what's behind LLMs, your coverage of machine learning paradigms, N-grams and other building blocks was very helpful!
Also, well done Nick on connecting the dots as to what this means for education going forward. You two completment each other perfectly here.
I've been looking forward to this article all week, and Alejandro you did not disappoint! Thank you for a wonderfully clear article. It was a pleasure.