What are Large Language Models?

This is the first of a multi-part series exploring exciting new developments in the field of AI. It is intended for anyone that is trying to get a deeper understanding of the ins and outs of large language models and their implications. We'll explore the following:

What are large language models (LLMs) and diffusion models?
Current landscape of LLMs
Use cases for LLMs
Recent history of LLM's
What's next? What does this mean for the world at large? What does this mean for software, product, data?

Part 1: What are large language models?

Basics

Before we get into large language models, we'll go over some basics. The terms machine learning and artificial intelligence are often used interchangeably but have some differences. Artificial Intelligence is any technique that mimics humans. Machine learning is the process of training machines to learn from data without any explicit instructions or programming. Within machine learning there are multiple sub fields: classical machine learning and deep learning. For the purposes of this series, we'll focus on deep learning. Deep learning is machine learning with a huge amount of data and algorithms called neural networks. Neural networks as per their name were inspired by the structure and function of human brains.

Any machine learning project includes the following 3 pillars:

Data
Models
Infrastructure

As the field of machine learning has evolved, each pillar has gotten more sophisticated. For example in infrastructure, besides compute for training and running a model, there's a burgeoning area of MLops to help provide more insight into ML models as well as manage deployments, provide monitoring capabilities, allow for "explainable AI" and so forth.

It's akin to what has already happened to software in the last few decades and what had happened with hardware in the many decades prior. Modern software and hardware is instrumented with "sensors" that provide insights into how that software and hardware is performing.

Back to Large Language Models

Okay so now that we have some of the basics covered let's move on to the exciting stuff, what are large language models (LLM's for short)?

LLM's are neural networks (types of deep learning models) that are designed using the transformer architecture and trained on a very large amount of data. There are many different types of neural networks (Feed forward, convolutional neural networks, recurrent neural networks and many more). Transformers are the latest kind.

In case you are interested in understanding neural networks better, here's a detailed analogy for how neural networks work courtesy of an LLM:

Fruit basket analogy

Let's use a simpler analogy of a group of people trying to guess the weight of a fruit basket to explain how a simple neural network works.

Input Layer: The fruit basket contains different types of fruits, such as apples, oranges, and bananas. Each type of fruit represents an input in the neural network.

Hidden Layer(s): Each person in the group uses their own method to estimate the weight of the fruit basket based on the number of each type of fruit. These people represent the neurons in the hidden layer(s) of the neural network.

Activation Functions: Each person decides whether or not to share their estimate based on their level of confidence. In the neural network, this decision-making process is represented by activation functions.

Weights and Biases: The group assigns different levels of importance to each person's estimate based on their past accuracy. In the neural network, these levels of importance are represented by weights and biases.

Output Layer: The group combines everyone's estimates and comes up with a final guess for the weight of the fruit basket. In the neural network, this is represented by the output layer, which combines the information from the hidden layer(s) to produce the final prediction.

Training and Learning: To improve their guessing accuracy, the group learns from past experiences and adjusts the importance they assign to each person's estimate. In the neural network, this learning process involves adjusting the weights and biases to minimize the error between the network's predictions and the actual outcomes.

In summary, a simple neural network works much like a group of people guessing the weight of a fruit basket. They process the input information (types of fruit), apply their own judgment (activation functions) and importance (weights and biases), and combine their estimates to make a final prediction. The network continuously learns and improves its performance over time.

There are many many different types of neural networks. The most dominant types of neural networks in computer vision are CNN's. Depending on the "task" that needed to be accomplished, different architectures might be better suited for it.

Machine learning models are typically organized by task and domain. Depending on the task at hand, different models might be better suited for it. For example, in computer vision for the task of image classification, google's "vit-base" might be the way to go. In natural language processing for the task of question answering, the model "roberta-base" might be a good choice.

As certain types of architectures are better suited for being trained on larger sets of data (aka Transformers), they have become good at many different tasks. These are also commonly referred to as foundation models given their property of serving as a solid base for so many tasks. This is what we're seeing with models like gpt3 and gpt4 where they can do multiple tasks fairly well out of the box.

So… what are transformers?

The novel concept introduced in transformers was the idea of attention. This can be visually seen in how each element in a transformer is connected to every other element, unlike prior neural network architectures.

Transformers were discovered by researchers at google and shared with the world through a 2017 paper called "Attention is all you need". Their defining characteristic is the ability to add relative attention.

Using the fruit basket analogy, let's describe the differences between each of the different types of neural networks:

Transformer: Each person can see the entire fruit basket at once and assess the relationships between all the fruits in the basket, regardless of their positions. This allows them to better understand the overall structure and dependencies within the data. Transformers capture both short and long-range dependencies effectively through the self-attention mechanism.

In contrast:

Feedforward networks focus on individual fruit types

CNNs consider local patterns

RNNs and LSTMs process the sequence step-by-step

GANs involve a generator and discriminator competing to create realistic fruit baskets

Transformers analyze the entire sequence at once using self-attention

Examples of large language models

The most popular example is GPT or Generative Pretrained Transformer. This model powers the viral OpenAI conversational product called "ChatGPT".

Other examples include BERT and T5.

A new era

LLM's are just the beginning. There's a lot more on the way as researchers continue to push the frontier of what we imagine is possible.

It really does give new meaning to Bill Gates popular quote of "We always overestimate the change that will occur in the next two years and underestimate the change that will occur in the next ten".

This post barely scratches the surface on machine learning and large language models but hopefully it served as an introduction to what's under the hood of this new tech that is shaking up the world.

Next, we'll explore the current landscape for LLM's and who the major players are.

If you're curious to learn more about neural networks, check out my favorite explainer by 3blue1brown: Neural Networks Series

The original Google blog where it started: Transformer: A Novel Neural Network Architecture

Sources

Transformers for Natural Language Processing, Dennis Rothman
Natural Language Processing with Transformers, Lewis Tunstall, Leandro von Werra, Thomas Wolf