Large Language Model Principles
Introduction
Large Language Models (LLMs) are advanced statistical models designed to interpret and understand text in statistical terms. These models generate a “most probable response” based on the likelihood derived from their extensive training datasets. This innovative technology emerged in 2017 and gained significant prominence with the deployment of ChatGPT in 2022, particularly in its integration within a ChatBot interface.
LLMs have unlocked a myriad of professional applications, markedly accelerating tasks such as coding, writing, editing, and offering new possibilities in fields like text data mining.
The objective of this section is to elucidate the underlying principles of these models, thereby clarifying how they produce their outputs and demystifying their functionality.
Principle
The fundamental principle behind LLMs lies in predicting the subsequent word in a given sentence, as demonstrated in the Figure below. Starting from an initial sequence of words, the model predicts the next word, incorporates it into the text, and continues this process iteratively until the text reaches a logical conclusion.

Training
The training of such sophisticated models, capable of encapsulating meaning and forecasting future words, is feasible due to the vast and diverse textual data available on the World Wide Web. Leveraging this extensive corpus, the model learns to predict the continuation of a text based on its initial segment. Throughout this training phase, billions of parameters are meticulously optimized to fulfill this complex task.
After the initial training, the model undergoes further specialization, including training to respond to queries, thereby transforming it into an effective chatbot. Additional training steps are undertaken to ensure the quality and relevance of its responses.
Prompt
To guide an LLM chatbot to respond in specific ways, one can provide instructions through a prompt preceding the question. This approach tailors the model’s responses to align with the desired style or content.
Example:
Prompt: “Your responses should be concise, objective, and incorporate geological terminology.”
Question: “Where is lithium found?”
LLM’s Answer: […]
This structure enables the customization of LLM responses, making them adaptable to various thematic contexts and conversational styles.
Words to values
To process text through a LLM, the initial step involves converting words into tokens, as depicted in Figure below. Typically, a token represents a group of characters, and a single word might consist of one or multiple tokens.

A specialized module then projects these tokens into a continuous value space. Each token is represented by a unique value, or a point, within an N-dimensional space (for instance, 768 dimensions in the case of ChatGPT-3). This process, known as embedding, captures the semantic essence of each word.
Additionally, positional encoding is applied to these values to reflect the token’s position within the sequence. This modification ensures that the model recognizes the order of words, which is crucial for understanding the context and structure of the text.
As a result, what emerges is a sequence of N vectors, each with K dimensions corresponding to the embedding space. The methodology for projecting tokens into this space is learned and refined during the model’s extensive training phase, allowing the LLM to interpret and generate text with a nuanced understanding of language.
Attention Mechanism
LLMs adeptly process and interpret the information encoded in a sequence of text. This interpretation hinges on the definitions of words and their interrelationships within the text. LLMs utilize an attention mechanism, as illustrated in Figure below, to analyze and compare words, extracting meaningful patterns and relationships.

For each vector in the sequence, three new vectors are generated through multiplication with the model’s parameters:
A Query, representing what the token signifies in the context of other tokens.
A Key, reflecting the influence or implication of the token for other tokens.
A Value, which captures the essential information of the original vector.
The attention mechanism involves a dot product between Queries and Keys. For a specific token, its Query is multiplied by the Keys of all other tokens, and this process is replicated for each token in the sequence. The result is an Ntokens \(\times\) Ntokens matrix that encapsulates the intricate relationships between words. This matrix is subsequently multiplied by the Values of the vectors.
A skip connection then adds the original input vectors to the output of this attention process. This step preserves previously computed information and aids in the model’s training. The outputs are normalized to prevent excessively large values.
These processed values are further passed through a series of neural network layers, typically a couple, which also incorporate the same mechanism of skip connections and normalization.
This entire process is iteratively repeated multiple times, as determined by the specific architecture of the model in use. Each iteration enhances the model’s ability to extract and understand the nuanced linguistic patterns present in the input text.
Encoder
After undergoing multiple iterations of the aforementioned process, the “encoder” component of the model produces a refined embedding of the input values. This resulting embedding represents a transformative analysis of the original token values, adeptly extracting and encoding the information intrinsic to their meaning and sequential arrangement. It is illustrated in the Figure below.

This transformation is crucial, as it distills the essence of the text, capturing not just the standalone significance of each word but also the collective context and nuances shaped by their order and interactions.
Decoder
Once the embedding is processed, the LLM’s “decoder” begins to generate an answer, assembling it word by word and leveraging the encoder’s embedding, as illustrated in the Figure below.
Initially, the decoder predicts the first token, selecting the one most likely to commence the response. Subsequently, it predicts the second token, informed by both the first token and the encoder’s embedding. Similar to the encoder, the decoder inputs previous tokens combined with a special start token to the embedding layer.

The decoder’s architecture includes several attention blocks (as shown in the Attention Block Figure) but with notable differences. It comprises two layers of attention: the first layer features masked self-attention, allowing each token to consider only preceding tokens in the sequence. The output of this layer is then combined and normalized with the previous values.
The second layer of attention involves cross-attention with the encoder. Here, the queries from the decoder’s sequence interact with the keys from the encoder’s embedding to calculate attention, which is then applied to the encoder’s embedding values. This output, too, is merged with a skip connection and normalized.
Additionally, the sequence passes through a block of several neural network layers, followed by a skip connection and normalization.
This process iterates multiple times. At the end of the decoder, an embedding sequence encapsulating the answer is generated. A final linear layer then uses the last token’s values to calculate the probabilities for the next token. This layer outputs a vector that assigns a probability to each potential token, indicating its likelihood of being the next choice.
Typically, the token with the highest probability is selected. To introduce variability, the model may randomly choose one of the top probable tokens. Once a token is chosen, it is appended to the answer’s token sequence, which is then processed again by the decoder to predict the subsequent token. The sequence concludes with a special “end” token, signaling the logical completion of the text.
In summary, the LLM predicts each word (token) sequentially, factoring in the preceding words and the information contained in the embedding. This prediction process aims to reflect one of the most likely responses based on the initial input processed by the encoder.
Conclusion
Language is one of the most ancient and fundamental means of storing and transmitting information. It is a complex system formed by an association of sounds, which coalesce into words based on their sequencing. The meanings of these words can vary depending on the context, that is, the other words in the sequence with which they are used. Language enables us to encapsulate high-level concepts, logical structures, and much more.
Large Language Models (represented in the Figure below) embody these intricate concepts and logical structures through statistical representations. They transform diverse textual forms -be it mathematical problems, emails, poems, or geological log descriptions- into a multidimensional point cloud in a high-dimensional space.

Utilizing this representation, Large Language Models can swiftly respond to complex queries by considering the context provided and it’s previously used words, all within a matter of milliseconds.
Large Language Models empower us to efficiently navigate through vast text databases. They can search for specific information and provide comprehensive responses almost instantaneously, without succumbing to fatigue or boredom. This capability represents a significant leap forward in our ability to process and make sense of large volumes of textual data.
References
To learn more about Large Language Models, please refer to the following resources:
Naveed, H., et al. (2023). “A Comprehensive Overview of Large Language Models”