AIMultiple ResearchAIMultiple ResearchAIMultiple Research
We follow ethical norms & our process for objectivity.
This research is not funded by any sponsors.
LLM
Updated on Apr 14, 2025

Large Language Model Training in 2025

Headshot of Cem Dilmegani
MailLinkedinX

While using existing LLMs in enterprise workflows is table stakes, leading enterprises are building their custom models. However, building custom models can cost millions and require investing in an internal AI team.

Follow the links based on what needs to change:

  • The tone or format for LLM responses, LLM fine-tuning can achieve this starting from a few $.
  • The information that your LLM has access to, RAG can provide the right information at runtime.

If these and other low-cost approaches like prompt engineering don’t fix your problem, learn what it takes to build a custom model and lower-cost approaches before you invest substantial amounts:

LLM training as a service

Outsourcing training can be the fastest LLM training approach but it can also be the costliest. Given the nascent nature of the market, there is a large variation in pricing between different providers:

OpenAI‘s service includes not just finetuning pre-trained models which is already available via API. It also includes training models from scratch using internal data. Interested parties are notified that it may cost $2-3M and take several months.1

MosaicML claims to offer similar services for $200-800k within days or weeks.2

If your business has a strong machine learning team, they can also follow the steps below to build a model:

Large language model training

There are four steps to training large language models:

1. Data collection and preprocessing 

The first step is to gather the training data set. The data can come from various sources such as documents, websites, articles, etc. The biggest advantage of a custom model is that it leverages internal company data, therefore preparing high-quality proprietary data is the most important step.

Once private data is prepared, it can be enriched with public data in adjacent domains. Popular public sources to find datasets are:

  • Kaggle
  • Google Dataset Search
  • Hugging Face
  • Data.gov
  • Wikipedia database

The data then needs to be cleaned and prepared for training. This may involve converting the dataset to lowercase, removing stop words, and tokenizing the text into sequences of tokens that make up the text. 

2. Model selection and configuration

Large models such as Google’s Gemini and OpenAI’s GPT-4 both use models trained with transformer deep learning architecture that collaborate in a Mixture-of-Experts (MoE) approach. Some key elements of the model such as:

  • Number of experts
  • Number of layers in transformer blocks
  • Number of attention heads
  • Loss function
  • Hyperparameters

need to be specified when configuring a transformer neural network. The configuration can depend on the desired use case and the training data. The configuration of the model directly influences the training time of the model. 

3. Model training

The model is trained on the pre-processed text data using supervised learning. During training, the model is presented with a sequence of words and is trained to predict the next word in the sequence. The model adjusts its weights based on the difference between its prediction and the actual next word. This process is repeated millions of times until the model reaches a satisfactory level of performance.

Since the models and data are large in size, it requires immense computation power to train models. To decrease training time, a technique called model parallelism is used. Model parallelism enables different parts of a large model to be spread across multiple GPUs, allowing the model to be trained in a distributed manner with AI chips

By dividing the model into smaller parts, each part can be trained in parallel, resulting in a faster training process compared to training the entire model on a single GPU or processor. This results in faster convergence and better overall performance, making it possible to train even larger language models than before. Common types of model parallelism include:

  • Data parallelism splits and transmits the training mini-batches to model replicas, increasing processing speed.
  • Pipeline parallelism assigns separate layers of the model to different GPUs, to extend model size beyond a single GPU.
  • Tensor parallelism splits a single layer across many GPUs, usually inside the same server.

Source: AWS3

Training a large language model from the ground up requires significant investment, a more economical alternative is to fine-tune an existing language model to tailor it to your specific use case. OpenAI’s CEO, Sam Altman stated that the cost of training GPT-4 exceeded $100 million.4

4. Evaluation and fine-tuning

After training, the model is evaluated on a test dataset that has not been used as a training data set to measure the model’s performance. Based on the evaluation results, the model may require some fine-tuning by adjusting its hyperparameters, changing the architecture, or training on additional data to improve its performance. 

For example, reinforcement learning from human feedback (RLHF) is a common technique for fine-tuning models. RLHF helps alignment and guarantees that the LLM’s output represents human values and preferences.

Training LLMs for specific use cases

Training of an LLM consists of two parts: pre-training and task-specific training. Task-specific training is also called LLM fine-tuning.

Pre-training is part of training that enables the model to learn the general rules and dependencies within a language. This takes a significant amount of data and

  • Computational power from supercomputer systems with hardware from leading AI chip builders (e.g. NVIDIA). Once maintenance and power costs are added, pre-training of a large language model is an investment in the magnitude of millions.
  • Time: GPT-4 training reportedly took about half a year.

To make large language models more accessible for enterprises, LLM developers are offering fine-tuning services for enterprises looking to leverage language models.

Most LLM developers and LLMOps platform providers offer fine-tuning services.

In addition, hardware providers are also starting to be active in this domain. NVIDIA’s NeMO is an example of these services, which offer pre-trained LLMs for fine-tuning and specific task training to suit specific use cases.

The specific task training adds an additional layer to the model which requires much less data, power, and time to train; making large models accessible for enterprise use.

The video below introduces NVIDIA’s NeMO LLM service.

What is the architecture of large language models?

The architecture of large language models, such as OpenAI’s GPT-4, is based on a type of deep learning called the Transformer architecture. It consists of the following main components (see Figure 1):

Figure 1: Transformer architecture 

Source:5

1. Input embedding

 The input sequence is first transformed into a dense vector representation, known as an embedding, which captures the relationships between words in the input.

2. Multi-head self-attention

The core component of the transformer block architecture is the multi-head self-attention mechanism, which allows the model to attend to different parts of the input sequence to capture its relationships and dependencies.

3. Feed-forward network

After the self-attention mechanism, the output is fed into a feed-forward neural network, which performs a non-linear transformation to generate a new representation.

4. Normalization and residual connections

To stabilize the training process, the output from each layer is normalized, and a residual connection is added to allow the input to be passed directly to the output, allowing the model to learn which parts of the input are most important.

These components are repeated several times to form a deep neural network, which can process long sequences of text and generate high-quality outputs for various language tasks, such as text generation, question answering, and translation.

Developers continue to develop large language models by implementing new techniques to:

  • Simplify the model (decrease the model size or memory required to train),
  • Improve performance,
  • Lower price,
  • Decrease model training time.

What are the top large language models by parameter size?

We compiled the 10+ LLMs by parameter size in the table below.6

Last Updated at 09-10-2024
ModelDeveloperParameter size
Claude 3Anthropic2T
GPT-4OpenAI1.8T
WuDao 2.0Beijing Academy of Artificial Intelligence1.75T
GeminiGoogle1T
MT-NLGNvidia and Microsoft530B
NeMo LNvidia530B
Llama 3.1Meta405B
PaLM 2Google340B
FalconTechnology Innovation Institute18B
BloomHugging Face and BigScience176B
OPTMeta175B
GPT-3OpenAI175B
LaMDAGoogle137B
DBRXDatabricks and Mosaic132B

Check our article on large language model examples for more models with in-depth information.

FAQ

What is a large language model?

Large language models (LLMs) took the internet by storm at the end of 2022 as ChatGPT from OpenAI reached 1 million users just 5 days after its launch. ChatGPT’s capabilities and wide applications can be accredited to the 1.8T parameters the GPT-4 language model has.

A large language model is a type of machine learning model that is trained on a large corpus of text data to generate outputs for various natural language processing (NLP) tasks.

Large language models are typically based on deep learning neural networks such as the Transformer architecture and are trained on massive amounts of text data, often involving billions of words.

If you are new to large language models, check our “Large Language Models: Complete Guide

Share This Article
MailLinkedinX
Cem has been the principal analyst at AIMultiple since 2017. AIMultiple informs hundreds of thousands of businesses (as per similarWeb) including 55% of Fortune 500 every month.

Cem's work has been cited by leading global publications including Business Insider, Forbes, Washington Post, global firms like Deloitte, HPE and NGOs like World Economic Forum and supranational organizations like European Commission. You can see more reputable companies and resources that referenced AIMultiple.

Throughout his career, Cem served as a tech consultant, tech buyer and tech entrepreneur. He advised enterprises on their technology decisions at McKinsey & Company and Altman Solon for more than a decade. He also published a McKinsey report on digitalization.

He led technology strategy and procurement of a telco while reporting to the CEO. He has also led commercial growth of deep tech company Hypatos that reached a 7 digit annual recurring revenue and a 9 digit valuation from 0 within 2 years. Cem's work in Hypatos was covered by leading technology publications like TechCrunch and Business Insider.

Cem regularly speaks at international technology conferences. He graduated from Bogazici University as a computer engineer and holds an MBA from Columbia Business School.
Mert Palazoglu is an industry analyst at AIMultiple focused on customer service and network security with a few years of experience. He holds a bachelor's degree in management.

Next to Read

Comments

Your email address will not be published. All fields are required.

0 Comments