While using existing LLMs in enterprise workflows is table stakes, leading enterprises are building their custom models. However, building custom models can cost millions and require investing in an internal AI team.
Follow the links based on what needs to change:
- The tone or format for LLM responses, LLM fine-tuning can achieve this starting from a few $.
- The information that your LLM has access to, RAG can provide the right information at runtime.
If these and other low-cost approaches like prompt engineering don’t fix your problem, learn what it takes to build a custom model and lower-cost approaches before you invest substantial amounts:
LLM training as a service
Outsourcing training can be the fastest LLM training approach but it can also be the costliest. Given the nascent nature of the market, there is a large variation in pricing between different providers:
OpenAI‘s service includes not just finetuning pre-trained models which is already available via API. It also includes training models from scratch using internal data. Interested parties are notified that it may cost $2-3M and take several months.1
MosaicML claims to offer similar services for $200-800k within days or weeks.2
If your business has a strong machine learning team, they can also follow the steps below to build a model:
Large language model training
There are four steps to training large language models:
1. Data collection and preprocessing
The first step is to gather the training data set. The data can come from various sources such as documents, websites, articles, etc. The biggest advantage of a custom model is that it leverages internal company data, therefore preparing high-quality proprietary data is the most important step.
Once private data is prepared, it can be enriched with public data in adjacent domains. Popular public sources to find datasets are:
- Kaggle
- Google Dataset Search
- Hugging Face
- Data.gov
- Wikipedia database
The data then needs to be cleaned and prepared for training. This may involve converting the dataset to lowercase, removing stop words, and tokenizing the text into sequences of tokens that make up the text.
2. Model selection and configuration
Large models such as Google’s Gemini and OpenAI’s GPT-4 both use models trained with transformer deep learning architecture that collaborate in a Mixture-of-Experts (MoE) approach. Some key elements of the model such as:
- Number of experts
- Number of layers in transformer blocks
- Number of attention heads
- Loss function
- Hyperparameters
need to be specified when configuring a transformer neural network. The configuration can depend on the desired use case and the training data. The configuration of the model directly influences the training time of the model.
3. Model training
The model is trained on the pre-processed text data using supervised learning. During training, the model is presented with a sequence of words and is trained to predict the next word in the sequence. The model adjusts its weights based on the difference between its prediction and the actual next word. This process is repeated millions of times until the model reaches a satisfactory level of performance.
Since the models and data are large in size, it requires immense computation power to train models. To decrease training time, a technique called model parallelism is used. Model parallelism enables different parts of a large model to be spread across multiple GPUs, allowing the model to be trained in a distributed manner with AI chips.
By dividing the model into smaller parts, each part can be trained in parallel, resulting in a faster training process compared to training the entire model on a single GPU or processor. This results in faster convergence and better overall performance, making it possible to train even larger language models than before. Common types of model parallelism include:
- Data parallelism splits and transmits the training mini-batches to model replicas, increasing processing speed.
- Pipeline parallelism assigns separate layers of the model to different GPUs, to extend model size beyond a single GPU.
- Tensor parallelism splits a single layer across many GPUs, usually inside the same server.

Source: AWS3
Training a large language model from the ground up requires significant investment, a more economical alternative is to fine-tune an existing language model to tailor it to your specific use case. OpenAI’s CEO, Sam Altman stated that the cost of training GPT-4 exceeded $100 million.4
4. Evaluation and fine-tuning
After training, the model is evaluated on a test dataset that has not been used as a training data set to measure the model’s performance. Based on the evaluation results, the model may require some fine-tuning by adjusting its hyperparameters, changing the architecture, or training on additional data to improve its performance.
For example, reinforcement learning from human feedback (RLHF) is a common technique for fine-tuning models. RLHF helps alignment and guarantees that the LLM’s output represents human values and preferences.
Training LLMs for specific use cases
Training of an LLM consists of two parts: pre-training and task-specific training. Task-specific training is also called LLM fine-tuning.
Pre-training is part of training that enables the model to learn the general rules and dependencies within a language. This takes a significant amount of data and
- Computational power from supercomputer systems with hardware from leading AI chip builders (e.g. NVIDIA). Once maintenance and power costs are added, pre-training of a large language model is an investment in the magnitude of millions.
- Time: GPT-4 training reportedly took about half a year.
To make large language models more accessible for enterprises, LLM developers are offering fine-tuning services for enterprises looking to leverage language models.
Most LLM developers and LLMOps platform providers offer fine-tuning services.
In addition, hardware providers are also starting to be active in this domain. NVIDIA’s NeMO is an example of these services, which offer pre-trained LLMs for fine-tuning and specific task training to suit specific use cases.
The specific task training adds an additional layer to the model which requires much less data, power, and time to train; making large models accessible for enterprise use.
The video below introduces NVIDIA’s NeMO LLM service.
What is the architecture of large language models?
The architecture of large language models, such as OpenAI’s GPT-4, is based on a type of deep learning called the Transformer architecture. It consists of the following main components (see Figure 1):
Figure 1: Transformer architecture

Source:5
1. Input embedding
The input sequence is first transformed into a dense vector representation, known as an embedding, which captures the relationships between words in the input.
2. Multi-head self-attention
The core component of the transformer block architecture is the multi-head self-attention mechanism, which allows the model to attend to different parts of the input sequence to capture its relationships and dependencies.
3. Feed-forward network
After the self-attention mechanism, the output is fed into a feed-forward neural network, which performs a non-linear transformation to generate a new representation.
4. Normalization and residual connections
To stabilize the training process, the output from each layer is normalized, and a residual connection is added to allow the input to be passed directly to the output, allowing the model to learn which parts of the input are most important.
These components are repeated several times to form a deep neural network, which can process long sequences of text and generate high-quality outputs for various language tasks, such as text generation, question answering, and translation.
Developers continue to develop large language models by implementing new techniques to:
- Simplify the model (decrease the model size or memory required to train),
- Improve performance,
- Lower price,
- Decrease model training time.
What are the top large language models by parameter size?
We compiled the 10+ LLMs by parameter size in the table below.6
Model | Developer | Parameter size |
---|---|---|
Claude 3 | Anthropic | 2T |
GPT-4 | OpenAI | 1.8T |
WuDao 2.0 | Beijing Academy of Artificial Intelligence | 1.75T |
Gemini | 1T | |
MT-NLG | Nvidia and Microsoft | 530B |
NeMo L | Nvidia | 530B |
Llama 3.1 | Meta | 405B |
PaLM 2 | 340B | |
Falcon | Technology Innovation Institute | 18B |
Bloom | Hugging Face and BigScience | 176B |
OPT | Meta | 175B |
GPT-3 | OpenAI | 175B |
LaMDA | 137B | |
DBRX | Databricks and Mosaic | 132B |
Check our article on large language model examples for more models with in-depth information.
FAQ
What is a large language model?
Large language models (LLMs) took the internet by storm at the end of 2022 as ChatGPT from OpenAI reached 1 million users just 5 days after its launch. ChatGPT’s capabilities and wide applications can be accredited to the 1.8T parameters the GPT-4 language model has.
A large language model is a type of machine learning model that is trained on a large corpus of text data to generate outputs for various natural language processing (NLP) tasks.
Large language models are typically based on deep learning neural networks such as the Transformer architecture and are trained on massive amounts of text data, often involving billions of words.
If you are new to large language models, check our “Large Language Models: Complete Guide“
External Links
- 1. OpenAI Platform.
- 2. #generativeai | Margaret Amori.
- 3. Training large language models on Amazon SageMaker: Best practices | AWS Machine Learning Blog.
- 4. OpenAI’s CEO Says the Age of Giant AI Models Is Already Over | WIRED. WIRED
- 5. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is All you Need. Neural Information Processing Systems, 30, 5998–6008. https://arxiv.org/pdf/1706.03762v5
- 6. 7 Language Models You Need to Know | AI Business. AI Business
Comments
Your email address will not be published. All fields are required.