AIMultiple ResearchAIMultiple Research

The Future of Large Language Models in 2024

Cem Dilmegani
Updated on Jan 10
8 min read

Interest in large language models (LLMs) is on the rise especially after the release of ChatGPT in November 2022 (see Figure 1). In recent years, LLMs have transformed various industries, generating human-like text and addressing a wide range of applications. However, their effectiveness is hindered by concerns surrounding bias, inaccuracy, and toxicity, which limit their broader adoption and raise ethical concerns.

Figure 1. Google search trend for large language models over a year

Source: Google Trends

This article explores the future of large language models by delving into promising approaches, such as self-training, fact-checking, and sparse expertise, to mitigate these issues and unlock the full potential of these models.

What is a large language model?

A large language model is a type of artificial intelligence model designed to generate and understand human-like text by analyzing vast amounts of data. These foundational models are based on deep learning techniques and typically involve neural networks with many layers and a large number of parameters, allowing them to capture complex patterns in the data they are trained on.

The primary goal of a large language model is to understand the structure, syntax, semantics, and context of natural language, so it can generate coherent and contextually appropriate responses or complete given text inputs with relevant information.

These models are trained on diverse sources of text data, including books, articles, websites, and other textual content, which enables them to generate responses to a wide range of topics.

BERT (Google)

BERT, an acronym for Bidirectional Encoder Representations from Transformers, is a foundational model developed by Google in 2018. Based on the Transformer Neural Network architecture introduced by Google in 2017, BERT marked a departure from the prevalent natural language processing (NLP) approach that relied on recurrent neural networks (RNNs).

Before BERT, RNNs typically processed text in a left-to-right manner or combined both left-to-right and right-to-left analyses. In contrast, BERT is trained bidirectionally, allowing it to gain a more comprehensive understanding of language context and flow compared to its unidirectional predecessors.

GPT-3 & GPT-4 (OpenAI)

GPT-3

OpenAI’s GPT-3, or Generative Pre-trained Transformer 3, is a large language model that has garnered significant attention for its remarkable capabilities in natural language understanding and generation. Released in June 2020, GPT-3 is the third iteration in the GPT series, building on the success of its predecessors, GPT and GPT-2.

GPT-3 became publicly used when developed into GPT-3.5 for the creation of the conversational AI tool ChatGPT which was released in November 2022.

GPT-3 uses billions of parameters, dwarfing its competitors in comparison (Figure 2). This made it the most complex large language model until its successor GPT-4.

Figure 2. The image shows how GPT-3 has a greater parameter analysis capacity than other giant NLP models

GPT-4

The largest language model is now OpenAI’s GPT-4, released in March 2023. Although the model is more complex than the others in terms of its size, OpenAI didn’t share the technical details of the model.

GPT-4 is a multimodal large language model of significant size that can handle inputs of both images and text and provide outputs of text. Although it may not perform as well as humans in many real-world situations, the new model has demonstrated performance levels on several professional and academic benchmarks that are comparable to those of humans.1

The model has various distinctive features compared to other LLMs, including:

  • Visual input option
  • Higher word limit
  • Advanced reasoning capability
  • Steerability, etc.

For a more detailed account of these capabilities of GPT-4, check our in-depth guide.

BLOOM (BigScience)

BLOOM, an autoregressive large language model, is trained using massive amounts of text data and extensive computational resources to extend text prompts. Released in July 2022, it is built on 176 parameters as a competitor of GPT-3. As a result, it can generate coherent text across 46 languages and 13 programming languages.

For a comparative analysis of the current LLMs, check our large language models examples article.

What is the current stage of large language models?

The current stage of large language models is marked by their impressive ability to understand and generate human-like text across a wide range of topics and applications. Built using advanced deep learning techniques and trained on vast amounts of data, these models, such as OpenAI’s GPT-3 and Google’s BERT, have significantly impacted the field of natural language processing. 

Current LLMs have achieved state-of-the-art performance on various tasks like: 

Despite these achievements, language models still have various limitations that need to be addressed and fixed in the future models.

1- Accuracy

Large language models employ machine learning to deduce information, which raises concerns about potential inaccuracies. Additionally, pre-trained large language models struggle to adapt to new information dynamically, leading to potentially erroneous responses that warrant further scrutiny and improvement in future developments. Figure 3 shows the accuracy comparison of some LLMs.

Figure 3. Results for a wide variety of language models on the 5-shot HELM benchmark for accuracy

Source: “BLOOM: A 176B-Parameter Open-Access Multilingual Language Model”

2- Bias

Large language models facilitate human-like communication through speech and text. However, recent findings indicate that more advanced and sizable systems tend to assimilate social biases present in their training data, resulting in sexist, racist, or ableist tendencies within online communities (Figure 4).

Figure 4. Large language models’ toxicity index

Source: Stanford University Artificial Intelligence Index Report 2022

For instance, a recent 280 billion-parameter model exhibited a substantial 29% increase in toxicity levels compared to a 117 million-parameter model from 2018. As these systems continue to advance and become more powerful tools for AI research and development, the potential for escalating bias risks also grows. Figure 5 compares the bias potential of some LLMs.

Figure 5. Results for a wide variety of language models on the 5-shot HELM benchmark for bias

Source: “BLOOM: A 176B-Parameter Open-Access Multilingual Language Model”

3- Toxicity

The toxicity problem of large language models refers to the issue where these models inadvertently generate harmful, offensive, or inappropriate content in their responses. This problem arises because these models are trained on vast amounts of text data from the internet, which may contain biases, offensive language, or controversial opinions.

Figure 6. Results for a wide variety of language models on the 5-shot HELM benchmark for toxicity

Source: “BLOOM: A 176B-Parameter Open-Access Multilingual Language Model”

Addressing the toxicity problem in future large language models requires a multifaceted approach involving research, collaboration, and continuous improvement. Some potential strategies to mitigate toxicity in future models can include:

  • Curating and improving training data
  • Developing better fine-tuning techniques
  • Incorporating user feedback
  • Content moderation strategies

4- Capacity limitations

Every large language model has a specific memory capacity, which restricts the number of tokens it can process as input. For example, ChatGPT has a 2048-token limit (approximately 1500 words), preventing it from comprehending and producing outputs for inputs that surpass this token threshold.

GPT-4 extended the capacity to 25,000 words, far exceeding the ChatGPT model depending on GPT-3.5 (Figure 7).

Figure 7. Word limit comparison between ChatGPT and GPT-4

Source: OpenAI

5- Pre-trained knowledge set

Language models are trained on a fixed set of data that represents a snapshot of knowledge at a certain point in time. Once the training is complete, the model’s knowledge is frozen and cannot access up-to-date information. This means that any information or changes that occur after the training data was collected won’t be reflected in how large language models respond.

This leads to several problems regarding such as:

  • Outdated or incorrect information
  • Inability to handle recent events
  • Less relevance in dynamic domains like technology, finance or medicine

What is the future of large language models?

It is not possible to foresee how the future language models will evolve. However, there is promising research on LLMs, focusing on the common problems we explained above. We can pinpoint 3 radical and substantial changes for future language models.

1- Fact-checking themselves

A collection of promising advancements aims to alleviate the factual unreliability and static knowledge limitations of large language models. These novel techniques are crucial for preparing LLMs for extensive real-world implementation. Doing this requires two abilities:

  • The ability to access external sources
  • The ability to provide citations and references for answers

Significant preliminary research in this domain features models such as Google’s REALM and Facebook’s RAG, both introduced in 2020.

In June 2022, OpenAI introduced a fine-tuned version of its GPT model called WebGPT, which utilizes Microsoft Bing to browse the internet and generate more precise and comprehensive answers to prompts. WebGPT operates similarly to a human user: 

  • Submitting search queries to Bing
  • Clicking on links
  • Scrolling web pages
  • Employing functions like Ctrl+F to locate terms

When the model incorporates relevant information from the internet into its output, it includes citations, allowing users to verify the source of the information. The research results show that All WebGPT models surpass every GPT-3 model in terms of the proportion of accurate responses and the percentage of truthful and informative answers provided. 

Figure 8. TruthfulQA results comparing GPT-3 and WebGPT models

Source: “WebGPT: Browser-assisted question-answering with human feedback”

DeepMind is actively exploring similar research avenues. A few months back, they introduced a new model called Sparrow. Like ChatGPT, Sparrow operates in a dialogue-based manner, and akin to WebGPT, it can search the internet for new information and offer citations to support its claims.

Figure 9. Sparrow provides up-to-date answers and evidence for factual claims

Source: “Improving alignment of dialogue agents via targeted human judgements”

Although it is still early to conclude that accuracy, fact-checking and static knowledge base problems can be overcome in the near-future models, current research results are promising for the future. This may reduce the need for using prompt engineering to cross check model output since model will already have cross-checked its results.

2- Synthetic training data

For fixing some of the limitations we mentioned above, such as those resulting from the training data, researchers are working on large language models that can generate their own training data sets (i.e. generating synthetic training data sets). 

In a recent study, Google researchers developed a large language model capable of creating questions, generating comprehensive answers, filtering its responses for the highest quality output, and fine-tuning itself using the curated answers. Impressively, this resulted in new state-of-the-art performance across multiple language tasks.

Figure 10. Overview of the Google’s self-improving model

Source: “Large Language Models Can Self-Improve”

For example, the model’s performance improved from 74.2% to 82.1% on GSM8K and from 78.2% to 83.0% on DROP, which are two widely used benchmarks for evaluating LLM performance.

A recent study focuses on enhancing a crucial LLM technique called “instruction fine-tuning,” which forms the foundation of products like ChatGPT. While ChatGPT and similar instruction fine-tuned models depend on human-crafted instructions, the research team developed a model capable of generating its own natural language instructions and subsequently fine-tuning itself using those instructions.

The performance improvements are substantial, as this method boosts the base GPT-3 model’s performance by 33%, nearly equaling the performance of OpenAI’s own instruction-tuned model (Figure 11).

Figure 11. Performance of GPT3 model and its instruction-tuned variants, evaluated by human experts

Source: “Self-Instruct: Aligning Language Model with Self Generated Instructions”

With such models in the future, it is possible to reduce biases and toxicity of the model outputs and increase the efficiency of fine-tuning with desired data sets, meaning that models learn to optimize themselves.

3- Sparse expertise

Although each model’s parameters, training data, algorithms etc. cause performance differences, all of the widely recognized language models today—such as OpenAI’s GPT-3, Nvidia/Microsoft’s Megatron-Turing, Google’s BERT—share a fundamental design in the end. They are:

  • Autoregressive
  • Self-supervised
  • Pre-trained
  • Employ densely activated transformer-based architectures

A dense language model means that each of these models use all of their parameters to create a response to a prompt. As you may guess, this is not very effective and troublesome.

A sparse expert model is the idea that a model can be able to activate only a relevant set of its parameters to answer a given prompt. Currently developed LLMs with more than 1 trillion parameters are assumed to be sparse models.2 An example to these models is Google’s GLam with 1.2 trillion parameters.

According to Forbes, Google’s GLaM is seven times bigger than GPT-3 but consumes two-thirds less energy for training. It demands only half the computing resources for inference and exceeds GPT-3’s performance on numerous natural language tasks. 

Sparse expert models mean that it is more efficient and environmentally less damaging to develop the future language models this way.

If you have questions or need help in finding vendors, we can help:

Find the Right Vendors
Access Cem's 2 decades of B2B tech experience as a tech consultant, enterprise leader, startup entrepreneur & industry analyst. Leverage insights informing top Fortune 500 every month.
Cem Dilmegani
Principal Analyst
Follow on

Cem Dilmegani
Principal Analyst

Cem has been the principal analyst at AIMultiple since 2017. AIMultiple informs hundreds of thousands of businesses (as per similarWeb) including 60% of Fortune 500 every month.

Cem's work has been cited by leading global publications including Business Insider, Forbes, Washington Post, global firms like Deloitte, HPE, NGOs like World Economic Forum and supranational organizations like European Commission. You can see more reputable companies and media that referenced AIMultiple.

Throughout his career, Cem served as a tech consultant, tech buyer and tech entrepreneur. He advised businesses on their enterprise software, automation, cloud, AI / ML and other technology related decisions at McKinsey & Company and Altman Solon for more than a decade. He also published a McKinsey report on digitalization.

He led technology strategy and procurement of a telco while reporting to the CEO. He has also led commercial growth of deep tech company Hypatos that reached a 7 digit annual recurring revenue and a 9 digit valuation from 0 within 2 years. Cem's work in Hypatos was covered by leading technology publications like TechCrunch and Business Insider.

Cem regularly speaks at international technology conferences. He graduated from Bogazici University as a computer engineer and holds an MBA from Columbia Business School.

To stay up-to-date on B2B tech & accelerate your enterprise:

Follow on

Next to Read

Comments

Your email address will not be published. All fields are required.

0 Comments