Vector databases (VDBs) and large language models (LLMs) like GPT series are gaining significance. The figure above shows that both concepts started gaining popularity at the beginning of 2023, and the trend shows that both have a similar upward trajectory.
Data reigns supreme, and computational advancements dictate technological trends. Considering the pivotal role of vector databases in contemporary artificial intelligence (AI) applications, their significance and interplay should not be understated.
Executives may prioritize generative AI projects but not realize the infrastructure supporting such projects. In light of AI and machine learning developments, we explain VDBs’ importance to LLM projects and delve deep into the significance of VDBs and LLMs, their intersection, and the transformative impact on modern-day computing.
How do LLMs utilize vector databases?
Basic interaction with a Large Language Model (LLM) like ChatGPT can include the following process:
- A user will type in their question or statement into the interface.
- This input is then processed by an embedding model, transforming into vector embeddings corresponding to the content you want to reference.
- This vector representation is then matched against the vector database related to the content from which the embedding was generated.
- Based on this, the vector database generates a response and presents it to the user as an answer.
- Subsequent queries from the user will follow the same method: passing through the embedding model to form vectors and querying the database to find matching or similar vectors. The likeness between these vectors reflects the original content from which they were formed.
Below, we explain some key areas where LLMs can utilize vector databases and bring benefits.
Word Embeddings Storage
LLMs often use word embeddings like Word2Vec, GloVe, and FastText to represent words as vectors in a multi-dimensional space. Vector databases can store these embeddings and fetch them efficiently during real-time operations. Word2Vec, GloVe, and FastText are popular algorithms/methods for learning word embeddings in natural language processing (NLP).
Semantic similarity is a concept used in natural language processing, linguistics, and cognitive science to quantify how similar two pieces of text (or words, phrases, sentences, etc.) are in terms of their meaning. It measures the likeness of meanings or semantics of words or sentences. Once words or sentences are represented as vectors, finding semantically similar words or sentences can be done using vector databases. Given a query vector, the database can quickly return the nearest vectors (i.e., semantically closest words or sentences).
Efficient Large-Scale Retrieval
LLMs may need to find the best matching documents from a large corpus for tasks like information retrieval or recommendation. If documents are represented as vectors, vector databases can help retrieve the most relevant documents rapidly.
In machine translation, previous translations can be stored as vectors in a database. When a new sentence needs to be translated, the database can be queried for similar sentences, and their translations can be reused or adapted, improving translation speed and consistency.
Knowledge Graph Embeddings
Knowledge graphs can be represented using embeddings, where entities and relations are transformed into vectors. Vector databases can help store and retrieve these embeddings, facilitating tasks like link prediction, entity resolution, and relation extraction.
In tasks like text classification or spam detection, vector representations of texts can be used to detect anomalies. Vector databases can facilitate efficient searching for anomalies in a high-dimensional space.
Here’s a basic example using word embeddings (a type of vector representation for text) to detect anomalies in a dataset of sentences:
- Data Collection:
- Gather a set of sentences. For simplicity, let’s consider the following:
- “Cats are great pets.”, “Dogs love to play fetch.”, “Elephants are the largest land animals.”, “Bananas are rich in potassium.”, “Birds can fly.”, “Fish live in water.”
- Vector Representation:
- Use a pre-trained word embedding model (like Word2Vec or FastText) to convert each sentence into a vector representation.
- Building a Reference Vector:
- Calculate the mean vector of all the sentence vectors related to animals. This mean vector represents the “centroid” or central point of the topic.
- Compute Distances:
- For each sentence vector, compute the cosine distance (or any other distance metric) to the reference vector.
- Thresholding and Detection:
- Set a distance threshold. Any sentence vector with a distance greater than this threshold from the reference vector can be considered an anomaly.
- In our example, the sentence “Bananas are rich in potassium.” would likely have a higher distance to the reference vector than the other sentences, identifying it as an anomaly.
- Check the results to confirm if the identified anomalies are indeed anomalies based on domain knowledge.
For applications that require real-time user interaction, like chatbots or virtual assistants, vector databases can ensure that response generation, which might depend on fetching relevant context or information represented as vectors, is quick.
What are vector databases?
A vector database holds data as high-dimensional vectors, which are numerical representations of specific features or characteristics. In the context of large language models or natural language processing, these vectors can vary in dimensionality, spanning from just a few to several thousand, based on the intricacy and detail of the information. Typically, these vectors originate from transforming or embedding raw data like text, pictures, sound, video, etc.
Vector databases gained prominence in recent years due to the rise of machine learning, especially with the widespread use of embeddings. Vector embeddings convert complex data, such as text, images, and unstructured data, into high-dimensional vectors so that similar items are closer to each other in the vector space.
Why LLMs need vector databases: Similarity search in high-dimensional vectors
Similarity searches in high-dimensional spaces refer to the problem of finding items in a dataset that are “similar” to a given query item when the data is represented in a multi-dimensional space. This search type is common in various domains, including machine learning, computer vision, and information retrieval.
Traditional databases are generally inefficient when handling similarity searches in high-dimensional spaces. To address this challenge, vector databases have been developed to efficiently index and search through extensive collections of high-dimensional vectors.
To conduct a similarity search in a vector database, you must utilize a query vector that encapsulates your search criteria. This query vector can originate from the same data type as the database vectors or a different type, such as using text to search an image database.
The next step is to employ a similarity metric to determine the proximity between two vectors in this space. This can include metrics like cosine similarity, euclidean distance, or the Jaccard index. The outcome typically presents a list of vectors ranked by their resemblance to the query vector. Subsequently, you can retrieve the raw data linked to each vector from the primary source or index.
So far, only major tech companies with the resources to create and maintain them have utilized vector databases. Given their high cost, optimizing them correctly is crucial to guarantee top performance.
If you have further questions, reach us:
Next to Read
Your email address will not be published. All fields are required.