More About Vector Databases
Jan 24, 2024By Kyle (Jake) Gellatly, Director of Product
What is a Vector Database?
In order to understand vector databases, it is easiest to start with a more familiar concept, Structured data. Structured data is any data that can easily be opened in a tabular format. For example, a CSV or Excel file where rows are entries, columns are descriptors, and each cell in the sheet takes on some numerical value, would be considered structured data. Starting from structured data it is relatively easy to run statistics and comparisons within the data, for example, finding the most similar sample to some other sample within your dataset.
However, other data formats, such as images or text, are not as easily stored in a tabular format, and are referred to as unstructured data. For this reason, even simple queries such as “Find the most similar image to some input image” become very difficult when dealing with unstructured data. For this reason, we often embed, or transform, unstructured data into a vector embedding. At the most basic level embedding is the process of using various ML techniques to transform unstructured data into a series of numbers [1,5,3,4…n]. There are methods available to transform images, text, and audio into a vector representation.
For example, we could use ML embedding techniques to transform simple words into a 3D space with X,Y, and Z dimensions. The similarity of words such as Wolf or Dog would cause them to be located near each other in the embedding space, while Fruits such as Apple and Banana would be far apart from the animals.
A vector database allows you to store, index, search, and retrieve vector embeddings and the associated unstructured (or structured) data.
Why use a vector Database?
There are many potential reasons to use a vector database, however one of the most common is similarity search. For example, any software that can recommend similar songs, or images, based on some input, is likely employing embeddings and a vector database under the hood! Some vector databases are even capable of handling both text and images by leveraging a ML model that can “co-embed” these data types. These databases can be incredibly powerful, as they allow for both the search of images similar to some input image, or to search for images that are similar to some input query text! This flexibility allows for the use of natural language to query image based data.
In a more traditional relational database if we wanted to find all images that contain a “feline”, we would need to query for images that had been tagged with either “cat”, “kitten”, “kitty” tag, etc. A prerequisite to running this query would be in annotating, or labeling, every image with a variety of tags related to their respective animal content. In contrast, a vector database would allow the user to search for “feline” and receive back the images most similar to this query, without ever labeling the data with these tags to start with!
In our toy example, we will be able to perform both image similarity search, as well as language to image search (example results below!)
In a future blog, we will show how to implement a vector database using CLIP, a large language model to handle the embeddings, in addition to Pinecone, a vector database cloud service provider.