A new PostgreSQL extension is now available in Supabase: pgvector
, an open-source vector similarity search.
The exponential progress of AI functionality over the past year has inspired many new real world applications. One specific challenge has been the ability to store and query embeddings at scale.
In this post we'll explain what embeddings are, why we might want to use them, and how we can store and query them in PostgreSQL using pgvector
.
What are embeddings?
Embeddings capture the “relatedness” of text, images, video, or other types of information. This relatedness is most commonly used for:
- Search: how similar is a search term to a body of text?
- Recommendations: how similar are two products?
- Classifications: how do we categorize a body of text?
- Clustering: how do we identify trends?
Let's explore an example of text embeddings. Say we have three phrases:
- “The cat chases the mouse”
- “The kitten hunts rodents”
- “I like ham sandwiches”
Your job is to group phrases with similar meaning. If you are a human, this should be obvious. Phrases 1 and 2 are almost identical, while phrase 3 has a completely different meaning.
Although phrases 1 and 2 are similar, they share no common vocabulary (besides “the”). Yet their meanings are nearly identical. How can we teach a computer that these are the same?
Human language
Humans use words and symbols to communicate language. But words in isolation are mostly meaningless - we need to draw from shared knowledge & experience in order to make sense of them. The phrase “You should Google it” only makes sense if you know that Google is a search engine and that people have been using it as a verb.
In the same way, we need to train a neural network model to understand human language. An effective model should be trained on millions of different examples to understand what each word, phrase, sentence, or paragraph could mean in different contexts.
So how does this relate to embeddings?
How do embeddings work?
Embeddings compress discrete information (words & symbols) into distributed continuous-valued data (vectors). If we took our phrases from before and plot them on a chart, it might look something like this:
Phrases 1 and 2 would be plotted close to each other, since their meanings are similar. We would expect phrase 3 to live somewhere far away since it isn't related. If we had a fourth phrase, “Sally ate Swiss cheese”, this might exist somewhere between phrase 3 (cheese can go on sandwiches) and phrase 1 (mice like Swiss cheese).
In this example we only have 2 dimensions: the X and Y axis. In reality, we would need many more dimensions to effectively capture the complexities of human language.
OpenAI embeddings
OpenAI offers an API to generate embeddings for a string of text using its language model. You feed it any text information (blog articles, documentation, your company's knowledge base), and it will output a vector of floating point numbers that represents the “meaning” of that text.
Compared to our 2-dimensional example above, their latest embedding model text-embedding-ada-002
will output 1536 dimensions.
Why is this useful? Once we have generated embeddings on multiple texts, it is trivial to calculate how similar they are using vector math operations like cosine distance. A perfect use case for this is search. Your process might look something like this:
- Pre-process your knowledge base and generate embeddings for each page
- Store your embeddings to be referenced later (more on this)
- Build a search page that prompts your user for input
- Take user's input, generate a one-time embedding, then perform a similarity search against your pre-processed embeddings.
- Return the most similar pages to the user
Embeddings in practice
At a small scale, you could store your embeddings in a CSV file, load them into Python, and use a library like numPy
to calculate similarity between them using something like cosine distance or dot product. OpenAI has a cookbook example that does just that. Unfortunately this likely won't scale well:
- What if I need to store and search over a large number of documents and embeddings (more than can fit in memory)?
- What if I want to create/update/delete embeddings dynamically?
- What if I'm not using Python?
Using PostgreSQL
Enter pgvector
, an extension for PostgreSQL that allows you to both store and query vector embeddings within your database. Let's try it out.
First we'll enable the Vector extension. In Supabase, this can be done from the web portal through Database
→ Extensions
. You can also do this in SQL by running:
_10create extension vector;
Next let's create a table to store our documents and their embeddings:
_10create table documents (_10 id bigserial primary key,_10 content text,_10 embedding vector(1536)_10);
pgvector
introduces a new data type called vector
. In the code above, we create a column named embedding
with the vector
data type. The size of the vector defines how many dimensions the vector holds. OpenAI's text-embedding-ada-002
model outputs 1536 dimensions, so we will use that for our vector size.
We also create a text
column named content
to store the original document text that produced this embedding. Depending on your use case, you might just store a reference (URL or foreign key) to a document here instead.
Soon we're going to need to perform a similarity search over these embeddings. Let's create a function to do that:
_21create or replace function match_documents (_21 query_embedding vector(1536),_21 match_threshold float,_21 match_count int_21)_21returns table (_21 id bigint,_21 content text,_21 similarity float_21)_21language sql stable_21as $$_21 select_21 documents.id,_21 documents.content,_21 1 - (documents.embedding <=> query_embedding) as similarity_21 from documents_21 where 1 - (documents.embedding <=> query_embedding) > match_threshold_21 order by similarity desc_21 limit match_count;_21$$;
pgvector
introduces 3 new operators that can be used to calculate similarity:
Operator | Description |
---|---|
<-> | Euclidean distance |
<#> | negative inner product |
<=> | cosine distance |
OpenAI recommends cosine similarity on their embeddings, so we will use that here.
Now we can call match_documents()
, pass in our embedding, similarity threshold, and match count, and we'll get a list of all documents that match. And since this is all managed by Postgres, our application code becomes very simple.
Indexing
Once your table starts to grow with embeddings, you will likely want to add an index to speed up queries. Vector indexes are particularly important when you're ordering results because vectors are not grouped by similarity, so finding the closest by sequential scan is a resource-intensive operation.
Each distance operator requires a different type of index. We expect to order by cosine distance, so we need vector_cosine_ops
index. A good starting number of lists is 4 * sqrt(table_rows):
_10create index on documents using ivfflat (embedding vector_cosine_ops)_10with_10 (lists = 100);
You can read more about indexing on pgvector
's GitHub page here.
Generating embeddings
Let's use JavaScript to generate embeddings and store them in Postgres:
_29import { createClient } from '@supabase/supabase-js'_29import { Configuration, OpenAIApi } from 'openai'_29import { supabaseClient } from './lib/supabase'_29_29async function generateEmbeddings() {_29 const configuration = new Configuration({ apiKey: '<YOUR_OPENAI_KEY>' })_29 const openAi = new OpenAIApi(configuration)_29_29 const documents = await getDocuments() // Your custom function to load docs_29_29 // Assuming each document is a string_29 for (const document of documents) {_29 // OpenAI recommends replacing newlines with spaces for best results_29 const input = document.replace(/\n/g, ' ')_29_29 const embeddingResponse = await openai.createEmbedding({_29 model: 'text-embedding-ada-002',_29 input,_29 })_29_29 const [{ embedding }] = embeddingResponse.data.data_29_29 // In production we should handle possible errors_29 await supabaseClient.from('documents').insert({_29 content: document,_29 embedding,_29 })_29 }_29}
Building a simple search function
Finally, let's create an Edge Function to perform our similarity search:
_45import { serve } from 'https://deno.land/std@0.170.0/http/server.ts'_45import 'https://deno.land/x/xhr@0.2.1/mod.ts'_45import { createClient } from 'https://esm.sh/@supabase/supabase-js@2.5.0'_45import { Configuration, OpenAIApi } from 'https://esm.sh/openai@3.1.0'_45import { supabaseClient } from './lib/supabase'_45_45export const corsHeaders = {_45 'Access-Control-Allow-Origin': '*',_45 'Access-Control-Allow-Headers': 'authorization, x-client-info, apikey, content-type',_45}_45_45serve(async (req) => {_45 // Handle CORS_45 if (req.method === 'OPTIONS') {_45 return new Response('ok', { headers: corsHeaders })_45 }_45_45 // Search query is passed in request payload_45 const { query } = await req.json()_45_45 // OpenAI recommends replacing newlines with spaces for best results_45 const input = query.replace(/\n/g, ' ')_45_45 const configuration = new Configuration({ apiKey: '<YOUR_OPENAI_KEY>' })_45 const openai = new OpenAIApi(configuration)_45_45 // Generate a one-time embedding for the query itself_45 const embeddingResponse = await openai.createEmbedding({_45 model: 'text-embedding-ada-002',_45 input,_45 })_45_45 const [{ embedding }] = embeddingResponse.data.data_45_45 // In production we should handle possible errors_45 const { data: documents } = await supabaseClient.rpc('match_documents', {_45 query_embedding: embedding,_45 match_threshold: 0.78, // Choose an appropriate threshold for your data_45 match_count: 10, // Choose the number of matches_45 })_45_45 return new Response(JSON.stringify(documents), {_45 headers: { ...corsHeaders, 'Content-Type': 'application/json' },_45 })_45})
Building a smarter search function
ChatGPT doesn't just return existing documents. It's able to assimilate a variety of information into a single, cohesive answer. To do this, we need to provide GPT with some relevant documents, and a prompt that it can use to formulate this answer.
One of the biggest challenges of OpenAI's text-davinci-003
completion model is the 4000 token limit. You must fit both your prompt and the resulting completion within the 4000 tokens. This makes it challenging if you wanted to prompt GPT-3 to answer questions about your own custom knowledge base that would never fit in a single prompt.
Embeddings can help solve this by splitting your prompts into a two-phased process:
- Query your embedding database for the most relevant documents related to the question
- Inject these documents as context for GPT-3 to reference in its answer
Here's another Edge Function that expands upon the simple example above:
_100import { serve } from 'https://deno.land/std@0.170.0/http/server.ts'_100import 'https://deno.land/x/xhr@0.2.1/mod.ts'_100import { createClient } from 'https://esm.sh/@supabase/supabase-js@2.5.0'_100import GPT3Tokenizer from 'https://esm.sh/gpt3-tokenizer@1.1.5'_100import { Configuration, OpenAIApi } from 'https://esm.sh/openai@3.1.0'_100import { stripIndent } from 'https://esm.sh/common-tags@1.8.2'_100import { supabaseClient } from './lib/supabase'_100_100export const corsHeaders = {_100 'Access-Control-Allow-Origin': '*',_100 'Access-Control-Allow-Headers': 'authorization, x-client-info, apikey, content-type',_100}_100_100serve(async (req) => {_100 // Handle CORS_100 if (req.method === 'OPTIONS') {_100 return new Response('ok', { headers: corsHeaders })_100 }_100_100 // Search query is passed in request payload_100 const { query } = await req.json()_100_100 // OpenAI recommends replacing newlines with spaces for best results_100 const input = query.replace(/\n/g, ' ')_100_100 const configuration = new Configuration({ apiKey: '<YOUR_OPENAI_KEY>' })_100 const openai = new OpenAIApi(configuration)_100_100 // Generate a one-time embedding for the query itself_100 const embeddingResponse = await openai.createEmbedding({_100 model: 'text-embedding-ada-002',_100 input,_100 })_100_100 const [{ embedding }] = embeddingResponse.data.data_100_100 // Fetching whole documents for this simple example._100 //_100 // Ideally for context injection, documents are chunked into_100 // smaller sections at earlier pre-processing/embedding step._100 const { data: documents } = await supabaseClient.rpc('match_documents', {_100 query_embedding: embedding,_100 match_threshold: 0.78, // Choose an appropriate threshold for your data_100 match_count: 10, // Choose the number of matches_100 })_100_100 const tokenizer = new GPT3Tokenizer({ type: 'gpt3' })_100 let tokenCount = 0_100 let contextText = ''_100_100 // Concat matched documents_100 for (let i = 0; i < documents.length; i++) {_100 const document = documents[i]_100 const content = document.content_100 const encoded = tokenizer.encode(content)_100 tokenCount += encoded.text.length_100_100 // Limit context to max 1500 tokens (configurable)_100 if (tokenCount > 1500) {_100 break_100 }_100_100 contextText += `${content.trim()}\n---\n`_100 }_100_100 const prompt = stripIndent`${oneLine`_100 You are a very enthusiastic Supabase representative who loves_100 to help people! Given the following sections from the Supabase_100 documentation, answer the question using only that information,_100 outputted in markdown format. If you are unsure and the answer_100 is not explicitly written in the documentation, say_100 "Sorry, I don't know how to help with that."`}_100_100 Context sections:_100 ${contextText}_100_100 Question: """_100 ${query}_100 """_100_100 Answer as markdown (including related code snippets if available):_100 `_100_100 // In production we should handle possible errors_100 const completionResponse = await openai.createCompletion({_100 model: 'text-davinci-003',_100 prompt,_100 max_tokens: 512, // Choose the max allowed tokens in completion_100 temperature: 0, // Set to 0 for deterministic results_100 })_100_100 const {_100 id,_100 choices: [{ text }],_100 } = completionResponse.data_100_100 return new Response(JSON.stringify({ id, text }), {_100 headers: { ...corsHeaders, 'Content-Type': 'application/json' },_100 })_100})
Streaming results
OpenAI API responses take longer to depending on the length of the “answer”. ChatGPT has a nice UX for this by streaming the response to the user immediately. You can see a similar effect for the Supabase docs:
The OpenAI API supports completion streaming with Server Side Events. Supabase Edge Functions are run Deno, which also supports Server Side Events. Check out this commit to see how we modified the Function above to build a streaming interface.
Wrap up
Storing embeddings in Postgres opens a world of possibilities. You can combine your search function with telemetry functions, add an user-provided feedback (thumbs up/down), and make your search feel more integrated with your products.
The pgvector extension is available on all new Supabase projects today. If you want to try it out, launch a new Postgres database today: database.new