Working with AI data stored in Postgres tables
Suggest editsThe examples on this page are about working with AI data stored in columns in the Postgres table.
To see how to use AI data stored in S3-compatible object storage, skip to working with AI data in S3.
Begin by creating a Postgres table for some test AI data:
CREATE TABLE products ( product_id SERIAL PRIMARY KEY, product_name TEXT NOT NULL, description TEXT, last_updated_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP );
CREATE TABLE
Working with auto embedding
Next, you are going to create a retriever with the just created products table as the source using the aidb.create_pg_retriever
function which has this syntax:
aidb.create_pg_retriever( retriever_name text, schema_name text, primary_key text, model_name text, data_type text, source_table text, columns text[], auto_embedding boolean )
- The
retriever_name
is used to identify and reference the retriever; set it toproduct_embeddings_auto
for this example. - The
schema_name
is the schema where the source table is located; set this topublic
. - The
primary_key
is the primary key column of the source table. - The
model_name
is the name of the embeddings encoder model for similarity data; set it toall-MiniLM-L6-v2
to use the open encoder model for text data from HuggingFace. - The
data_type
is the type of data in the source table, which could be eitherimg
ortext
. Set it totext
. - The
source_table
is the name of the source table. The source table created previously, isproducts
so set it to that. - The
columns
is an array of columns to use for the similarity search by the retriever. Set this toARRAY['product_name', 'description']
to use the product_name and description columns. - The
auto_embedding
is a boolean value to set a trigger for auto embeddings. Set it to TRUE so that any future insert, update or delete to the source table shall automatically generate, update or delete also the corresponding embedding.
This gives the following SQL command:
SELECT aidb.create_pg_retriever( 'product_embeddings_auto', -- Retriever name 'public', -- Schema 'product_id', -- Primary key 'all-MiniLM-L6-v2', -- embedding model 'text', -- data type 'products', -- Source table ARRAY['product_name', 'description'], -- Columns to vectorize TRUE -- auto embeddings TRUE to set trigger );
create_pg_retriever --------------------- (1 row)
You have now created a retriever for the products table. The next step is to insert some AI data records into the products table.
Since you set auto_embedding
to true, the retriever shall automatically generate all embeddings in real-time for each inserted record:
INSERT INTO products (product_name, description) VALUES ('Hamburger', 'Tasty'), ('Cheesburger', 'Very tasty'), ('Fish n Chips', 'Naa'), ('Fries', 'Dunno'), ('Burrito', 'Always'), ('Pizza', 'Mkay'), ('Sandwich', 'So what'), ('Veggie Burger', 'Go away'), ('Kebab', 'Maybe');
INSERT 0 9
Now you can use the retriever, by specifying the retriever name, to perform a similarity retrieval of the top K most relevant, in this case most similar, AI data items. You can do this by running the aidb.retrieve
function with the required parameters:
aidb.retrieve( query text, top_k integer, retriever_name text )
- The
query
is the text to use to retrieve the top similar data. Set it toI like it
. - The
top_k
is the number of top similar data items to retrieve. Set this to 5 - The
retriever_name
is the name of the retriever. The retriever's name isproduct_embeddings_auto
.
This gives the following SQL command:
SELECT data FROM aidb.retrieve( 'I like it', -- The query text to retrieve the top similar data 5, -- top K 'product_embeddings_auto' -- retriever's name );
data -------------------------------------- {'data': 'Hamburger - Tasty'} {'data': 'Cheesburger - Very tasty'} {'data': 'Pizza - Mkay'} {'data': 'Sandwich - So what'} {'data': 'Kebab - Maybe'} (5 rows)
Working without auto embedding
You can now create a retriever without auto embedding. This means that the application has control over when the embeddings computation occurs. It also means that the computation is a bulk operation. For demonstration you can simply create a second retriever for the same products table that you just previously created the first retriever for, but setting auto_embedding
to false.
SELECT aidb.create_pg_retriever( 'product_embeddings_bulk', -- Retriever name 'public', -- Schema 'product_id', -- Primary key 'all-MiniLM-L6-v2', -- embedding model 'text', -- data type 'products', -- Source table ARRAY['product_name', 'description'], -- Columns to vectorize FALSE -- auto embeddings FALSE );
create_pg_retriever --------------------- (1 row)
The AI records are already in the table though. As this second retriever is newly created, it won't have created any embeddings. Running aidb.retrieve
using the retriever now doesn't return any results:
SELECT data FROM aidb.retrieve( 'I like it', -- The query text to retrieve the top similar data 5, -- top K 'product_embeddings_bulk' -- retriever's name );
data ------ (0 rows)
You need to run a bulk generation of embeddings before performing any retrieval. You can do this using the aidb.refresh_retriever
function:
aidb.refresh_retriever( retriever_name text )
The retriever_name
is the name of the retriever. Our retriever's name is product_embeddings_bulk
. So the SQL command is:
SELECT aidb.refresh_retriever( 'product_embeddings_bulk' -- name of the retriever );
INFO: inserted table name public._aidb_embeddings_product_embeddings_bulk refresh_retriever ------------------- (1 row)
You can now run that retrieve operation using the second retriever and get the same results as with the first retriever:
SELECT data FROM aidb.retrieve( 'I like it', -- The query text to retrieve the top similar data 5, -- top K 'product_embeddings_bulk' -- retriever's name );
data -------------------------------------- {'data': 'Hamburger - Tasty'} {'data': 'Cheesburger - Very tasty'} {'data': 'Pizza - Mkay'} {'data': 'Sandwich - So what'} {'data': 'Kebab - Maybe'} (5 rows)
The next step is to see what happens if when you add more AI data records:
INSERT INTO products (product_name, description) VALUES ('Chicken Nuggets', 'Never'), ('Ramen', 'Delicious');
INSERT 0 2
The new data is automatically picked up in the retrieval from the first retriever with auto embeddings:
SELECT data FROM aidb.retrieve( 'I like it', -- The query text to retrieve the top similar data 5, -- top K 'product_embeddings_auto' -- retriever's name );
data -------------------------------------- {'data': 'Hamburger - Tasty'} {'data': 'Cheesburger - Very tasty'} {'data': 'Pizza - Mkay'} {'data': 'Sandwich - So what'} {'data': 'Ramen - Delicious'} (5 rows)
The second retriever without auto embedding doesn't reflect the new data. It can only do so when once there has been another explicit call to aidb.refresh_retriever
. Until then, the results don't change:
SELECT data FROM aidb.retrieve( 'I like it', -- The query text to retrieve the top similar data 5, -- top K 'product_embeddings_bulk' -- retriever's name );
data -------------------------------------- {'data': 'Hamburger - Tasty'} {'data': 'Cheesburger - Very tasty'} {'data': 'Pizza - Mkay'} {'data': 'Sandwich - So what'} {'data': 'Kebab - Maybe'} (5 rows)
If you now call aidb.refresh_retriever()
again, the embeddings computation uses the new data to refresh the embeddings:
SELECT aidb.refresh_retriever( 'product_embeddings_bulk' -- name of the retriever );
INFO: inserted table name public._aidb_embeddings_product_embeddings_bulk refresh_retriever -------------------
And the new data shows up in the results of the query when you call the aidb.retrieve
function again:
SELECT data FROM aidb.retrieve( 'I like it', -- The query text to retrieve the top similar data 5, -- top K 'product_embeddings_bulk' -- retriever's name );
data -------------------------------------- {'data': 'Hamburger - Tasty'} {'data': 'Cheesburger - Very tasty'} {'data': 'Pizza - Mkay'} {'data': 'Sandwich - So what'} {'data': 'Ramen - Delicious'} (5 rows)
You used the two different retrievers for the same source data just to demonstrate the workings of auto embedding compared to explicit refresh_retriever()
.
In practice you may want to combine auto embedding and refresh_retriever() in a single retriever to conduct an initial embedding of data that existed before you created the retriever and then rely on auto embedding for any future data that's ingested, updated, or deleted.
You should consider relying on aidb.refresh_retriever
, and not using auto embedding, if you typically ingest a lot of AI data at once as a batch.
Could this page be better? Report a problem or suggest an addition!