AI_EMBEDDING_VECTOR

Creating embeddings using the ai_embedding_vector function in PlaidCloud Lakehouse

This document provides an overview of the ai_embedding_vector function in PlaidCloud Lakehouse and demonstrates how to create document embeddings using this function.

The main code implementation can be found here.

By default, PlaidCloud Lakehouse leverages the text-embedding-ada model for generating embeddings.

Overview of ai_embedding_vector

The ai_embedding_vector function in PlaidCloud Lakehouse is a built-in function that generates vector embeddings for text data. It is useful for natural language processing tasks, such as document similarity, clustering, and recommendation systems.

The function takes a text input and returns a high-dimensional vector that represents the input text's semantic meaning and context. The embeddings are created using pre-trained models on large text corpora, capturing the relationships between words and phrases in a continuous space.

Creating embeddings using ai_embedding_vector

To create embeddings for a text document using the ai_embedding_vector function, follow the example below.

  1. Create a table to store the documents:
CREATE TABLE documents (
                           id INT,
                           title VARCHAR,
                           content VARCHAR,
                           embedding ARRAY(FLOAT32)
);
  1. Insert example documents into the table:
INSERT INTO documents(id, title, content)
VALUES
    (1, 'A Brief History of AI', 'Artificial intelligence (AI) has been a fascinating concept of science fiction for decades...'),
    (2, 'Machine Learning vs. Deep Learning', 'Machine learning and deep learning are two subsets of artificial intelligence...'),
    (3, 'Neural Networks Explained', 'A neural network is a series of algorithms that endeavors to recognize underlying relationships...'),
  1. Generate the embeddings:
UPDATE documents SET embedding = ai_embedding_vector(content) WHERE length(embedding) = 0;

After running the query, the embedding column in the table will contain the generated embeddings.

The embeddings are stored as an array of FLOAT32 values in the embedding column, which has the ARRAY(FLOAT32) column type.

You can now use these embeddings for various natural language processing tasks, such as finding similar documents or clustering documents based on their content.

  1. Inspect the embeddings:
SELECT length(embedding) FROM documents;
+-------------------+
| length(embedding) |
+-------------------+
|              1536 |
|              1536 |
|              1536 |
+-------------------+

The query above shows that the generated embeddings have a length of 1536(dimensions) for each document.

Last modified June 11, 2024 at 9:00 PM EST: clean up cautions and notes (d4a1b9a)