AI_EMBEDDING_VECTOR

Creating embeddings using the ai_embedding_vector function in PlaidCloud Lakehouse

This document provides an overview of the ai_embedding_vector function in PlaidCloud Lakehouse and demonstrates how to create document embeddings using this function.

The main code implementation can be found here.

By default, PlaidCloud Lakehouse leverages the text-embedding-ada model for generating embeddings.

Note:

Starting from PlaidCloud Lakehouse v1.1.47, PlaidCloud Lakehouse supports the Azure OpenAI service.

This integration offers improved data privacy.

To use Azure OpenAI, add the following configurations to the [query] section:

# Azure OpenAI
openai_api_chat_base_url = "https://<name>.openai.azure.com/openai/deployments/<name>/"
openai_api_embedding_base_url = "https://<name>.openai.azure.com/openai/deployments/<name>/"
openai_api_version = "2023-03-15-preview"

Caution:

PlaidCloud Lakehouse relies on (Azure) OpenAI for AI_EMBEDDING_VECTOR and sends the embedding column data to (Azure) OpenAI.

They will only work when the PlaidCloud Lakehouse configuration includes the openai_api_key, otherwise they will be inactive.

This function is available by default on PlaidCloud Lakehouse using an Azure OpenAI key. If you use them, you acknowledge that your data will be sent to Azure OpenAI by us.

Overview of ai_embedding_vector

The ai_embedding_vector function in PlaidCloud Lakehouse is a built-in function that generates vector embeddings for text data. It is useful for natural language processing tasks, such as document similarity, clustering, and recommendation systems.

The function takes a text input and returns a high-dimensional vector that represents the input text's semantic meaning and context. The embeddings are created using pre-trained models on large text corpora, capturing the relationships between words and phrases in a continuous space.

Creating embeddings using ai_embedding_vector

To create embeddings for a text document using the ai_embedding_vector function, follow the example below.

Create a table to store the documents:

CREATE TABLE documents (
                           id INT,
                           title VARCHAR,
                           content VARCHAR,
                           embedding ARRAY(FLOAT32)
);

Insert example documents into the table:

INSERT INTO documents(id, title, content)
VALUES
    (1, 'A Brief History of AI', 'Artificial intelligence (AI) has been a fascinating concept of science fiction for decades...'),
    (2, 'Machine Learning vs. Deep Learning', 'Machine learning and deep learning are two subsets of artificial intelligence...'),
    (3, 'Neural Networks Explained', 'A neural network is a series of algorithms that endeavors to recognize underlying relationships...'),

Generate the embeddings:

UPDATE documents SET embedding = ai_embedding_vector(content) WHERE length(embedding) = 0;

After running the query, the embedding column in the table will contain the generated embeddings.

The embeddings are stored as an array of FLOAT32 values in the embedding column, which has the ARRAY(FLOAT32) column type.

You can now use these embeddings for various natural language processing tasks, such as finding similar documents or clustering documents based on their content.

Inspect the embeddings:

SELECT length(embedding) FROM documents;
+-------------------+
| length(embedding) |
+-------------------+
|              1536 |
|              1536 |
|              1536 |
+-------------------+

The query above shows that the generated embeddings have a length of 1536(dimensions) for each document.