Skip to main content

Best Practices for RAG on Relyt

This article introduces the best practices for implementing RAG (Retrieval-Augmented Generation) in the Relyt Vector Edition Data Warehouse (DW) service units.


Key concepts

Collection

In RAG, a collection is the basic unit for users to manage and organize information. Each collection can contain various types of data, such as papers, Excel files, and PowerPoint slides. This flexibility allows users to categorize and organize data based on their needs. Each time a user initiates a chat, one or more collections are associated with it for RAG queries. This query method retrieves the best answers from the relevant collections based on the user's question, improving query accuracy and efficiency.

Chunk

Chunks are small data segments or text blocks extracted from source files. These blocks are typically obtained through specific processing steps aimed at improving the efficiency of information retrieval and processing. A source file (such as a PDF paper) can be divided into multiple chunks, each containing a certain amount of text or information.

For each source file in the collection (such as a PDF paper), when being processed, it undergoes multiple processing pipelines, which include the following steps:

  • Text extraction: extracts text content from PDFs, removing formatting and redundant information.

  • Chunking: divides the extracted text into multiple chunks, typically based on paragraphs, sentences, or fixed word counts.

  • Semantic embedding generation: generates semantic embeddings for each chunk. These embeddings are derived using deep learning models (such as BERT or other natural language processing models) and capture the semantic information of the chunk.

Recall

After a user initiates a chat on a specific collection, the RAG recall system retrieves relevant information from all chunks corresponding to that collection. This process generally includes the following steps:

  1. Query embedding: The model is called to generate an embedding of the user's query text, resulting in a query vector.

  2. Vector recall: Compare the query vectors with the chunk embedding vectors for approximate matching. Relevant chunks are retrieved from the database, with each chunk being a text segment highly relevant to the user's query. These chunks may originate from different source files but are part of the same collection.

  3. Re-ranking: Re-rank the recalled chunks to determine which ones best fulfill the user's query. Re-ranking is typically based on factors like chunk relevance, quality, and others.

  4. Providing context for the query: Return the selected chunks as the query context for the user. This context enhances the user's understanding of the query results and helps deliver more accurate answers.


Table schema design

1. Field requirements

Field nameDescriptionField type
collection_idThe ID that uniquely identifies a collection. This can be a user collection or a public collection.text
idThe ID that uniquely identifies a record within the collection.text
datasource_idThe ID of a source file (such as a PDF file).text
textThe original text, returned as context to the LLM.text
embeddingThe dense embedding vector of the text, corresponding to the text field. It encapsulates the semantic information generated by the model, enabling semantic matching.vecf16(1024)
sparse_embeddingThe sparse embedding vector of the text, corresponding to the text field. It contains keyword information and associated weights, used for keyword-based matching.svector(30052)
metaThe metadata of the chunk, such as the author, source, and citations.jsonb
text_typeThe original type of the text, such as Chunk and Summary, extracted from metadata (meta), used for point-based filtering.text
index_nameThe index type corresponding to this record, extracted from metadata (meta), used specifically for point-based filtering.text

2. Design and select partition keys

When designing the table schema, it is crucial to consider how to efficiently manage and query datasets of varying sizes. To balance table count and query performance, we select the user's dataset ID (collection_id) as the partition key (partition by range(collection_id)). The collection_id includes a date prefix to ensure chronological ordering. This design strategy effectively manages both private user datasets and public platform datasets, optimizing query performance.

Using collection_id with a date prefix as the partition key offers several advantages:

  • Natural temporal sequence: Since collection_id is inherently ordered by date, using it as the partition key ensures data is stored in a manner that mirrors typical access patterns, promoting efficient cold and hot data separation.

  • Avoid metadata inflation: Unlike the naive approach of creating a separate table for each user's private dataset, this strategy helps prevent metadata inflation as the number of users and datasets grows. It also simplifies the database structure, reduces maintenance complexity, and enhances overall system performance.

  • Customization for special datasets: For datasets that require isolation or specialized indexing, adding a sub-partition expression is sufficient. This eliminates the need to rebuild tables or modify business logic, streamlining the process of accommodating unique dataset requirements.

Examples of collection_id with date prefixes:

  • "20240803-clzqw6zd70iv501l1k9cmhb5p"
  • "20240804-clzqw6zd70iv501l1k9cmsfhw"
  • "20240805-clzqw6zd70iv501l1k9cmvert"

Examples of table partitioning by month:

CREATE TABLE llama_index_vectors_1024_part_202407 partition of llama_index_vectors_1024 FOR VALUES FROM ('20240701') TO ('20240801');
CREATE TABLE llama_index_vectors_1024_part_202408 partition of llama_index_vectors_1024 FOR VALUES FROM ('20240801') TO ('20240901');

Examples of table partitioning by day:

CREATE TABLE llama_index_vectors_1024_part_20240701 partition of llama_index_vectors_1024 FOR VALUES FROM ('20240701') TO ('20240702');
CREATE TABLE llama_index_vectors_1024_part_20240702 partition of llama_index_vectors_1024 FOR VALUES FROM ('20240702') TO ('20240703');

Depending on data growth and access patterns, we recommend partitioning tables by day or by month to maintain an optimal balance between the number of partitioned tables and query performance.

3. Design distribution columns

Most queries typically use equality conditions such as collection_id = '20240803-xxx'. To optimize queries per second (QPS) and improve computation node utilization, we choose collection_id as the distribution column. However, this design may impact real-time performance.

4. Indexing design

4.1 Vector indexing

For embedding vector fields, we use the Hierarchical Navigable Small World (HNSW) index to achieve sub-millisecond query latency while maintaining a high recall rate, optimizing performance for large-scale vector searches.

4.2 Indexing for point queries

For frequently queried filter fields such as index_name and text_type, we create B-tree secondary indexes on (collection_id, index_name). These indexes significantly accelerate queries with high filtration requirements.

5. Full table creation statement

5.1 Create the main table

CREATE TABLE llama_index_vectors_1024
(
id text not null,
collection_id text not null, -- With date prefix, e.g., "20240803-clzqw6zd70iv501l1k9cmhb5p"
datasource_id text,
index_name text,
text_type text,
text text,
embedding vecf16(1024),
sparse_embedding svector(30052),
meta jsonb,
primary key (collection_id, id)
)
DISTRIBUTED BY (collection_id)
PARTITION BY range(collection_id);

-- Set storage as plain
ALTER TABLE llama_index_vectors_1024
ALTER COLUMN id set storage plain,
ALTER COLUMN collection_id set storage plain,
ALTER COLUMN index_name set storage plain,
ALTER COLUMN text_type set storage plain,
ALTER COLUMN embedding set storage plain,
ALTER COLUMN datasource_id set storage plain;

5.2 Partition tables by month

-- Partition by month
CREATE TABLE llama_index_vectors_1024_part_202407 partition of llama_index_vectors_1024 FOR VALUES FROM ('20240701') TO ('20240801');
CREATE TABLE llama_index_vectors_1024_part_202408 partition of llama_index_vectors_1024 FOR VALUES FROM ('20240801') TO ('20240901');

-- Default partition
CREATE TABLE llama_index_vectors_1024_part_default partition of llama_index_vectors_1024 default;

5.3 Vector indexing

CREATE INDEX llama_index_vectors_1024_embedding on
llama_index_vectors_1024
USING vectors(embedding vecf16_cos_ops
WITH (options = $$
optimizing.optimizing_threads = 10
segment.max_growing_segment_size = 2000
segment.max_sealed_segment_size = 30000000
[indexing.hnsw]
m=30
ef_construction=500
$$);

5.4 Indexing for point queries

CREATE INDEX llama_index_vectors_1024_index_name on
llama_index_vectors_1024 using btree

(collection_id, index_name);
CREATE INDEX llama_index_vectors_1024_text_type on
llama_index_vectors_1024 using btree(collection_id, text_type);
CREATE INDEX llama_index_vectors_1024_datasource_id on
llama_index_vectors_1024 using btree(collection_id, datasource_id);
CREATE INDEX llama_index_vectors_1024_meta on llama_index_vectors_1024
USING gin((collection_id::tsvector), meta);

6. A query example

SELECT id, text, meta,
embedding <-> '[0.1,0.3, ...., 0.6]' -- An array of 1024-dimensional query vectors
as distance
FROM llama_index_vectors_1024
WHERE collection_id = '20240803-clzqw6zd70iv501l1k9cmhb5p'
AND index_name = 'EMBEDDING'
ORDER BY distance -- Sort by dot product distance, restricting the result to 16 entries