Skip to main content

Vector Indexing

Creating a vector index accelerates retrieval in large datasets or scenarios requiring fast access, such as query optimization, machine learning, data mining, and image or spatial data searches. This improves query performance, speeds up analysis, and optimizes search tasks, enhancing overall system efficiency and response times.


Context

Vector DPS leverages the Hierarchical Navigable Small World (HNSW) algorithm, which combines the concepts of skip lists and Navigable Small World (NSW) graphs to enable efficient approximate nearest neighbor searches using a hierarchical structure. The upper layers of the graph contain longer edges for fast target location, while the lower layers use shorter edges to enhance search precision.

Graph construction

The m parameter controls the number of connections each new node establishes with its nearest neighbors. A higher m value results in a denser graph with more connections, improving search performance at the cost of increased memory usage and longer insertion times. During node insertion, the algorithm locates the nearest m nodes and creates bidirectional connections between them.

Dimensionality reduction

Vector DPS also supports Product Quantization (PQ) to reduce the dimensionality of high-dimensional vectors. By storing the quantized vectors in the index, PQ minimizes table lookups, boosting the performance of both vector insertion and query operations.


Syntax

CREATE INDEX <index_name>
ON <schema_name>.<table_name>
USING vectors (<column_name> <distance_measure>)
WITH (options = $$
<common_option_key1> = <common_option_value1>
<common_option_key2> = <common_option_value2>
...
[indexing.hnsw]
<hnsw_option_key1> = <hnsw_option_value1>
<hnsw_option_key2> = <hnsw_option_value2>
...
$$);


Parameters

  • <index_name>

    The name of the index. This parameter is optional.

  • <schema_name>

    The schema name.

  • <table_name>

    The table name.

  • <column_name>

    The name of the vector column.

  • <distance_measure>

    The similarity distance measure algorithm, formatted as <vector_data_type>_<distance_type>_ops. Available options include:

    • vecf16_l2_ops: squared Euclidean distance for 16-bit floating-point type.

    • vecf16_dot_ops: negative dot product distance for 16-bit floating-point type.

    • vecf16_cos_ops: cosine distance for 16-bit floating-point type.

    • vector_l2_ops: squared Euclidean distance for 32-bit floating-point type.

    • vector_dot_ops: negative dot product distance for 32-bit floating-point type.

    • vector_cos_ops: cosine distance for 32-bit floating-point type.

  • Other optional common vector index parameters (<common_option_n>)

    KeyData TypeValue RangeDefaultDescription
    optimizing.optimizing_threadsinteger[1, 65535]1The maximum number of threads for indexing.
    optimizing.sealing_secsinteger[1, 60]60The merge detection time for indexing.
    segment.max_growing_segment_sizeinteger[1, 4,000,000,000]20,000The maximum size of vectors without an index.
    segment.max_sealed_segment_sizeinteger[1, 4,000,000,000]1,000,000The maximum size of vectors for indexing.

  • HNSW algorithm parameters (<hnsw_option_n)

    KeyData TypeValue RangeDefaultDescription
    minteger[4, 128]12The maximum degree of each node.
    ef_constructioninteger[10, 2000]300The search scope during construction.
    quantizationtabletrivial, scalar, or productN/AThe distance quantization algorithm. Supported options include:
    trivial: Quantization not used
    scalar: Scalar quantization
    product: Product quantization (recommended)
    See quantization.product options for detailed parameter description.

quantization.product options

KeyData TypeValue RangeDefaultDescription
sampleinteger[1, 1,000,000]65535The number of samples for quantization.
ratioenum"x4", "x8", "x16", "x32", "x64""x4"The compression ratio for quantization.


Examples

Imagine a text-based knowledge base where documents are segmented into chunks and transformed into 512-dimensional embedding vectors for database storage. The resulting docs table includes the following fields:

FieldData TypeDescription
idserialThe ID.
chunkvarchar(1024)The chunk.
intimetimestampThe timestamp when the document was stored.
urlvarchar(1024)The URL of the document to which the chunk belongs.
embeddingvecf16(512)The embedding vector of the chunk.

  1. Create a vector table named docs to store vector data.

    CREATE TABLE docs (
    id SERIAL PRIMARY KEY,
    chunk VARCHAR(1024),
    intime TIMESTAMP,
    url VARCHAR(1024),
    embedding VECF16(512)
    ) DISTRIBUTED BY (id);
  2. Set the storage mode for the vector column to PLAIN to reduce row scan cost and improve performance.

    ALTER TABLE docs ALTER COLUMN embedding SET STORAGE PLAIN;
  3. Create a vector index on the vector column.

    -- Create a vector index using Euclidean distance.
    CREATE INDEX idx_docs_feature_l2 ON docs USING vectors(embedding vecf16_l2_ops)
    WITH (options = $$
    optimizing.optimizing_threads = 3
    segment.max_growing_segment_size = 100000
    segment.max_sealed_segment_size = 8000000
    [indexing.hnsw]
    m=30
    ef_construction=500
    $$);
  4. Create indexes on frequently used structured columns to accelerate hybrid queries.

    CREATE INDEX ON chunks(intime);


Supported operator types

NameDescriptionFormula
<->Squared Euclidean distance.
<#>Negative dot product.
<=>Cosine distance.

Examples of using vector operators:

-- Squared Euclidean distance
SELECT '[1, 2, 3]'::vector <-> '[3, 2, 1]'::vector;

-- Negative dot product
SELECT '[1, 2, 3]'::vector <#> '[3, 2, 1]'::vector;

-- Cosine distance
SELECT '[1, 2, 3]'::vector <=> '[3, 2, 1]'::vector;