Vector Indexing
Creating a vector index accelerates retrieval in large datasets or scenarios requiring fast access, such as query optimization, machine learning, data mining, and image or spatial data searches. This improves query performance, speeds up analysis, and optimizes search tasks, enhancing overall system efficiency and response times.
Context
Vector DPS leverages the Hierarchical Navigable Small World (HNSW) algorithm, which combines the concepts of skip lists and Navigable Small World (NSW) graphs to enable efficient approximate nearest neighbor searches using a hierarchical structure. The upper layers of the graph contain longer edges for fast target location, while the lower layers use shorter edges to enhance search precision.
Graph construction
The m
parameter controls the number of connections each new node establishes with its nearest neighbors. A higher m
value results in a denser graph with more connections, improving search performance at the cost of increased memory usage and longer insertion times. During node insertion, the algorithm locates the nearest m
nodes and creates bidirectional connections between them.
Dimensionality reduction
Vector DPS also supports Product Quantization (PQ) to reduce the dimensionality of high-dimensional vectors. By storing the quantized vectors in the index, PQ minimizes table lookups, boosting the performance of both vector insertion and query operations.
Syntax
CREATE INDEX <index_name>
ON <schema_name>.<table_name>
USING vectors (<column_name> <distance_measure>)
WITH (options = $$
<common_option_key1> = <common_option_value1>
<common_option_key2> = <common_option_value2>
...
[indexing.hnsw]
<hnsw_option_key1> = <hnsw_option_value1>
<hnsw_option_key2> = <hnsw_option_value2>
...
$$);
Parameters
-
<index_name>
The name of the index. This parameter is optional.
-
<schema_name>
The schema name.
-
<table_name>
The table name.
-
<column_name>
The name of the vector column.
-
<distance_measure>
The similarity distance measure algorithm, formatted as
<vector_data_type>
_<distance_type>
_ops. Available options include:-
vecf16_l2_ops
: squared Euclidean distance for 16-bit floating-point type. -
vecf16_dot_ops
: negative dot product distance for 16-bit floating-point type. -
vecf16_cos_ops
: cosine distance for 16-bit floating-point type. -
vector_l2_ops
: squared Euclidean distance for 32-bit floating-point type. -
vector_dot_ops
: negative dot product distance for 32-bit floating-point type. -
vector_cos_ops
: cosine distance for 32-bit floating-point type.
-
-
Other optional common vector index parameters (
<common_option_n>
)Key Data Type Value Range Default Description optimizing.optimizing_threads
integer
[1, 65535] 1
The maximum number of threads for indexing. optimizing.sealing_secs
integer
[1, 60] 60
The merge detection time for indexing. segment.max_growing_segment_size
integer
[1, 4,000,000,000] 20,000
The maximum size of vectors without an index. segment.max_sealed_segment_size
integer
[1, 4,000,000,000] 1,000,000
The maximum size of vectors for indexing.
-
HNSW algorithm parameters (
<hnsw_option_n
)Key Data Type Value Range Default Description m
integer
[4, 128] 12
The maximum degree of each node. ef_construction
integer
[10, 2000] 300
The search scope during construction. quantization
table
trivial
,scalar
, orproduct
N/A The distance quantization algorithm. Supported options include: trivial
: Quantization not usedscalar
: Scalar quantizationproduct
: Product quantization (recommended)
See quantization.product options for detailed parameter description.
quantization.product
options
Key | Data Type | Value Range | Default | Description |
---|---|---|---|---|
sample | integer | [1, 1,000,000] | 65535 | The number of samples for quantization. |
ratio | enum | "x4", "x8", "x16", "x32", "x64" | "x4" | The compression ratio for quantization. |
Examples
Imagine a text-based knowledge base where documents are segmented into chunks and transformed into 512-dimensional embedding vectors for database storage. The resulting docs
table includes the following fields:
Field | Data Type | Description |
---|---|---|
id | serial | The ID. |
chunk | varchar(1024) | The chunk. |
intime | timestamp | The timestamp when the document was stored. |
url | varchar(1024) | The URL of the document to which the chunk belongs. |
embedding | vecf16(512) | The embedding vector of the chunk. |
-
Create a vector table named
docs
to store vector data.CREATE TABLE docs (
id SERIAL PRIMARY KEY,
chunk VARCHAR(1024),
intime TIMESTAMP,
url VARCHAR(1024),
embedding VECF16(512)
) DISTRIBUTED BY (id); -
Set the storage mode for the vector column to
PLAIN
to reduce row scan cost and improve performance.ALTER TABLE docs ALTER COLUMN embedding SET STORAGE PLAIN;
-
Create a vector index on the vector column.
-- Create a vector index using Euclidean distance.
CREATE INDEX idx_docs_feature_l2 ON docs USING vectors(embedding vecf16_l2_ops)
WITH (options = $$
optimizing.optimizing_threads = 3
segment.max_growing_segment_size = 100000
segment.max_sealed_segment_size = 8000000
[indexing.hnsw]
m=30
ef_construction=500
$$); -
Create indexes on frequently used structured columns to accelerate hybrid queries.
CREATE INDEX ON chunks(intime);
Related reference
Supported operator types
Name | Description | Formula |
---|---|---|
<-> | Squared Euclidean distance. | |
<#> | Negative dot product. | |
<=> | Cosine distance. |
Examples of using vector operators:
-- Squared Euclidean distance
SELECT '[1, 2, 3]'::vector <-> '[3, 2, 1]'::vector;
-- Negative dot product
SELECT '[1, 2, 3]'::vector <#> '[3, 2, 1]'::vector;
-- Cosine distance
SELECT '[1, 2, 3]'::vector <=> '[3, 2, 1]'::vector;