Relyt Best Practices
Relyt AI-ready Data Cloud is a fully managed cloud data service built for rapid queries and advanced analysis of large-scale datasets. Featuring optimized data structures and a highly efficient query engine, It is equipped to handle even the most demanding big data requirements. This topic outlines key best practices to help users maximize Relyt's potential, ensuring top-notch efficiency and stability.
Table design and optimization
Clustering keys
When creating a table, you can specify one or more columns as clustering keys. When data is first loaded into an empty table, the rows are stored on disks according to the order defined by these keys. This clustering information is then provided to the query planner, enabling it to generate a query plan that optimizes the performance benefits of sorted data.
Below is a simple example:
CREATE TABLE orders (...)
CLUSTER BY (o_orderdate, o_orderkey)
For syntax details, see CREATE TABLE.
Sorting data significantly improves query performance for range predicates. In Relyt, columnar data is stored in files within object storage, with each file's minimum and maximum values recorded as metadata. When a query includes range predicates, the query processor utilizes this metadata to skip over files that don't match the specified range, reducing the number of files scanned. For example, if a table contains five years of date-sorted data and the query targets a one-month range, sorting can eliminate up to 98% of the files from being scanned. Without sorting, the query might need to scan most or all of the files, leading to inefficiencies.
Best practices for selecting clustering keys:
-
If your most frequently queried data is recent, consider setting timestamp columns (day, month, year) as the first column in the clustering keys.
This lets the system efficiently skip data blocks outside the specified time range.
-
For columns frequently used in range or equality filters, designate them as clustering keys.
Relyt tracks the minimum and maximum values of each data block, allowing it to quickly skip files that don't fall within the range, reducing unnecessary reads.
Distribution keys
Distribution keys determine how data is distributed, and an improper choice can lead to data skew. Therefore, the key principle is to select a distribution key that ensures an even spread of data.
Below is a simple example:
CREATE TABLE orders (...)
DISTRIBUTED BY (o_orderkey)
SELECT o_orderkey, count(*) FROM o_orderkey GROUP BY 1 ORDER BY 2 DESC LIMIT 10;
SELECT o_orderkey, count(*) FROM o_orderkey GROUP BY 1 ORDER BY 2 LIMIT 10;
If the distinct_id
values of the top 10 and bottom 10 groups do not show significant variation, the chosen distribution key is reasonable.
If no suitable distribution key can be identified in the table, use RANDOM
distribution to achieve complete uniformity, effectively preventing data skew.
Compression strategies
Choose an appropriate compression encoding scheme for each column, with options including LZ4
, ZSTD
, or no compression. Proper compression can significantly reduce storage space, and different compression levels can be specified for optimization.
Compression impacts query performance in two key ways:
-
It can improve query performance by reducing the total amount of data read (I/O).
-
However, decompression consumes CPU resources, which can negatively affect query performance.
Therefore, when selecting a compression scheme, it is important to evaluate both factors through testing and make a balanced trade-off to determine the optimal compression method and level.
Encoding strategies
Choosing different encoding methods for columns can impact query performance. Therefore, it is important to test and balance performance to identify the optimal encoding strategy. By default, no encoding is applied, but options like automatic encoding or dictionary encoding are available for better efficiency.
Query optimization
-
Query optimization is a continuous process, and designing tables based on best practices can greatly improve both physical storage and query efficiency.
-
Avoid using
SELECT *
and instead query only the necessary columns to reduce data I/O. -
For complex aggregations, use
CASE
statements instead of selecting from the same table multiple times. -
When a table in a query is only used for predicate conditions and the subquery returns a small number of rows (e.g., within hundreds), consider using subqueries.
-
Apply predicates to limit the dataset size as much as possible.
-
Choose the least expensive operators in predicates. The equality (
=
) operator is more efficient than comparison operators, and comparison operators are preferable to theLIKE
operator. -
Avoid using functions in query predicates, as they may require scanning many rows to perform intermediate steps, increasing the query cost.
-
When performing multi-table JOINs, add predicates to filter rows from the tables involved in the JOIN.
For example, if you want to query ticket sales data listed after December by joining
t1
andt2
and grouping by seller, both tables are sorted by date. The following query joins the two tables by their common key and filters rows wherelisting.listtime
is later than December 1st:SELECT listing.sellerid, sum(sales.qtysold)
FROM t1, t2
WHERE t1.salesid = t2.listid
AND t1.listtime > '2008-12-01' GROUP BY 1 ORDER BY 1;Since the
WHERE
clause lacks a condition fort2.saletime
, the execution engine needs to scan the entiret2
table. Including filtering conditions that reduce the number of rows participating in the join can significantly improve performance:SELECT listing.sellerid, sum(sales.qtysold)
FROM t1, t2
WHERE t1.salesid = t2.listid
AND t1.listtime > '2008-12-01'
AND t2.saletime > '2008-12-01' GROUP BY 1 ORDER BY 1; -
Whenever possible, include the clustering key in the
GROUP BY
clause to enable the query planner to use more efficient aggregation methods.
Cost optimization
-
Delete unnecessary data: Use the TTL feature to periodically remove historical data that is no longer needed. This helps reduce storage costs and improves query performance by minimizing data volume.
-
Use cache to improve performance: Configure TTL to retain hot data in the cache, reducing the need for object storage scans during cache misses and enhancing performance.
-
Query optimization: Regularly review and optimize queries to eliminate unnecessary computations and data scans, which is a key way to reduce operational costs and improve efficiency.