Skip to main content

Aggregate Functions

Aggregate functions perform a calculation on a set of values and return a single value. Extreme DPS provides many aggregate functions to help you analyze data.


ARRAY_AGG

Combines a set of values (including null values) into an array and returns the array.

Syntax

ARRAY_AGG ( [ DISTINCT ] <expression> [ ORDER BY <sort_expression> ])

Arguments

<expression>: the expression or column name used to determine the values to be included in the array. Supported data types include boolean, smallint, integer, bigint, real, decimal, varchar, date, timestamp, timestamp with time zone.

Returns

An array.

Examples

SELECT ARRAY_AGG(order_amount)
FROM orders;

BOOL_AND or BOOL_EVERY

Determines whether all non-null input values are true. If yes, the function returns true. Otherwise, the function returns false.

info

BOOL_EVERY is an alias for BOOL_AND, so their behaviors are identical.

Syntax

BOOL_AND(<expression>)

BOOL_EVERY(<expression>)

Arguments

<expression>: the name of a column of type boolean or an expression that results in a boolean value.

Returns

A boolean value.

Examples

  1. Create a table named employee_attendance, where the field present records attendance, with true for present and false for absent.

    CREATE TABLE employee_attendance (
    employee_id integer,
    present boolean
    );
  2. Insert attendance data.

    INSERT INTO employee_attendance (employee_id, present) VALUES
    (1, TRUE),
    (1, TRUE),
    (2, FALSE),
    (2, TRUE),
    (3, TRUE),
    (3, FALSE),
    (3, TRUE);
  3. Query for employees who were present all days.

    SELECT employee_id, 
    bool_and(present) AS all_days_present
    FROM employee_attendance
    GROUP BY employee_id;

Only the employee whose ID is 1 was present every day.


BOOL_OR

Determines if there is at least one true value among non-null values. If yes, the function returns true. Otherwise, the function returns false.

Syntax

BOOL_OR(<expression>)

Arguments

<expression>: the name of a column of type boolean or an expression that results in a boolean value.

Returns

A boolean value.

Examples

  1. Create a table named customer_feedback to record customer feedback, where the field positive_feedback is true for positive feedback and false for negative feedback.

    CREATE TABLE customer_feedback (
    customer_id INT,
    positive_feedback BOOLEAN
    );
  2. Insert customer feedback data.

    INSERT INTO customer_feedback (customer_id, positive_feedback) VALUES
    (1, TRUE),
    (1, FALSE),
    (2, FALSE),
    (2, FALSE),
    (3, TRUE),
    (3, TRUE),
    (3, FALSE);
  3. Query for customers who have given positive feedback at least once.

    SELECT customer_id,
    bool_or(positive_feedback) AS has_positive_feedback
    FROM customer_feedback
    GROUP BY customer_id;

Customers 1 and 3 have both given positive feedback.


BOOL_AND or BOOL_EVERY

Returns true if all non-null input values are true. Otherwise, it returns false.

info

BOOL_EVERY is an alias for BOOL_AND, and their behaviors are identical.

Syntax

BOOL_AND(<expression>)

BOOL_EVERY(<expression>)

Parameters

<expression>: the name of the column with the boolean type or an expression that produces a boolean value.

Returns

A boolean value.

Examples

  1. Create a table named employee_attendance, where the field present specifies whether the employee is present at the specific date.

    CREATE TABLE employee_attendance (
    employee_id integer,
    present boolean
    );
  2. Insert attendance data.

    INSERT INTO employee_attendance (employee_id, present) VALUES
    (1, TRUE),
    (1, TRUE),
    (2, FALSE),
    (2, TRUE),
    (3, TRUE),
    (3, FALSE),
    (3, TRUE);
  3. Query for employees who have been present every day.

    SELECT employee_id, 
    bool_and(present) AS all_days_present
    FROM employee_attendance
    GROUP BY employee_id;

    In this example, only the employee whose ID is 1 has been present every day.

BOOL_OR

Returns true if any non-null values is true. Otherwise, it returns false.

Syntax

BOOL_OR(<expression>)

Parameters

<expression>: the name of the column with the boolean type or an expression that produces a boolean value.

Examples

  1. Create a table named customer_feedback to record customer feedback, where the field positive_feedback specifies whether

    CREATE TABLE customer_feedback (
    customer_id INT,
    positive_feedback BOOLEAN
    );
  2. Insert customer feedback data.

    INSERT INTO customer_feedback (customer_id, positive_feedback) VALUES
    (1, TRUE),
    (1, FALSE),
    (2, FALSE),
    (2, FALSE),
    (3, TRUE),
    (3, TRUE),
    (3, FALSE);
  3. Query for customers who have given positive feedback at least once.

    SELECT customer_id,
    bool_or(positive_feedback) AS has_positive_feedback
    FROM customer_feedback
    GROUP BY customer_id;

    Customers 1 and 3 have both given positive feedback.

SUM

Returns the sum of non-null input values.

info

SUM automatically ignores null values. If all records are null values, null will be returned.

Syntax

SUM( [ DISTINCT ] <column_name> | <expression> )

Arguments

  • <column_name>

    The name of the column of which the non-null values will be summed up.

  • <expression>

    The expression that evaluates to an input type supported by SUM.

    Supported input types include real, double precision, smallint, bigint, integer, and decimal.

Returns

The return type varies with the input type:

  • If the input type is real or double precision, the return type is the same as the input type.

  • If the input type is smallint, the return type is bigint.

  • If the input type is integer, the return type is bigint.

  • If the input type is bigint, the return type is decimal(19,0).

  • If the input type is decimal, the return type is decimal featuring a precision of 38 and maintaining the same scale as the input decimals.

Examples

SELECT SUM(price) FROM orders;

SELECT SUM(price * (1 + tax_rate)) FROM orders;

MIN / MAX

Returns the minimum or maximum value for the specified records.

info

MIN or MAX automatically ignores null values. If all records are null values, null will be returned.

Syntax

MAX(<column_name> | <expression>)
MIN(<cloumn_name> | <expression>)

Arguments

  • <column_name>

    The name of the column from which the maximum value or minimum value will be be extracted.

  • <expression>

    The expression that evaluates to a numeric data type.

Returns

The return type is the same as the input type.

Examples

SELECT MIN(price) FROM orders;
SELECT MAX(price) FROM orders;
SELECT MIN(price), MAX(price) FROM orders;

COUNT

Returns the number of non-null values or the number of all values.

Syntax

COUNT ( [ DISTINCT ]  <column_name> | <expression> )

COUNT( * )

COUNT( "any" )

Arguments

  • <column_name>

    The name of the column from which the non-null values will be counted.

  • <expression>

    The expression that evaluates to a numeric data type.

  • *

    Sepcifies to return the number of input rows.

  • any

    Specifies to return the number of input rows in which the input value is not null.

Examples

SELECT COUNT(order_id) FROM orders;
SELECT COUNT(*) FROM orders;
SELECT COUNT("any") FROM orders;

AVG

Returns the average of non-null values.

info

AVG automatically ignores null values. If all values are null, null will be returned.

Syntax

AVG( [ DISTINCT ] <column_name> | <expression> )

Arguments

  • <column_name>

    The name of the column in which the values you want to compute the average.

  • <expression>

    The expression that evaluates to an input type supported by AVG.

    Supported input types include double precision, smallint, integer, and bigint.

Returns

The return type varies with the input type:

  • If the input type is double precision, the return type is double precision.

  • If the input type is smallint, the return type is decimal(25,6).

  • If the input type is integer, the return type is decimal(25,6).

  • If the input type is bigint, the return type is decimal(25,6).

Examples

SELECT AVG(price) FROM orders;

SELECT AVG(price * quantity) FROM orders;

PERCENTILE_CONT

Returns a percentile value based on the continuous distribution of values in the input column specified in <sort_expression>. If no exact value is found at the desired percentile, the result is calculated through linear interpolation between the two closest input values. Null values are ignored in the calculation.

info

PERCENTILE_CONT operates as follows:

  1. Sorts the values in the dataset in ascending order based on sort_expression.
  2. Calculates a position based on the desired percentage. If a value exists at that position, it returns that value. If not, proceed to step 3.
  3. Calculates and returns a value using linear interpolation between the actual neighboring data values.

Syntax

PERCENTILE_CONT(<fraction>) WITHIN GROUP (ORDER BY <sort_expression>)

Arguments

  • <fraction>

    The percentile, which is double precision value between 0 and 1.

  • <sort_expression>

    The sort expression that produces a double precision, timestamp, or timestamp with time zone expression.

Returns

The return type is the same as the data type of <sort_expression>.

Examples

  1. Create a table named Scores that is used to record students' scores:

    CREATE TABLE Scores (
    id INT,
    score DECIMAL (10, 1)
    );

    INSERT INTO Scores (id, score) VALUES
    (1, 88.5),
    (2, 75.0),
    (3, 92.0);
  2. Calculate the median of the students' scores:

    SELECT PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY score) 
    OVER ()
    AS MedianScore
    FROM Scores;

The result is 88.5.


PERCENTILE_DISC

Sorts the dataset based on the specified expression and returns the first value in the sorted dataset that is at or beyond the specified percentile.

PERCENTILE_DISC is an inverse distribution function that assumes a discrete distribution model. This function takes a percentile value and a sort expression specified by the ORDER BY clause and returns the value with the smallest cumulative distribution that is greater than or equal to the specified percentile.

note

Unlike PERCENTILE_CONT, which can return non-existent values, PERCENTILE_DISC will return the first value greater than the specified percentile if the exact percentile value does not exist.

Syntax

PERCENTILE_DISC(<fraction>) WITHIN GROUP (ORDER BY <sort_expression>)

Parameters

  • <fraction>

    The percentile, which is double precision value between 0 and 1.

  • <sort_expression>

    The sort expression that produces a boolean, smallint, integer, bigint, real, double precision, decimal, varchar, date, timestamp, or timestamp with time zone value.

Returns

The return type is the same as the data type of <sort_expression>.

Examples

  1. Create a table named employees recording employees' names and salaries:

    CREATE TABLE employees (
    name varchar(100),
    salary decimal(10, 2)
    );
    INSERT INTO employees (name, salary) VALUES
    ('Alice', 70000.00),
    ('Bob', 48000.50),
    ('Charlie', 54000.00),
    ('David', 65000.00),
    ('Eve', 50000.00);
  2. Use the PERCENTILE_DISC function to find the median salary of employees:

    SELECT PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY salary) 
    AS median_salary
    FROM employees;

    The result is 54000.00.


RETENTION

Accepts 2 to 32 input arguments as conditions to indicate whether an event meets specific criteria:

  • If condition 1 is true, and condition N is true, in the returned array, the first element is 1 (true) and the Nth element is 1 (true).

  • If condition 1 is false, regardless of whether condition N is true or false, the Nth element in the returned array is 0 (false), returning an all 0 array.

Syntax

RETENTION(<cond1>, <cond2>[,..., <cond32>])

Arguments

<condN>: an expression that returns 0 or 1.

Returns

An array composed of 0s and 1s:

  • 0: The event does not meet the condition.

  • 1: The event meets the condition.

Examples

--Create a test table and insert data.
CREATE TABLE retention_test(on_date date, uid int);

INSERT INTO retention_test(uid, on_date) VALUES
(1, '2024-01-01'),
(1, '2024-01-02'),
(1, '2024-01-03'),
(2, '2024-01-02'),
(2, '2024-01-03');


--Check the retention.
SELECT uid, RETENTION(on_date = '2024-01-01', on_date ='2024-01-02', on_date = '2024-01-03')
FROM retention_test GROUP BY uid;


--The returned result is as follows:
uid | retention
-----+-----------
1 | {1,1,1}
2 | {0,0,0}

WINDOW_FUNNEL

Searches within a sliding time window for a list of events, and calculates the maximum number of consecutive events that occurred from the chain.

Working principles

The function operates as follows:

  • It searchs for data that triggers the initial condition in the chain and then sets the event counter to 1. This marks the start of the sliding window.

  • If events within the chain occur consecutively within the window, the counter increases. However, if the sequence of events is interrupted, the counter remains unchanged.

  • In cases where the data contains multiple event chains at different stages of completion, the function only returns the length of the longest chain.

Syntax

WINDOW_FUNNEL(<window>, <mode>, <timestamp>, <cond1>[, <cond2>, ...])

Arguments

  • <window>

    The length of the sliding window, which specifies the maximum interval between the first and the last event in the chain, in units determined by <timestamp>, with a type of bigint.

  • <mode>

    The filter mode of the event chain, with a type of integer. It can be a combination of the following flags:

    • 'strict_deduplication': Strict deduplication, if an event that has been accepted before is encountered, stop traversing, and slide the window forward.

    • 'strict_order': Strict order, if a timestamp that does not contain any events or a non-sequential event is encountered, stop traversing, and slide the window forward.

    • 'strict_increase': Timestamps strictly increase, not allowing the timestamps of adjacent events in the chain to be equal.

  • <timestamp>

    The timestamp column, with types of date, bigint, integer, smallint, or timestamp.

  • <condN>

    The conditions of the event chain, with up to 32 conditions specifiable at the same time.

Returns

A bigint value. The value is the maximum number of consecutive events within the sliding window that meets the conditions.

Examples

  1. Create a test table and insert data:

    DROP TABLE IF EXISTS window_funnel_test_table;
    CREATE TABLE window_funnel_test_table (
    timestamp timestamp,
    event bigint
    )USING pg_zdb;

    INSERT INTO window_funnel_test_table (timestamp, event)
    VALUES
    ('2022-02-01 00:00:01.000000', 1),
    ('2022-02-01 00:00:01.000100', 2),
    ('2022-02-01 00:00:01.000200', 3),
    ('2022-02-01 00:00:01.000300', 1),
    ('2022-02-01 00:00:01.000400', 2),
    ('2022-02-01 00:00:01.000500', 1),
    ('2022-02-01 00:00:01.000600', 2),
    ('2022-02-01 00:00:01.000700', 3),
    ('2022-02-01 00:00:01.000800', 4);

    --Check the table content:
    SELECT * FROM window_funnel_test_table ORDER BY timestamp;

    --The result returned is as follows:

    timestamp | event
    --------------------------+-------
    2022-02-01 00:00:01 | 1
    2022-02-01 00:00:01.0001 | 2
    2022-02-01 00:00:01.0002 | 3
    2022-02-01 00:00:01.0003 | 1
    2022-02-01 00:00:01.0004 | 2
    2022-02-01 00:00:01.0005 | 1
    2022-02-01 00:00:01.0006 | 2
    2022-02-01 00:00:01.0007 | 3
    2022-02-01 00:00:01.0008 | 4
  2. Search for events in strict deduplication mode:

    SELECT WINDOW_FUNNEL(1000, 1, time, event=1, event=2, event=3, event=4) 
    FROM window_funnel_test_table;

    --The query result is as follows:
    window_funnel
    ---------------
    4

APPROX_DISTINCT

Returns the approximate number of unique values.

This function provides an approximation to COUNT(DISTINCT x). If all input values are null, it returns zero.

Explanation

This function is expected to produce a standard error of 2.3% which is the standard deviation of the error distribution across all possible sets (approximately normal). It does not guarantee an upper bound on the error for any specific input set.

Syntax

APPROX_DISTINCT (<expression>)

Arguments

<expression>: the expression for which to count unique values. Supported data types include boolean, smallint, integer, bigint, real, double precision, decimal, varchar, date, timestamp, timestamp with time zone.

Returns

A bigint value.

Examples

SELECT APPROX_DISTINCT(product_id)
FROM sales_data_table;

STDDEV or STDDEV_SAMP

Returns the sample standard deviation of a set of values. If all values are null, null will be returned.

info

STDDEV_SAMP is an alias for STDDEV, so their behaviors are identical.

Syntax

{ STDDEV | STDDEV_SAMP } ( [ DISTINCT ] <expression> )

Arguments

<expression>: the expression that produces a numeric value. Supported numeric types include smallint, integer, bigint, real, and double precision.

Returns

The return type varies based on the input type:

  • When the input type is double precision or real, the return type is double precision.

  • When the input type is smallint, integer, or bigint, the return type is decimal(38,6).

Examples

SELECT STDDEV(salary) FROM employee_salaries;
SELECT STDDEV_SAMP(salary) FROM employee_salaries;

STDDEV_POP

Returns the population standard deviation of a set of values. If all values are null, null will be returned.

Syntax

STDDEV_POP( [ DISTINCT ] <expression> )

Arguments

<expression>: the expression that produces a numeric value. Supported numeric types include smallint, integer, bigint, real, and double precision.

Returns

The return type varies based on the input type:

  • When the input type is double precision or real, the return type is double precision.

  • When the input type is smallint, integer, or bigint, the return type is decimal(38,6).

Examples

SELECT STDDEV_POP(salary) FROM employee_salaries;

VARIANCE or VAR_SAMP

Returns the sample variance of a set of values. If all values are null, null will be returned.

info

VAR_SAMP is an alias for VARIANCE, and both functions behave identically.

Syntax

{ VARIANCE | VAR_SAMP }( [ DISTINCT ] <expression> )

Arguments

<expression>: the expression that produces a numeric value. Supported numeric types include smallint, integer, bigint, real, and double precision.

Returns

The return type varies based on the input type:

  • When the input type is double precision or real, the return type is double precision.

  • When the input type is smallint, integer, or bigint, the return type is decimal(38,6).

Examples

SELECT VARIANCE(salary) FROM employee_salaries;
SELECT VAR_SAMP(salary) FROM employee_salaries;

VAR_POP

Returns the population variance of a set of values. If all values are null, null will be returned.

Syntax

VAR_POP( [ DISTINCT ] <expression> )

Arguments

<expression>: the expression that produces a numeric value. Supported numeric types include smallint, integer, bigint, real, and double precision.

Returns

The return type varies based on the input type:

  • When the input type is double precision or real, the return type is double precision.

  • When the input type is smallint, integer, or bigint, the return type is decimal(38,6).

Examples

SELECT VAR_POP(salary) FROM employee_salaries;

STRING_AGG

Concatenates multiple string values into one string and returns the result.

Syntax

STRING_AGG ( [ DISTINCT ] <expression> , <delimiter> ) [
ORDER BY <sort_expression>]

Arguments

  • <expression>

    The varchar column of which values are to be concatenated.

  • <delimiter>

    The varchar string used as the delimeter.

  • <sort_expression>

    The non-constant sorting expression.

Returns

Same as the input type.

Examples

SELECT category, STRING_AGG(name, ', ' ORDER BY name)
FROM products
GROUP BY category;