Aggregate Functions
Aggregate functions perform a calculation on a set of values and return a single value. Extreme DPS provides many aggregate functions to help you analyze data.
ARRAY_AGG
Combines a set of values (including null values) into an array and returns the array.
Syntax
ARRAY_AGG ( [ DISTINCT ] <expression> [ ORDER BY <sort_expression> ])
Arguments
<expression>
: the expression or column name used to determine the values to be included in the array. Supported data types include boolean
, smallint
, integer
, bigint
, real
, decimal
, varchar
, date
, timestamp
, timestamp with time zone
.
Returns
An array.
Examples
SELECT ARRAY_AGG(order_amount)
FROM orders;
BOOL_AND or BOOL_EVERY
Determines whether all non-null input values are true. If yes, the function returns true. Otherwise, the function returns false.
BOOL_EVERY
is an alias for BOOL_AND
, so their behaviors are identical.
Syntax
BOOL_AND(<expression>)
BOOL_EVERY(<expression>)
Arguments
<expression>
: the name of a column of type boolean
or an expression that results in a boolean
value.
Returns
A boolean
value.
Examples
-
Create a table named
employee_attendance
, where the fieldpresent
records attendance, with true for present and false for absent.CREATE TABLE employee_attendance (
employee_id integer,
present boolean
); -
Insert attendance data.
INSERT INTO employee_attendance (employee_id, present) VALUES
(1, TRUE),
(1, TRUE),
(2, FALSE),
(2, TRUE),
(3, TRUE),
(3, FALSE),
(3, TRUE); -
Query for employees who were present all days.
SELECT employee_id,
bool_and(present) AS all_days_present
FROM employee_attendance
GROUP BY employee_id;
Only the employee whose ID is 1
was present every day.
BOOL_OR
Determines if there is at least one true value among non-null values. If yes, the function returns true. Otherwise, the function returns false.
Syntax
BOOL_OR(<expression>)
Arguments
<expression>
: the name of a column of type boolean
or an expression that results in a boolean
value.
Returns
A boolean
value.
Examples
-
Create a table named
customer_feedback
to record customer feedback, where the fieldpositive_feedback
is true for positive feedback and false for negative feedback.CREATE TABLE customer_feedback (
customer_id INT,
positive_feedback BOOLEAN
); -
Insert customer feedback data.
INSERT INTO customer_feedback (customer_id, positive_feedback) VALUES
(1, TRUE),
(1, FALSE),
(2, FALSE),
(2, FALSE),
(3, TRUE),
(3, TRUE),
(3, FALSE); -
Query for customers who have given positive feedback at least once.
SELECT customer_id,
bool_or(positive_feedback) AS has_positive_feedback
FROM customer_feedback
GROUP BY customer_id;
Customers 1 and 3 have both given positive feedback.
BOOL_AND or BOOL_EVERY
Returns true if all non-null input values are true. Otherwise, it returns false.
BOOL_EVERY
is an alias for BOOL_AND
, and their behaviors are identical.
Syntax
BOOL_AND(<expression>)
BOOL_EVERY(<expression>)
Parameters
<expression>
: the name of the column with the boolean
type or an expression that produces a boolean
value.
Returns
A boolean
value.
Examples
-
Create a table named
employee_attendance
, where the fieldpresent
specifies whether the employee is present at the specific date.CREATE TABLE employee_attendance (
employee_id integer,
present boolean
); -
Insert attendance data.
INSERT INTO employee_attendance (employee_id, present) VALUES
(1, TRUE),
(1, TRUE),
(2, FALSE),
(2, TRUE),
(3, TRUE),
(3, FALSE),
(3, TRUE); -
Query for employees who have been present every day.
SELECT employee_id,
bool_and(present) AS all_days_present
FROM employee_attendance
GROUP BY employee_id;In this example, only the employee whose ID is
1
has been present every day.
BOOL_OR
Returns true if any non-null values is true. Otherwise, it returns false.
Syntax
BOOL_OR(<expression>)
Parameters
<expression>
: the name of the column with the boolean
type or an expression that produces a boolean
value.
Examples
-
Create a table named
customer_feedback
to record customer feedback, where the fieldpositive_feedback
specifies whetherCREATE TABLE customer_feedback (
customer_id INT,
positive_feedback BOOLEAN
); -
Insert customer feedback data.
INSERT INTO customer_feedback (customer_id, positive_feedback) VALUES
(1, TRUE),
(1, FALSE),
(2, FALSE),
(2, FALSE),
(3, TRUE),
(3, TRUE),
(3, FALSE); -
Query for customers who have given positive feedback at least once.
SELECT customer_id,
bool_or(positive_feedback) AS has_positive_feedback
FROM customer_feedback
GROUP BY customer_id;Customers 1 and 3 have both given positive feedback.
SUM
Returns the sum of non-null input values.
SUM
automatically ignores null values. If all records are null values, null will be returned.
Syntax
SUM( [ DISTINCT ] <column_name> | <expression> )
Arguments
-
<column_name>
The name of the column of which the non-null values will be summed up.
-
<expression>
The expression that evaluates to an input type supported by
SUM
.Supported input types include
real
,double precision
,smallint
,bigint
,integer
, anddecimal
.
Returns
The return type varies with the input type:
-
If the input type is
real
ordouble precision
, the return type is the same as the input type. -
If the input type is
smallint
, the return type isbigint
. -
If the input type is
integer
, the return type isbigint
. -
If the input type is
bigint
, the return type isdecimal(19,0)
. -
If the input type is
decimal
, the return type isdecimal
featuring a precision of38
and maintaining the same scale as the input decimals.
Examples
SELECT SUM(price) FROM orders;
SELECT SUM(price * (1 + tax_rate)) FROM orders;
MIN / MAX
Returns the minimum or maximum value for the specified records.
MIN
or MAX
automatically ignores null values. If all records are null values, null will be returned.
Syntax
MAX(<column_name> | <expression>)
MIN(<cloumn_name> | <expression>)
Arguments
-
<column_name>
The name of the column from which the maximum value or minimum value will be be extracted.
-
<expression>
The expression that evaluates to a numeric data type.
Returns
The return type is the same as the input type.
Examples
SELECT MIN(price) FROM orders;
SELECT MAX(price) FROM orders;
SELECT MIN(price), MAX(price) FROM orders;
COUNT
Returns the number of non-null values or the number of all values.
Syntax
COUNT ( [ DISTINCT ] <column_name> | <expression> )
COUNT( * )
COUNT( "any" )
Arguments
-
<column_name>
The name of the column from which the non-null values will be counted.
-
<expression>
The expression that evaluates to a numeric data type.
-
*
Sepcifies to return the number of input rows.
-
any
Specifies to return the number of input rows in which the input value is not null.
Examples
SELECT COUNT(order_id) FROM orders;
SELECT COUNT(*) FROM orders;
SELECT COUNT("any") FROM orders;
AVG
Returns the average of non-null values.
AVG
automatically ignores null values. If all values are null, null will be returned.
Syntax
AVG( [ DISTINCT ] <column_name> | <expression> )
Arguments
-
<column_name>
The name of the column in which the values you want to compute the average.
-
<expression>
The expression that evaluates to an input type supported by
AVG
.Supported input types include
double precision
,smallint
,integer
, andbigint
.
Returns
The return type varies with the input type:
-
If the input type is
double precision
, the return type isdouble precision
. -
If the input type is
smallint
, the return type isdecimal(25,6)
. -
If the input type is
integer
, the return type isdecimal(25,6)
. -
If the input type is
bigint
, the return type isdecimal(25,6)
.
Examples
SELECT AVG(price) FROM orders;
SELECT AVG(price * quantity) FROM orders;
PERCENTILE_CONT
Returns a percentile value based on the continuous distribution of values in the input column specified in <sort_expression>
. If no exact value is found at the desired percentile, the result is calculated through linear interpolation between the two closest input values. Null values are ignored in the calculation.
PERCENTILE_CONT
operates as follows:
- Sorts the values in the dataset in ascending order based on
sort_expression
. - Calculates a position based on the desired percentage. If a value exists at that position, it returns that value. If not, proceed to step 3.
- Calculates and returns a value using linear interpolation between the actual neighboring data values.
Syntax
PERCENTILE_CONT(<fraction>) WITHIN GROUP (ORDER BY <sort_expression>)
Arguments
-
<fraction>
The percentile, which is
double precision
value between 0 and 1. -
<sort_expression>
The sort expression that produces a
double precision
,timestamp
, ortimestamp with time zone
expression.
Returns
The return type is the same as the data type of <sort_expression>
.
Examples
-
Create a table named
Scores
that is used to record students' scores:CREATE TABLE Scores (
id INT,
score DECIMAL (10, 1)
);
INSERT INTO Scores (id, score) VALUES
(1, 88.5),
(2, 75.0),
(3, 92.0); -
Calculate the median of the students' scores:
SELECT PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY score)
OVER ()
AS MedianScore
FROM Scores;
The result is 88.5
.
PERCENTILE_DISC
Sorts the dataset based on the specified expression and returns the first value in the sorted dataset that is at or beyond the specified percentile.
PERCENTILE_DISC
is an inverse distribution function that assumes a discrete distribution model. This function takes a percentile value and a sort expression specified by the ORDER BY
clause and returns the value with the smallest cumulative distribution that is greater than or equal to the specified percentile.
Unlike PERCENTILE_CONT, which can return non-existent values, PERCENTILE_DISC
will return the first value greater than the specified percentile if the exact percentile value does not exist.
Syntax
PERCENTILE_DISC(<fraction>) WITHIN GROUP (ORDER BY <sort_expression>)
Parameters
-
<fraction>
The percentile, which is
double precision
value between 0 and 1. -
<sort_expression>
The sort expression that produces a
boolean
,smallint
,integer
,bigint
,real
,double precision
,decimal
,varchar
,date
,timestamp
, ortimestamp with time zone
value.
Returns
The return type is the same as the data type of <sort_expression>
.
Examples
-
Create a table named
employees
recording employees' names and salaries:CREATE TABLE employees (
name varchar(100),
salary decimal(10, 2)
);
INSERT INTO employees (name, salary) VALUES
('Alice', 70000.00),
('Bob', 48000.50),
('Charlie', 54000.00),
('David', 65000.00),
('Eve', 50000.00); -
Use the
PERCENTILE_DISC
function to find the median salary of employees:SELECT PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY salary)
AS median_salary
FROM employees;The result is
54000.00
.
RETENTION
Accepts 2 to 32 input arguments as conditions to indicate whether an event meets specific criteria:
-
If condition 1 is true, and condition N is true, in the returned array, the first element is
1
(true) and the Nth element is1
(true). -
If condition 1 is false, regardless of whether condition N is true or false, the Nth element in the returned array is
0
(false), returning an all0
array.
Syntax
RETENTION(<cond1>, <cond2>[,..., <cond32>])
Arguments
<condN>
: an expression that returns 0
or 1
.
Returns
An array composed of 0
s and 1
s:
-
0
: The event does not meet the condition. -
1
: The event meets the condition.
Examples
--Create a test table and insert data.
CREATE TABLE retention_test(on_date date, uid int);
INSERT INTO retention_test(uid, on_date) VALUES
(1, '2024-01-01'),
(1, '2024-01-02'),
(1, '2024-01-03'),
(2, '2024-01-02'),
(2, '2024-01-03');
--Check the retention.
SELECT uid, RETENTION(on_date = '2024-01-01', on_date ='2024-01-02', on_date = '2024-01-03')
FROM retention_test GROUP BY uid;
--The returned result is as follows:
uid | retention
-----+-----------
1 | {1,1,1}
2 | {0,0,0}
WINDOW_FUNNEL
Searches within a sliding time window for a list of events, and calculates the maximum number of consecutive events that occurred from the chain.
The function operates as follows:
-
It searchs for data that triggers the initial condition in the chain and then sets the event counter to 1. This marks the start of the sliding window.
-
If events within the chain occur consecutively within the window, the counter increases. However, if the sequence of events is interrupted, the counter remains unchanged.
-
In cases where the data contains multiple event chains at different stages of completion, the function only returns the length of the longest chain.
Syntax
WINDOW_FUNNEL(<window>, <mode>, <timestamp>, <cond1>[, <cond2>, ...])
Arguments
-
<window>
The length of the sliding window, which specifies the maximum interval between the first and the last event in the chain, in units determined by
<timestamp>
, with a type ofbigint
. -
<mode>
The filter mode of the event chain, with a type of
integer
. It can be a combination of the following flags:-
'strict_deduplication': Strict deduplication, if an event that has been accepted before is encountered, stop traversing, and slide the window forward.
-
'strict_order': Strict order, if a timestamp that does not contain any events or a non-sequential event is encountered, stop traversing, and slide the window forward.
-
'strict_increase': Timestamps strictly increase, not allowing the timestamps of adjacent events in the chain to be equal.
-
-
<timestamp>
The timestamp column, with types of
date
,bigint
,integer
,smallint
, ortimestamp
. -
<condN>
The conditions of the event chain, with up to 32 conditions specifiable at the same time.
Returns
A bigint
value. The value is the maximum number of consecutive events within the sliding window that meets the conditions.
Examples
-
Create a test table and insert data:
DROP TABLE IF EXISTS window_funnel_test_table;
CREATE TABLE window_funnel_test_table (
timestamp timestamp,
event bigint
)USING pg_zdb;
INSERT INTO window_funnel_test_table (timestamp, event)
VALUES
('2022-02-01 00:00:01.000000', 1),
('2022-02-01 00:00:01.000100', 2),
('2022-02-01 00:00:01.000200', 3),
('2022-02-01 00:00:01.000300', 1),
('2022-02-01 00:00:01.000400', 2),
('2022-02-01 00:00:01.000500', 1),
('2022-02-01 00:00:01.000600', 2),
('2022-02-01 00:00:01.000700', 3),
('2022-02-01 00:00:01.000800', 4);
--Check the table content:
SELECT * FROM window_funnel_test_table ORDER BY timestamp;
--The result returned is as follows:
timestamp | event
--------------------------+-------
2022-02-01 00:00:01 | 1
2022-02-01 00:00:01.0001 | 2
2022-02-01 00:00:01.0002 | 3
2022-02-01 00:00:01.0003 | 1
2022-02-01 00:00:01.0004 | 2
2022-02-01 00:00:01.0005 | 1
2022-02-01 00:00:01.0006 | 2
2022-02-01 00:00:01.0007 | 3
2022-02-01 00:00:01.0008 | 4 -
Search for events in strict deduplication mode:
SELECT WINDOW_FUNNEL(1000, 1, time, event=1, event=2, event=3, event=4)
FROM window_funnel_test_table;
--The query result is as follows:
window_funnel
---------------
4
APPROX_DISTINCT
Returns the approximate number of unique values.
This function provides an approximation to COUNT(DISTINCT x)
. If all input values are null, it returns zero.
This function is expected to produce a standard error of 2.3% which is the standard deviation of the error distribution across all possible sets (approximately normal). It does not guarantee an upper bound on the error for any specific input set.
Syntax
APPROX_DISTINCT (<expression>)
Arguments
<expression>
: the expression for which to count unique values. Supported data types include boolean
, smallint
, integer
, bigint
, real
, double precision
, decimal
, varchar
, date
, timestamp
, timestamp with time zone
.
Returns
A bigint
value.
Examples
SELECT APPROX_DISTINCT(product_id)
FROM sales_data_table;
STDDEV or STDDEV_SAMP
Returns the sample standard deviation of a set of values. If all values are null, null will be returned.
STDDEV_SAMP
is an alias for STDDEV
, so their behaviors are identical.
Syntax
{ STDDEV | STDDEV_SAMP } ( [ DISTINCT ] <expression> )
Arguments
<expression>
: the expression that produces a numeric value. Supported numeric types include smallint
, integer
, bigint
, real
, and double precision
.
Returns
The return type varies based on the input type:
-
When the input type is
double precision
orreal
, the return type isdouble precision
. -
When the input type is
smallint
,integer
, orbigint
, the return type isdecimal(38,6)
.
Examples
SELECT STDDEV(salary) FROM employee_salaries;
SELECT STDDEV_SAMP(salary) FROM employee_salaries;
STDDEV_POP
Returns the population standard deviation of a set of values. If all values are null, null will be returned.
Syntax
STDDEV_POP( [ DISTINCT ] <expression> )
Arguments
<expression>
: the expression that produces a numeric value. Supported numeric types include smallint
, integer
, bigint
, real
, and double precision
.
Returns
The return type varies based on the input type:
-
When the input type is
double precision
orreal
, the return type isdouble precision
. -
When the input type is
smallint
,integer
, orbigint
, the return type isdecimal(38,6)
.
Examples
SELECT STDDEV_POP(salary) FROM employee_salaries;
VARIANCE or VAR_SAMP
Returns the sample variance of a set of values. If all values are null, null will be returned.
VAR_SAMP
is an alias for VARIANCE
, and both functions behave identically.
Syntax
{ VARIANCE | VAR_SAMP }( [ DISTINCT ] <expression> )
Arguments
<expression>
: the expression that produces a numeric value. Supported numeric types include smallint
, integer
, bigint
, real
, and double precision
.
Returns
The return type varies based on the input type:
-
When the input type is
double precision
orreal
, the return type isdouble precision
. -
When the input type is
smallint
,integer
, orbigint
, the return type isdecimal(38,6)
.
Examples
SELECT VARIANCE(salary) FROM employee_salaries;
SELECT VAR_SAMP(salary) FROM employee_salaries;
VAR_POP
Returns the population variance of a set of values. If all values are null, null will be returned.
Syntax
VAR_POP( [ DISTINCT ] <expression> )
Arguments
<expression>
: the expression that produces a numeric value. Supported numeric types include smallint
, integer
, bigint
, real
, and double precision
.
Returns
The return type varies based on the input type:
-
When the input type is
double precision
orreal
, the return type isdouble precision
. -
When the input type is
smallint
,integer
, orbigint
, the return type isdecimal(38,6)
.
Examples
SELECT VAR_POP(salary) FROM employee_salaries;
STRING_AGG
Concatenates multiple string values into one string and returns the result.
Syntax
STRING_AGG ( [ DISTINCT ] <expression> , <delimiter> ) [
ORDER BY <sort_expression>]
Arguments
-
<expression>
The
varchar
column of which values are to be concatenated. -
<delimiter>
The
varchar
string used as the delimeter. -
<sort_expression>
The non-constant sorting expression.
Returns
Same as the input type.
Examples
SELECT category, STRING_AGG(name, ', ' ORDER BY name)
FROM products
GROUP BY category;