A common analytic use case is to find the number of distinct items in a data set.
Vertica performs well at solving COUNT DISTINCT in a few ways. Since Vertica stores all data in columns, it is possible to optimize for COUNT DISTINCT by building a projection that is tuned for this use case. Vertica can quickly COUNT DISTINCT over a column that is ordered and encoded with RLE – then one can simply count the number of distinct RLE groups in the range.
This might mean building a new projection to optimize count distinct patterns. This could mean additional overhead for load time and storage. However, if you only need a projection to optimize the most recent data – suppose you only need distinct counts for the last month, you can build a partition range projection over only the most recent range of data. This speeds up both data load and query operations. Alternately, there are some documented ways to tune queries to improve count distinct performance linked in the references.
There are cases where projection or query tuning aren’t the answer. You might not know all the queries users will run, or you might need to roll up daily counts into weekly or monthly counts, and do not want to re-run counts over longer ranges or try to guess the ideal projection (ranged or not) for future workloads. For these cases, Vertica provides a set of approximate count distinct functions based on LogLogBeta that allow you to compute aggregates, store them, and perform aggregates of aggregates to combine daily numbers into weekly numbers, for example.
Let us consider a set of air traffic data where we have timestamps, aircraft transponder identifiers (a serial number in hexadecimal), and aircraft callsigns (the carrier and flight number, such as “UAL1234”). The transponder identifiers do not change, but the callsign changes regularly as the same aircraft flies different routes. Let’s work out the number of distinct aircraft and distinct flights in a day, which is straightforward with SQL COUNT DISTINCT:
d2=> select date_generated, count(distinct hex_ident), count(distinct callsign) from dump1090new where date_generated = '2022-10-10' group by date_generated;
date_generated | count | count
----------------+-------+-------
2022-10-10 | 3967 | 6331
Time: First fetch (1 row): 2853.199 ms. All rows formatted: 2853.264 ms
This may be tolerable for one day, but let’s do the same COUNT DISTINCT over 10 days:
d2=> select min(date_generated), max(date_generated), COUNT(DISTINCT hex_ident), COUNT(DISTINCT callsign) from dump1090new where date_generated between '2022-10-01' and '2022-10-10';
min | max | COUNT | COUNT
------------+------------+-------+-------
2022-10-01 | 2022-10-10 | 12401 | 13299
Time: First fetch (1 row): 16757.459 ms. All rows formatted: 16757.517 ms
This is not fast, and it would be handy to have a way to store the daily counts to roll up later. Let’s create a COUNT DISTINCT synopsis, which is a LONG VARBINARY that we can use to compute the current estimated count, or save for later:
d2=> insert into dump1090_approximate select date_generated, APPROXIMATE_COUNT_DISTINCT_SYNOPSIS (hex_ident), APPROXIMATE_COUNT_DISTINCT_SYNOPSIS (callsign) from dump1090new where date_generated = '2022-10-10' group by date_generated;
OUTPUT
--------
1
Time: First fetch (1 row): 3329.641 ms. All rows formatted: 3329.692 ms
The approximate function does not seem faster. However, we saved the synopsis and can use it later. If we run the approximate function and store the synopsis daily, we’ve done the work up front and can use it again here from a table with synopsis from 10 days of data:
d2=> select min, max, APPROXIMATE_COUNT_DISTINCT_OF_SYNOPSIS(hexmerge) as hexmerge, APPROXIMATE_COUNT_DISTINCT_OF_SYNOPSIS(csmerge) as csmerge from (select min(day) as min, max(day) as max, APPROXIMATE_COUNT_DISTINCT_SYNOPSIS_MERGE(hex_ident) as hexmerge, APPROXIMATE_COUNT_DISTINCT_SYNOPSIS_MERGE(callsign) as csmerge from dump1090_approximate) a1 group by min, max;
min | max | hexmerge | csmerge
------------+------------+----------+---------
2022-10-01 | 2022-10-10 | 12328 | 13191
Time: First fetch (1 row): 56.257 ms. All rows formatted: 56.319 ms
The approximate library and its synopsis functions allow you to save aggregates and perform arbitrary rollups later, going from daily to every 10 days in this case, and can considerably speed up future aggregation calculations if an estimate is acceptable. The values shown here are not far off, within 1% of the exact count; Vertica targets an error tolerance of 1.25% by default.
The built-in approximate functions are like Apache Datasketches, which implements several algorithms that can run to produce a datasketch as a VARBINARY that can be saved and manipulated later to compute counts, frequencies, and more. If you’d like to go beyond count distinct and try other aggregates, there is a community project to implement Apache Datasketches as a Vertica UDX library at https://github.com/bryanherger/vertica-datasketch. Currently only a few algorithms are imported though contributions are welcome!
References
- The ADS-B air traffic dataset with DDL and approximate count distinct examples is posted at https://github.com/bryanherger/vertica-datasets/tree/main/adsb
- Vertica query optimization for DISTINCT: DISTINCT in a SELECT Query List
- Vertica approximate functions: Approximate Count Distinct Functions