How approximate is APPROX_COUNT_DISTINCT? And some other aggregate documentation questions

driscoll42 · December 7, 2021, 10:42pm

I see the function APPROX_COUNT_DISTINCT which looks interesting, but the description makes me a bit wary to use it. How approximate is it? Further how is it different than NDV?

Also for the CORR function, none of the examples actually show an output.

For VAR_POP and VAR_SAMP, what is the actual difference? For population variance I get that it’s /n and then for sample it’s /(n-1) normally, but does VAR_SAMP assume the entire column given is the sample or does it take a subset? Then is the only difference between the two whether the summed squared differences are divided by the count or count - 1? Also I would suggest that it’d be helpful in the documentation for the two to use the same columns/datasets to more easily compare. E.g. for int64 VAR_POP uses

SELECT VAR_POP(pop) FROM “zips.json”

and VAR_SAMP uses

SELECT VAR_SAMP(passenger_count) FROM Samples.“samples.dremio.com”.“NYC-taxi-trips”

It’d be helpful to see
SELECT VAR_SAMP (pop) FROM “zips.json”

instead, or the other way around

Topic		Replies	Views
Unexpected result when using both APPROX_COUNT_DISTINCT and COUNT	5	339	January 18, 2024
Setting up reflections on distinct count	1	1289	February 25, 2019
How to accelerate count distinct queries with aggregation reflection	2	1430	May 26, 2022
Accelerator/Reflection giving incorrect result	10	2141	July 24, 2017
Reflection with count distinct	5	2904	January 15, 2018

How approximate is APPROX_COUNT_DISTINCT? And some other aggregate documentation questions

Related topics