I see the function APPROX_COUNT_DISTINCT which looks interesting, but the description makes me a bit wary to use it. How approximate is it? Further how is it different than NDV?
Also for the CORR function, none of the examples actually show an output.
For VAR_POP and VAR_SAMP, what is the actual difference? For population variance I get that it’s /n and then for sample it’s /(n-1) normally, but does VAR_SAMP assume the entire column given is the sample or does it take a subset? Then is the only difference between the two whether the summed squared differences are divided by the count or count - 1? Also I would suggest that it’d be helpful in the documentation for the two to use the same columns/datasets to more easily compare. E.g. for int64 VAR_POP uses
SELECT VAR_POP(pop) FROM “zips.json”
and VAR_SAMP uses
SELECT VAR_SAMP(passenger_count) FROM Samples.“samples.dremio.com”.“NYC-taxi-trips”
It’d be helpful to see
SELECT VAR_SAMP (pop) FROM “zips.json”
instead, or the other way around