Reflection with count distinct

dfleckinger · January 11, 2018, 2:34pm

I want to use aggregation reflection to calculate the number of distinct visitors by a dimension.

let’s say my source data is composed of 3 columns :
_Table event_pid _
- pid (productId)
- ts (timestamp)
- visitor_id

I was supposing that creating an aggregation reflection with pid as dimension, visitor_id as metric,

would enable to use acceleration
for a query such as
select pid, ndv(visitor_id) from event_pid group by pid

Unfortunately Dremio does not use acceleration for such a query.

Any advise on a way to use acceleration four count(distinct …) or ndv queries ?

Thanks

evan_b · January 11, 2018, 7:42pm

What about creating another virtual dataset based on the count distinct query, and using a raw reflection of that?

can · January 12, 2018, 12:18am

@dfleckinger you have a few options:

You can add visitor_id as a dimension (not measure) to your existing list of dimensions. This is a bit counter intuitive but eventually count distinct queries can be thought as group by/count queries – hence setting this field as a dimension.
As @evan_b suggested, you can create a virtual dataset for the query you want to accelerate and create a Raw Reflection on that. This would work both with count(distinct X) and NDV(x).

dfleckinger · January 12, 2018, 8:43am

Thanks for your answers. I do not find very elegant the solution to create a VDS for the query I want to accelerate.
For the first one, adding visitor_id as a dimension of the reflection would increase a lot the storage size of the reflection.

can · January 12, 2018, 6:49pm

@dfleckinger creating a VDS should be considered as a secondary option, especially if you’d like a NDV based solution as there isn’t currently an option to configure approximate count distincts for reflections. (We do have plans to support this going forward.)

As for the reflection sizes – that’s going to be a trade-off between query-time performance and storage that you’ll need to determine for your use-case. The same trade-off would have applied if we hypothetically supported adding visitor_id as a measure to accelerate count distinct queries. We also have plans to support even more memory efficient (focus on query-time memory, not reflection storage space) count distinct acceleration.

dfleckinger · January 15, 2018, 1:35pm

Thanks Can.
At the moment I’m not able to add visitor_id as a dimension. When Dremio tries to create the reflection,
out of memory error happens :
OUT_OF_MEMORY ERROR: One or more nodes ran out of memory while executing the query.

My dataset has more than 50 millions distinct visitor_id already in a parquet dataset.

So at the moment I will use the VDS for the query.

Happy to see that there are plans to improve the performance of count distinct queries.

Topic		Replies	Views
How to accelerate count distinct queries with aggregation reflection	2	1442	May 26, 2022
Setting up reflections on distinct count	1	1295	February 25, 2019
Optimization/Acceleration of select distinct query on attribute	4	2954	December 7, 2017
Dremio Reflection Choosing Algorithm	5	1421	November 4, 2019
Accelaration not effective on select distinct * from table Apache Iceberg	3	70	September 19, 2024

Reflection with count distinct

Related topics