Reflection with count distinct

I want to use aggregation reflection to calculate the number of distinct visitors by a dimension.

let’s say my source data is composed of 3 columns :
_Table event_pid _
- pid (productId)
- ts (timestamp)
- visitor_id

I was supposing that creating an aggregation reflection with pid as dimension, visitor_id as metric,

would enable to use acceleration
for a query such as
select pid, ndv(visitor_id) from event_pid group by pid

Unfortunately Dremio does not use acceleration for such a query.

Any advise on a way to use acceleration four count(distinct …) or ndv queries ?

Thanks

What about creating another virtual dataset based on the count distinct query, and using a raw reflection of that?

@dfleckinger you have a few options:

  • You can add visitor_id as a dimension (not measure) to your existing list of dimensions. This is a bit counter intuitive but eventually count distinct queries can be thought as group by/count queries – hence setting this field as a dimension.
  • As @evan_b suggested, you can create a virtual dataset for the query you want to accelerate and create a Raw Reflection on that. This would work both with count(distinct X) and NDV(x).

Thanks for your answers. I do not find very elegant the solution to create a VDS for the query I want to accelerate.
For the first one, adding visitor_id as a dimension of the reflection would increase a lot the storage size of the reflection.

@dfleckinger creating a VDS should be considered as a secondary option, especially if you’d like a NDV based solution as there isn’t currently an option to configure approximate count distincts for reflections. (We do have plans to support this going forward.)

As for the reflection sizes – that’s going to be a trade-off between query-time performance and storage that you’ll need to determine for your use-case. The same trade-off would have applied if we hypothetically supported adding visitor_id as a measure to accelerate count distinct queries. We also have plans to support even more memory efficient (focus on query-time memory, not reflection storage space) count distinct acceleration.

Thanks Can.
At the moment I’m not able to add visitor_id as a dimension. When Dremio tries to create the reflection,
out of memory error happens :
OUT_OF_MEMORY ERROR: One or more nodes ran out of memory while executing the query.

My dataset has more than 50 millions distinct visitor_id already in a parquet dataset.

So at the moment I will use the VDS for the query.

Happy to see that there are plans to improve the performance of count distinct queries.

1 Like