You can add visitor_id as a dimension (not measure) to your existing list of dimensions. This is a bit counter intuitive but eventually count distinct queries can be thought as group by/count queries – hence setting this field as a dimension.
As @evan_b suggested, you can create a virtual dataset for the query you want to accelerate and create a Raw Reflection on that. This would work both with count(distinct X) and NDV(x).
Thanks for your answers. I do not find very elegant the solution to create a VDS for the query I want to accelerate.
For the first one, adding visitor_id as a dimension of the reflection would increase a lot the storage size of the reflection.
@dfleckinger creating a VDS should be considered as a secondary option, especially if you’d like a NDV based solution as there isn’t currently an option to configure approximate count distincts for reflections. (We do have plans to support this going forward.)
As for the reflection sizes – that’s going to be a trade-off between query-time performance and storage that you’ll need to determine for your use-case. The same trade-off would have applied if we hypothetically supported adding visitor_id as a measure to accelerate count distinct queries. We also have plans to support even more memory efficient (focus on query-time memory, not reflection storage space) count distinct acceleration.
Thanks Can.
At the moment I’m not able to add visitor_id as a dimension. When Dremio tries to create the reflection,
out of memory error happens :
OUT_OF_MEMORY ERROR: One or more nodes ran out of memory while executing the query.
My dataset has more than 50 millions distinct visitor_id already in a parquet dataset.
So at the moment I will use the VDS for the query.
Happy to see that there are plans to improve the performance of count distinct queries.