Optimization/Acceleration of select distinct query on attribute

jasiek · November 21, 2017, 11:46am

I have big flat data file (about 10M rows). File consist of several attribute columns and several measure columns. I need fast query that give me distinct values of attributes. This query should give results similar to the query below:

SELECT DISTINCT ‘dims_1’ as attr, “dims_1” as val
FROM test1.PERFORMANCE_TEST_10M_conv_big_attrs
UNION ALL
SELECT DISTINCT ‘dims_2’ as attr, “dims_2” as val
FROM test1.PERFORMANCE_TEST_10M_conv_big_attrs
UNION ALL
SELECT DISTINCT ‘dims_3’ as attr, “dims_3” as val
FROM test1.PERFORMANCE_TEST_10M_conv_big_attrs
UNION ALL

I’m going to use results of this query to build attribute filter in GUI.
I tried to use aggregate reflection for this purpose and then SELECT * FROM, but dremio claims that aggregated reflection not match for that query and use only raw reflection for file.

doron · November 21, 2017, 4:37pm

Could you share your reflection definition (a screenshot perhaps)? That would help us figure out what is going on.

thanks,
Doron

jasiek · November 22, 2017, 8:57am

Hi,
The case looks like this:

I query data source test1.PERFORMANCE_TEST_10M_ATTR_QUAERY to get all rows:
SELECT *
FROM test1.PERFORMANCE_TEST_10M_ATTR_QUERY_TO_CUBE

I use this data source to get all (attribute name, value) tuples.

Data source definition looks like this:

SELECT attr, val, 1 as cnt
FROM test1.PERFORMANCE_TEST_10M_ATTR_QUAERY

(i added artificial column cnt to have measure in aggregated reflection)

Parent data source (PERFORMANCE_TEST_10M_ATTR_QUAERY) is defined as follows:
SELECT DISTINCT ‘dims_1’ as attr, “dims_1” as val
FROM test1.PERFORMANCE_TEST_10M_conv_big_attrs
UNION ALL
SELECT DISTINCT ‘dims_2’ as attr, “dims_2” as val
FROM test1.PERFORMANCE_TEST_10M_conv_big_attrs
UNION ALL
SELECT DISTINCT ‘dims_3’ as attr, “dims_3” as val
FROM test1.PERFORMANCE_TEST_10M_conv_big_attrs
UNION ALL
SELECT DISTINCT ‘dims_4’ as attr, “dims_4” as val
FROM test1.PERFORMANCE_TEST_10M_conv_big_attrs
UNION ALL
SELECT DISTINCT ‘dims_5’ as attr, “dims_5” as val
FROM test1.PERFORMANCE_TEST_10M_conv_big_attrs
UNION ALL
SELECT DISTINCT ‘dims_6’ as attr, “dims_6” as val
FROM test1.PERFORMANCE_TEST_10M_conv_big_attrs
UNION ALL
SELECT DISTINCT ‘dims_7’ as attr, “dims_7” as val
FROM test1.PERFORMANCE_TEST_10M_conv_big_attrs
UNION ALL
SELECT DISTINCT ‘dims_8’ as attr, “dims_8” as val
FROM test1.PERFORMANCE_TEST_10M_conv_big_attrs
UNION ALL
SELECT DISTINCT ‘dims_9’ as attr, “dims_9” as val
FROM test1.PERFORMANCE_TEST_10M_conv_big_attrs
UNION ALL
SELECT DISTINCT ‘dims_10’ as attr, “dims_10” as val
FROM test1.PERFORMANCE_TEST_10M_conv_big_attrs
UNION ALL
SELECT DISTINCT ‘dim_big_1’ as attr, “dim_big_1” as val
FROM test1.PERFORMANCE_TEST_10M_conv_big_attrs
UNION ALL
SELECT DISTINCT ‘dim_big_2’ as attr, “dim_big_2” as val
FROM test1.PERFORMANCE_TEST_10M_conv_big_attrs

PERFORMANCE_TEST_10M_conv_big_attrs refers to csv file.

I enabled following Aggregation Reflections:

aggregation_Dremio.png987×241 11.9 KB
When i execute query

SELECT *
FROM test1.PERFORMANCE_TEST_10M_ATTR_QUERY_TO_CUBE

it looks like dremio perform full scan on raw reflection of csv file.

Reflection Outcome
Query was accelerated
17095cb9-d5e4-4ea5-a3c4-bcb840d03647 (agg): considered, not matched.
11d95c44-3db3-4a8c-a5de-573439be563e (raw): considered, matched, chosen.

balaji.ramaswamy · December 5, 2017, 6:41pm

Hi @jasiek,

I a little confused here. I see your final query is a plain "select ", what happens if you try select count(), val and group by so the agg kicks in?

Thanks,
@balaji.ramaswamy

jasiek · December 7, 2017, 9:02am

Hi,
I wanted to have a list of “distinct” elements in column to know what can I use later in filters. I expected that “distinct” will check in some meta data in aggregations/accelerations.

I rewrote query with “group” by and “count(*)” and split aggregation to have single aggregation for single column. That helps a lot. Now it use accelerations and speed up a lot.

Thanks

Topic		Replies	Views
Accelaration not effective on select distinct * from table Apache Iceberg	3	69	September 19, 2024
Reflection with count distinct	5	2916	January 15, 2018
How to accelerate count distinct queries with aggregation reflection	2	1440	May 26, 2022
Job doesnt get accelerated by aggregation reflection	13	732	July 9, 2023
Puzzling behavior with RAW Reflections and VDS	6	1291	August 31, 2021

Optimization/Acceleration of select distinct query on attribute

Related topics