Optimization/Acceleration of select distinct query on attribute

I have big flat data file (about 10M rows). File consist of several attribute columns and several measure columns. I need fast query that give me distinct values of attributes. This query should give results similar to the query below:

SELECT DISTINCT ‘dims_1’ as attr, “dims_1” as val
FROM test1.PERFORMANCE_TEST_10M_conv_big_attrs
UNION ALL
SELECT DISTINCT ‘dims_2’ as attr, “dims_2” as val
FROM test1.PERFORMANCE_TEST_10M_conv_big_attrs
UNION ALL
SELECT DISTINCT ‘dims_3’ as attr, “dims_3” as val
FROM test1.PERFORMANCE_TEST_10M_conv_big_attrs
UNION ALL

I’m going to use results of this query to build attribute filter in GUI.
I tried to use aggregate reflection for this purpose and then SELECT * FROM, but dremio claims that aggregated reflection not match for that query and use only raw reflection for file.

Could you share your reflection definition (a screenshot perhaps)? That would help us figure out what is going on.

thanks,
Doron

Hi,
The case looks like this:

  1. I query data source test1.PERFORMANCE_TEST_10M_ATTR_QUAERY to get all rows:
    SELECT *
    FROM test1.PERFORMANCE_TEST_10M_ATTR_QUERY_TO_CUBE

I use this data source to get all (attribute name, value) tuples.

  1. Data source definition looks like this:

SELECT attr, val, 1 as cnt
FROM test1.PERFORMANCE_TEST_10M_ATTR_QUAERY

(i added artificial column cnt to have measure in aggregated reflection)

  1. Parent data source (PERFORMANCE_TEST_10M_ATTR_QUAERY) is defined as follows:
    SELECT DISTINCT ‘dims_1’ as attr, “dims_1” as val
    FROM test1.PERFORMANCE_TEST_10M_conv_big_attrs
    UNION ALL
    SELECT DISTINCT ‘dims_2’ as attr, “dims_2” as val
    FROM test1.PERFORMANCE_TEST_10M_conv_big_attrs
    UNION ALL
    SELECT DISTINCT ‘dims_3’ as attr, “dims_3” as val
    FROM test1.PERFORMANCE_TEST_10M_conv_big_attrs
    UNION ALL
    SELECT DISTINCT ‘dims_4’ as attr, “dims_4” as val
    FROM test1.PERFORMANCE_TEST_10M_conv_big_attrs
    UNION ALL
    SELECT DISTINCT ‘dims_5’ as attr, “dims_5” as val
    FROM test1.PERFORMANCE_TEST_10M_conv_big_attrs
    UNION ALL
    SELECT DISTINCT ‘dims_6’ as attr, “dims_6” as val
    FROM test1.PERFORMANCE_TEST_10M_conv_big_attrs
    UNION ALL
    SELECT DISTINCT ‘dims_7’ as attr, “dims_7” as val
    FROM test1.PERFORMANCE_TEST_10M_conv_big_attrs
    UNION ALL
    SELECT DISTINCT ‘dims_8’ as attr, “dims_8” as val
    FROM test1.PERFORMANCE_TEST_10M_conv_big_attrs
    UNION ALL
    SELECT DISTINCT ‘dims_9’ as attr, “dims_9” as val
    FROM test1.PERFORMANCE_TEST_10M_conv_big_attrs
    UNION ALL
    SELECT DISTINCT ‘dims_10’ as attr, “dims_10” as val
    FROM test1.PERFORMANCE_TEST_10M_conv_big_attrs
    UNION ALL
    SELECT DISTINCT ‘dim_big_1’ as attr, “dim_big_1” as val
    FROM test1.PERFORMANCE_TEST_10M_conv_big_attrs
    UNION ALL
    SELECT DISTINCT ‘dim_big_2’ as attr, “dim_big_2” as val
    FROM test1.PERFORMANCE_TEST_10M_conv_big_attrs

PERFORMANCE_TEST_10M_conv_big_attrs refers to csv file.

  1. I enabled following Aggregation Reflections:

  2. When i execute query

SELECT *
FROM test1.PERFORMANCE_TEST_10M_ATTR_QUERY_TO_CUBE

it looks like dremio perform full scan on raw reflection of csv file.

Reflection Outcome
Query was accelerated
17095cb9-d5e4-4ea5-a3c4-bcb840d03647 (agg): considered, not matched.
11d95c44-3db3-4a8c-a5de-573439be563e (raw): considered, matched, chosen.

Hi @jasiek,

I a little confused here. I see your final query is a plain "select ", what happens if you try select count(), val and group by so the agg kicks in?

Thanks,
@balaji.ramaswamy

Hi,
I wanted to have a list of “distinct” elements in column to know what can I use later in filters. I expected that “distinct” will check in some meta data in aggregations/accelerations.

I rewrote query with “group” by and “count(*)” and split aggregation to have single aggregation for single column. That helps a lot. Now it use accelerations and speed up a lot.

Thanks

1 Like