Dremio seems to be scanning all the partitoned data from s3

I am new to dremio, having gone through the documentation I tried to evaluate the community edition in terms of latency between presto and athena but facing following issues

  1. My data is in parquet format partitioned on date
    the preview show the column as dir0 with values as
    dir0
    eventdate=20200927
    eventdate=20200930
    eventdate=20200919
    eventdate=20200918
    eventdate=20200922
    eventdate=20200929
    eventdate=20200921
    eventdate=20200920
    .
    .
    .
    I am trying to run a query with left join over 3 such data set filtered over above date partitioned over for 10 date. partition

I ran same query on similar config presto cluster it takes around 7-8 min on presto , while same query when I am trying on the dremio takes more than 30 min (then I cancelled without waiting for the result)

few things I noticed were
presto scanned around 200GB of data to give the result
while when dremio query was cancelled showed the data scan
ie
Input bytes : 600GB +

I am having hardtime to understand as I have why dremio is not doing partiton pruning/ filtering and scanning the whole lot amount of the data.

Note: I have not tried using reflections

@v-shaal, can you attach a query profile for the job?

Hello!

I have some problem with Dremio and S3.

We have a parquet table partitioned with folders in S3 and a simple “SELECT COUNT(DISTINCT COL_PARTITION)” run a full scan on table.

I’ve tried to mapped the table across GLUE and S3 with the same result.

@fbelchior Can you please send me the profile for the above query and also just count (without the distinct)?

@balaji.ramaswamy

The count without distinct run in 5 minutes and the distinct partitions run in 13 minutes.

Profile_With_Distinct.zip (25,7,KB)
Profile_Count_Without_Distinct.zip (22,3,KB)

Count(distinct) is generally expensive and there are 18K files

Do you need an exact count distinct? or would Approximate count distinct help?

https://docs.dremio.com/software/sql-reference/sql-functions/functions/APPROX_COUNT_DISTINCT/

Also can agg reflect the query using NDV (Near Distinct Values)

1 Like