Dremio seems to be scanning all the partitoned data from s3

v-shaal · October 29, 2020, 10:18pm

I am new to dremio, having gone through the documentation I tried to evaluate the community edition in terms of latency between presto and athena but facing following issues

My data is in parquet format partitioned on date
the preview show the column as dir0 with values as
dir0
eventdate=20200927
eventdate=20200930
eventdate=20200919
eventdate=20200918
eventdate=20200922
eventdate=20200929
eventdate=20200921
eventdate=20200920
.
.
.
I am trying to run a query with left join over 3 such data set filtered over above date partitioned over for 10 date. partition

I ran same query on similar config presto cluster it takes around 7-8 min on presto , while same query when I am trying on the dremio takes more than 30 min (then I cancelled without waiting for the result)

few things I noticed were
presto scanned around 200GB of data to give the result
while when dremio query was cancelled showed the data scan
ie
Input bytes : 600GB +

I am having hardtime to understand as I have why dremio is not doing partiton pruning/ filtering and scanning the whole lot amount of the data.

Note: I have not tried using reflections

ben · October 30, 2020, 12:04am

@v-shaal, can you attach a query profile for the job?

fbelchior · December 7, 2022, 2:25pm

Hello!

I have some problem with Dremio and S3.

We have a parquet table partitioned with folders in S3 and a simple “SELECT COUNT(DISTINCT COL_PARTITION)” run a full scan on table.

I’ve tried to mapped the table across GLUE and S3 with the same result.

balaji.ramaswamy · December 9, 2022, 5:30am

@fbelchior Can you please send me the profile for the above query and also just count (without the distinct)?

fbelchior · December 9, 2022, 12:47pm

@balaji.ramaswamy

The count without distinct run in 5 minutes and the distinct partitions run in 13 minutes.

Profile_With_Distinct.zip (25,7,KB)
Profile_Count_Without_Distinct.zip (22,3,KB)

balaji.ramaswamy · December 14, 2022, 5:31pm

Count(distinct) is generally expensive and there are 18K files

Do you need an exact count distinct? or would Approximate count distinct help?

https://docs.dremio.com/software/sql-reference/sql-functions/functions/APPROX_COUNT_DISTINCT/

Also can agg reflect the query using NDV (Near Distinct Values)

Topic		Replies	Views
Parquet Row Scan long wait time	1	1057	December 12, 2020
Query was cancelled planning time exceeded 60 seconds	3	1996	June 7, 2018
Improve S3 Parquet mapping and metadata update	2	1393	January 8, 2020
Strange planner behavior (and performance consequences)	3	1352	September 3, 2019
S3 select PARQUET Dremio + MinIO	1	1392	August 2, 2021

Dremio seems to be scanning all the partitoned data from s3

Related topics