We are testing on our staging cluster which has 4 machines (1 m5d.4xlarge and 3 m5d.4xlarge) and separate zookeepers (3)
We have a set of queries that just are running forever… They seem to be stuck. When running this same job locally they complete fine.
We can issue queries and they run fine.
Here’s a profile of one of them (they are all the same type of query just different date ranges)
145c7d78-fdb1-4feb-b71c-b685f45c7ed4.zip (50.4 KB)
Also, looking at the main machine and one of the other machines and the CPU is almost nothing
We also get this periodically.
Trying to cancel one of the queries does nothing…
one of my manual queries (of the same) ran for awhile and eventually got this error:
Query cancelled by Workload Manager. Queue enqueued time of 300.00 seconds exceeded for ‘large’ queue
Here’s profile for it.
f63123f9-ae33-4536-b41f-0b3b59cd91ea.zip (63.1 KB)
The other queries above are still running (>1 hour now)
Can you filter on running and enqueued jobs but also click UI, external and check all job types? How many jobs are running? see screenshot below
we just bounced the main box to see if we could reproduce. we’re kind of stuck at this point and not sure what to do. there were ~8 jobs running for >1hr when we just bounced. I’m going to try and re-run one of the jobs now.
Yes start with one and throttle up. Kindly provide us feedback on when it starts to get slow?
we are testing this on another machine with identical configuration, with the exception this is still running 3.0 instead of 3.1 on the machines above.
this simple query against a 4 parquet files stored in S3 took 1.35m. there are only ~1.5K records total.
here is a profile. … could be different than above but we’re trying to reduce the queries down to see if we can figure it out.
1589206e-7faa-4cd7-b52d-75d0c5995a7c.zip (25.4 KB)
looking at the profile, it seems like 1 of the threads (01-03-05) in the PARQUET_ROW_GROUP_SCAN took almost 1 minute to read 671 records.
spent a good bit of last night and today trying to debug this more … super frustrating.
we decided to experiment with moving the S3 data for one customer into HDFS and connecting to that.
the performance differences are STAGGERING. same source (parquet), same query.
the HDFS results are 1.27GB ~43 seconds
the S3 results are ~8 minutes!
here are the 2 different profiles
I guess s3 just isn’t very performant?
FWIW -we download the files from this box from s3 and it’s almost immediate so we’ve verified that there’s really no difference in pulling from s3 vs. hdfs as it relates to network performance, etc.
dremio-s3-source-profile.zip (26.1 KB)
dremio-hdfs-source-profile.zip (21.8 KB)
If you open your profile and scroll down to the initial Parquet Row Group Scan, you will see the entire time we are waiting on S3
HDFS: seems slightly better but still high wait time. Is this on prem HDFS and are your executors on HDFS?
We are running these (like I said above) on AWS EC2 using the recommended config. The HDFS is just an EMR cluster.
OK, we are moving the executors into YARN and will re-run the queries and compare here soon