Dremio queries are hanging

jhaynie · February 7, 2019, 12:55am

We are testing on our staging cluster which has 4 machines (1 m5d.4xlarge and 3 m5d.4xlarge) and separate zookeepers (3)

We have a set of queries that just are running forever… They seem to be stuck. When running this same job locally they complete fine.

We can issue queries and they run fine.

Here’s a profile of one of them (they are all the same type of query just different date ranges)

145c7d78-fdb1-4feb-b71c-b685f45c7ed4.zip (50.4 KB)

Also, looking at the main machine and one of the other machines and the CPU is almost nothing

jhaynie · February 7, 2019, 12:56am

We also get this periodically.

Trying to cancel one of the queries does nothing…

jhaynie · February 7, 2019, 1:17am

one of my manual queries (of the same) ran for awhile and eventually got this error:

Query cancelled by Workload Manager. Queue enqueued time of 300.00 seconds exceeded for ‘large’ queue

Here’s profile for it.

f63123f9-ae33-4536-b41f-0b3b59cd91ea.zip (63.1 KB)

The other queries above are still running (>1 hour now)

balaji.ramaswamy · February 7, 2019, 1:28am

Hi @jhaynie

Can you filter on running and enqueued jobs but also click UI, external and check all job types? How many jobs are running? see screenshot below

running-enqueued all-job-types

jhaynie · February 7, 2019, 1:30am

we just bounced the main box to see if we could reproduce. we’re kind of stuck at this point and not sure what to do. there were ~8 jobs running for >1hr when we just bounced. I’m going to try and re-run one of the jobs now.

balaji.ramaswamy · February 7, 2019, 1:32am

@jhaynie

Yes start with one and throttle up. Kindly provide us feedback on when it starts to get slow?

Thanks
@balaji.ramaswamy

jhaynie · February 7, 2019, 2:29am

we are testing this on another machine with identical configuration, with the exception this is still running 3.0 instead of 3.1 on the machines above.

this simple query against a 4 parquet files stored in S3 took 1.35m. there are only ~1.5K records total.

here is a profile. … could be different than above but we’re trying to reduce the queries down to see if we can figure it out.

1589206e-7faa-4cd7-b52d-75d0c5995a7c.zip (25.4 KB)

looking at the profile, it seems like 1 of the threads (01-03-05) in the PARQUET_ROW_GROUP_SCAN took almost 1 minute to read 671 records.

jhaynie · February 7, 2019, 8:07pm

spent a good bit of last night and today trying to debug this more … super frustrating.

we decided to experiment with moving the S3 data for one customer into HDFS and connecting to that.

the performance differences are STAGGERING. same source (parquet), same query.

the HDFS results are 1.27GB ~43 seconds

the S3 results are ~8 minutes!

here are the 2 different profiles

I guess s3 just isn’t very performant?

FWIW -we download the files from this box from s3 and it’s almost immediate so we’ve verified that there’s really no difference in pulling from s3 vs. hdfs as it relates to network performance, etc.

dremio-s3-source-profile.zip (26.1 KB)
dremio-hdfs-source-profile.zip (21.8 KB)

balaji.ramaswamy · February 7, 2019, 8:37pm

Hi @jhaynie

If you open your profile and scroll down to the initial Parquet Row Group Scan, you will see the entire time we are waiting on S3

HDFS: seems slightly better but still high wait time. Is this on prem HDFS and are your executors on HDFS?

jhaynie · February 7, 2019, 9:10pm

We are running these (like I said above) on AWS EC2 using the recommended config. The HDFS is just an EMR cluster.

jhaynie · February 7, 2019, 9:33pm

OK, we are moving the executors into YARN and will re-run the queries and compare here soon

Topic		Replies	Views
Query is hanging and cannot be cancelled	1	964	August 9, 2022
Queries running forever in Dremio	3	65	February 10, 2025
Query was cancelled planning time exceeded 60 seconds	3	2006	June 7, 2018
All Jobs remain in status Running forever	12	117	October 8, 2024
Query by external tools is running for more than 24 hours	7	1779	April 2, 2019

Dremio queries are hanging

Related topics