we just bounced the main box to see if we could reproduce. we’re kind of stuck at this point and not sure what to do. there were ~8 jobs running for >1hr when we just bounced. I’m going to try and re-run one of the jobs now.
we are testing this on another machine with identical configuration, with the exception this is still running 3.0 instead of 3.1 on the machines above.
this simple query against a 4 parquet files stored in S3 took 1.35m. there are only ~1.5K records total.
here is a profile. … could be different than above but we’re trying to reduce the queries down to see if we can figure it out.
spent a good bit of last night and today trying to debug this more … super frustrating.
we decided to experiment with moving the S3 data for one customer into HDFS and connecting to that.
the performance differences are STAGGERING. same source (parquet), same query.
the HDFS results are 1.27GB ~43 seconds
the S3 results are ~8 minutes!
here are the 2 different profiles
I guess s3 just isn’t very performant?
FWIW -we download the files from this box from s3 and it’s almost immediate so we’ve verified that there’s really no difference in pulling from s3 vs. hdfs as it relates to network performance, etc.