Reflection in AWS S3 is slow? store in EBS?

poonhs-esquel · June 20, 2018, 2:40am

Hi, my understanding of Reflection is to keep a snapshot of result data for fast query instead of manipulating the raw data sources.
However in our Dremio hosted in AWS using S3 as distributed storage, i experienced 1m:03s to query the data from reflection while direct query without reflection only needs 20s. [raw data source is in S3 too]
Is there any reason?

Also i see in the document that it is suggested to store reflections in EBS or EFS for better performance.

For additional performance benefits running on AWS, reflections can be stored on EBS or EFS instead of S3.

Can i know how to set it up as i believe paths.dist should be accessible by multiple nodes instead of a single mount point of EBS…Please advise what should we put in the configuration? Thanks?

With Reflection (1m:03s)

Without Reflection (20s)

kelly · June 20, 2018, 7:07am

Hi - this is surprising.

What is the raw data source, and what options did you use for your raw reflection?

Can you share the query profile for each query?

poonhs-esquel · June 21, 2018, 2:38am

Thanks for your prompt reply.

The query is to join two S3 parquet files. The SQL is

SELECT *FROM S3.“aws-dremio-dev-ap-southeast-2”.test.AGPO_LINE.“cdc_snapshot_date=2018-06-05” a
LEFT JOIN S3.“aws-dremio-dev-ap-southeast-2”.test.AGPO_HD.“cdc_snapshot_date=2018-06-05” b
ON CONVERT_TO_FLOAT(a.AGPO_ID, 1, 1, 0) = CONVERT_TO_INTEGER(b.AGPO_ID, 1, 1, 0)

The query profile is as below:

without reflection (38s)
query_from_dataset_without_acceleration.zip (25.3 KB)

query_from_dataset_without_acceleration1920×939 129 KB
with reflection (2m:03s)
query_from_dataset_with_acceleration.zip (22.4 KB)

query_from_dataset_with_acceleration1920×947 134 KB

kelly · June 21, 2018, 2:40am

Thanks, we will take a closer look over the next day or two.

Aditya_Chandra · June 25, 2018, 3:13pm

was it investigated?

tomasienrbc · June 25, 2018, 6:25pm

I’m not part of team Dremio so I’ll leave debugging this to them - I’m just curious since I had a similar issue, can you see in the “Profile” what is causing the delay?

balaji.ramaswamy · June 25, 2018, 8:41pm

Hi @poonhs-esquel

Would you know if the reflection files on S3, the actual S3 buckets and the Dremio servers (I assume EC2) are all in the same region?

Can you also please go to the S3 distributed storage location and see the # of reflection files actually created for this query? How does it compare to the actual number of raw parquet files (non Dremio ones) you are reading ?

Thanks
@balaji.ramaswamy

poonhs-esquel · June 28, 2018, 3:18am

Hi, i am sure the dremio accelerator folder in S3 is the same as EC2. They are all in Sydney region.
Without finding the fix, i now disable the distributed storage in the configuration. so the reflections are stored in local storage. The same query can now return within few seconds.

So i hope Dremio team can share the correct configuration of EC2 hosted dremio cluster with dist storage for our reference. Thanks.

balaji.ramaswamy · June 28, 2018, 8:31pm

Hi @poonhs-esquel,

We have filed an improvement on this with our engineering team. Can you please tell me the # of reflection Parquet files?

Thanks,
@balaji.ramaswamy

poonhs-esquel · June 29, 2018, 1:10am

Sorry that i removed the parquet files in S3 now. but as i remember, my data size only use 1-3 parquet file in accelerator folder in S3.

Topic		Replies	Views
Why use reflection on reading data from S3?	2	2756	September 15, 2018
Large Reflection creation, speed and performance	4	2238	April 16, 2019
Reflection is not accelerating query	2	948	July 6, 2021
Evaluating Dremio	3	2102	May 17, 2018
Error creating s3 reflection	6	1245	December 6, 2018

Reflection in AWS S3 is slow? store in EBS?

Related topics