Reflection in AWS S3 is slow? store in EBS?

Hi, my understanding of Reflection is to keep a snapshot of result data for fast query instead of manipulating the raw data sources.
However in our Dremio hosted in AWS using S3 as distributed storage, i experienced 1m:03s to query the data from reflection while direct query without reflection only needs 20s. [raw data source is in S3 too]
Is there any reason?

Also i see in the document that it is suggested to store reflections in EBS or EFS for better performance.

For additional performance benefits running on AWS, reflections can be stored on EBS or EFS instead of S3.

Can i know how to set it up as i believe paths.dist should be accessible by multiple nodes instead of a single mount point of EBS…Please advise what should we put in the configuration? Thanks?

With Reflection (1m:03s)

Without Reflection (20s)
image

Hi - this is surprising.

What is the raw data source, and what options did you use for your raw reflection?

Can you share the query profile for each query?

Thanks for your prompt reply.

The query is to join two S3 parquet files. The SQL is

SELECT *FROM S3.“aws-dremio-dev-ap-southeast-2”.test.AGPO_LINE.“cdc_snapshot_date=2018-06-05” a
LEFT JOIN S3.“aws-dremio-dev-ap-southeast-2”.test.AGPO_HD.“cdc_snapshot_date=2018-06-05” b
ON CONVERT_TO_FLOAT(a.AGPO_ID, 1, 1, 0) = CONVERT_TO_INTEGER(b.AGPO_ID, 1, 1, 0)

The query profile is as below:

  1. without reflection (38s)
    query_from_dataset_without_acceleration.zip (25.3 KB)

  2. with reflection (2m:03s)
    query_from_dataset_with_acceleration.zip (22.4 KB)

Thanks, we will take a closer look over the next day or two.

was it investigated?

1 Like

I’m not part of team Dremio so I’ll leave debugging this to them - I’m just curious since I had a similar issue, can you see in the “Profile” what is causing the delay?

Hi @poonhs-esquel

Would you know if the reflection files on S3, the actual S3 buckets and the Dremio servers (I assume EC2) are all in the same region?

Can you also please go to the S3 distributed storage location and see the # of reflection files actually created for this query? How does it compare to the actual number of raw parquet files (non Dremio ones) you are reading ?

Thanks
@balaji.ramaswamy

Hi, i am sure the dremio accelerator folder in S3 is the same as EC2. They are all in Sydney region.
Without finding the fix, i now disable the distributed storage in the configuration. so the reflections are stored in local storage. The same query can now return within few seconds.

So i hope Dremio team can share the correct configuration of EC2 hosted dremio cluster with dist storage for our reference. Thanks.

Hi @poonhs-esquel,

We have filed an improvement on this with our engineering team. Can you please tell me the # of reflection Parquet files?

Thanks,
@balaji.ramaswamy

Sorry that i removed the parquet files in S3 now. but as i remember, my data size only use 1-3 parquet file in accelerator folder in S3.