Hi, my understanding of Reflection is to keep a snapshot of result data for fast query instead of manipulating the raw data sources.
However in our Dremio hosted in AWS using S3 as distributed storage, i experienced 1m:03s to query the data from reflection while direct query without reflection only needs 20s. [raw data source is in S3 too]
Is there any reason?
Also i see in the document that it is suggested to store reflections in EBS or EFS for better performance.
For additional performance benefits running on AWS, reflections can be stored on EBS or EFS instead of S3.
Can i know how to set it up as i believe paths.dist should be accessible by multiple nodes instead of a single mount point of EBS…Please advise what should we put in the configuration? Thanks?
The query is to join two S3 parquet files. The SQL is
SELECT *FROM S3.“aws-dremio-dev-ap-southeast-2”.test.AGPO_LINE.“cdc_snapshot_date=2018-06-05” a
LEFT JOIN S3.“aws-dremio-dev-ap-southeast-2”.test.AGPO_HD.“cdc_snapshot_date=2018-06-05” b
ON CONVERT_TO_FLOAT(a.AGPO_ID, 1, 1, 0) = CONVERT_TO_INTEGER(b.AGPO_ID, 1, 1, 0)
I’m not part of team Dremio so I’ll leave debugging this to them - I’m just curious since I had a similar issue, can you see in the “Profile” what is causing the delay?
Would you know if the reflection files on S3, the actual S3 buckets and the Dremio servers (I assume EC2) are all in the same region?
Can you also please go to the S3 distributed storage location and see the # of reflection files actually created for this query? How does it compare to the actual number of raw parquet files (non Dremio ones) you are reading ?
Hi, i am sure the dremio accelerator folder in S3 is the same as EC2. They are all in Sydney region.
Without finding the fix, i now disable the distributed storage in the configuration. so the reflections are stored in local storage. The same query can now return within few seconds.
So i hope Dremio team can share the correct configuration of EC2 hosted dremio cluster with dist storage for our reference. Thanks.