Why use reflection on reading data from S3?

Hello!

We are trying to read S3 data in parquet format with gzip compression, through Dremio. While creating parquet files, before writing to S3 we have set row group size = 512MB and page size = 8KB.
While reading through Dremio, we have set fs.s3a.connection.maximum = 10000 and fs.s3a.experimental.input.fadvise = random, apart from other standard S3A properties.
While creating a dataset over a folder, of total size almost 22GB, having almost 15 files ranging sizes between 30 to 250 MB, we observe that it takes almost 8 minutes to create dataset.

This time itself is very large, considering that we would be working on data much larger in size in future. Also Dremio downloads the entire 22GB to disk, leading to space issues.
Raw reflection takes almost 40 minutes to complete and yet again consumes lot of disk space.
This could be forgiven if this was a one-time problem. But on running queries on the data we found the following issues :
• If query is accelerated by raw reflection, time taken does not improve but is almost the same.
• Queries are never using aggregation reflection [It says : Did not cover query] but use raw reflection, even for aggregation queries
• On running query the entire output of the query is again loaded on the disk causing disk issues.
• For some queries, even raw reflection is not working with message that “it is too expensive”.

No particularly helpful info can be found in logs.

Since we are not getting any substantial improvement, neither on space or time front, it forces me to ask the question that is using reflection really of any help [in case of S3].
If there is something we are missing or not configuring right, we would really appreciate the inputs.

Thanks and Regards
Akshat Chitransh.

Hey @Akshat_Chitransh! Sorry to hear that you’ve run into issues here. I’ll try to take a first pass on some of your questions. Initial responses in-line:

• If query is accelerated by raw reflection, time taken does not improve but is almost the same.
Are you creating raw reflections on virtual datasets or directly on top of your tables?
If on tables/physical datasets and if your original source format is columnar (parquet, ORC, etc.), I’d recommend trying out partitioning and sorting on the fields that you expect filters on. Otherwise, you’ll see similar performance to your original datasets. In cases where you’re running a needle in the haystack type query, you may want to create a raw reflection with both partition/sorting on fields you are filtering by. Typically creating raw reflections directly on physical datasets (ie. tables) without sorts/partitions is recommended in cases where you sources are slow (ie. relational) or you want to offload an operational system.
However, if these reflections are on a VDS with somewhat expensive one-time operations (joins, aggregates, filters, etc.) that you don’t want to repeat on-the-fly, creating a raw reflection in itself should give you a big perf. boost.

• Queries are never using aggregation reflection [It says : Did not cover query] but use raw reflection, even for aggregation queries
Would be great to get an example w/ query profile so that we can discuss/brainstorm. Aggregation reflections are typically where our users see 10-100+ times improvements. Have you had the chance to check docs for determining candidates for dimensions/measures?

• On running query the entire output of the query is again loaded on the disk causing disk issues.
Do you mean in the UI? UI Runs are automatically capped at around 1M rows – we truncate results automatically once we observe around this many results. If running through JDBC/ODBC/REST results are not capped/persisted. Also, I’d recommend not doing doing RUNs of select * type queries in the UI. These will be persisted. Typically, there is no way to consume this much data in the UI anyways, so a Preview is the recommended approach over Run, unless you are actually interested in consuming the results.

• For some queries, even raw reflection is not working with message that “it is too expensive”.
Since original format is Parquet, raw reflections that directly select the original datasets as-is will not be “costed” less than going against your original files by Dremio’s optimizer (hence “too expensive”). Again, I’d recommend creating either 1) agg. reflections 2) reflections on VDSs 3) reflections on physical datasets with sorts/partitions corresponding to common filters/access patterns.

Also, I’d recommend keeping the default values for row group size, page size. These are values we’ve determined that are optimized for Dremio’s consumption. Curious, what’s your experience with these values?What’s the motivation for your changes? Is it optimizing for consumption in another tool?

Hello Akshat, a few questions:

  1. where is Dremio deployed?
  2. how many nodes in your Dremio cluster, and how much RAM, CPU cores per node?
  3. are you only running queries through the SQL console in Dremio, or have you tried via ODBC/JDBC?

things that may help:

  1. If you are creating Parquet files for Dremio, please see these recommendations on configurations for Parquet: https://docs.dremio.com/advanced-administration/parquet-files.html

  2. If your raw data is already in Parquet, then a Raw Reflection may not provide any benefits as it is also in Parquet. It can be helpful in some cases: a) the Raw Reflections may be sorted or partitioned in a way that is different from the raw data, which can accelerate some queries; b) the Raw Reflections may be closer to your Dremio cluster or on a faster storage sub-system; c) it may contain a subset of the columns/rows of the source data; d) it may perform joins ahead of time, removing the need to perform the join at query time (denormalized). There are other examples, but hopefull you get the idea.

  3. Aggregation Reflections can be a very significant performance improvement. It sounds like your particular reflection isn’t configured to cover the queries you are issuing. Can you describe how you have it configured and provide a sample query that isn’t being accelerated? Normally if the query profile says that it wasn’t covered by the reflection that means you are missing columns, or there is a join condition in the virtual dataset that makes it not cover your query. Another example is that you don’t have the correct aggregation operators enabled for a specific measure (ie, MAX, MIN).

  4. Also, if you haven’t seen this tutorial it may be helpful: https://www.dremio.com/tutorials/getting-started-with-data-reflections/