Why use reflection on reading data from S3?

Hello Akshat, a few questions:

  1. where is Dremio deployed?
  2. how many nodes in your Dremio cluster, and how much RAM, CPU cores per node?
  3. are you only running queries through the SQL console in Dremio, or have you tried via ODBC/JDBC?

things that may help:

  1. If you are creating Parquet files for Dremio, please see these recommendations on configurations for Parquet: https://docs.dremio.com/advanced-administration/parquet-files.html

  2. If your raw data is already in Parquet, then a Raw Reflection may not provide any benefits as it is also in Parquet. It can be helpful in some cases: a) the Raw Reflections may be sorted or partitioned in a way that is different from the raw data, which can accelerate some queries; b) the Raw Reflections may be closer to your Dremio cluster or on a faster storage sub-system; c) it may contain a subset of the columns/rows of the source data; d) it may perform joins ahead of time, removing the need to perform the join at query time (denormalized). There are other examples, but hopefull you get the idea.

  3. Aggregation Reflections can be a very significant performance improvement. It sounds like your particular reflection isn’t configured to cover the queries you are issuing. Can you describe how you have it configured and provide a sample query that isn’t being accelerated? Normally if the query profile says that it wasn’t covered by the reflection that means you are missing columns, or there is a join condition in the virtual dataset that makes it not cover your query. Another example is that you don’t have the correct aggregation operators enabled for a specific measure (ie, MAX, MIN).

  4. Also, if you haven’t seen this tutorial it may be helpful: https://www.dremio.com/tutorials/getting-started-with-data-reflections/