How does reflection work

Hey,

I’m wondering how does reflections work internally ?

Like If I have some Data stored in Amazon S3 or GCS, does Dremio copy all the data in order to create the Raw Reflections ? Or does is just reads the data once, gets some metadata and uses it ?

@nkz Few things,

  • Raw reflection on the entire table (with no filters) and selecting all columns will create a copy in your dist location defined for reflections
  • Every time you refreh it would try and do an incremental reflection whenever possible so you will not see multiple copies
  • Now, with that said, few recomendations
  • Always create a VDS on top of your iceberg table on only the columns that you often query and rows that you often query. For example, you may have historical data in the table but may only query last 2 years of data (from the dashboard) then create the VDS only on the last 2 years
  • The PDS may contain several columns but the dashboard may only query a specific set of columns, create the CVDS only on those columns
  • Create the raw reflection on the VDS
  • Now coming to raw reflections, if your query is using aggregates (which is usually the case on dashboard queries) then create agg reflections over raw reflections as that will be a focussed smaller foot print reflection and both creating the reflection will be fast and the query (dashboard) using the reflection will really benefit.
  • depending on the size of the dataset, sometimes creating a raw and then an agg would benefit, as the Agg reflection creation would be accelerated by the raw
  • Lastly, only refresh when needed as every refresh will take cpu/memory
  • Remember to create a separate engine and route reflection creations to that engine

Thanks
Bali

1 Like