Hi, I am pretty new to Dremio and have questions that might be basic. My apologies for the same.
In the document related to reflection, it is given:
“Dremio maintains the data in a highly compressed, columnar form based on Apache Parquet. Dremio stores this data near Dremio’s query engine which can be scaled out with additional nodes to support larger data volumes, greater concurrency, and lower latency”
My questions are:
If I am creating a reflection based on a dataset that is fetching 1 TB of records, then does Dremio store the entire 1 TB in an optimized format after compressing? (I understand that parquet can compress the data by 95%)
From infrastructure planning, how much storage is supposed to allocate in such a scenario? Is there a best practice document which we can refer to plan the storage in general?
You need to run a test query to compare partial result on physical data set and in Dremio to get an estimate on the size needed. If you can run the same query on the database where the record resides to see its compressed size there or get file size from subset of the file if its raw. The storage needed would be proportionate to whatever your test query gave, though it would do even better assuming your data repeats all or part or has a sequential date or ID column.
In the reflections tab it will show the size of the reflection, admin Jobs view, select the job and it will show the size, and in admin reflections list it shows the footprint of the reflection itself.