Is there a way to identify the particular parquet file that is a reflection of a particular VDS?
For example, how would I programmatically determine that file
/accelerator/0c37988c-44d0-4732-8d94-5a49a8857456).7e1fb5bd-6b54-400b-becf-45e5e35542b9_0/0_0_0.parquet is a reflection for VDS named CUSTOMER_LIST?
In that case
0c37988c-44d0-4732-8d94-5a49a8857456 is the reflection id and
7e1fb5bd-6b54-400b-becf-45e5e35542b9_0 is the materialization id. You can use the system table
sys.reflections to link reflections with the dataset(s) they are based on.
Thanks for the information! I’m a little confused about the materialization id. I see that some accelerator/reflection folders contain multiple sub folders with what look like materialization ids.
For example, we have a folder named accelerator/e82526e3-7ffb-4bc4-aa63-448835f3ea1d that contains three sub folders 73d0acda-dc86-4823-bab9-07a65f233e13_0, 837f4ee5-aa02-4fc2-a96a-e3f9545a038a_0, and f671677b-c2bb-4351-9d3e-ff2dc5758f65_0. Each of those sub folders contain multiple files: 1_0_0.parquet, 1_1_0.parquet, and 1_2_0.parquet. Can you help me understand what the sub folders and multiple files contain? Are reflections broken up into multiple files and folders?
If so, what is the process to re-assemble these?
Subfolders are materializations of the reflection - each time a reflection is refreshed, we build a new materialziation (we have a table called
sys.materializations that contains more data). Old materializations will eventually be cleaned up and removed by Dremio after a certain time period (to ensure any queries using them can finish). Reflections may be partitioned into multiple parquet files depending on various factors (reflection definition for example).
What exactly do you mean by reassembly?
I’d like to read a parquet file created by Dremio in an external application (such as a python script). Where a folder contains a single parquet file (0_0_0.parquet) I am able to read it in python. However, when a folder contains multiple parquet files (1_0_0.parquet, 1_1_0.parquet, 1_2_0.parquet) Do I need to combine these files into 1 in order to load the entire reflection into memory?
If you are accessing the mulit-file dataset from an external application via Dremio (that is, you are querying Dremio tables for the data with your app, rather than the files directly), in Dremio, you should promote the whole directory as a PDS.
If you are accessing the files outside of Dremio with your app, then it would depend on what app you are using. Probably one file at a time.