Physical Dataset Icon created on datalake directory disappears

Hi,
We are evaluating Dremio Community Edition.
We are using OBDA HDFS as store where we create parquet files in a specific directory. We format this directory into a physical dataset with the JSON format so this purple folder icon appears after doing that… and we build Virtual datasets on top of that physical dataset.
We remove and recreate the directory during our batch load and load parquets files afterwards.
All good so far.
However we notice that for some unclear reason that the formatting of the HFDS folder disappears ‘sometimes… not daily…’… so the purple icon of physical dataset disappears… causing all virtual datasets to fail.
We unchecked the setting ‘Remove dataset definitions if the underlying data is unavailable’… hoping that this solves the problem… (at least the documentation is insinuating this) but no the problem remains.

Any ideas on this problem would be really appreciated.

Krgds,

Danny

Hi @DannyPannemans

I am a little confused when you said “we create parquet files in a specific directory. We format this directory into a physical dataset with the JSON format”, why would you format Parquet files as JSON

The formatting can get lost due to different reasons. Are you making any source level changes that you get a WARNING about metadata being lost?

Thanks
Bali

Hi, I am sorry… you are right… we do format the folder as parquet ofcourse :-).
So to rephrase what we do:

  1. remove directory on HDFS if it exists
  2. load parquet files in it
  3. format the directory with Parquet as format…==> makes the directory appear as a purple physical dataset icon
  4. build virtual datasets on top of the PDS

steps 1) and 2) are repeated at a regular basis

Sometimes we notice that the purple PDS icon gets lost… causing all VDS on top of that to fail…
The only thing we can do then is to repeat step 3).

Hope this clarifies again our problem.

Danny

@DannyPannemans

Thanks for the explanation, PDS can lose formatting due to 2 reasons

  • If in the source properties “Remove dataset definitions if underlying data is unavailable.” is checked and you remove all rows from the PDS (looks like you do it in step #1, I assume that is remove files and not folder) and metadata refresh happens at that time, we would remove the folder as PDS, try to uncheck the flag and see what happens
  • If the source has a property changed and on clicking “Save”, you get a WARNING stating “This is a metadata impacting change” and you continue to hit submit, PDS formatting will be lost

Thanks
Bali

Hi Bali,

Thanks for your answer.
We already have unchecked the checkbox (see my original post). And yes we remove the folder and recreate it in every batch run.
Your second reason doesn’t apply as we don’t do metadata impacting changes to the source.

Isn’t there another reason?

Krgds,

Danny

Hi @DannyPannemans

Even after unchecking the box, are you still seeing PDS getting unpromoted? Is there a change the source goes offline?

Yes Bali. The problem remains. Some pds remain OK, others are lost.
They all are on the same source which never goes down.
I don’t understand why the checkbox doesn’t do what it promises…

Rgds,

Danny

@DannyPannemans The check box only prevents data set not to lose formatting, when data is not present but we could lose formatting due to other reasons

  • Are there any rest API scripts that remove formatting?
  • Are there any rest API scripts that update the source?

Another option you can try is check “Automatically format files into physical datasets when users issue queries.” so when issue query on the PDS that is unformatted, Dremio would automatically format it into a dataset