Streamlining Query Access to Parquet Tables in Dremio Connected to HDFS

I connect to a Parquet table in HDFS, but in Dremio, I have to perform a formatting operation manually before querying this table in Dremio. I would like to ask if it’s possible to skip this step?


@bigfacewo Absolutely, Edit your HDFS source and click on the Metadata tab and check the option for “Automatically format files into physical datasets when users issue queries.”, see screenshot below

Important - If you have CSV files then do not use this option as it assumes field delimiter is comma and if it is something else then it may result in one large field

Thanks, I will give it a try.

This indeed partially solves my problem, but not entirely. In fact, it still doesn’t address the issue mentioned earlier about the inability to actively refresh the metadata for the entire connection, which prevents Dremio from promptly updating newly added tables and directories by users.

@bigfacewo When you say “entire connection”, you mean source level? You would have set a frequency to refresh the metadata and it is not doing that? How long does each background refresh take, you can look at metadata_refresh.log to find that out

For new tables, you have one of three choices

  • If you do not know that a new table has been added then you have to wait for the BG refresh
  • If you know a new table has been added then you can
    #1 Run ALTER PDS
    #2 Just query the table

I have found that the latest version of Dremio can directly query newly added tables in relational databases without waiting for metadata refresh, but I have not tested HDFS yet. Can this also be done in HDFS? Directly query without refreshing metadata.

@bigfacewo Yes, if you know the dataset name and the checkbox I gave previously is checked and you query the folder it will get automaticlly promoted (changes from folder to a purple grid in Dremio UI) and results displayed