Files and Folders

We have data in parquet with a folder structure as shown below for store_sales as

/hive/warehouse/tpcds_pqt.db/store_sales/ss_sold_date_sk=2452635
/hive/warehouse/tpcds_pqt.db/store_sales/ss_sold_date_sk=2452636
/hive/warehouse/tpcds_pqt.db/store_sales/ss_sold_date_sk=2452637
/hive/warehouse/tpcds_pqt.db/store_sales/ss_sold_date_sk=2452638
/hive/warehouse/tpcds_pqt.db/store_sales/ss_sold_date_sk=2452639
/hive/warehouse/tpcds_pqt.db/store_sales/ss_sold_date_sk=2452640
/hive/warehouse/tpcds_pqt.db/store_sales/ss_sold_date_sk=2452641
/hive/warehouse/tpcds_pqt.db/store_sales/ss_sold_date_sk=2452642

After importing the dataset into dremio, the folder for store_sales shows up as dir0 with values
ss_sold_date_sk=2452635
ss_sold_date_sk=2452636
ss_sold_date_sk=2452637
ss_sold_date_sk=2452638
ss_sold_date_sk=2452639
ss_sold_date_sk=2452640
ss_sold_date_sk=2452641
ss_sold_date_sk=2452642

We have a date dimension table date_dim with a column d_date_sk with values

2452635
2452636
2452637
2452638
2452639
2452640
2452641
2452642

What are our best practices in such situations? Hive handles the new column by stripping the ‘ss_sold_date_sk=’ part so that when we join, it is straight forward

Did you try to create a virtual dataset to extract the date out of the dir0 column? Then, you could use that new dataset to perform your join the same way as in hive

Here are a couple of links which might be of interest to you:

Is there a plan to make it part of the discovery process to be compatible with hive. We would like to have the same queries to be executed on both with any custom UDFs in Dremio

Also, is there an easy way to generate the full select statement. Is there intellisense built into the UI

We would love to have select fielda, fieldb from table1 generated with a click on the table. This makes it easy to create UDFs across 100 odd tables we might have

Try to connect to Hive directly from Dremio (choose Hive source). In this case you should get your partition keys as columns directly in your table

That works fine if we embed dremio with yarn. We are also trying to do it in an isolated/stand-alone cluster for workload management purposes.

Do you mean you don’t have access to Hive Metastore on your standalone cluster or you mean you deploy Dremio using YARN (which you don’t need to do if you don’t want to)

we have both. One with yarn and other stand-alone installation of dremio. Both reading data from a remote storage

If you use the Hive source instead of the HDFS source, the metadata is fetched from Hive using the Hive metastore, but the actual query is still managed by Dremio, by directly accessing the underlying data.

yes. In the standalone cluster there is no hive. Its just dremio coordinator and worker

As long as you provide metastore URI to your remote Hive while configuring Hive source it should be fine. You don’t need Hive installation on your standalone cluster

Due to multi-tenancy restrictions, this cluster cannot access the hive metastore. Its discover the data and create datasets in dremio