Files and Folders

Madhu · March 1, 2018, 7:05pm

We have data in parquet with a folder structure as shown below for store_sales as

/hive/warehouse/tpcds_pqt.db/store_sales/ss_sold_date_sk=2452635
/hive/warehouse/tpcds_pqt.db/store_sales/ss_sold_date_sk=2452636
/hive/warehouse/tpcds_pqt.db/store_sales/ss_sold_date_sk=2452637
/hive/warehouse/tpcds_pqt.db/store_sales/ss_sold_date_sk=2452638
/hive/warehouse/tpcds_pqt.db/store_sales/ss_sold_date_sk=2452639
/hive/warehouse/tpcds_pqt.db/store_sales/ss_sold_date_sk=2452640
/hive/warehouse/tpcds_pqt.db/store_sales/ss_sold_date_sk=2452641
/hive/warehouse/tpcds_pqt.db/store_sales/ss_sold_date_sk=2452642

After importing the dataset into dremio, the folder for store_sales shows up as dir0 with values
ss_sold_date_sk=2452635
ss_sold_date_sk=2452636
ss_sold_date_sk=2452637
ss_sold_date_sk=2452638
ss_sold_date_sk=2452639
ss_sold_date_sk=2452640
ss_sold_date_sk=2452641
ss_sold_date_sk=2452642

We have a date dimension table date_dim with a column d_date_sk with values

2452635
2452636
2452637
2452638
2452639
2452640
2452641
2452642

What are our best practices in such situations? Hive handles the new column by stripping the ‘ss_sold_date_sk=’ part so that when we join, it is straight forward

laurent · March 1, 2018, 7:11pm

Did you try to create a virtual dataset to extract the date out of the dir0 column? Then, you could use that new dataset to perform your join the same way as in hive

Here are a couple of links which might be of interest to you:

Madhu · March 1, 2018, 7:33pm

Is there a plan to make it part of the discovery process to be compatible with hive. We would like to have the same queries to be executed on both with any custom UDFs in Dremio

Madhu · March 1, 2018, 7:47pm

Also, is there an easy way to generate the full select statement. Is there intellisense built into the UI

We would love to have select fielda, fieldb from table1 generated with a click on the table. This makes it easy to create UDFs across 100 odd tables we might have

yufeldman · March 1, 2018, 8:01pm

Try to connect to Hive directly from Dremio (choose Hive source). In this case you should get your partition keys as columns directly in your table

Madhu · March 1, 2018, 8:07pm

That works fine if we embed dremio with yarn. We are also trying to do it in an isolated/stand-alone cluster for workload management purposes.

yufeldman · March 1, 2018, 8:12pm

Do you mean you don’t have access to Hive Metastore on your standalone cluster or you mean you deploy Dremio using YARN (which you don’t need to do if you don’t want to)

Madhu · March 1, 2018, 8:14pm

we have both. One with yarn and other stand-alone installation of dremio. Both reading data from a remote storage

laurent · March 1, 2018, 8:19pm

If you use the Hive source instead of the HDFS source, the metadata is fetched from Hive using the Hive metastore, but the actual query is still managed by Dremio, by directly accessing the underlying data.

Madhu · March 1, 2018, 8:21pm

yes. In the standalone cluster there is no hive. Its just dremio coordinator and worker

yufeldman · March 1, 2018, 8:28pm

As long as you provide metastore URI to your remote Hive while configuring Hive source it should be fine. You don’t need Hive installation on your standalone cluster

Madhu · March 2, 2018, 3:08pm

Due to multi-tenancy restrictions, this cluster cannot access the hive metastore. Its discover the data and create datasets in dremio

Topic		Replies	Views
Dremio OS hive metadata error (Windows)	5	1390	November 9, 2018
How to create dataset for daily data report	3	1228	January 17, 2019
Impala and Kudu as Source	9	3690	November 29, 2018
Can configure the custom Dremio metadata storage using hive metastore?	5	1279	November 23, 2020
Working with big dataset	7	2560	November 16, 2018

Files and Folders

Related topics