Stacking files?

fish747 · January 18, 2018, 8:10pm

Hi, I’m new to Dremio. Is there an easy way in Dremio to stack a bunch of large files that have the exact same layout (same columns?) Basically a union query across 20 tables with around 50 identical columns.

Note: I can do this in Alteryx really quickly, was given Dremio as an alternative to Alteryx.

Thanks for any feedback.

can · January 18, 2018, 8:29pm

Hey @fish747, glad to have you onboard! You can easily achieve this by converting a folder of files (with the same layouts) to Physical Datasets in Dremio. Check the Directories section of this article from our documentation for the details. Let us know if you have any other questions.

fish747 · January 18, 2018, 9:49pm

I have two Excel files as an example. They have the same columns. I have defined all the files in the directory.

The directory will not show the button on the right when I hover over it (the one with the directory pointing to the other directory).

Nothing in this message is intended to constitute an electronic signature unless a specific statement to the contrary is included in this message.

Confidentiality Note: This message is intended only for the person or entity to which it is addressed. It may contain confidential and/or privileged material. Any review, transmission, dissemination or other use, or taking of any action in reliance upon this message by persons or entities other than the intended recipient is prohibited and may be unlawful. If you received this message in error, please contact the sender and delete it from your computer.

can · January 18, 2018, 10:09pm

Got it. Converting folders to datasets is currently supported for folders within sources (e.g. NAS, HDFS, S3, etc.) where users often deal with 1000s of files. We also have a roadmap item to support this within user spaces.

Some thought to workaround this:

You can add the 50 files you have to a source (e.g S3 etc.) and then convert the folder to a dataset.
If running Dremio locally on a single node, add you local filesystem as a NAS source (point Root to you hard drive’s root) and then convert the folder to a dataset.
As you mentioned, you can manually create a Virtual Dataset that is a UNION of all the datasets you have.

JuergenD · August 28, 2018, 10:29pm

Hi,

on stacking files, I’ve read in the documentation that I can use partitioning via directories as well, and the dir0, dir1 etc. become available. I wonder
a) is the file name available in the query when I defined a directory with stacked files?
b) Is the line number from within the file available in the query?

thanks a lot

Juergen

can · August 30, 2018, 1:18am

Hey @JuergenD we don’t currently support either. Would be curios to understand the purpose though. Is this for compliance/GDPR reasons?

JuergenD · August 30, 2018, 3:40pm

Hi, yes compliancy in the wider sense, and tracing. Knowing exactly where the data orignate from, like the file name and line number.

JuergenD · December 10, 2019, 6:15pm

Hi,

has anything changed meanwhile related to:

is the file name of csv files available in Dremio?
is the line number from within the csv file available? Or an idea about a workaorund?

thanks a lot
Juergen

Topic		Replies	Views
Extract value from source file to add new column	3	1541	April 5, 2019
Files and Folders	11	2197	March 2, 2018
How can Dremio read in zipped CSV files? Dremio University	1	1483	November 24, 2020
Configure multiple subdirectories as a dataset	6	1917	January 16, 2021
Multiple data source convert into single data source Dremio Cloud	3	707	June 1, 2023

Stacking files?

Related topics