Hi, I’m new to Dremio. Is there an easy way in Dremio to stack a bunch of large files that have the exact same layout (same columns?) Basically a union query across 20 tables with around 50 identical columns.
Note: I can do this in Alteryx really quickly, was given Dremio as an alternative to Alteryx.
Thanks for any feedback.
Hey @fish747, glad to have you onboard! You can easily achieve this by converting a folder of files (with the same layouts) to Physical Datasets in Dremio. Check the Directories section of this article from our documentation for the details. Let us know if you have any other questions.
I have two Excel files as an example. They have the same columns. I have defined all the files in the directory.
The directory will not show the button on the right when I hover over it (the one with the directory pointing to the other directory).
Nothing in this message is intended to constitute an electronic signature unless a specific statement to the contrary is included in this message.
Confidentiality Note: This message is intended only for the person or entity to which it is addressed. It may contain confidential and/or privileged material. Any review, transmission, dissemination or other use, or taking of any action in reliance upon this message by persons or entities other than the intended recipient is prohibited and may be unlawful. If you received this message in error, please contact the sender and delete it from your computer.
Got it. Converting folders to datasets is currently supported for folders within sources (e.g. NAS, HDFS, S3, etc.) where users often deal with 1000s of files. We also have a roadmap item to support this within user spaces.
Some thought to workaround this:
- You can add the 50 files you have to a source (e.g S3 etc.) and then convert the folder to a dataset.
- If running Dremio locally on a single node, add you local filesystem as a
NAS source (point
Root to you hard drive’s root) and then convert the folder to a dataset.
- As you mentioned, you can manually create a Virtual Dataset that is a
UNION of all the datasets you have.
on stacking files, I’ve read in the documentation that I can use partitioning via directories as well, and the dir0, dir1 etc. become available. I wonder
a) is the file name available in the query when I defined a directory with stacked files?
b) Is the line number from within the file available in the query?
thanks a lot
Hey @JuergenD we don’t currently support either. Would be curios to understand the purpose though. Is this for compliance/GDPR reasons?
Hi, yes compliancy in the wider sense, and tracing. Knowing exactly where the data orignate from, like the file name and line number.
has anything changed meanwhile related to:
- is the file name of csv files available in Dremio?
- is the line number from within the csv file available? Or an idea about a workaorund?
thanks a lot