Exclude certain subfolders when creating a dataset

Madhu · January 28, 2019, 4:26am

We have spark writing parquet files and the application being a streaming one, it creates _spark_metadata folder with some json files under it. When trying to query the top level folder, dremio complains about the contents of the _spark_metadata folder. Any suggestions on how to ignore them.

Here is the file layout

Dataset1
_spark_metadata/0
_spark_metadata/1
part001.snappy.parquet
part002.snappy.parquet

The subfolder _spark_metadata has json files that are metadata and not data.

Madhu · January 28, 2019, 4:30am

Here is the error when creating a dataset

dataset1/_spark_metadata/0 is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [100, 100, 34, 125]

balaji.ramaswamy · January 28, 2019, 6:40am

Hi @Madhu

Currently we cannot promote mixed file types by ignoring certain files. Is there by any chance you can move the Parquet files under a folder and then promote the folder containing only the Parquet files?

Thanks
@balaji.ramaswamy

Madhu · January 28, 2019, 1:58pm

It is needed for spark structured streaming and new files are added for each micro batch. I don’t think we will have a choice to remove them. We are also trying to see if spark can write it else where.

Thanks,
Madhu

Topic		Replies	Views
Cannot choose Dataset Format for folder containing '_SUCCESS' file	1	1096	November 2, 2018
Query subfolders	1	1227	July 1, 2021
Folder not taken into account in a parquet dataset	6	2201	October 15, 2019
Dir0, dir1 and partitioned datasets, etc	13	4720	March 25, 2022
Filename filtering when reading multiple files	2	598	July 12, 2023

Exclude certain subfolders when creating a dataset

Related topics