Parquet file too many splits

Hi,

I’m creating table using CTAS and reading it. But it gives below error while refreshing metadata.

java.io.IOException: com.dremio.exec.store.parquet.Metadata$TooManySplitsException: Too many splits encountered when processing parquet metadata at file 1_4_0.parquet, maximum is 60000 but encountered 60002 splits thus far.

As I understand limit cannot be changed. So, Is it possible to reduce the parquet column group chunk(split size) of CTAS method?

Thanks, Dalai

@Dalai It looks like one of the PDS’s used in the CTAS query has hit the limit, with the current version of Dremio you should no longer hit this limit. What version of Dremio are you on?

Hi,

My Dremio version is 21.2.0-202205262146080444-038d6d1b.

Thanks,

@Dalai Send me the job profile and also your dremio.conf (removing any password references)

Hi,

Here is the profile.
429c6c3b-e6c5-4aa4-9c0e-f148e157580f.zip (61.8 KB)

dremio.conf file is here.
dremio.zip (672 Bytes)

Thanks, Dalai

@Dalai Few things,

  • The profile you sent is one that has completed successfully and not the one with the error
  • Even though you are doing a CTAS, the source dataset is text and that is what is exceeding 60K files for a particular dataset,
  • I also see your dist:/// in your dremio.conf is commented and this will not use the unlimited splits feature

So in Summary

  • Your source datasets need to PARQUET
  • You have to enable dist:/// in dremio.conf

https://docs.dremio.com/software/advanced-administration/metadata-caching/#unlimited-splits

https://docs.dremio.com/software/deployment/dist-store-config/

Hi @balaji.ramaswamy,

My source data is csv on Minio hourly partitioned and I’m creating summarized table on Minio using CTAS method and output data which is Parquet format as default. My data is on Minio S3 storage and configured as a data lake on Dremio. But my target CTAS table partition is not refreshed because of this LIMIT and I’can’t query the target dataset.

Here is sample query to refresh target table partition
ALTER TABLE “s3-source”.“mydata_data” REFRESH METADATA FOR PARTITIONS ( “dir1” = ‘12M’,“dir2” = ‘21D’)

Error:

java.io.IOException: com.dremio.exec.store.parquet.Metadata$TooManySplitsException: Too many splits encountered when processing parquet metadata at file /prs/processed_data/prs_hourly/2022Y/12M/20D/03H/1_4_0.parquet, maximum is 60000 but encountered 60002 splits thus far.

@Dalai Since your source data is CSV you are hitting the 60K limit. In Dremio 24.0, the copy into feature will help, until then you have to do see if you can move to Parquet in batches or have bigger CSV files so you do not hit the limit