HDFS source with different file schema

Monika_Goel · October 4, 2018, 1:45pm

Hi Team,

Can you please clarify my understanding regarding HDFS source functionality. What will happen if I try to map a HDFS directory, which have files with different schema (lets say csv files). Will dremio union column’s from all files or will it result in error.
My understanding was dremio will union fields from all files but it results in “Schema change detected but unable to learn schema, query failed. A full table scan may be necessary to fully learn the schema.” error, while in case of parquet file observation is little different i.e. it union schema for multiple file.

Any insight will be helpful. Thanks.

can · October 4, 2018, 6:10pm

@Monika_Goel this is typically caused by too many variations in the schema (different types, columns, etc.) – Dremio caps its automatic schema learning retries at 10 per query to avoid resource waste. In your case, it looks like 10 retries wasn’t enough as there were more changes detected. Schema learning is continuous, it does not start from the beginning every time you run a query. So, you could try running something like select * from table where random() = random() (any query that’ll will scan the whole table ideally without returning many results) a few times and see if that’ll be sufficient to cover all the variations in the schema.

Topic		Replies	Views
PDS Parquet Schema incorrectly learned after METADATA Refresh	12	1497	October 19, 2022
Can Dremio process Big Data actually?	10	1482	June 19, 2018
Parquet Schema Error @ OSS-only build	4	992	June 29, 2021
Issues with a view (VDS) built using queries on an auto expiring/refreshing PDS	1	960	September 29, 2022
JSON schema infers only on a subset Dremio University	1	950	October 30, 2022

HDFS source with different file schema

Related topics