Inferring data from a CSV file

Hello again,

I’m trying to query a CSV file I uploaded to Dremio through jdbc and SparkSQL. Usually, when querying a database I am able to infer the schema of the database I’m reading. However when reading a CSV file through jdbc the columns cannot be inferred and are all read as strings.

Is this a functionality that you have chosen to omit or is it a feature that is going to be added in the future? Or am I simply doing something wrong?

Note: Spark itself has a feature called inferschema when reading a CSV that allows it to infer the schema however this option is not available when reading through jdbc.

Automatically inferring the schema of a CSV is not currently supported in Dremio. The best way to handle this right now is to create a Virtual Dataset through the Dremio UI. See https://docs.dremio.com/working-with-datasets/virtual-datasets.html.

1 Like

Of course, you could use the Python Pandas library which supports CSV file schema inference and happens to be tightly integrated with/into Arrow and Parquet… :grinning:

Given that you guys never alter the source data (in this case a CSV file), won’t creating a virtual dataset always be required to achieve anything like this? More importantly, what good would inference be if you didn’t want to create a virtual dataset? Simply to guide things like join recommendations?

I can think of a couple of ways we might incorporate schema inference:

  1. We could continue with the current workflow of creating virtual dataset, but use schema inference to make the process easier. For tables with a large number of columns, it can be a bit cumbersome to add all of the type information manually.

  2. It’s already possible to query a file/directory without first setting the format settings. In this case, maybe it would make sense to use an inferred schema.

Note that I’m just brainstorming here, and this is not in any way meant to be a product road map. But I think this forum is great place to get this sort of feedback.

Understood. I was more making the point that since CSVs are schemaless the only way to apply a schema to the is in a virtual dataset. Especially when you consider that you may want a single CSV file to use different schemas in different virtual datasets (perhaps to make joins or other operations involving another virtual dataset easier).