Inferring data from a CSV file

supert165 · July 21, 2017, 10:37pm

Hello again,

I’m trying to query a CSV file I uploaded to Dremio through jdbc and SparkSQL. Usually, when querying a database I am able to infer the schema of the database I’m reading. However when reading a CSV file through jdbc the columns cannot be inferred and are all read as strings.

Is this a functionality that you have chosen to omit or is it a feature that is going to be added in the future? Or am I simply doing something wrong?

Note: Spark itself has a feature called inferschema when reading a CSV that allows it to infer the schema however this option is not available when reading through jdbc.

steven · July 21, 2017, 10:49pm

Automatically inferring the schema of a CSV is not currently supported in Dremio. The best way to handle this right now is to create a Virtual Dataset through the Dremio UI. See https://docs.dremio.com/working-with-datasets/virtual-datasets.html.

jeffknupp · July 22, 2017, 12:25am

Of course, you could use the Python Pandas library which supports CSV file schema inference and happens to be tightly integrated with/into Arrow and Parquet…

jeffknupp · July 22, 2017, 12:29am

Given that you guys never alter the source data (in this case a CSV file), won’t creating a virtual dataset always be required to achieve anything like this? More importantly, what good would inference be if you didn’t want to create a virtual dataset? Simply to guide things like join recommendations?

steven · July 22, 2017, 12:40am

I can think of a couple of ways we might incorporate schema inference:

We could continue with the current workflow of creating virtual dataset, but use schema inference to make the process easier. For tables with a large number of columns, it can be a bit cumbersome to add all of the type information manually.
It’s already possible to query a file/directory without first setting the format settings. In this case, maybe it would make sense to use an inferred schema.

Note that I’m just brainstorming here, and this is not in any way meant to be a product road map. But I think this forum is great place to get this sort of feedback.

jeffknupp · July 24, 2017, 7:07am

Understood. I was more making the point that since CSVs are schemaless the only way to apply a schema to the is in a virtual dataset. Especially when you consider that you may want a single CSV file to use different schemas in different virtual datasets (perhaps to make joins or other operations involving another virtual dataset easier).

Topic		Replies	Views
JSON schema infers only on a subset Dremio University	1	947	October 30, 2022
Dremio is not able to infer the complete schema from gz compressed json files	1	1301	September 8, 2020
CSV Files with VARCHAR & NULL Dremio University	2	1747	November 4, 2020
Uploading individual CSVs as data sources	7	2838	September 27, 2019
Create dataset for file/folder using JDBC? Dremio University	6	1310	April 13, 2022

Inferring data from a CSV file

Related topics