JDBC driver returns a large set of unexpected schemas

dqmdev · November 10, 2020, 5:56pm

When you connect to a Dremio instance a large number of schema names are returned.

This includes mappings to external Parquet files in S3 under data lake section of Admin UX.

Dremio server 4.9.3-202010281843560195-edc49b6d
Dremio driver 4.9.1-202010230218060541-2e764ed0

From a consuming application perspective, many of the schemas returned are not expected and there appears to be no way to make Dremio suppress them by default.

balaji.ramaswamy · November 11, 2020, 8:51am

@dqmdev

Via JDBC it lists all schemas, via Tableau (ODBC) we can filter to certain tables/schemas etc

dqmdev · November 11, 2020, 11:57am

I am questioning the implementation of the driver responses.

Consider the case where you define a “data lake” A which is mapped to an S3 bucket B.

Within the bucket you have a folder C and within it a folder per Parquet file.

Dremio is returning a distinct schema for each table that is mapped.

An application which calls DatabaseMetadata.getSchemas could be many tens or hundreds of schemas each containing a single table.

A.B.C.D1.T1
A.B.C.D2.T2
…
A.B.C.D99.T99
A.B.C.D100.T100
…
A.B.C.D499.T499
A.B.C.D500.T500

The Resultset from DatabaseMetadata.getTables returns a null value for the schema name.

Persons familiar with other database systems would not expect to see the large number of schemas. The would expect to have some means to group those objects into fewer (i.e. one) logical schema. They certainly would not be expecting to see a potentially very large list of schemas by default.

The Dremio JDBC driver response from DatabaseMetadata.getTables claims qualified names is “DREMIO”.“A.B.C.D1.T1” and the simple name is DREMIO.A.B.C.D1.T1

balaji.ramaswamy · November 11, 2020, 9:27pm

@dqmdev

Not able to completely follow your concern. Let me try to understand and help

Is one of your concerns about full qualified names? From JDBC that is mandatory
Is your other concern about, too many datasets returned by your call? What is the call you are trying to make? Is it something like everything under INFORMATION_SCHEMA.“TABLES”?

dqmdev · November 11, 2020, 10:36pm

A simple JDBC program calls the DatabaseMetadata.getSchemas method.

ResultSet rs = dbMeta.getSchemas(null ,null);

The ResultSet returned by the Dremio is expected to include a row for each schema that is in the system.

There is no documentation for the driver etc that define that this is how Dremio is designed to work.

[$scratch],[DREMIO]
[@dbcert],[DREMIO]
…

[A],[DREMIO] <---- this would be reference to S3
[A.B],[DREMIO] <------ this is the bucket in S3
[A.B.C],[DREMIO] <-- this is a folder in S3
[A.B.C.D1],[DREMIO] <-- this would be a folder in S3
[A.B.C.D2],[DREMIO]
[A.B.C.D3],[DREMIO]
…
[A.B.C.D.T9999],[DREMIO]

[dbcert],[DREMIO] <---- THIS is my schema

[sys],[DREMIO]
[sys.cache],[DREMIO]

I also mentioned that Dremio returns a CATALOG name of Dremio in the JDBC driver interface. However, the SQL parser is not expecting 3-part CATALOG.SCHEMA.TABLE names. So why is it returning a catalog etc?

Keep in mind, applications which use dynamic systems to get metadata from the DatabaseMetadata interface etc, will see these issues. People who hand code will typically be unaware etc.

dqmdev · March 14, 2021, 9:44pm

Same issue with Dremio 14.

balaji.ramaswamy · March 15, 2021, 3:48am

@dqmdev

I have checked to see if we can filter this out

tshiran · April 19, 2021, 10:09pm

Can you elaborate on the issue here.

The schemas returned by the driver are indeed all the schemas that have tables in them. Dremio supports a hierarchy of folders (not just two levels like most databases) so you could have a folder a.b.c and in it two tables foo and bar. In that case a.b.c would be one of the schemas returned.

WRT the original question, this won’t be a problem because files in the object store (e.g., a Parquet file in a folder in S3) are not tables unless someone actually promoted the file to a table.

Topic		Replies	Views
JDBC driver ResultsetMetadata does not return schema name	2	952	November 11, 2020
Add_files and schema.name-mapping.default	4	248	September 7, 2024
Dremio is not able to infer the complete schema from gz compressed json files	1	1304	September 8, 2020
Not all databases listed in SQL Server datasource (only 4)	17	3264	July 31, 2019
JSON schema infers only on a subset Dremio University	1	950	October 30, 2022

JDBC driver returns a large set of unexpected schemas

Related topics