JDBC driver returns a large set of unexpected schemas

When you connect to a Dremio instance a large number of schema names are returned.

This includes mappings to external Parquet files in S3 under data lake section of Admin UX.

Dremio server 4.9.3-202010281843560195-edc49b6d
Dremio driver 4.9.1-202010230218060541-2e764ed0

From a consuming application perspective, many of the schemas returned are not expected and there appears to be no way to make Dremio suppress them by default.

image

@dqmdev

Via JDBC it lists all schemas, via Tableau (ODBC) we can filter to certain tables/schemas etc

I am questioning the implementation of the driver responses.

Consider the case where you define a “data lake” A which is mapped to an S3 bucket B.

Within the bucket you have a folder C and within it a folder per Parquet file.

Dremio is returning a distinct schema for each table that is mapped.

An application which calls DatabaseMetadata.getSchemas could be many tens or hundreds of schemas each containing a single table.

A.B.C.D1.T1
A.B.C.D2.T2

A.B.C.D99.T99
A.B.C.D100.T100

A.B.C.D499.T499
A.B.C.D500.T500

The Resultset from DatabaseMetadata.getTables returns a null value for the schema name.

Persons familiar with other database systems would not expect to see the large number of schemas. The would expect to have some means to group those objects into fewer (i.e. one) logical schema. They certainly would not be expecting to see a potentially very large list of schemas by default.

The Dremio JDBC driver response from DatabaseMetadata.getTables claims qualified names is “DREMIO”.“A.B.C.D1.T1” and the simple name is DREMIO.A.B.C.D1.T1

@dqmdev

Not able to completely follow your concern. Let me try to understand and help

  • Is one of your concerns about full qualified names? From JDBC that is mandatory
  • Is your other concern about, too many datasets returned by your call? What is the call you are trying to make? Is it something like everything under INFORMATION_SCHEMA.“TABLES”?

A simple JDBC program calls the DatabaseMetadata.getSchemas method.

ResultSet rs = dbMeta.getSchemas(null ,null);

The ResultSet returned by the Dremio is expected to include a row for each schema that is in the system.

There is no documentation for the driver etc that define that this is how Dremio is designed to work.

[$scratch],[DREMIO]
[@dbcert],[DREMIO]

[A],[DREMIO] <---- this would be reference to S3
[A.B],[DREMIO] <------ this is the bucket in S3
[A.B.C],[DREMIO] <-- this is a folder in S3
[A.B.C.D1],[DREMIO] <-- this would be a folder in S3
[A.B.C.D2],[DREMIO]
[A.B.C.D3],[DREMIO]

[A.B.C.D.T9999],[DREMIO]

[dbcert],[DREMIO] <---- THIS is my schema

[sys],[DREMIO]
[sys.cache],[DREMIO]

I also mentioned that Dremio returns a CATALOG name of Dremio in the JDBC driver interface. However, the SQL parser is not expecting 3-part CATALOG.SCHEMA.TABLE names. So why is it returning a catalog etc?

Keep in mind, applications which use dynamic systems to get metadata from the DatabaseMetadata interface etc, will see these issues. People who hand code will typically be unaware etc.

Same issue with Dremio 14.

@dqmdev

I have checked to see if we can filter this out

Can you elaborate on the issue here.

The schemas returned by the driver are indeed all the schemas that have tables in them. Dremio supports a hierarchy of folders (not just two levels like most databases) so you could have a folder a.b.c and in it two tables foo and bar. In that case a.b.c would be one of the schemas returned.

WRT the original question, this won’t be a problem because files in the object store (e.g., a Parquet file in a folder in S3) are not tables unless someone actually promoted the file to a table.