Oracle dataset metadata collection and use

Hello,

I am interested in knowing some more details about metadata that Dremio collects for datasets in Oracle database and how the metadata is used for query planning.

Specifically I would like to get some better understanding of precisely what metadata is collected about tables, how the metadata is used and what are the implications of stale metadata on query execution.

Based on some investigation, apart of basic schema information about tables (columns, data types, nullability) which is stored in ‘INFROMATION_SCHEMA.COLUMNS’ table, Dremio seems also to collect the following information about each column of a table: max/min/average and counts (total and distinct) of values. Is this correct? What additional information about datasets Dremio also collects? Where this data is stored? Can it be queried as well via Dremio in similar way as INFORMATION_SCHEMA can?

Could you let me know which loggers to enable to get the queries executed by metadata collector printed out in the log please? Alternatively can you provide some example/reference SQL(s) that are executed?

What I am also interested in is where the metadata collection queries are actually executed? Is the metadata collection query ALWAYS executed by the source (i.e. the SQL is pushed down to Oracle for example) or may it happen that the query is executed by Dremio (i.e. full table contents is transferred from Oracle to Dremio and Dremio collects the data stats locally)?

Considering that collection of metadata can take up some time (for some tables we got times longer than 60 minutes) - what is the impact of stale metadata on query plans? E.g. we have a table ANIMAL(SERIAL ID, NAME) with 200 rows, but Dremio collected metadata for the table when there were just 50 rows - i.e. in Dremio metadata the maximum value for the ‘ID’ column is ‘50’. We then execute the query ‘SELECT * FROM ANIMAL WHERE ID = 100’. Will query execution be short-circuited in such case - as the value used in selection clause (100) is outside the range for the specific column - and Dremio will instantly return empty result set or will Dremio still execute the query? If Dremio executes the query anyways then presumably in case of a dataset with reflection(s) the result will be an empty row set (as the reflection(s) for the dataset will have stale data in such case); however for a dataset without reflection the query will be executed against actual source and will return the matching row(s), is this correct?

Is there a way to control what metadata is collected by Dremio for a specific source/table? e.g. can we configure Dremio to collect just the total record count for a table but skip the collection of detailed per-column stats to reduce the metadata collection times? Is it possible to turn off stats collection completely and let Dremio collect just the basic dataset metadata (schema)?

Dremio version: 4.1.4
Source: Oracle

Many thanks

@useless

  • Metadata collection frequency can be controlled at the source metadata settings

http://docs.dremio.com/advanced-administration/metadata-caching.html

  • Metadata can be collected per dataset on demand

http://docs.dremio.com/sql-reference/sql-commands/datasets.html#refreshing-physical-dataset-metadata

  • Metadata for RDBMS sources like Oracle, Dremio collects
  • get row count
  • get column metadata

This is important so the optimizer can create an optimized pushdown, the other option is if you upgrade to 4.7.3 or higher, you can use External Query

In RDBMS since queries are pushdown to Oracle, metadata is only for new columns, data type changes, new tables. If new rows are added then even if metadata is not collected, Dremio should return the latest data as the query is pushdown. Unlike Hive where each insert is a new row requires a metadata refresh even for new rows

If there is a reflection on top of the Oracle dataset and rows are inserted into Oracle and you run a query without refreshing the reflection and if the query is accelerate by that reflection that is not refreshed then you will see old data, refreshing the reflection will bring fresh data

This is too get you started, if you have more questions, feel free to reach out

Thanks
Bali

Hi @balaji.ramaswamy ,

thank you for your reply. It is very helpful.

So effectively - for Oracle data source at least - if the database schema doesn’t change then the metadata refresh frequency can be configured to run relatively rarely as the actual data/row count should not affect the push down functionality.

Just out of interest, I saw queries similar to this in Dremio log (assuming ANIMAL(NUMBER(12,0) ID, VARCHAR2(255) NAME):

SELECT NDV("ID") as "count_distinct_ID", min("ID") as "min_ID", avg("ID") as "avg_ID", count(ID) as "count_ID", max("ID") as max_ID, NDV("NAME") as "count_distinct_NAME", min(octet_length("NAME")) as "min_len_NAME", avg(octet_length("NAME")) as "avg_len_NAME", count("NAME") as "count_NAME", max(octet_length("NAME")) as max_len_NAME FROM ANIMAL;

what are these? Initially I thought that they were executed by Metadata collector, but these seem to be executed against a sample of records (e.g. ~1000 records from a table with ~500M of records) from the target table? Are the stats collected by these queries just used to speed up previews or do they affect the ‘non-preview’ execution?

Many thanks

So effectively - for Oracle data source at least - if the database schema doesn’t change then the metadata refresh frequency can be configured to run relatively rarely as the actual data/row count should not affect the push down functionality.

Yes ^^^

The query is an internal query Dremio runs to suggest measures for Aggregate reflections

Thanks
Bali