Hello,
I am interested in knowing some more details about metadata that Dremio collects for datasets in Oracle database and how the metadata is used for query planning.
Specifically I would like to get some better understanding of precisely what metadata is collected about tables, how the metadata is used and what are the implications of stale metadata on query execution.
Based on some investigation, apart of basic schema information about tables (columns, data types, nullability) which is stored in ‘INFROMATION_SCHEMA.COLUMNS’ table, Dremio seems also to collect the following information about each column of a table: max/min/average and counts (total and distinct) of values. Is this correct? What additional information about datasets Dremio also collects? Where this data is stored? Can it be queried as well via Dremio in similar way as INFORMATION_SCHEMA can?
Could you let me know which loggers to enable to get the queries executed by metadata collector printed out in the log please? Alternatively can you provide some example/reference SQL(s) that are executed?
What I am also interested in is where the metadata collection queries are actually executed? Is the metadata collection query ALWAYS executed by the source (i.e. the SQL is pushed down to Oracle for example) or may it happen that the query is executed by Dremio (i.e. full table contents is transferred from Oracle to Dremio and Dremio collects the data stats locally)?
Considering that collection of metadata can take up some time (for some tables we got times longer than 60 minutes) - what is the impact of stale metadata on query plans? E.g. we have a table ANIMAL(SERIAL ID, NAME) with 200 rows, but Dremio collected metadata for the table when there were just 50 rows - i.e. in Dremio metadata the maximum value for the ‘ID’ column is ‘50’. We then execute the query ‘SELECT * FROM ANIMAL WHERE ID = 100’. Will query execution be short-circuited in such case - as the value used in selection clause (100) is outside the range for the specific column - and Dremio will instantly return empty result set or will Dremio still execute the query? If Dremio executes the query anyways then presumably in case of a dataset with reflection(s) the result will be an empty row set (as the reflection(s) for the dataset will have stale data in such case); however for a dataset without reflection the query will be executed against actual source and will return the matching row(s), is this correct?
Is there a way to control what metadata is collected by Dremio for a specific source/table? e.g. can we configure Dremio to collect just the total record count for a table but skip the collection of detailed per-column stats to reduce the metadata collection times? Is it possible to turn off stats collection completely and let Dremio collect just the basic dataset metadata (schema)?
Dremio version: 4.1.4
Source: Oracle
Many thanks