We have Hive source with 1 day metadata full refresh configured and every full refresh is taking almost 7 to 9 hours time.
Below are the logs for metadata full refresh,
Source ‘XXXXX’ refreshed in 24666 seconds. Details:
Shallow probed 102798 datasets: 1 added, 102795 unchanged, 2 deleted
Deep probed 1030 queried datasets: 786 changed, 242 unchanged, 0 deleted, 2 unreadable
There are few tables which has around 1600 partitions, which are taking approximately 10 mins time to complete refresh for that table.
Is there any optimisations in latest dremio version? Or is it normal behaviour?
Also suggest if there is any way to reduce the process time.
we are using dremio 3.3.1.
Later this year, we are coming up with more optimizations to reduce time taken for metadata refresh. BY any chance in your source metadata settings, have you configured the metadata setting “Fetch mode” to “All Datasets”, see attached screenshot
Thanks for the response.
No , we have not configured as ‘All Datasets’, It is only ‘Queried Datasets’.
Try this, add the below logger to logback.xml and restart coordinator, after a day, review metadata_refresh.log and you will see dataset wise how much time was taken. You can bring that file into Dremio as CSV (space as delimiter) and find out the datasets that takes most of the time. If there are any datasets that you do not require then you can do “ALTER PDS FORGET METADATA”
<logger name="com.dremio.exec.catalog.MetadataSynchronizer" additivity="false">
<level value="debug" />
Note: The ones that are listed as 0 ms can be ignored
Thanks Balaji. We will try this and validate