I did some more extensive testing of the reflection refresh behavior of dremio with s3 sources by pushing a simple id data set up to s3 in batches of 100 and polling the data source in dremio to see what the max id was being returned. This was for a small (3 node dremio cluster). I wanted to share this here and ask if this is all to be expected of if there are additional optimizations I’m not aware of that could make this better?
First test - no reflections
The first test I ran was with no reflections anywhere (and hence no reflection refreshes)
(2019-10-17 13:14:05.141636) 0
(2019-10-17 13:14:06.517629) 0
(2019-10-17 13:14:07.924899) 0
(2019-10-17 13:14:17.602598) 0
(2019-10-17 13:14:18.982750) 0
(2019-10-17 13:14:20.360365) 0
(2019-10-17 13:14:21.757077) 0
(2019-10-17 13:14:23.142910) 0
(2019-10-17 13:14:24.533491) 0
(2019-10-17 13:14:25.915443) 0
(2019-10-17 13:14:27.296775) 0
(2019-10-17 13:14:28.801653) 0
(2019-10-17 13:14:30.195872) 300
(2019-10-17 13:14:31.583417) 300
(2019-10-17 13:14:32.962828) 300
(2019-10-17 13:14:34.340646) 300
(2019-10-17 13:14:35.721036) 300
(2019-10-17 13:14:37.097542) 300
(2019-10-17 13:14:38.476944) 300
(2019-10-17 13:14:39.858609) 300
The query results updated after about a minute (on average) which I think matches the minimum time that it takes for metadata to be discovered in dremio according to this setting on the source?
This result was not unexexpected, however we can’t get away with not having reflections for most our datasets.
Test 2 - Reflection wo Refresh
With a reflection in place on one of the middle tables in the process (regardless of whether or not it is an incremental reflection update is turned on for the source). The query results will not update again until the next time the reflection is refreshed by the system. The minimum time span for this in dremio is set on the physical data sourceand is an hour.
You can hit refresh now and it will refresh the reflection manually (which refreshes all downstream dependent reflections in the tree). This causes it to update in about 10-30 seconds, but depends on how long it takes for the reflection to build.
Notes:
-
this is regardless of incremental refresh option. This is because apparently incremental refresh simply makes the reflection refresh builder diff the files from one time that the reflection ran to another, but the refresh frequency is still capped at the 1 hour minimum for automatically rebuilt reflections and not affected by the frequency of updates to the data.
-
Unlike what I was expecting the results are actually stale. This means that dremio uses the reflection parquet in its entirety and does not try to do a diff or union with the current data unreflected. This is important as this will result in stale results. (Question: is this something you are exploring to improve? I was expecting that it would return accurate results (data consistency) but that the query would simply suffer a performance penalty)
-
If incremental refresh is turned on and you delete you have to go into the reflection and turn it off and back on (or delete and rebuild it). This is because incremental refresh only works for “append only” context as noted other places. This generally isn’t an issue outside of testing as our data is append only.
Test 3 - Reflection w Refresh
Adding a manual refresh to the physical data source using the rest api before running the query resulted in the following:
(2019-10-17 17:24:33.508418) 0,null,null
(2019-10-17 17:24:39.148795) 0,null,null
(2019-10-17 17:24:44.736047) 0,null,null
(2019-10-17 17:24:50.368392) 0,null,null
(2019-10-17 17:24:55.952825) 0,null,null
(2019-10-17 17:25:01.545077) 400,null,null
(2019-10-17 17:25:07.135184) 400,null,null
(2019-10-17 17:25:12.705676) 400,null,null
(2019-10-17 17:25:18.273818) 400,null,null
(2019-10-17 17:25:24.053751) 400,null,null
(2019-10-17 17:25:29.621542) 400,null,null
(2019-10-17 17:25:35.180298) 400,null,null
(2019-10-17 17:25:40.752258) 400,null,null
(2019-10-17 17:25:46.313424) 400,null,null
(2019-10-17 17:25:51.861592) 400,null,null
(2019-10-17 17:25:57.461728) 400,null,null
(2019-10-17 17:26:03.042781) 400,null,null
(2019-10-17 17:26:08.617842) 1500,null,null
(2019-10-17 17:26:14.195623) 1500,null,null
(2019-10-17 17:26:19.769558) 1500,null,null
(2019-10-17 17:26:25.370713) 1500,null,null
(2019-10-17 17:26:30.920955) 1500,null,null
(2019-10-17 17:26:36.498327) 1500,null,null
(2019-10-17 17:26:42.074627) 1500,null,null
(2019-10-17 17:26:47.635900) 1500,null,null
(2019-10-17 17:26:53.253580) 1500,null,null
(2019-10-17 17:26:58.850476) 1500,null,null
(2019-10-17 17:27:04.413253) 1500,null,null
(2019-10-17 17:27:09.958870) 1500,null,null
(2019-10-17 17:27:15.495383) 3200,null,null
(2019-10-d 17:24:36.622792) Copying file
(2019-10-d 17:24:39.174707) Copying file 1
(2019-10-d 17:24:41.644146) Copying file 2
(2019-10-d 17:24:44.074194) Copying file 3
(2019-10-d 17:24:46.563777) Copying file 4
(2019-10-d 17:24:49.033014) Copying file 5
(2019-10-d 17:24:51.509967) Copying file 6
(2019-10-d 17:24:54.935158) Copying file 7
(2019-10-d 17:24:57.405324) Copying file 8
(2019-10-d 17:25:00.329522) Copying file 9
(2019-10-d 17:25:02.873334) Copying file 10
(2019-10-d 17:25:05.474731) Copying file 11
(2019-10-d 17:25:08.055041) Copying file 12
(2019-10-d 17:25:10.593861) Copying file 13
(2019-10-d 17:25:13.190899) Copying file 14
(2019-10-d 17:25:15.679632) Copying file 15
(2019-10-d 17:25:18.156999) Copying file 16
(2019-10-d 17:26:20.862639) Copying file 17
(2019-10-d 17:26:23.338329) Copying file 18
(2019-10-d 17:26:25.947379) Copying file 19
(2019-10-d 17:26:28.510195) Copying file 20
(2019-10-d 17:26:31.032345) Copying file 21
(2019-10-d 17:26:33.515189) Copying file 22
(2019-10-d 17:26:36.036805) Copying file 23
(2019-10-d 17:26:38.642543) Copying file 24
(2019-10-d 17:26:41.108855) Copying file 25
(2019-10-d 17:26:43.511439) Copying file 26
(2019-10-d 17:26:45.957778) Copying file 27
(2019-10-d 17:26:49.459563) Copying file 28
(2019-10-d 17:26:52.092022) Copying file 29
(2019-10-d 17:26:54.718526) Copying file 30
(2019-10-d 17:26:57.305769) Copying file 31
(2019-10-d 17:26:59.864569) Copying file 32
(2019-10-d 17:27:02.392413) Copying file 33
(2019-10-d 17:27:04.882784) Copying file 34
(2019-10-d 17:27:07.708058) Copying file 35
(2019-10-d 17:27:10.230805) Copying file 36
(2019-10-d 17:27:12.733081) Copying file 37
(2019-10-d 17:27:15.218653) Copying file 38
(2019-10-d 17:27:17.698528) Copying file 39
In this set up you can see that the accurate results runs in the range of 10-30 seconds behind the data set.
Summary
As expected Dremio still generally fits a batch context best as intended when setting out to use it.
Generally speaking if the data hits at a different time of day from when the batch task runs (at least an hour difference) then there shouldn’t be any issue. We could potentially use this as a “micro batch” processing engine by hitting reflection refreshes before queries as I have done here but I generally wouldn’t recommend it.
In that case the best case scenario latency we can hope for is in the less than 1-minute range but greater than 10 seconds range (depending on how long reflection building takes). Reflection building time can be shortened for larger data sets by leveraging the incremental refresh option.
Question: If we wanted to continue to use dremio as our metadata specification layer over all our data sources, would you recommend we improve on batch level SLA’s by potentially leveraging a different tool beneath dremio as a data source such as snowflake to do some of the heavy lifting/data warehousing and or near real time SLA’s (using https://docs.snowflake.net/manuals/user-guide/streams.html ). Can you comment on the update behavior of snowflake (snowpipe for example) with respect to dremio reflections?