I understand C3 cache is in-memory cache generated with the query results. So next time when the same query is executed it picks from cache and shows the results faster. When reflection is created on the same query the results are stored in-disk in parquet format and results are fetched faster. So comparing this how the c3 cache and reflection makes difference? which method is optimal?
Also how to purge the generated c3 cache which are holding stale results?
C3 is a disk-based acceleration feature of Dremio. If you have parquet datasets sitting on one or more data lakes like S3, Dremio caches the parquet row groups locally on the disks of the executor nodes. This is done automatically based on the queries that are issued against Dremio. So we reduce dependency on constantly hitting S3 for queries that refer the same row groups and improve performance (it can also reduce S3 I/O costs too). Typically certain EC2 instances comes with NVMe/SSD locally attached storage that work really well with C3. See docs here.
Here’s something from our website that’s relevant:
C3 only caches data required to satisfy your workloads and can even cache individual microblocks within datasets. If your table has 1,000 columns and you only query a subset of those columns and filter for data within a certain timeframe, then C3 will just cache that portion of your table.
Reflections are similar to materialized views. The two more common types are → Raw and Aggregate. Raw Reflections materializes the VDS definition as-is (you add sorting and partition if required). Aggregate Reflections can create pre-aggregated materializations based on dimensions & measures you set (this is more common). Both types get stored back to Dremio’s Distributed store (which will be on your data lake like S3 - not on local disk like how C3 is).
You can have more than one reflection per VDS or PDS. Dremio’s query optimizer automatically chooses the best reflections for a given query. You don’t have to manage or remember which Reflections to use (like how you would have to do with traditional materialized views). All you have to do is query the base VDS or PDS.
The cool thing is C3 works seamlessly with Reflections. So a typical query that get accelerated by a Reflection (stored on S3), can cache the Reflection data on to C3 (stored on local disk). Next time a similar type of query that uses the same Reflection will get the cached data from C3 and not go to S3.
You can disable C3 in Source Settings (Advanced Options → Disable “Enable asynchronous access when possible”).
Generally, the idea is to look at your working data size and size C3 according to that.
For example - you may have terabytes of data in your data lake, however your users are mostly only interested in the last 2 months of data. You then try to figure out what this 2 month working set size looks like (doing this is probably the hardest part). Let’s say it’s 1.5TB. You would then want to size C3 to be sized 150GB per node for a 10 executor cluster. This would avoid most of your workloads from hitting your data lake all the time.
Ideally, you would have dedicated engines for different workloads so that C3 doesn’t get thrashed too much (ie, route reflection refreshes to a separate engine; queries belonging to similar workloads/use-cases to another etc).
If all that is too hard, start with at least 150GB per executor, monitor the disk usage and query performance over time and size up/down accordingly.