I am evaluating Dremio for usecase.
Background on Usecase.
We have S3 bucket on which AWS Kinesis Firehose is dumping data.
Ingestion rate is almost 1GB per hour and around 450k records are getting ingested. Currently we have around 1TB data in S3 bucket and it takes around 3hours for creating reflection on that data. Final datasize we are looking for is 10TB
Current cluster configuration:
Coordinator Node - 4 cores X 16GB RAM - 2 Nodes (With HA)
Executor Nodes - 4 cores X 16GB RAM - 3 Nodes
I couldn’t find answers to following questions.
Can we see at which rate data is getting pulled/scanned for creating reflection?
Is there anyway to minimise reflection creation time.
Can minimise refresh interval be less than 1 hour?
4 Let’s say we create reflection on S3 bucket where data was around 100MB and in hour it goes to 10GB so how much time it will take to update reflection on current data? or is there any standard at which we can calculate time required to update reflection.
Does reducing dimensions, measures and partitions will reduce reflection creation/updation time?
What cluster size you suggest for this size of data.
Any suggestion for optimization.
Also we are facing following error while creating reflection -
“Failed to spill to disk. Please check space availability”
Can we see at which rate data is getting pulled/scanned for creating reflection?
You can go to Jobs, find the reflection creation job (need to include “Accelerator” filter), then click on “Profile”. In this modal, you can find the “SCAN” operator and see the metrics for it per thread. Here is an example for a query that scans text files:
Is there anyway to minimise reflection creation time.
A few factors here:
Underlying data source format/type. For example, creating reflections from CSV files will be a lot slower than creating reflections on Parquet files.
Type and configuration of reflections. For example, aggregation reflections with a lot of high cardinality dimensions might take longer than dimensions with low cardinality. Another example is, when reflections are partitioned we’ll need to first sort the data, which might increase overall processing time.
Cluster resources compared to dataset size and reflection complexity.
Can minimize refresh interval be less than 1 hour?
This is not a recommended configuration. Curious, what type of reflections are you creating? How much of a lift are you getting compared to running without reflections? Is the underlying data append-only or mutating (upserts, deletes)?
Let’s say we create reflection on S3 bucket where data was around 100MB and in hour it goes to 10GB so how much time it will take to update reflection on current data? or is there any standard at which we can calculate time required to update reflection.
This is an quick test you’ll need to do for yourself within your cluster resources and reflection configurations.
Does reducing dimensions, measures and partitions will reduce reflection creation/updation time?
Yes, all these factors will affect reflection build times. I’d suspect dimensions and partitions to have more of an affect.
What cluster size you suggest for this size of data.
I’d recommend adding more memory and CPU first (double/triple both to start with), and comparing this with your current baseline. Our approach for this for POCs is incrementally increasing or decreasing resources based on desired query response SLAs, reflection build times, etc. This is a function of 1) dataset sizes 2) workload profiles 3) concurrency 4) resources = Target latency.
“Failed to spill to disk. Please check space availability”
Have you configured spill directories to have enough space (/var/lib/dremio/spill/) configured properly? Looks like the reflection you are trying to create first fails due to lack of enough cluster memory (aggregation looks to be large/high cardinality). In cases where aggregations run out of memory, we auto-retry query with streaming aggregation operator instead of hash aggregation which consumes less memory. This method needs to first sort the data. Looks like when we try to spill this sort operation we are running into issues.
A good starting point is this article from the documentation. You’ll need to determine if you are seeing large queries placed in the small query queue and vice versa. Than based on these queries, adjust the threshold as described in the docs. We have some features in the pipeline that should simplify some of this work.