I have a question about the physical and virtual data sets. Consider the raw data is present in S3 bucket in AWS. When we create physical and virtual data sets, is the data stored locally in dremio or will it get from S3 every time we access it?
PDS - Dremio does not store anything locally, gets it from S3 every time
VDS - It is just a view, no physical existence
Reflection - Converts to Parquet and stores in S3 (can be local but not recommended)
C3 Cache - Local
Thanks @balaji.ramaswamy for the answer. What I understand is that data reflection will create a new copy of the data in Parquet format. But what if the file is already in parquet. So how useful is data reflection in this case.
How to enable C3 Cache. Is it enabled by default? And do we need to have huge disk size for this?
Also, below is the current data storage in Cloud. Do I have to create to physical data set at jobid folder as mentioned below? We will have multiple folders created (around 200) for jobid every day. So thinking how to use dremio in this case… Thanks
Data reflections can be created on a subset of rows and columns, so even if the source is Parquet, yo can create on a subset. Moreover, Dremio creates Parquet as per industry standard best practices in terms of block sizes, encoding etc
C3 on AWS editon is enabled by default. On other deployments, please follow the documentation and configure. The disk size you want to use for C3 is all configurable, please go through the document
The folder on which you want to create the PDS is depends on the number of sub-partitions and how the queries are generated. It is very easy to promote and de-promote and based on your application needs you can enable at the right level
Thanks a lot @balaji.ramaswamy for the quick response. My use case is that there will be a minimum of 200 sub-partitions created everyday (like jobid-101, jobid-102 etc… as mentioned in my previous post.). And I have to maintain the data for 7 years. So there would be a minimum of 500K sub partitions created.
Will Dremio able to handle this and how would be the performance? Trying to evaluate Dremio and few other query engines for this use case.
As long as your query has a filter on the partition-sub-partition, only those sub-partitions will be scanned as others will be pruned. Running a select * without a filter on such a PDS may not be the best idea
If my Reflection store is local and I have a S3 source with CSV data and I create a RAW reflection on the PDS if the caching is enabled will it be available in C3 Cache?
In this scenario what is the difference between local reflection and C3 cache?
Not much of gain, but it is strongly recommended to move reflection store to a distributed storage like S3 as if you store reflections in local and an executor goes down then the reflection is not available
The only downside of storing reflections on S3 is there could be some latency and that is why we need C3