Some newbie questions. I have setup Dremio in a single instance mode in AWS pointing at delimited data in S3, with reflections stored locally.
Works great, except with a larger 2 billion row, 100 gig dataset. Just kicking the times on this so decided to see how it scales.
Questions:
Are there best practices for partitioning the data or the reflection? I noticed creating the reflection is a single threaded process. Its been running for two hours now on the large data set. Break up the underlying data to smaller files to speed up S3 access and get multiple processes running?
Is it better to create a reflection in the underlying data or on the view with modified value/types on the data or unioned. If a reflection is created on a table with a reflection, is the reflection leveraged for creating the secondary reflection? For performance say as a union of smaller tables?…
The underling file is gzipped on S3 – does that affect performance?
Is it better to place the reflection data on S3 – for portability if extending to a multi-node system? I suppose that would slow it down a bit.
Thank you - -interesting product. Just getting the hang of it.
If your data is a single gzipped JSON file, things will probably go faster with multiple smaller files. For example, I typically work with a 300GB CSV data set on S3 and building reflections takes about 20 min in a 4 node cluster.
You can have a reflection on the raw data or the transformed data. This works well for “last mile ETL” - we recommend Hive or Spark for transformations that take hours to perform, for example.
If you have a reflection on the raw data, it can still typically be used on the virtual dataset that is derived from the raw data. OTOTH, if your VDS is a filtered form of the raw data, or transformed, or joined, etc, then having that work done ahead of time in a reflection that is built on the VDS may be more efficient in a significant way.
Sibling VDS can be accelerated by the same reflection if it is on their parent or ancestor. It is also possible that a reflection on a VDS can be used to accelerate a sibling. Dremio maintains a dependency graph of the relationships between datasets and uses this information at query time to find the lowest cost query plan.
If you have related reflections, the dependency graph is also used to automatically optimized the reflection maintenance. For example, updates will be daisy chained to minimize the load on the source system and because updating one reflection from another is far more efficient than scanning the source data, typically.
Typically you would store your reflections on a common file system so that all nodes have access. This allows you to scale compute and storage independently, among other benefits.
is there any way to track the progress of reflections building? I am interested in know the total time taken to build the reflections and the percentage completion at a given point of time?
Every process to build or update a data reflection is tracked as a job. You can review the jobs for a data reflection to get a sense for how long they take in general for a given dataset.