Can't setup a raw reflection on a virtual dataset out of memory

Data source: SQL Server 2016
Dremio install configuration: standalone

My team members and I are working with a data source that has well over 10s of billions of records. The data source is used for reporting and we want to optimize query execution times. We are evaluating Dremio and are coming across some issues optimizing virtual datasets with reflections.

We set Dremio Max Memory to 64gb. The virtual dataset in question joins on several tables that have raw reflections. Those child reflections appear to be optimized and we are happy with the query times. This virtual dataset is based off a query that returns around 17 billion records in the source database.

Dremio bombs when the input records reach around 200 million and throws out of memory exception.

We could increase the memory, but we’re not sure if that’s the real solution here. We installed dremio as standalone. Is it preferable to install in a clustered environment with containers? Would that solve this issue with out of memory and the large datasets we’re dealing with?

@raoufosman, if you want to perform large joins, then Dremio will require a large amount of direct memory on executor nodes. In production environments, we don’t suggest using a standalone (coordinator + executor in one JVM) because it’s difficult to guarantees on the memory available for planning and metadata operations (coordinator tasks) and query execution.

I suggests configuring once master coordinator and one executor and scaling the executors vertically and horizontally until you can complete the join.