Large join fail. Possible solutions

Hello All,

I have 4 large datasets in parquet format and I would like to join them and create a new virtual dataset out of them. However the join can never complete since it runs out of RAM. The query looks something like this.

– select a subset of the fields of all 4 tables
from transactions tr
inner join tickets ti on
inner join events e on
inner join users u on

The cardinality of is 200.000.000
The cardinality of is 1.000.000
The cardinality of is 300.000

Is there any way I can partition the datasets or create any reflections that could make this join possible ? I am thinking for example of creating 100.000 buckets for and partitioning the tickets dataset based on those buckets. Same could be done for users with a smaller number of buckets. Would those partitions help ? If the transactions and tickets have a date field maybe all datasets could be partitioned by date ?

What other methods are there available apart from adding more nodes to the cluster or adding more ram ? Even if the join succeeds it would be terribly slow. How could the result set be sped up if the virtual dataset could be created ? What kind of reflection should be added ? Just a raw reflection ? I have read the docs but its not really clear to me how to peform and then later speed up this huge join.