I have 4 large datasets in parquet format and I would like to join them and create a new virtual dataset out of them. However the join can never complete since it runs out of RAM. The query looks something like this.
– select a subset of the fields of all 4 tables
from transactions tr
inner join tickets ti on ti.id=tr.ticket_id
inner join events e on e.id=tr.event_id
inner join users u on u.id=ti.user_id
The cardinality of tickets.id is 200.000.000
The cardinality of users.id is 1.000.000
The cardinality of events.id is 300.000
Is there any way I can partition the datasets or create any reflections that could make this join possible ? I am thinking for example of creating 100.000 buckets for tickets.id and partitioning the tickets dataset based on those buckets. Same could be done for users with a smaller number of buckets. Would those partitions help ? If the transactions and tickets have a date field maybe all datasets could be partitioned by date ?
What other methods are there available apart from adding more nodes to the cluster or adding more ram ? Even if the join succeeds it would be terribly slow. How could the result set be sped up if the virtual dataset could be created ? What kind of reflection should be added ? Just a raw reflection ? I have read the docs but its not really clear to me how to peform and then later speed up this huge join.