Large join fail. Possible solutions

traboukos · August 21, 2019, 1:57pm

Hello All,

I have 4 large datasets in parquet format and I would like to join them and create a new virtual dataset out of them. However the join can never complete since it runs out of RAM. The query looks something like this.

select
– select a subset of the fields of all 4 tables
from transactions tr
inner join tickets ti on ti.id=tr.ticket_id
inner join events e on e.id=tr.event_id
inner join users u on u.id=ti.user_id

The cardinality of tickets.id is 200.000.000
The cardinality of users.id is 1.000.000
The cardinality of events.id is 300.000

Is there any way I can partition the datasets or create any reflections that could make this join possible ? I am thinking for example of creating 100.000 buckets for tickets.id and partitioning the tickets dataset based on those buckets. Same could be done for users with a smaller number of buckets. Would those partitions help ? If the transactions and tickets have a date field maybe all datasets could be partitioned by date ?

What other methods are there available apart from adding more nodes to the cluster or adding more ram ? Even if the join succeeds it would be terribly slow. How could the result set be sped up if the virtual dataset could be created ? What kind of reflection should be added ? Just a raw reflection ? I have read the docs but its not really clear to me how to peform and then later speed up this huge join.

Thanks,
@traboukos

Topic		Replies	Views
Understanding how the parquet reader works and a suggestion	10	3422	May 23, 2018
Dremio Join of large table in S3 and RedShift	1	1037	December 20, 2021
How to Optimize Query Performance for Large Datasets in Dremio Dremio Cloud	1	417	June 21, 2024
Aggregation Reflection Limits	6	2133	October 16, 2019
Extremely slow query performance under concurrency	9	2822	November 13, 2020

Large join fail. Possible solutions

Related topics