Out of memory on relatively small join

qootec · August 19, 2022, 1:49pm

Environment:

Dremio-oss v22.0 in Docker
Host: Windows 10 pro
Environment changes in docker-compose.yaml file:
- DREMIO_MAX_MEMORY_SIZE_MB=16384
- DREMIO_MAX_DIRECT_MEMORY_SIZE_MB=12288

Scenario:

Processing two directories A and B, containing subdirectories with json files
Full size of all files in both directories = 12MB, but the files individually have a relatively deep hierarchical structure
I want to extract some information out of the json files
Directory A: successfully created a query that extracts some info from the json and still keeps the original json in a column
Directory B: sucessfully created a query that extracts some info from the json and keeps a part of the original json in a column

Issue:

I then want to join both queries on a shared key
This fails with the message “Query was cancelled because it exceeded the memory limits set by the administrator.”

Rephrased join query:

The join query is essentially this one (I renamed tables and fields):

SELECT A.procedure_name AS procedure_name,
A.jsondoc AS jsondoc,
A.jsonref AS jsonref,
B.jsondoc2 AS jsondoc2
FROM
A
LEFT OUTER JOIN B ON A.jsonref = B.jsonref

Question:

Given the relatively small size of those jsons and the (in comparison) big amount of available memory (16GB) I expect the issue to have another origin. How can we debug this?
Both sides of the join contain about 500 records with overall size of 12 MB.

Extra info:

The first directory with jsons required us to disable a support variable exec.operator.copier.complex.vectorize.
The second directory with jsons required us to set a support variable store.json.all_text_mode=true

balaji.ramaswamy · August 27, 2022, 2:23am

@qootec Are you running out of heap or direct? Can we please have the profile?

qootec · August 28, 2022, 8:21pm

Thanks for your reply.

In order to answer your question, I was about to retry today.
I see though that the schema of the jsons that I receive is not stable.
This is not an “evolution”, but rather different kinds of jsons.
I only need to process one of these kinds and I could textually filter on the occurrence of a given string in them to make sure that I only process those that I need.

Potentially this also avoids the issue that I have submitted, so I think it would be good to first get that done.

I do not immediately see how I can configure Dremio to first “textually” filter the offered jsons and only after that start to interpret the schema of those files that satisfy the filter.

Obviously, I can filter them before offering them to Dremio, but that would require an additional kind of processing / ETL on the location where they are stored.

Any suggestions?

qootec · September 7, 2022, 12:26pm

Hi Balaji,

My reaction could give you the impression that I’m “too lazy” to filter the relevant files manually. That is not the problem

I would need to know whether Dremio would be capable to determine its scheme in a filtered way:

Scheme 1 on files that satisfy some filter A
Scheme 2 on files that satisfy some filter B

The customer will store this mix of differently structured jsons and will keep doing that. So I would need to make sure Dremio doesn’t get confused in its scheme reflection. Additional ETL is obviously possible, but it would be much better to avoid it.

How do you see this?

Thanks for your time,
Johan

balaji.ramaswamy · September 30, 2022, 10:00am

I assume the query will be

scheme1 join scheme2 on column(s) and also filter on columns

You should create a VDS just on the join and then create aggregate/raw reflections on the VDS so during runtime only the filter can be passed

Topic		Replies	Views
Null values are not supported in lists by default. Please set `store.json.all_text_mode` to true to read lists containing nulls. Be advised that this will treat JSON null values as a string containing the word 'null'	19	4613	June 20, 2018
How to handle huge reflections?	6	1432	May 12, 2019
DREMIO - Query was cancelled because it exceeded the memory limits set by the administrator	4	900	August 16, 2023
Query was cancelled because it exceeded the memory limits set by the administrator	26	9844	January 26, 2023
UnsupportedOperationException Dremio University	7	2626	September 29, 2022

Out of memory on relatively small join

Related topics