Out of memory on relatively small join

Environment:

  • Dremio-oss v22.0 in Docker
  • Host: Windows 10 pro
  • Environment changes in docker-compose.yaml file:
    - DREMIO_MAX_MEMORY_SIZE_MB=16384
    - DREMIO_MAX_DIRECT_MEMORY_SIZE_MB=12288

Scenario:

  • Processing two directories A and B, containing subdirectories with json files
  • Full size of all files in both directories = 12MB, but the files individually have a relatively deep hierarchical structure
  • I want to extract some information out of the json files
  • Directory A: successfully created a query that extracts some info from the json and still keeps the original json in a column
  • Directory B: sucessfully created a query that extracts some info from the json and keeps a part of the original json in a column

Issue:

  • I then want to join both queries on a shared key
  • This fails with the message “Query was cancelled because it exceeded the memory limits set by the administrator.”

Rephrased join query:

The join query is essentially this one (I renamed tables and fields):

SELECT A.procedure_name AS procedure_name,
A.jsondoc AS jsondoc,
A.jsonref AS jsonref,
B.jsondoc2 AS jsondoc2
FROM
A
LEFT OUTER JOIN B ON A.jsonref = B.jsonref

Question:

Given the relatively small size of those jsons and the (in comparison) big amount of available memory (16GB) I expect the issue to have another origin. How can we debug this?
Both sides of the join contain about 500 records with overall size of 12 MB.

Extra info:

  • The first directory with jsons required us to disable a support variable exec.operator.copier.complex.vectorize.
  • The second directory with jsons required us to set a support variable store.json.all_text_mode=true

@qootec Are you running out of heap or direct? Can we please have the profile?

Thanks for your reply.

In order to answer your question, I was about to retry today.
I see though that the schema of the jsons that I receive is not stable.
This is not an “evolution”, but rather different kinds of jsons.
I only need to process one of these kinds and I could textually filter on the occurrence of a given string in them to make sure that I only process those that I need.

Potentially this also avoids the issue that I have submitted, so I think it would be good to first get that done.

I do not immediately see how I can configure Dremio to first “textually” filter the offered jsons and only after that start to interpret the schema of those files that satisfy the filter.

Obviously, I can filter them before offering them to Dremio, but that would require an additional kind of processing / ETL on the location where they are stored.

Any suggestions?

Hi Balaji,

My reaction could give you the impression that I’m “too lazy” to filter the relevant files manually. That is not the problem :slight_smile:

I would need to know whether Dremio would be capable to determine its scheme in a filtered way:

  • Scheme 1 on files that satisfy some filter A
  • Scheme 2 on files that satisfy some filter B

The customer will store this mix of differently structured jsons and will keep doing that. So I would need to make sure Dremio doesn’t get confused in its scheme reflection. Additional ETL is obviously possible, but it would be much better to avoid it.

How do you see this?

Thanks for your time,
Johan

I assume the query will be

scheme1 join scheme2 on column(s) and also filter on columns

You should create a VDS just on the join and then create aggregate/raw reflections on the VDS so during runtime only the filter can be passed