Environment changes in docker-compose.yaml file:
- DREMIO_MAX_MEMORY_SIZE_MB=16384
- DREMIO_MAX_DIRECT_MEMORY_SIZE_MB=12288
Scenario:
Processing two directories A and B, containing subdirectories with json files
Full size of all files in both directories = 12MB, but the files individually have a relatively deep hierarchical structure
I want to extract some information out of the json files
Directory A: successfully created a query that extracts some info from the json and still keeps the original json in a column
Directory B: sucessfully created a query that extracts some info from the json and keeps a part of the original json in a column
Issue:
I then want to join both queries on a shared key
This fails with the message “Query was cancelled because it exceeded the memory limits set by the administrator.”
Rephrased join query:
The join query is essentially this one (I renamed tables and fields):
SELECT A.procedure_name AS procedure_name,
A.jsondoc AS jsondoc,
A.jsonref AS jsonref,
B.jsondoc2 AS jsondoc2
FROM
A
LEFT OUTER JOIN B ON A.jsonref = B.jsonref
Question:
Given the relatively small size of those jsons and the (in comparison) big amount of available memory (16GB) I expect the issue to have another origin. How can we debug this?
Both sides of the join contain about 500 records with overall size of 12 MB.
Extra info:
The first directory with jsons required us to disable a support variable exec.operator.copier.complex.vectorize.
The second directory with jsons required us to set a support variable store.json.all_text_mode=true
In order to answer your question, I was about to retry today.
I see though that the schema of the jsons that I receive is not stable.
This is not an “evolution”, but rather different kinds of jsons.
I only need to process one of these kinds and I could textually filter on the occurrence of a given string in them to make sure that I only process those that I need.
Potentially this also avoids the issue that I have submitted, so I think it would be good to first get that done.
I do not immediately see how I can configure Dremio to first “textually” filter the offered jsons and only after that start to interpret the schema of those files that satisfy the filter.
Obviously, I can filter them before offering them to Dremio, but that would require an additional kind of processing / ETL on the location where they are stored.
My reaction could give you the impression that I’m “too lazy” to filter the relevant files manually. That is not the problem
I would need to know whether Dremio would be capable to determine its scheme in a filtered way:
Scheme 1 on files that satisfy some filter A
Scheme 2 on files that satisfy some filter B
The customer will store this mix of differently structured jsons and will keep doing that. So I would need to make sure Dremio doesn’t get confused in its scheme reflection. Additional ETL is obviously possible, but it would be much better to avoid it.