Null values are not supported in lists by default. Please set `store.json.all_text_mode` to true to read lists containing nulls. Be advised that this will treat JSON null values as a string containing the word 'null'

I downloaded this file (about 3GB):
https://data.cityofchicago.org/api/views/ijzp-q8t2/rows.json?accessType=DOWNLOAD

Then I tried to open it by Dremio and error message was shown
Null values are not supported in lists by default. Please set store.json.all_text_mode to true to read lists containing nulls. Be advised that this will treat JSON null values as a string containing the word ‘null’.

Can you please advise how I can resolve this issue

If you scroll to the bottom of http://host:9047/admin/advanced under “Dremio Support”, there will be an area to copy/paste that setting > click Show > then toggle it

Thank you so much, I was able to turn this option on
However, dremio is still unable to load this file, it shows error message: “Error parsing JSON - Unable to expand the buffer”
Have you ever tried to open 3Gb file, is 16GB of memory computer not enough to open?

How do you have Dremio deployment? Windows app? Linux server?

I used linux, specifically ubuntu 16

This may need further investigation. 16gb single node is indeed a bit small, and keep in mind some of that goes to heap, not direct memory. Please note we have plenty of users working with larger files though. If I have some free time maybe I’ll try to load the dataset as well…

So is there any option to change heap size for Dremio?

Yes, you can change the dremio-env file. Here is the documentation:
http://docs.dremio.com/deployment/dremio-config.html#environment-setup

Thank you, but this option doesn’t resolve problem
I changed
DREMIO_MAX_HEAP_MEMORY_SIZE_MB=12000
DREMIO_MAX_DIRECT_MEMORY_SIZE_MB=12000

I checked the RAM and it looks to me Dremio didn’t use RAM to process, MemFree is always 12645252 kB

Did you restart the node after making the change? That is required.

Yes, i did of course

My first comment here is that given your 16Gb of ram, those settings are potentially too high. Try setting them to 8GB each and see what happens.

If that still doesn’t happen, it would be interesting to see what happens if you could say bisect the file into 2 1.5 GB instances. I’d like to get a better idea as to when this becomes an issue.

Would you mind spending a little time to load this file to make sure it works on your machine(single machine)
https://data.cityofchicago.org/api/views/ijzp-q8t2/rows.json?accessType=DOWNLOAD
It doesn’t make sense that analyze tool cannot process 3GB file since we’re working with big data
Thank you so much

Hey Hai,

I can see the issue. It’s due to a very deep and wide schema in the JSON file; whilst trying to schema learn, it’s coming up against an internal limit of an buffer, which doesn’t seem to be settable via the UI.

Let me talk to support. I’ll feed back once I have some more news

Christy

Hey Hai,

Looking at the JSON file again, I noticed that the file is actually a single Object.

Dremio is fundamentally a “row” based technology. Essentially, it wants an Array of Objects, so it can treat each entry as a row. Here, Dremio is trying to fit the entire file into a single row and “schema” learn across the entire file.

I would suggest that in this instance, some pre-processing of the file is required to extract the data you want and turn it into a new file that is an array of objects. For example, I notice, there is a large meta section at the start of the file. You could strip this out and inside take the actual “row” data (found later in the file as the “data” property) and use that to create a new file.

Thank you so much Christy, it’s very informative
By the way, can you please advise what public large dataset I should use in order to show Dremio’s capability to work with big data file to our data scientist team
Thank you so much

I’m glad I could help :smiley:

I often use the Yelp data: https://www.yelp.com/dataset

It’s also a few GB in size, and multiple JSON files, so you can show joins too.

Hope this helps :slight_smile:

Here is a tutorial that uses Yelp, including instructions on how to make the data available:

Also, here’s a discussion on the multi-record JSON file format Christy mentioned: Feature requests + Bug reporting

I changed the dataset and saw Dremio on AWS t2.medium instance can load 4GB file perfectly
Thank you so much

2 Likes