Can Dremio process Big Data actually?

Hai_Pham · June 15, 2018, 7:51am

I just tried to create about 10.361 Json files, all of them take about 1.2 MB data in HDFS
Then I navigated to HDFS folder by Dremio using ‘Data Settings’, format is Json and saw that Dremio is unable to load all data in this folder. It showed an error message ‘Schema change detected but unable to learn schema, query failed. A full table scan may be necessary to fully learn the schema.’
It’s OK to load one by one file
I used t2.xlarge on AWS, which has 16GB RAM

Note: Using Unknown format and Text (delimited) has the same error
Below is the log file
log.zip (4.4 KB)

kelly · June 15, 2018, 12:10pm

Are there multiple JSON records per document or a single very large record?

If multiple records how are they delimited? Perhaps you can share one here?

Hai_Pham · June 15, 2018, 12:43pm

Just one record per file, the format of this record is like below
{
“type”: “Metric”,
“metric”: “sys.cpu.user”,
“tags”: {
“host”: “web01”
},
“timestamp”: 1492641000,
“value”: 42
}

anthony · June 15, 2018, 3:13pm

We have no issues reading/parsing the sample you provided. Are you trying to create a source from a directory? What I believe in happening is there are some JSON files that have a completely different schema than the one you provided that you may be trying to group together?

Hai_Pham · June 17, 2018, 1:28pm

It looks like a big issue to walk through 10k files to confirm this
I will delete all of them tomorrow and test it again

doron · June 17, 2018, 2:20pm

Are you seeing the error in the dialog where you can choose the formatting (json, etc)? You can actually press save there - that preview query does not handle schema relearning for performance reasons.

Each time dremio discovers a file with a new schema, it has to rescan all files. So if many files have different schemas this may become slow.

Hai_Pham · June 18, 2018, 3:06am

It’s OK to load more than 1K files
I will try to reproduce to see if this error happens again with more than 10K files
Update:
2K files are OK to load

Hai_Pham · June 18, 2018, 4:02am

It’s OK to load more than 4K files
Then I decided to change schema, from:
{“type”: “Metric”, “metric”: “Temperature”, “timestamp”: 1529294232334.94, “value”: “20.7225”, “tags”: {“device_id”: “xxx-usdfs-xxdfs”, “device_name”: “My Sensor 1”}}
to
{“type”: “Metric”, “metric”: “Temperature”, “timestamp”: 1529294318142.44, “value”: “21.79”, “tags”: {“host”: “xxx-usdfs-xxdfs”}}
Then it failed to load

Hai_Pham · June 18, 2018, 4:46am

Does it make sense to notify user what files have new schema, so user can process them more effectively without manually validating files one by one?

fetanchaud · June 18, 2018, 9:59am

Hi,
if you need to obtain the schema of your json data set, you can use BaseX, a great piece of software.
It’s a native xml database, that can ingest json files. After ingestion, you will have access to the schema of you dataset (indeed the schema of the collection created from your dataset). Using the GUI is easy.
Hoping it helps.

Best

Hai_Pham · June 19, 2018, 1:50am

Yes, but it takes time to find out which files have different schema, in this example we have more than 10K files
What I want is just a notification, so user can quickly understand what files have different schema. After that, they can use whatever tools they want in order to execute pre-processing on those files

Topic		Replies	Views
Null values are not supported in lists by default. Please set `store.json.all_text_mode` to true to read lists containing nulls. Be advised that this will treat JSON null values as a string containing the word 'null'	19	4612	June 20, 2018
Couple of queries on Dremio Dremio University	3	1399	August 23, 2020
Dremio keeps telling "Detected schema change"	2	841	September 29, 2022
Unable load big json data to Iceberg with Dremio Apache Iceberg	0	34	March 18, 2025
Dremio capabilities on UNStructured data?	4	1937	November 13, 2019

Can Dremio process Big Data actually?

Related topics