Can Dremio process Big Data actually?

I just tried to create about 10.361 Json files, all of them take about 1.2 MB data in HDFS
Then I navigated to HDFS folder by Dremio using ‘Data Settings’, format is Json and saw that Dremio is unable to load all data in this folder. It showed an error message ‘Schema change detected but unable to learn schema, query failed. A full table scan may be necessary to fully learn the schema.’
It’s OK to load one by one file
I used t2.xlarge on AWS, which has 16GB RAM

Note: Using Unknown format and Text (delimited) has the same error
Below is the log file
log.zip (4.4 KB)

Are there multiple JSON records per document or a single very large record?

If multiple records how are they delimited? Perhaps you can share one here?

Just one record per file, the format of this record is like below
{
“type”: “Metric”,
“metric”: “sys.cpu.user”,
“tags”: {
“host”: “web01”
},
“timestamp”: 1492641000,
“value”: 42
}

We have no issues reading/parsing the sample you provided. Are you trying to create a source from a directory? What I believe in happening is there are some JSON files that have a completely different schema than the one you provided that you may be trying to group together?

It looks like a big issue to walk through 10k files to confirm this
I will delete all of them tomorrow and test it again

Are you seeing the error in the dialog where you can choose the formatting (json, etc)? You can actually press save there - that preview query does not handle schema relearning for performance reasons.

Each time dremio discovers a file with a new schema, it has to rescan all files. So if many files have different schemas this may become slow.

It’s OK to load more than 1K files
I will try to reproduce to see if this error happens again with more than 10K files
Update:
2K files are OK to load

It’s OK to load more than 4K files
Then I decided to change schema, from:
{“type”: “Metric”, “metric”: “Temperature”, “timestamp”: 1529294232334.94, “value”: “20.7225”, “tags”: {“device_id”: “xxx-usdfs-xxdfs”, “device_name”: “My Sensor 1”}}
to
{“type”: “Metric”, “metric”: “Temperature”, “timestamp”: 1529294318142.44, “value”: “21.79”, “tags”: {“host”: “xxx-usdfs-xxdfs”}}
Then it failed to load

Does it make sense to notify user what files have new schema, so user can process them more effectively without manually validating files one by one?

Hi,
if you need to obtain the schema of your json data set, you can use BaseX, a great piece of software.
It’s a native xml database, that can ingest json files. After ingestion, you will have access to the schema of you dataset (indeed the schema of the collection created from your dataset). Using the GUI is easy.
Hoping it helps.

Best

Yes, but it takes time to find out which files have different schema, in this example we have more than 10K files
What I want is just a notification, so user can quickly understand what files have different schema. After that, they can use whatever tools they want in order to execute pre-processing on those files

1 Like