I’ve been reading a bit about Dremio and working through the basics, but I have two things that I can’t seem to figure out if they’re possible.
Currently running on windows, but connecting to HDP running on a Linux box, so the windows bit is temporary.
My general HDF looks something like this:
//*.json
When I create a datasource for this, I can see ProjectName and EventName as dir0 and dir1, but I’m curious if it’s possible to add the filename that the row was found in as a column?
My second question is that my json files look like:
{“DocumentId”:123, “Document”: “{“Column1”:“A”, “Column2”:“B”}”}
When I query, I get two columns, as I’d suspect, but is it possible to pull out that json into columns? Googling this I see the FLATTEN command recommended, but that doesn’t work in this scenario (and doesn’t create additional columns). Any thoughts on how to do that, other than reformatting the data?
Hey Ahamilton, I’m going to take stab at your 2nd question. You can create columns out of your nested field by “Extracting” the field that you want to turn into a column. Take a look at the “Extracting map elements” section from this tutorial on data curation with Dremio.
I also cover that and other tricks in our Data Consumers course in Dremio U.
Hope that helps!
I’m not sure this is what I’m looking for, to be honest.
The problem I’m having is that the ACTUAL document is nested inside a wrapper container. So I need Dremio to look at the internal document as the document, not the outside.
I could, individually, pull out and extract each individual columns, but I’d have to do that for hundreds of columns in hundreds of files. But this is a task that Dremio is already doing, because it works for the external wrapper just fine.
In theory, I could use regular expressions to get it so that my first column is just the nested json, but then it’s still all in one column. I need to tell Dremo that I need to process THAT as if it was the entire json document.
At this point I’ll probably just remove the wrapper before saving the json data, but it seems like this is something that should be do-able, and that tutorial doesn’t cover what I’m asking, unfortunately.