I’ve been kicking the tires on Dremio and I keep running into various IndexOutofBoundsException’s.
I have a nested parquet dataset, 290+MM records of family, Listof , Listof<other topic areas, i.e., autos>. The dataset spans 8 shards and roughly 40ish parquet files, roughly 1.2TB’s total data.
I’ve created a flattend version of individuals that contains, just the familyid, and the Individuals struct.
When I query from this view,
“SELECT count(distinct individual[‘Individualid’]) FROM flat_ind”
I get IndexOutofBoundsException. I get this with any version of querying a column across ‘individual’.
My first thought that this must be a bad file in the dataset. So, I uploaded all of it to a bucket on Google and attached the parquet as an external table in BigQuery. Though BigQuery support for Parquet is strange (inserts an extra level of struct into the data structure), I was able to execute select count(distinct) using the very same parquet files (roughly 5 seconds to run).
Also, we have a 1% sample dataset with exact same schema and queries run flawlessly.
Attached is the profile for the query. I’m hoping that Dremio can be a central hub for our datasets. But this is a simple use case and it only gets much more complicated from here.
Co-founding Managing Director
35d7efb0-77af-47e8-b762-096375f726d1.zip (94.2 KB)
@rpurdom Would like to narrow down the PARQUET file that is hitting this issue, if you give a FILTER with a partition column, do you see the error, meanwhile will check internally here
@balaji.ramaswamy I don’t believe the problem is in the data. I wrote a quick Go program that simply reads the family and individual id and count’s instances of both. It runs very quickly (~4 seconds) and returns the correct count. I’m literally counting the instances, not just looking at metadata. And, as I mentioned in my original post, it also works flawlessly on BigQuery as an external table. BigQuery is not an option as really, I’m looking for a solution to in local environment to support data science and data discovery.
It’s not partitioned in the traditional sense. It’s simply using hash partitions (sharding) based on family id split into 8 sub-directories and then split out into files of (40961024) family rows and the associated individuals and topics are lists of structs within family. Technically, individuals could have topics, but this particular file does not. The end result is 48 parquet files spread across eight ‘buckets’, seven files of 10244096 families and an 8th file of the remainder.
However, saying this, Dremio errors at the same point every time. I’m simply running the service using docker run, though I do override memory constraints, otherwise it crashes out with memory constraint issues. I’m giving the container 64GB of memory, but it ticks just over 13GB of usage when it errors out the query.
I’ve had the suspicion that I need to deploy this in Kubernetes and have perhaps one docker job just isnt a good idea. For one, it’s way too slow as I’m running it now. Just need to get to a point where I have time to continue testing.
@rpurdom Is this data sensitive? Wondering if you can pass us the actual file if not sensitive, can get to the fix quickly. Meanwhile I will check internally on any open issues on this
It’s very sensitive, it belongs to a client. I can probably create a scrubbed version, let me think of a quick way to knock it out.