IndexOutofBoundsException: Index: 6, Size: 3

I’ve been kicking the tires on Dremio and I keep running into various IndexOutofBoundsException’s.

I have a nested parquet dataset, 290+MM records of family, Listof , Listof<other topic areas, i.e., autos>. The dataset spans 8 shards and roughly 40ish parquet files, roughly 1.2TB’s total data.

I’ve created a flattend version of individuals that contains, just the familyid, and the Individuals struct.

When I query from this view,

“SELECT count(distinct individual[‘Individualid’]) FROM flat_ind”

I get IndexOutofBoundsException. I get this with any version of querying a column across ‘individual’.

My first thought that this must be a bad file in the dataset. So, I uploaded all of it to a bucket on Google and attached the parquet as an external table in BigQuery. Though BigQuery support for Parquet is strange (inserts an extra level of struct into the data structure), I was able to execute select count(distinct) using the very same parquet files (roughly 5 seconds to run).

Also, we have a 1% sample dataset with exact same schema and queries run flawlessly.

Attached is the profile for the query. I’m hoping that Dremio can be a central hub for our datasets. But this is a simple use case and it only gets much more complicated from here.

Robert Purdom
Co-founding Managing Director
Intelafy LLC
35d7efb0-77af-47e8-b762-096375f726d1.zip (94.2 KB)

@rpurdom Would like to narrow down the PARQUET file that is hitting this issue, if you give a FILTER with a partition column, do you see the error, meanwhile will check internally here

@balaji.ramaswamy I don’t believe the problem is in the data. I wrote a quick Go program that simply reads the family and individual id and count’s instances of both. It runs very quickly (~4 seconds) and returns the correct count. I’m literally counting the instances, not just looking at metadata. And, as I mentioned in my original post, it also works flawlessly on BigQuery as an external table. BigQuery is not an option as really, I’m looking for a solution to in local environment to support data science and data discovery.

It’s not partitioned in the traditional sense. It’s simply using hash partitions (sharding) based on family id split into 8 sub-directories and then split out into files of (40961024) family rows and the associated individuals and topics are lists of structs within family. Technically, individuals could have topics, but this particular file does not. The end result is 48 parquet files spread across eight ‘buckets’, seven files of 10244096 families and an 8th file of the remainder.

However, saying this, Dremio errors at the same point every time. I’m simply running the service using docker run, though I do override memory constraints, otherwise it crashes out with memory constraint issues. I’m giving the container 64GB of memory, but it ticks just over 13GB of usage when it errors out the query.

I’ve had the suspicion that I need to deploy this in Kubernetes and have perhaps one docker job just isnt a good idea. For one, it’s way too slow as I’m running it now. Just need to get to a point where I have time to continue testing.

Robert

@rpurdom Is this data sensitive? Wondering if you can pass us the actual file if not sensitive, can get to the fix quickly. Meanwhile I will check internally on any open issues on this

It’s very sensitive, it belongs to a client. I can probably create a scrubbed version, let me think of a quick way to knock it out.

Robert

was this problem ever resolved? it looks like I’m similar issue.

I can query the table, but any time I do a distinct or some type of aggregate function, I et an IndexOutofBoundsException

@aki0086 Want to check if this iss the same stack as @rpurdom, are you able to provide the profile?

@rpurdom This stack looks very similar to an issue fixed in 20.3, would you be able to upgrade and test?

@balaji.ramaswamy
aefbe3ae-5333-47e1-bcde-a89ebdf5c825.zip (14.1 KB)

here is the trace for the below version
version commit_id commit_message commit_time build_email build_time

20.1.0-202202061055110045-36733c65 36733c65bab0f37953474319d90a4cfff7f2a2ca [NULL] 2022-02-04T00:48:35+0000 2022-02-06T11:05:25+0000

when i query on an older version we have… dont get the same issue. the same data is loaded into both environments.

version commit_id commit_message commit_time build_email build_time

14.0.0-202103011714040666-9a0c2e10 0073ec1dd24c23e2b6ef4652ca3fea0a93e06873 [NULL] 2021-03-01T13:44:39-0500 2021-03-29T19:28:08-0400

@aki0086 This seems to have been fixed in 20.3 or 21.x or 22.x, can you please validate?

Meanwhile can you please try to disable “store.parquet.read_column_indexes” and see if that helps?

Unfortunately just disabling didn’t help.

For the bug you mentioned…is it referring to one of these fixes?
https://docs.dremio.com/software/release-notes/200-release/
20.4

  • When attempting to download certain query results as JSON or Parquet files, the downloaded file size was zero bytes and resulted in an IndexOutofBounds exception.

or
20.3

  • Intermittent jobs were failing with an IndexOutOfBounds exception while preparing operator details information for runtime filters.

@aki0086 Are you able to send me the profile when it failed after you disabled store.parquet.read_column_indexes?

sent to you via direct msg

@aki0086 I do not seem to find the profile, would you mind sending it again please? balaji@dremio.com