"Number of splits (432658) in dataset XXX.2019 exceeds dataset split limit of 300000" when trying to use the "Format Folder" feature on aws load balancer access logs in s3 at the load balancer level and the year level s3 folder

In short:
When you try to use dremio’s “Format Folder” feature on the s3 bucket that aws writes all the load balancer access logs to, even doing it at the year folder level fails the 300k splits limit restriction. Please allow this limit to be bumped through config.

In more detail:
We have been trying to setup dremio to query our load balancer access logs instead of using the AWS Athena service which aws recommends. Our fear is that Athena might be very expensive for this use case if we use it across the company at $5 usd for every 1tb of data scanned. Each one of our queries scans around 300gb of access log data regardless of how restrictive the query condition or how narrow the time range is.

When we tried doing this in dremio 21.1.1 and we ran into the 300k splits limit error:

  • Number of splits (432658) in dataset XXXLoadBalancerNameXXX-s3.2019 exceeds dataset split limit of 300000.

AWS’s load balancer access logs are stored in this kind of folder structure:
s3 bucket → load balancer name bucket prefix → … → region → year → month → day
Initially we encountered this error when we tried using the “Format Folder” feature at the region folder level. In other words, the level where the folder contents has a folder for each year.

We tried to avoid the 300k splits error by instead doing the “Format Folder” operation on each year folder. However for some busier load balancers, even this level ran into the same 300k splits limit. Doing the “Format Folder” operation at the month folder level is not really an option as doing it at the year level across our many load balancers was already annoying enough. Each instance of the process was very time consuming and essentially the same processing is being done for each folder and each load balancer. Having to repeat that 12 times for each year and each load balancer folder before we can run any queries on it would simply be too much work for dremio to be considered useful.

Please offer a way for this 300k splits limit to be bumped some how through config.

Alternatively, offer us a way to apply the same “Format Folder” operation across a number of folders without dremio spending ages sampling the contents of each folder each time this operation is done. The format for all the access log files in all the year month day folders for all load balancer bucket prefix folders are the same.

Please don’t do what Elasticsearch’s approach was in the past. Their approach was to do nothing and ask the customer to change their data. From my perspective, Elasticsearch focused too much on their headline claim of being fast. So when the customer’s data pushed some of elasticsearch’s limits and elasticsearch couldn’t be “fast”, elasticsearch would simply return an error rather than just taking longer and finishing the task. Any requests to elasticsearch support along the lines of addressing this issue would often result in one of the following sharp rebukes:

  • the customer was using elasticsearch wrong
  • the customer needed to delete some data
  • the customer needs to restructure their elasticsearch indexes, by deleting it all and starting again
  • add more hardware

Dremio looked very promising. It seemed to offer a way to interface with all kinds of customer data. Dremio implied a respect towards the customer’s data. Dremio’s literature illustrated an understanding that the customer data might be messy and all over the place but dremio can deal with it and stitch it all together. So that is why I am very surprised to be running into this 300k limit and to find out through past forum threads that the limit is not configurable, yet dremio claims to be able to deal with petabytes of data.

Please consider making it configurable. We don’t mind things taking longer. However we do mind dremio not working for this use case at all. With dremio’s headline claims, I suspect a lot of people would have sold the idea of dremio to management in their organisations like I have. I wonder how many other people are silently running into these limits and having to completely drop dremio as a solution while looking embarrassed in front of their management. Please don’t let us down.

Thanks

Hey guys. This issue still exists in the 24.0.0 version. Can someone revisit these limitations please. Thanks.

@asdf01 That is a product limit. For Parquet/ORC/AVRO, it is unlimited. The issue with increasing this limit is the coordinator (in Parquet it is the executors) does the metadata collection for Elasticsearch and cause increased usage of heap. What is the physical RAM in your coordinator?

Hi @balaji.ramaswamy.

Thanks for your interest in this issue. We run our coordinator / executor nodes on machines with 128GB of ram. Currently we have 9 of these machines that these coordinator / executor nodes can run on. We have tried running 9 coordinator / executor nodes. However this seems to have had no effect on the 300k splits limit.

Our dremio environment is probably not configured to use all the ram that’s available on those machines. DREMIO_JAVA_OPTS -Xmx4g -XX:MaxDirectMemorySize=8g. I think that’s just some default recommended setting that I found somewhere on the dremio forums.

If changing the java memory settings or using larger machines or more machines can increase the splits limit beyond 300k, then we would definitely do that. Back to the point I was making in my original post, please don’t treat customer data the way elasticsearch did in the past. Consider behaving more like a db server.

A db server wouldn’t give up part way through fulfilling a customer query just because it was taking too long. The db server will either be successful or die trying.

If dremio made the splits limit configurable, and we bumped the limit and our query made the server fall over, the customer could find their own solutions by increasing the server instance size or reconfiguring the environment to allocate more resources.

By dremio imposing these hard limits without exceptions, some workloads would just be completely excluded from consideration for dremio.

Please consider allowing this limit to be configurable. We wouldn’t mind any of the following consequences:

  • work taking much longer
  • the server crashes trying to fulfill a request

These are the type of problems we can manage ourselves. Changing dremio source code is something we will not be capable of ourselves.

Thanks

@asdf01 This is a product limitation and hence increasing number of nodes will not help. Also you typically would not need more than one coordinator unless there is a very high level of concurrency. Increasing this limit is possible but you enter the risk of high heap usage and Full GC pause. This only affects the coordinator for this file format. Here is what you can do

  • Send me an email balaji@dremio.com, will send the required JVM flags+key
  • Increase the limit
  • Monitor GC on the coordinator and if it goes to a Full GC pause then send us the GC logs