"Number of splits (432658) in dataset XXX.2019 exceeds dataset split limit of 300000" when trying to use the "Format Folder" feature on aws load balancer access logs in s3 at the load balancer level and the year level s3 folder

In short:
When you try to use dremio’s “Format Folder” feature on the s3 bucket that aws writes all the load balancer access logs to, even doing it at the year folder level fails the 300k splits limit restriction. Please allow this limit to be bumped through config.

In more detail:
We have been trying to setup dremio to query our load balancer access logs instead of using the AWS Athena service which aws recommends. Our fear is that Athena might be very expensive for this use case if we use it across the company at $5 usd for every 1tb of data scanned. Each one of our queries scans around 300gb of access log data regardless of how restrictive the query condition or how narrow the time range is.

When we tried doing this in dremio 21.1.1 and we ran into the 300k splits limit error:

  • Number of splits (432658) in dataset XXXLoadBalancerNameXXX-s3.2019 exceeds dataset split limit of 300000.

AWS’s load balancer access logs are stored in this kind of folder structure:
s3 bucket → load balancer name bucket prefix → … → region → year → month → day
Initially we encountered this error when we tried using the “Format Folder” feature at the region folder level. In other words, the level where the folder contents has a folder for each year.

We tried to avoid the 300k splits error by instead doing the “Format Folder” operation on each year folder. However for some busier load balancers, even this level ran into the same 300k splits limit. Doing the “Format Folder” operation at the month folder level is not really an option as doing it at the year level across our many load balancers was already annoying enough. Each instance of the process was very time consuming and essentially the same processing is being done for each folder and each load balancer. Having to repeat that 12 times for each year and each load balancer folder before we can run any queries on it would simply be too much work for dremio to be considered useful.

Please offer a way for this 300k splits limit to be bumped some how through config.

Alternatively, offer us a way to apply the same “Format Folder” operation across a number of folders without dremio spending ages sampling the contents of each folder each time this operation is done. The format for all the access log files in all the year month day folders for all load balancer bucket prefix folders are the same.

Please don’t do what Elasticsearch’s approach was in the past. Their approach was to do nothing and ask the customer to change their data. From my perspective, Elasticsearch focused too much on their headline claim of being fast. So when the customer’s data pushed some of elasticsearch’s limits and elasticsearch couldn’t be “fast”, elasticsearch would simply return an error rather than just taking longer and finishing the task. Any requests to elasticsearch support along the lines of addressing this issue would often result in one of the following sharp rebukes:

  • the customer was using elasticsearch wrong
  • the customer needed to delete some data
  • the customer needs to restructure their elasticsearch indexes, by deleting it all and starting again
  • add more hardware

Dremio looked very promising. It seemed to offer a way to interface with all kinds of customer data. Dremio implied a respect towards the customer’s data. Dremio’s literature illustrated an understanding that the customer data might be messy and all over the place but dremio can deal with it and stitch it all together. So that is why I am very surprised to be running into this 300k limit and to find out through past forum threads that the limit is not configurable, yet dremio claims to be able to deal with petabytes of data.

Please consider making it configurable. We don’t mind things taking longer. However we do mind dremio not working for this use case at all. With dremio’s headline claims, I suspect a lot of people would have sold the idea of dremio to management in their organisations like I have. I wonder how many other people are silently running into these limits and having to completely drop dremio as a solution while looking embarrassed in front of their management. Please don’t let us down.

Thanks