Akismet has blocked 2 of my posts for over 3 weeks now

Hey Guys

I posted 2x questions 3 weeks ago and just because I edited the post as soon as it was created to correct some spelling mistakes, that seemed to trigger the Akismet automated spam filter and both of my posts got hidden. The message said a staff will review my post soon, but it’s been 3 weeks now and the questions are still blocked. Can a human look into unblocking those posts and consider putting a leash on Akismet.

Thanks

@asdf01 I am not able to recover the post but I read through your question. In versions 18.0 and above, Dremio supports unlimited splits for PARQUET/ORC and AVRO formats. Is your dataset one of these?

Hi @balaji.ramaswamy

Thanks for your interest on this issue.

I am not able to recover the post

It would be good to know why that is. I spent hours typing up the original posts. The message from Akismet suggested that the posts were simply “temporarily hidden”. Is Akismet hiding things from the human staff at dremio? Is AI taking over?!

but I read through your question

I’m confused. So is the situation that you can see the original post contents but can’t unhide it? If so, it would be good if you could send me the original post contents so that I can make it part of a new post. Hours of my life went into those posts. It would be good if I didn’t have to do it all over again.

In versions 18.0 and above, Dremio supports unlimited splits for PARQUET/ORC and AVRO formats

Again, thanks for your help on this issue. We experienced this issue with the latest 21.2.0 release. The data seems to be in csv format that is then gzipped up. I go into more details in my original post but we are trying to query aws load balancer access logs in s3.

It would be good if these limits are either configurable or can simply be removed. We don’t mind if unreasonable queries took unreasonable amounts of time to run. If that’s the case, we would simply cancel those queries and find optimisations ourselves through the query conditions.

We don’t even mind if unreasonable queries made the server fall over. That would at least give us an opportunity to resolve the issue ourselves by adding more resources.

What we do have issues with is dremio precluding us of having any access to our data after making claims of being able to handle data lakes with petabytes of data. We also have issues with dremio suggesting that we change our data structure after previously expressing an understanding and respect for divergent data formats of your customers.

In this case, it’s not that we are resisting changes to our data structure, it’s just that we have no control over the data structure that aws is producing for their load balancer access logs.

All this discussion might be very confusing for anyone reading this thread without the context from the original post. So if you have access to the original post contents, please send it to me somehow and I will repost the original post contents and we can continue the discussion there.

Thanks again for your interest.

Hi @balaji.ramaswamy.

Are you having any luck with unblocking my previous posts?

Thanks

@asdf01 Currently the only file formats that unlimited splits are supported are PARQUET/ORC and AVRO. How many files are you trying to promote? Can you split them in 2 folders and promote them as 2 PDS?

Hi @balaji.ramaswamy

Thanks for your continued support.

All the details are in my previous posts. So if you could unblock those original posts or if you could send me the original contents of those posts and I can repost it. Then we can continue the discussion in that thread with all the context established for any potential readers.

The aws load balancer access logs folder structure has the year month day subfolders. Because of this 300k splits limit, I’ve resorted to creating a PDS using the “Format Folder” feature at the month folder level, but that still seems to trip the 300k splits limit. It is not really feasible to create a PDS for every single day that we would possibly want to query the access logs for. That’s 365 PDSes per year per load balancer.

Even when I tried to do the PDS at the day level just to see what the performance is like, I ran into another issue which one of my other blocked posts was raising with Dremio. The other issue we encountered was Dremio’s handling of 0 byte *.log.gz files.

The aws load balancer access log seems to create these 0 byte *.log.gz files for some reason. When one of these files is in one of these folders, the “Format Folder” feature doesn’t fail at PDS definition time, but the query of the PDS fails at query time.

So it would be really good if Dremio could solve both of these issues. As I stated in my original posts which have been blocked, we don’t mind queries taking a long time. We just object to these data sources being completely unusable due to these performance orientated limitations.

Thanks