We are trying to read S3 data in parquet format with gzip compression, through Dremio. While creating parquet files, before writing to S3 we have set row group size = 512MB and page size = 8KB.
While reading through Dremio, we have set fs.s3a.connection.maximum = 10000 and fs.s3a.experimental.input.fadvise = random, apart from other standard S3A properties.
While creating a dataset over a folder, of total size almost 22GB, having almost 15 files ranging sizes between 30 to 250 MB, we observe that it takes almost 8 minutes to create dataset.
This time itself is very large, considering that we would be working on data much larger in size in future. Also Dremio downloads the entire 22GB to disk, leading to space issues.
Raw reflection takes almost 40 minutes to complete and yet again consumes lot of disk space.
This could be forgiven if this was a one-time problem. But on running queries on the data we found the following issues :
• If query is accelerated by raw reflection, time taken does not improve but is almost the same.
• Queries are never using aggregation reflection [It says : Did not cover query] but use raw reflection, even for aggregation queries
• On running query the entire output of the query is again loaded on the disk causing disk issues.
• For some queries, even raw reflection is not working with message that “it is too expensive”.
No particularly helpful info can be found in logs.
Since we are not getting any substantial improvement, neither on space or time front, it forces me to ask the question that is using reflection really of any help [in case of S3].
If there is something we are missing or not configuring right, we would really appreciate the inputs.
Thanks and Regards