Dremio should be able to read in standard Parquet files, so to understand what is going on could you provide a query profile (more information here)? Also, what version of Dremio are you running?
As the parquet files can’t be loaded as datasets, i can’t query it and there are no jobs. Maybe if i send you one of the files, very small, that’s can help: 003.zip (959 Bytes)
Or if you can tell me how i can to know if the turbodbc’s parquet files are standard…
Thanks for the answer. In my case, i’m trying and comparing various solutions for the data lake in my workplace. I’m testing Dremio vs Azure Data Lake (Analitycs). I want parquet files because:
Parquet is readable from Hadoop, and the ecosystem of pandas.
It’s a compresed format. In my head, Dremio can be the Sql Layer to query old data and new data in a uniform way, as the same data. The idea is to move the old data from Sql Server to Parquet files, and remove the old data from Sql Server. Then, Dremio allows to query the data (old and new) as the same data source. I think, i’m testing Dremio to see if that it’s true.
If Dremio is not accepted by the team, the Parquet files can be used with other tools too.
I’m very new to these tools and maybe i’m misunderstanding things. Or there are better ways. I’ll continue testing.
P.D.: Dremio store the Arrow data from reflections in a compressed way?
I can reproduce the issue with your 003.parquet file, the error that we get is Required field 'codec' was not present! (you can see it as well in the Jobs page, you will have to filter for Internal jobs). We will have someone with more parquet knowledge look into this.
Dremio persists Data Reflections as Parquet. Dremio’s Parquet reader is highly optimized and reads the data into Arrow in-memory buffers.
Your approach makes sense if you’re trying to ‘retire’ the legacy system. Data Reflections aren’t intended to be a long term persistence strategy. Instead, they are used to optimize query performance. For example, you could have Dremio maintain one or more Data Reflections that are sorted/partitioned/aggregated in different ways to optimize different query patterns. Users wouldn’t ever need to know about these Data Reflections - the query planner would automatically pick the best one for a given query.
A common use case is that the source isn’t being retired and will remain as the system of record. In this case, Data Reflections 1) optimize the analytical processing (most sources aren’t optimized for scan-heavy workloads), and 2) offloads the analytical workload from the source, so the SLA isn’t impacted.
You could still use Dremio to generate the Parquet files, even for long-term storage, by following the instructions in this thread: Converting JSON to Parquet
You might find this is 1) better performance than turbodbc (assuming you have a Dremio cluster), and 2) works with non-relational sources (eg, JSON).
I see. Thanks you very much. I’ll try with dremio to extract more data from other tables as compressed parquet files. We need to free space from disks.
Figured out the issue with help from the Parquet dev community. The file you’re producing with Turbodbc is using Brotli compression which was only introduced very recently in Parquet (format 2.3.2) and Dremio doesn’t support it yet. (You’d probably have a similar problem with most other tools as well. You should be able to tell Turbodbc to write using a more common compression format (such as Snappy).
I’ll file an enhancement request to keep track of this in Dremio.