Can I query values inside a Parquet structure?

Hello,

I am trying to put together a small scale Dremio-Minio-Nessie-Iceberg pathfinder. I was given several dozen Parquet files that all use the same schema, and these have been ingested to a Minio bucket. In turn, these have been registered in Dremio.

When I query my files in Dremio with something like “SELECT * FROM MyFile.parquet”, I see three columns that are of types (DOUBLE, ROW, ROW). The second and third columns are visualized in the query output as JSON-like structures. Using an independent tool, I have viewed the schema definition embedded in the Parquet files, so I am at least oriented.

I have been going crazy for a couple of days attempting to write SQL that extracts elements from the second and third columns. Is it possible to do so?

Any advice will be very appreciated, as I am getting behind.

@Latitude42 It looks like the 2nd and 3rd columns are either a LIST or STRUCT or a combination. For List, you would need to flatten (SQL) or via UI use the unlist option, for struct on the UI you have to use extract, sample SQL below

SELECT operatorprofile, nested_1.operatorprofile.outputRecords AS outputRecords
FROM (
  SELECT flatten(nested_0.fragmentProfile.minorFragmentProfile[0].operatorProfile) AS operatorprofile
  FROM (
    SELECT flatten(fragmentProfile) AS fragmentProfile
    FROM "@dremio".profile_attempt_0 AS profile_attempt_0
  ) nested_0
) nested_1;

OK, I have solved my problem, although I don’t understand why it makes SQL notational sense.

Recall my original post in which I said the Parquet files upon ingestion show multiple rows and three columns (DOUBLE, ROW, ROW) in which the second and third columns are structures. Call these columns A, B, C, respectively

Days ago, I was trying to make this work (and it does not work):

SELECT A, B.* FROM “MyFile.parquet”

But today I stumbled on this. Horray! What a relief, it works!

SELECT A, “MyFile.parquet”.B.* from “MyFile.parquet”

And also this works:

SELECT A, “MyFile.parquet”.B.* , “MyFile.parquet”.C.* FROM "MyFile.parquet*

But why is it that the first form of the SQL (which makes notational sense to me) does not work, but the second and third forms of the SQL (which do not make notational sense to me) work?

@Latitude42 Are you able to send us the job profiles as only upon reviewing the plan we can see what the issue is, especially for the one that does not work

SELECT A, B.* FROM “MyFile.parquet”

I’ve generated two zipfiles (failure and sucess) that I believe contain the plans you are offering to view. I am reluctant to post them in the clear. Is there a way to submit them privately?

@Latitude42 You should be able to send me a private message