Serialize selection of columns into Map

Is it possible to select a series of columns in dremio and serialize that to a Map or JSON type in a new column? How would you go about doing that?

@balaji.ramaswamy do you have any thoughts on this. Can I use Arrows Map or Struct syntax in Dremio, Are there any examples of this?

@kprifogle,

Just so I understand the requirement, if you have the below table:

col_a, col_b, col_c
val1,val2,val3
val4,val5,val6
…

And you want to add a new column with records from the others, like so:

col_a, col_b, col_c, col_new
val1,val2,val3, {col_a: val1, col_b: val2, col_c: val3}
val4,val5,val6, {col_a: val4, col_b: val5, col_c: val6}
…

Is this correct?

@ben Yes that is correct. Im hacking around it by manually constructing the string as a Calculated Field, but I was wondering if there is a better way, or better yet a way to construct an arrow Map type in dremio queries.

@kprifogle we do not currently have a function like to_struct(col_a, col_b), or something similar.

If you have a map/struct type column in your dataset, the in-memory image of those records will be have Arrow struct type. Then when Dremio stores query results, it writes that same image to disk.

However, if you use an ODBC or JDBC client tool to consume records from a Dremio dataset, these will not be (de)serialized in the standard Apache Arrow format. What is your requirement for the arrow struct type?

I am using dremio to gradually transform heavily structured application specific data into a more general schema where several columns can be grouped together as “attributes” which can be treated as unstructured. This allows me to union together several different schemas that share the more general columns and ignore the less general “metadata” columns. I know the question of “why are you doing this” always comes up, but in this case it is the most sensible path forward. Basically I need to access the Arrow Map or Struct type from a SQL query. I notice that in dremio when I type “Map” or “Struct” it is recognized as a valid command by its syntax checker, but I cannot seem to produce the right syntax in order to leverage the command.

If there is no way to do this, then I guess I will continue manually constructing a JSON string with Concat in a calculated field, but its a very unsavory solution.

It seems like Dremio supports “complex types” but only in the very limited capacity that it can leverage complex types internally when reading data, and that you cannot actually construct queries which leverage complex types :frowning: For example you cannot write complex types at all, which makes the advertisement of complex types support kind of misleading.

Im needing something equivalent to this Impala primitive: https://www.cloudera.com/documentation/enterprise/5-8-x/topics/impala_struct.html

Thanks for sharing your use case @kprifogle. Definitely agree that having a way to build structs from column values would be a useful function. We will consider this as a possible enhancement request and update community if we implement this.

1 Like

Wanted to add a me too here. I went in expecting either array[,,,] to work or map[k,v] to work in my select statement, but was met by an error from RextoExpr that they have not been implemented yet.

Hi, has anything been implemented to build structs from columns?

Right now Dremio does not support literal struct or list values, so if you want to insert some literal value into these complex types you can try a different method:

  1. Construct some JSON and insert into a VARCHAR field
  2. Use the CONVERT_FROM() function, to convert it to a different type.

(Thanks to adamr for this suggestion.)

Thanks @Scott_Gurney for the suggestion.

for 1, if we want to construct JSON, is there a function like json_dump() in Dremio which creates the JSON from a table for us?

Using a similar example here: (Serialize selection of columns into Map - #3 by ben)

col_a, col_b, col_c
val1,val2,val3
val4,val5,val6

How do I get the json form of the above as

col_a, col_b, col_c, col_new
val1,val2,val3, {col_a: val1, col_b: val2, col_c: val3}
val4,val5,val6, {col_a: val4, col_b: val5, col_c: val6}