HDFS connector interpret VARCHAR as VARBINARY

LucGth · March 11, 2020, 6:26pm

Hi,
connecting to an hdfs source i notice that the varchar columns are interpreted as varbinary.
This not happend for columns with data format, like you can see from the picture.
Another strange thing, the column dir0 has the true format. it is the partition column of the Impala table.
Using the Hive connector this not happend.
All this cause problems to cast these varbinary columns into varchar.
With api requests i have extracted the infomation about the pds:
{
“entityType” : “dataset”,
“id” : “8c096312-4df7-4f16-abf7-54c0b870038b”,
“type” : “PHYSICAL_DATASET”,
“path” : [ “HDFS”, “test”, “sample_20200110” ],
“createdAt” : “2020-02-04T10:41:07.193Z”,
“tag” : “31”,
“format” : {
“type” : “Parquet”,
“ctime” : 0,
“isFolder” : true,
“location” : “/test/sample_20200110”,
“autoCorrectCorruptDates” : true
},
“approximateStatisticsAllowed” : false,
“fields” : [ {
“name” : “xxx”,
“type” : {
“name” : “VARBINARY”
}
}, {
“name” : “scheda”,
“type” : {
“name” : “VARBINARY”
}
}, {
“name” : “ts”,
“type” : {
“name” : “TIMESTAMP”
}
}, {
“name” : “transaction_id”,
“type” : {
“name” : “TIMESTAMP”
}
}, {
“name” : “dir0”,
“type” : {
“name” : “VARCHAR”
}
} ]
}

thanks!

ben · March 11, 2020, 7:43pm

@LucGth, what is the file type you are accessing? Parquet or ORC or something else?

LucGth · March 11, 2020, 10:04pm

HI Ben, are parquet files.

ben · March 11, 2020, 11:10pm

You should have parquet-tools installed with your Hadoop distribution.

Try running: $ parquet-tools meta <path/to/sample_20200110> and attach the output here.

LucGth · April 1, 2020, 3:59pm

Hi Ben, thanks for your reply.

creator: impala version 2.12.0-cdh5.16.1 (build 4a3775ef6781301af81b23bca45a9faeca5e761d)

file schema: schema

xx: OPTIONAL BINARY R:0 D:1
yy: OPTIONAL BINARY R:0 D:1
zz: OPTIONAL INT96 R:0 D:1
kk: OPTIONAL INT96 R:0 D:1

row group 1: RC:1 TS:3039

xx: BINARY SNAPPY DO:4 FPO:33 SZ:57/53/0.93 VC:1 ENC:PLAIN_DICTIONARY,RLE
yy: BINARY SNAPPY DO:134 FPO:2978 SZ:2872/5957/2.07 VC:1 ENC:PLAIN_DICTIONARY,RLE
zz: INT96 SNAPPY DO:3059 FPO:3086 SZ:55/51/0.93 VC:1 ENC:PLAIN_DICTIONARY,RLE
kk: INT96 SNAPPY DO:3193 FPO:3220 SZ:55/51/0.93 VC:1 ENC:PLAIN_DICTIONARY,RLE

LucGth · April 17, 2020, 10:28am

is it normal in your opinion?

thanks

Topic		Replies	Views
Parquet Logical Type Support	25	3190	April 16, 2019
Impala and Kudu as Source	9	3785	November 29, 2018
Dremio Query not returning Original Binary Data	1	666	May 11, 2023
Parquet metadata error - is Parquet v2.0 file format supported?	3	4125	January 28, 2019
Cannot read parquet file from Airbyte Dremio University	7	1909	July 13, 2022

HDFS connector interpret VARCHAR as VARBINARY

file schema: schema

row group 1: RC:1 TS:3039

Related topics