Dremio 3.0 Bug - Dates coming back as 1970-01-01


#1

This Works:

select portfolioDate from MyDataSet.MyTable limit 5

image

This fails: the same portfolioDate column is coming back with 1970s

select * from MyDataSet.MyTable limit 5

image

If I replace * with a comma separated list of column names it also fails with 1970s coming back…


#2

Ok this is odd… There seems to be some sort of limit on the number of columns…

If I cut down the result set down from 46 columns to 43 columns the portfolioDate column comes back correctly with 9/30/2018. I can exclude any three columns and it returns back the correct date.

image

This is reading a parquet file generated by pyarrow with a pa.date32() column type.


#3

I started up an Apache Drill server and ran the same query and it comes back with the right date using “Select * from”

image


#4

Is anyone looking into this one yet? Basically running a " select * " statement on a directory of parquet files isn’t working on parquet date columns.


#5

Attaching some dummy parquet files I generated for this error

test_files.zip (6.4 KB)

There are two files. test_good.parquet and test_bad.parquet.

Both files have 30 rows each made up from 3 row groups of 10 rows each

The good file has 10 rows for A, 10 rows of B and 10 rows of A
The bad file has 10 rows for B, 10 rows of A and 10 rows of A

Same rows in both files. Only the order is different.

The only difference between A rows and B rows is that the B rows have NULL values for my_date.

select * from test_good gives me good dates
my_date
12/25/2018
12/25/2018
12/25/2018
12/25/2018
12/25/2018
12/25/2018
12/25/2018
12/25/2018
12/25/2018
12/25/2018
12/25/2018
12/25/2018
12/25/2018
12/25/2018
12/25/2018
12/25/2018
12/25/2018
12/25/2018
12/25/2018
12/25/2018
(null)
(null)
(null)
(null)
(null)
(null)
(null)
(null)
(null)
(null)

select * from test_bad gives me 1970 dates instead of 12/25/2018
my_date
1/1/1970
1/1/1970
1/1/1970
1/1/1970
1/1/1970
1/1/1970
1/1/1970
1/1/1970
1/1/1970
1/1/1970
1/1/1970
1/1/1970
1/1/1970
1/1/1970
1/1/1970
1/1/1970
1/1/1970
1/1/1970
1/1/1970
1/1/1970
(null)
(null)
(null)
(null)
(null)
(null)
(null)
(null)
(null)
(null)

If I read test_bad.parquet with either Apache Drill or read it back using pyarrow I don’t see the bad 1970s.


#6

Thanks @david.lee,

I was able to reproduce this using the sample files your provided. Did you have any issue formatting these in Dremio?

Also, can you give a profile for the wrong results run and the incorrect results run?


#7

Also,

Can you share the version/build of pyarrow you’re using to generate these? From parquet-tools I see the parquet writer is parquet-cpp version 1.5.1-SNAPSHOT but I would like a more complete picture of the workflow.


#8

This should be re-produceable using the latest pyarrow 0.11.1 version. (I’m using a compiled from latest source version which fixes an unrelated Pandas metadata issue)

import pyarrow as pa
import pyarrow.parquet as pq
from datetime import datetime

modified = datetime.utcnow()
modified = datetime(modified.year, modified.month, modified.day, modified.hour, modified.minute, modified.second, modified.microsecond - modified.microsecond % 1000)

good_columns = {'a': 100, 'b': 100, 'c': 'hello', 'd': 'hello', 'e': 'hello', 'my_date': datetime.date(datetime(2018, 12, 25)), 'f': 100.01000213623047, 'g': 'hello', 'h': 'hello', 'i': datetime.date(datetime(2018, 12, 25)), 'j': 100.01000213623047, 'k': 'hello', 'l': 100.01000213623047, 'm': datetime.date(datetime(2018, 12, 25)), 'n': 'hello', 'o': 'hello', 'p': 'hello', 'q': 'hello', 'r': 'hello', 's': datetime.date(datetime(2018, 12, 25)), 't': 100.01000213623047, 'u': 'hello', 'v': 'hello', 'w': True, 'x': 100.01000213623047, 'y': 'hello', 'z': 100.01000213623047, 'aa': 100.01000213623047, 'bb': 100.01000213623047, 'dd': 100.01000213623047, 'ee': 100.01000213623047, 'ff': 100.01000213623047, 'gg': 100.01000213623047, 'hh': False, 'ii': 100.01000213623047, 'jj': 'hello', 'kk': 'hello', 'll': 'hello', 'mm': 'hello', 'nn': 'hello', 'oo': 'hello', 'pp': 'hello', 'qq': 'hello', 'rr': 100, 'ss': 100.01000213623047, 'tt': 'hello', 'uu': 'hello', 'vv': modified}

schema = pa.schema([
    pa.field('a', pa.int32()),
    pa.field('b', pa.int32()),
    pa.field('c', pa.string()),
    pa.field('d', pa.string()),
    pa.field('e', pa.string()),
    pa.field('my_date', pa.date32()),
    pa.field('f', pa.float32()),
    pa.field('g', pa.string()),
    pa.field('h', pa.string()),
    pa.field('i', pa.date32()),
    pa.field('j', pa.float32()),
    pa.field('k', pa.string()),
    pa.field('l', pa.float32()),
    pa.field('m', pa.date32()),
    pa.field('n', pa.string()),
    pa.field('o', pa.string()),
    pa.field('p', pa.string()),
    pa.field('q', pa.string()),
    pa.field('r', pa.string()),
    pa.field('s', pa.date32()),
    pa.field('t', pa.float32()),
    pa.field('u', pa.string()),
    pa.field('v', pa.string()),
    pa.field('w', pa.bool_()),
    pa.field('x', pa.float32()),
    pa.field('y', pa.string()),
    pa.field('z', pa.float32()),
    pa.field('aa', pa.float32()),
    pa.field('bb', pa.float32()),
    pa.field('dd', pa.float32()),
    pa.field('ee', pa.float32()),
    pa.field('ff', pa.float32()),
    pa.field('gg', pa.float32()),
    pa.field('hh', pa.bool_()),
    pa.field('ii', pa.float32()),
    pa.field('jj', pa.string()),
    pa.field('kk', pa.string()),
    pa.field('ll', pa.string()),
    pa.field('mm', pa.string()),
    pa.field('nn', pa.string()),
    pa.field('oo', pa.string()),
    pa.field('pp', pa.string()),
    pa.field('qq', pa.string()),
    pa.field('rr', pa.int32()),
    pa.field('ss', pa.float32()),
    pa.field('tt', pa.string()),
    pa.field('uu', pa.string()),
    pa.field('vv', pa.timestamp('ms'))
])

arrow_columns = list()

for column in schema.names:
    arrow_columns.append(pa.array([good_columns[column]] * 10, type=schema.types[schema.get_field_index(column)]))

good_arrow_table = pa.Table.from_arrays(arrow_columns, names=schema.names)

bad_column = pa.array([None] * 10, type=schema.types[schema.get_field_index('my_date')])
arrow_columns[5] = bad_column

bad_arrow_table = pa.Table.from_arrays(arrow_columns, names=schema.names)

writer = pq.ParquetWriter("test_good.parquet", schema=schema, use_dictionary=True, compression='snappy')
writer.write_table(good_arrow_table)
writer.write_table(bad_arrow_table)
writer.write_table(good_arrow_table)
writer.close()

writer = pq.ParquetWriter("test_bad.parquet", schema=schema, use_dictionary=True, compression='snappy')
writer.write_table(bad_arrow_table)
writer.write_table(good_arrow_table)
writer.write_table(good_arrow_table)
writer.close()

#9

Thanks @david.lee,

I’ll give it a try and see what I can find.


#10

The preview window where you initially choose the file format (Parquet), shows 2018-12-25 correctly, but once you click “Save” and switch to the Dremio SQL Editor / UI, the 2018-12-25 values get swapped out with 1970-01-01.


#11

What version of Dremio are you using? You can find the build number under “Help” --> “About Dremio” in the menu bar.


#12

Have this problem in both 3.0 and 3.0.6

Build
3.0.6-201812082352540436-1f684f9