Dremio 3.0 Bug - Dates coming back as 1970-01-01


#1

This Works:

select portfolioDate from MyDataSet.MyTable limit 5

image

This fails: the same portfolioDate column is coming back with 1970s

select * from MyDataSet.MyTable limit 5

image

If I replace * with a comma separated list of column names it also fails with 1970s coming back…


#2

Ok this is odd… There seems to be some sort of limit on the number of columns…

If I cut down the result set down from 46 columns to 43 columns the portfolioDate column comes back correctly with 9/30/2018. I can exclude any three columns and it returns back the correct date.

image

This is reading a parquet file generated by pyarrow with a pa.date32() column type.


#3

I started up an Apache Drill server and ran the same query and it comes back with the right date using “Select * from”

image


#4

Is anyone looking into this one yet? Basically running a " select * " statement on a directory of parquet files isn’t working on parquet date columns.


#5

Attaching some dummy parquet files I generated for this error

test_files.zip (6.4 KB)

There are two files. test_good.parquet and test_bad.parquet.

Both files have 30 rows each made up from 3 row groups of 10 rows each

The good file has 10 rows for A, 10 rows of B and 10 rows of A
The bad file has 10 rows for B, 10 rows of A and 10 rows of A

Same rows in both files. Only the order is different.

The only difference between A rows and B rows is that the B rows have NULL values for my_date.

select * from test_good gives me good dates
my_date
12/25/2018
12/25/2018
12/25/2018
12/25/2018
12/25/2018
12/25/2018
12/25/2018
12/25/2018
12/25/2018
12/25/2018
12/25/2018
12/25/2018
12/25/2018
12/25/2018
12/25/2018
12/25/2018
12/25/2018
12/25/2018
12/25/2018
12/25/2018
(null)
(null)
(null)
(null)
(null)
(null)
(null)
(null)
(null)
(null)

select * from test_bad gives me 1970 dates instead of 12/25/2018
my_date
1/1/1970
1/1/1970
1/1/1970
1/1/1970
1/1/1970
1/1/1970
1/1/1970
1/1/1970
1/1/1970
1/1/1970
1/1/1970
1/1/1970
1/1/1970
1/1/1970
1/1/1970
1/1/1970
1/1/1970
1/1/1970
1/1/1970
1/1/1970
(null)
(null)
(null)
(null)
(null)
(null)
(null)
(null)
(null)
(null)

If I read test_bad.parquet with either Apache Drill or read it back using pyarrow I don’t see the bad 1970s.


#6

Thanks @david.lee,

I was able to reproduce this using the sample files your provided. Did you have any issue formatting these in Dremio?

Also, can you give a profile for the wrong results run and the incorrect results run?


#7

Also,

Can you share the version/build of pyarrow you’re using to generate these? From parquet-tools I see the parquet writer is parquet-cpp version 1.5.1-SNAPSHOT but I would like a more complete picture of the workflow.


#8

This should be re-produceable using the latest pyarrow 0.11.1 version. (I’m using a compiled from latest source version which fixes an unrelated Pandas metadata issue)

import pyarrow as pa
import pyarrow.parquet as pq
from datetime import datetime

modified = datetime.utcnow()
modified = datetime(modified.year, modified.month, modified.day, modified.hour, modified.minute, modified.second, modified.microsecond - modified.microsecond % 1000)

good_columns = {'a': 100, 'b': 100, 'c': 'hello', 'd': 'hello', 'e': 'hello', 'my_date': datetime.date(datetime(2018, 12, 25)), 'f': 100.01000213623047, 'g': 'hello', 'h': 'hello', 'i': datetime.date(datetime(2018, 12, 25)), 'j': 100.01000213623047, 'k': 'hello', 'l': 100.01000213623047, 'm': datetime.date(datetime(2018, 12, 25)), 'n': 'hello', 'o': 'hello', 'p': 'hello', 'q': 'hello', 'r': 'hello', 's': datetime.date(datetime(2018, 12, 25)), 't': 100.01000213623047, 'u': 'hello', 'v': 'hello', 'w': True, 'x': 100.01000213623047, 'y': 'hello', 'z': 100.01000213623047, 'aa': 100.01000213623047, 'bb': 100.01000213623047, 'dd': 100.01000213623047, 'ee': 100.01000213623047, 'ff': 100.01000213623047, 'gg': 100.01000213623047, 'hh': False, 'ii': 100.01000213623047, 'jj': 'hello', 'kk': 'hello', 'll': 'hello', 'mm': 'hello', 'nn': 'hello', 'oo': 'hello', 'pp': 'hello', 'qq': 'hello', 'rr': 100, 'ss': 100.01000213623047, 'tt': 'hello', 'uu': 'hello', 'vv': modified}

schema = pa.schema([
    pa.field('a', pa.int32()),
    pa.field('b', pa.int32()),
    pa.field('c', pa.string()),
    pa.field('d', pa.string()),
    pa.field('e', pa.string()),
    pa.field('my_date', pa.date32()),
    pa.field('f', pa.float32()),
    pa.field('g', pa.string()),
    pa.field('h', pa.string()),
    pa.field('i', pa.date32()),
    pa.field('j', pa.float32()),
    pa.field('k', pa.string()),
    pa.field('l', pa.float32()),
    pa.field('m', pa.date32()),
    pa.field('n', pa.string()),
    pa.field('o', pa.string()),
    pa.field('p', pa.string()),
    pa.field('q', pa.string()),
    pa.field('r', pa.string()),
    pa.field('s', pa.date32()),
    pa.field('t', pa.float32()),
    pa.field('u', pa.string()),
    pa.field('v', pa.string()),
    pa.field('w', pa.bool_()),
    pa.field('x', pa.float32()),
    pa.field('y', pa.string()),
    pa.field('z', pa.float32()),
    pa.field('aa', pa.float32()),
    pa.field('bb', pa.float32()),
    pa.field('dd', pa.float32()),
    pa.field('ee', pa.float32()),
    pa.field('ff', pa.float32()),
    pa.field('gg', pa.float32()),
    pa.field('hh', pa.bool_()),
    pa.field('ii', pa.float32()),
    pa.field('jj', pa.string()),
    pa.field('kk', pa.string()),
    pa.field('ll', pa.string()),
    pa.field('mm', pa.string()),
    pa.field('nn', pa.string()),
    pa.field('oo', pa.string()),
    pa.field('pp', pa.string()),
    pa.field('qq', pa.string()),
    pa.field('rr', pa.int32()),
    pa.field('ss', pa.float32()),
    pa.field('tt', pa.string()),
    pa.field('uu', pa.string()),
    pa.field('vv', pa.timestamp('ms'))
])

arrow_columns = list()

for column in schema.names:
    arrow_columns.append(pa.array([good_columns[column]] * 10, type=schema.types[schema.get_field_index(column)]))

good_arrow_table = pa.Table.from_arrays(arrow_columns, names=schema.names)

bad_column = pa.array([None] * 10, type=schema.types[schema.get_field_index('my_date')])
arrow_columns[5] = bad_column

bad_arrow_table = pa.Table.from_arrays(arrow_columns, names=schema.names)

writer = pq.ParquetWriter("test_good.parquet", schema=schema, use_dictionary=True, compression='snappy')
writer.write_table(good_arrow_table)
writer.write_table(bad_arrow_table)
writer.write_table(good_arrow_table)
writer.close()

writer = pq.ParquetWriter("test_bad.parquet", schema=schema, use_dictionary=True, compression='snappy')
writer.write_table(bad_arrow_table)
writer.write_table(good_arrow_table)
writer.write_table(good_arrow_table)
writer.close()

#9

Thanks @david.lee,

I’ll give it a try and see what I can find.


#10

The preview window where you initially choose the file format (Parquet), shows 2018-12-25 correctly, but once you click “Save” and switch to the Dremio SQL Editor / UI, the 2018-12-25 values get swapped out with 1970-01-01.


#11

What version of Dremio are you using? You can find the build number under “Help” --> “About Dremio” in the menu bar.


#12

Have this problem in both 3.0 and 3.0.6

Build
3.0.6-201812082352540436-1f684f9


#13

Upgraded to 3.1.3 and now I’m seeing both 1970-01-01 and in some cases dates with year value of 211346… These just seems to be problems reading parquet date32 types…

select * from … gives me bad dates:
image

But select SummaryDate, Symbol, Country from… gives me the correct dates:

image


#14

Hi @david.lee,

That seems like a different error. Can you supply a sample again?


#15

Unfortunately, I cannot provide a sample. I tried both scrubbing out sensitive columns and reducing the number of rows into a smaller parquet file and the smaller parquet file wouldn’t duplicate the problem.

I will say that the problem appears to be related to NULL INT32 Date values. I have three date columns in my parquet file, but one of the date columns has NULL values across the board. If I remove that column from my select list then the date values in the other two columns show up fine.

The original parquet file with this problem is 128 megs, snappy compressed and has 22 row groups across 4.2 million rows.


#16

Thanks @david.lee for the information you have shared. Like the previous example, you are using calls to pyarrow.date32() to create that field, correct?
https://arrow.apache.org/docs/python/generated/pyarrow.date32.html


#17

Correct… Using pyarrow 0.12.0…

I also tried to download the bad results to a parquet file from Dremio and it looks like it saved the dates somehow with -11346-11-17 dates which couldn’t be read by pyarrow.

I’ve ran the same queries on the same files using Apache Drill and Spark without any date issues.