Dremio 3.0 Bug - Dates coming back as 1970-01-01

david.lee · November 8, 2018, 11:39pm

This Works:

select portfolioDate from MyDataSet.MyTable limit 5

This fails: the same portfolioDate column is coming back with 1970s

select * from MyDataSet.MyTable limit 5

If I replace * with a comma separated list of column names it also fails with 1970s coming back…

david.lee · November 8, 2018, 11:45pm

Ok this is odd… There seems to be some sort of limit on the number of columns…

If I cut down the result set down from 46 columns to 43 columns the portfolioDate column comes back correctly with 9/30/2018. I can exclude any three columns and it returns back the correct date.

This is reading a parquet file generated by pyarrow with a pa.date32() column type.

david.lee · November 9, 2018, 12:03am

I started up an Apache Drill server and ran the same query and it comes back with the right date using “Select * from”

david.lee · January 5, 2019, 6:21am

Is anyone looking into this one yet? Basically running a " select * " statement on a directory of parquet files isn’t working on parquet date columns.

david.lee · January 5, 2019, 8:28am

Attaching some dummy parquet files I generated for this error

test_files.zip (6.4 KB)

There are two files. test_good.parquet and test_bad.parquet.

Both files have 30 rows each made up from 3 row groups of 10 rows each

The good file has 10 rows for A, 10 rows of B and 10 rows of A
The bad file has 10 rows for B, 10 rows of A and 10 rows of A

Same rows in both files. Only the order is different.

The only difference between A rows and B rows is that the B rows have NULL values for my_date.

select * from test_good gives me good dates
my_date
12/25/2018
12/25/2018
12/25/2018
12/25/2018
12/25/2018
12/25/2018
12/25/2018
12/25/2018
12/25/2018
12/25/2018
12/25/2018
12/25/2018
12/25/2018
12/25/2018
12/25/2018
12/25/2018
12/25/2018
12/25/2018
12/25/2018
12/25/2018
(null)
(null)
(null)
(null)
(null)
(null)
(null)
(null)
(null)
(null)

select * from test_bad gives me 1970 dates instead of 12/25/2018
my_date
1/1/1970
1/1/1970
1/1/1970
1/1/1970
1/1/1970
1/1/1970
1/1/1970
1/1/1970
1/1/1970
1/1/1970
1/1/1970
1/1/1970
1/1/1970
1/1/1970
1/1/1970
1/1/1970
1/1/1970
1/1/1970
1/1/1970
1/1/1970
(null)
(null)
(null)
(null)
(null)
(null)
(null)
(null)
(null)
(null)

If I read test_bad.parquet with either Apache Drill or read it back using pyarrow I don’t see the bad 1970s.

ben · January 11, 2019, 6:20pm

Thanks @david.lee,

I was able to reproduce this using the sample files your provided. Did you have any issue formatting these in Dremio?

Also, can you give a profile for the wrong results run and the incorrect results run?

ben · January 11, 2019, 6:36pm

Also,

Can you share the version/build of pyarrow you’re using to generate these? From parquet-tools I see the parquet writer is parquet-cpp version 1.5.1-SNAPSHOT but I would like a more complete picture of the workflow.

david.lee · January 11, 2019, 11:25pm

This should be re-produceable using the latest pyarrow 0.11.1 version. (I’m using a compiled from latest source version which fixes an unrelated Pandas metadata issue)

import pyarrow as pa
import pyarrow.parquet as pq
from datetime import datetime

modified = datetime.utcnow()
modified = datetime(modified.year, modified.month, modified.day, modified.hour, modified.minute, modified.second, modified.microsecond - modified.microsecond % 1000)

good_columns = {'a': 100, 'b': 100, 'c': 'hello', 'd': 'hello', 'e': 'hello', 'my_date': datetime.date(datetime(2018, 12, 25)), 'f': 100.01000213623047, 'g': 'hello', 'h': 'hello', 'i': datetime.date(datetime(2018, 12, 25)), 'j': 100.01000213623047, 'k': 'hello', 'l': 100.01000213623047, 'm': datetime.date(datetime(2018, 12, 25)), 'n': 'hello', 'o': 'hello', 'p': 'hello', 'q': 'hello', 'r': 'hello', 's': datetime.date(datetime(2018, 12, 25)), 't': 100.01000213623047, 'u': 'hello', 'v': 'hello', 'w': True, 'x': 100.01000213623047, 'y': 'hello', 'z': 100.01000213623047, 'aa': 100.01000213623047, 'bb': 100.01000213623047, 'dd': 100.01000213623047, 'ee': 100.01000213623047, 'ff': 100.01000213623047, 'gg': 100.01000213623047, 'hh': False, 'ii': 100.01000213623047, 'jj': 'hello', 'kk': 'hello', 'll': 'hello', 'mm': 'hello', 'nn': 'hello', 'oo': 'hello', 'pp': 'hello', 'qq': 'hello', 'rr': 100, 'ss': 100.01000213623047, 'tt': 'hello', 'uu': 'hello', 'vv': modified}

schema = pa.schema([
    pa.field('a', pa.int32()),
    pa.field('b', pa.int32()),
    pa.field('c', pa.string()),
    pa.field('d', pa.string()),
    pa.field('e', pa.string()),
    pa.field('my_date', pa.date32()),
    pa.field('f', pa.float32()),
    pa.field('g', pa.string()),
    pa.field('h', pa.string()),
    pa.field('i', pa.date32()),
    pa.field('j', pa.float32()),
    pa.field('k', pa.string()),
    pa.field('l', pa.float32()),
    pa.field('m', pa.date32()),
    pa.field('n', pa.string()),
    pa.field('o', pa.string()),
    pa.field('p', pa.string()),
    pa.field('q', pa.string()),
    pa.field('r', pa.string()),
    pa.field('s', pa.date32()),
    pa.field('t', pa.float32()),
    pa.field('u', pa.string()),
    pa.field('v', pa.string()),
    pa.field('w', pa.bool_()),
    pa.field('x', pa.float32()),
    pa.field('y', pa.string()),
    pa.field('z', pa.float32()),
    pa.field('aa', pa.float32()),
    pa.field('bb', pa.float32()),
    pa.field('dd', pa.float32()),
    pa.field('ee', pa.float32()),
    pa.field('ff', pa.float32()),
    pa.field('gg', pa.float32()),
    pa.field('hh', pa.bool_()),
    pa.field('ii', pa.float32()),
    pa.field('jj', pa.string()),
    pa.field('kk', pa.string()),
    pa.field('ll', pa.string()),
    pa.field('mm', pa.string()),
    pa.field('nn', pa.string()),
    pa.field('oo', pa.string()),
    pa.field('pp', pa.string()),
    pa.field('qq', pa.string()),
    pa.field('rr', pa.int32()),
    pa.field('ss', pa.float32()),
    pa.field('tt', pa.string()),
    pa.field('uu', pa.string()),
    pa.field('vv', pa.timestamp('ms'))
])

arrow_columns = list()

for column in schema.names:
    arrow_columns.append(pa.array([good_columns[column]] * 10, type=schema.types[schema.get_field_index(column)]))

good_arrow_table = pa.Table.from_arrays(arrow_columns, names=schema.names)

bad_column = pa.array([None] * 10, type=schema.types[schema.get_field_index('my_date')])
arrow_columns[5] = bad_column

bad_arrow_table = pa.Table.from_arrays(arrow_columns, names=schema.names)

writer = pq.ParquetWriter("test_good.parquet", schema=schema, use_dictionary=True, compression='snappy')
writer.write_table(good_arrow_table)
writer.write_table(bad_arrow_table)
writer.write_table(good_arrow_table)
writer.close()

writer = pq.ParquetWriter("test_bad.parquet", schema=schema, use_dictionary=True, compression='snappy')
writer.write_table(bad_arrow_table)
writer.write_table(good_arrow_table)
writer.write_table(good_arrow_table)
writer.close()

ben · January 11, 2019, 11:41pm

Thanks @david.lee,

I’ll give it a try and see what I can find.

david.lee · January 11, 2019, 11:42pm

The preview window where you initially choose the file format (Parquet), shows 2018-12-25 correctly, but once you click “Save” and switch to the Dremio SQL Editor / UI, the 2018-12-25 values get swapped out with 1970-01-01.

ben · January 11, 2019, 11:51pm

What version of Dremio are you using? You can find the build number under “Help” --> “About Dremio” in the menu bar.

david.lee · January 12, 2019, 12:00am

Have this problem in both 3.0 and 3.0.6

Build
3.0.6-201812082352540436-1f684f9

david.lee · February 22, 2019, 8:57pm

Upgraded to 3.1.3 and now I’m seeing both 1970-01-01 and in some cases dates with year value of 211346… These just seems to be problems reading parquet date32 types…

select * from … gives me bad dates:

But select SummaryDate, Symbol, Country from… gives me the correct dates:

ben · February 25, 2019, 4:03am

Hi @david.lee,

That seems like a different error. Can you supply a sample again?

david.lee · February 26, 2019, 7:02pm

Unfortunately, I cannot provide a sample. I tried both scrubbing out sensitive columns and reducing the number of rows into a smaller parquet file and the smaller parquet file wouldn’t duplicate the problem.

I will say that the problem appears to be related to NULL INT32 Date values. I have three date columns in my parquet file, but one of the date columns has NULL values across the board. If I remove that column from my select list then the date values in the other two columns show up fine.

The original parquet file with this problem is 128 megs, snappy compressed and has 22 row groups across 4.2 million rows.

ben · February 26, 2019, 8:26pm

Thanks @david.lee for the information you have shared. Like the previous example, you are using calls to pyarrow.date32() to create that field, correct?
https://arrow.apache.org/docs/python/generated/pyarrow.date32.html

david.lee · February 26, 2019, 9:46pm

Correct… Using pyarrow 0.12.0…

I also tried to download the bad results to a parquet file from Dremio and it looks like it saved the dates somehow with -11346-11-17 dates which couldn’t be read by pyarrow.

I’ve ran the same queries on the same files using Apache Drill and Spark without any date issues.

david.lee · May 20, 2019, 7:31pm

I can confirm this has been fixed now in the 3.2 release…

Topic		Replies	Views
Dremio can't read datetime values in parquet files created with Drill 1.11	1	1337	September 19, 2017
Dates are shifted by 1 day using JDBC Driver	6	2584	November 29, 2017
DateTime column returning no results	3	1652	October 7, 2022
Dremio can not read Parquet produced by Arrow	0	1468	March 27, 2020
Regression in parquet reader in version 3.3.1	27	3201	October 1, 2019

Dremio 3.0 Bug - Dates coming back as 1970-01-01

Related topics