Chaotic situation after upgrade to v11

Berni · December 17, 2020, 7:21pm

Hi,
I’m calling for help.

Few weeks after upgrading to v11 Community my platform started to become unstable.
My source database is Oracle, and I knew I will have to rewrite some of my requests because of date format change for Oracle Source.
I did by using to_date() in all my VDS that were getting date field from PDS.
It works fine for about 2 weeks and then all of a sudden it started to become crazy.
join on date range when using with clause returned inconsistent data, reflection refresh on oracle view with dates failed with illegal month while other reflection on tables with dates works, reflections on VDS that used to be accelerated by PDS and/or VDS reflections no longer gets accelerated and data returned by VDS gets incoherent with real source data and so on.
I’m at the point where the complete Datalake is no longer operational.

Any idea why it works pretty well and much faster than the 4.7 version I was running before for few weeks and started to crawl down since ?
Which path should I follow to troubleshoot or should I go back and restore my 2 or 3 weeks old backup?

Any help will be more than welcome in this emergency situation.

Thanks,
Berni

balaji.ramaswamy · December 18, 2020, 7:57am

Hi @Berni

Sorry to hear that you are having issues. Let us go one by one

For Oracle, send us the profile. Can you try external query instead
For Data lake, let us start with the profile, any chance, you have the old profile still?

Berni · December 18, 2020, 8:50am

Hi @balaji.ramaswamy
Thanks for reacting to my alarming message.
Since yesterday evening, I’ve made some progress.
After creating a new source connection to my Oracle server (same configuration as the actual production one) and requesting some PDS with simple select, the request that is timing out in planning phase on the production connector is running blazing fast on the new one.
Same happen when recreating reflection on that new source.
I then modify some of my VDS to use that new source name and its also runs smoothly.

I then thought of dropping the old source (as it cannot be renames) and recreates it on the same name thinking that this might have to do with the Oracle JDBC Drivers.
Obviously I lost all my PDS reflection.
After that sadly, pointing my VDS back to the oldname connector I was back in PDS request timing out on planning phase.
Therefore I created a brand new name for the source, and it started to work again.
I’m now in the process of recreating all my PDS reflection and modifying the source name to use the new one in all my VDS.

At this point I think the problems has to do with the stats, dependencies, catalog, … of the old source name.
Of course I did all the dremio_admin clean operation before starting all that.

Attached profiles for a request on a 2 rows PDS using the old and new source.
newsource.zip (11,0 Ko) oldsource.zip (676,3 Ko)

Thx

Berni · December 18, 2020, 10:04am

@balaji.ramaswamy,
Answering a bit too fast

It turns out that if modifying source referenced in VDS fixed the issue in VDS creation or reflection creation on that same VDS, requesting the VDS itself (select * from VDS) has the same problem as requesting the PDS on the old source.

Again same thing, recreating the VDS with a different name (same request) and requesting it fixed the problem.
Dropping the old named VDS then renaming the new named one with the old name goes back into trouble.

This time It’s not going to be possible to rename the VDSs as they are used in multiple QikSense dashboards that will require to rebuild them all

Is there any command to just clear all the caches, stats, data, … and just keep the VDS and reflections definitions ?

Cheers,

balaji.ramaswamy · December 20, 2020, 6:34am

@Berni

To remove metadata for a PDS, just FORGET metadata on the dataset. But if you have ACL’s or reflections on the PDS, they will be removed. If they are on a VDS then they should be fine. You can check on one PDS and then proceed

Berni · December 20, 2020, 6:57pm

@balaji.ramaswamy,

I found out that my main issue was linked to the number of reflection that are considered during he planning phase.
Removing them all, I realized that every PDS and VDS worked and that most of reflections was quite useless, probably thanks to the improvement introduced in 4.9 (default reflection) I think.
Now I just keep reflections on biggest PDS and complex VDS I’m back to a working Datalake, I think.
Just have to let it run few days to check tha the refresh are stable.
I’ll let you know how it goes.

Thanks for your support on the issue.

Lesson learned, Reflection are good as long as you keep them low in number and avoid dependencies between them.

By the way, is there any reason why a Run VDS operation works while creating a reflection on it failed with two out of memory attempts ?

Cheers,

balaji.ramaswamy · December 22, 2020, 6:45am

@Berni

It depends where we ran out of memory. Are you able to provide a profile?

Thanks
Bali

Berni · January 29, 2021, 9:48am

Hi,
First of happy new year for you and all your loving ones.
Second, sorry for the late feedback but had very busy end and beginning of year.

Back on the Dremio subject.
After more than a month of very stable and fast running, this morning we are back on a quite similar situation but this time the source doesn’t seems to be linked to reflections even if it sounds that the consequences are very similar.

The situation is quite simple, each requests on PDS that has been requested before are failing on planning timeout phase.
ie: a simple select request on a two rows table failed on timeout.

extract from the profile shows “durationMillis” : 68122 (see below).

If I create a new source pointing to the same DB and do the same request then result is immediate.

I also tried to forget the metadata and refresh it again and result is the same.

As complement information, I’m still running the 11.0 community version single hosted.

I’d like not to have to take back all my queries to modify the source name as I did the last time in order to start fresh.

Any help very much appreciated.

{
“phaseName” : “Logical Planning”,
“durationMillis” : 68122,
“plan” : “LogicalProject(CODE=[$0], LIBE=[$1], RESP=[$2], ADR1=[$3], ADR2=[$4], PTT=[$5], LCOM=[$6], TELE=[$7], DEFI=[$8]): rowcount = 2.0, cumulative cost = {inf}, id = 1821925\n SampleCrel: rowcount = 2.0, cumulative cost = {inf}, id = 1821923\n ScanCrel(table=[PRH3.S9704_GL_00.AGENCE], columns=[CODE, LIBE, RESP, ADR1, ADR2, PTT, LCOM, TELE, DEFI], splits=[1]): rowcount = 2.0, cumulative cost = {inf}, id = 1821902\n”,
“plannerDump” : “Root: rel#1821932:Subset#2.LOGICAL.ANY([]).[]\nOriginal rel:\nLogicalProject(subset=[rel#1821932:Subset#2.LOGICAL.ANY([]).[]], CODE=[$0], LIBE=[$1], RESP=[$2], ADR1=[$3], ADR2=[$4], PTT=[$5], LCOM=[$6], TELE=[$7], DEFI=[$8]): rowcount = 2.0, cumulative cost = {2.0 rows, 18.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 1821930\n SampleCrel(subset=[rel#1821929:Subset#1.NONE.ANY([]).[]]): rowcount = 2.0, cumulative cost = {2.0 rows, 2.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 1821928\n ScanCrel(subset=[rel#1821927:Subset#0.NONE.ANY([]).[]], table=[PRH3.S9704_GL_00.AGENCE], columns=[CODE, LIBE, RESP, ADR1, ADR2, PTT, LCOM, TELE, DEFI], splits=[1]): rowcount = 2.0, cumulative cost = {inf}, id = 1821902\n\nSets:\nSet#0, type: RecordType(VARCHAR(65536) CODE, VARCHAR(65536) LIBE, VARCHAR(65536) RESP, VARCHAR(65536) ADR1, VARCHAR(65536) ADR2, VARCHAR(65536) PTT, VARCHAR(65536) LCOM, VARCHAR(65536) TELE, VARCHAR(65536) DEFI)\n\trel#1821927:Subset#0.NONE.ANY([]).[], best=null, importance=0.7290000000000001\n\t\trel#1821902:ScanCrel.NONE.ANY([]).[](table=PRH3.S9704_GL_00.AGENCE,columns=CODE, LIBE, RESP, ADR1, ADR2, PTT, LCOM, TELE, DEFI,splits=1,tableDigest=13744457|PRH3|81d9e42f-e3ee-4c1f-900b-b18544f794ca), rowcount=2.0, cumulative cost={inf}\nSet#1, type: RecordType(VARCHAR(65536) CODE, VARCHAR(65536) LIBE, VARCHAR(65536) RESP, VARCHAR(65536) ADR1, VARCHAR(65536) ADR2, VARCHAR(65536) PTT, VARCHAR(65536) LCOM, VARCHAR(65536) TELE, VARCHAR(65536) DEFI)\n\trel#1821929:Subset#1.NONE.ANY([]).[], best=null, importance=0.81\n\t\trel#1821928:SampleCrel.NONE.ANY([])., rowcount=2.0, cumulative cost={inf}\nSet#2, type: RecordType(VARCHAR(65536) CODE, VARCHAR(65536) LIBE, VARCHAR(65536) RESP, VARCHAR(65536) ADR1, VARCHAR(65536) ADR2, VARCHAR(65536) PTT, VARCHAR(65536) LCOM, VARCHAR(65536) TELE, VARCHAR(65536) DEFI)\n\trel#1821931:Subset#2.NONE.ANY([]).[], best=null, importance=0.9\n\t\trel#1821930:LogicalProject.NONE.ANY([])., rowcount=2.0, cumulative cost={inf}\n\trel#1821932:Subset#2.LOGICAL.ANY([]).[], best=null, importance=1.0\n\t\trel#1821933:AbstractConverter.LOGICAL.ANY([])., rowcount=2.0, cumulative cost={inf}\n\n”
}

balaji.ramaswamy · January 30, 2021, 4:38am

@Berni

Happy new year !

Kindly send is the job profile and server.gc that covers the time the query was run. It may have also rolled over to server.gc.1 and so on

Thanks
Bali

Berni · January 30, 2021, 5:02am

Thanks for your feedback.
Attached working and not working profile.
Difference is the source name used that has the same configuration and point to the orace db.
Working source is PRH
Not working is PRH3.

And the gc file.

ChArchive 2.zip (773,0 Ko) eers,

Berni · January 30, 2021, 5:15am

Checking back again this morning, the requests that were failing in more than 60s timeout during the planning phase started to work again.
The reason sound to be because the planning falls under the 60s.
Most of the requests, whatever the complexity, sound to have a planning phase between 47 and 58s.
While doing the same on the new source, planning stays below the 20 to 30s, less than 2 for the select request I’ve sent you.
Does that help ?

Cheers,

balaji.ramaswamy · January 31, 2021, 4:38am

@Berni

The difference between the working and no working are that in the non working one there are bunch of reflections while the one in the working one there are no reflections considered

Try this,

Create a VDS in a space from “select * from PRH3.S9704_GL_00.AGENCE”, then turn on raw reflection on the VDS, see if the planning and execution is fast. This is just a test, ideally you should create a semantic layer and create raw/agg reflections on a VDS that is already joined the most common tables minus filters and aggregates

Thanks
Bali

Berni · January 31, 2021, 5:57am

Hi,
The point is ha I used to have lots of reflection on the PRH Source.
Following the initial issue, I created a new source PRH3 for the same DB and remove all reflections on PRH PDS and fine tune the reflections on PRH3 to keep only the necessary ones.
Looking at the AGENCE PDS, the fact is that there was a reflection on the PRH source but removed long time ago and that reflection never exists on the PRH3 source.
Moreover, AGENCE PDS have 2 records only.

Only thing that I notice is that on PRH source, even though reflections have all been removed, I still have the never expire checked in the reflection refresh policy.
Just unckecked it and set the expire to 0 hour.

My guess is that the PRH3 Catalog or search indices became unstable or that I’ve IO issues on the storage.
Second option is not the good one as my Dremio configuration use the same storage for both sources.
I then ran both request with a monitor on accessed folders:
select * FROM “PRH”.“S9704_GL_00”.AGENCE => less than a second
Establishing watches…
Finished establishing watches, now collecting statistics.
total access modify attrib close_write close_nowrite open create delete filename
1004 0 741 0 53 36 89 51 34 /mnt/data/db/search/jobs/core/
88 64 24 0 0 0 0 0 0 /mnt/data/db/catalog/
12 2 2 1 1 2 3 1 0 /mnt/data/pdfs/results/1fe9c6fd-e6cc-eaa6-1d09-5a7848af1b00/
5 1 0 1 0 1 1 1 0 /mnt/data/pdfs/results/
4 0 4 0 0 0 0 0 0 /mnt/data/zk/zookeeper/version-2/

select * FROM “PRH3”.“S9704_GL_00”.AGENCE => Timeout or a bit less than 60s
Establishing watches…
Finished establishing watches, now collecting statistics.
total access modify attrib close_write close_nowrite open moved_from moved_to create delete filename
2456 2294 162 0 0 0 0 0 0 0 0 /mnt/data/db/catalog/
301 0 50 0 69 26 98 1 1 31 25 /mnt/data/db/search/jobs/core/
9 1 2 1 1 1 2 0 0 1 0 /mnt/data/pdfs/results/1fe9c46f-8739-9c9a-f386-96d8ddbe5100/
5 1 0 1 0 1 1 0 0 1 0 /mnt/data/pdfs/results/
4 0 4 0 0 0 0 0 0 0 0 /mnt/data/zk/zookeeper/version-2/
2 0 0 0 0 1 1 0 0 0 0 /mnt/data/db/search/jobs/

As we can see the main difference is on the /mnt/data/db/catalog/.
On PRH source only 88 access.
ON PRH3 source 2456.

I then tried the ./bin/dremio-admin reset-catalog-search hoping to see a change.
Less access to the /mnt/data/db/catalog/ folder but still 15 times more than on the PRH Source and still 57s in planning state and less than 1ms in all other operations.
Attached the last 2 requests profiles.
I’ll keep looking through the MetaData, unless you have some other advises or ideas.

Archive 3.zip (478,2 Ko)

Cheers,

Berni · January 31, 2021, 6:57am

As more info, I do have some PDS in PRH3 that still behave correctly even though they have more records in it.
Ie: attached profile of a select * from AGENT
PRH3-Working-profile.zip (12,9 Ko)

Going in that direction it looks like all PDS in PRH3 source that used to have a reflection in PRH source spend long time in planning but no longer have.
While PDS that never had a reflection in PRH source are very fast to answer in PRH3.

Sounds like the engine is lost between the two sources that points to the same Oracle DB.

I’ll try to remove the PRH Source in source config, but as far as I remember that doesn’t clean the RocksDB.

cheers,

Berni · January 31, 2021, 6:12pm

Good evening,
I think I’ve fixed the issue.
After activating the debug log level I’ve noted that on some PDS when doing simple select, the foreman-planning task were launching a bunch of these:
2021-01-31 12:02:24,729 [1fe963af-41a1-b5eb-fb14-fde42afb8400/0:foreman-planning] DEBUG o.apache.calcite.plan.RelOptPlanner
on all the VDS that are using the tables listed in this series of messages:
2021-01-31 12:02:24,144 [1fe963af-41a1-b5eb-fb14-fde42afb8400/0:foreman-planning] DEBUG c.d.e.p.sql.handlers.PrelTransformer - HEP_BOTTOM_UP:Pre-Logical Filter Pushdown (0ms):

2021-01-31 12:02:24,710 [1fe963af-41a1-b5eb-fb14-fde42afb8400/0:foreman-planning] DEBUG c.d.e.p.a.s.DremioSubstitutionProvider - Eliminate extra starflake tables

2021-01-31 12:02:24,710 [1fe963af-41a1-b5eb-fb14-fde42afb8400/0:foreman-planning] DEBUG c.d.e.p.a.s.SnowflakeUtilities - Duplicate table found: [PRH3, S9704_GL_00, CONTACT]

2021-01-31 12:02:24,710 [1fe963af-41a1-b5eb-fb14-fde42afb8400/0:foreman-planning] DEBUG c.d.e.p.a.s.SnowflakeUtilities - Duplicate table found: [PRH3, S9704_GL_00, ENSEMBLE]

2021-01-31 12:02:24,711 [1fe963af-41a1-b5eb-fb14-fde42afb8400/0:foreman-planning] DEBUG c.d.e.p.a.s.SnowflakeUtilities - Duplicate table found: [PRH3, S9704_GL_00, AGENCE]

2021-01-31 12:02:24,711 [1fe963af-41a1-b5eb-fb14-fde42afb8400/0:foreman-planning] DEBUG c.d.e.p.a.s.SnowflakeUtilities - Duplicate table found: [PRH3, S9704_GL_00, BATIMENT]

2021-01-31 12:02:24,711 [1fe963af-41a1-b5eb-fb14-fde42afb8400/0:foreman-planning] DEBUG c.d.e.p.a.s.SnowflakeUtilities - Duplicate table found: [PRH3, S9704_GL_00, MONTEE]

2021-01-31 12:02:24,711 [1fe963af-41a1-b5eb-fb14-fde42afb8400/0:foreman-planning] DEBUG c.d.e.p.a.s.SnowflakeUtilities - Duplicate table found: [PRH3, S9704_GL_00, ENSEMBLE]

2021-01-31 12:02:24,711 [1fe963af-41a1-b5eb-fb14-fde42afb8400/0:foreman-planning] DEBUG c.d.e.p.a.s.SnowflakeUtilities - Duplicate table found: [PRH3, S9704_GL_00, AGENCE]

2021-01-31 12:02:24,711 [1fe963af-41a1-b5eb-fb14-fde42afb8400/0:foreman-planning] DEBUG c.d.e.p.a.s.SnowflakeUtilities - Duplicate table found: [PRH3, S9704_GL_00, BATIMENT]

2021-01-31 12:02:24,711 [1fe963af-41a1-b5eb-fb14-fde42afb8400/0:foreman-planning] DEBUG c.d.e.p.a.s.SnowflakeUtilities - Duplicate table found: [PRH3, S9704_GL_00, ENSEMBLE]

2021-01-31 12:02:24,711 [1fe963af-41a1-b5eb-fb14-fde42afb8400/0:foreman-planning] DEBUG c.d.e.p.a.s.SnowflakeUtilities - Duplicate table found: [PRH3, S9704_GL_00, AGENCE]

2021-01-31 12:02:24,711 [1fe963af-41a1-b5eb-fb14-fde42afb8400/0:foreman-planning] DEBUG c.d.e.p.a.s.SnowflakeUtilities - Duplicate table found: [PRH3, S9704_GL_00, CONTACT]

2021-01-31 12:02:24,711 [1fe963af-41a1-b5eb-fb14-fde42afb8400/0:foreman-planning] DEBUG c.d.e.p.a.s.SnowflakeUtilities - Duplicate table found: [PRH3, S9704_GL_00, ENSEMBLE]

2021-01-31 12:02:24,711 [1fe963af-41a1-b5eb-fb14-fde42afb8400/0:foreman-planning] DEBUG c.d.e.p.a.s.SnowflakeUtilities - Duplicate table found: [PRH3, S9704_GL_00, AGENCE]

2021-01-31 12:02:24,711 [1fe963af-41a1-b5eb-fb14-fde42afb8400/0:foreman-planning] DEBUG c.d.e.p.a.s.SnowflakeUtilities - Duplicate table found: [PRH3, S9704_GL_00, BATIMENT]

2021-01-31 12:02:24,711 [1fe963af-41a1-b5eb-fb14-fde42afb8400/0:foreman-planning] DEBUG c.d.e.p.a.s.SnowflakeUtilities - Duplicate table found: [PRH3, S9704_GL_00, MONTEE]

2021-01-31 12:02:24,712 [1fe963af-41a1-b5eb-fb14-fde42afb8400/0:foreman-planning] DEBUG c.d.e.p.a.s.SnowflakeUtilities - Duplicate table found: [PRH3, S9704_GL_00, ENSEMBLE]

2021-01-31 12:02:24,712 [1fe963af-41a1-b5eb-fb14-fde42afb8400/0:foreman-planning] DEBUG c.d.e.p.a.s.SnowflakeUtilities - Duplicate table found: [PRH3, S9704_GL_00, AGENCE]

2021-01-31 12:02:24,712 [1fe963af-41a1-b5eb-fb14-fde42afb8400/0:foreman-planning] DEBUG c.d.e.p.a.s.SnowflakeUtilities - Duplicate table found: [PRH3, S9704_GL_00, BATIMENT]

2021-01-31 12:02:24,712 [1fe963af-41a1-b5eb-fb14-fde42afb8400/0:foreman-planning] DEBUG c.d.e.p.a.s.SnowflakeUtilities - Duplicate table found: [PRH3, S9704_GL_00, ENSEMBLE]

2021-01-31 12:02:24,712 [1fe963af-41a1-b5eb-fb14-fde42afb8400/0:foreman-planning] DEBUG c.d.e.p.a.s.SnowflakeUtilities - Duplicate table found: [PRH3, S9704_GL_00, AGENCE]

2021-01-31 12:02:24,712 [1fe963af-41a1-b5eb-fb14-fde42afb8400/0:foreman-planning] DEBUG c.d.e.p.a.s.SnowflakeUtilities - Duplicate table found: [PRH3, S9704_GL_00, CONTACT]

2021-01-31 12:02:24,714 [1fe963af-41a1-b5eb-fb14-fde42afb8400/0:foreman-planning] DEBUG c.d.e.p.a.s.SnowflakeUtilities - Duplicate table found: [PRH3, S9704_GL_00, CONTACT]

2021-01-31 12:02:24,714 [1fe963af-41a1-b5eb-fb14-fde42afb8400/0:foreman-planning] DEBUG c.d.e.p.a.s.SnowflakeUtilities - Duplicate table found: [PRH3, S9704_GL_00, ENSEMBLE]

2021-01-31 12:02:24,714 [1fe963af-41a1-b5eb-fb14-fde42afb8400/0:foreman-planning] DEBUG c.d.e.p.a.s.SnowflakeUtilities - Duplicate table found: [PRH3, S9704_GL_00, AGENCE]

2021-01-31 12:02:24,714 [1fe963af-41a1-b5eb-fb14-fde42afb8400/0:foreman-planning] DEBUG c.d.e.p.a.s.SnowflakeUtilities - Duplicate table found: [PRH3, S9704_GL_00, BATIMENT]

2021-01-31 12:02:24,714 [1fe963af-41a1-b5eb-fb14-fde42afb8400/0:foreman-planning] DEBUG c.d.e.p.a.s.SnowflakeUtilities - Duplicate table found: [PRH3, S9704_GL_00, MONTEE]

2021-01-31 12:02:24,714 [1fe963af-41a1-b5eb-fb14-fde42afb8400/0:foreman-planning] DEBUG c.d.e.p.a.s.SnowflakeUtilities - Duplicate table found: [PRH3, S9704_GL_00, ENSEMBLE]

2021-01-31 12:02:24,714 [1fe963af-41a1-b5eb-fb14-fde42afb8400/0:foreman-planning] DEBUG c.d.e.p.a.s.SnowflakeUtilities - Duplicate table found: [PRH3, S9704_GL_00, AGENCE]

2021-01-31 12:02:24,715 [1fe963af-41a1-b5eb-fb14-fde42afb8400/0:foreman-planning] DEBUG c.d.e.p.a.s.SnowflakeUtilities - Duplicate table found: [PRH3, S9704_GL_00, BATIMENT]

2021-01-31 12:02:24,715 [1fe963af-41a1-b5eb-fb14-fde42afb8400/0:foreman-planning] DEBUG c.d.e.p.a.s.SnowflakeUtilities - Duplicate table found: [PRH3, S9704_GL_00, ENSEMBLE]

2021-01-31 12:02:24,715 [1fe963af-41a1-b5eb-fb14-fde42afb8400/0:foreman-planning] DEBUG c.d.e.p.a.s.SnowflakeUtilities - Duplicate table found: [PRH3, S9704_GL_00, AGENCE]

All these tables are PDS, most of them are small but are used in complex VDS that have reflection.

I then realized that all these tables are used in a new VDS with its reflection was created the day before the problem arises.
I dropped the reflection and everything sound to run smoothly again.
I then recreated it as the VDS is quite complex and everything keeps running OK, under 1s on simple select over PDS.

I’m not sure where the problem is and why a reflection on a VDS can cause that kind of situation on a PDS access.

Maybe you can have some info.

Cheers,

balaji.ramaswamy · February 1, 2021, 8:58am

@Berni

I do not see an access error in any of the profiles. Is there one I missed. Also attaching reflection best practices and building a semantic layer

White-Paper.zip (1.9 MB)

Topic		Replies	Views
Reflection always says 'in progress'	16	3128	April 9, 2018
Reflections on S3 - no results when hitting "run" Dremio University	34	3283	May 23, 2023
Dremio 2.0.1rc : some reflections fail at creation	20	3341	September 17, 2018
Reflections stop working	14	1731	December 9, 2019
Reflection Refresh Behavior	1	2331	September 16, 2020

Chaotic situation after upgrade to v11

Related topics