My views are created out of parquet files stored in mapr fs. For some of the views, while accesing through DREMIO, it gives error like “Error while expanding view…” which on checking JOB Profile indicates , the error is caused by ‘(java.security.AccessControlException) Permission Denied’.
And sometimes, it gives error “Access denied reading dataset Findatalake.mnr.cts.”
Some of the error gets resolved on refreshing the metadata but sometimes it does not.
I want to know is it always actually related to permission and access. How do I check the metadata, if is it corrupted and for what reasons does it get corrupted.
View Corrupt – How can we find if a view is corrupt
Will view be corrupt if a new column is added at the end of the current file. Will this ever happen as we always do this via a schema.
What happens if the schema changes|
Will the file read fail if a new column is added at the end of the file.|
Do we need to recreate views / if yes, when?
Dremio runs the VDS as the last person who saved the VDS and not as the logged in user. It could be possible the user who last saved the VDS lost access, you can resave the VDS too and it should make you as the last saved and run the query as you
Do you have impersonation turned on?
I guess not as it’s a production server. Do we have any option to check if impersonation is enabled from DREMIO Web page?
Also, for access denied error in production, here is what I found.
select count(*) AS “LIQ_POOL_DEAL” FROM Fin.Liq.udm.“LIQ_POOL_DEAL” where dir0=‘20210129_0’
It 's throwing error:
Access denied reading dataset Fin.Liq.udm.LIQ_POOL_DEAL
ERROR: Caused By (java.io.FileNotFoundException) maprfs://xyz/abc/LIQUIDITY_POOL_DEAL/20200918_0/part-00000-160b1e7c-4d43-414c-8ebc-7b4862a7cb80-c000.snappy.parquet
On investigating I found other parquet file are there but above particular file is not there. Hence, the error is thrown.
My question is: where is it getting the name of the parquet file which is not present in that particular folder, how can I find that?
Dremio would have learnt about the file when it was originally there, just do the below command and retry
"ALTER PDS REFRESH METADATA FORCE UPDATE
Thanks for your response.
It’s giving ERROR while trying to REFRESH:
java.io.IOException: maprfs://abc/xyz/part-00000-f7a7f1c5-34d2-41cf-962c-0a00d71bf9d8-c000.snappy.parquet is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [-92, -61, -12, -105].
Seems like issue with parquet file format/corrupted. For what reasons, this parquet file could get corrupted and will moving this parquet file to some other folder resolve the issue or is there any other way to resolve the same. I need to be clear before trying anything in production.
@Vinod What happens if you remove format and add format again?
I have not tried to remove or add anything. I have not got access to production for today yet but I am thinking of trying to move the particular parquet file to some other temp location to see if the REFRESH works then.
The point I need to know to avoid such errors in future is like in which cases does this parquet file gets corrupted or un-readable, from DREMIO side, as in what actions/situations could cause such error where the parquet appears as ‘no parquet’.
Even if moving parquet file to some other location helps, would not that mean I have lost one part of the data. Is there any way to recover that?
I know there are many questions but these questions are hounding me as I am not able to find any appropriate answers anywhere. Thanks a ton for all your time and help.
After promotion if the schema changes or you try to ready the file when the ETL job is still writing to it
Would any of the above be true?
I think 2nd option could be the case in my production as there is a job which reads the content from oracle tables and dumps it in the form of parquet in mapr filesystem.
So, during the job run, if the productions guys would be querying the dremio space in order to check the counts, that could be one of the situation where it might happen that data is being written in the form of parquet and the path is being queried from DREMIO.
But another point in this case would be: this data which is causing issue, is data of 20200918 so it should have caused such problem very soon after 20200918, why now.
I am not that into DREMIO world, could you tell what exactly PROMOTION in dremio stands for?
Folders on your lake cannot be queried until they are converted to a physical dataset which is also called promotion. When a folder is promoted it becomes a purple icon
The below tutorial explains in detail
Thanks Bala. Thanks for the info.
Since, now the situation is that I have to run query on one of the views in DREMIO but it fails for access denied for it is trying to access one of the file which is not present now.
On trying to refresh the METADATA, it fails for another error saying one of the parquet file does not seem to be parquet.
What I have in mind is to try moving the corrupted parquet to some temp folder and then try refresh and then try the actual query.
But is there anything else which I can try. Also, is there any way in which I can handle such scenario where schema changes or something happens and some parquet gets corrupted.
I mean how to avoid that or if it occurs is there a way to recover the parquet because we do not want to lose any data.
I am curious to see what this error on Parquet is? Do you have a query profile?
The error pasted in the previous comment is from Job Profile’s Error tab only.
Now, I cannot access the JOB profile of production as we are not allowed to.
But had got the corrupted parquet moved to archive directory with prod. support’s help
And the issue is resolved now.
But the question remains what led to file getting corrupt.
Dremio only reads from these files and does not write so best to check with your ETL team