Working with big dataset

Alexander_Sterligov · March 5, 2018, 9:37pm

Hi!

I’m trying to use Dremio with large dataset of events in hive (40TB). It’s is partitioned by date. When I open dataset it displays “Preparing Results…” forever. Only single coordinator node is running the job, when I try to open job plan I see “{“errorMessage”:“Something went wrong”,“moreInfo”:“Failed to get query profile.”}”

It it possible to work with such big datasets with dremio? How can I import it?

yufeldman · March 6, 2018, 6:03am

From what you are saying it feels that planning phase is not complete.
Could you quantify “forever” though? 60 sec, 10 min, other?
Do you have just a single node to handle this volume of data or yyou have single coordinator and multiple executors?
Do you run “Preview” or actual “Run”?

Alexander_Sterligov · March 6, 2018, 7:19am

Cluster is single master and 13 c+e. Executors and other coordinators did nothing. I was waiting for more than 10 minutes. It was preview.
I noticed full gc each several seconds for more than a second on master and disconnections from zk.

It seams to me that metadata of such dataset is too large for planning and no partition pruning is done.

Alexander_Sterligov · March 6, 2018, 4:58pm

This problem is not only related to hive source. When the size of files are real big data, dremio cannot create dataset.

I’ve just tried to access s3 dataset of gzipped json files (60TB).

In about 10 minuters a try to preview or save dataset ends with “Table ‘…’ not found”.

I tried to query specifying directories as noticed in documentation and got same result:
SELECT *
FROM …
where
dir0="partition_date=2017-05-10"
and dir1=“partition_hour=12”

But following query works, but that’s not what I want.
SELECT *
FROM …“partition_date=2017-05-10”.“partition_hour=12”

During such query execution coordinator tries to list all files recursively in this directory:

“25613b81-d1ba-017f-93cb-87652cbff000:foreman” #190 daemon prio=10 os_prio=0 tid=0x00007fa558bb9730 nid=0x6b44 runnable [0x00007fa528e1b000]
java.lang.Thread.State: RUNNABLE
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:171)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at org.apache.http.impl.io.SessionInputBufferImpl.streamRead(SessionInputBufferImpl.java:139)
at org.apache.http.impl.io.SessionInputBufferImpl.fillBuffer(SessionInputBufferImpl.java:155)
at org.apache.http.impl.io.SessionInputBufferImpl.readLine(SessionInputBufferImpl.java:284)
at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:140)
at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:57)
at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:261)
at org.apache.http.impl.DefaultBHttpClientConnection.receiveResponseHeader(DefaultBHttpClientConnection.java:165)
at org.apache.http.impl.conn.CPoolProxy.receiveResponseHeader(CPoolProxy.java:167)
at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:272)
at com.amazonaws.http.protocol.SdkHttpRequestExecutor.doReceiveResponse(SdkHttpRequestExecutor.java:82)
at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:124)
at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:271)
at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:184)
at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
at com.amazonaws.http.apache.client.impl.SdkHttpClient.execute(SdkHttpClient.java:72)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1190)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1030)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:742)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:716)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:699)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:667)
at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:649)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:513)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4221)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4168)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4162)
at com.amazonaws.services.s3.AmazonS3Client.listObjects(AmazonS3Client.java:821)
at org.apache.hadoop.fs.s3a.S3AFileSystem.listObjects(S3AFileSystem.java:918)
at org.apache.hadoop.fs.s3a.Listing$ObjectListingIterator.(Listing.java:397)
at org.apache.hadoop.fs.s3a.Listing.createFileStatusListingIterator(Listing.java:72)
at org.apache.hadoop.fs.s3a.S3AFileSystem.innerListStatus(S3AFileSystem.java:1403)
at org.apache.hadoop.fs.s3a.S3AFileSystem.listStatus(S3AFileSystem.java:1369)
at com.dremio.plugins.s3.store.S3FileSystem.listStatus(S3FileSystem.java:318)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1536)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1579)
at com.dremio.exec.store.dfs.FileSystemWrapper.populateRecursiveStatus(FileSystemWrapper.java:1263)
at com.dremio.exec.store.dfs.FileSystemWrapper.populateRecursiveStatus(FileSystemWrapper.java:1265)
at com.dremio.exec.store.dfs.FileSystemWrapper.populateRecursiveStatus(FileSystemWrapper.java:1265)
at com.dremio.exec.store.dfs.FileSystemWrapper.listRecursive(FileSystemWrapper.java:1233)
at com.dremio.exec.store.dfs.FileSelection.create(FileSelection.java:241)
at com.dremio.exec.store.dfs.FileSelection.create(FileSelection.java:235)
at com.dremio.exec.store.dfs.FileSelection.create(FileSelection.java:224)
at com.dremio.exec.store.dfs.FileSystemPlugin.getDatasetWithFormat(FileSystemPlugin.java:331)
at com.dremio.exec.store.dfs.FileSystemPlugin.getDataset(FileSystemPlugin.java:310)
at com.dremio.exec.store.SimpleSchema.getTableFromSource(SimpleSchema.java:351)
at com.dremio.exec.store.SimpleSchema.getTableWithRegistry(SimpleSchema.java:284)
at com.dremio.exec.store.SimpleSchema.getTable(SimpleSchema.java:406)
at org.apache.calcite.jdbc.SimpleCalciteSchema.getImplicitTable(SimpleCalciteSchema.java:67)
at org.apache.calcite.jdbc.CalciteSchema.getTable(CalciteSchema.java:219)
at org.apache.calcite.prepare.CalciteCatalogReader.getTableFrom(CalciteCatalogReader.java:117)
at org.apache.calcite.prepare.CalciteCatalogReader.getTable(CalciteCatalogReader.java:106)
at org.apache.calcite.prepare.CalciteCatalogReader.getTable(CalciteCatalogReader.java:73)
at org.apache.calcite.sql.validate.EmptyScope.getTableNamespace(EmptyScope.java:71)
at org.apache.calcite.sql.validate.DelegatingScope.getTableNamespace(DelegatingScope.java:189)
at org.apache.calcite.sql.validate.IdentifierNamespace.validateImpl(IdentifierNamespace.java:104)
at org.apache.calcite.sql.validate.AbstractNamespace.validate(AbstractNamespace.java:84)
at org.apache.calcite.sql.validate.SqlValidatorImpl.validateNamespace(SqlValidatorImpl.java:910)
at org.apache.calcite.sql.validate.SqlValidatorImpl.validateQuery(SqlValidatorImpl.java:891)
at org.apache.calcite.sql.validate.SqlValidatorImpl.validateFrom(SqlValidatorImpl.java:2859)
at org.apache.calcite.sql.validate.SqlValidatorImpl.validateFrom(SqlValidatorImpl.java:2844)
at org.apache.calcite.sql.validate.SqlValidatorImpl.validateSelect(SqlValidatorImpl.java:3077)
at org.apache.calcite.sql.validate.SelectNamespace.validateImpl(SelectNamespace.java:60)
at org.apache.calcite.sql.validate.AbstractNamespace.validate(AbstractNamespace.java:84)
at org.apache.calcite.sql.validate.SqlValidatorImpl.validateNamespace(SqlValidatorImpl.java:910)
at org.apache.calcite.sql.validate.SqlValidatorImpl.validateQuery(SqlValidatorImpl.java:891)
at org.apache.calcite.sql.SqlSelect.validate(SqlSelect.java:208)
at org.apache.calcite.sql.validate.SqlValidatorImpl.validateScopedExpression(SqlValidatorImpl.java:866)
at org.apache.calcite.sql.validate.SqlValidatorImpl.validate(SqlValidatorImpl.java:577)
at com.dremio.exec.planner.sql.SqlConverter.validate(SqlConverter.java:188)
at com.dremio.exec.planner.sql.handlers.PrelTransformer.validateNode(PrelTransformer.java:165)
at com.dremio.exec.planner.sql.handlers.PrelTransformer.validateAndConvert(PrelTransformer.java:153)
at com.dremio.exec.planner.sql.handlers.query.NormalHandler.getPlan(NormalHandler.java:43)
at com.dremio.exec.planner.sql.handlers.commands.HandlerToExec.plan(HandlerToExec.java:66)
at com.dremio.exec.work.foreman.AttemptManager.run(AttemptManager.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

In case of hive table the stack trace is exactly the same at the end. THe problem seems to be in the following part of the stack:
at org.apache.hadoop.fs.s3a.S3AFileSystem.listStatus(S3AFileSystem.java:1369)
at com.dremio.plugins.s3.store.S3FileSystem.listStatus(S3FileSystem.java:318)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1536)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1579)
at com.dremio.exec.store.dfs.FileSystemWrapper.populateRecursiveStatus(FileSystemWrapper.java:1263)
at com.dremio.exec.store.dfs.FileSystemWrapper.populateRecursiveStatus(FileSystemWrapper.java:1265)
at com.dremio.exec.store.dfs.FileSystemWrapper.populateRecursiveStatus(FileSystemWrapper.java:1265)
at com.dremio.exec.store.dfs.FileSystemWrapper.listRecursive(FileSystemWrapper.java:1233)
at com.dremio.exec.store.dfs.FileSelection.create(FileSelection.java:241)
at com.dremio.exec.store.dfs.FileSelection.create(FileSelection.java:235)
at com.dremio.exec.store.dfs.FileSelection.create(FileSelection.java:224)
at com.dremio.exec.store.dfs.FileSystemPlugin.getDatasetWithFormat(FileSystemPlugin.java:331)
at com.dremio.exec.store.dfs.FileSystemPlugin.getDataset(FileSystemPlugin.java:310)

yufeldman · March 6, 2018, 6:06pm

Thank you @Alexander_Sterligov for detailed report. We are aware of this issue. Will keep you posted the progress.

dbrys · November 13, 2018, 4:04pm

Hi, is there already some feedback on this? I seem to face the same issue with massive datasets in MongoDB.

djpirra · November 15, 2018, 8:37am

Same here with just 2.5GB file on Azure Data Lake…

kelly · November 16, 2018, 4:16pm

@dbrys and @djpirra can you open separate threads? I dont think your issues are related to the underlying issue in this thread.

Thanks!

Topic		Replies	Views
Slow preview/run on reasonably sized machine	3	1112	October 24, 2019
Problem on create large dataset	2	996	May 28, 2020
How to Optimize Query Performance for Large Datasets in Dremio Dremio Cloud	1	438	June 21, 2024
How to create dataset for daily data report	3	1224	January 17, 2019
DREMIO Query Performance not stable with hive data source	1	1107	July 25, 2020

Working with big dataset

Related topics