Working with big dataset

Hi!

I’m trying to use Dremio with large dataset of events in hive (40TB). It’s is partitioned by date. When I open dataset it displays “Preparing Results…” forever. Only single coordinator node is running the job, when I try to open job plan I see “{“errorMessage”:“Something went wrong”,“moreInfo”:“Failed to get query profile.”}”

It it possible to work with such big datasets with dremio? How can I import it?

From what you are saying it feels that planning phase is not complete.
Could you quantify “forever” though? 60 sec, 10 min, other?
Do you have just a single node to handle this volume of data or yyou have single coordinator and multiple executors?
Do you run “Preview” or actual “Run”?

Cluster is single master and 13 c+e. Executors and other coordinators did nothing. I was waiting for more than 10 minutes. It was preview.
I noticed full gc each several seconds for more than a second on master and disconnections from zk.

It seams to me that metadata of such dataset is too large for planning and no partition pruning is done.

This problem is not only related to hive source. When the size of files are real big data, dremio cannot create dataset.

I’ve just tried to access s3 dataset of gzipped json files (60TB).

In about 10 minuters a try to preview or save dataset ends with “Table ‘…’ not found”.

I tried to query specifying directories as noticed in documentation and got same result:
SELECT *
FROM …
where
dir0="partition_date=2017-05-10"
and dir1=“partition_hour=12”

But following query works, but that’s not what I want.
SELECT *
FROM …“partition_date=2017-05-10”.“partition_hour=12”

During such query execution coordinator tries to list all files recursively in this directory:

“25613b81-d1ba-017f-93cb-87652cbff000:foreman” #190 daemon prio=10 os_prio=0 tid=0x00007fa558bb9730 nid=0x6b44 runnable [0x00007fa528e1b000]
java.lang.Thread.State: RUNNABLE
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:171)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at org.apache.http.impl.io.SessionInputBufferImpl.streamRead(SessionInputBufferImpl.java:139)
at org.apache.http.impl.io.SessionInputBufferImpl.fillBuffer(SessionInputBufferImpl.java:155)
at org.apache.http.impl.io.SessionInputBufferImpl.readLine(SessionInputBufferImpl.java:284)
at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:140)
at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:57)
at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:261)
at org.apache.http.impl.DefaultBHttpClientConnection.receiveResponseHeader(DefaultBHttpClientConnection.java:165)
at org.apache.http.impl.conn.CPoolProxy.receiveResponseHeader(CPoolProxy.java:167)
at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:272)
at com.amazonaws.http.protocol.SdkHttpRequestExecutor.doReceiveResponse(SdkHttpRequestExecutor.java:82)
at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:124)
at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:271)
at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:184)
at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
at com.amazonaws.http.apache.client.impl.SdkHttpClient.execute(SdkHttpClient.java:72)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1190)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1030)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:742)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:716)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:699)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:667)
at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:649)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:513)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4221)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4168)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4162)
at com.amazonaws.services.s3.AmazonS3Client.listObjects(AmazonS3Client.java:821)
at org.apache.hadoop.fs.s3a.S3AFileSystem.listObjects(S3AFileSystem.java:918)
at org.apache.hadoop.fs.s3a.Listing$ObjectListingIterator.(Listing.java:397)
at org.apache.hadoop.fs.s3a.Listing.createFileStatusListingIterator(Listing.java:72)
at org.apache.hadoop.fs.s3a.S3AFileSystem.innerListStatus(S3AFileSystem.java:1403)
at org.apache.hadoop.fs.s3a.S3AFileSystem.listStatus(S3AFileSystem.java:1369)
at com.dremio.plugins.s3.store.S3FileSystem.listStatus(S3FileSystem.java:318)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1536)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1579)
at com.dremio.exec.store.dfs.FileSystemWrapper.populateRecursiveStatus(FileSystemWrapper.java:1263)
at com.dremio.exec.store.dfs.FileSystemWrapper.populateRecursiveStatus(FileSystemWrapper.java:1265)
at com.dremio.exec.store.dfs.FileSystemWrapper.populateRecursiveStatus(FileSystemWrapper.java:1265)
at com.dremio.exec.store.dfs.FileSystemWrapper.listRecursive(FileSystemWrapper.java:1233)
at com.dremio.exec.store.dfs.FileSelection.create(FileSelection.java:241)
at com.dremio.exec.store.dfs.FileSelection.create(FileSelection.java:235)
at com.dremio.exec.store.dfs.FileSelection.create(FileSelection.java:224)
at com.dremio.exec.store.dfs.FileSystemPlugin.getDatasetWithFormat(FileSystemPlugin.java:331)
at com.dremio.exec.store.dfs.FileSystemPlugin.getDataset(FileSystemPlugin.java:310)
at com.dremio.exec.store.SimpleSchema.getTableFromSource(SimpleSchema.java:351)
at com.dremio.exec.store.SimpleSchema.getTableWithRegistry(SimpleSchema.java:284)
at com.dremio.exec.store.SimpleSchema.getTable(SimpleSchema.java:406)
at org.apache.calcite.jdbc.SimpleCalciteSchema.getImplicitTable(SimpleCalciteSchema.java:67)
at org.apache.calcite.jdbc.CalciteSchema.getTable(CalciteSchema.java:219)
at org.apache.calcite.prepare.CalciteCatalogReader.getTableFrom(CalciteCatalogReader.java:117)
at org.apache.calcite.prepare.CalciteCatalogReader.getTable(CalciteCatalogReader.java:106)
at org.apache.calcite.prepare.CalciteCatalogReader.getTable(CalciteCatalogReader.java:73)
at org.apache.calcite.sql.validate.EmptyScope.getTableNamespace(EmptyScope.java:71)
at org.apache.calcite.sql.validate.DelegatingScope.getTableNamespace(DelegatingScope.java:189)
at org.apache.calcite.sql.validate.IdentifierNamespace.validateImpl(IdentifierNamespace.java:104)
at org.apache.calcite.sql.validate.AbstractNamespace.validate(AbstractNamespace.java:84)
at org.apache.calcite.sql.validate.SqlValidatorImpl.validateNamespace(SqlValidatorImpl.java:910)
at org.apache.calcite.sql.validate.SqlValidatorImpl.validateQuery(SqlValidatorImpl.java:891)
at org.apache.calcite.sql.validate.SqlValidatorImpl.validateFrom(SqlValidatorImpl.java:2859)
at org.apache.calcite.sql.validate.SqlValidatorImpl.validateFrom(SqlValidatorImpl.java:2844)
at org.apache.calcite.sql.validate.SqlValidatorImpl.validateSelect(SqlValidatorImpl.java:3077)
at org.apache.calcite.sql.validate.SelectNamespace.validateImpl(SelectNamespace.java:60)
at org.apache.calcite.sql.validate.AbstractNamespace.validate(AbstractNamespace.java:84)
at org.apache.calcite.sql.validate.SqlValidatorImpl.validateNamespace(SqlValidatorImpl.java:910)
at org.apache.calcite.sql.validate.SqlValidatorImpl.validateQuery(SqlValidatorImpl.java:891)
at org.apache.calcite.sql.SqlSelect.validate(SqlSelect.java:208)
at org.apache.calcite.sql.validate.SqlValidatorImpl.validateScopedExpression(SqlValidatorImpl.java:866)
at org.apache.calcite.sql.validate.SqlValidatorImpl.validate(SqlValidatorImpl.java:577)
at com.dremio.exec.planner.sql.SqlConverter.validate(SqlConverter.java:188)
at com.dremio.exec.planner.sql.handlers.PrelTransformer.validateNode(PrelTransformer.java:165)
at com.dremio.exec.planner.sql.handlers.PrelTransformer.validateAndConvert(PrelTransformer.java:153)
at com.dremio.exec.planner.sql.handlers.query.NormalHandler.getPlan(NormalHandler.java:43)
at com.dremio.exec.planner.sql.handlers.commands.HandlerToExec.plan(HandlerToExec.java:66)
at com.dremio.exec.work.foreman.AttemptManager.run(AttemptManager.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

In case of hive table the stack trace is exactly the same at the end. THe problem seems to be in the following part of the stack:
at org.apache.hadoop.fs.s3a.S3AFileSystem.listStatus(S3AFileSystem.java:1369)
at com.dremio.plugins.s3.store.S3FileSystem.listStatus(S3FileSystem.java:318)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1536)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1579)
at com.dremio.exec.store.dfs.FileSystemWrapper.populateRecursiveStatus(FileSystemWrapper.java:1263)
at com.dremio.exec.store.dfs.FileSystemWrapper.populateRecursiveStatus(FileSystemWrapper.java:1265)
at com.dremio.exec.store.dfs.FileSystemWrapper.populateRecursiveStatus(FileSystemWrapper.java:1265)
at com.dremio.exec.store.dfs.FileSystemWrapper.listRecursive(FileSystemWrapper.java:1233)
at com.dremio.exec.store.dfs.FileSelection.create(FileSelection.java:241)
at com.dremio.exec.store.dfs.FileSelection.create(FileSelection.java:235)
at com.dremio.exec.store.dfs.FileSelection.create(FileSelection.java:224)
at com.dremio.exec.store.dfs.FileSystemPlugin.getDatasetWithFormat(FileSystemPlugin.java:331)
at com.dremio.exec.store.dfs.FileSystemPlugin.getDataset(FileSystemPlugin.java:310)

Thank you @Alexander_Sterligov for detailed report. We are aware of this issue. Will keep you posted the progress.

Hi, is there already some feedback on this? I seem to face the same issue with massive datasets in MongoDB.

Same here with just 2.5GB file on Azure Data Lake… :frowning:

@dbrys and @djpirra can you open separate threads? I dont think your issues are related to the underlying issue in this thread.

Thanks!