Dremio read data from HDFS


How do we find the size of physical dataset and virtual dataset in Dremio?

Right now, the way I use to find the data size is to query ‘SELECT * FROM dataset’ to check the input_bytes in a Dremio job.
But I found that the input_bytes show in Dremio job is very different than the size show in HDFS. For eample, 16GB in input_bytes in Dremio vs 6GB in HDFS.

Let me know if you know any way to check the physical dataset and virtual dataset in Dremio.

Thank you!

Hello @carol

The virtual datasets are more like views in a database, so they don’t really have a size in the way you’re thinking. For physical datasets, Dremio doesn’t keep the size information in it’s metadata store, so your SELECT * method might be the “best” way to get that number, but you’d have to be sure to use Run instead of Preview, and either case it’s an expensive way of getting that information. You’re best served by going to the source directly for table/file size

But now the problem is the file size I get from 'SELECT * ’ is different than the source.

When do a ‘SELECT *’ job, why the input_bytes is much larger than the file size in source?

Thank you.

Can you share the query profile for the “SELECT * FROM dataset” that is shows the 16 GB input?