A dataset belongs to one and only one space. Users belong to multiple spaces. If you want a dataset to be available in multiple spaces you can do that with virtual datasets.
We are working on this and should have it available shortly via our API. Will let you know.
This is also something we are looking to add, via UDF. If you have examples of REST services we should consider when designing this feature that would be great to hear.
I still think that it is usefult for a user to be able to import o share with some other user a virtual dataset (this is just my point of view of course)
For example wehave datasets containing entities to reconcile (like in OpenRefine) so we ask that service if it gives us some info about an entity with some attributes (each attribute is mapped to a column). Does it make sense?
“Reflection” is a concept that is not entirely clear to me…is it a term to indicate that the query optimizer choose the sampling technique?
In the Enterprise Edition users can share virtual datasets with other users. You can click on the config button (looks like a gear) then navigate to the sharing options:
Data Reflections accelerate your analytics with physically optimized representations of your source data. You can always push queries down into the source, but Data Reflections give the query planner an alternative representation of data that is optimized for different query patterns. For example, sorting, aggregation, columnarization, compression, filtering, etc. It is a little like materialized views, but invisible to end users (no need to change your SQL). You can read more about them here: http://docs.dremio.com/acceleration/
Thanks for the details on OpenRefine - we’ll take a look!
You’re right that reflections are like a materialized view. They are actually persisted as Parquet, then read into Arrow in-memory buffers during query execution.
I think the magic is the query planner can automatically rewrite user queries to take advantage of the reflections to speed up queries. In this sense they are like an index - you don’t change your query or the schema you connect to. You can use the same query whether you have a reflection or not. When more cost-effective, the cost-based optimizer will substitute a reflection for the underlying data source.
And reflections can be used together to solve queries for which they were not necessarily designed (eg, derived virtual datasets).
Great…Dremio sounds very exciting!
Any possibility that the Path of the produced Materialized View (Parquet)
could be retrieved via the API you’re currently developing?
For example, I create a dataset in one of the user spaces and I apply some
data curation/transformation.
I’ll need an API to tell Dremio to refresh/update the materialized Parquet
files (because I can upsert data via Apache Phoenix) and, once finished the
job (I should thus get the id of the job) get the materialized dataset
location. Do you think this could be possible with the new APIs?
Today you cannot access reflections directly - they are resources that Dremio manages internally.
However, maybe what you’re trying to do is already possible today. In your scenario, what is the source? Let’s say it is HBase.
connect to HBase via Dremio
create a virtual dataset called “my_dataset” that returns a subset of your HBase records, applying some transformation logic. Save this in a space called “my_space.”
Now you can query SELECT * FROM my_space.my_dataset to get your results. You can run any query you like, including joining to other physical or virtual datasets.
You want these queries to be faster so you create one or more reflections on my_space.my_dataset.
Now SELECT * FROM my_space.my_dataset comes back much faster.
You update your data in HBase via Phoenix or some other process.
Dremio will update the reflections for you based on the SLA you configure for my_space.my_dataset. The SLA determines how “fresh” data needs to be in a reflection. That screen looks like this:
Subsequent queries on my_space.my_dataset will see results with data as fresh as your SLA specifies. Dremio manages this process for you in the background.
Does this seem to meet your needs? If not, tell me more so we understand what you have in mind.
I think I understoop correctly…what I’d like to achieve, at the end, will
be the possibility to scan the produced dataset via Spark or Flink,
avoiding to pass through a jdbc layer. Is that possible?
Ah, yes. Today you must pass through JDBC from Spark. A parallel API that goes against our Arrow in-memory buffers is coming soon. There’s nothing to stop you from accessing the Parquet files directly, but we do not currently publish a catalog API that would tell you which files correspond to each dataset.
I think you will find that going over JDBC isn’t so bad right now, and that the situation will only improve once we release the parallel API.
Any news on an API that would allow accessing the Arrow buffers directly? I’m really interested in being able to access data like that to facilitate accessing data without the cost of JDBC. Thanks!
Hey @Mark, we don’t have an exact timeline just yet, but this is something we also consider to be important. I’ll reach out with more information on timing on this thread.