Because I was restricted to uploading files with size 500MB or less, I needed to split my dataset into 25 files. Now that it’s in Dremio I’d like to combine all the tables into a single table. According to this answer we can do this with a simple SQL query to combine two tables.
I’d like to do this in a single SQL query if possible, or at least not something that needs to spell out each of the 25 tables, i.e. not something ugly like this:
select * from
(select * from
(select * from table1
union all
select * from table2)
union all
select * from table3))
union all
select * from table4))
...
Is it possible to do something like “select the union of all rows from all tables named like X” without resorting to something convoluted like the above? For example, if I know the naming/numbering scheme of the directory structure, then in PySpark I could read all Parquet files under multiple directories like this:
path_prefix = "tables/directory_"
dir_paths = [ f"{path_prefix}{i}" for i in range(25) ]
df = spark.read.parquet(*dir_paths)
Can you do something like this in a single/simple SQL query? I’ve seen another user reference using “UNION ALL” here, but with no example – is this a possible solution?
To clarify here’s what I’d like to do:
Let’s say I have three tables with two columns each, i.e. the same schema for all three tables. I want to combine them all together into a single table (named table_combined below):
+------------------+
| table_1 |
+----------+-------+
| name | age |
+----------+-------+
| Jane | 31 |
+----------+-------+
+------------------+
| table_2 |
+----------+-------+
| name | age |
+----------+-------+
| Mark | 37 |
+----------+-------+
+------------------+
| table_3 |
+----------+-------+
| name | age |
+----------+-------+
| Vera | 38 |
+----------+-------+
+------------------+
| table_combined |
+----------+-------+
| name | age |
+----------+-------+
| Jane | 31 |
| Mark | 37 |
| Vera | 38 |
+----------+-------+
Thanks in advance for any comments or suggestions!
Maybe this is as simple as just stacking the unions? It’s a bit long of a query after a few tables but not nested or complicated at all really. Something like this:
CREATE VDS "workspace"."unioned_all_tables"
SELECT *
FROM table1
UNION ALL
SELECT *
FROM table2
UNION ALL
SELECT *
FROM table3
UNION ALL
Then for performance considerations, I will create a reflection:
ALTER DATASET "workspace"."unioned_all_tables"
CREATE EXTERNAL REFLECTION "reflected_unioned_all_tables"
USING "workspace"."unioned_all_tables"
Then I can create a table for export using CTAS:
CREATE TABLE "data_lake_source"."root_dir".sub_dir.table_name
AS
SELECT *
FROM "reflected_unioned_all_tables"
Does this seem a reasonable way to do this (querying against the reflection)? (I know this is a bit of an extension of the original question.)
I ended up using this SQL (with different workspace and table names):
CREATE VDS "@james".combined_table
AS
SELECT *
FROM "@james"."table_1"
UNION ALL
SELECT *
FROM "@james"."table_2"
UNION ALL
SELECT *
FROM "@james"."table_3"
UNION ALL
Next I created the table directly from this new, combined VDS, no reflection required, it was plenty fast enough. Oddly enough the CTAS query on the combined dataset with ~125M rows took the same time as did the previous CTAS queries I was using for the smaller datasets with ~5M rows.
Hi,
maybe I completely missed the point, but why don’t you upload the files to a folder and then promote the folder to a dataset? Something like upload to folder “combined_table” and then just click the “Format folder” button or use the SQL API to do so?
Yes, agreed, I’d rather do it like that. Unfortunately, where I work we are just getting started with Dremio and all the kinks aren’t worked out yet. For example, I don’t have an easy way to get files onto the S3 buckets being used for the S3 data sources. I can’t use a NAS since the server isn’t running locally. The best I can do is to upload files one at a time via the web console. I wish they’d allow for multiple files or an entire folder to be uploaded at once. Soon however I’ve been told that we will use S3 replication to allow us to get files into the S3 source, and then I can more easily promote an entire directory of files using the REST API.
You make a very interesting point – if I have all these files in the same folder and they all have the same schema then I can just format the folder and it will create a single VDS out of the tranche? If so then this accomplishes the task a lot easier than the “union all” query I resorted to in the final answer above.
I appreciate your feedback, thanks for the idea, Tim!
Great point by @tid, Just to reconfirm, promoting a dataset is only on a PDS, so if you upload files individually then the UNION ALL is the route versus you can upload to NAS/S3/ADLS/HDFS and promote.