How to combine multiple tables (same schema for all) into a single table with a single query?

monocongo · November 10, 2020, 5:43pm

Because I was restricted to uploading files with size 500MB or less, I needed to split my dataset into 25 files. Now that it’s in Dremio I’d like to combine all the tables into a single table. According to this answer we can do this with a simple SQL query to combine two tables.

I’d like to do this in a single SQL query if possible, or at least not something that needs to spell out each of the 25 tables, i.e. not something ugly like this:

select * from
(select * from
(select * from table1
union all
select * from table2)
union all 
select * from table3))
union all 
select * from table4))
...

Is it possible to do something like “select the union of all rows from all tables named like X” without resorting to something convoluted like the above? For example, if I know the naming/numbering scheme of the directory structure, then in PySpark I could read all Parquet files under multiple directories like this:

path_prefix = "tables/directory_"
dir_paths = [ f"{path_prefix}{i}" for i in range(25) ]
df = spark.read.parquet(*dir_paths)

Can you do something like this in a single/simple SQL query? I’ve seen another user reference using “UNION ALL” here, but with no example – is this a possible solution?

To clarify here’s what I’d like to do:

Let’s say I have three tables with two columns each, i.e. the same schema for all three tables. I want to combine them all together into a single table (named table_combined below):

+------------------+
| table_1          |
+----------+-------+
| name     |  age  |
+----------+-------+
| Jane     |  31   | 
+----------+-------+

+------------------+
| table_2          |
+----------+-------+
| name     |  age  |
+----------+-------+
| Mark     |  37   | 
+----------+-------+

+------------------+
| table_3          |
+----------+-------+
| name     |  age  |
+----------+-------+
| Vera     |  38   | 
+----------+-------+

+------------------+
| table_combined   |
+----------+-------+
| name     |  age  |
+----------+-------+
| Jane     |  31   | 
| Mark     |  37   | 
| Vera     |  38   | 
+----------+-------+

Thanks in advance for any comments or suggestions!

monocongo · November 10, 2020, 7:48pm

Maybe this is as simple as just stacking the unions? It’s a bit long of a query after a few tables but not nested or complicated at all really. Something like this:

CREATE VDS "workspace"."unioned_all_tables"
SELECT *
FROM table1
UNION ALL
SELECT *
FROM table2
UNION ALL
SELECT *
FROM table3
UNION ALL

Then for performance considerations, I will create a reflection:

ALTER DATASET "workspace"."unioned_all_tables"
CREATE EXTERNAL REFLECTION "reflected_unioned_all_tables"
USING "workspace"."unioned_all_tables"

Then I can create a table for export using CTAS:

CREATE TABLE "data_lake_source"."root_dir".sub_dir.table_name
AS
SELECT *
FROM  "reflected_unioned_all_tables"

Does this seem a reasonable way to do this (querying against the reflection)? (I know this is a bit of an extension of the original question.)

monocongo · November 10, 2020, 9:37pm

I ended up using this SQL (with different workspace and table names):

CREATE VDS "@james".combined_table
AS 
    SELECT *
    FROM "@james"."table_1"
    UNION ALL 
    SELECT *
    FROM "@james"."table_2"
    UNION ALL 
    SELECT *
    FROM "@james"."table_3"
    UNION ALL

Next I created the table directly from this new, combined VDS, no reflection required, it was plenty fast enough. Oddly enough the CTAS query on the combined dataset with ~125M rows took the same time as did the previous CTAS queries I was using for the smaller datasets with ~5M rows.

tid · November 12, 2020, 7:47am

Hi,
maybe I completely missed the point, but why don’t you upload the files to a folder and then promote the folder to a dataset? Something like upload to folder “combined_table” and then just click the “Format folder” button or use the SQL API to do so?

Best, Tim

monocongo · November 12, 2020, 9:06am

Yes, agreed, I’d rather do it like that. Unfortunately, where I work we are just getting started with Dremio and all the kinks aren’t worked out yet. For example, I don’t have an easy way to get files onto the S3 buckets being used for the S3 data sources. I can’t use a NAS since the server isn’t running locally. The best I can do is to upload files one at a time via the web console. I wish they’d allow for multiple files or an entire folder to be uploaded at once. Soon however I’ve been told that we will use S3 replication to allow us to get files into the S3 source, and then I can more easily promote an entire directory of files using the REST API.

You make a very interesting point – if I have all these files in the same folder and they all have the same schema then I can just format the folder and it will create a single VDS out of the tranche? If so then this accomplishes the task a lot easier than the “union all” query I resorted to in the final answer above.

I appreciate your feedback, thanks for the idea, Tim!

balaji.ramaswamy · November 13, 2020, 8:30am

@monocongo

Great point by @tid, Just to reconfirm, promoting a dataset is only on a PDS, so if you upload files individually then the UNION ALL is the route versus you can upload to NAS/S3/ADLS/HDFS and promote.

Thanks
Bali

Topic		Replies	Views
Combine data from multiple datasets	10	4092	March 28, 2023
SQL JOIN non-existent tables	2	1396	January 29, 2019
Stacking up tables	5	3539	August 30, 2018
Can I query values inside a Parquet structure?	5	173	August 13, 2024
HDFS source with different file schema	1	1187	October 4, 2018

How to combine multiple tables (same schema for all) into a single table with a single query?

Related topics