Currently this is not in the roadmap. The only way currently is to follow release notes, let me check internally if there is a plan to expose roadmap and enhancement tickets to the community
Greetings. Googling on the error dremio 22.1 handed back after my ETL tool tried to create a prepared statement brought me here. What is the proper way to diagnose the issue?
For context, this is a Junk Dimension transformation step in Apache Hop 2.1 which should insert new records if a match on all fields is not found. It has an option to lookup on a hash field instead of just comparing across fields. The transform is generating prepared SQL statements.
For example:
SELECT DUDD_SK
FROM "Empower.data-staging".Drug_Utilization_By_Drug_per_Date
WHERE row_fingerprint = ?
AND ( ( Filled_Date = ? ) OR ( Filled_Date IS NULL AND ? IS NULL ) )
AND ( ( Deactivation_Date = ? ) OR ( Deactivation_Date IS NULL AND ? IS NULL ) )
AND ( ( Drug_Name = ? ) OR ( Drug_Name IS NULL AND ? IS NULL ) )
AND ( ( Drug_Form = ? ) OR ( Drug_Form IS NULL AND ? IS NULL ) )
AND ( ( Drug_Strength = ? ) OR ( Drug_Strength IS NULL AND ? IS NULL ) )
AND ( ( Drug_ID = ? ) OR ( Drug_ID IS NULL AND ? IS NULL ) )
AND ( ( PMS_ID = ? ) OR ( PMS_ID IS NULL AND ? IS NULL ) )
AND ( ( Qty = ? ) OR ( Qty IS NULL AND ? IS NULL ) )
AND ( ( Number_of_Prescriptions = ? ) OR ( Number_of_Prescriptions IS NULL AND ? IS NULL ) )
AND ( ( Total_Price = ? ) OR ( Total_Price IS NULL AND ? IS NULL ) )
AND ( ( Compound = ? ) OR ( Compound IS NULL AND ? IS NULL ) )
@Brandon_Jackson I am able to reproduce the error with a similar SQL, What is the expected output from this SQL?
SELECT DUDD_SK
FROM "Empower.data-staging".Drug_Utilization_By_Drug_per_Date
WHERE row_fingerprint = ?
AND ( ( Filled_Date = ? ) OR ( Filled_Date IS NULL AND ? IS NULL ) )
AND ( ( Deactivation_Date = ? ) OR ( Deactivation_Date IS NULL AND ? IS NULL ) )
AND ( ( Drug_Name = ? ) OR ( Drug_Name IS NULL AND ? IS NULL ) )
AND ( ( Drug_Form = ? ) OR ( Drug_Form IS NULL AND ? IS NULL ) )
AND ( ( Drug_Strength = ? ) OR ( Drug_Strength IS NULL AND ? IS NULL ) )
AND ( ( Drug_ID = ? ) OR ( Drug_ID IS NULL AND ? IS NULL ) )
AND ( ( PMS_ID = ? ) OR ( PMS_ID IS NULL AND ? IS NULL ) )
AND ( ( Qty = ? ) OR ( Qty IS NULL AND ? IS NULL ) )
AND ( ( Number_of_Prescriptions = ? ) OR ( Number_of_Prescriptions IS NULL AND ? IS NULL ) )
AND ( ( Total_Price = ? ) OR ( Total_Price IS NULL AND ? IS NULL ) )
AND ( ( Compound = ? ) OR ( Compound IS NULL AND ? IS NULL ) )
A single integer value would be returned that represents a row that could be updated or an empty result which would indicate the need for a newly inserted row.
@Brandon_Jackson Sorry for the dumb question, you are giving something like total_price=?, is total_price a number and equating to a ‘?’ should yield what?
Hi Balaji, No dumb questions at all. It is a specific price inserted replacing the question mark. First a bit of background. I am using a Java based ETL tool called Apache Hop. All Java based tools used JDBC drivers. When dealing with any data manipulation, like inserting, updating, and even selecting, java applications will craft prepared statements that include parameters like filter conditions. You end up with SQL queries that look like mine with a bunch of question marks, each one takes on a specific value per executed statement. Prepared statements exist in the Windows world, on MS SQL and every other database I know of. The particular SQL query in question asks the database if the destination table has precisely each value for each column listed there for a given fingerprinted row (hashcoded aka all columns values asking like a huge password resulting in a unique code identifying the row).
What I found after browsing the Dremio Java code and experimentation is that Dremio only supports sending a fully filled out query with concrete values as text, no matter what kind of query it is. If you use a Java based tool that programmatically tries to interact with Dremio and separate out the parameter values from the query and communicate over a query pattern for Dremio to chew on, that fails. That is what was meant by “Dynamic Parameters” or “Parameters” generally. That means that Dremio cannot be treated like a true database yet compared to all the others like Trino/Presto/Starburst, MySQL, Postgres, MS SQL. Dremio only processes a query without any variability at all. I venture to say only one statement at a time.
I wish Dremio supported prepared statements with parameters so it would be easier to use a wider variety of tools to interact with data it could query. I found this disappointing.
I imagine the best help you could provide is to communicate the desire to use prepared statements with parameters to the development team. I can see they are using Calcite to process queries and that it should provide a good JDBC driver, so it is really a feature implementation beyond Calcite itself in whatever the engine is that Dremio has.
Does Dremio, the company, publish features they are working on anywhere?
Is the forum here the only way to communicate back with the company in a meaningful way?
Hey @Brandon_Jackson ! I hope you are well. Dremio doesn’t support parameterized SQL statements yet. For large complex queries on huge datasets, it really doesn’t make sense to use parameterized statements as you need the actual values to read partition stats and estimate row counts for join ordering and other query optimizations. These stats are also needed to determine degrees of parallelism for various phases of execution. A plan for one set of values could be totally different than a plan for a different set of values. Reflections can be used to accelerate queries and reflections need to consider actual filter values when trying to match.
Parameterized SQLs make more sense in the context of simple SQL statements that are executed 10-100s of times like small update/insert/delete statements. Dremio just released DML support for Apache Iceberg in June 2022: Announcing DML Support for Apache Iceberg | Dremio However, issuing 1,000 small insert statements is going to result in a ton of small Parquet files and table maintenance will become a real concern: Maintenance
Sorry, this statement doesn’t make a lot of sense. One use case for parameterized queries is repeated inserts, sure, and I’m not sure that has to be supported, here. However, one of the biggest use cases for parameterized queries is to prevent SQL injection on selects. Parameters, in this case, are so useful for large, complex queries on huge datasets. Brandon has provided ample examples. This is extremely standard stuff in dealing with data.
Having users sanitize their own input leaves the Dremio connection methods very open to SQL injection attacks. Most database engines will not actually create a named plan for the query unless it’s run repeatedly in a short amount of time, so just supporting parameterization without fully preparing the statements on the backend would be a huge boon to the security of Dremio drivers.
It is very disheartening to see the flippant attitude of Dremio toward basic SQL injection prevention.