'Azure Databricks - Write to parquet file using spark.sql with union and subqueries
Issue:
I'm trying to write to parquet file using spark.sql, however I encounter issues when having unions or subqueries. I know there's some syntax I can't seem to figure out.
Ex.
%python
df = spark.sql("SELECT
sha2(Code, 256) as COUNTRY_SK,
Code as COUNTRY_CODE,
Name as COUNTRY_NAME,
current_date() as EXTRACT_DATE
FROM raw.EXTR_COUNTRY)
UNION ALL
SELECT
-1 as COUNTRY_SK,
'Unknown' as COUNTRY_CODE,
'Unknown' as COUNTRY_NAME,
current_date() as EXTRACT_DATE")
df.write.parquet("dbfs:/mnt/devstorage/landing/companyx/country",
mode="overwrite")
WHEN doing a simple query I have no issues at all, such as:
%python
df = spark.sql("select * from raw.EXTR_COUNTRY")
df.write.parquet("dbfs:/mnt/devstorage/landing/companyx/country/",
mode="overwrite")
Solution 1:[1]
The quotes solved the issue, the sql-script itself wasn't the issue. So using tripple quotes (""" or ''') solved the issue.
%python
df = spark.sql("""SELECT
sha2(Code, 256) as COUNTRY_SK,
Code as COUNTRY_CODE,
Name as COUNTRY_NAME,
current_date() as EXTRACT_DATE
FROM raw.EXTR_COUNTRY)
UNION ALL
SELECT
-1 as COUNTRY_SK,
'Unknown' as COUNTRY_CODE,
'Unknown' as COUNTRY_NAME,
current_date() as EXTRACT_DATE""")
df.write.parquet("dbfs:/mnt/devstorage/landing/companyx/country",
mode="overwrite")
Solution 2:[2]
There are few problems with your code that needs to be fixed:
- you're using single quotes (
"
) for multi-line string. Instead you need to use tripple quotes ("""
or'''
) - your SQL syntax is incorrect for second part of the query (after
union all
) - you didn't specifyFROM
which table you need to pull that data. See docs for details of the SQL syntax.
I really recommend to debug each subquery separately, maybe first using the %sql
, and only after it works, put it into the spark.sql
string.
Also, because you're overwriting the data, it could be easier to use create or replace table
syntax to perform everything in SQL (docs), something like this:
create or replace table delta.`/mnt/devstorage/landing/companyx/country/` AS (
SELECT
sha2(Code, 256) as COUNTRY_SK,
Code as COUNTRY_CODE,
Name as COUNTRY_NAME,
current_date() as EXTRACT_DATE
FROM raw.EXTR_COUNTRY)
UNION ALL
SELECT
-1 as COUNTRY_SK,
'Unknown' as COUNTRY_CODE,
'Unknown' as COUNTRY_NAME,
current_date() as EXTRACT_DATE
FROM ....
)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | user3052850 |
Solution 2 | Alex Ott |