'Azure Databricks - Write to parquet file using spark.sql with union and subqueries

Issue:

I'm trying to write to parquet file using spark.sql, however I encounter issues when having unions or subqueries. I know there's some syntax I can't seem to figure out.

Ex.

%python
    
df = spark.sql("SELECT
          sha2(Code, 256) as COUNTRY_SK,
          Code as COUNTRY_CODE,
          Name as COUNTRY_NAME,
          current_date() as EXTRACT_DATE
       
        FROM raw.EXTR_COUNTRY)
       
    UNION ALL
    
    SELECT
          -1 as COUNTRY_SK,
          'Unknown' as COUNTRY_CODE,
          'Unknown' as COUNTRY_NAME,
          current_date() as EXTRACT_DATE")
          
df.write.parquet("dbfs:/mnt/devstorage/landing/companyx/country", 
   mode="overwrite")

WHEN doing a simple query I have no issues at all, such as:

%python
    
df = spark.sql("select * from raw.EXTR_COUNTRY")
df.write.parquet("dbfs:/mnt/devstorage/landing/companyx/country/", 
   mode="overwrite")


Solution 1:[1]

The quotes solved the issue, the sql-script itself wasn't the issue. So using tripple quotes (""" or ''') solved the issue.

 %python
        
    df = spark.sql("""SELECT
              sha2(Code, 256) as COUNTRY_SK,
              Code as COUNTRY_CODE,
              Name as COUNTRY_NAME,
              current_date() as EXTRACT_DATE
           
            FROM raw.EXTR_COUNTRY)
           
        UNION ALL
        
        SELECT
              -1 as COUNTRY_SK,
              'Unknown' as COUNTRY_CODE,
              'Unknown' as COUNTRY_NAME,
              current_date() as EXTRACT_DATE""")
              
    df.write.parquet("dbfs:/mnt/devstorage/landing/companyx/country", 
       mode="overwrite")

Solution 2:[2]

There are few problems with your code that needs to be fixed:

  • you're using single quotes (") for multi-line string. Instead you need to use tripple quotes (""" or ''')
  • your SQL syntax is incorrect for second part of the query (after union all) - you didn't specify FROM which table you need to pull that data. See docs for details of the SQL syntax.

I really recommend to debug each subquery separately, maybe first using the %sql, and only after it works, put it into the spark.sql string.

Also, because you're overwriting the data, it could be easier to use create or replace table syntax to perform everything in SQL (docs), something like this:

create or replace table delta.`/mnt/devstorage/landing/companyx/country/` AS (
SELECT
          sha2(Code, 256) as COUNTRY_SK,
          Code as COUNTRY_CODE,
          Name as COUNTRY_NAME,
          current_date() as EXTRACT_DATE
       
        FROM raw.EXTR_COUNTRY)
       
    UNION ALL
    
    SELECT
          -1 as COUNTRY_SK,
          'Unknown' as COUNTRY_CODE,
          'Unknown' as COUNTRY_NAME,
          current_date() as EXTRACT_DATE
          FROM ....
)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 user3052850
Solution 2 Alex Ott