'reading and converting SQL QUERY to Azure Databricks/Spark python

I am using azure databricks and I have the following sql query that I would like to convert into a spark python code:

SELECT DISTINCT 
        personID,
        SUM(quantity) as total_shipped
FROM(
SELECT  p.personID,
        p.systemID,
        s.quantity
FROM shipped s
LEFT JOIN ordered p

on (s.OrderId = p.OrderNumber OR
substr(s.OrderId,1,6) = p.OrderNumber )

and p.ndcnum = s.ndc

where s.Dateshipped <= "2022-04-07"
AND personID is not null

group by personID

I intend to merge the spark dataframes first, then perform the aggregated sum. However, I think I am making it more complicated than it is. So far, this is what I have but I am getting InvalidSyntax error:

ordered.join(shipped, ((ordered("OrderId").or(ordered.select(substring(ordered.OrderId, 1, 6)))) === ordered("ORDERNUMBER")) &&
    (ordered("ndcnumber") === ordered("ndc")),"left")
      .show()

The part I am getting confused is on the OR statement from the SQL query, how do I convert that into a spark python statement?



Solution 1:[1]

There is beauty in using databricks. you can directly use the same code by calling spark.sql(""" {your sql query here} """) and you will still get the same results. You can assign it to a variable and you will have a dataframe.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Dennis David