'reading and converting SQL QUERY to Azure Databricks/Spark python
I am using azure databricks and I have the following sql query that I would like to convert into a spark python code:
SELECT DISTINCT
personID,
SUM(quantity) as total_shipped
FROM(
SELECT p.personID,
p.systemID,
s.quantity
FROM shipped s
LEFT JOIN ordered p
on (s.OrderId = p.OrderNumber OR
substr(s.OrderId,1,6) = p.OrderNumber )
and p.ndcnum = s.ndc
where s.Dateshipped <= "2022-04-07"
AND personID is not null
group by personID
I intend to merge the spark dataframes first, then perform the aggregated sum. However, I think I am making it more complicated than it is. So far, this is what I have but I am getting InvalidSyntax error:
ordered.join(shipped, ((ordered("OrderId").or(ordered.select(substring(ordered.OrderId, 1, 6)))) === ordered("ORDERNUMBER")) &&
(ordered("ndcnumber") === ordered("ndc")),"left")
.show()
The part I am getting confused is on the OR statement from the SQL query, how do I convert that into a spark python statement?
Solution 1:[1]
There is beauty in using databricks. you can directly use the same code by calling spark.sql(""" {your sql query here} """) and you will still get the same results. You can assign it to a variable and you will have a dataframe.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Dennis David |
