'SQL order of execution

I wonder how this query is executing successfully. As we know 'having' clause execute before the select one then here how alias name used in 'select' statement working in having condition and not giving any error. Running sql queries in Azure DataBricks



Solution 1:[1]

Spark SQL engine is obviously different than the normal SQL engine because it is a distributed SQL engine. The normal SQL order of execution does not applied here because when you execute a query via Spark SQL, the engine converts it into optimized DAG before it is distributed across your worker nodes. The worker nodes then do map, shuffle, and reduce tasks before the result is aggregated and returned to the driver node. Read more about Spark DAG here.

Therefore, there are more than just one selecting, filtering, aggregation happening before it returns any result. You can see it yourself by clicking on Spark job view on the Databricks query result panel and then select Associated SQL Query.

So, when it comes to Spark SQL, I recommend we refer to Spark document which clearly indicates that Having clause can refer to aggregation function by its alias.

Solution 2:[2]

As we know 'having' clause execute before the select one

This affirmation is wrong. The HAVING clause is used to apply filters in aggregation functions (such as SUM, AVG, COUNT, MIN and MAX). Since they need to be calculated BEFORE applying any filter, in fact, the SELECT statement is done when the HAVING clause start to be processed.


Even if the previous paragraph was not true, it is important to consider that SQL statements are interpreted as a whole before any processing. Due to this, it doesn't really matter the order of the instructions: the interpreter can link all references so they make sense in runtime.

So it would be perfectly feasible to put the HAVING clause before the SELECT or in any part of the instruction, because this is just a syntax decision. Currently, HAVING clause is after GROUP BY clause because someone decided that this syntax makes more sense in SQL.

Finally, you should consider that allowing you to reference something by an alias is much more a language feature than a rational on how the instruction is processed.

Solution 3:[3]

the order of exution is

  1. Getting Data (From, Join)
  2. Row Filter (Where)
  3. Grouping (Group by)
  4. Group Filter (Having)
  5. Return Expressions (Select)
  6. Order & Paging (Order by & Limit / Offset)

I still don't get, why you are asking about, syntactially your seelect qiery is correct, but if it the correct result we can not know

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2
Solution 3 nbk