'What is the accepted syntax for PySpark's SQL expression-based filters?
The PySpark documentation for filters says that it accepts "a string of SQL expression".
Is there a reference of the accepted syntax for this parameter? The best I could find is the page about the WHERE clause in Spark SQL docs. Obviously some examples, like "id > 200"
, "length(name) > 3"
, or "id BETWEEN 200 AND 300"
, would work. But what about others? Filters like "age > (SELECT 42)"
seem to work, so I assume nested expressions are OK. This just raises more questions:
- What databases can these nested expressions refer to? Is there a way I can create a nested
SELECT
expression referring to the current dataframe, e.g. to do something like"age > (SELECT avg(age) FROM <current_dataframe>)"
as a filter? (I know there are other ways of achieving this, I am only interested in what SQL expressions can do.) - Are there other, more advanced things that are allowed in filter expressions?
- Finally, is there an online resource explaining this in more detail?
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|