'What is the accepted syntax for PySpark's SQL expression-based filters?

The PySpark documentation for filters says that it accepts "a string of SQL expression".

Is there a reference of the accepted syntax for this parameter? The best I could find is the page about the WHERE clause in Spark SQL docs. Obviously some examples, like "id > 200", "length(name) > 3", or "id BETWEEN 200 AND 300", would work. But what about others? Filters like "age > (SELECT 42)" seem to work, so I assume nested expressions are OK. This just raises more questions:

  • What databases can these nested expressions refer to? Is there a way I can create a nested SELECT expression referring to the current dataframe, e.g. to do something like "age > (SELECT avg(age) FROM <current_dataframe>)" as a filter? (I know there are other ways of achieving this, I am only interested in what SQL expressions can do.)
  • Are there other, more advanced things that are allowed in filter expressions?
  • Finally, is there an online resource explaining this in more detail?


Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source