'pyspark groupBy and orderBy use together

Hi there I want to achieve something like this

SAS SQL: select * from flightData2015 group by DEST_COUNTRY_NAME order by count

My data looks like this: enter image description here

This is my spark code:

flightData2015.selectExpr("*").groupBy("DEST_COUNTRY_NAME").orderBy("count").show()

I received this error:

AttributeError: 'GroupedData' object has no attribute 'orderBy'. I am new to pyspark. Pyspark's groupby and orderby are not the same as SAS SQL?

I also try sortflightData2015.selectExpr("*").groupBy("DEST_COUNTRY_NAME").sort("count").show()and I received kind of same error. "AttributeError: 'GroupedData' object has no attribute 'sort'" Please help!



Solution 1:[1]

In Spark, groupBy returns a GroupedData, not a DataFrame. And usually, you'd always have an aggregation after groupBy. In this case, even though the SAS SQL doesn't have any aggregation, you still have to define one (and drop it later if you want).

(flightData2015
    .groupBy("DEST_COUNTRY_NAME")
    .count() # this is the "dummy" aggregation
    .orderBy("count")
    .show()
)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 pltc