'Spark SQL grouping: Add to group by or wrap in first() if you don't care which value you get.;

I have a query in Spark SQL like

select count(ts), truncToHour(ts)
from myTable
group by truncToHour(ts).

Where ts is of timestamp type, truncToHour is a UDF that truncates the timestamp to hour. This query does not work. If I try,

select count(ts), ts from myTable group by truncToHour(ts)

I got expression 'ts' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() if you don't care which value you get.;, but first() is not defined if I do:

select count(ts), first(ts) from myTable group by truncToHour(ts)

Anyway to get what I wanted without using a subquery? Also, why does it say "wrap in first()" but the first() is not defined?

Solution 1:^[1]

https://issues.apache.org/jira/browse/SPARK-9210

Seems the actual function is first_value.

Solution 2:^[2]

I got a solution:

SELECT max(truncHour(ts)), COUNT(ts) FROM myTable GROUP BY truncHour(ts)

SELECT truncHour(max(ts)), count(ts) FROM myTable GROUP BY truncHour(ts)

Is there any better solution?

Solution 3:^[3]

This seems better but requires nesting

select truncHrTs, count(ts)
from(
select ts, truncToHour(ts) AS truncHrTs
from myTable
)
group by truncHrTs

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1	Kumar Deepak
Solution 2	Mike Sukmanowsky
Solution 3	alwaysLearning

'Spark SQL grouping: Add to group by or wrap in first() if you don't care which value you get.;

Solution 1:[1]

Solution 2:[2]

Solution 3:[3]

Sources

Related Questions

Solution 1:^[1]

Solution 2:^[2]

Solution 3:^[3]