'Using AVG Function & Top Clause
I am working with a weather data set in SQL Server and to try determine the average temperature when it was snowing and order by them DESC visibility (kilometers).
I tried using the TOP clause and AVG function together, but I am getting errors. Any ideas why?
Here is the code I was using:
SELECT TOP 20 *, AVG(Temp)
FROM weather
WHERE Precip_Type = 'Snow'
ORDER BY Visibility_km DESC
This is the error I get:
Msg 8120, Level 16, State 1, Line 1
Column 'weather.Formatted_Date' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.
Solution 1:[1]
When you use an aggregate function such as AVG() any non-agregrated functions must be in a group by. You cannot just use *. As you want to see the visibility we will use it. You can add other columns to the SELECT as long as you add them to the GROUP BY.
As we are only showing results when it's snowing it's not useful to show this column. If we group by the date you will get the average of 1 row which is not helpful. I have added COUNT() which could be interesting in your analysis.
SELECT TOP 20
Count(*) number_rows,
Visibility_km,
AVG(Temp)
FROM weather
WHERE Precip_Type = 'Snow'
GROUP BY Visibility_km
ORDER BY Visibility_km DESC
Solution 2:[2]
It's nothing to do with TOP
You need to understand that AVG needs to work on some dataset that you pick by choosing some rows to go into a group, with WHERE, and then choosing what keys to group on with GROUP BY
Databases work on rectangular blocks of data:
Person
Name Age
John 10
Jane 20
You can do this:
SELECT AVG(age) FROM Person
15
You cannot do this:
SELECT *, AVG(age) FROM Person
I suppose you might expect the DB to give:
John 10 15
Jane 20 15
But if you want that you do something like one of these:
SELECT *
FROM
Person
CROSS JOIN
(SELECT AVG(age) FROM Person) x
SELECT *, (SELECT AVG(age) FROM Person)
FROM Person
SELECT *, AVG(age) OVER()
FROM Person
We haven't yet got to the complexity of grouping on a key set yet, but already there is this notion that one cannot just straight up mix grouped data and nongrouped data together without being very specific about how it should be done. This is because when we make the grouping scenario even just slightly more complex, all those queries above cease to make sense:
Person
Name Age Gender
John 10 M
Jane 20 F
Paul 30 M
Sara 40 F
Average age by gender
SELECT *, AVG(age)
FROM person
GROUP BY gender
To calculate the avg age per gender we throw away everything other than gender and age, we partition the dataset into N different genders and then we do the average of each partition. We end up with N records resulting, each with a different gender and average
How should the database interpret
SELECT *, AVG(age)
FROM person
GROUP BY gender
?
Perhaps it should retrieve all the records for each person, and then separately do a group on gender, then attach to each record the average age for the gender, based on the gender of the person?
That could be possible.. but then what if you wanted the max weight by city, and the average age by gender ...
SELECT *, AVG(age), MAX(weight)
FROM person
GROUP BY gender, city
How is the database going to know what you want, that it's average age by gender and max weight by city rather than vice verse?
So, with increasingly complex requirements an auto magical system isn't going to work..
..we can however, be explicit. We can run groupings separately and join them back, or we can use window functions to say what we want
SELECT
*,
(SELECT AVG(age) FROM person pp WHERE pp.gender = p.gender)
FROM person p
SELECT
*,
avg_age
FROM
person p
INNER JOIN
(SELECT gender, AVG(age) FROM person GROUP BY gender) x ON pp.gender = p.gender
SELECT
*,
AVG(age) OVER(PARTITION BY gender)
FROM person p
These are all ways of doing a group and then connecting the grouped info back to the full dataset. They're explicit and the database knows exactly what to do with them..
So, in summary, there are quite specific rules for grouping data. When you write a group query anything you mention in the SELECT either has to also be mentioned in the GROUP BY or it has to be contained in a function call that performs some aggregate operation. You can't add extra columns in that aren't part of the grouping key; you either add them to the GEOUP BU and partition yoir dataset up into more unique combinations of values, or you put the expression in a MIN, MAX etc
--acceptable
SELECT gender, AVG(age)
FROM person
GROUP BY gender
--not acceptable - city is selected but not grouped or aggregated
SELECT gender, city, AVG(age)
FROM person
GROUP BY gender
--acceptable. You get average age per gender per city
--if there were 2 genders and 3 cities you get 6 rows
SELECT gender, city, AVG(age)
FROM person
GROUP BY gender, city
--acceptable. You get average age and max city (alphabetically latest) per gender
--if there were 2 genders and 3 cities you get 2 rows
SELECT gender, MAX(city), AVG(age)
FROM person
GROUP BY gender
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 |
