'Using AVG Function & Top Clause

I am working with a weather data set in SQL Server and to try determine the average temperature when it was snowing and order by them DESC visibility (kilometers).

I tried using the TOP clause and AVG function together, but I am getting errors. Any ideas why?

Here is the code I was using:

SELECT TOP 20 *, AVG(Temp)
FROM weather
WHERE Precip_Type = 'Snow'
ORDER BY Visibility_km DESC

This is the error I get:

Msg 8120, Level 16, State 1, Line 1
Column 'weather.Formatted_Date' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.

Photo of sample data

sql


Solution 1:[1]

When you use an aggregate function such as AVG() any non-agregrated functions must be in a group by. You cannot just use *. As you want to see the visibility we will use it. You can add other columns to the SELECT as long as you add them to the GROUP BY.
As we are only showing results when it's snowing it's not useful to show this column. If we group by the date you will get the average of 1 row which is not helpful. I have added COUNT() which could be interesting in your analysis.

SELECT TOP 20 
  Count(*) number_rows,
  Visibility_km, 
  AVG(Temp)
FROM weather
WHERE Precip_Type = 'Snow'
GROUP BY Visibility_km
ORDER BY Visibility_km DESC

Solution 2:[2]

It's nothing to do with TOP

You need to understand that AVG needs to work on some dataset that you pick by choosing some rows to go into a group, with WHERE, and then choosing what keys to group on with GROUP BY

Databases work on rectangular blocks of data:

Person
Name Age
John 10
Jane 20

You can do this:

SELECT AVG(age) FROM Person

15

You cannot do this:

SELECT *, AVG(age) FROM Person

I suppose you might expect the DB to give:

John 10 15
Jane 20 15

But if you want that you do something like one of these:

SELECT * 
FROM
  Person 
  CROSS JOIN
  (SELECT AVG(age) FROM Person) x

SELECT *, (SELECT AVG(age) FROM Person) 
FROM Person

SELECT *, AVG(age) OVER() 
FROM Person

We haven't yet got to the complexity of grouping on a key set yet, but already there is this notion that one cannot just straight up mix grouped data and nongrouped data together without being very specific about how it should be done. This is because when we make the grouping scenario even just slightly more complex, all those queries above cease to make sense:

Person
Name Age Gender
John 10  M
Jane 20  F
Paul 30  M
Sara 40  F

Average age by gender

SELECT *, AVG(age)
FROM person
GROUP BY gender

To calculate the avg age per gender we throw away everything other than gender and age, we partition the dataset into N different genders and then we do the average of each partition. We end up with N records resulting, each with a different gender and average

How should the database interpret

SELECT *, AVG(age)
FROM person
GROUP BY gender

?

Perhaps it should retrieve all the records for each person, and then separately do a group on gender, then attach to each record the average age for the gender, based on the gender of the person?

That could be possible.. but then what if you wanted the max weight by city, and the average age by gender ...

SELECT *, AVG(age), MAX(weight)
FROM person
GROUP BY gender, city

How is the database going to know what you want, that it's average age by gender and max weight by city rather than vice verse?

So, with increasingly complex requirements an auto magical system isn't going to work..

..we can however, be explicit. We can run groupings separately and join them back, or we can use window functions to say what we want

SELECT
  *, 
  (SELECT AVG(age) FROM person pp WHERE pp.gender = p.gender)
FROM person p


SELECT
  *, 
  avg_age
FROM 
  person p
  INNER JOIN
  (SELECT gender, AVG(age) FROM person GROUP BY gender) x ON pp.gender = p.gender


SELECT
  *, 
  AVG(age) OVER(PARTITION BY gender) 
FROM person p

These are all ways of doing a group and then connecting the grouped info back to the full dataset. They're explicit and the database knows exactly what to do with them..


So, in summary, there are quite specific rules for grouping data. When you write a group query anything you mention in the SELECT either has to also be mentioned in the GROUP BY or it has to be contained in a function call that performs some aggregate operation. You can't add extra columns in that aren't part of the grouping key; you either add them to the GEOUP BU and partition yoir dataset up into more unique combinations of values, or you put the expression in a MIN, MAX etc

--acceptable 
SELECT gender, AVG(age)
FROM person 
GROUP BY gender


--not acceptable - city is selected but not grouped or aggregated 
SELECT gender, city, AVG(age)
FROM person 
GROUP BY gender


--acceptable. You get average age per gender per city
--if there were 2 genders and 3 cities you get 6 rows 
SELECT gender, city, AVG(age)
FROM person 
GROUP BY gender, city 


--acceptable. You get average age and max city (alphabetically latest) per gender
--if there were 2 genders and 3 cities you get 2 rows 
SELECT gender, MAX(city), AVG(age)
FROM person 
GROUP BY gender 

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2