'How to assign to each row a number of times a value appears in the whole table?
I'm trying to run an SQL query on Vertica but I can't find a way to get the results I need.
Let's say I have a table showing:
- productID
- campaignID (ID of the sales campaign)
- calendarYearWeek (calendar week when the campaign was active [usually they're active for 5 days)
- countryOrigin (in which country was the product sold, as it's international sales)
- valueLocal (price in local currency)
What I need to do is to find products sold in different countries and compare their prices between markets.
Sometimes the campaigns are available only in one country, sometimes in more, so to avoid having hundreds of thousands of unnecessary rows that I can't compare to others, I want to distill only those products that were available in more than 1 countryOrigin.
What's important - a product can be available in different campaigns with a different price.
That's why in my SELECT statement I added a new column:
calendarYearWeek||productID||campaignID AS uniqueItem - that way I know that I'm checking the price only for a specific product in a specific campaign during a specific week of year.
The table is also joined with another table to get exchange rates etc., so it's also GROUPed BY, so in each row I have a price and average exchange rate for a given uniqueItem in a specific country.
If I run this query, it works but even just for this year it gives me several million results, most of which I don't need because these are products sold only in one country and I need to compare prices across different markets.
So what I thought I need is to assign to each row a number of times a uniqueItem value appears in the whole table. If it's 1 - then the product is sold only in one country and I don't have to care about it. If it's 2 or 3 - this is what I need. Then I can filter out the unnecessary results in the WHERE clause ( > 1) and I can work on a smaller, better data set.
I tried different combinations of COUNT, I tried row_number + OVER(PARTITION BY) (works only partially, as when a product is available in 2 or more countries it counts the rows, but still I cannot filter out "1" because then I'll lose the "first" country on the list). I thought about MATCH_RECOGNIZED, but I've never used it before and I think it's not available in Vertica.
Sorry if it's messy, but I'm not really advanced in SQL and English is not my native language.
Do you have any ideas how to get only the data I need?
What I have now is:
SELECT
a.originCountry,
a.calendarYearWeek,
a.productID,
a.campaignId,
a.valueLocal,
ROUND(AVG(b.exchange_rate),4),
a.calendarYearWeek||a.productID||a.campaignID AS uniqueItem
FROM table1 a
LEFT JOIN table2 b
ON a.reportDate = b.reportDate
AND a.originCountry = b.originCountry
WHERE a.originCountry IN ('ES', 'DE', 'FR')
GROUP BY 3, 4, 7, 1, 5, 2
ORDER BY 3, 4, 1
----------
Solution 1:[1]
I need some sample data - so I make up a few rows.
- You need to find the identifying grouping columns of those combinations that occur more than once in a sub select or a common table expression, to join with table1.
- You need to formulate the average as an OLAP function if you want the country back in the report.
WITH
-- input, don't use in final query ..
table1(originCountry,calendarYearWeek,productID,campaignId,valuelocal,reportDate) AS (
SELECT 'ES',202203,43,142,100.50, DATE '2022-01-19'
UNION ALL SELECT 'DE',202203,43,142,135.00, DATE '2022-01-19'
UNION ALL SELECT 'FR',202203,43,142, 98.75, DATE '2022-01-19'
UNION ALL SELECT 'ES',202203,44,147,198.75, DATE '2022-01-19'
UNION ALL SELECT 'DE',202203,44,147,205.00, DATE '2022-01-19'
UNION ALL SELECT 'FR',202203,44,147,198.75, DATE '2022-01-19'
UNION ALL SELECT 'es',202203,49,150, 1.25, DATE '2022-01-19'
)
,
table2(originCountry,reportDate,exchange_rate) AS (
SELECT 'ES',DATE '2022-01-19', 1
UNION ALL SELECT 'DE',DATE '2022-01-19', 1
UNION ALL SELECT 'FR',DATE '2022-01-19', 1
)
-- end of input; real query starts here, replace following comma with "WITH" ..
,
-- you need the unique ident grouping values to join with ..
selgrp AS (
SELECT
a.calendarYearWeek
, a.productID
, a.campaignId
FROM table1 a
GROUP BY
a.calendarYearWeek
, a.productID
, a.campaignId
HAVING COUNT(*) > 1
-- chk calendarYearWeek | productID | campaignId
-- chk ------------------+--------+--------
-- chk 202203 | 43 | 142
-- chk 202203 | 44 | 147
)
SELECT
a.originCountry
, a.calendarYearWeek
, a.productID
, a.campaignId
, a.valueLocal
, AVG(b.exchange_rate) OVER w::NUMERIC(9,4) AS avg_exch_rate
-- a.calendarYearWeek||a.productID||a.campaignID AS uniqueItem
FROM table1 a
JOIN selgrp USING(calendarYearWeek,productID,campaignId)
LEFT JOIN table2 b
ON a.reportDate = b.reportDate
AND a.originCountry = b.originCountry
WHERE UPPER(a.originCountry) IN ('ES', 'DE', 'FR')
WINDOW w AS (PARTITION BY a.calendarYearWeek,a.productID,a.campaignID)
ORDER BY 3, 4, 1
-- out originCountry | calendarYearWeek | productID | campaignId | valueLocal | avg_exch_rate
-- out ---------------+------------------+-----------+------------+------------+---------------
-- out DE | 202203 | 43 | 142 | 135.00 | 1.0000
-- out ES | 202203 | 43 | 142 | 100.50 | 1.0000
-- out FR | 202203 | 43 | 142 | 98.75 | 1.0000
-- out DE | 202203 | 44 | 147 | 205.00 | 1.0000
-- out ES | 202203 | 44 | 147 | 198.75 | 1.0000
-- out FR | 202203 | 44 | 147 | 198.75 | 1.0000
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | marcothesane |
