'SQL query to replace multiple occurrences with the most frequent ones
I have a products table, which has the product description in two languages, one in English and one in an alternate language.
Let's say:
| Product_Desc | Product_Desc_Alt |
|---|---|
| A | A1 |
| A | A2 |
| A | A1 |
| A | A3 |
| B | B1 |
| B | B2 |
| B | B2 |
| C | C1 |
If I do a GROUP BY statement, there are multiple alternate language occurrences for the same product. So, let's say:
SELECT Product_Desc, COUNT(DISTINCT Product_Desc_Alt) AS CNT
FROM products
GROUP BY Product_Desc
ORDER BY CNT DESC
| Product_Desc | CNT |
|---|---|
| A | 3 |
| B | 2 |
| C | 1 |
I would like to replace the Product_Desc_Alt occurrences with the most frequent ones,
so for example I would like the output to be:
| Product_Desc | Product_Desc_Alt |
|---|---|
| A | A1 |
| A | A1 |
| A | A1 |
| A | A1 |
| B | B2 |
| B | B2 |
| B | B2 |
| C | C1 |
Obviously, if a product has only one alter lang description, just keep that one.
There may be lots of ways to do that, but I can't think of one.
I am using Azure Databricks so this could also happen with PySpark, but I am interested in doing this the SQL way.
Thanks a lot!
Solution 1:[1]
Is this what you mean?
UPDATE products
SET Product_Desc_Alt = (
SELECT TOP 1 Product_Desc_Alt
FROM products P2
WHERE P2.Product_Desc = products.Product_Desc
GROUP BY Product_Desc_Alt
ORDER BY COUNT(*) DESC
)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Gert-Jan |
