'Generate all possible combinations after grouping by in SQL
Let's say I have a table called my_table in this format
key | animal
------------
1 | Cat
2 | Cat
3 | Cat
4 | Dog
5 | Dog
I would like to get a table of this format
animal | key1 | key2
--------------------
Cat | 1 | 2
Cat | 1 | 3
Cat | 2 | 3
Dog | 4 | 5
Note that I DON'T want to use join from the beginning because it's very expensive for the size of the table that I have; that's to say I don't wanna do
SELECT m1.animal, m1.key AS key1, m2.key AS key2
FROM my_table m1
INNER JOIN my_table m2
ON (m1.key < m2.key)
WHERE m1.animal = m2.animal
I would like to do something like GROUP BY animal or PARTITION BY animal, and then generate all possible combinations for the key column inside the grouped animal col, because I don't wanna do unnecessary computations of combining keys for different animals.
EDIT: Note that |keys| >> |animals|
Solution 1:[1]
So lets use some data:
lets create some tables of animals and keys:
CREATE TABLE data_100Kx10 as
SELECT * FROM (
SELECT
seq8() as animal
FROM table(generator(ROWCOUNT => 100000))
) a, (
SELECT
seq8() as key
FROM table(generator(ROWCOUNT => 100))
) b;
CREATE TABLE data_100Kx100 as
SELECT * FROM (
SELECT
seq8() as animal
FROM table(generator(ROWCOUNT => 100000))
) a, (
SELECT
seq8() as key
FROM table(generator(ROWCOUNT => 100))
) b;
CREATE TABLE data_1Mx10 as
SELECT * FROM (
SELECT
seq8() as animal
FROM table(generator(ROWCOUNT => 100000))
) a, (
SELECT
seq8() as key
FROM table(generator(ROWCOUNT => 100))
) b;
so those take on an xtra small warehouse
| name | rows | time to create |
|---|---|---|
| 100Kx10 | 1,000,000 | 1.4s |
| 100Kx100 | 10,000,000 | 2.5s |
| 1Mx10 | 10,000,000 | 2.5s |
nice now lets compare two ways of querying that data:
How I would write the join (will call = <)
SELECT sum(animal), sum(key1), sum(key2) FROM(
SELECT m1.animal, m1.key AS key1, m2.key AS key2
FROM data_100Kx100 m1
INNER JOIN data_100Kx100 m2
ON m1.animal = m2.animal and m1.key < m2.key
);
verse how you wrote the join (will call < =)
SELECT m1.animal, m1.key AS key1, m2.key AS key2
FROM data_100Kx100 m1
INNER JOIN data_100Kx100 m2
ON (m1.key < m2.key)
WHERE m1.animal = m2.animal
| form | table | time | rows |
|---|---|---|---|
| = < | 100Kx10 | 2.7s | 4.5M |
| < = | 100Kx10 | 2.8s | 4.5M |
| = < | 100Kx100 | 120s | 495M |
| < = | 100Kx100 | 50ms* | 4.5M |
- just proves they are the same query.
But the point is is slow "to fetch large amounts of data" we can see this but changing to summing all the results:
SELECT sum(animal), sum(key1), sum(key2) FROM(
SELECT m1.animal, m1.key AS key1, m2.key AS key2
FROM data_100Kx10 m1
INNER JOIN data_100Kx10 m2
ON m1.animal = m2.animal and m1.key < m2.key
);
gives:
| SUM(ANIMAL) | SUM(KEY1) | SUM(KEY2) |
|---|---|---|
| 224,997,750,000 | 12,000,000 | 28,500,000 |
| table | time |
|---|---|
| 100Kx10 | 0.08s |
| 100Kx100 | 0.1s |
| 1Mx10 | 1.4s |
So these tables are small, but if you are moving all rows around, it's slow, but if you are doing something meaningful with the data, full table joins are not really that slow.
Which is to say, write the SQL and if it's slow work out can it go faster, verse avoiding doing something, because "we think it's slow"
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Simeon Pilgrim |
