'Generate all possible combinations after grouping by in SQL

Let's say I have a table called my_table in this format

key | animal
------------
1   |  Cat
2   |  Cat
3   |  Cat
4   |  Dog
5   |  Dog

I would like to get a table of this format

animal | key1 | key2
--------------------
Cat    |  1   |  2
Cat    |  1   |  3
Cat    |  2   |  3
Dog    |  4   |  5

Note that I DON'T want to use join from the beginning because it's very expensive for the size of the table that I have; that's to say I don't wanna do

SELECT m1.animal, m1.key AS key1, m2.key AS key2
FROM my_table m1
INNER JOIN my_table m2
 ON (m1.key < m2.key)
WHERE m1.animal = m2.animal

I would like to do something like GROUP BY animal or PARTITION BY animal, and then generate all possible combinations for the key column inside the grouped animal col, because I don't wanna do unnecessary computations of combining keys for different animals.

EDIT: Note that |keys| >> |animals|



Solution 1:[1]

So lets use some data:

lets create some tables of animals and keys:

CREATE TABLE data_100Kx10 as  
 SELECT * FROM (
        SELECT 
            seq8() as animal
        FROM table(generator(ROWCOUNT => 100000))
    ) a, (
        SELECT 
            seq8() as key
        FROM table(generator(ROWCOUNT => 100))
    ) b;

CREATE TABLE data_100Kx100 as  
 SELECT * FROM (
        SELECT 
            seq8() as animal
        FROM table(generator(ROWCOUNT => 100000))
    ) a, (
        SELECT 
            seq8() as key
        FROM table(generator(ROWCOUNT => 100))
    ) b;

CREATE TABLE data_1Mx10 as  
 SELECT * FROM (
        SELECT 
            seq8() as animal
        FROM table(generator(ROWCOUNT => 100000))
    ) a, (
        SELECT 
            seq8() as key
        FROM table(generator(ROWCOUNT => 100))
    ) b;

so those take on an xtra small warehouse

name rows time to create
100Kx10 1,000,000 1.4s
100Kx100 10,000,000 2.5s
1Mx10 10,000,000 2.5s

nice now lets compare two ways of querying that data:

How I would write the join (will call = <)

SELECT sum(animal), sum(key1), sum(key2) FROM(
    SELECT m1.animal, m1.key AS key1, m2.key AS key2
    FROM data_100Kx100 m1
    INNER JOIN data_100Kx100 m2
     ON m1.animal = m2.animal and m1.key < m2.key   
);

verse how you wrote the join (will call < =)

SELECT m1.animal, m1.key AS key1, m2.key AS key2
FROM data_100Kx100 m1
INNER JOIN data_100Kx100 m2
 ON (m1.key < m2.key)
WHERE m1.animal = m2.animal
form table time rows
= < 100Kx10 2.7s 4.5M
< = 100Kx10 2.8s 4.5M
= < 100Kx100 120s 495M
< = 100Kx100 50ms* 4.5M
  • just proves they are the same query.

But the point is is slow "to fetch large amounts of data" we can see this but changing to summing all the results:

SELECT sum(animal), sum(key1), sum(key2) FROM(
    SELECT m1.animal, m1.key AS key1, m2.key AS key2
    FROM data_100Kx10 m1
    INNER JOIN data_100Kx10 m2
     ON m1.animal = m2.animal and m1.key < m2.key   
);

gives:

SUM(ANIMAL) SUM(KEY1) SUM(KEY2)
224,997,750,000 12,000,000 28,500,000
table time
100Kx10 0.08s
100Kx100 0.1s
1Mx10 1.4s

So these tables are small, but if you are moving all rows around, it's slow, but if you are doing something meaningful with the data, full table joins are not really that slow.

Which is to say, write the SQL and if it's slow work out can it go faster, verse avoiding doing something, because "we think it's slow"

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Simeon Pilgrim