'Random Sampling base on 1 column after Groupby

I have a Spark Table, which contains 400+ millions records/rows. I used spark.table to convert it into a DF. The DF looks like this below

  id          pub_date   version         unique_id     c_id    p_id    type      source
lni001        20220301      1           64WP-UI-POLI    002     P02    org      internet
lni001        20220301      1           64WP-UI-POLI    002     P02    org      internet
lni001        20220301      1           64WP-UI-POLI    002     P02    org      internet
lni001        20220301      2           64WP-UI-CFGT    012     K21   location  internet
lni001        20220301      2           64WP-UI-CFGT    012     K21   location  internet
lni001        20220301      3           64WP-UI-CFGT    012     K21   location  internet
lni001        20220301      3           64WP-UI-POLI    002     P02    org      internet
lni002        20220301      85          64WP-UI-POLI    002     P02    org      internet
lni002        20220301      85          64WP-UI-POLI    002     P02    org      internet
lni002        20220301      5           64WP-UI-CFGT    012     K21   location  internet
lni002        20220301      1           64WP-UI-CFGT    012     K21   location  internet
 ::
 ::

I am trying to randomly select rows base on the id column. I want to be able to randomly select group of id rows data after doing a groupBy or partitionBy on the id column.

If I want want 2 random samples, then I should get back with all the rows associate to the id column. For example, under id column, "lni001" has 7 records and "lni002" has 4 records. I will need all the records under "lni001" and "lni002"

I am trying to use groupBy and patritionBy but still couldnt figure out how to do it. That will be great if you all have give me some ideas or suggestions. Thanks!



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source