'Sample dataframe by column with duplicate values in Spark

Let's say I have 500 rows in a DF, which has 100 unique IDs. I want to sample 2% IDs in the DF, i.e, we want to keep 2% IDs and all the rows for that ID. How can we do that?



Solution 1:[1]

It sounds like you want to use sample:

If you only want to Look at ID's then you may want to filter your dataset to the unique set of ID's first but this would give you the ability to randomly sample the data set:

sample(fraction : scala.Double, seed : scala.Long) : Dataset[T]
sample(fraction : scala.Double) : Dataset[T]
sample(withReplacement : scala.Boolean, fraction : scala.Double, seed : scala.Long) : Dataset[T]
sample(withReplacement : scala.Boolean, fraction : scala.Double) : Dataset[T]

Scala Parameters

fraction – Fraction of rows to generate, range [0.0, 1.0]. Note that it doesn’t guarantee to provide the exact number of the fraction of records.

seed – Seed for sampling (default a random seed). Used to reproduce same random sampling

withReplacement – Sample with replacement or not (default False).

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Matt Andruff