'Sample dataframe by column with duplicate values in Spark
Let's say I have 500 rows in a DF, which has 100 unique IDs. I want to sample 2% IDs in the DF, i.e, we want to keep 2% IDs and all the rows for that ID. How can we do that?
Solution 1:[1]
It sounds like you want to use sample:
If you only want to Look at ID's then you may want to filter your dataset to the unique set of ID's first but this would give you the ability to randomly sample the data set:
sample(fraction : scala.Double, seed : scala.Long) : Dataset[T] sample(fraction : scala.Double) : Dataset[T] sample(withReplacement : scala.Boolean, fraction : scala.Double, seed : scala.Long) : Dataset[T] sample(withReplacement : scala.Boolean, fraction : scala.Double) : Dataset[T]Scala Parameters
fraction – Fraction of rows to generate, range [0.0, 1.0]. Note that it doesn’t guarantee to provide the exact number of the fraction of records.
seed – Seed for sampling (default a random seed). Used to reproduce same random sampling
withReplacement – Sample with replacement or not (default False).
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Matt Andruff |
