'Sklearn train_test_split() equivalent in pyspark

This is the Sklearn version function to split data

sklearn.model_selection.train_test_split(*arrays, test_size=None, train_size=None, random_state=None, shuffle=True, stratify=None)

I need the Sklearn train_test_split() equivalent in PySpark which can be given arguments to stratify on the target, has option whether to shuffle the data or not and things like that. The train_test_split() is a fantastic handy function and it would be best to have its closest possible implementation. randomSplit() function doesn't match.



Solution 1:[1]

I found a solution to the stratification problem in Pyspark data splitting function.

Stratified sampling in pyspark can be achieved by using sampleBy, lets say the column on which you want to stratify the sample is "colname" sampled = df.sampleBy("colname", fractions={4: 0.2, 6: 0.4, 8: 0.2}, seed=0)

Also did a little search and found out that it is without replacement.

https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.sampleBy.html

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Dev_Man