'Spark ml basic operation in Java

I have Dataset <Row> dataset; and want to perform some basic operation on it.
For Example:- Suppose I have 3 columns "Id","Name","Age" and data for these columns. I want to perform these below operation on this dataset based on Name column
[1] Remove white space from Name column
[2] Remove number from Name column
[3] Remove special character from Name column

I am using java8, Apache-Spark and Apache-Spark-ml library

Please suggest best way to do this.



Solution 1:[1]

Use regexp_replace() to replace whitespaces, numbers & special characters. (Essentially retain only letters).

List<Row> rows = new ArrayList<Row>() {{
    add(RowFactory.create("validName"));
    add(RowFactory.create("name with whitespace    "));
    add(RowFactory.create("name with numbers 1234"));
    add(RowFactory.create("name with special chars !@#$%"));

}};

StructField[] structFields = new StructField[]{
        new StructField("Name", DataTypes.StringType, false, Metadata.empty()),
};

//create sample data
Dataset<Row> input = spark().createDataFrame(rows, new StructType(structFields));
input.withColumn("cleanedName", functions.regexp_replace(functions.col("Name"),"[^a-zA-Z]+", "")).show(100, false);

cleanedName column has the expected values:

+-----------------------------+--------------------+
|Name                         |cleanedName         |
+-----------------------------+--------------------+
|validName                    |validName           |
|name with whitespace         |namewithwhitespace  |
|name with numbers 1234       |namewithnumbers     |
|name with special chars !@#$%|namewithspecialchars|
+-----------------------------+--------------------+

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 vdep