'Spark ml basic operation in Java
I have Dataset <Row> dataset; and want to perform some basic operation on it.
For Example:- Suppose I have 3 columns "Id","Name","Age"
and data for these columns. I want to perform these below operation on this dataset based on Name column
[1] Remove white space from Name column
[2] Remove number from Name column
[3] Remove special character from Name column
I am using java8, Apache-Spark and Apache-Spark-ml library
Please suggest best way to do this.
Solution 1:[1]
Use regexp_replace() to replace whitespaces, numbers & special characters. (Essentially retain only letters).
List<Row> rows = new ArrayList<Row>() {{
add(RowFactory.create("validName"));
add(RowFactory.create("name with whitespace "));
add(RowFactory.create("name with numbers 1234"));
add(RowFactory.create("name with special chars !@#$%"));
}};
StructField[] structFields = new StructField[]{
new StructField("Name", DataTypes.StringType, false, Metadata.empty()),
};
//create sample data
Dataset<Row> input = spark().createDataFrame(rows, new StructType(structFields));
input.withColumn("cleanedName", functions.regexp_replace(functions.col("Name"),"[^a-zA-Z]+", "")).show(100, false);
cleanedName column has the expected values:
+-----------------------------+--------------------+
|Name |cleanedName |
+-----------------------------+--------------------+
|validName |validName |
|name with whitespace |namewithwhitespace |
|name with numbers 1234 |namewithnumbers |
|name with special chars !@#$%|namewithspecialchars|
+-----------------------------+--------------------+
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | vdep |
