'How do I remove words numerics pyspark
I'm trying to remove only words that are numerical from my words array, but the function I created is not working correctly. When I try to view the information from my dataframe the following error message appears
First I converted my string and word tokens
from pyspark.ml.feature import RegexTokenizer
regexTokenizer = RegexTokenizer(
inputCol="description",
outputCol="words_withnumber",
pattern="\\W"
)
data = regexTokenizer.transform(data)
I created the function to remove only the numbers
from pyspark.sql.functions import when,udf
from pyspark.sql.types import BooleanType
def is_digit(value):
if value:
return value.isdigit()
else:
return False
is_digit_udf = udf(is_digit, BooleanType())
Call function
data = data.withColumn(
'words_withoutnumber',
when(~is_digit_udf(data['words_withnumber']), data['words_withnumber'])
)
Error:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 5.0 failed 4 times, most recent failure: Lost task 0.3 in stage 5.0 (TID 14, 10.139.64.4, executor 0): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
Sample Dataframe
+-----------+-----------------------------------------------------------+
|categoryid |description |
+-----------+-----------------------------------------------------------+
| 33004|["short","sarja", "40567","detalhe","couro"] |
| 22033|["multipane","6768686868686867868888","220v","branco"] |
+-----------+-----------------------------------------------------------+
expected result
+-----------+-----------------------------------------------------------+
|categoryid |description |
+-----------+-----------------------------------------------------------+
| 33004|["short","sarja","detalhe","couro"] |
| 22033|["multipane","220v","branco"] |
+-----------+-----------------------------------------------------------+
Solution 1:[1]
As a help @pault the solution was this.
from pyspark.sql.functions import when,udf
from pyspark.sql.types import BooleanType
def is_digit(value):
if value:
return value.isdigit()
else:
return False
is_digit_udf = udf(is_digit, BooleanType()
Call function
from pyspark.sql.types import ArrayType, StringType
from pyspark.sql.types import StructType
filter_length_udf = udf(lambda row: [x for x in row if not is_digit(x)], ArrayType(StringType()))
data = data.withColumn('words_clean', filter_length_udf(col('words_withnumber')))
Solution 2:[2]
If you want to avoid udf() for performance reasons and if comma won't appear in your "description" column, then below scala solution would work. The df.withColumn() should be similar in pyspark.
Note: I also added a third record to show that the solution works when the numbers appear at the start/end of the array. Try it out.
scala> val df = Seq((33004,Array("short","sarja", "40567","detalhe","couro")), (22033,Array("multipane","6768686868686867868888","220v","branco")), (33033,Array("0123","x220","220v","889"))).toDF("categoryid","description")
df: org.apache.spark.sql.DataFrame = [categoryid: int, description: array<string>]
scala> df.show(false)
+----------+-------------------------------------------------+
|categoryid|description |
+----------+-------------------------------------------------+
|33004 |[short, sarja, 40567, detalhe, couro] |
|22033 |[multipane, 6768686868686867868888, 220v, branco]|
|33033 |[0123, x220, 220v, 889] |
+----------+-------------------------------------------------+
scala> df.withColumn("newc",split(regexp_replace(regexp_replace(regexp_replace(concat_ws(",",'description),"""\b\d+\b""",""),"""^,|,$""",""),",,",","),",")).show(false)
+----------+-------------------------------------------------+------------------------------+
|categoryid|description |newc |
+----------+-------------------------------------------------+------------------------------+
|33004 |[short, sarja, 40567, detalhe, couro] |[short, sarja, detalhe, couro]|
|22033 |[multipane, 6768686868686867868888, 220v, branco]|[multipane, 220v, branco] |
|33033 |[0123, x220, 220v, 889] |[x220, 220v] |
+----------+-------------------------------------------------+------------------------------+
scala>
Spark 2.4 answer
Using spark-sql in version 2.4 onwards, you can use the filter() higher order function and get the results
scala> val df = Seq((33004,Array("short","sarja", "40567","detalhe","couro")), (22033,Array("multipane","6768686868686867868888","220v","branco")), (33033,Array("0123","x220","220v","889"))).toDF("categoryid","description")
df: org.apache.spark.sql.DataFrame = [categoryid: int, description: array<string>]
scala> df.createOrReplaceTempView("tab")
scala> spark.sql(""" select categoryid, filter(description, x -> lower(x)!=upper(x)) fw from tab """).show(false)
+----------+------------------------------+
|categoryid|fw |
+----------+------------------------------+
|33004 |[short, sarja, detalhe, couro]|
|22033 |[multipane, 220v, branco] |
|33033 |[x220, 220v] |
+----------+------------------------------+
scala>
Solution 3:[3]
Now that you mentioned pyspark, lets include python.
DataFrame
data =[ (33004 , ['short', 'sarja', '40567', 'detalhe', 'couro']),
(22033 , ['multipane', '6768686868686867868888', '220v', 'branco']),
(33033 ,['123', 'x220', '220v', '889'] )
]
df= spark.createDataFrame(data, ('categoryid','description' ))
Code
#Only alphanumeric
df.withColumn("description_filtered", expr("filter(description, x -> x rlike '([a-z]+)')")).show()
Outcome
+----------+--------------------+--------------------+
|categoryid| description|description_filtered|
+----------+--------------------+--------------------+
| 33004|[short, sarja, 40...|[short, sarja, de...|
| 22033|[multipane, 67686...|[multipane, 220v,...|
| 33033|[123, x220, 220v,...| [x220, 220v]|
+----------+--------------------+--------------------+
#Only alphabetical
df.withColumn("description_filtered", expr("filter(description, x -> x rlike '([^0-9]{2})')")).show()
Outcome
+----------+--------------------+--------------------+
|categoryid| description|description_filtered|
+----------+--------------------+--------------------+
| 33004|[short, sarja, 40...|[short, sarja, de...|
| 22033|[multipane, 67686...| [multipane, branco]|
| 33033|[123, x220, 220v,...| []|
+----------+--------------------+--------------------+
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | user3661384 |
| Solution 2 | |
| Solution 3 |
