'PySpark remove leading 0s from data frame column unless the value is just 0
(1) 00781 should be -> 781
(2) 00001 should be -> 1
(3) 00000 should be -> 0 (just one 0 for this special case)
(4) 10101 should be -> 10101 (unchanged)
What I have so far is
from pyspark.sql import functions as F
df = df.withColumn('ID',F.regexp_replace('ID', r'^[0]*',''))
But this is not entirely right as it does not work for the 00000 -> 0 case
Solution 1:[1]
I find it easier to rather regexp_extract it, although there's almost certainly a way to regexp_replace.
df.withColumn('id2', f.regexp_extract(f.col('ID'), '^0*(([^0].*)|0$)', 1)).show()
+-----+-----+
| ID| id2|
+-----+-----+
|00781| 781|
|00001| 1|
|00000| 0|
|10101|10101|
+-----+-----+
That regex simply targets either the substring starting at the first non-0
or the last 0 (if there are only 0s).
The result of that is
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | ernest_k |
