'PySpark remove leading 0s from data frame column unless the value is just 0

(1) 00781 should be -> 781

(2) 00001 should be -> 1

(3) 00000 should be -> 0 (just one 0 for this special case)

(4) 10101 should be -> 10101 (unchanged)

What I have so far is

from pyspark.sql import functions as F
df = df.withColumn('ID',F.regexp_replace('ID', r'^[0]*',''))

But this is not entirely right as it does not work for the 00000 -> 0 case



Solution 1:[1]

I find it easier to rather regexp_extract it, although there's almost certainly a way to regexp_replace.

df.withColumn('id2', f.regexp_extract(f.col('ID'), '^0*(([^0].*)|0$)', 1)).show()

+-----+-----+
|   ID|  id2|
+-----+-----+
|00781|  781|
|00001|    1|
|00000|    0|
|10101|10101|
+-----+-----+

That regex simply targets either the substring starting at the first non-0 or the last 0 (if there are only 0s).

The result of that is

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 ernest_k