'How to split csv comma separated value as single row in a new column using pyspark
I have a log file in csv which has a column contains a list of filepaths separated by comma. I want to split those filepaths into new rows using pyspark(or excel). This original data looks like:
+----------+----------------------------------------------------------------------------+
|time |message |
+----------+----------------------------------------------------------------------------+
|4-19 20:00|[info] Delete object in ['03-26/abc/123.jpg', '03-26/abc/456.jpg'] |
+----------+----------------------------------------------------------------------------+
|4-19 21:00|[info] Delete object in ['03-27/def/789.jpg', '03-27/def/012.jpg'] |
+----------+----------------------------------------------------------------------------+
I'd like it to be converted as this:
+-----------------+
|path |
+-----------------+
|03-26/abc/123.jpg|
+-----------------+
|03-26/abc/456.jpg|
+-----------------+
|03-27/def/789.jpg|
+-----------------+
|03-27/def/012.jpg|
+-----------------+
Solution 1:[1]
Just extract those paths from message and parse it
from pyspark.sql import functions as F
(df
.withColumn('paths', F.explode(F.from_json(F.regexp_extract('message', '\[\'[^\]]+]', 0), 'array<string>')))
.show(10, False)
)
+----------+------------------------------------------------------------------+-----------------+
|time |message |paths |
+----------+------------------------------------------------------------------+-----------------+
|4-19 20:00|[info] Delete object in ['03-26/abc/123.jpg', '03-26/abc/456.jpg']|03-26/abc/123.jpg|
|4-19 20:00|[info] Delete object in ['03-26/abc/123.jpg', '03-26/abc/456.jpg']|03-26/abc/456.jpg|
|4-19 21:00|[info] Delete object in ['03-27/def/789.jpg', '03-27/def/012.jpg']|03-27/def/789.jpg|
|4-19 21:00|[info] Delete object in ['03-27/def/789.jpg', '03-27/def/012.jpg']|03-27/def/012.jpg|
+----------+------------------------------------------------------------------+-----------------+
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | pltc |
