'extract a certain string from a value in a CSV and rewrite the CSV

I have a file that has fields like this

"1652476614-55","https://tr.uspoloassn.com/erkek-yesil-polo-yaka-t-shirt-basic-50249149-vr083/?integration_color=VR004","Erkek Açık Sarı Polo Yaka T-Shirt Basic","299,95 TL","50249149-VR004","<a href=""#"" class=""js-variant "" data-name=""integration_size"" data-value=""XL"" data-isvariant=""true"" data-pk=""165742"">XL</a>"

I want to remove everything in the last element and keep only what's between the quotes after data-value and rewrite the file again to look like this

"1652476614-55","https://tr.uspoloassn.com/erkek-yesil-polo-yaka-t-shirt-basic-50249149-vr083/?integration_color=VR004","Erkek Açık Sarı Polo Yaka T-Shirt Basic","299,95 TL","50249149-VR004","XL"

Any suggestions (python, shell script, etc ..)

csv


Solution 1:[1]

One method is to use awk.

BEGIN{
   FS="\",\""   # input is quoted fields with comma
   OFS="\";\""  # set output to be quoted, with semicolon
}
{
 # match value inside link tag 
 # match($6, />(.*?)<\/a>/, arr)  # capture content in HTML <a> tag

 # match value in data-value:   data-value=""CONTENT HERE""
 match($6, /data-value=""([^\"]*?)/, arr)

 gsub(/^"/, "", $1)         # remove the first quote in field 1. 
 print "\""$1,$2,$3,$4,$5,arr[1]"\""     # print with start and end quotes
}

Input: (abbreviated fields for simplicity!)

"55","http","TS","9,95 TL","VR004","<a href=""#"" data-value=""XL"" data-isvariant=""true"" data-pk=""165742"">XL</a>"

Result:

"55";"http";"TS";"9,95 TL";"VR004";"XL"

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1