'extract a certain string from a value in a CSV and rewrite the CSV
I have a file that has fields like this
"1652476614-55","https://tr.uspoloassn.com/erkek-yesil-polo-yaka-t-shirt-basic-50249149-vr083/?integration_color=VR004","Erkek Açık Sarı Polo Yaka T-Shirt Basic","299,95 TL","50249149-VR004","<a href=""#"" class=""js-variant "" data-name=""integration_size"" data-value=""XL"" data-isvariant=""true"" data-pk=""165742"">XL</a>"
I want to remove everything in the last element and keep only what's between the quotes after data-value and rewrite the file again to look like this
"1652476614-55","https://tr.uspoloassn.com/erkek-yesil-polo-yaka-t-shirt-basic-50249149-vr083/?integration_color=VR004","Erkek Açık Sarı Polo Yaka T-Shirt Basic","299,95 TL","50249149-VR004","XL"
Any suggestions (python, shell script, etc ..)
Solution 1:[1]
One method is to use awk.
BEGIN{
FS="\",\"" # input is quoted fields with comma
OFS="\";\"" # set output to be quoted, with semicolon
}
{
# match value inside link tag
# match($6, />(.*?)<\/a>/, arr) # capture content in HTML <a> tag
# match value in data-value: data-value=""CONTENT HERE""
match($6, /data-value=""([^\"]*?)/, arr)
gsub(/^"/, "", $1) # remove the first quote in field 1.
print "\""$1,$2,$3,$4,$5,arr[1]"\"" # print with start and end quotes
}
Input: (abbreviated fields for simplicity!)
"55","http","TS","9,95 TL","VR004","<a href=""#"" data-value=""XL"" data-isvariant=""true"" data-pk=""165742"">XL</a>"
Result:
"55";"http";"TS";"9,95 TL";"VR004";"XL"
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 |