'How to remove spaces between tags in a delimited file?
I have this table dump from a MySQL system, and although it follows RFC standards, it appears to have added unwanted space in columns where HTML text are stored. For example:
"2000","Something","Something,"Something","Something","Something","2017-11-15 15:12:51","115060","Something","Something","Something","Something","","Something","Something","Something","Tabuk","TKPR","999","Something","Something","103984","Something","Something","UTC+03:00","sameday","15","100","3","1443","1","Something","3","Something","<div style=""margin:1em;"">"
<div lang=""en"dir=""ltr"style=""font-family: Microgramma;"">"
This is one out of about 30K rows, so I am trying to figure out a smart way to remove the space between the " and <div (and possibly others) here. I tried out:
awk '{$1=$1;printf $0}'
And this kind of works, but it mashes everything into one line which is not what I want. I would like to preserve the line breaks in the CSV dump. I am very curious to hear your ideas on how to tackle this.
Solution 1:[1]
The following using GNU awk for multi-char RS, RT, and gensub() will work even if your input file is huge as it doesn't read the whole file into memory, it just reads the strings separated by "<spaces>< or newline one at a time:
$ awk -v RS='"\\s+<|\n' '{printf "%s%s", $0, gensub(/"\s+</,"\"<",1,RT)}' file
"2000","Something","Something,"Something","Something","Something","2017-11-15 15:12:51","115060","Something","Something","Something","Something","","Something","Something","Something","Tabuk","TKPR","999","Something","Something","103984","Something","Something","UTC+03:00","sameday","15","100","3","1443","1","Something","3","Something","<div style=""margin:1em;"">"<div lang=""en"dir=""ltr"style=""font-family: Microgramma;"">"
I'm assuming that when you say and possibly others in your question you mean other case like "<spaces><div> where there's a " then spaces then a tag starting with < but that's obviously just a guess.
Solution 2:[2]
With your shown samples only, please try following awk code. Written and tested in GNU awk. Simple explanation would be, setting RS(record separator) as null and in main program, globally substituting new lines followed by spaces followed by <div with <div in lines and printing the lines by awkish way by using 1.
awk -v RS="" '{gsub(/\n+[[:space:]]+<div/,"<div")} 1' Input_file
Solution 3:[3]
Assuming your request is to remove the space before the start of the <div tag, you can try this GNU sed
$ sed -z 's/\(\"\)[[:space:]]\+\(<div .*\)/\1\n\2/' input_file
"2000","Something","Something,"Something","Something","Something","2017-11-15 15:12:51","115060","Something","Something","Something","Something","","Something","Something","Something","Tabuk","TKPR","999","Something","Something","103984","Something","Something","UTC+03:00","sameday","15","100","3","1443","1","Something","3","Something","<div style=""margin:1em;"">"
<div lang=""en"dir=""ltr"style=""font-family: Microgramma;"">"
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | RavinderSingh13 |
| Solution 3 |
