'Regex - Remove space chars whenever there are more than one in succession, but exclude all lines commented out
Let's say I have a few lines as follows:
01090 C -------CALCULATION OF SOMETHING--
01100 "SOME.VARIABLE" = "SOME.OTHER.VARIABLE" + 2
01110 IF("SOME.VARIABLE" .NE. "SOME.VALUE") THEN ON("SOME.MACHINE")
I would like to go through the program and remove all of the space characters that have more than one in succession. For example, line 01100 has three (3) space characters before the "=" and two (2) after. In line 01110, there are several different locations with more than 1 consecutive space char. I would like to replace them with just a single space char. I do NOT want to remove/alter the spaces that are contained within the commented line 01090.
All lines begin with 5 digits, all lines have a tab following the line number, and only commented lines have a "C" or a "c" that denotes them as commented out.
I am using Sublime3, and boost regex. I have tried things like:
(?!\t[Cc] )[ ]{2,}
(?!\t[Cc])[ ]{2,}
I can't seem to determine how to negate an entire line without also capturing an entire line.
I tried putting a caret in the beginning as well, but that didn't seem to help.
Basically, if the line has a "TAB" followed by a "c" or a "C", then ignore the entire thing. Otherwise, any two or more consecutive space chars are located and replaced with a single space char.
EDIT
--------- solution ---------
Thanks to the input from Wiktor and The fourth bird, I was able to determine the solution. Many thanks to both. Here's what I ended up with:
^\d+\t[cC].*\K|[ ]{2,}
I also determined that should there be extra spaces at the end of a line, I might want to ignore those as well so I can remove them completely with a different regex search. The final product looks like this:
^\d+\t[cC].*\K|[ ]*\n\K|[ ]{2,}
If I had not been limited by the engine of boost or PCRE, I believe one of my previous failed attempts would actually work. I'll include that here in the event it helps someone else. It can't be used in boost or PCRE because they don't support infinite lookbehinds:
(?<!\t[cC].*)[ ]{2,}
Solution 1:[1]
You have to add a negative lookahead and a negative lookbehid to your regex. Try something like this.
(?<![Cc])\s{3,}(?![Cc])
Solution 2:[2]
I'm think you might actually be preparing to parse this language. Parsers aren't often convenient using regular expressions.
Also, you didn't ask but in this case the transformation could be in-place (since the output is shorter than the input, or of equal length).
I'd suggest a PEG grammar like this (using Boost Spirit):
template <typename In, typename Out>
Out compress_whitespace(In f, In l, Out out) {
auto copy = [&out](auto& ctx) {
struct Append {
static void call(Out& out, char ch) { *out++ = ch; }
static void call(Out& out, boost::iterator_range<In> raw) {
for (auto ch : raw) *out++ = ch; }
};
Append::call(out, _attr(ctx));
};
using namespace boost::spirit::x3;
auto prefix = raw[uint_ >> " "][copy];
auto comment = raw["C " >> *(char_ - eol)][copy];
auto code_ch = omit[+blank] >> attr(' ')[copy] | (char_ - eol)[copy];
auto line = prefix >> (comment | *code_ch);
auto newline = raw[eol][copy];
parse(f, l, -line % newline);
return out;
}
To disallow empty lines:
parse(f, l, line % newline);
To throw at incomplete/invalid input change the parse line:
parse(f, l, expect[line % newline >> *newline >> eoi]);
int main(int argc, char** argv)
{
std::ostreambuf_iterator out(std::cout);
for (std::string file : std::vector(argv+1, argv+argc)) {
std::ifstream s(file, std::ios::binary);
std::string const program(std::istreambuf_iterator<char>{s}, {});
compress_whitespace(begin(program), end(program), out);
}
}
Output using vim -d input.txt <(./sotest input.txt):
BONUS: In place processing
Since we know the output will be same length or less, you can afford to process inplace:
std::string program = R"~(
01090 C -------CALCULATION OF SOMETHING--
01100 "SOME.VARIABLE" = "SOME.OTHER.VARIABLE" + 2
01110 IF("SOME.VARIABLE" .NE. "SOME.VALUE") THEN ON("SOME.MACHINE"))~";
auto b = begin(program), e = end(program),
new_e = compress_whitespace(b, e, b);
std::cout << "Shorter by " << (e - new_e) << " chars\n";
program.erase(new_e, e);
std::cout << program << "\n";
See it Live On Coliru, printing:
Shorter by 7 chars
01090 C -------CALCULATION OF SOMETHING--
01100 "SOME.VARIABLE" = "SOME.OTHER.VARIABLE" + 2
01110 IF("SOME.VARIABLE" .NE. "SOME.VALUE") THEN ON("SOME.MACHINE")
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Danyer Dominguez |
| Solution 2 | sehe |

