'Need regex (using C#) to condense all whitepace into single whitespaces

I need to replace multiple whitespaces into a single whitespace (per iteration) in a document. Doesn't matter whether they are spaces, tabs or newlines, any combination of any kind of whitespace needs to be truncated to a single whitespace.

Let's say we have the string: "Hello,\t \t\n  \t    \n world", (where \t and \n represent tabs and newlines respectively) then I'd need it to become "Hello, world".

I'm so completely bewildered by regex more generally that I ended up just asking.

Considerations:

  • I have no control over the document, since it could be any document on the internet.

  • I'm using C#, so if anyone knows how to do this in C# specifically, that would be even more awesome.

  • I don't really have to use regex (before someone asks), but I figured it's probably the optimal way, since regex is designed for this sort of stuff, and my own strpos/str_replace/substr soup would probably not perform as well. Performance is important on this one so what I'm essentially looking for is an efficient way to do this to any random text file on the internet (remember, I can't predict the size!).

Thanks in advance!



Solution 1:[1]

newString = Regex.Replace(oldString, @"\s+", " ");

The "\s" is a regex character class for any whitespace character, and the + means "one or more". It replaces each occurence with a simple space character.

Solution 2:[2]

As someone who sympathizes with Jamie Zawinski's position on Regex, I'll offer an alternative for what it's worth.

Not wanting to be religious about it, but I'd say it's faster than Regex, though whether you'll ever be processing strings long enough to see the difference is another matter.

    public static string CompressWhiteSpace(string value)
    {
        if (value == null) return null;

        bool inWhiteSpace = false;
        StringBuilder builder = new StringBuilder(value.Length);

        foreach (char c in value)
        {
            if (Char.IsWhiteSpace(c))
            {
                inWhiteSpace = true;
            }
            else
            {
                if (inWhiteSpace) builder.Append(' ');
                inWhiteSpace = false;
                builder.Append(c);
            }
        }
        return builder.ToString();
    }

Solution 3:[3]

I would suggest you replace your chomp with
 $line =~ s/\s+$//;

which will strip off all trailing white spaces - tabs, spaces, new lines and returns as well.

Taken from: http://www.wellho.net/forum/Perl-Programming/New-line-characters-beware.html

I'm aware its Perl, but it should be helpful enough for you.

Solution 4:[4]

Actually I think an extension method would probably be more efficient as you don't have the state machine overhead of the regex. Essentially, it becomes a very specialized pattern matcher.

public static string Collapse( this string source )
{
    if (string.IsNullOrEmpty( source ))
    {
        return source;
    }

    StringBuilder builder = new StringBuilder();
    bool inWhiteSpace = false;
    bool sawFirst = false;
    foreach (var c in source)
    {
        if (char.IsWhiteSpace(c))
        {
            inWhiteSpace = true;
        }
        else
        {
            // only output a whitespace if followed by non-whitespace
            // except at the beginning of the string
            if (inWhiteSpace && sawFirst)
            {
                builder.Append(" ");
            }
            inWhiteSpace = false;
            sawFirst = true;
            builder.Append(c);
        }
    }
    return builder.ToString();
}

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 womp
Solution 2 Joe
Solution 3 Woot4Moo
Solution 4 tvanfosson