'Match non printable/non ascii characters and remove from text

My JavaScript is quite rusty so any help with this would be great. I have a requirement to detect non printable characters (control characters like SOH, BS etc) as well extended ascii characters such as Ž in a string and remove them but I am not sure how to write the code?

Can anyone point me in the right direction for how to go about this? This is what I have so far:

$(document).ready(function() {
    $('.jsTextArea').blur(function() {
        var pattern = /[^\000-\031]+/gi;
        var val = $(this).val();
        if (pattern.test(val)) {    
        for (var i = 0; i < val.length; i++) {
            var res = val.charAt([i]);
                alert("Character " + [i] + " " + res);              
        }          
    }
    else {
         alert("It failed");
     }

    });
});

Solution 1:^[1]

To target characters that are not part of the printable basic ASCII range, you can use this simple regex:

[^ -~]+

Explanation: in the first 128 characters of the ASCII table, the printable range starts with the space character and ends with a tilde. These are the characters you want to keep. That range is expressed with [ -~], and the characters not in that range are expressed with [^ -~]. These are the ones we want to replace. Therefore:

result = string.replace(/[^ -~]+/g, "");

Solution 2:^[2]

No need to test, you can directly process the text box content:

textBoxContent = textBoxContent.replace(/[^\x20-\x7E]+/g, '');

where the range \x20-\x7E covers the printable part of the ascii table.

Example with your code:

$('.jsTextArea').blur(function() {
    this.value = this.value.replace(/[^\x20-\x7E]+/g, '');
});

Solution 3:^[3]

You have to assign a pattern (instead of string) into isNonAscii variable, then use test() to check if it matches. test() returns true or false.

$(document).ready(function() {
    $('.jsTextArea').blur(function() {
        var pattern = /[^\000-\031]+/gi;
        var val = $(this).val();
        if (pattern.test(val)) {
            alert("It matched");
        }
        else {
            alert("It did NOT match");
        }
    });
});

Check jsFiddle

Solution 4:^[4]

For anyone looking for a solution that works beyond ascii and does not strip out Unicode chars:

function stripNonPrintableAndNormalize(text) {
    // strip control chars
    text = text.replace(/\p{C}/gu, '');

    // other common tasks are to normalize newlines and other whitespace

    // normalize newline
    text = text.replace(/\n\r/g, '\n');
    text = text.replace(/\p{Zl}/gu, '\n');
    text = text.replace(/\p{Zp}/gu, '\n');

    // normalize space
    text = text.replace(/\p{Zs}/gu, ' ');

    return text;
}

The various unicode class identifiers (e.g. Zl for line separator) are defined at https://www.unicode.org/reports/tr44/ as also shown below:

Abbr	Long	Description
Lu	Uppercase_Letter	an uppercase letter
Ll	Lowercase_Letter	a lowercase letter
Lt	Titlecase_Letter	a digraphic character, with first part uppercase
LC	Cased_Letter	Lu \| Ll \| Lt
Lm	Modifier_Letter	a modifier letter
Lo	Other_Letter	other letters, including syllables and ideographs
L	Letter	Lu \| Ll \| Lt \| Lm \| Lo
Mn	Nonspacing_Mark	a nonspacing combining mark (zero advance width)
Mc	Spacing_Mark	a spacing combining mark (positive advance width)
Me	Enclosing_Mark	an enclosing combining mark
M	Mark	Mn \| Mc \| Me
Nd	Decimal_Number	a decimal digit
Nl	Letter_Number	a letterlike numeric character
No	Other_Number	a numeric character of other type
N	Number	Nd \| Nl \| No
Pc	Connector_Punctuation	a connecting punctuation mark, like a tie
Pd	Dash_Punctuation	a dash or hyphen punctuation mark
Ps	Open_Punctuation	an opening punctuation mark (of a pair)
Pe	Close_Punctuation	a closing punctuation mark (of a pair)
Pi	Initial_Punctuation	an initial quotation mark
Pf	Final_Punctuation	a final quotation mark
Po	Other_Punctuation	a punctuation mark of other type
P	Punctuation	Pc \| Pd \| Ps \| Pe \| Pi \| Pf \| Po
Sm	Math_Symbol	a symbol of mathematical use
Sc	Currency_Symbol	a currency sign
Sk	Modifier_Symbol	a non-letterlike modifier symbol
So	Other_Symbol	a symbol of other type
S	Symbol	Sm \| Sc \| Sk \| So
Zs	Space_Separator	a space character (of various non-zero widths)
Zl	Line_Separator	U+2028 LINE SEPARATOR only
Zp	Paragraph_Separator	U+2029 PARAGRAPH SEPARATOR only
Z	Separator	Zs \| Zl \| Zp
Cc	Control	a C0 or C1 control code
Cf	Format	a format control character
Cs	Surrogate	a surrogate code point
Co	Private_Use	a private-use character
Cn	Unassigned	a reserved unassigned code point or a noncharacter
C	Other	Cc \| Cf \| Cs \| Co \| Cn

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1	Jonathan
Solution 2
Solution 3	kosmos
Solution 4

'Match non printable/non ascii characters and remove from text

Solution 1:[1]

Solution 2:[2]

Solution 3:[3]

Solution 4:[4]

Sources

Related Questions

Solution 1:^[1]

Solution 2:^[2]

Solution 3:^[3]

Solution 4:^[4]