'Match non printable/non ascii characters and remove from text
My JavaScript is quite rusty so any help with this would be great. I have a requirement to detect non printable characters (control characters like SOH, BS etc) as well extended ascii characters such as Ž in a string and remove them but I am not sure how to write the code?
Can anyone point me in the right direction for how to go about this? This is what I have so far:
$(document).ready(function() {
$('.jsTextArea').blur(function() {
var pattern = /[^\000-\031]+/gi;
var val = $(this).val();
if (pattern.test(val)) {
for (var i = 0; i < val.length; i++) {
var res = val.charAt([i]);
alert("Character " + [i] + " " + res);
}
}
else {
alert("It failed");
}
});
});
Solution 1:[1]
To target characters that are not part of the printable basic ASCII range, you can use this simple regex:
[^ -~]+
Explanation: in the first 128 characters of the ASCII table, the printable range starts with the space character and ends with a tilde. These are the characters you want to keep. That range is expressed with [ -~], and the characters not in that range are expressed with [^ -~]. These are the ones we want to replace. Therefore:
result = string.replace(/[^ -~]+/g, "");
Solution 2:[2]
No need to test, you can directly process the text box content:
textBoxContent = textBoxContent.replace(/[^\x20-\x7E]+/g, '');
where the range \x20-\x7E covers the printable part of the ascii table.
Example with your code:
$('.jsTextArea').blur(function() {
this.value = this.value.replace(/[^\x20-\x7E]+/g, '');
});
Solution 3:[3]
You have to assign a pattern (instead of string) into isNonAscii variable, then use test() to check if it matches. test() returns true or false.
$(document).ready(function() {
$('.jsTextArea').blur(function() {
var pattern = /[^\000-\031]+/gi;
var val = $(this).val();
if (pattern.test(val)) {
alert("It matched");
}
else {
alert("It did NOT match");
}
});
});
Check jsFiddle
Solution 4:[4]
For anyone looking for a solution that works beyond ascii and does not strip out Unicode chars:
function stripNonPrintableAndNormalize(text) {
// strip control chars
text = text.replace(/\p{C}/gu, '');
// other common tasks are to normalize newlines and other whitespace
// normalize newline
text = text.replace(/\n\r/g, '\n');
text = text.replace(/\p{Zl}/gu, '\n');
text = text.replace(/\p{Zp}/gu, '\n');
// normalize space
text = text.replace(/\p{Zs}/gu, ' ');
return text;
}
The various unicode class identifiers (e.g. Zl for line separator) are defined at https://www.unicode.org/reports/tr44/ as also shown below:
| Abbr | Long | Description |
|---|---|---|
| Lu | Uppercase_Letter | an uppercase letter |
| Ll | Lowercase_Letter | a lowercase letter |
| Lt | Titlecase_Letter | a digraphic character, with first part uppercase |
| LC | Cased_Letter | Lu | Ll | Lt |
| Lm | Modifier_Letter | a modifier letter |
| Lo | Other_Letter | other letters, including syllables and ideographs |
| L | Letter | Lu | Ll | Lt | Lm | Lo |
| Mn | Nonspacing_Mark | a nonspacing combining mark (zero advance width) |
| Mc | Spacing_Mark | a spacing combining mark (positive advance width) |
| Me | Enclosing_Mark | an enclosing combining mark |
| M | Mark | Mn | Mc | Me |
| Nd | Decimal_Number | a decimal digit |
| Nl | Letter_Number | a letterlike numeric character |
| No | Other_Number | a numeric character of other type |
| N | Number | Nd | Nl | No |
| Pc | Connector_Punctuation | a connecting punctuation mark, like a tie |
| Pd | Dash_Punctuation | a dash or hyphen punctuation mark |
| Ps | Open_Punctuation | an opening punctuation mark (of a pair) |
| Pe | Close_Punctuation | a closing punctuation mark (of a pair) |
| Pi | Initial_Punctuation | an initial quotation mark |
| Pf | Final_Punctuation | a final quotation mark |
| Po | Other_Punctuation | a punctuation mark of other type |
| P | Punctuation | Pc | Pd | Ps | Pe | Pi | Pf | Po |
| Sm | Math_Symbol | a symbol of mathematical use |
| Sc | Currency_Symbol | a currency sign |
| Sk | Modifier_Symbol | a non-letterlike modifier symbol |
| So | Other_Symbol | a symbol of other type |
| S | Symbol | Sm | Sc | Sk | So |
| Zs | Space_Separator | a space character (of various non-zero widths) |
| Zl | Line_Separator | U+2028 LINE SEPARATOR only |
| Zp | Paragraph_Separator | U+2029 PARAGRAPH SEPARATOR only |
| Z | Separator | Zs | Zl | Zp |
| Cc | Control | a C0 or C1 control code |
| Cf | Format | a format control character |
| Cs | Surrogate | a surrogate code point |
| Co | Private_Use | a private-use character |
| Cn | Unassigned | a reserved unassigned code point or a noncharacter |
| C | Other | Cc | Cf | Cs | Co | Cn |
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Jonathan |
| Solution 2 | |
| Solution 3 | kosmos |
| Solution 4 |
