'c++: how to remove surrogate unicode values from string? [closed]
how do you remove surrogate values from a std::string in c++? looking for regular expression like this:
string pattern = u8"[\uD800-\uDFFF]";
regex regx(pattern);
name = regex_replace(name, regx, "_");
how do you do it in a c++ multiplatform project (e.g. cmake).
Solution 1:[1]
First off, you can't store UTF-16 surrogates in a std::string (char-based), you would need std::u16string (char16_t-based), or std::wstring (wchar_t-based) on Windows only. Javascript strings are UTF-16 strings.
For those string types, you can use either:
std::remove_if()+std::basic_string::erase():#include <string> #include <algorithm> std::u16string str; // or std::wstring on Windows ... str.erase( std::remove_if(str.begin(), str.end(), [](char16_t ch){ return (ch >= 0xd800) && (ch <= 0xdfff); } ), str.end() );std::erase_if()(C++20 and later only):#include <string> std::u16string str; // or std::wstring on Windows ... std::erase_if(str, [](char16_t ch){ return (ch >= 0xd800) && (ch <= 0xdfff); } );
UPDATE: You edited your question to change its semantics. Originally, you asked how to remove surrogates, now you are asking how to replace them instead. You can use std::replace_if() for that task, eg:
#include <string>
#include <algorithm>
std::u16string str; // or std::wstring on Windows
...
std::replace_if(str.begin(), str.end(),
[](char16_t ch){ return (ch >= 0xd800) && (ch <= 0xdfff); },
u'_'
);
Or, if you really want a regex-based approach, you can use std::regex_replace(), eg:
#include <string>
#include <regex>
std::wstring str; // std::basic_regex does not support char16_t strings!
...
std::wstring newstr = std::regex_replace(
str,
std::wregex(L"[\\uD800-\\uDFFF]"),
L"_"
);
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
