'Split a paragraph containing words in different languages
Given input
let sentence = `browser's
emoji 🤖
rød
continuïteit a-b c+d
D-er går en
المسجد الحرام
٠١٢٣٤٥٦٧٨٩
তার মধ্যে আশ্চর্য`;
Needed output
I want every word and spacing wrapped in <span>s indicating it's a word or space
Each <span> has type attribute with values:
- w for word
- t for space or non-word
Examples
<span type="w">D</span><span type="t">-</span>
<span type="w">er</span><span type="t"> </span>
<span type="w">går</span>
<span type="t"> </span><span type="w">en</span>
<span type="w">المسجد</span>
<span type="t"> </span><span type="w">الحرام</span>
<span type="t"> </span>
<span type="w">তার</span><span type="t"> </span>
<span type="w">মধ্যে</span><span type="t"> </span>
<span type="w">আশ্চর্য</span>
Ideas investigated
Search stack exchange
Unicode string with diacritics split by chars lead me to answer that for using Unicode properties Grapheme_Base
Using split(/\w/) and split(/\W/) word boundaries.
That splits on ASCII as MDN reports RegEx \w and 'W
\w and \W only matches ASCII based characters; for example, a to z, A to Z, 0 to 9, and _.
Using split("")
Using sentence.split("") splits the emoji into its unicode bytes.
Unicode codepoint properties Grapheme_Base and Grapheme_Extend
const matchGrapheme =
/\p{Grapheme_Base}\p{Grapheme_Extend}|\p{Grapheme_Base}/gu;
let result = sentence.match(matchGrapheme);
console.log("Grapheme_Base (+Grapheme_Extend)", result);
splits each word but has still all content.
Unicode properties Punctuation and White_Space
const matchPunctuation = /[\p{Punctuation}|\p{White_Space}]+/ug;
let punctuationAndWhiteSpace = sentence.match(matchPunctuation);
console.log("Punctuation/White_Space", punctuationAndWhiteSpace);
seems to fetch the non words.
Solution 1:[1]
You could also use a combination of replace(), split() and join().
const sentence = `browser's
emoji ?
rød
continuïteit a-b c+d
D-er går en
?????? ??????
??????????
??? ????? ???????`;
const splitP = (sentence) => {
const oneLine = sentence.replace(/[\r\n]/g, " "); // replace all \r\ns by spaces
const splitted = oneLine.split(" ").filter(x => x); // split & filter out falsy values
return `<span>${splitted.join("</span><span>")}</span>`; // join with span tags
}
console.log(splitP(sentence));
If you like a one-line solution.
const sentence = `browser's
emoji ?
rød
continuïteit a-b c+d
D-er går en
?????? ??????
??????????
??? ????? ???????`;
const result = `<span>${sentence.replace(/[\r\n]/g, " ").split(" ").filter(x => x).join("</span><span>")}</span>`;
console.log(result);
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | axtcknj |
