'How to find and remove first/starting string from an Arabic string having diacritics but maintaining the original diacritics of remaining string

The aim is to find and remove a starting string/chars/word from an Arabic string that we don't know if it has diacritics or not but must maintain any and all diacritics of the remaining string (if any).

There are many answers for removing the first/starting string/chars from an English string on StackOverflow, but there is no existing solution to this problem found on StackOverflow that maintains the balance of the Arabic string in its original form.

If the original string is normalized (removing the diacritics, tanween, etc.) before processing it, then the remaining string returned will be the balance of the normalized string, not the remaining of the original string.

Example. Assume the following original string which can be in any of the following forms (i.e. the same string but different diacritics):

1. "السلام عليكم ورحمة الله"

2. "السَلام عليكمُ ورحمةُ الله"

3. "السَلامُ عَليكمُ ورَحمةُ الله"

4. "السَّلَامُ عَلَيْكُمُ وَرَحْمَةُ الله"

Now let us say we want to remove the first/staring characters "السلام" only if the string starts with such characters (which it does), and return the remaining of the "original" string with its original diacritics.

Of course, we are looking for the characters "السلام" without diacritics because we don't know how the original string is formatted with diacritics.

So, in this case, the returned remaining of each string must be:

1. " عليكم ورحمة الله"

2. " عليكمُ ورحمةُ الله"

3. " عَليكمُ ورَحمةُ الله"

4. " عَلَيْكُمُ وَرَحْمَةُ الله"

The following code works for an English string (there are many other solutions) but not for an Arabic string as explained above.

function removeStartWord(string,word) {
if (string.startsWith(word)) string=string.slice(word.length);
return string;
}

The above code uses the principle of slicing the starting characters found from the original string based on the characters' length; which works fine for English text.

For an Arabic string, we don't know the form of diacritics of the original string and thus the length of the string/characters we are looking for in the original string will be different and unknown.

Edit: Added example image for better clarifications.

The following image table provides further examples:

enter image description here



Solution 1:[1]

I have come up with the following as a possible solution.

The following solution is broken into 2 parts; firstly the function startsWithAr() is used to "partially" mimic the javascript startsWith() method but for an Arabic string.

However instead of returning 'true' or 'false', it will return the index after the characters we are looking for at the start of the Source String (i.e. the length of the string found in the Source String including its Tashkeel (diacritics) if any), otherwise, it returns -1 if the characters of the specified string are not found at the start of the string.

Using the startsWithAr() function, we then create (in the 2nd part) a function that removes the characters of the specified string if found at the start of the Source String using the slice() method; the removeStartString() function.

This approach permits not only maintaining the Tashkeel (diacritics) of the remainder of the Source String but also allows strings with Tahmeez to be searched and removed.

The function ignores Tashkeel (diacritics) and Tahmeez in both the Source String and the Look-For Search strings and will return the remaining part of the Source String with its original Tashkeel (diacritics) intact after removing the specified staring of characters from the beginning of the Source String.

This way we can use the function to handle all Unicode in the Arabic script not limit it to a defined range because any other characters of whatever language are ignored.

We can also improve it easily by matching "?" with "?" so we can remove the string "??????" even if it is written as "??????" by adding .replace(/[?]/g,'?') at the 2 .replace() lines.

I have included below separate test cases on the use of the startsWithAr() function and the removeStartString() functions.

The two functions can be combined into one function if needed.

Please improve as necessary; any suggestions are appreciated.


1st Part: startsWithAr()


//=====================================================================
// startsWithAr() function
// Purpose:
// Determines whether an Arabic string (the "Source String") begins with the characters
// of a specified string (the "Look-For String").
// Return the position (index) after the Look-For String if found, else return -1 if not found.
// Ignores Tashkeel (diacritics) and Tahmeez in both the Source and Look-For Strings.
// The returned position index is zero based.
// By knowing the position (index) after the Look-For String, one can remove the
// starting string using the slice() method while maintaining the remainder of the Source String with
// its original tashkeel (diacritics) unchanged.
//
// Parameters:
// str     : The Source String to search in.
// lookFor : The characters to be searched for at the start of this string.
//=====================================================================
function startsWithAr(str,lookFor) {
let indexLookFor=0, tshk=/[?-??-??-??-????]/, w=/[?]/g,hamz=/[???????]/g;
lookFor=lookFor.replace(hamz,'?').replace(w,'?').replace(/[?-??-??-??-????]/g,''); // normalize the lookFor string
for (let indexStr=0; indexStr<str.length;indexStr++) {
while(tshk.test(str[indexStr])&&indexStr<str.length)++indexStr; // skip tashkeel & increase index
if (lookFor[indexLookFor]!==str[indexStr].replace(hamz,'?').replace(w,'?')) return-1; // no match, so exit -1
indexLookFor++;                               // match found so next char in lookFor String
    if (indexLookFor>=lookFor.length) {       // if end of Source String then WE FOUND IT
      indexStr+=1;                            // point after source char
      while(tshk.test(str[indexStr])&&indexStr<str.length)++indexStr; // skip tashkeel after Source String if any
    return indexStr;      // return index in Source String after lookFor string and after any tashkeel
    }
}
return-1; // not found end of string reached
}
//=========================================
// test cases for startsWithAr() function
//=========================================
var r =0; // test tracking flag
r |= test("?????? ?????????? ????? ????","??????",6);  // find the start letters '??????'
r |= test("?????????? ?????????? ????? ????","??????",10); // find the start letters '??????'
r |= test("?????????? ?????????? ?????????? ????","????????",10); // find the start letters '????????'
r |= test("?????????? ?????????? ?????????? ????","????????",10); // find the start letters '????????'
r |= test("?????? ?? ??????","??????",6);      // find the start letters '??????'
r |= test("?????/???","?????",5);           // find the start letters '?????'
r |= test("?????/???","?",-1);           // find the start letters '?????'
r |= test(" ?????"," ",1);               // find the start letter ' ' (space)
r |= test("????? ???","??",2);             // find the start letters '??'
r |= test("????? ???","?",1);              // find the start letter  '?'
r |= test("????? ???","??",2);             // find the start letters '??'
r |= test("????? ???","??",2);             // find the start letters '??'
r |= test("????? ???","???",2);             // find the start letters '???'
r |= test("??????? ????","???",3);             // find the start letters '???'
r |= test("","?",-1);                  // empty Source String
r |= test("","",-1);                  // empty Source String and Look-For String

if (r==0) console.log("? All startsWithAr() test cases passed");


//-----------------------------------
function test(str,lookfor,should) {
  let result= startsWithAr(str,lookfor);
  if (result !== should) {console.log(`
  ${str} Output   :${result}
  ${str} Should be:${should}
  `);return 1;}
  }

2nd Part: removeStartString()


//=====================================================================
// removeStartString() function
// Purpose:
// Determines whether an Arabic string (the "Source String") begins with the characters
// of a specified string (the "Look-For String").
// If found the Look-For String is removed and the reminder of the Source String is returned
// with its original Tashkeel (diacritics);
// If no match then return original Source String.
//
// Ignores Tashkeel (diacritics) and Tahmeez in both the Source and Look-For Strings.
// The function uses the startsWithAr() function to determine the index after the matched
// starting string/characters.
//
// Parameters:
// str     : The Source String to search in.
// toRemove: The characters to be searched for and removed if at the start of this string.
//=====================================================================
function removeStartString(str,toRemove) {
let index=startsWithAr(str,toRemove);
if (index>-1) str=str.slice(index);
return str;
}

//=========================================
// test cases for removeStartString() function
//=========================================
var r =0; // test tracking flag
r |= test2("?????? ?????????? ????? ????","??????"," ?????????? ????? ????");  // remove the start letters '??????'
r |= test2("?????????? ?????????? ????? ????","??????"," ?????????? ????? ????");  // remove the start letters '??????????'
r |= test2("?????? ?????????? ????? ????","??????????"," ?????????? ????? ????");  // remove the start letters '??????????'
r |= test2(" ?????? ?????????? ????? ????"," ??????????"," ?????????? ????? ????");// remove the start letters '?????????? '
r |= test2("?????? ?????????? ????? ????","??","???? ?????????? ????? ????"); // remove the start letters '??'
r |= test2("??????? ????????","?","????? ????????");             // remove the start letter '?'    r |= test2("??????? ????????"," ","??????? ????????");                // remove the start letter ' '
r |= test2("??????? ????????","","??????? ????????");             // remove the start letter ''
r |= test2("??????? ????????","???","??????? ????????");           // remove the start letters '???'

if (r==0) console.log("? All removeStartString() test cases passed");

//-----------------------------------
function startsWithAr(str,lookFor) {
let indexLookFor=0, tshk=/[?-??-??-??-????]/, w=/[?]/g,hamz=/[???????]/g;
lookFor=lookFor.replace(hamz,'?').replace(w,'?').replace(/[?-??-??-??-????]/g,''); 
for (let indexStr=0; indexStr<str.length;indexStr++) {
while(tshk.test(str[indexStr])&&indexStr<str.length)++indexStr; 
if (lookFor[indexLookFor]!==str[indexStr].replace(hamz,'?').replace(w,'?')) return-1;
indexLookFor++;                                           
    if (indexLookFor>=lookFor.length) {                    
      indexStr+=1;                                         
      while(tshk.test(str[indexStr])&&indexStr<str.length)++indexStr; 
    return indexStr;
    }
}
return-1;
}
//-----------------------------------
function test2(str,toRemove,should) {
  let result= removeStartString(str,toRemove);
  if (result !== should) {console.log(`
  ${str} Output   :${result}
  ${str} Should be:${should}
  `);return 1;}
  }

Solution 2:[2]

Working with regex unicode escapes might already be good enough for what the OP is looking for, though JavaScript does not support unicode scripts like \p{Arabic}.

A category based pattern like /^[\p{L}\p{M}]+\p{Z}+/gmu together with replace already exactly does what the OP did ask for ...

find and remove first starting word from an arabic string having diacritis

The pattern ... ^[\p{L}\p{M}]+\p{Z}+ ... reads like this ...

  • ^... starting at the beginning of a new line ...
  • [ ... ]+ ... find at list one character of the specified character class ...
    • \p{L} ... either any kind of Letter from any language,
    • \p{M} ... or a character intended to be combined with another character (e.g. accents, umlauts, enclosing boxes, etc.)
  • ... followed by \p{Z}+ ... at least one of any kind of whitespace or invisible separator.

console.log(`?????? ????? ????? ????
??????? ?????? ?????? ????
???????? ??????? ??????? ????
?????????? ?????????? ?????????? ????`.replace(/^[\p{L}\p{M}]+\p{Z}+/gmu, ''));
.as-console-wrapper { min-height: 100%!important; top: 0; }

Edit

Since it is clear now what the OP really wants, the above approach remains and just gets raised to the next level by utilizing a replacer function with additional comparison logic based on an Intl.Collator object which takes Arabic and base letter comparison into account.

The collator is initialized the least strict by providing (additionally to the 'ar' locals) an option which features a base sensitivity. Thus, while comparing two similar (but not quite equal) strings via the collator's compare method, e.g. '??????' and '??????????' will be considered equal despite of the latter featuring (a lot of) diacritics.

proof / examples ...

const baseLetterCollator = new Intl.Collator('ar', { sensitivity: 'base' } );

console.log(
  "('?????? ????? ????? ????' === '?????????? ?????????? ?????????? ????') ?..",
  ('?????? ????? ????? ????' === '?????????? ?????????? ?????????? ????')
);
console.log('\n');

console.log(`new Intl.Collator()
  .compare('?????? ????? ????? ????' ,'?????????? ?????????? ?????????? ????') === 0

  ?..`,
  new Intl.Collator()
    .compare('?????? ????? ????? ????' ,'?????????? ?????????? ?????????? ????') === 0
);
console.log(`new Intl.Collator('ar', { sensitivity: 'base' } )
  .compare('?????? ????? ????? ????' ,'?????????? ?????????? ?????????? ????') === 0

  ?..`,
  new Intl.Collator('ar', { sensitivity: 'base' } )
    .compare('?????? ????? ????? ????' ,'?????????? ?????????? ?????????? ????') === 0
);
.as-console-wrapper { min-height: 100%!important; top: 0; }

Based on all the above said ... the final solution ...

function removeFirstMatchingWordFromEveryNewLine(search, multilineString) {
  const baseLetterCollator
    // - [ar]abic
    // - base sensitivity
    //   ... only strings that differ in base letters compare as unequal.
    = new Intl.Collator('ar', { sensitivity: 'base' } );

  const replacer = word => {
    return (baseLetterCollator.compare(search, word.trim()) === 0)
      ? ''    // - remove the matching word (whitespace included).
      : word; // - keep the word since there was no match. 
  }
  const regXFirstLineWord = /^[\p{L}\p{M}]+\p{Z}+/gmu;

  search = String(search).trim();

  return String(multilineString).replace(regXFirstLineWord, replacer);  
}
const sampleData = `?????? ????? ????? ????
??????? ?????? ?????? ????
???? ??????
???????? ??????? ??????? ????
?????????? ?????????? ?????????? ????`;

console.log('sampleData ...', sampleData);
console.log(
  "removeFirstMatchingWordFromEveryNewLine('??????', sampleData) ...",
  removeFirstMatchingWordFromEveryNewLine('??????', sampleData)
);
.as-console-wrapper { min-height: 100%!important; top: 0; }

Solution 3:[3]

Since requirements change(d) and information come in slice by slice, ...

"The [...] answer removes the first matching word assuming a space after the word. But the string of characters we are looking for may not necessarily be followed by a space (i.e. not a standalone word). For example, removing the characters "?????" from the sentence "?????/???? ???????" and returning only "/???? ???????". – Mohsen Alyafei"

... I also will start at a blank sheet.

The combined approach of doing a Intl.Collator based locale compare against the matching result of a Unicode property escapes based regex which matches any Arabic word regardless of combined characters like accents, umlauts, etc. can not be used anymore if it comes to finding/matching any kind of string (and here in the beginning of a new line).

But any approach which tries to just naively iterate over strings with the intention of comparing two strings to one another character by character will fail.

Example code tells better than words ... let's see it ...

console.log(`
  ... remember ...
  new Intl.Collator('ar', { sensitivity: 'base' } )
    .compare('??????????' ,'??????') === 0

  ?..`, new Intl.Collator('ar', { sensitivity: 'base' } )
    .compare('??????????' ,'??????') === 0, `

  ... but ...
  new Intl.Collator('ar')
    .compare('??????????' ,'??????') === 0

  ?..`, new Intl.Collator('ar')
    .compare('??????????' ,'??????') === 0
);
console.log('\n... explanation ...\n\n');

console.log("'??????'.length ...", '??????'.length);
console.log("'??????????'.length ...", '??????????'.length);

console.log("'??????'.split('') ...", '??????'.split(''));
console.log("'??????????'.split('') ...", '??????????'.split(''));
.as-console-wrapper { min-height: 100%!important; top: 0; }

Fortunately Intl, ECMAScript's Internationalization API, can help here too. There is Intl.Segmenter which will help breaking down the string(s) into comparable segments. For the OP's use case it will be good enough to do it on the default granularity level of 'grapheme', which seems to equal a segmentation into locale comparable letters ...

console.log(`[
  ...new Intl.Segmenter('ar', { granularity: 'grapheme' }).segment('??????')
]
.map(({ segment }) => segment) ...`, [

  ...new Intl.Segmenter('ar', { granularity: 'grapheme' }).segment('??????')
  ]
  .map(({ segment }) => segment)
);
console.log(`[
  ...new Intl.Segmenter('ar').segment('??????????')
]
.map(({ segment }) => segment) ...`, [

    ...new Intl.Segmenter('ar').segment('??????????')
  ]
  .map(({ segment }) => segment)
);
.as-console-wrapper { min-height: 100%!important; top: 0; }

Thus the last step was to implement a function which meets the most recent requirements of the OP by combining the above introduced Intl.Segmenter with the by now already familiar Intl.Collator ...

function removeEveryMatchingNewLineStart(search, multilineString) {
  const letterSegmenter
    // - [ar]abic
    // - default grapheme granularity (locale comparable letters).
    = new Intl.Segmenter('ar'/*, { granularity: 'grapheme' }*/);

  const letterCollator
    // - [ar]abic
    // - base sensitivity
    //   ... Non-zero comparator result value for strings only
    //   that for a base letter comparison are considered unequal.
    = new Intl.Collator('ar', { sensitivity: 'base' } );

  const getLocaleComparableLetterList = str =>
    [...letterSegmenter.segment(str)].map(({ segment }) => segment);

  function replaceLineStartByBoundComparableLetters(line) {
    const searchLetters = this;
    let lineLetters = getLocaleComparableLetterList(line);

    if (searchLetters.every((searchLetter, idx/*, arr*/) =>
      (letterCollator.compare(searchLetter, lineLetters[idx]) === 0)
    )) {
      lineLetters = lineLetters.slice(searchLetters.length);

      let leadingBlanks = '';
      while (lineLetters[0] === ' ') {
        leadingBlanks = leadingBlanks + lineLetters.shift();
      }
      line = `${ lineLetters.join('') }${ leadingBlanks }`;

      // // due to keeping/restoring leading witespace sequences ...
      // // ... all the above additional computation instead of ...
      // // ... a simple ...
      // line = lineLetters.slice(searchLetters.length).join('')
    }
    return line;
  }
  return String(multilineString)
    .split(/(\n)/)
    .map(
      replaceLineStartByBoundComparableLetters.bind(
        getLocaleComparableLetterList(String(search))
      )
    )
    .join('');
}
const sampleData = `?????? ????? ????? ????
??????? ?????? ?????? ????
???? ??????
???????? ??????? ??????? ????
?????????? ?????????? ?????????? ????`;

console.log('sampleData ...', sampleData);
console.log(
  "removeEveryMatchingNewLineStart('??????', sampleData) ...",
  removeEveryMatchingNewLineStart('??????', sampleData)
);
console.log(
  "removeEveryMatchingNewLineStart('???', sampleData) ...",
  removeEveryMatchingNewLineStart('???', sampleData)
);
console.log(
  "removeEveryMatchingNewLineStart('?????? ', sampleData) ...",
  removeEveryMatchingNewLineStart('?????? ', sampleData)
);
.as-console-wrapper { min-height: 100%!important; top: 0; }

Solution 4:[4]

I don't see what wrong in your code, but here is another approach:

function removeStartWord(string, word) {
  return string.split(' ').filter((_word, index) => index !== 0 || _word.replace(/[^a-zA-Z?-?]+/g, '') !== word).join(' ');
}

const sampleData = `?????????? ?????????? ?????????? ????`;

console.log('sampleData ...', sampleData);
console.log(
  "removeStartWord(sampleData, '??????') ...",
  removeStartWord(sampleData, '??????')
);
console.log(
  "removeStartWord(sampleData, '???') ...",
  removeStartWord(sampleData, '???')
);
console.log(
  "removeStartWord(sampleData, '?????? ') ...",
  removeStartWord(sampleData, '?????? ')
);
console.log(
  "removeStartWord(sampleData, ' ??????') ...",
  removeStartWord(sampleData, ' ??????')
);
.as-console-wrapper { min-height: 100%!important; top: 0; }

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Mohsen Alyafei
Solution 2
Solution 3
Solution 4 Peter Seliger