'How to extract only the English words and leaving the Devanagari words in bash script?

The text file is like this,

#एक
1के
अंकगणित8IU
अधोरेखाunderscore
$thatऔर
%redएकyellow
$चिह्न
अंडरस्कोर@_

The desired text file should be like,

#
1
8IU
underscore
$that
%redyellow
$
@_

This is what I have tried so far, using awk

awk -F"[अ-ह]*" '{print $1}' filename.txt And the output that I am getting is,

#
1


$that
%red
$

and using this awk -F"[अ-ह]*" '{print $1,$2}' filename.txt and I am getting an output like this,

# 
1 े
 ं
 ो
$that 
%red yellow
$ ि
 ं

Is there anyway to solve this in bash script?

Solution 1:^[1]

Using perl:

$ perl -CSD -lpe 's/\p{Devanagari}+//g' input.txt
#
1
8IU
underscore
$that
%redyellow
$
@_

-CSD tells perl that standard streams and any opened files are encoded in UTF-8. -p loops over input files printing each line to standard output after executing the script given by -e. If you want to modify the file in place, add the -i option.

The regular expression matches any codepoints assigned to the Devanagari script in the Unicode standard and removes them. Use \P{Devanagari} to do the opposite and remove the non-Devanagari characters.

Solution 2:^[2]

Using awk you can do:

awk '{sub(/[^\x00-\x7F]+/, "")} 1' file
#
1
8IU
underscore
$that
%redyellow

See documentation: https://www.gnu.org/software/gawk/manual/html_node/Bracket-Expressions.html

using [\x00-\x7F]. This matches all values numerically between zero and 127, which is the defined range of the ASCII character set. Use a complemented character list [^\x00-\x7F] to match any single-byte characters that are not in the ASCII range.

Solution 3:^[3]

tr is a very good fit for this task:

LC_ALL=C tr -c -d '[:cntrl:][:graph:]' < input.txt

It sets the POSIX C locale environment so that only US English character set is valid.

Then instructs tr to -d delete -c complement [:cntrl:][:graph:], control and drawn characters classes (those not control or visible) characters. Since it is sets all the locale setting to C, all non-US-English characters are discarded.

Solution 4:^[4]

Does this sed work?

sed 's/\([0-9a-zA-Z[:punct:]]*\)[^0-9a-zA-Z[:punct:]]*/\1/g' input_file
#
1
8IU
underscore
$that
%redyellow
$
@_

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1
Solution 2	Carlos Pascual
Solution 3	LÃ©a Gris
Solution 4

'How to extract only the English words and leaving the Devanagari words in bash script?

Solution 1:[1]

Solution 2:[2]

Solution 3:[3]

Solution 4:[4]

Sources

Related Questions

Solution 1:^[1]

Solution 2:^[2]

Solution 3:^[3]

Solution 4:^[4]