'Extract words within curly quotes but keep it when used as apostrophe
I have a UTF-8 file which has curly quotes ‘Awaara’ like these and in some places curly quotes are used such as don’t and don't' . The issue arises when trying to convert these curly quotes to single quotes. After converting to single quotes, I am unable to extract the single quotes words 'Awaara' without removing all single quotes used as don't , I'm.
GOAL: Convert curly--> single, remove single quotes yet keep apostrophied single quotes.
Here's the code I have written which convert yet fails to remove words within single quotes:
#!/bin/bash
cat $1 | sed -e "s/\’/'/g" -e "s/\‘/'/g" | sed -e "s/^'/ /g" -e "s/'$/ /g" | sed "s/\…/ /g" | tr '>' ' ' | tr '?' ' ' | tr ',' ' ' | tr ';' ' ' | tr '.' ' ' | tr '!' ' ' | tr '′' ' ' | tr ':' ' ' | sed -e "s/\[/ /g" -e "s/\]/ /g" -e 's/(/ /g' -e "s/)/ /g" | tr ' ' '\n' | sort -u | uniq | tr 'a-z' 'A-Z' >our_vocab.txt
The output is:
'AWAARA ---> Should be AWAARA
25
50
70
800
A
AD
AI
AMITABH
AND
ANYWAY
ARE
BACHCHAN
BECAUSE
BUT
C++
CAN
CHECK
COMPUTER
DEVAKI
DIFFICULT
.
.
.
HOON' --> Should be HOON
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
