'How to remove the lines which appear on file B from another file A even with caps
Here is my code
from collections import Counter
with open('Downloads/book_25.txt', encoding = 'utf-8-sig') as f: # Open txt file
book = f.read()
with open('Downloads/stop_words_english.txt', encoding = 'utf-8-sig') as f: # Open txt file
stopfile = f.read()
# For loop to iterate over the txt file to check and remove punctuations
punctuations = '''!()-[]{};:'"\,<>./?@#$%^&*_~“‘'''
no_punct = ""
for char in book:
if char not in punctuations:
no_punct = no_punct + char
no_punct = no_punct.split()
# For loop to iterate over the txt file to check and remove stop words
cleantxt = []
for word in no_punct:
if word not in stopfile:
cleantxt.append(word)
print(cleantxt)
I have book_25.txt which is a book and stop_words_english.txt with a bunch of words but it is all small caps. I want to remove the words in stop_words_english.txt from book_25.txt but it doesn't remove some words as it is capitalized.
For example, there is the word "the" in stop_words_english.txt but it doesn't remove "The" from book_25.txt.
How can I change this? ( Ignore the other for loop as it is for removing punctuations)
Solution 1:[1]
You can convert everything to lowercase before adding/removing.
If the words from stop_words_english.txt are in a list, you can do (if they are already in lowercase, ignore this code):
stopfile = [x.lower() for x in stopfile]
Then when you are running the final code:
if word.lower() not in stopfile:
cleantxt.append(word)
Solution 2:[2]
You can achieve this with this code:
from pathlib import Path
# book = Path('Downloads/book_25.txt').read_text(encoding='utf-8-sig')
# stopfile = Path('Downloads/stop_words_english.txt').read_text(encoding='utf-8-sig')
book: str = "Hello my World!"
stopfile: str = "world"
punctuations: str = r"""!()-[]{};:'"\,<>./?@#$%^&*_~“‘"""
book = "".join(filter(lambda char: char not in punctuations, book))
stopfile = "".join(filter(lambda char: char not in punctuations, stopfile))
for punctuation in punctuations:
book = book.replace(punctuation, "")
stopfile = stopfile.replace(punctuation, "")
exclude: set[str] = set([word.upper() for word in stopfile.split()])
cleantext = [word for word in book.split() if word.upper() not in exclude]
print(cleantext) # ['Hello', 'my']
Little explanation:
- When handling paths/files you should use the pathlib library. I use it here the read the content of the file as text.
- I use a set of strings
excludefor membership testing. This is what sets are made for. They also have some nice methods around that likeunion,intersectionetc.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 |
