'How to remove the lines which appear on file B from another file A even with caps

Here is my code

from collections import Counter

with open('Downloads/book_25.txt', encoding = 'utf-8-sig') as f: # Open txt file
    book = f.read() 
    
with open('Downloads/stop_words_english.txt', encoding = 'utf-8-sig') as f: # Open txt file
    stopfile = f.read()
    
# For loop to iterate over the txt file to check and remove punctuations
punctuations = '''!()-[]{};:'"\,<>./?@#$%^&*_~“‘'''
no_punct = ""
for char in book:
    if char not in punctuations:
        no_punct = no_punct + char 
no_punct = no_punct.split()

# For loop to iterate over the txt file to check and remove stop words
cleantxt = []
for word in no_punct:
    if word not in stopfile:
        cleantxt.append(word)
        
print(cleantxt)

I have book_25.txt which is a book and stop_words_english.txt with a bunch of words but it is all small caps. I want to remove the words in stop_words_english.txt from book_25.txt but it doesn't remove some words as it is capitalized.

For example, there is the word "the" in stop_words_english.txt but it doesn't remove "The" from book_25.txt.

How can I change this? ( Ignore the other for loop as it is for removing punctuations)



Solution 1:[1]

You can convert everything to lowercase before adding/removing.

If the words from stop_words_english.txt are in a list, you can do (if they are already in lowercase, ignore this code):

stopfile = [x.lower() for x in stopfile]

Then when you are running the final code:

if word.lower() not in stopfile:
    cleantxt.append(word)

Solution 2:[2]

You can achieve this with this code:

from pathlib import Path

# book = Path('Downloads/book_25.txt').read_text(encoding='utf-8-sig')
# stopfile = Path('Downloads/stop_words_english.txt').read_text(encoding='utf-8-sig')
book: str = "Hello my World!"
stopfile: str = "world"
punctuations: str = r"""!()-[]{};:'"\,<>./?@#$%^&*_~“‘"""
book = "".join(filter(lambda char: char not in punctuations, book))
stopfile = "".join(filter(lambda char: char not in punctuations, stopfile))

for punctuation in punctuations:
    book = book.replace(punctuation, "")
    stopfile = stopfile.replace(punctuation, "")

exclude: set[str] = set([word.upper() for word in stopfile.split()])
cleantext = [word for word in book.split() if word.upper() not in exclude]
print(cleantext)  # ['Hello', 'my']

Little explanation:

  1. When handling paths/files you should use the pathlib library. I use it here the read the content of the file as text.
  2. I use a set of strings exclude for membership testing. This is what sets are made for. They also have some nice methods around that like union, intersection etc.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2