'I'm trying to scrape most frequent words in a web page and filter out stop words

My code works with scraping the most frequent words but once I introduce the code to try and remove or filter out any stop words my output is all funky. Here is my full code

nltk.download('stopwords')

stopwords = stopwords.words('english')
print(stopwords)

def start(url):
    worldlist = []
    source_code = requests.get(url).text
    soup = BeautifulSoup(source_code, 'html.parser')
    for each_text in soup.findAll('div', {'class': 'centerPar'}):
        content = each_text.text
        words = content.lower().split()
        for each_word in words:
            worldlist.append(each_word)
        clean_wordlist(worldlist)

def clean_wordlist(wordlist):
    clean_list = []
    for word in wordlist:
        symbols = "!@#$%^&*()_-+={[}]|\;:\"<>?/., "
        for i in range(len(symbols)):
            word = word.replace(symbols[i], '')
        if len(word) > 0:
            clean_list.append(word)
    filter_list(clean_list)

def filter_list(clean_list):
    filtered_list = []
    for word in clean_list:
        for i in range(len(stopwords)):
            word = word.replace(stopwords[i], '')
        if len(word) > 0:
            filtered_list.append(word)
    create_dict(filtered_list)

def create_dict(filtered_list):
    word_count = {}
    for word in filtered_list:
        if word in word_count:
            word_count[word] += 1
        else:
            word_count[word] = 1
    c = Counter(word_count)
    top = c.most_common(20)
    print(top)

if __name__ != '__main__':
    pass
else:
    url = "https://www.pwc.com/us/en/about-us/purpose-and-values.html"
    start(url)

I'm having issues with the filter_list function

my out put is as follows:

[('n', 82), ('e', 23), ('f', 17), ('r', 14), ('ce', 14), ('wk', 14), ('pwc', 13), ('gl', 13), ('c', 12), ('wh', 12), ('u', 12), ('l', 11), ('cn', 10), ('peple', 10), ('purpe', 9), ('’', 9), ('ke', 7), ('en', 7), ('h', 7), ('p', 7)]


Solution 1:[1]

You replace stopwords within tokens with an empty string.

So if the token is exactly a stopword it has length 0 and gets filtered correctly. If it doesn't contain any substrings that are stopwords then it gets fully appended correctly.

In all other cases (and that are most tokens in any text) stopword substrings within the tokens get replaced, leaving behind nonsense strings.

E.g. if your word is programming. Your stopword list contains short words such as ['it', 'am', 'in', 'i', 'a', ...]. Your replacement results in:

programming --> progrmg

To fix this use a check whether the whole word string is in the stopwords string list and thereby skip the whole temporary modification of the word variable:

def filter_list(clean_list):
    filtered_list = []
    for word in clean_list:
        if word not in stopwords:
            filtered_list.append(word)
    create_dict(filtered_list)

Or as a list comprehension:

def filter_list(clean_list):
    filtered_list = [word for word in clean_list if word not in stopwords]
    create_dict(filtered_list)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 ewz93