'How to take user input of a file name and only count letters as words instead of punctuation?

I have written this code for a word frequency calculator. It works but will count the word 'not,' differently than 'not'. I also am attempting to make the computer ask for user input of the filename and if the user inputs the wrong file it returns 'wrong file'. I am unsure how to code for the user input and how to make sure the program only is counting letters (not punctuation).

file = open('document (1).txt')

empty_dictionary = dict()

for sentence in file:
    
    sentence = sentence.strip()
    sentence = sentence.lower()
    words = sentence.split(" ")
    for word in words:
        if word in empty_dictionary:
            empty_dictionary[word] = empty_dictionary[word] + 1
        else:
            empty_dictionary[word] = 1

 
for key in list(empty_dictionary.keys()):
    print(key, "- ", empty_dictionary[key])


Solution 1:[1]

If there are a few specific characters you want to remove, you can do it this way:

word = word.replace(',', '')
word = word.replace(';', '')
word = word.replace(':', '')

Or a more general solution that removes any character that isn't a letter:

word = ''.join(ch for ch in word if ch.isalpha())

Solution 2:[2]

Regular expressions are actually perfectly suited for this! They have the notion of a word boundary.

The regular expression /\w+\b/ will match any number of "word characters" (\w+) followed by a word boundary (\b). Python provides regex functionality through its re module and you can use it like this:

# import the regex module
import re

# just some example text
text = """Not? Not!
Not, not."""

# pre-compile the regex
pattern = re.compile(r'\w+\b')
empty_dictionary = dict()

for sentence in text.split('\n'):
    sentence = sentence.strip()
    sentence = sentence.lower()
    # `findall`, as you might imagine, finds all matches in a given string
    for word in pattern.findall(sentence):
        if word in empty_dictionary:
            empty_dictionary[word] = empty_dictionary[word] + 1
        else:
            empty_dictionary[word] = 1

 
for key in list(empty_dictionary.keys()):
    print(key, "- ", empty_dictionary[key])

Solution 3:[3]

Regex will help you out, context managers are nice to use as it to help you with the file close (naughty (unexpected error) or nice). A Counter is probably something you will find useful :) (works as a dict when it comes to loops and can be updated). I also got some nice methods like most_common witch might be beneficial to the work you are doing.

import re
from collections import Counter

# with open("document (1).txt") as fp:
#    text = fp.read().lower()

# mock file content
text = """Not? Not!
Not, not."""

print(Counter(re.findall(r'\w+\b', text)))

For file input control you could go for something like

import re
from collections import Counter
from pathlib import Path

while True:
    candidate = Path(input("Yo give me that file name"))
    if candidate.is_file():
        break
    print(f"{candidate} is not a file on the system")

text = candidate.read_text().lower()

# use .items() instead of .most_common() if order is not important
for k, v in Counter(re.findall(r'\w+\b', text)).most_common():
    print(k, v, sep=" -  ")

Solution 4:[4]

As some of the others said Regex is probably one of the best solutions but for your user input issue:

import tkinter as tk
import re
from tkinter import filedialog
from os.path import exists
from os import execv
root = tk.Tk()
root.withdraw()

file_path = filedialog.askopenfilename()

pattern = re.compile(r'\w+\b')
if exists(file_path):
   with open(file_path,'r') as reader:
      text = reader.readlines()

   empty_dictionary = dict()
   for sentence in text.split('\n'):
      sentence = sentence.strip().lower()
      for work in pattern.findall(sentence):
      if word in empty_dictionary:
         empty_dictionary[word] = empty_dictionary[word] + 1
      else:
         empty_dictionary[word] = 1
   for key in list(empty_dictionary.keys()):
      print(key, "- ", empty_dictionary[key])
else:
   from colorama import Fore
   print(Fore.RED, "The File does not exist.\nHit Enter to retry.")
   input()
   os.execv(sys.argv[0], sys.argv) # Restarts to ask again

All of this works like so: The tkinter GUI import actually creates a window which is instantly withdrawn(hidden) and only shows the file dialog of choice and puts the path into text. Then exists() checks to make sure the file is real and returns true if it does in which you put your code. If not it raises prints red text like an error and awaits enter to restart and try again. The regex takes count of what you want after being compiled and finds all words that match

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 John Gordon
Solution 2 isaactfa
Solution 3
Solution 4 Halfow