'Tokenizing Strings without Punctuation in Python and putting punctuation back subsequently

After reading here for a while already, I have decided to make a post because I am not getting anywhere with my problem. Unfortunately, I am just a "finance guy" and need some help in coding with python. I have posts from social media platforms and would like to tokenize the sentences for NLP purposes (without punctuation), lemmatize the tokens and then reinstate the punctuation. For tokenizing and bringing back punctuation, I have used the following code so far:

from nltk.tokenize import word_tokenize

With the join_punctuation function, I wanted to put the punctuation back together correctly.

def join_punctuation(seq, characters='''!"#%&'()*+-.,/:;<=>?@[\]^_`{|}~'''):
    characters = set(characters)
    seq = iter(seq)
    current = next(seq)

    for nxt in seq:
        if nxt in characters:
            current += nxt
        else:
            yield current
            current = nxt
    yield current

So far, tokenizing works the way I want it to, as now I can lemmatize the words without being disturbed by punctuation. Here is an example:

text = "Today, (I) bought $GME and sold $TSLA!!"
tokens = word_tokenize(text)
print(tokens): "['Today', ',', '(', 'I', ')', 'bought', '$', 'GME', 'and', 'sold', '$', 'TSLA', '!', '!']"

But when I then want to put the punctuation back together, it works perfectly fine with punctuation behind words, but not if the punctuation occurs in front of the words (like "( I)" and "$ GME" instead of "(I)" and "$GME"):

newtext = " ".join(join_punctuation([w for w in tokens]))
print(newtext): "Today,( I) bought $ GME and sold $ TSLA!!"

Does anyone have an idea how to solve this? Thank you in advance!



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source