'divide sentence into words using regex

i want to devide a sentence into words using regex, i'm using this code:

import re
sentence='<30>Jan 11 11:45:50 test-tt systemd[1]: tester-test.service: activation successfully.'
sentence = re.split('\s|,|>|<|\[|\]:', sentence)

but i'm getting not what i'm waiting for

expected output is :

['30', 'Jan', '11', '11:45:50', 'test-tt', 'systemd', '1', 'tester-test.service: activation successfully.']

but what i'm getting is :

['', '30', 'Jan', '11', '11:45:50', 'test-tt', 'systemd', '1', '', 'tester-test.service:', 'activation', 'successfully.']

i tried actually to ingnore the whitespace but actually it should be ignored only in the last long-word and i have no idea how can i do that.. any suggestions/help Thank you in advance



Solution 1:[1]

It appears from the "expected output" for your example that as soon as a character is encountered that is preceded by ': ' the string comprised by that character and all that follow (to the end of the string) is to be returned. I assume that is one of the rules.

That suggests to me that you want you want to return matches (rather than the result of splitting) and that the regular expression to be matched should be a two-part alternation (that is, having the form ...|...) with the first part being

(?<=: ).+

That reads, "match one or more characters, greedily, the first being preceded by a colon followed by a space". (?<=: ) is a positive lookbehind.

Before reaching the first character that is preceded by a colon followed by a space we need to match strings comprised of digits, letters, and hyphens, and colons preceded and followed by a digit. The needed regular expression is therefore

rgx = r'(?<=: ).+|(?:[\da-zA-Z-]|(?<=\d):(?=\d))+'

You therefore may write

str = "<30>Jan 11 11:45:50 test-tt systemd[1]: tester-test.service: activation successfully."
re.findall(rgx, str)
  #=> ['30', 'Jan', '11', '11:45:50', 'test-tt', 'systemd',
  #    '1', 'tester-test.service: activation successfully.']

Python demo<-\(?)/->Regex demo

The components of the regular expression are as follows.

(?<=: )        # the preceding string must be ': '
.+             # match one or more characters (greedily)
|              # or
(?:            # begin a non-capture group
  [\da-zA-Z-]  # match one character in the character class
  |            # or
  (?<=\d)      # the previous character must be a digit
  :            # match literal
  (?=\d)       # the next character must be a digit
)+             # end the non-capture group and execute one or more times

(?=\d) is a positive lookahead.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Cary Swoveland