'extract sub-string from long text
I have a string as:
string="
(2021-07-04 11:58:43 PM BST)
---
le ) says tosen
Hi yohan
(2021-07-05 12:04:42 AM BST)
---
len (Trade ) says to sen
okay -5 / 0 .
(2021-07-04 11:47:14 PM BST)
---
Keun says to
HanSo
hello
---
(2021-07-05 12:09:41 AM BST)
---
len (Trade) says to sen
yes -5 / 0 TN -- / +2.5
---
* * *
Processe | 2021-07-05 12:26:44 AM
BST
---
"
All I want to extract the text after says to and before timestamp.
Expected output as:
text=['yoh Hi yo','sen okay -5 / 0 ','sen yes -5 / 0 TN -- / +2.5']
What I have tried:
text=re.findall(r'says to (\D+)(',string)
Solution 1:[1]
There are digits in between says to and the next timestamp between parenthesis, so using \D+ will stop matching when there is a digit.
Instead you can capture want you want after matching says to in group 1 for all following lines that do not start for example with ( and a digit or --- (or make it more specific)
\bsays to (.*(?:\n(?!\(\d|---).*?)*?)\s*\n(?:\(\d|---)
For example:
pattern = r"\bsays to (.*(?:\n(?!\(\d|---).*?)*?)\s*\n(?:\(\d|---)"
text = re.findall(pattern, text)
print(text)
Output
['yohan sen \n[[:Conversations will be recorded and may be monitored by the participants and\ntheir employers:]] Hi yohan', 'yohan sen \nokay -5 / 0', 'yohan sen \nyes -5 / 0 TN -- / +2.5']
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
