'extract sub-string from long text

I have a string as:

string="
(2021-07-04 11:58:43 PM BST)  
---  
le  ) says tosen  

Hi yohan

(2021-07-05 12:04:42 AM BST)  
---  
len (Trade ) says to sen  
okay -5 / 0  .



(2021-07-04 11:47:14 PM BST)  
---  
Keun says to
HanSo 
hello 
  
  
  
  
--- 


  
  

(2021-07-05 12:09:41 AM BST)  
---  
len (Trade) says to sen  
yes -5 / 0 TN -- / +2.5  
  
  
---  
  
* * *

Processe | 2021-07-05 12:26:44 AM
BST  
---

"

All I want to extract the text after says to and before timestamp.

Expected output as:

text=['yoh Hi yo','sen okay -5 / 0 ','sen yes -5 / 0 TN -- / +2.5']

What I have tried:

text=re.findall(r'says to (\D+)(',string)



Solution 1:[1]

There are digits in between says to and the next timestamp between parenthesis, so using \D+ will stop matching when there is a digit.

Instead you can capture want you want after matching says to in group 1 for all following lines that do not start for example with ( and a digit or --- (or make it more specific)

\bsays to (.*(?:\n(?!\(\d|---).*?)*?)\s*\n(?:\(\d|---)

Regex demo | Python demo

For example:

pattern = r"\bsays to (.*(?:\n(?!\(\d|---).*?)*?)\s*\n(?:\(\d|---)"
text = re.findall(pattern, text)
print(text)

Output

['yohan sen  \n[[:Conversations will be recorded and may be monitored by the participants and\ntheir employers:]] Hi yohan', 'yohan sen  \nokay -5 / 0', 'yohan sen  \nyes -5 / 0 TN -- / +2.5']

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1