'Regex in combination with a list of keywords from a textfile to parse into another textfile
I have a simulationoutput with many lines, parts of it look like this:
</GraphicData>
</Connection>
<Connection>
<Name>ES1</Name>
<Type>Port</Type>
<From>Windfarm.Out</From>
<To>BR1.In</To>
<GraphicData>
<Icon>
<Points>
</GraphicData>
</Connection>
<Connection>
<Name>S2</Name>
<Type>Port</Type>
<From>BR1.Out</From>
<To>C1.In</To>
<GraphicData>
<Icon>
<Points>
The word between Name and /Name varies from output to output. These names (here: ES1 and S2) are stored in a textfile (keywords.txt).
What I need is a Regex that gets the keywords from the list (keywords.txt). searches for matches in (Simulationoutput.txt) until /To> and writes these matches into another textfile (finaloutput.txt).
Here is what I've done so far
with open("keywords.txt", 'r') as f:
keywords = ast.literal_eval(f.read())
pattern = '|'.join(keywords)
results = []
with open('Simulationoutput.txt', 'r') as f:
for line in f:
matches = re.findall(pattern,line)
if matches:
results.append((line, len(matches)))
results = sorted(results, key=lambda x: x[1], reverse=True)
with open('finaloutput.txt', 'w') as f:
for line, num_matches in results:
f.write('{} {}\n'.format(num_matches, line))
The finaloutput.txt looks like this now:
<Name>ES1</Name>
<Name>S2</Name>
So the code almost does what I want but the output should look like this
<Name>ES1</Name>
<Type>Port</Type>
<From>Hydro.Out</From>
<To>BR1.In</To>
<Name>S2</Name>
<Type>Port</Type>
<From>BR1.Out</From>
<To>C1.In</To>
Thanks in advance.
Solution 1:[1]
Although I strongly advise you to use xml.etree.ElementTree to do this, here's how you could do it using regex:
import re
keywords = ["ES1", "S2"]
pattern = "|".join([re.escape(key) for key in keywords])
pattern = fr"<Name>(?:{pattern}).*?<\/To>"
with open("Simulationoutput.txt", "r") as f:
matches = re.findall(pattern, f.read(), flags=re.DOTALL)
with open("finaloutput.txt", "w") as f:
f.write("\n\n".join(matches).replace("\n ", "\n"))
The regex used is the following:
<Name>(?:ES1|S2).*?<\/To>
<Name>: Matches `.(?:): Non-capturing group.ES1|S2: Matches eitherES1orS2..*?: Matches any character, between zero and unlimited times, as few as possible (lazy). Note that.does not match newlines by default, only because there.DOTALLflag is set.<\/To>: Matches</To>.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Cubix48 |
