'xml parsing to retrieve specific tags
I have an xml annotation file where there are < action > tags, I want to find the tag < origin > for each action and read its value (check if it is Blur or not) and for each action I also want to return the < start_time > and < stop_time >. How can I do this? Is there a toolkit? Do I need to read each and every < tag > and find all of its children?
<action>
<temporal_region>
<start_time>2683480</start_time>
<stop_time>2684448</stop_time>
</temporal_region>
<action_type/>
<state>1</state>
<actuator>Incident</actuator>
<description/><verb/><affected_list/><instrument_list/><recipient/>
<origin>Blur</origin>
<destination/>
</action>
Edit:
The suggestions, slightly extended to have multiple actions:
from bs4 import BeautifulSoup as bs
xml = """
<action>
<temporal_region>
<start_time>2683480</start_time>
<stop_time>2684448</stop_time>
</temporal_region>
<action_type/>
<state>1</state>
<actuator>Incident</actuator>
<description/><verb/><affected_list/><instrument_list/><recipient/>
<origin>Blur</origin>
<destination/>
</action>
<action>
<temporal_region>
<start_time>2683480</start_time>
<stop_time>2684448</stop_time>
</temporal_region>
<action_type/>
<state>1</state>
<actuator>Incident</actuator>
<description/><verb/><affected_list/><instrument_list/><recipient/>
<origin>Blur</origin>
<destination/>
</action>"""
soup = bs(xml, 'html.parser')
origin = soup.find('origin').text
print(len(origin))
start_time = soup.find('start_time').text
stop_time = soup.find('stop_time').text
if origin == 'Blur':
print("success")
Returns 4, which I suppose is the opening and closing tags of origin while I have only 2 elements.
Solution 1:[1]
Another solution.
from simplified_scrapy.simplified_doc import SimplifiedDoc
xml = """
<action>
<temporal_region>
<start_time>2683480</start_time>
<stop_time>2684448</stop_time>
</temporal_region>
<action_type/>
<state>1</state>
<actuator>Incident</actuator>
<description/><verb/><affected_list/><instrument_list/><recipient/>
<origin>Blur</origin>
<destination/>
</action>
<action>
<temporal_region>
<start_time>2683480</start_time>
<stop_time>2684448</stop_time>
</temporal_region>
<action_type/>
<state>1</state>
<actuator>Incident</actuator>
<description/><verb/><affected_list/><instrument_list/><recipient/>
<origin>Blur</origin>
<destination/>
</action>"""
doc = SimplifiedDoc(xml)
actions = doc.selects('action')
for action in actions:
print (action.start_time)
print (action.stop_time)
print (action.origin)
Here's an example of SimplifiedDoc: https://github.com/yiyedata/simplified-scrapy-demo/tree/master/doc_examples
Solution 2:[2]
You can use BeautifulSoup for that.
from bs4 import BeautifulSoup as bs
xml = """
<action>
<temporal_region>
<start_time>2683480</start_time>
<stop_time>2684448</stop_time>
</temporal_region>
<action_type/>
<state>1</state>
<actuator>Incident</actuator>
<description/><verb/><affected_list/><instrument_list/><recipient/>
<origin>Blur</origin>
<destination/>
</action>"""
soup = bs(xml, 'html.parser')
origin = soup.find('origin').text
start_time = soup.find('start_time').text
stop_time = soup.find('stop_time').text
if origin == 'Blur':
print("success")
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | dabingsou |
| Solution 2 |
