'How to extract XML tags from text in python 3.x?

I have a text file that contains XML like tags. I want to extract the contents of the tags. Now, I am able to extract some tags but not all. Not sure what I am doing wrong. My current code looks like below.

from bs4 import BeautifulSoup
txt = '''<Settings>
  <TRF>DailyCallActivity.TRF</TRF>
  <Heading>Dagoverzicht KCC</Heading>
</Settings>

<VisibleColoums>
  0,1,2,3,4,5,6,7,8,9
</VisibleColoums>


<SelectedValues>
<DATEFILTER> and Call_date  >= convert(date, dateadd(mi,$TZOFFSET,getutcdate()-1)) and  Call_date  < convert(date, dateadd(mi,$TZOFFSET,getutcdate()))  and 1 = (case when call_time >= (select value from config_settings where value_type='Tradinghours' and left(value_name,1)=datepart(dw,call_date) and right(value_name,1)=1 and config_settings.tenantid=tenants.tenantid )and call_time <=(select value from config_settings where value_type='Tradinghours' and left(value_name,1)=datepart(dw,call_date) and right(value_name,1)=2 and config_settings.tenantid=tenants.tenantid)then 1 else 0 end) and datepart(dw,call_date)<>0</DATEFILTER>

<TENANTSUPPORT> and tenants.TenantId = 233  </TENANTSUPPORT>

<FILTER> and dir_extensions.Extno in ('[email protected]') and data_calls.Group_no in ('[email protected]') and Vpn = '0'</FILTER>

<RT>30</RT>


</SelectedValues>

<ControlVal>
<ddlDate>Yesterday</ddlDate>
<chkTime>true</chkTime>
<rdbTradingHour>true</rdbTradingHour>
<tabExtension>true</tabExtension>
<rdbSelectedExtension>true</rdbSelectedExtension>
<rcbSelExtn>[email protected]</rcbSelExtn>
<tabDDI>true</tabDDI>
<rbdAllDDIs>true</rbdAllDDIs>
<tabGroup>true</tabGroup>
<chkAllDivisions>true</chkAllDivisions>
<chkAllDepartments>true</chkAllDepartments>
<chkAllCostcenters>true</chkAllCostcenters>
<chkAllSites>true</chkAllSites>
<chkAllContactGroups>false</chkAllContactGroups>
<txtContactGroupsValues>[email protected]</txtContactGroupsValues>
<chkAllAccounts>true</chkAllAccounts>
<chkAllAccountGroups>true</chkAllAccountGroups>
<tabCallType>true</tabCallType>
<chkIncomingCalls>true</chkIncomingCalls>
<rdbAnyAnswerStatus>true</rdbAnyAnswerStatus>
<rdbAnyRouting>true</rdbAnyRouting>
<rdbIncludeBouncedCalls>true</rdbIncludeBouncedCalls>
<chkOutgoingCalls>true</chkOutgoingCalls>
<chkLocal>true</chkLocal>
<chkNational>true</chkNational>
<chkInternational>true</chkInternational>
<chkMobile>true</chkMobile>
<chkOther>true</chkOther>
<rdbAnyOutgoingCalls>true</rdbAnyOutgoingCalls>
<chkInternalCalls>true</chkInternalCalls>
<rdbAnyInternalCalls>true</rdbAnyInternalCalls>
<chkMultimedia>true</chkMultimedia>
<chkIncludeImCalls>true</chkIncludeImCalls>
<chkIncludeSmsCalls>true</chkIncludeSmsCalls>
<tabRestriction>true</tabRestriction>
<chkRestrict>false</chkRestrict>
<chkCallDuRangeIncoming>false</chkCallDuRangeIncoming>
<chkCallDuRangeOutgoing>false</chkCallDuRangeOutgoing>
<chkRingTimeRangeAnswered>false</chkRingTimeRangeAnswered>
<chkRingTimeRangeUnAnswered>false</chkRingTimeRangeUnAnswered>
<splQryTag></splQryTag>
<sort></sort>

</ControlVal>
'''
soup = BeautifulSoup(txt, 'xml')
result = soup.find("TENANTSUPPORT")
print(result)

The result I get is 'None'. However, when I look for 'Settings' or 'TRF', I get the right result.



Solution 1:[1]

If you change your parser to "lxml" and use soup.select_one('TENANTSUPPORT') it will pick up the "TENANTSUPPORT" tag.

soup = BeautifulSoup(txt, 'lxml')
result = soup.select_one("FILTER")
print(result)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1