'How do I scrape html without closing tag using Python? [duplicate]

I got following information from EDGAR:

<SERIES-AND-CLASSES-CONTRACTS-DATA>
<EXISTING-SERIES-AND-CLASSES-CONTRACTS>
<SERIES>
<OWNER-CIK>0000074663
<SERIES-ID>S000004984
<SERIES-NAME>Eaton Vance Income Fund of Boston
<CLASS-CONTRACT>
<CLASS-CONTRACT-ID>C000013484
<CLASS-CONTRACT-NAME>Eaton Vance Income Fund of Boston Class A
<CLASS-CONTRACT-TICKER-SYMBOL>EVIBX
</CLASS-CONTRACT>
<CLASS-CONTRACT>
<CLASS-CONTRACT-ID>C000013485
<CLASS-CONTRACT-NAME>Eaton Vance Income Fund of Boston Class B
<CLASS-CONTRACT-TICKER-SYMBOL>EBIBX
</CLASS-CONTRACT>
<CLASS-CONTRACT>
<CLASS-CONTRACT-ID>C000013486
<CLASS-CONTRACT-NAME>Eaton Vance Income Fund of Boston Class C
<CLASS-CONTRACT-TICKER-SYMBOL>ECIBX
</CLASS-CONTRACT>
<CLASS-CONTRACT>
<CLASS-CONTRACT-ID>C000013487
<CLASS-CONTRACT-NAME>Eaton Vance Income Fund of Boston Class R
<CLASS-CONTRACT-TICKER-SYMBOL>ERIBX
</CLASS-CONTRACT>
<CLASS-CONTRACT>
<CLASS-CONTRACT-ID>C000013488
<CLASS-CONTRACT-NAME>Eaton Vance Income Fund of Boston Class I
<CLASS-CONTRACT-TICKER-SYMBOL>EIBIX
</CLASS-CONTRACT>
</SERIES>
</EXISTING-SERIES-AND-CLASSES-CONTRACTS>
</SERIES-AND-CLASSES-CONTRACTS-DATA>

I would ideally like to scrape all information for each tag and its subtags. It seems that for tags within class contract (e.g., class-contract-id) does not have closing tag.

Possibly for this reason, I get the following result when I try this out:

from bs4 import BeautifulSoup

with open("temp.txt",'r') as html_file:
    content = html_file.read()
    soup = BeautifulSoup(content, 'lxml')
        
    series = soup.find('series') 
    
    for item in series:
        cik = item.find('owner-cik')
        print(cik)

Result:

-1
None

Is there any possible way to sort this out?



Solution 1:[1]

The issue is that in this case, item itself is the OWNER-CIK tag. series.find('owner-cik') will probably do what you want, as page 33 of the specification seems to say there's only one OWNER CIK per SERIES.

It looks like there are also a number of existing python libraries for downloading/parsing EDGAR data. You may be able to use or modify one of those instead.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 yut23