'How can I get TREC documents extracted?

I've been trying to extract TREC documents into separated text files using the code above but I've got some errors. Here is an example of the content of my TREC file that contains 2 documents between the tags <DOC> and </DOC>:

<DOC>
    <DOCNO>
       WSJ910102-0145
    </DOCNO>
    <DOCID>
       910102-0145.
    </DOCID>
    <HL>
       xxxx
    </HL>
    <DATE>
        01/02/91
    </DATE>
    <LP>
       text LP1
    </LP>
    <TEXT>
       text1
    </TEXT>
</DOC>
<DOC>
    <DOCNO>
        WSJ910102-0144
    </DOCNO>
    <DOCID>
        910102-0144.
    </DOCID>
    <HL>
       ....
    </HL>
    <DATE>
       01/02/91
    </DATE>
    <LP>
       text LP2
    </LP>
    <TEXT>
       text2
    </TEXT>
</DOC>

I want to extract each document in a separated text file. I must get the content of the tags "LP" and "TEXT" with document number "DOCNO". Here is my code:

text=text.replace('\n',' ').replace('\t', ' ')
i=0
txtDoc=''
regexTxt='(<LP>(.*?)</LP>)? <TEXT>(.*?)</TEXT>'
regexDoc='<DOC>(.*?)</DOC>'
regexDocNo='<DOCNO>(.*?)</DOCNO>'
pattern = compile(r'<DOC>(.*?)</DOC>')
iterator = finditer(pattern, text)
count = 0
for match in iterator:
    count +=1
res=re.search(regexDoc,text)
while (i<count):
    txtDoc=res.group(i)
    resNo=re.search(regexDocNo,txtDoc)
    docNo=resNo.group()
    docNo=docNo.replace('<DOCNO>', ' ').replace('</DOCNO>', ' ')
    res2=re.search(regexTxt,txtDoc)
    txt=res2.group()
    txt=txt.replace('<TEXT>', ' ').replace('</TEXT>', ' ').replace('<LP>',' ').replace('</LP>',' ')
    print("Document : %s \n %s" %(docNo,txt))
    i+=1

print ("Fin")

Here is the printed result :

Document :       WSJ910102-0145
          text1
Document :       WSJ910102-0145
          text1
Fin

And I want to get this one :

Document :       WSJ910102-0145
           text LP1
           text1 
Document :       WSJ910102-0144
           text LP2
           text2
Fin


Solution 1:[1]

I would try to use an xml parser. Here's a sample code how to parse such structure:

import xml.etree.ElementTree as ElementTree

with open('test.trec', 'r') as f:   # Reading file
    xml = f.read()

xml = '<ROOT>' + xml + '</ROOT>'   # Let's add a root tag

root = ElementTree.fromstring(xml)

# Simple loop through each document
for doc in root:
    print(
        'DOC NO: {}, DOC ID: {}, HL: {}, LP: {}, DATE: {}, TEXT: {}'.format( # Nice formatting py 3 \o/
            doc.find('DOCID').text.strip(),
            doc.find('HL').text.strip(),
            doc.find('DOCNO').text.strip(),
            doc.find('LP').text.strip(),
            doc.find('DATE').text.strip(),
            doc.find('TEXT').text.strip(),
        )
    )

The workaround of adding a root tag was kinda required to make the xml parseable.

Sample output:

DOC NO: 910102-0145., DOC ID: xxxx, HL: WSJ910102-0145, LP: text LP1, DATE: 01/02/91, TEXT: text1
DOC NO: 910102-0144., DOC ID: blabla, HL: WSJ910102-0144, LP: text LP2, DATE: 01/02/91, TEXT: text2

Solution 2:[2]

Since not all TREC documents have a tag to be considered as the content, any code written entirely based on any XML parser would not work properly.

As an alternative, the following code use the xml package to get the content between <DOC> and </DOC> and then extracts the content excluding all the tags.

import xml.etree.ElementTree as ElementTree
import re
tag_exp = re.compile('<.*?>') 

def cleanTag(rawDoc):
    cleanDoc = re.sub(tag_exp, '', rawDoc)
    return cleanDoc

# this function needs to be called for each of the files in the directory
def processFile(filePath):
    with open(filePath, 'r') as f:   # Reading file
        xml = f.read()

    xml = '<ROOT>' + xml + '</ROOT>'   # Needed to make the file as proper XML.

    root = ElementTree.fromstring(xml)
    for doc in root.findall('DOC'):
        docid = doc.find('DOCNO').text.strip() # field 1
        content = ElementTree.tostring(doc, encoding='utf8').decode('utf8')
        cleanContent = cleanhtml(content)    # REMOVING ALL THE TAGS.  # field 2
        # NOW YOU NEED TO INDEX cleanContent and docid

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 HamZa
Solution 2 Doi