'How can I get TREC documents extracted?
I've been trying to extract TREC documents into separated text files using the code above but I've got some errors. Here is an example of the content of my TREC file that contains 2 documents between the tags <DOC> and </DOC>:
<DOC>
<DOCNO>
WSJ910102-0145
</DOCNO>
<DOCID>
910102-0145.
</DOCID>
<HL>
xxxx
</HL>
<DATE>
01/02/91
</DATE>
<LP>
text LP1
</LP>
<TEXT>
text1
</TEXT>
</DOC>
<DOC>
<DOCNO>
WSJ910102-0144
</DOCNO>
<DOCID>
910102-0144.
</DOCID>
<HL>
....
</HL>
<DATE>
01/02/91
</DATE>
<LP>
text LP2
</LP>
<TEXT>
text2
</TEXT>
</DOC>
I want to extract each document in a separated text file. I must get the content of the tags "LP" and "TEXT" with document number "DOCNO". Here is my code:
text=text.replace('\n',' ').replace('\t', ' ')
i=0
txtDoc=''
regexTxt='(<LP>(.*?)</LP>)? <TEXT>(.*?)</TEXT>'
regexDoc='<DOC>(.*?)</DOC>'
regexDocNo='<DOCNO>(.*?)</DOCNO>'
pattern = compile(r'<DOC>(.*?)</DOC>')
iterator = finditer(pattern, text)
count = 0
for match in iterator:
count +=1
res=re.search(regexDoc,text)
while (i<count):
txtDoc=res.group(i)
resNo=re.search(regexDocNo,txtDoc)
docNo=resNo.group()
docNo=docNo.replace('<DOCNO>', ' ').replace('</DOCNO>', ' ')
res2=re.search(regexTxt,txtDoc)
txt=res2.group()
txt=txt.replace('<TEXT>', ' ').replace('</TEXT>', ' ').replace('<LP>',' ').replace('</LP>',' ')
print("Document : %s \n %s" %(docNo,txt))
i+=1
print ("Fin")
Here is the printed result :
Document : WSJ910102-0145
text1
Document : WSJ910102-0145
text1
Fin
And I want to get this one :
Document : WSJ910102-0145
text LP1
text1
Document : WSJ910102-0144
text LP2
text2
Fin
Solution 1:[1]
I would try to use an xml parser. Here's a sample code how to parse such structure:
import xml.etree.ElementTree as ElementTree
with open('test.trec', 'r') as f: # Reading file
xml = f.read()
xml = '<ROOT>' + xml + '</ROOT>' # Let's add a root tag
root = ElementTree.fromstring(xml)
# Simple loop through each document
for doc in root:
print(
'DOC NO: {}, DOC ID: {}, HL: {}, LP: {}, DATE: {}, TEXT: {}'.format( # Nice formatting py 3 \o/
doc.find('DOCID').text.strip(),
doc.find('HL').text.strip(),
doc.find('DOCNO').text.strip(),
doc.find('LP').text.strip(),
doc.find('DATE').text.strip(),
doc.find('TEXT').text.strip(),
)
)
The workaround of adding a root tag was kinda required to make the xml parseable.
Sample output:
DOC NO: 910102-0145., DOC ID: xxxx, HL: WSJ910102-0145, LP: text LP1, DATE: 01/02/91, TEXT: text1
DOC NO: 910102-0144., DOC ID: blabla, HL: WSJ910102-0144, LP: text LP2, DATE: 01/02/91, TEXT: text2
Solution 2:[2]
Since not all TREC documents have a tag to be considered as the content, any code written entirely based on any XML parser would not work properly.
As an alternative, the following code use the xml package to get the content between <DOC> and </DOC> and then extracts the content excluding all the tags.
import xml.etree.ElementTree as ElementTree
import re
tag_exp = re.compile('<.*?>')
def cleanTag(rawDoc):
cleanDoc = re.sub(tag_exp, '', rawDoc)
return cleanDoc
# this function needs to be called for each of the files in the directory
def processFile(filePath):
with open(filePath, 'r') as f: # Reading file
xml = f.read()
xml = '<ROOT>' + xml + '</ROOT>' # Needed to make the file as proper XML.
root = ElementTree.fromstring(xml)
for doc in root.findall('DOC'):
docid = doc.find('DOCNO').text.strip() # field 1
content = ElementTree.tostring(doc, encoding='utf8').decode('utf8')
cleanContent = cleanhtml(content) # REMOVING ALL THE TAGS. # field 2
# NOW YOU NEED TO INDEX cleanContent and docid
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | HamZa |
| Solution 2 | Doi |
