'NLTK PlaintextCorpusReader reading files in and splitting them on delimiters

I would like to split input text based on delimiters and only extract a specific part to process using NLTK, here is the input information example:

[t] troubleshooting ad-2500 and ad-2600 no picture scrolling b/w . 
##repost from january 13 , 2004 with a better fit title . 
support[-3][u]##apex does n't answer the phone . 
player[-2][p]##unfortunately it turns out to be the " disposable " type . 
battery[+2]##i treat the battery well and it has lasted . 
sound quality[+2], fm[+1], earpiece[+1]##while i had the phone , the positive features were : good sound quality and an excellent fm phone and earpiece . 
speakerphone[+3][u]##you can be up to about 3 feet away from it and it will still work perfectly . 
size[+2],weight[+2]##i like the size and weight of this little critter . 
[t]excellent picture quality / color 
canon g3[+3]##i bought my canon g3 about a month ago and i have to say i am very satisfied . 
zoom[+2],lense[+2]##the extended zoom range and faster lense put it at the top of it 's class . 

I am trying to split the file using NLTK to split the lines then only use the part after the ##. Here is my attempt, however I could not find a solution to split the file best on delimiter:

# Import  Natural Language Toolkit Library
import nltk
# Importing Operator Module
import operator

from nltk.corpus import PlaintextCorpusReader

# Root folder where the text files are located
corpus_root = "Data"

# Read the list of files
filelists = PlaintextCorpusReader(corpus_root, '.*', encoding='utf-8')

# List down the IDs of the files read from the local storage
filelists.fileids()

# Read the text from specific file,
# like plaintext corpora support methods to read the corpus as 
# raw text, a list of words, a list of sentences, or a list of paragraphs.
rawlist = filelists.raw('text.txt')
wordslist = filelists.words('text.txt')
sentslist = filelists.sents('text.txt')
paraslist = filelists.paras('text.txt')

print("a list of filenames:")
print(filelists.fileids(),'\n')
print("a list of words:")
print(wordslist,'\n')
print("a list of sentences:")
print(sentslist,'\n')
print("a list of paragraphs:")
print(paraslist,'\n')
print("a list of raw text:")
print(rawlist,'\n')

Desired output:

troubleshooting ad-2500 and ad-2600 no picture scrolling b/w . 
repost from january 13 , 2004 with a better fit title .
apex does n't answer the phone . 
unfortunately it turns out to be the " disposable " type . 
i treat the battery well and it has lasted . 
while i had the phone , the positive features were : good sound quality and an excellent fm phone and earpiece . 
you can be up to about 3 feet away from it and it will still work perfectly . 
i like the size and weight of this little critter . 
excellent picture quality / color 
i bought my canon g3 about a month ago and i have to say i am very satisfied . 
the extended zoom range and faster lense put it at the top of it 's class . 


Solution 1:[1]

I used the existing Corpora import function in NLTK for the utilization of the files for this project. First I found the actual directory of the folders from nltk.corpus import product_reviews_1 as the product review 1 is a known module in the current NLTK data package. Then running nltk.corpus.product_reviews_1.abspaths() to get the exact path of the folders. After this I copied the folders into the corpora directory

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 N K