'Grouping NLTK entities
I have the following code:
import nltk
page = '
EDUCATION
University
Won first prize for the best second year group project, focused on software engineering.
Sixth Form
Mathematics, Economics, French
UK, London
'
for sent in nltk.sent_tokenize(page):
for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))):
if hasattr(chunk, 'label'):
print(''.join(c[0] for c in chunk), ' ',chunk.label())
Returns:
EDUCATION ORGANIZATION
UniversityWon ORGANIZATION
Sixth PERSON
FormMathematics ORGANIZATION
Economics PERSON
FrenchUK GPE
London GPE
Which i'd like to be grouped into some data-structure based on the entity label, maybe a list: ORGANIZATION=[EDUCATION,UniversityWon,FormMathematics] PERSON=[Sixth,Economics] GPE=[FrenchUK,London]
Or maybe a dictionary with the keys: ORGANIZATION, PERSON, GPE then the associated values are as the lists above
Solution 1:[1]
A dictionary makes more sense, perhaps something like this.
from collections import defaultdict
entities = defaultdict(list)
for sent in nltk.sent_tokenize(page):
for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))):
if hasattr(chunk, 'label'):
entities[chunk.label()].append(''.join(c[0] for c in chunk))
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | fsimonjetz |
