'Grouping NLTK entities

I have the following code:

import nltk
 
page = '
EDUCATION   
University
Won first prize for the best second year group project, focused on software engineering.
Sixth Form
Mathematics, Economics, French
UK, London
'


for sent in nltk.sent_tokenize(page):
  for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))):
     if hasattr(chunk, 'label'):
        print(''.join(c[0] for c in chunk), ' ',chunk.label())

Returns:

EDUCATION   ORGANIZATION
UniversityWon   ORGANIZATION
Sixth   PERSON
FormMathematics   ORGANIZATION
Economics   PERSON
FrenchUK   GPE
London   GPE

Which i'd like to be grouped into some data-structure based on the entity label, maybe a list: ORGANIZATION=[EDUCATION,UniversityWon,FormMathematics] PERSON=[Sixth,Economics] GPE=[FrenchUK,London]

Or maybe a dictionary with the keys: ORGANIZATION, PERSON, GPE then the associated values are as the lists above



Solution 1:[1]

A dictionary makes more sense, perhaps something like this.

from collections import defaultdict

entities = defaultdict(list)

for sent in nltk.sent_tokenize(page):
    for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))):
        if hasattr(chunk, 'label'):
            entities[chunk.label()].append(''.join(c[0] for c in chunk))

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 fsimonjetz