'Python: Find vocabulary of a bigram
I have a list of tweets (tokenized and preprocessed). It's like this:
['AT_TOKEN',
'what',
'AT_TOKEN',
'said',
'END',
'AT_TOKEN',
'plus',
'you',
've',
'added',
'commercials',
'to',
'the',
'experience',
'tacky',
'END',
'AT_TOKEN',
'i',
'did',
'nt',
'today',
'must',
'mean',
'i',
'need',
'to',
'take',
'another',
'trip',
'END']
END signifies that a tweet has ended and a new one has begun.
I want to find the bigram vocabulary for this list but having a hard time how can I do it efficiently. I have figured out how I can do this for a unigram like this:
unique_words = defaultdict(int)
for i in range(len(data)):
unique_words[data[i]] = 1
return list(unique_words.keys())
Problem is that I need to first convert this list into bigram and then find the vocabulary for that bigram.
Can anybody help me figure this out?
Solution 1:[1]
To complement furas' answer. You can utilize collections.Counter and itertools.pairwise if you are on Python 3.10 to count bigrams extremely efficiently:
from collections import Counter
from itertools import pairwise
# c = Counter(zip(data, data[1:])) on Python < 3.10
c = Counter(pairwise(data))
print(c)
Output:
Counter({('END', 'AT_TOKEN'): 2, ('AT_TOKEN', 'what'): 1, ('what', 'AT_TOKEN'): 1, ('AT_TOKEN', 'said'): 1, ('said', 'END'): 1, ...
Counter works just like a dictionary, but extends it with some useful methods. See https://docs.python.org/3/library/collections.html#collections.Counter
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | mwo |
