'Python: Find vocabulary of a bigram

I have a list of tweets (tokenized and preprocessed). It's like this:

['AT_TOKEN',
 'what',
 'AT_TOKEN',
 'said',
 'END',
 'AT_TOKEN',
 'plus',
 'you',
 've',
 'added',
 'commercials',
 'to',
 'the',
 'experience',
 'tacky',
 'END',
 'AT_TOKEN',
 'i',
 'did',
 'nt',
 'today',
 'must',
 'mean',
 'i',
 'need',
 'to',
 'take',
 'another',
 'trip',
 'END']

END signifies that a tweet has ended and a new one has begun.

I want to find the bigram vocabulary for this list but having a hard time how can I do it efficiently. I have figured out how I can do this for a unigram like this:

unique_words = defaultdict(int)
for i in range(len(data)):
    unique_words[data[i]] = 1
return list(unique_words.keys())

Problem is that I need to first convert this list into bigram and then find the vocabulary for that bigram.

Can anybody help me figure this out?



Solution 1:[1]

To complement furas' answer. You can utilize collections.Counter and itertools.pairwise if you are on Python 3.10 to count bigrams extremely efficiently:

from collections import Counter
from itertools import pairwise  

# c = Counter(zip(data, data[1:])) on Python < 3.10
c = Counter(pairwise(data))

print(c)

Output:

Counter({('END', 'AT_TOKEN'): 2, ('AT_TOKEN', 'what'): 1, ('what', 'AT_TOKEN'): 1, ('AT_TOKEN', 'said'): 1, ('said', 'END'): 1, ...

Counter works just like a dictionary, but extends it with some useful methods. See https://docs.python.org/3/library/collections.html#collections.Counter

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 mwo