'bag-of-words format from two different list
I have two lists: A = [['a','b','c'],['a','b','c']] and B = ['a','b','c','a','b','c']. I would like to convert the list into a bag-of-words format where we have a list of (token_id, token_count) 2-tuples. I would like to remain the structure of list A but use list B for counting tokens. The code I use currently is: corpus = [id2word2.doc2bow(text) for text in texts] where texts is a dictionary of the A list. So the result I would like to have is the following:
BoW = [[(1,2),(2,2),(3,2)],[(1,2),(2,2),(3,2)]]
and not like this:
BoW = [[(1,1),(2,1),(3,1)],[(1,1),(2,1),(3,1)]]
BoW = [[(1,2),(2,2),(3,2),(1,2),(2,2),(3,2)]
EDIT: Bad example from my side, the words 'a','b','c' should be changed to identifiers for that specific word. All 'a' should be reffered to 1 and all 'b' to 2 and so on. So if we have two lists A = [['a','z','c'],['z','b','e']] and B = ['a','b','c','a','b','c','z','a','e'].
The result I would like to have is the following:
Bow = [(1,3),(2,1),(3,2)],[(2,1),(4,2),(5,1)]
All words will be identified with the same integer. I am creating a corpus (term document frequency) from a dictionary where there are unique ids for each unique word.
Solution 1:[1]
A pretty easy way I could come up with is this -
A = [['a','b','c'],['a','b','c']]
B = ['a','b','c','a','b','c']
out = []
for ls in A:
newls = []
for i,j in enumerate(ls):
newls.append((i+1,B.count(j)))
out.append(newls)
print(out)
The Output this gives is -
[[(1, 2), (2, 2), (3, 2)], [(1, 2), (2, 2), (3, 2)]]
Solution 2:[2]
from collections import Counter
A = [['a','b','c'], ['a','b','c']]
B = ['a','b','c','a','b','c']
def cvt(lst):
enum = enumerate(Counter(itertools.chain(*lst)).items())
return {k: (i+1, c) for i, (k, c) in enum}
def replace(lst, cnt):
return [replace(x, cnt) if isinstance(x, list) else cnt[x] for x in lst]
print(cvt(A))
print(cvt(B))
cnt = cvt(A)
print(replace(A, cnt))
print(replace(B, cnt))
output:
{'a': (1, 2), 'b': (2, 2), 'c': (3, 2)}
{'a': (1, 2), 'b': (2, 2), 'c': (3, 2)}
[[(1, 2), (2, 2), (3, 2)], [(1, 2), (2, 2), (3, 2)]]
[(1, 2), (2, 2), (3, 2), (1, 2), (2, 2), (3, 2)]?
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | FoundABetterName |
| Solution 2 | Kang San Lee |
