'bag-of-words format from two different list

I have two lists: A = [['a','b','c'],['a','b','c']] and B = ['a','b','c','a','b','c']. I would like to convert the list into a bag-of-words format where we have a list of (token_id, token_count) 2-tuples. I would like to remain the structure of list A but use list B for counting tokens. The code I use currently is: corpus = [id2word2.doc2bow(text) for text in texts] where texts is a dictionary of the A list. So the result I would like to have is the following:

BoW = [[(1,2),(2,2),(3,2)],[(1,2),(2,2),(3,2)]]

and not like this:

BoW = [[(1,1),(2,1),(3,1)],[(1,1),(2,1),(3,1)]]
BoW = [[(1,2),(2,2),(3,2),(1,2),(2,2),(3,2)]

EDIT: Bad example from my side, the words 'a','b','c' should be changed to identifiers for that specific word. All 'a' should be reffered to 1 and all 'b' to 2 and so on. So if we have two lists A = [['a','z','c'],['z','b','e']] and B = ['a','b','c','a','b','c','z','a','e']. The result I would like to have is the following:

Bow = [(1,3),(2,1),(3,2)],[(2,1),(4,2),(5,1)]

All words will be identified with the same integer. I am creating a corpus (term document frequency) from a dictionary where there are unique ids for each unique word.



Solution 1:[1]

A pretty easy way I could come up with is this -

A = [['a','b','c'],['a','b','c']]
B = ['a','b','c','a','b','c']
out = []
for ls in A:
    newls = []
    for i,j in enumerate(ls):
        newls.append((i+1,B.count(j)))
    out.append(newls)
print(out)

The Output this gives is -

[[(1, 2), (2, 2), (3, 2)], [(1, 2), (2, 2), (3, 2)]]

Solution 2:[2]

from collections import Counter

A = [['a','b','c'], ['a','b','c']]
B = ['a','b','c','a','b','c']

def cvt(lst):
    enum = enumerate(Counter(itertools.chain(*lst)).items())
    return {k: (i+1, c) for i, (k, c) in enum}

def replace(lst, cnt):
    return [replace(x, cnt) if isinstance(x, list) else cnt[x] for x in lst]

print(cvt(A))
print(cvt(B))

cnt = cvt(A)
print(replace(A, cnt))
print(replace(B, cnt))

output:

{'a': (1, 2), 'b': (2, 2), 'c': (3, 2)}
{'a': (1, 2), 'b': (2, 2), 'c': (3, 2)}
[[(1, 2), (2, 2), (3, 2)], [(1, 2), (2, 2), (3, 2)]]
[(1, 2), (2, 2), (3, 2), (1, 2), (2, 2), (3, 2)]?

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 FoundABetterName
Solution 2 Kang San Lee