'Apply Regex Pattern on key from nested list of objects of custom dictionaries

I have a nested list of objects called "words". It consists of objects of a class that has data like conf(float), end(float), start(float), word(string) I want to apply regex pattern "\b(\w+)\b(?=.*?\b\1\b)" on "word" and remove objects that match the pattern

class Word:
    ''' A class representing a word from the JSON format for vosk speech recognition API '''

    def __init__(self, dict):
        '''
        Parameters:
          dict (dict) dictionary from JSON, containing:
            conf (float): degree of confidence, from 0 to 1
            end (float): end time of the pronouncing the word, in seconds
            start (float): start time of the pronouncing the word, in seconds
            word (str): recognized word
        '''

        self.conf = dict["conf"]
        self.end = dict["end"]
        self.start = dict["start"]
        self.word = dict["word"]

    def to_string(self):
        ''' Returns a string describing this instance '''
        return "{:20} from {:.2f} sec to {:.2f} sec, confidence is {:.2f}%".format(
            self.word, self.start, self.end, self.conf*100)


    def compare(self, other):
        if self.word == other.word:
            return True
        else:
            return False

here is the collection of objects

enter image description here

each object contain data like this

{'conf': 0.0, 'end': 0.00, 'start': 0.00, 'word': 'hello'} 

{'conf': 0.0, 'end': 1.00, 'start': 0.00, 'word': 'hello'} 

{'conf': 0.0, 'end': 2.00, 'start': 0.00, 'word': 'to'} 

I tried to apply regex pattern this way but couldn't get it working

pattern = re.compile("\b(\w+)\b(?=.*?\b\1\b)")
for w in words:
    lst = [x for x in w.word if not re.match(pattern, x)]
print(lst)

Regex I tested Online enter image description here

can some good soul guide me on how to apply regex pattern on "word" and remove objects that matches the pattern Thanks in advance!



Solution 1:[1]

Try this:

for i in range(len(words)):
    if not re.match(pattern, words[i].word):
        lst.append(i)
print(lst)
# lst will have index of objs that satisfy the above condition

You can then use the indices to remove the objects from your list of objects.

EDIT: according to your comments, I've updated the answer:

distinct_words = {}
lst = []
for i in range(len(words)):
    if isinstance(distinct_words.get(words[i].word), int):
        lst.append(i)
    else:
        distinct_words[words[i].word] = i
print(lst)

Add the current word to distinct word dict with its index, if the word is found again then append it to lst else update the new word with the dict.

At the end lst will contain indices of all the repeated words. So use the indices in lst to remove the objects from the list.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1