'Use a list and str.extract() to get multiple string matches on the same row in a data frame

The title is somewhat confusing but it was the best I could come up with. Anyway I am trying to string match a list cnn_journalist of names with all rows in column tucker_transcripts["Tekst"]. The code below does match the names in list cnn_journalist with tucker_transcripts["Tekst"]. However, in case a row in column Tekst contain more than one name in list cnn_journalist I only get the first match. Is there a way to match several names in rows of column Tekst with the names in list cnn_journalist? If a match then the names of the match should be added to a new column cnn as shown in the code below. One name per row.

# Scrapped transcripts from Tucker Carlson Tonight: URL, Date of the show, and transcript text 
tucker_transcripts.head()
                                                 URL        Dato  \
0  https://www.foxnews.com/opinion/tucker-carlson...  2022-02-11   
1  https://www.foxnews.com/opinion/tucker-carlson...  2022-02-10   
2  https://www.foxnews.com/opinion/tucker-carlson...  2022-02-09   
3  https://www.foxnews.com/opinion/tucker-how-lon...  2022-02-07   
4  https://www.foxnews.com/opinion/tucker-carlson...  2022-02-05   
                                               Tekst  
0  n We’ve been covering this truck strike in Can...  
1  When Democratic politicians across this countr...  
2  It'd be pretty fascinating to see the Democrat...  
3  Workers of the world unite. You have nothing t...  
4  There aren’t a lot of amazing stories left in ...

# CNN journalists: column x contain names on 372 journalists
cnn_journalist.head()
               x
0  Sarah Aarthun
1   Maribel Aber
2     Jim Acosta
3   Beryl Adcock
4      Jon Adler

cnn_journalist = cnn_journalist["x"].tolist() # convert CNN data.frame to a list
cnn_journalist = '(%s)' % '|'.join(cnn_journalist) # separate names with a "|"

Match names in cnn_jorunalist list with Tucker's transcripts. If a journalist has been called out on Tuckers show he/she will appear by name in the new column cnn. The corresponding row in column Dato will be the date the journalist was called out by Tucker on his show.

tucker_transcripts['cnn'] = tucker_transcripts['Tekst'].str.extract(cnn_journalist, flags=re.IGNORECASE)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source

'Use a list and str.extract() to get multiple string matches on the same row in a data frame

Sources

Related Questions