'Pandas dataframe reports no matching string when the string is present

Fairly new to python. This seems to be a really simple question but I can't find any information about it. I have a list of strings, and for each string I want to check whether it is present in a dataframe (actually in a particular column of the dataframe. Not whether a substring is present, but the whole exact string.

So my dataframe is something like the following:

A=pd.DataFrame(["ancestry","time","history"])

I should simply be able to use the "string in dataframe" method, as in

"time" in A

This returns False however. If I run

"time" == A.iloc[1]

it returns "True", but annoyingly as part of a series, and this depends on knowing where in the dataframe the corresponding string is. Is there some way I can just use the string in df method, to easily find out whether the strings in my list are in the dataframe?



Solution 1:[1]

The way to deal with this is to compare the whole dataframe with "time". That will return a mask where each value of the DF is True if it was time, False otherwise. Then, you can use .any() to check if there are any True values:

>>> A = pd.DataFrame(["ancestry","time","history"])
>>> A
          0
0  ancestry
1      time
2   history

>>> A == "time"  # or A.eq("time")
       0
0  False
1   True
2  False

>>> (A == "time").any()
0    True
dtype: bool

Notice in the above output, (A == "time").any() returns a Series where each entry is a column and whether or not that column contained time. If you want to check the entire dataframe (across all columns), call .any() twice:

>>> (A == "time").any().any()
True

Solution 2:[2]

Add .to_numpy() to the end:

'time' in A.to_numpy() 

As you've noticed, the x in pandas.DataFrame syntax doesn't produce the result you want. But .to_numpy() transforms the dataframe into a Numpy array, and x in numpy.array works as you expect.

Solution 3:[3]

I believe (myseries==mystr).any() will do what you ask. The special __contains__ method of DataFrames (which informs behavior of in) checks whether your string is a column of the DataFrame, e.g.

>>> A = pd.DataFrame({"c": [0,1,2], "d": [3,4,5]})
>>> 'c' in A
True
>>> 0 in A
False

Solution 4:[4]

I would slightly modify your dataframe and use .str.contains for checking where the string is present in your series.

df=pd.DataFrame()
df['A']=pd.Series(["ancestry","time","history"])

df['A'].str.contains("time")

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 richardec
Solution 2 richardec
Solution 3 Mose Wintner
Solution 4 Daniel Weigel