'numpy.where does not work for empty strings

As a SEO manager, I am using this python code to see whether the H1 tags are the same on the desktop version and mobile version of different pages of a website:

##Print the path of your current working directory
import os
print(os.getcwd())
#What you get here is where you should save your CSV crawls

##Import Panda Library
import pandas as pd
import numpy

##Load the crawls to Pandas
dfTextonly = pd.DataFrame(pd.read_csv('mobile.csv', low_memory=False, header=0))
dfTextonly = dfTextonly[['Address', 'H1-1']].copy()
dfJS = pd.DataFrame(pd.read_csv('desktop.csv', low_memory=False, header=0))
dfJS = dfJS[['Address','H1-1']].copy()
#Combine the two crawls into one dataframe
df = pd.merge(dfTextonly, dfJS, left_on='Address', right_on='Address', how='outer')

##Check the differences
df["H1s are equal"] = numpy.where((df["H1-1_y"] == df["H1-1_x"]), "yes", "no")
##Export in Excel
df.to_excel("test-results.xlsx")

However, the problem is that numpy.where in this code returns the value "no" whenever H1-1_y and H1-1_x are both "nan" (empty strings), while it should return "yes" since in this case, they are the same. Can somebody help me with this?

Sample Data

Click to download sample data



Solution 1:[1]

If it is about handling NaNs as you've mentioned in a comment, you can use pandas where which handle NaN == NaN as true. The code looks a bit hackish, so you can decide if you want that but you could try

df["H1s are equal"] = pd.Series(["yes"]*len(df["H1-1_y"])).where(df["H1-1_y"]==df["H1-1_x"], "No")

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Simon Hawe