'Too many columns resulting in `PerformanceWarning: DataFrame is highly fragmented`

I have a list of filepaths in the first column of a dataframe. My goal is to create a second column that represents file categories, with categories reflecting the words in the filepath.

import pandas as pd
import numpy as np

data = {'filepath':  ['C:/barracuda/document.doc', 'C:/dog/document.doc', 'C:/cat/document.doc']
        }

df = pd.DataFrame(data)

df["Animal"] =(df['filepath'].str.contains("dog|cat",case=False,regex=True))

df["Fish"] =(df['filepath'].str.contains("barracuda",case=False))

df = df.loc[:, 'filepath':'Fish'].replace(True, pd.Series(df.columns, df.columns))

df = df.loc[:, 'filepath':'Fish'].replace(False,np.nan)

def squeeze_nan(x):
    original_columns = x.index.tolist()
    squeezed = x.dropna()
    squeezed.index = [original_columns[n] for n in range(squeezed.count())]

    return squeezed.reindex(original_columns, fill_value=np.nan)

df = df.apply(squeeze_nan, axis=1) 
print(df)

This code works. The problem arises when I have 200 statements beginning with df['columnName'] =. Because I have so many, I get the error:

PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling frame.insert many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use newframe = frame.copy()

To fix this I have tried:

dfAnimal = df.copy

dfAnimal['Animal'] = dfAnimal['filepath'].str.contains("dog|cat",case=False,regex=True)

dfFish = df.copy
dfFish["Fish"] =dfFish['filepath'].str.contains("barracuda",case=False)

df = pd.concat(dfAnimal,dfFish)

The above gives me errors such as method object is not iterable and method object is not subscriptable. I then tried df = df.loc[df['filepath'].isin(['cat','dog'])] but this only works when 'cat' or 'dog' is the only word in the column. How do I avoid the performance error?



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source