'Filter a dataframe by both column value and row number

I have a large dataframe with over 4 million rows and multiple columns. Column X may have a value of Nan. I want to firstly filter any row where X column has a value, then split the dataframe into smaller segments for processing. However, if I use both loc and iloc, the settingwithcopywarning error is raised. How can I code around this problem?

The reason for segmenting is to extract the dataframe in CSV every time a segment is processed to prevent extensive data loss if an error occurs.

My code is the following:

filtered_df = initdf.loc[initdf['x'].isnull(), :]
for i in range(0, len(filtered_df.index), 2000):
    filtered_df_chunk = filtered_df.iloc[i:i+2000]
    # Code to edit the chunk
    initdf.update(filtered_df_chunk, overwrite=False)

Is there any better way to avoid the settingwithcopywarning but still being able to filter and segment the initial dataframe?

Edit: An initial ommition, althouth I don't think it changes the answer: The exported dataframe is the initial one, once the chunk changes have been integrated to it using df.update.

Many thanks!



Solution 1:[1]

Here's my initial take on this. Using a simplified example.

list_a = {
  "a": [1, 7, 3, np.NaN, 8, 3, 9, 9, 3, np.NaN, 4, 3],
  "b": np.arange(12)
} # Creating random DataFrame with NaN values

df = pandas.DataFrame(list_a)

df_no_nan = df[df["a"].isna() == False] # Removing indexes where row "a" is NaN

def chunk_operation(df, chunk_size):
  split_points = [index for index in np.arange(len(df))[0:-1:chunk_size]]
  for chunk in [df_no_nan.iloc[split:split+chunk_size] for split in split_points]:
    chunk["a"] * 5
    chunk.to_csv(r"\some_path")

chunk_operation(df_no_nan, 3)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1