'Filter a dataframe by both column value and row number
I have a large dataframe with over 4 million rows and multiple columns. Column X may have a value of Nan. I want to firstly filter any row where X column has a value, then split the dataframe into smaller segments for processing. However, if I use both loc and iloc, the settingwithcopywarning error is raised. How can I code around this problem?
The reason for segmenting is to extract the dataframe in CSV every time a segment is processed to prevent extensive data loss if an error occurs.
My code is the following:
filtered_df = initdf.loc[initdf['x'].isnull(), :]
for i in range(0, len(filtered_df.index), 2000):
filtered_df_chunk = filtered_df.iloc[i:i+2000]
# Code to edit the chunk
initdf.update(filtered_df_chunk, overwrite=False)
Is there any better way to avoid the settingwithcopywarning but still being able to filter and segment the initial dataframe?
Edit: An initial ommition, althouth I don't think it changes the answer: The exported dataframe is the initial one, once the chunk changes have been integrated to it using df.update.
Many thanks!
Solution 1:[1]
Here's my initial take on this. Using a simplified example.
list_a = {
"a": [1, 7, 3, np.NaN, 8, 3, 9, 9, 3, np.NaN, 4, 3],
"b": np.arange(12)
} # Creating random DataFrame with NaN values
df = pandas.DataFrame(list_a)
df_no_nan = df[df["a"].isna() == False] # Removing indexes where row "a" is NaN
def chunk_operation(df, chunk_size):
split_points = [index for index in np.arange(len(df))[0:-1:chunk_size]]
for chunk in [df_no_nan.iloc[split:split+chunk_size] for split in split_points]:
chunk["a"] * 5
chunk.to_csv(r"\some_path")
chunk_operation(df_no_nan, 3)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
