'appending rows to pandas dataframe results in duplicate rows
here's a MWE that illustrates a problem I'm having, where incrementally saving values to a dataframe over the course of a series of loops results in what looks like the overwriting of previous rows.
import pandas as pd
import numpy as np
saved = pd.DataFrame(columns = ['value1', 'value2'])
m = np.zeros(2)
for t in range(5):
for i in range(2):
m[i] = m[i] + i + 1
print(t)
print(m)
saved.loc[t] = m
print(saved)
The output I get is:
0
[1. 2.]
1
[2. 4.]
2
[3. 6.]
3
[4. 8.]
4
[5. 10.]
value1 value2
0 2.0 4.0
1 2.0 4.0
2 3.0 6.0
3 4.0 8.0
4 5.0 510.0
Why is the first row of the saved
dataframe not 1.0, 2.0
?
Edit: Here's another articulation of the problem, now using lists for saving then configuring as dataframe at end. The following code in a .py script
import numpy as np
import pandas as pd
saved_list = []
m = np.zeros(2)
for t in range(5):
for i in range(2):
m[i] = m[i] + i + 1
print(t)
print(m)
saved_list.append(m)
saved = pd.DataFrame(saved_list, columns = ['value1', 'value2'])
print(saved)
gives this output from the command line:
0
[1. 2.]
1
[2. 4.]
2
[3. 6.]
3
[4. 8.]
4
[ 5. 10.]
value1 value2
0 5.0 10.0
1 5.0 10.0
2 5.0 10.0
3 5.0 10.0
4 5.0 10.0
Why are the previous saved_list items being overwritten?
Solution 1:[1]
Well, it seems that making a copy of the array within the loop for saving solves both scenarios.
For the first, I used
saved.loc[t] = m.copy()
and for the second I used saved_list.append(m.copy())
.
It may be obvious to some that when the array is defined outside the loop, the items saved to either the list or the frame are pointers to the original item so anything saved within the loop ends up pointing to the final version.
Now I know.
Solution 2:[2]
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | zazizoma |
Solution 2 | Yanirmr |