'Can cast pandas Series to `int64` but not to `Int64`
I am stuck with a weird type conversion issue..
I have a pandas DataFrame pp with a column Value. All values are 'int', of type float. It is possible to convert them all to int64. But I get an error when attempting to conver to nullable Int64. Help!
Some details: pp['Value'] is of dtype float and may contain NaN. For my specific case, all values are integer values (see my debugging attempts below). I want them to become Int64. This has worked for months with thousands of series, but now suddenly 1 series seems to fail - I cannot find any non-int value that would explain this.
What I am trying: Convert the below DataSeries to Int64:
print(f"pp['Value']:\n{pp['Value']}")
pp['Value']:
0 3500000.0
1 600000.0
2 400000.0
3 8300000.0
4 5700000.0
5 4400000.0
6 3000000.0
7 2700000.0
8 2000000.0
9 800000.0
10 300000.0
11 300000.0
12 5300000.0
13 2500000.0
14 11000000.0
15 1000000.0
16 18000000.0
17 6250000.0
18 5000000.0
19 4400000.0
20 4200000.0
21 2000000.0
22 1750000.0
23 900000.0
24 4000000.0
25 800000.0
26 9250000.0
27 5200000.0
28 600000.0
29 5700000.0
30 13500000.0
31 10000000.0
32 3300000.0
33 3200000.0
34 2000000.0
35 750000.0
Name: Value, dtype: float64
For some reason, pp['Value'].astype('Int64') raises the error TypeError: cannot safely cast non-equivalent float64 to int64
I debugged 2 alternative approaches, with both work..
A: Convert the series to 'int64' - works like a charm (the numbers
really all are integers):
pp['Value'] = pp['Value'].astype('int64')
print(f"pp['Value']:\n{pp['Value']}")
pp['Value']:
0 3500000
1 600000
2 400000
3 8300000
4 5700000
5 4400000
6 3000000
7 2700000
8 2000000
9 800000
10 300000
11 300000
12 5300000
13 2500000
14 11000000
15 1000000
16 18000000
17 6250000
18 5000000
19 4400000
20 4200000
21 2000000
22 1750000
23 900000
24 4000000
25 800000
26 9250000
27 5200000
28 600000
29 5700000
30 13500000
31 10000000
32 3300000
33 3200000
34 2000000
35 750000
Name: Value, dtype: int64
B: Converted each element individually, and checked whether any single value does have some weird floating-point arithmetic issue.. not the case either. All values can be converted..
for idx, row in pp.iterrows():
print(f"{idx}: value = {row['Value']}, residual vs. int: {row['Value']%row['Value']}, int value: {int(row['Value'])}")
0: value = 3500000.0, residual vs. int: 0.0, int value: 3500000
1: value = 600000.0, residual vs. int: 0.0, int value: 600000
2: value = 400000.0, residual vs. int: 0.0, int value: 400000
3: value = 8300000.000000001, residual vs. int: 0.0, int value: 8300000
4: value = 5700000.0, residual vs. int: 0.0, int value: 5700000
5: value = 4400000.0, residual vs. int: 0.0, int value: 4400000
6: value = 3000000.0, residual vs. int: 0.0, int value: 3000000
7: value = 2700000.0, residual vs. int: 0.0, int value: 2700000
8: value = 2000000.0, residual vs. int: 0.0, int value: 2000000
9: value = 800000.0, residual vs. int: 0.0, int value: 800000
10: value = 300000.0, residual vs. int: 0.0, int value: 300000
11: value = 300000.0, residual vs. int: 0.0, int value: 300000
12: value = 5300000.0, residual vs. int: 0.0, int value: 5300000
13: value = 2500000.0, residual vs. int: 0.0, int value: 2500000
14: value = 11000000.0, residual vs. int: 0.0, int value: 11000000
15: value = 1000000.0, residual vs. int: 0.0, int value: 1000000
16: value = 18000000.0, residual vs. int: 0.0, int value: 18000000
17: value = 6250000.0, residual vs. int: 0.0, int value: 6250000
18: value = 5000000.0, residual vs. int: 0.0, int value: 5000000
19: value = 4400000.0, residual vs. int: 0.0, int value: 4400000
20: value = 4200000.0, residual vs. int: 0.0, int value: 4200000
21: value = 2000000.0, residual vs. int: 0.0, int value: 2000000
22: value = 1750000.0, residual vs. int: 0.0, int value: 1750000
23: value = 900000.0, residual vs. int: 0.0, int value: 900000
24: value = 4000000.0, residual vs. int: 0.0, int value: 4000000
25: value = 800000.0, residual vs. int: 0.0, int value: 800000
26: value = 9250000.0, residual vs. int: 0.0, int value: 9250000
27: value = 5200000.0, residual vs. int: 0.0, int value: 5200000
28: value = 600000.0, residual vs. int: 0.0, int value: 600000
29: value = 5700000.0, residual vs. int: 0.0, int value: 5700000
30: value = 13500000.0, residual vs. int: 0.0, int value: 13500000
31: value = 10000000.0, residual vs. int: 0.0, int value: 10000000
32: value = 3300000.0, residual vs. int: 0.0, int value: 3300000
33: value = 3200000.0, residual vs. int: 0.0, int value: 3200000
34: value = 2000000.0, residual vs. int: 0.0, int value: 2000000
35: value = 750000.0, residual vs. int: 0.0, int value: 750000
I am lost... All the values are int. I can convert all values to int. I can convert the whole Series to int64. But when converting to Int64, I get an error. Why? What is wrong here?
Edit note:
pp['Value'] = pp['Value'].round().astype('Int64')
solves the problem.. But I would love to understand why. As you can see above, the set is guaranteed to only contain integers; each value is an 'int' down to machine accuracy.. Why on earth would the 'non-safe conversion' error be raised?
Solution 1:[1]
As Jason suggested in his comment, your edit solves the problem because rounding changes 8300000.000000001 to 8300000.0.
This is important as it means that after the type conversion the two values are still equal, and so they meet the "safe" casting rule for numpy conversions. When converting to 'Int64' pandas use the numpy.ndarray.astype function which applies this rule. The details on "safe" casting can be found here.
As far as I am aware, there is no way to request that pandas uses the numpy function with a different type of casting, so rounding the values first is the solution to your problem.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | FluxZA |
