'Memory efficient way to store bool and NaN values in pandas

I am working with quite a large dataset (over 4 GB), which I imported in pandas. Quite some columns in this dataset are simple True/False indicators, and naturally the most memory-efficient way to store these would be using a bool dtype for this column. However, the column also contains some NaN values I want to preserve. Right now, this leads to the column having dtype float (with values 1.0, 0.0 and np.nan) or object, but they both use way too much memory.

As an example:

df = pd.DataFrame([[True,True,True],[False,False,False], 
                   [np.nan,np.nan,np.nan]])
df[1] = df[1].astype(bool)
df[2] = df[2].astype(float)
print(df)
print(df.memory_usage(index=False, deep=True))
print(df.memory_usage(index=False, deep=False))

results in

       0      1    2
0   True   True  1.0
1  False  False  0.0
2    NaN   True  NaN

0       100
1         3
2        24
dtype: int64

0        24
1         3
2        24
dtype: int64

What would be the most efficient way to store these kinds of values, knowing they can only take on 3 different kinds of values: True, False and <undefined>



Solution 1:[1]

Building upon the previous answer, it might be worth mentioning that Pandas has an "integer NaN" as of v1.0.0 (pd.NA), whose presence allows a column's dtype to remain an integer. From the linked documentation page:

In Working with missing data, we saw that pandas primarily uses NaN to represent missing data. Because NaN is a float, this forces an array of integers with any missing values to become floating point. In some cases, this may not matter much. But if your integer column is, say, an identifier, casting to float can be problematic. Some integers cannot even be represented as floating point numbers.

This might be slightly more readable than encoding NaNs as some known-to-be-invalid integer value, and of course pd.isna returns True for them.

I do not know what effect this has in terms of memory, compared to a simple integer.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Dev-iL