'How do you represent missing data in a Pandas DataFrame?

Does Pandas have an equivalent of R's na (meaning not available)? If not, what is the convention for representing a missing value, as opposed to NaN which represents a mathematically impossible value such as a divide by zero?



Solution 1:[1]

Currently there is no NA value available in Pandas or NumPy. From the section "Working with missing data" in the Pandas manual (http://pandas.pydata.org/pandas-docs/stable/missing_data.html):

The choice of using NaN internally to denote missing data was largely for simplicity and performance reasons. It differs from the MaskedArray approach of, for example, scikits.timeseries. We are hopeful that NumPy will soon be able to provide a native NA type solution (similar to R) performant enough to be used in pandas.

Also, this part of the documentation (http://pandas.pydata.org/pandas-docs/stable/gotchas.html#nan-integer-na-values-and-na-type-promotions) provides more details on the trade-offs in this choice of NA representation.

Solution 2:[2]

You can use it from numpy:

import numpy as np
np.nan

or simply

float('NaN')

In pandas docs the np.nan version is used mostly: http://pandas.pydata.org/pandas-docs/dev/missing_data.html

Solution 3:[3]

It comes from numpy

from numpy import nan
x = nan

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Jon Lund Steffensen
Solution 2 tamasgal
Solution 3