Navigating The Hell of NaNs in Python

I recently had a lot of headaches caused by NaNs. Every programmer knows what they are, and why they happen, but in my case, I did not know all of their characteristics or not well enough to prevent my struggle. In the hope of finding solutions and avoiding a bad headache, I looked further into the behaviour of NaNs values in Python. After playing with a few statements in Jupyter Notebook, my results were quite surprising and extremely confusing. Here is what I had using np.nan from Numpy.

np.nan in [np.nan] is True

So far so good, okay but …

np.nan == np.nan is False

Huh? And …

np.nan is np.nan is True

So what the hell is going on with NaNs in Python?

Short Intro

NaN stands for Not A Number and is a common missing data representation. It is a special floating-point value and cannot be converted to any other type than float. It was introduced by the IEEE Standard for Binary Floating-Point for Arithmetic (IEEE 754) before Python even existed and is used in all systems following this standard. NaN can be seen like some sort of data virus that infects all operations it touches.

None vs NaN

None and NaN sound similar, look similar but are actually quite different. None is a Python internal type which can be considered as the equivalent of NULL. The [None](https://www.w3schools.com/python/ref_keyword_none.asp) keyword is used to define a null value, or no value at all. None is not the same as 0, False, or an empty string. It is a datatype of its own (NoneType) and only None can be … None. While missing values are NaN in numerical arrays, they are None in object arrays. It is best to check for None by using foo is None instead of foo == None which brings us back to our previous issue with the peculiar results I found in my NaN operations.

nan is NOT equal to nan

At first, reading that np.nan == np.nan is False can trigger a reaction of confusion and frustration. It looks weird, sounds really weird but if you give it a little bit of thought, the logic starts to appear and even starts to make some sense.

Even though we do not know what every NaN is, not every NaN is the same.

Let’s imagine that instead of nan values, we are looking at a group of people that we do not know. They are completely unknown people to us. Unknown people can be seen as all the same to us, meaning that we describe them all as unknown. However, in reality, it does not mean that one unknown person is equal to another unknown person.

To leave this strange metaphor of mine and go back to Python, NaN cannot be equal to itself because NaN is the result of a failure, but that failure can happen in multiple ways. The result of one failure cannot be equal to the result of any other failure and unknown values cannot be equal to each other.

Equality vs Identity

Now, to understand why np.nan in [np.nan] is True, we have to look at the difference between equality and identity.

Equality

Equality refers to the concept that most Python programmers know as “==”. This is used to ask Python whether the content of the variable is the same as the content of another variable.

num = 1
num2 = 1
num == num2 

The last line will result inTrue . The content of both variables is the same. As I said previously, the content of NaN is never equal to the content of another NaN.

Identity

Identity is when you are asking Python if a variable is the same as another variable, meaning you are asking Python whether the two variables share the same identity. Python assigns an id to each variable that is created, and ids are compared when Python looks at the identity of a variable in an operation. However, np.**nan** is a single object that always has the same id, no matter which variable you assign it to.

import numpy as np
one = np.nan
two = np.nan
one is two

np.nan is np.nan is True and one is two is also True.

If you check the id of one and two using id(one) and id(two) , the same id will be displayed.

np.nan in [np.nan] is True because the list container in Python checks identity before checking equality. However, there are different “flavors”of nans depending on how they are created. float(‘nan’) creates different objects with different ids so float('nan') is float('nan') actually gives False!! We will mention these differences again later.

Dealing with nan without getting a headache

The full nan concept can be quite difficult to grasp and very annoying to deal with at first. Thankfully, pandas and numpy are fantastic when it comes to dealing with nan values and bring several functions that will easily, select, replace or delete the nan values in your variables.

Testing if a value is nan

As I said, whenever you want to know if a value is a nan, you cannot check whether it is equal to nan. However, there are many other options to do so and the one I propose are not the only ones available out there.

import numpy as np
import pandas as pd
var = float('nan')
var is np.nan #results in True
#or
np.isnan(var) #results in True
#or
pd.isna(var) #results in True
#or
pd.isnull(var)#results in True

pd.isnull & pd.isna() behave identically. Pandas provide the .isnull() function as it is an adaptation of R dataframes in Python. In R, null and na are two different types with different behaviours.

Other than numpy and as of Python 3.5, you can also use math.**nan** . The reason why I wrote both nan and NaN in this article (apart from my lack of consistency) is the fact that the value is not case sensitive. Both float(‘nan’) or float(‘NAN’) will produce the same result.

import math
var = float('nan')
math.isnan(var) #results in True

A little warning:

import math
import numpy as np
math.nan is math.nan #results in True
math.nan is np.nan #results in False
math.nan is float('nan') #results in False

The statements give False becausemath.nan , np.nanand float('nan') all have different ids. They do not have the same identity.

For Dataframes

import pandas as pd
df = pd.DataFrame(some_data)
df.dropna()
#will drop all rows of your dataset with nan values. 
#use the subset parameter to drop rows with nan values in specific columns
df.fillna()
#will fill nan values with the value of your choice
df.isnull()
#same as pd.isnull() for dataframes
df.isna()
#same as pd.isna() for dataframes

Unfortunately, I do not find the pandas documentation extremely helpful when it comes to their missing data documentation. However, I really appreciate this excerpt from the Python Data Science Handbookwhich gives a great overview on how to deal with missing data in Pandas.

What to watch out for

TypeError: ‘float’ object is not iterable

While NoneType errors are quite clear, errors caused by nan values can be a little confusing. Nan values can often cause errors (more specifically TypeErrors) that will involve their type ‘float’. The error message can be surprising, especially when you believe that your data has absolutely no float. Your dataframe might not seem to include any floats, but actually, it really does. It probably has NaN values you did not know about and you simply need to get rid of your nan values in order to get rid of this error!

#python #programming

Navigating The Hell of NaNs in Python
2 Likes19.65 GEEK