I recently had a lot of headaches caused by NaNs. Every programmer knows what they are, and why they happen, but in my case, I did not know all of their characteristics or not well enough to prevent my struggle. In the hope of finding solutions and avoiding a bad headache, I looked further into the behaviour of NaNs values in Python. After playing with a few statements in Jupyter Notebook, my results were quite surprising and extremely confusing. Here is what I had using np.nan from Numpy.
np.nan in [np.nan]
is True
So far so good, okay but …
np.nan == np.nan
is False
Huh? And …
np.nan is np.nan
is True
So what the hell is going on with NaNs in Python?
NaN stands for Not A Number and is a common missing data representation. It is a special floating-point value and cannot be converted to any other type than float. It was introduced by the IEEE Standard for Binary Floating-Point for Arithmetic (IEEE 754) before Python even existed and is used in all systems following this standard. NaN can be seen like some sort of data virus that infects all operations it touches.
None and NaN sound similar, look similar but are actually quite different. None is a Python internal type which can be considered as the equivalent of NULL. The [None](https://www.w3schools.com/python/ref_keyword_none.asp)
keyword is used to define a null value, or no value at all. None is not the same as 0, False, or an empty string. It is a datatype of its own (NoneType) and only None can be … None. While missing values are NaN in numerical arrays, they are None in object arrays. It is best to check for None by using foo is None
instead of foo == None which brings
us back to our previous issue with the peculiar results I found in my NaN operations.
At first, reading that np.nan == np.nan
is False
can trigger a reaction of confusion and frustration. It looks weird, sounds really weird but if you give it a little bit of thought, the logic starts to appear and even starts to make some sense.
Even though we do not know what every NaN is, not every NaN is the same.
Let’s imagine that instead of nan values, we are looking at a group of people that we do not know. They are completely unknown people to us. Unknown people can be seen as all the same to us, meaning that we describe them all as unknown. However, in reality, it does not mean that one unknown person is equal to another unknown person.
To leave this strange metaphor of mine and go back to Python, NaN cannot be equal to itself because NaN is the result of a failure, but that failure can happen in multiple ways. The result of one failure cannot be equal to the result of any other failure and unknown values cannot be equal to each other.
Now, to understand why np.nan in [np.nan]
is True
, we have to look at the difference between equality and identity.
Equality refers to the concept that most Python programmers know as “==”. This is used to ask Python whether the content of the variable is the same as the content of another variable.
num = 1
num2 = 1
num == num2
The last line will result inTrue
. The content of both variables is the same. As I said previously, the content of NaN is never equal to the content of another NaN.
Identity is when you are asking Python if a variable is the same as another variable, meaning you are asking Python whether the two variables share the same identity. Python assigns an id to each variable that is created, and ids are compared when Python looks at the identity of a variable in an operation. However, np.**nan**
is a single object that always has the same id, no matter which variable you assign it to.
import numpy as np
one = np.nan
two = np.nan
one is two
np.nan is np.nan
is True
and one is two
is also True
.
If you check the id of one
and two
using id(one)
and id(two)
, the same id will be displayed.
np.nan in [np.nan]
is True
because the list container in Python checks identity before checking equality. However, there are different “flavors”of nans depending on how they are created. float(‘nan’)
creates different objects with different ids so float('nan') is float('nan')
actually gives False!! We will mention these differences again later.
The full nan concept can be quite difficult to grasp and very annoying to deal with at first. Thankfully, pandas and numpy are fantastic when it comes to dealing with nan values and bring several functions that will easily, select, replace or delete the nan values in your variables.
As I said, whenever you want to know if a value is a nan, you cannot check whether it is equal to nan. However, there are many other options to do so and the one I propose are not the only ones available out there.
import numpy as np
import pandas as pd
var = float('nan')
var is np.nan #results in True
#or
np.isnan(var) #results in True
#or
pd.isna(var) #results in True
#or
pd.isnull(var)#results in True
pd.isnull
& pd.isna()
behave identically. Pandas provide the .isnull() function as it is an adaptation of R dataframes in Python. In R, null and na are two different types with different behaviours.
Other than numpy and as of Python 3.5, you can also use math.**nan**
. The reason why I wrote both nan and NaN in this article (apart from my lack of consistency) is the fact that the value is not case sensitive. Both float(‘nan’)
or float(‘NAN’)
will produce the same result.
import math
var = float('nan')
math.isnan(var) #results in True
A little warning:
import math
import numpy as np
math.nan is math.nan #results in True
math.nan is np.nan #results in False
math.nan is float('nan') #results in False
The statements give False becausemath.nan
, np.nan
and float('nan')
all have different ids. They do not have the same identity.
import pandas as pd
df = pd.DataFrame(some_data)
df.dropna()
#will drop all rows of your dataset with nan values.
#use the subset parameter to drop rows with nan values in specific columns
df.fillna()
#will fill nan values with the value of your choice
df.isnull()
#same as pd.isnull() for dataframes
df.isna()
#same as pd.isna() for dataframes
Unfortunately, I do not find the pandas documentation extremely helpful when it comes to their missing data documentation. However, I really appreciate this excerpt from the Python Data Science Handbookwhich gives a great overview on how to deal with missing data in Pandas.
TypeError: ‘float’ object is not iterable
While NoneType errors are quite clear, errors caused by nan values can be a little confusing. Nan values can often cause errors (more specifically TypeErrors) that will involve their type ‘float’. The error message can be surprising, especially when you believe that your data has absolutely no float. Your dataframe might not seem to include any floats, but actually, it really does. It probably has NaN values you did not know about and you simply need to get rid of your nan values in order to get rid of this error!
#python #programming