NaN, None and Experimental NA

In order to represent the missing values, we see two approaches that are commonly applied to the data in tables or dataframes. The first approach involves a mask to point out the missing values whereas the second uses a datatype-specific sentinel value to represent a missing value.
When masking, a mask could either be a global or a local one. A global mask consists of a separate boolean array for each data array (Figure 1) whereas a local mask utilises a single bit in the element’s bit-wise representation. For example, a signed integer also reserves a single bit to use as a local mask to indicate the positive/negative sign of an integer.

Figure 1: A global boolean mask approach. note that MV denotes missing value. Image by author, using diagrams.
On the other hand, in a sentinel approach, a datatype-specific sentinel value is defined. This could either be a typical value based on best practices or a uniquely defined bit-wise representation. For the missing values of floating-point types, libraries typically choose the standard IEEE 754 floating-point representation called NaN (Not a Number), for example, see Figure 2. Similarly, there are libraries that also define unique bit-wise patterns for other data types, for example R.

Figure 2: Illustrates a bit-wise IEEE 754 single precision (32-bit) NaN representation. Based on Wikipedia, IEEE 754 NaNs are encoded with the exponent field filled with ones (like in infinity value representations), and some non-zero number “x” in the significand field (“x” equals zero denotes infinities). This allows for multiple distinct NaN values, depending on which bits are set in the significand field, but also depending on the value of the leading sign bit “s”. It appears that the IEEE 754 standard defines 16,777,214 (²²⁴-2) floating point values as NaNs, or 0.4% of all possible values. The subtracted two values are positive and negative infinity. Also note that the first bit from x is used to determine the type of NaN: “quiet NaN” or “signaling NaN”. The remaining bits are used to encode a payload (most often ignored in applications). Image by author, made using diagrams.
Although the above masking and sentinel approaches are widely employed, they have their trade-offs. A separate global boolean mask adds extra burden in terms of storage and computation; whereas, a bit-style sentinel puts a limit on the range of valid values that could be missing entries. Besides that, type-specific bit-wise patterns for sentinels, also require additional logic to be implemented for performing bit-level operations.
As pandas is built on NumPy, it simply incorporates the IEEE standard NaN value as a sentinel value for floating-point data types. However, NumPy does not have built-in sentinels for non-floating-point data types. Hence, implying that pandas could either utilise a mask or a sentinel for the non-floating-point types. That is, pandas could either have a global boolean mask, or locally reserve one bit in the element’s bit-wise representation, or have unique type-specific bit-wise representations such as the IEEE’s NaN.
However, as mentioned earlier, each of the above-mentioned three possibilities [boolean mask, bit-level mask, and sentinels (bit-wise patterns)] do come at a price. When it comes to utilising global boolean masks, pandas could build upon the NumPy’s masked array (ma) module. But, the required upkeep of the code base, memory allocations, and computational effort, makes it less practical. Similarly, on a local level, pandas could also reserve a single bit in each of its element’s bit-wise representation. But then again, for smaller 8-bit data units, loosing a bit to use as a local mask will remarkably reduce the range of values it can store. Therefore, deeming, both, global and local masking as less favourable. That said, this brings us to the third option, which is type-specific sentinels. Although a possible solution, pandas’ dependence on NumPy makes type-specific sentinels unfeasible. For example, the package supports 14 different integer types accounting for precisions, endianness, and signedness. So, if unique IEEE-like standard bit-wise representations are to be specified and maintained for all the different data types NumPy supports, pandas will again end up with a mammoth development task at hand.

#artificial-intelligence #pandas #data-science #nan, none and experimental na #experimental na #na

towardsdatascience.com

NaN, None and Experimental NA