What are the FP8, FP16, FP32 and FP64?
Table of Contents
FP stands for “floating point”. It is a way to represent numbers in a computer using a scientific notation format that allows for a wide range of values to be represented, including very large and very small numbers, as well as numbers with a fractional component.
In a floating-point representation, a number is expressed in scientific notation as a mantissa (also known as significand or coefficient) multiplied by a power of the base (typically 2). The mantissa is a binary fraction that is normalized, meaning it has a single nonzero bit to the left of the binary point. The exponent specifies the power of the base by which the mantissa is multiplied.
Floating-point representations are used in a variety of applications, including scientific computing, numerical analysis, and computer graphics. They are also used in the implementation of many programming languages, such as C and Python, and in hardware such as microprocessors and graphics processing units (GPUs).
Floating Points and Their Features
FP16 and FP32 are common precision modes used to perform computations on Graphics Processing Units (GPUs) such as NVIDIA GPUs. They refer to the precision of floating-point numbers used in computations.
- FP16 (Half-Precision): In this mode, the GPU uses 16-bit floating-point numbers to perform computations. This results in a smaller memory footprint and lower memory bandwidth requirements compared to FP32, but at the cost of reduced numerical precision.
- FP32 (Single-Precision): In this mode, the GPU uses 32-bit floating-point numbers to perform computations. This provides a higher level of numerical precision compared to FP16, but also requires more memory and higher memory bandwidth.
- FP64 (Double-Precision): In this mode, the GPU uses 64-bit floating-point numbers to perform computations. This provides an even higher level of numerical precision compared to FP32, but also requires even more memory and bandwidth compared to FP32.
Precision Mode Examples
Don’t forget these are just examples, and the actual representation of the numbers in binary may differ depending on the specific GPU architecture and endianness.
- FP16:
- 0.25 (in binary: 0011110000000000)
- -1.25 (in binary: 11000000010100000)
- FP32:
- 0.25 (in binary: 0x3f200000)
- -1.25 (in binary: 0xbf200000)
- FP64:
- 0.25 (in binary: 0x3fe0000000000000)
- -1.25 (in binary: 0xbfe0000000000000)
- INT8:
- 5 (in binary: 00000101)
- -5 (in binary: 11111011)
- INT16:
- 5 (in binary: 0000000000000101)
- -5 (in binary: 1111111111111011)
- INT32:
- 5 (in binary: 0x00000005)
- -5 (in binary: 0xfffffffb)
In the binary representations I provided, “0x” is a hexadecimal prefix that indicates the number following it is in hexadecimal format.
So, “0x3F2” is a hexadecimal representation of a binary number. To see what this binary number represents in decimal form, you can convert it from hexadecimal to decimal. Here’s how you can do that:
- First, convert each hexadecimal digit to 4 binary digits. So, “3” becomes “0011” and “F” becomes “1111”.
- Then, concatenate these binary digits to form the full binary number: 00111110010
- Finally, interpret this binary number as a signed binary representation of a decimal number using the two’s complement form.
So, the decimal representation of 0x3F2 is 1010.
Differences Between FP16 and INT16
For FP16, the binary representation consists of:
- 1 bit for the sign (0 for positive numbers, 1 for negative numbers)
- 5 bits for the exponent
- 10 bits for the mantissa
For INT16, the binary representation consists of:
- 1 bit for the sign (0 for positive numbers, 1 for negative numbers)
- 15 bits for the magnitude of the number
FP8 Format
FP8 is not a standard floating-point format, and it is not commonly used in computing because it has a very limited range and precision compared to others. Typically an 8-bit floating-point format would have a smaller range and precision compared to larger floating-point formats such as FP16, FP32, and FP64.
In an 8-bit floating-point format, there are only 3 bits available for the mantissa, which severely limits the precision of the representation. This means that small changes in the value of the number being represented may not be accurately captured, which can lead to rounding errors and other inaccuracies in calculations.
In addition, with only 5 bits available for the exponent, an 8-bit floating-point format can only represent numbers within a limited range. This makes it unsuitable for many applications that require a wider range of values to be represented.
For these reasons, 8-bit floating-point formats are generally not used in scientific computing, numerical analysis, or other applications that require high precision and accuracy. However, as I mentioned earlier, there may be some specialized applications where a custom 8-bit floating-point format is used, but these are likely to be very specific and rare use cases.