Accelerating NumPy, Pandas, and Scikit-Learn with GPU
Table of Contents
GPU-accelerated data analytics is one the solutions, if you aim to enhance the speed and scalability of your machine learning (ML) projects. This approach offers rapid insights through superior performance and optimizes computation and model training. So, RAPIDS taps into the enormous parallelism of NVIDIA GPUs to process the data.
What is RAPIDS?
RAPIDS is an open-source collection of Python libraries, developed on NVIDIA AI, that supercharges data science and analytics workflows using GPU acceleration. Designed to integrate with widely-used data science tools seamlessly, RAPIDS leverages NVIDIA CUDA primitives to optimize low-level computations.
This means that users can directly harness the power of GPU parallelism and its high-bandwidth memory capabilities through Python, resulting in significantly enhanced performance and scalability across various data processes.
What is cuDF?
cuDF is a specialized Python GPU DataFrame library, constructed on the foundation of the Apache Arrow columnar memory format. It facilitates various data operations such as loading, joining, aggregating, and filtering.
What sets cuDF apart is its API, which mirrors that of pandas—a renowned Python library dedicated to data manipulation and analysis. This similarity in design makes cuDF an invaluable asset in data analytics, streamlining preprocessing and exploratory activities, and priming data frames for machine learning applications.
RAPIDS cuDF, being a GPU library built on top of NVIDIA CUDA, cannot take regular Python code and simply run it on a GPU. Under the hood, cuDF uses Numba to convert and compile the Python code into a CUDA kernel. Additionally, RAPIDS cuDF, almost entirely, replicates the same API and functionality. While not 100% feature-equivalent with pandas, the team at NVIDIA and the external contributors work continually on bridging the feature-parity gap.
What is cuPY?
CuPy is an open-source library tailored for GPU-accelerated computing within the Python programming environment. It supports multi-dimensional arrays, sparse matrices, and numerous numerical algorithms built atop these structures. Remarkably, CuPy boasts an API set that mirrors those of NumPy and SciPy, enabling it to serve as a direct GPU-based substitute for running NumPy/SciPy codes.
What is cuML?
cuML is a potent tool designed to offer an API that closely resembles the widely-used scikit-learn API, ensuring seamless integration into ongoing ML endeavors. Once you have trained your cuML model, you can deploy it to NVIDIA Triton.
In conjunction with cuDF, cuPY, and cuML empowers data scientists and analysts by merging the user-friendly interactivity of leading open-source data science instruments with the robustness of GPU acceleration. This combination ensures that ML projects benefit from both intuitive interfaces and accelerated data processing throughout the pipeline.
How to Install RAPIDS?
You can install components of NVIDIA RAPIDS like cuPY, cuDF, and cuML using pip, conda, or docker environments. Additionally, you can specify the CUDA version according to your project needs. Take a look at the RAPIDS documentation for the other options.
pip install cudf-cu12 cuml-cu12 --extra-index-url=https://pypi.nvidia.com
Conclusion
cuPY, cuDF and cuML significantly enhance the efficiency of machine learning pipelines. cuDF expedites data preprocessing, while cuML’s compatibility with the scikit-learn API simplifies the transition to GPU-powered machine learning. Together, they offer a seamless way to harness the immense computational prowess of GPUs in the realm of data science.
We will mention how to use cuDF, cuPY and cuML in the upcoming posts with benchmarks.