Scaling AI Computing with Ray: Large-Scale Implementation and Optimization

conquerym

ym qu

Posted on November 23, 2024

Scaling AI Computing with Ray: Large-Scale Implementation and Optimization

Large-scale Practice of the Next-generation Supercomputing Framework Ray in AI Computing

1. Background

AI computing is widely used in various scenarios like traffic distribution, product operations, and content creation. The demands for AI computing involve core feature generation for search, ads, and recommendations, as well as AI-generated content (AIGC) related tasks, such as text-to-image, image generation, and AI effects. However, existing infrastructure faces the following challenges when applied to AI computing:

  1. Resource Issues: AI applications are compute-intensive, and using online resources directly incurs high costs.
  2. Deployment Issues: AI applications need to adapt to a large amount of heterogeneous hardware, making deployment complex.
  3. Application Orchestration: Complex feature dependencies and asynchronous processes lead to low development efficiency and high risks during changes.
  4. Platform Issues: The lack of platform support results in slow algorithm iteration and high barriers for using model capabilities.

For example, OCR, a core feature for recommendation and search, requires massive resources (over 1 million CPU cores) and has high requirements for real-time and reliability. However, the existing infrastructure cannot efficiently support such demand. Thus, a low-cost, high-efficiency, and easy-to-use AI computing platform is needed.

2. Why Introduce Ray into AI Computing?

Ray is a general-purpose distributed computing engine widely adopted by many large companies. Its advantages include:

  1. Simple and Intuitive Distributed Programming: Using Ray’s simple API, developers do not need to deeply understand communication and scheduling details. By simply adding Python decorators, functions or classes can be transformed into distributed tasks, greatly simplifying the development of distributed applications.

Example: OCR inference locally requires splitting the model into multiple modules. By using Ray, we can simply add the @ray.remote decorator to each function to turn it into a distributed task, significantly simplifying deployment.

   def detect(image_data):
       model = load_detect_model()
       return model(image_data)

   def recognize(det_result):
       model = load_recognize_model()
       return model(det_result)

   def ocr(image_data):
       det_result = detect(image_data)
       return recognize(det_result)

   image_data = load_image_data()
   ocr_result = ocr(image_data)
Enter fullscreen mode Exit fullscreen mode

After applying Ray:

   @ray.remote(num_gpus=1, num_cpus=16)
   def detect(image_data):
       model = load_detect_model()
       return model(image_data)

   @ray.remote(num_gpus=2, num_cpus=16)
   def recognize(detect_result):
       model = load_recognize_model()
       return model(detect_result)

   @ray.remote(num_cpus=4)
   def ocr(image_data):
       det_result = detect.remote(image_data)
       return recognize.remote(det_result)

   image_data = load_image_data()
   ocr_result = ocr.remote(image_data)
Enter fullscreen mode Exit fullscreen mode
  1. Extensive AI Framework Integration: Ray integrates seamlessly with popular machine learning libraries such as XGBoost, HuggingFace, PyTorch, and TensorFlow, making it easy to handle distributed data processing, training, inference, and service deployment.

  2. Efficient Scaling: Developers can develop on their laptops and then scale the application to Ray clusters with just a single line of code.

   RAY_ADDRESS=ray://<cluster>:<port> python your_script.py
Enter fullscreen mode Exit fullscreen mode

Ray provides a high-performance distributed framework and a simple programming model, offering a unified platform for distributed computing. As such, Ray is an ideal choice as the foundation for large-scale AI computing in various applications.

3. Large-scale Practice of Ray in AI Computing

Platform 1 solves high real-time operational issues for online services through automated orchestration and flexible scaling. However, it does not support large-scale batch processing. Platform 2 is suitable for large-scale offline data processing but does not meet the needs for high real-time and high-throughput AI computing scenarios. To achieve a high real-time, high-throughput, high-reliability, and low-cost AI computing platform, I built Platform 3 based on Ray, addressing the following key issues:

  1. Low Cost: Supports heterogeneous resource extensions, fully utilizing low-cost resources.
  2. High Throughput: Supports scheduling of computing nodes at the scale of millions.
  3. Simplified Deployment: Reduces the complexity of deploying AI applications with multiple models.

3.1 Architecture

Compared to the community version of KubeRay, the self-developed Ray solution solves problems like small cluster size, difficulty in expanding heterogeneous resources, and slow scaling, making it suitable for large-scale internal AI applications.

The architecture of AstraRay addresses three core technical challenges:

  1. Managing Million-scale Pod Clusters: Internal AI applications far exceed the K8s cluster limits, requiring a single Ray application to support millions of pods.
  2. Ensuring Stability with Unstable Resources: Many low-cost, idle resources have unstable performance, making it difficult to ensure service stability.
  3. Simplified Application Deployment: AI applications must deal with deployment complexity across multiple models, hardware types, and modules.

By leveraging service discovery, load balancing, and disaster recovery scheduling, AstraRay builds a platform that supports large-scale resource management for AI computing.

3.2 Challenges of Supporting Million-scale Nodes in a Single Cluster

To support scaling to millions of nodes, AstraRay adopts a shared scheduling architecture. It resolves resource allocation conflicts using optimistic concurrent scheduling locks, avoiding the performance bottlenecks of a single scheduler. A is the core scheduler, supporting cross-platform scheduling across multiple resources and allowing different nodes to operate in heterogeneous resource environments.

The core components of the A scheduling architecture include:

  • Node: Handles application deployments.
  • Resource: Manages node heartbeats and resource status.
  • App: The resource scheduler within an application.
  • Scheduler: Responsible for load balancing and disaster recovery.

3.3 Challenges of Building Stable Services with Unstable Resources

Through fast disaster recovery scheduling and dynamic weighted SWRR routing algorithm, AstraRay enables quick handling of unstable nodes and efficient scheduling, reducing service failure rates and improving overall resource utilization.

3.3.1 Fast Disaster Recovery Scheduling

By using K8s’ PreStop Hook mechanism and Resource pre-aggregated broadcast, AstraRay achieves node eviction within 4 seconds, significantly reducing application failure rates.

3.3.2 Dynamic Weighted SWRR Routing Algorithm

AstraRay uses a self-adaptive weighted SWRR algorithm to quickly adjust request distribution, balancing node utilization while reducing request latency.

3.4 Challenges of Reducing Application Deployment Complexity

The complexity of deploying AI applications involves multi-model, multi-hardware, and multi-module extensions. AstraRay simplifies AI application deployment through the following approaches:

  1. Multi-model Extension: Through Conda environment isolation and packaging, AstraRay supports runtime dynamic switching of Python environments, addressing the complexity of model extensions.
  2. Fast Model Distribution: AstraRay accelerates large model distribution through an embedded P2P network.
  3. Multi-module Extension: AstraRay achieves horizontal scaling and disaster recovery capability through the Ray Federation cluster architecture.
  4. Multi-hardware Extension: Based on the TFCC framework, AstraRay unifies engine and hardware adaptation, supporting inference tasks on various heterogeneous hardware platforms.

3.4.1 Multi-model Extension Challenge

The core of the multi-model extension challenge is the dynamic switching of model runtime environments, which involves two problems:

  1. Runtime Dynamic Switching: Ray’s RuntimeEnv provides environment management but cannot switch Python versions. Additionally, dependencies outside of the Python environment rely solely on Docker, limiting flexibility.

AstraRay solves this by supporting Conda as the isolation and packaging tool for Python runtime environments. Unlike Ray’s Conda mechanism, AstraRay initializes the runtime environment before starting Ray, allowing different applications to customize their Python versions. Users can specify dependencies using requirement.txt and package them with conda-pack, enabling fast distribution during large-scale scaling. This approach avoids the pressure of source downloads and allows users to customize environments (such as TensorRT), providing greater flexibility.

  1. Fast Model Distribution:

As models grow larger in the era of large models (LLMs), the size of model files can reach tens of GBs, which may take tens of minutes to download. While Ray provides the working_dir feature to distribute code and models, it relies on GCS nodes and has a default size limit of 500MB, making it unsuitable for production environments with large models.

To address this, AstraRay embeds a P2P network into nodes. The P2P network consists of server and SDK components. The server manages P2P nodes via heartbeats and maintains seed information in the cluster. P2P nodes provide file fragment caching and transfer capabilities, accelerating large model distribution.

P2P Network Optimizations:

  • NAT Traversal: P2P supports NAT detection and traversal, minimizing network connectivity issues.
  • Node Auto-throttling: P2P measures the bandwidth of nodes and sets appropriate thresholds to avoid impacting regular services.
  • Global Throttling: The server can issue global throttling commands to prevent network congestion from affecting other services.
  • Download Acceleration: For new files, P2P distributes requests across different nodes by randomizing fragment download sequences, preventing bottlenecks and improving download speed.

3.4.2 Multi-module Extension Challenge

To enhance the scalability of Ray applications, AstraRay implements the Ray Federation cluster architecture. Each Ray application can have multiple Ray clusters, each with full functionality. Users can adjust the size of individual Ray clusters as needed, improving application processing power and resource utilization for vertical scaling. Additionally, the numberof Ray clusters can be scaled horizontally to handle larger computing workloads.

For disaster recovery, AstraRay enhances Ray's built-in capabilities with the following strategies:

  • Head node disaster recovery: If the head node goes offline, the system will horizontally scale up a new Ray cluster.
  • Worker node disaster recovery: If any worker node goes offline, the system will spin up a new worker within the same Ray cluster.

With these disaster recovery strategies, AstraRay minimizes the service failure impact caused by unstable or low-priority resources.

3.4.3 Multi-hardware Extension

The challenges related to multi-hardware extension primarily include:

  1. Diverse Inference Business Types: Different applications use a variety of inference engines and models (such as PyTorch, ONNX, TensorRT, etc.).
  2. Complex and Repetitive Hardware Adaptation: Adapting to different GPU types (e.g., NVIDIA, AMD, Huawei) is labor-intensive and repetitive.
  3. High Costs for Engine Support and Model Switching: Switching between different inference engines can be complex and increase development and deployment costs.

AstraRay addresses these challenges by building on the TFCC (TensorFlow Compute Cluster) framework, which standardizes the service framework and unifies the access mode for inference engines. The TFCC framework abstracts the underlying hardware complexity, making it transparent to the application layer. As a result, developers need only declare their models without writing complex inference code. The framework also internalizes the hardware adaptation work, allowing a single codebase to run inference on multiple hardware platforms, supporting a wide range of AI application scenarios.

4. Conclusion

The advent of the AI era has posed numerous challenges to the backend infrastructure. By introducing Ray, I built a platform that adapts to the foundational environment and provides an efficient, streamlined development process for AI applications. On top of Ray, I simplified cluster management and, by utilizing low-cost, idle resources, significantly reduced machine costs.

AstraRay, despite being a relatively new project, has already laid a solid foundation for AI applications in production environments. The platform continues to undergo optimization and improvements, ensuring it is well-prepared for future AI computing use cases across various business scenarios.

💖 💪 🙅 🚩
conquerym
ym qu

Posted on November 23, 2024

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related