Integration of Machine Learning through AWS Batch

raghulg02

Raghul G

Posted on August 11, 2023

Integration of Machine Learning through AWS Batch

Introduction:

AWS Batch is a fully managed batch processing service designed for scalability. It offers a set of batch management capabilities that execute upcoming jobs within AWS. Key features of AWS Batch include:

Fully Managed: AWS Batch provisions, manages, and scales your infrastructure, eliminating the need for software installation or server management.
Integrated with AWS: AWS Batch seamlessly and securely interacts with other AWS services like Amazon S3, Dynamo DB, Amazon Recognition, etc.
Cost-optimized Resource provisioning: AWS Batch automatically provisions compute resources using EC2 On Demand Instances or EC2 Spot instances.

Let’s see how AWS Batch support and integrate with machine learning models.

AWS Batch Main Components:
Fig. 1 depicts the main components of AWS Batch.

Fig. 1

Scheduler – The scheduler acts as the core of AWS Batch. It is the component that evaluates when, where, and how to run the jobs that have been submitted to the job queue. The job runs approximately in the order in which they are submitted as long as all dependencies on other jobs have been met.

Jobs – Jobs are units of work executed by AWS Batch. They are executed as a containerized application running on EC2. The containerized jobs can be referred to as container image with commands or parameters. Users can import the container image as a zip file into an Amazon Linux Container and run it. Jobs can be both dependent on and independent of each other. Jobs can be considered machine learning instances operating inside a large machine learning model. The scheduler runs every machine learning job based on their occurrence and needs in the appropriate time.

States of a Job:
Fig. 2 provides an overview of the different job states. Let’s review them.

Fig. 2
Submitted: Jobs that have been accepted into the queue but have not yet been evaluated. Those jobs have to wait before proceeding to a runnable state.
Pending: Jobs with dependencies on other unfinished jobs remain pending until all jobs are completed.
Runnable: Jobs that have been evaluated by the scheduler and are available in a runnable state.
Starting: Jobs that are in the process of being scheduled for compute resources.
Running: Jobs that are currently running and actively participating.
Succeeded: Jobs completed with exit code of 0.
Failed: Jobs that finished with a non-zero exit code, or they canceled or terminated.

Array Jobs:
Now, we’ve reviewed how the jobs are evaluated by the scheduler and get executed across the seven different states of the job. As mentioned earlier, these jobs are considered machine-learning instances of the large machine-learning model. However, in some use cases, numerous machine learning instances may need to be considered and they can be considered as independent “simple jobs”. The scenario could lead to the failure of the machine learning instances when they need to be executed as a large number of independent simple jobs. AWS Batch provides a solution for that through “Array Jobs”. Instead of submitting a large number of independent simple jobs, Array Jobs can run many copies of an application against an array of elements.

Array Jobs are an efficient way to run:
•Parametric Sweeps
•Monte Carlo Simulations
•Processing a large collection of large object processing

Job Queues:
The machine learning jobs are submitted to Job Queues. The jobs reside in the job queues until they can be scheduled to a compute resource. Information related to completed jobs persists in the queue for 24 hours. Jobs scheduling can follow either FCFS (First Come First Serve) or priority based rules.

Compute Environments:
Job queues are mapped to one or more compute environments containing EC2 instances used to run containerized batch jobs. Two compute environment types exist:

Managed Computer Environment – This type enables you to describe your business requirements and define the instance types, desired virtual CPUs (vCPUs), and EC2 spot bids as a percentage of on-demand. The ECS (Elastic Container Service) agent allows container instances to connect the cluster on an optimized AMI, but the ECS Container agent can also install the instances on an Amazon EC2 instance that supports the configuration.

Unmanaged Compute Environment – In an unmanaged compute environment, the user’s custom instances and resources can be used instead of the prescribed resources. These instances need to include the ECS agent and run supported versions of Linux and Docker. AWS Batch creates an Amazon ECS cluster that can accept the instances created and launch its own ECS agent. Jobs can be scheduled as soon as the instances are healthy and registered with the ECS agent.

Resources are important when deploying a machine learning model and tuning the hyperparameters needed for model training and testing. In some scenarios, the amount of resources needed to train and test the model is not known when starting. In those situations, an unmanaged compute environment can be used to run your machine learning instances. The solution would convert the machine learning instances into the containerized batch jobs that the ECS agent will allow to connect to your cluster, enabling flexible execution. Managed compute environments are suitable for scenarios with a pre-defined prescribed amount of resources to be used for containerized batch jobs. For example, in use cases like anomaly detection in supply chain management via machine learning, the managed compute environment can be used.

Job Definitions:
Job definitions contain information about how the job should be executed similar to ECS Task definitions. Machine learning job definitions contain detailed explanations of what the machine learning algorithm is going to do, the list of hyperparameters used in the machine learning instance, what type of data preprocessing techniques are used to provide accuracy, hyperparameter tuning methodology, and more.
Example attributes specified in a job definition are listed below:
•IAM role associated with the Job
•vCPUs and memory requirements
•Retry Strategy
•Mount Points
•Container Properties
•Environment Variable

Workflows, Pipelines, and Job Dependencies:
Jobs can express a dependency on the successful completion of other jobs or specific elements of an array job.
There are two types of workflow engines and languages to submit the jobs.
•Flow-based statements are sequential job submissions.
•DAG (Directed Acyclic Graph) based systems submit jobs simultaneously and can identify other job dependencies.

We can use our preferred workflow engine and language to submit the machine-learning jobs. As a result, AWS Batch has a large number of dependency models.

Array Job Dependency Models:
Fig. 3,4,5,6, and 7 depict the different types of dependency models.

Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
How do EC2 instances allocate compute resources for batch jobs? The EC2 instances can take a single container for the batch jobs and run it, or the instance can have different containers determined by the amount of functionality-based jobs clustered. Multi-Node Parallel Jobs (MNP) enables AWS Batch to run single jobs across multiple EC2 instances. The instances can communicate with each other and integrated with the Elastic Fabric Adaptor for low latency between nodes.

Conclusion:
AWS Batch is a fully managed service that can be integrated with machine learning models, providing efficient resource scaling for various use cases, at a competitive price.

For more contents like this
Follow Raghul Gopal for more.
Raghul G
AWS Cloud Captain | India
AWS Cloud Club St. Joseph’s Institute of Technology

💖 💪 🙅 🚩
raghulg02
Raghul G

Posted on August 11, 2023

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related