AWS Machine Learning exam guide

A guide to the guide
Syllabus, specification and blue print are all terms to describe the knowledge domain of an exam or course. However AWS call the description of the content of their Machine Learning exam the Exam Guide. Perhaps this is a telling choice since the information they provide is far from comprehensive, it is just a guide.

If you come from an AWS Machine Learning background the Exam Guide PDF will be sufficient for you. However if you are already a Data Scientist and wish to move into AWS, or you use AWS and want to learn to SageMaker and Machine Learning then large chunks of the Exam Guide will be unintelligible. This guide to the guide fills the gaps and explains high level concepts.

The Exam Guide is where the exam subjects are listed, split into four domains and fifteen sub-domains. This article describes each sub-domain in enough detail for the complete newbie to get a good idea of what it is about. If you intend to study for the AWS Machine Learning certificate this will give you an overview of what you are getting yourself into.

AWS pdf: https://d1.awsstatic.com/training-and-certification/docs-ml/AWS-Certified-Machine-Learning-Specialty_Exam-Guide.pdf

Domain 1: Data Engineering

Domain 1 Data Engineering is concerned with obtaining the data, transforming it and putting it in a repository. It comprises 20% of the exam marks. There are three sub-domains that can be summarised as:

1.1 Data repositories
1.2 Data ingestion
1.3 Data transformation

1.1 Data repositories

Create data repositories for machine learning

The data repository is where raw and processed data is stored. S3 is the repository of choice for Machine Learning in AWS and all built-in algorithms and services can consume data from S3. Other data stores are also mentioned in the exam guide:

Database (Relational Database Service)
Data Lake (LakeFormation)
EFS
EBS Often data is generated by the business themselves, but sometimes data from other sources is needed to train the model. For example libraries of image data to train the Object Detection algorithm. Many data sources are publicly available.

1.2 Data ingestion

Identify and implement a data ingestion solution

The data ingestion sub-domain is concerned with gathering the raw data into the repository. This can be via batch processing or streaming data. With batch processing, data is collected and grouped at a point in time and passed to the data store. Streaming data is constantly being collected and fed into the data store. The AWS streaming services are:

Kinesis family of streaming data services:

Kinesis Data Streams
Kinesis Firehose
Kinesis Analytics
Kinesis Video Streams

Batch processing requires a way to schedule or trigger the processing, also called job scheduling. Examples are Glue Workflow and Step Functions. AWS batch services include:

EMR (Hadoop)
Glue (Spark)

1.3 Data transformation

Identify and implement a data transformation solution

The third Data Engineering sub-domain focuses on how raw data is transformed into data that can be used for ML processing. The transformation process changes the data structure. The data may also need to be clean up, de-duplicated, incomplete data managed and have it’s attributes standardised. The AWS Services are similar to those used with data ingestion:

Glue (Spark)
EMR (Hadoop, Spark, Hive)
AWS Batch Once these data engineering processes are complete the data is ready for further pre-processing prior to being fed into a Machine Learning algorithm. This preprocessing is covered by the second knowledge domain Exploratory Data Analysis.

Domain 2: Exploratory Data Analysis

In the Exploratory Data Analysis domain the data is analysed so it can be understood and cleaned up. It comprises 24% of the exam marks. There are three sub-domains:

2.1 Prep and sanitise data
2.2 Feature engineering
2.3 Analyse and visualize data

2.1 Prep and sanitise data

Sanitize and prepare data for modeling

In Sanitize and prepare data for modeling, the data can be cleaned up using techniques to remove distortions and fill in gaps.

missing data
corrupt data
stop words
formatting
normalizing
augmenting
scaling data

Data labeling is the process of identifying raw data and adding one or more meaningful and informative labels to provide context. (AWS)

Data labeling can be costly and time consuming because it involves applying the labels manually. AWS provides the service called Mechanical Turk to reduce the cost and speed up the labelling process.

2.2 Feature engineering

Perform feature engineering

Feature Engineering is about creating new features from existing ones to make the Machine Learning algorithms more powerful. Feature Engineering techniques are used to reduce the number of features and categorise the data.

binning
tokenization
outliers
synthetic features
1 hot encoding
reducing dimensionality of data

2.3 Analyse and visualize data

Analyze and visualize data for machine learning

Analyzing and visualizing the data overlaps with the other two sub-domains which use these techniques. The techniques include graphs, charts and matrices.

scatter plot
histogram
box plot
Before data can be sanitized and prepared is has to bet understood. This is done using statistics that focus on specific aspects of the data and graphs and charts that allow relationships and distributions to be seen.
correlation
summary statistics
p value
elbow plot
cluster size
You now understand your data and have cleaned it up ready for the next stage, modeling.

Domain 3: Modeling

When people talk about Machine Learning they are mostly thinking about Modeling. Modeling is selecting and testing the algorithms to process data to find the information of value. It comprises 36% of the exam marks. This domain has five sub-domains:

3.1 Frame the business problem
3.2 Select the appropriate model
3.3 Train the model
3.4 Tune the model
3.5 Evaluate the model

3.1 Frame the business problem

Frame business problems as machine learning problems

First we decide if Machine Learning is appropriate for this problem. Machine Learning is good for data driven problems involving large amounts of data where the rules cannot easily be coded. The business problem can probably be framed in many ways and this determines what kind of Machine Learning problem is being solved. For example the business problem could be framed to require a yes/no answer as in fraud detection, or a value as in share price. Also in this sub-domain we can find out the type of data and so if the algorithm will use a supervised, or unsupervised paradigm. From the type of problem that has to be solved features of the algorithm can be identified, for example classification, regression, forecasting, clustering or recommendation.

3.2 Select the appropriate model

Select the appropriate model(s) for a given machine learning problem

Many models are available through AWS Machine Learning services with SageMaker having over seventeen built-in algorithms. Each model has it’s own use cases and requirements. Once the model has been chosen an iterative process of training, tuning and evaluation is undertaken.

The exam guide lists the SageMaker built-in algorithms XGboost, K-means. Since there are many built-in algorithms perhaps these are just the most important. Modeling concepts are also listed:

linear regression — Linear Learner, K-Nearest Neighbors, Factorization Machines
logistic regression — XGBoost
decision trees — XGBoost
random forests — Random Cut Forest
RNN — DeepAR forecasting, Sequence to Sequence
CNN — Sequence to Sequence
ensemble learning — XGBoost
transfer learning — Image classification

3.3 Train the model

Train machine learning models

Model training is the process of providing a model with data to learn from. During model training the data is split into three parts. Most is used as training data with the remainder used for validation and testing. Cross validation is a technique used when training data is limited. By understanding the concepts of the internal workings of algorithms, model training can be optimised. Concepts used by models in training include gradient descent, loss functions, local minima, convergence, batches, optimizer and probability.

The speed and cost of training depends on the choices about the compute resources used. The type of instance central processing unit can be specified. Graphical Processing Units (GPU) can provide more compute power, but not all algorithms can utilise them and may require cheaper CPU instances. For heavy training loads distributed processing options may be available to speed up training. Spark and non-Spark data processing can be used to pre-process training data.

Model training is also concerned with how and when models are updated and retrained.

3.4 Tune the model

Perform hyperparameter optimization

Model tuning is also known as hyperparameter optimisation. Machine Learning algorithms can be thought of as black boxes with hyperparameters being the exposed controls that can be changed and optimised. Hyperparameters settings do not change during training. They can be tuned manually before training commences, using search methods or automatically by using SageMaker guided search. Model tuning can be improved by using:

Regularization
Drop out
L1/L2
Model initialization
Models that utilise a neural network architecture use other hyperparameters:
layers / nodes
learning rate
activation functions
Tree-based models have hyperparameters that influence the number of trees and number of levels. The learning rate is used to optimise Linear models.

3.5 Evaluate the model

Evaluate machine learning models

Model evaluation is used to find out how well a Model will perform in predicting the desired outcome. This is done using metrics to measure the performance of the Model. Metrics measure accuracy, precision and other features of the Model by comparing the results from the Model with the known contents of the training data.

Metrics commonly used:

AUC-ROC
accuracy
precision
recall
RMSE
F1 score To measure the correlation between features a Confusion matrix is used.

Evaluation methods can be performed offline or online. A/B testing can also be used to compare the performance of model variants. Metrics allow the detecting of a poorly fitting model caused by bias or variance. This is where a model performs poorly with real world data.

Other metrics allow models and model variants to be compared using metrics that are not directly related to data:

time to train a model
quality of model
engineering costs
Cross validation

Your model is now ready to be used with real data. But before it can be let loose on your corporate data it has to be deployed into the production environment.

Domain 4: Machine Learning Implementation and Operations

This domain is about Systems Architecture and DevOps skills to make everything work in production. It comprises 20% of the exam marks. There are four sub-domains:

4.1 Build machine learning solutions for performance, availability, scalability, resiliency, and fault tolerance.
4.2 Recommend and implement the appropriate machine learning services and features for a given problem.
4.3 Apply basic AWS security practices to machine learning solutions.
4.4 Deploy and operationalize machine learning solutions.

4.1 The production environment

Build machine learning solutions for performance, availability, scalability, resiliency, and fault tolerance.

Designing AWS production environments for performance, availability, scalability, resiliency, and fault tolerance is part of AWS best practice. Resilience and availability is provided by deploying models in multiple AWS regions and multiple Availability Zones. Auto Scaling groups and Load balancing provide scalability for compute resources. Performance is optimised by rightsizing EC2 instances, volumes and provision IOPS. There are a variety of deployment options including EC2, SageMaker managed EC2 via endpoints and docker containers. CloudTrail and CloudWatch are used for AWS environment logging and monitoring. This assists in creating fault tolerance systems and build error monitoring.

4.2 ML services and features

_Recommend and implement the appropriate machine learning services and features for a given problem
_
AWS provides a range of services and features to choose from for a given Machine Learning problem. AWS provide AI services which are highly optimised algorithms deployed on AWS managed infrastructure. Some of the services contain pre-trained models ready for production inferencing. Some examples are:

Poly, text to speech
Lex, chatbot
Transcribe, speech to text

When using AI services AWS does all the heavy lifting of managing infrastructure, models and training. There are other features if you need to have more control of these aspects. SageMaker built-in algorithms can be used or you can bring your own model. This allows cost considerations to influence the choice of compute services. Even more sophisticated cost control can be achieved with using spot instances to train deep learning models using AWS Batch

AWS Service limits are used to limit the amount of resources. This may be a limit on the number of instances of a service used in an account. AWS Service limits can be increased by AWS on request. Sometimes there is a hard limit which is the maximum for that service in a single AWS Account or Region.

4.3 Security

Apply basic AWS security practices to machine learning solutions

Security in AWS starts with the ubiquitous IAM, Identity and Access Management, which controls the activities of all AWS services. Since S3 is the most common storage for Machine Learning services S3 bucket policies are also included. It may seem that access to VPCs, Amazon Virtual Private Cloud, and VPC Security Groups may not be needed if you are implementing serverless applications. However, under the hood, SageMaker uses these services and the security has to be configured. As well as configuring security for the services, data security also has to be considered. This includes encryption of data both at rest and in transit. Anonymisation can be used to protect PII data, Personally identifiable information.

4.4 Deploy and operationalize

Deploy and operationalize machine learning solutions

There are many ways to deploy Machine Learning models in production, one method is to use SageMaker endpoints. Despite the name a SageMaker endpoint is more than an isolated interface, it sits on top of serious processing power. This is provided by SageMaker managed EC2 instances which are set up by the endpoint configuration. SageMaker endpoints can host multiple variants of the same model. This enables different variants of the same Model to be compared using testing strategies, for example A/B testing.

Once in production the model is monitored because the performance of a model may degrade over time as real world data changes. This drop in performance can be detected and used to trigger the retraining of the model via a retrain pipeline.

Summary

The AWS Certified Machine Learning — Speciality, exam guide is good for outlining the breadth of the course and how it is divided up into four domains and fifteen sub-domains. Whilst it lists and mentions many subjects only a few are described in any detail and it is still a bit light with those. This article provides additional description of the subjects to allow someone considering studying for the exam to understand what has to be learnt to achieve exam success.

Credits

Photo by Daniel Gonzalez on Unsplash
Originally published at www.mlexam.com on October 8, 2020.
All infographics by Michael Stainsbury

Blog