MLOps journey with AWS - part 2 (Visibility is job zero)
almamon rasool abdali
Posted on January 3, 2022
welcome again
in previous article ,we get genral overview of MLOps
in the previous article, we get a general overview of MLOps
today we will cover our next step in MLOps implementation.
our first thing to do is visibility some of you may think that visibility ( monitoring ) is at the end of the deployment.
but first, what I mean by visibility:-
it is monitoring, tracking, and collaboration between the team and getting insight on the data and code and models journey from the beginning to the end of the pipeline.
so we need continuous visibility over the following things:
- visibility over code
- visibility over data
- visibility over model training process and all the experiments undergoing
- visibility over inference and feedbacks
- visibility over activities
Now, let's check the visibility list one by one
1. visibility over code changes
for normal Software developers, this is not an issue but for a managing team of data scientists and ML researchers, it can be considered a headache.
in such projects mostly the team use notebooks and you will find your team develops bad coding habits which also affect the version control and code change tracking, CI/CD problems, and many other things.
yet, many tools try to solve these problems but to me, it is not the notebook itself that makes the problem it is due to bad coding habits by the team itself.
all the above problems can be solved if you enforce the team to write good code that must be at least fulfill three main points (Modularity, High Cohesion, Loose Coupling)
so basically if we use notebooks for only importing and calling classes and methods.
also, separate each script by its work nature such as pre-processing script has to be fully functional without the training code and vice versa.
also to make work more scalable and portable we need to containerize each script.
but what if the environment you use will help you and the team to do the above ??
based on the best practice method to use sagemaker when running our scripts it needs you to separate each phase into a different script file.
also, each phase will be containerized and run separately, and the notebook in sagemaker is used for functions calling while the heavy coding is inside scripts that shipped in the containers of each stage.
let's take an example to get into the sagemaker mentality, starting by shipping a preprocessing script inside a pre-made aws container for sklearn to do preprocessing.
import boto3
import sagemaker
from sagemaker import get_execution_role
from sagemaker.sklearn.processing import SKLearnProcessor
from sagemaker.processing import ProcessingInput, ProcessingOutput
#get region and excution role
role = get_execution_role()
region = boto3.session.Session().region_name
#set the machine type and number of machines
sk_proc = SKLearnProcessor(
framework_version="0.20.0", role=role, instance_type="ml.m5.xlarge", instance_count=2
)
#sagemaker will copy data from s3 loction to /opt/ml/processing/input
#your script will read data from /opt/ml/processing/input
#sagemaker will expact you now to give it the output preproceesdata
#into /opt/ml/processing/train and /opt/ml/processing/test
#we also add cmd arg called --train-test-split-ratio to control spliting ratio
#run
sk_proc.run(
code="preproc.py",
inputs=[ProcessingInput(source=input_data, destination="/opt/ml/processing/input")],
outputs=[
ProcessingOutput(output_name="train_data", source="/opt/ml/processing/train"),
ProcessingOutput(output_name="test_data", source="/opt/ml/processing/test"),
],
arguments=["--train-test-split-ratio", "0.2"],
)
#get information regarding our runing job
preproc_job_info = sk_proc.jobs[-1].describe()
#get the conifgartion info to get the output uri for each final s3 for train and test
out_cfg = preproc_job_info["ProcessingOutputConfig"]
for output in out_cfg["Outputs"]:
if output["OutputName"] == "train_data":
train_preprco_s3 = output["S3Output"]["S3Uri"]
if output["OutputName"] == "test_data":
test_preprco_s3 = output["S3Output"]["S3Uri"]
as you can see we just provide our script (the script is easier to track than a notebook ) and the sagemaker will ship it in a container for us.
also if we want to train a model on it it has to be on a different container.
let's see an example for training
from sagemaker.sklearn.estimator import SKLearn
#send our script to the sklearn container by aws
sklearn_model = SKLearn(
entry_point="train.py", framework_version="0.20.0",
instance_type="ml.m5.xlarge",
role=role
)
#aws sagemaker will put data for you in /opt/ml/input/data/train from s3
# your model must output the final model in /opt/ml/model so sagemaker will copy it to s3
sklearn_model.fit({"train": train_preprco_s3})
#get job info
training_job_info = sklearn_model.jobs[-1].describe()
#get final model from s3
model_data_s3_uri = "{}{}/{}".format(
training_job_info["OutputDataConfig"]["S3OutputPath"],
training_job_info["TrainingJobName"],
"output/model.tar.gz",
)
now when work is done as above the code can be part of any normal CI/CD pipeline and the team can work together and collaborate based on any normal software lifecycle.
let's move to the next section of the data visibility
2. visibility over data
here I want to cover three things
- collaborate over features created by team members
- versioning of the data or features
- monitoring data quality and detecting drifts
solving 1 & 2 by using feature store (AWS sagemaker feature store )
and solving 3 by monitoring some statistical information about the data and here we will use (Amazon SageMaker Model Monitor - Monitor Data Quality ).
so let's start by exploring them one by one
feature store
if you work with a team and say you finished preprocessing data and get the feature ready for modeling, now maybe you ask how to share features between the team, how to re-use them over a different project, how to make them fast to reach fast to query without the need to re-do the work again.
feature stores are to help you create, share, and manage features and it works as a single source of truth to store, retrieve, remove, track, share, discover, and control access to features.
before we start working with the AWS sagemaker feature store we need to understand a few concepts:-
Feature group – main Feature Store resource that contains the metadata for all the data stored in Amazon SageMaker Feature Store.
Feature definition – the schema definition for that data such as feature named prices is float, and a feature named age is an integer
Record identifier name – Each feature group is defined with a record identifier name. The record identifier name must refer to one of the names of a feature defined in the feature group's feature definitions.
Record – A record is a collection of values for features for a single record identifier value. A combination of record identifier name and a timestamp uniquely identify a record within a feature group.
Event time – a point in time when a new event occurs that corresponds to the creation or update of a record in a feature group.
Online Store – the low latency, high availability cache for a feature group that enables real-time lookup of records.
Offline store – stores historical data in your S3 bucket. It is used when low (sub-second) latency reads are not needed.
now let's see how to work with feature stores in aws.
in this video will show you the main idea of using the feature store after doing preprocessing from AWS data wrangler to see the flow of data from raw data into analyzing and preprocessing the data with AWS data wrangler to creating feature store from the data flow pipeline.
now let's see how we can deal with data drift
but first, let's understand what is drifts.
Let first logically ask ourselves if the model is deployed and it is static with all its code and artifacts, so what makes things break, and why model accuracy degrades over time ??
in any system, the input always is something that needs to be checked and validated and in ml input must be checked for drifts and security stuff.
so what can happen to the data that make things not work as they must be ??
Data Drift: happens when the distribution of data changes such as a change in clothes trends and fashions which maybe affect your clothes recommender system, or changes in the country economy and salaries which will affect houses ranges or maybe you have a CCTV system with the problem in some of it cameras that send damaged stream or a new type of cameras with different video formats our different output ranges.
to make things more focused we have
Concept drift is a type of model drift where the relationship or the mapping between x to y is changed such as ML-based WAF where new attacks emerge that no longer the previous pattern can help to detect them so what the model know as the attack has been changed.
Data drift is a type of drift here we have changes in data distribution where the relation of x to y is still valid but something change the distribution such as nature change in temperature or new clothes trends or changes in customer preference
Upstream data changes refer to change in the data pipeline such as CCTV systems with a problem in some of its cameras that send damaged
so now how to detect these drifts ???
note that not all drifts can be detected automatically and many need humans in the loop.
but generally, it is all about capturing the model performance decay if we can !!
so if possible we compare model accuracy with some ground truth.
but for tasks that these ground truth not available there is other common methods to check for drifts.
Kolmogorov-Smirnov method: simply we compare the cumulative distributions of two datasets; if the distributions from both datasets are not identical then we have data drift.
for more refer to
https://www.sciencedirect.com/topics/engineering/kolmogorov-smirnovpopulation stability index (PSI) : it measures how much a variable has shifted over time.
when we have
PSI < 0.10 means a “little change”.
0.10 < PSI < 0.25 means a “moderate change”
PSI > 0.25 means a “significant change, action required”.
for more refer to https://www.risk.net/journal-of-risk-model-validation/7725371/statistical-properties-of-the-population-stability-index
now let's back to the AWS sagemaker model monitor and how it can help us here
it can help us with ( Monitor drift in data quality, Monitor drift in model quality metrics, Monitor bias in your model's predictions, Monitor drift in feature attribution )
let's check data quality as an example
the idea is that we create baseline data that sagemaker will use to compare with new data to check some rules that help to detect drift
the steps needed is that
first, you must enable data capture for your model when deployed for inference
from sagemaker.model_monitor import DataCaptureConfig
#set the conifgration
capture_config=DataCaptureConfig(
enable_capture = True,
sampling_percentage=100,
destination_s3_uri=s3_capture_path)
#add the confi to your model deployment
predictor = model.deploy(initial_instance_count=1,
instance_type='ml.m4.xlarge',
endpoint_name='endpoint name'
data_capture_config=capture_config)
Next, we must create a baseline from the main data so we will have some baseline statistical calculations so we can know when the new data changes from the baseline
example of creating the baseline
from sagemaker.model_monitor import DefaultModelMonitor
from sagemaker.model_monitor.dataset_format import DatasetFormat
data_monitor = DefaultModelMonitor(
role=role,
instance_count=1,
instance_type='ml.m5.xlarge',
volume_size_in_gb=20,
max_runtime_in_seconds=3600,
)
data_monitor.suggest_baseline(
baseline_dataset=baseline_maindata_uri,
dataset_format=DatasetFormat.csv(header=True),
output_s3_uri=baseline_result,
wait=True
)
for more please check out https://docs.aws.amazon.com/sagemaker/latest/dg/model-monitor.html
now we reach the end of these part and will cover in the next part the remaining items in the visibility section ... see you next
note :
some of you will say according to 'AWS Security is job Zero' and of course, it is but according to Principle of Least Privilege, our responsibility as "MLOps using AWS " is to secure (code, models, data ).
and as you can see visibility is the way or the enabling tools that make us do the security stuff over:
1- code security checks: by enabling code sharing and tracking methodology via visibility over code
2- pre-checks for ML and data attack and security ( this is done via visibility over training + visibility over data ) and that is before we go live when we have visibility over model training we can attack it while we train.
you cant secure a model without training it coz before training you don't have a model to attack, and before getting data you can't build a model to try to secure it.
refer to these links to know more about attacks layers on ML models
https://venturebeat.com/2021/05/29/adversarial-attacks-in-machine-learning-what-they-are-and-how-to-stop-them/
https://openai.com/blog/adversarial-example-research/
https://persagen.com/files/misc/goodfellow2017attacking.pdf
https://arxiv.org/abs/1705.00564
https://ieeexplore.ieee.org/abstract/document/9089095
https://ieeexplore.ieee.org/abstract/document/6868201
3- post-checks and during-checks ( the visibility over inference, the visibility over-activity, and also data visibility ).
also, in the end, MLops work must be integrated with the company's other teams and it is not replacing them, we are here to integrate with the other's work.
Posted on January 3, 2022
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.