Data Science and the Cloud
Michellebuchiokonicha
Posted on June 18, 2024
There are many good reasons to move data science projects to the cloud as there are many advantages like highly clustered processes on many machines giving us enormous computing power.
Also, data science is nothing without data. In a distributed cloud system, our data can be stored and processed reliably even if they are enormous datasets that are also changing constantly. Drawbacks include:
Data science projects in the cloud tend to be more complex especially collaborating in large teams and using many services and technologies. We can however choose from a whole range of services supporting our calms in the cloud ranging from complete managed services to services giving us full control and flexibility of the environment.
Provider-Independent services and tools
MLFlow
A framework you will find implemented in most data cloud services independent of the cloud provider. It consists of 4 components, supporting and streamlining our ML projects and helping us conduct them systematically and collaboratively.
It lets us streamline our ML projects by systematically tracking experiment runs, and easy packaging using Anaconda and Docker.
Tracking: This can be used during model training to lock and track model parameters, and KPIs like model performance metrics, experimental models, code version numbers, etc.
Projects: can be used to package ML model training and trigger it remotely with varying parameters. Once we are satisfied with the performance of a trained model, we can proceed.
Model: register this model with ML flow.
Model registry: can be used as a centralized repository to train models and we can use these repositories to deploy models to the production environment easily.
An applied example of using an ML flow project to share our ML development easily and trigger runs with specific parameters over the network.
For example, someone optimized a Python script to train a machine-learning model. The new model looks very promising but you have to fill in that tweaking a parameter might make the ML model perform even better.
MLflow project can be used to do this without writing the code again.
Eg, this is the GitHub rep you used.
We have
Data, conda environment, readme file, trained pyscript, license.
The script imports models and defines evaluation metrics
Typical MLFLOW workflow:
Start MLFlow Server
Coduct and tract run: meaning to train models and lock the parameters and modern metrics.
Evaluate results using graphical user interphase or its SDK to query the best-performing models programmatically. Once satisfied with a model, we register it.
Register model
*Model registry *
Deploy models: we deploy a model for inference to a productive system typically equipped with a restful API that we can use to obtain predictions from the model.
Databricks: Processing massive amounts of data requires parallel computation on several cluster notes. Spark is a well-known and widely used solution for data processing and data-intensive projects. It is open source but the downside lies in its maintenance and operation overhead. It might not be that simple to set up a spark cluster and keep it running on custom machines.
If we lack the resources to do this, we can use Databricks.
Databricks is An overreaching technology that many cloud providers(CSP) can host; Databricks was founded in 2013 by some of the founders of Spark.
It is therefore not surprising that the core services offered by DB are managed by spark clusters.
It is active in open source and hosting the data and AI summit yearly. At its core, it offered managed spark clusters that can be hosted on Azure, AWS GCP, etc.
Different billing models depend on the cloud platform of choice. But in general, databricks units are used. Depending on the machine, processes, and storage that is used, it is calculated in DBUs. we can also use the community edition at no cost. The managed spark clusters are at the core of Databricks but additional features are designed for data-intensive projects.
Additional features are:
Lakehouses: aims to describe unified data storage that comes with data bricks designed for structured and unstructured data so it can be considered a mixture of a data lake and a data warehouse. It uses files to store the data in a delta lake.
Delta Lake: the data storage used in Databricks. It is a distributed datastore but using an elaborate metadata mechanism, delta lake is ACID-compliant.
Delta Live Tables(DLT): they automatically propagate underlined data updates and provide a GUI to design and manage DTL pipelines straightforwardly. We can also set up data quality checks in DLT that are automatically conducted regularly.
Delta Engine: A query optimizer that automatically and periodically optimizes data queries depending on the data access pattern.
Unity Catalog: With Unity Catalog, databricks offers an easy-to-use tool for data governance. We can the graphical user interface of the unity catalog or SQL queries to perform data governance tasks for example, we can set role-based access control for a specific dataset.
Structuring the data in Delta Lake by processing maturing, there are
Bronze, silver, and Gold tables.
Bronze tables: contains raw data as it was loaded from external sources.
Silver tables: contain processed data including established joints.
Gold tables: contain completely pre-processed data ready for specific use cases like machine learning.
We can use Python, R, SQL, and scala to develop the ML and data processing logic in scripts, and notebooks or use remote compute targets.
Analysis components
Databricks SQL: similar to spark SQL to use the SQL language to query our data.
Databricks data science and engineering environment: similar to ML leap in spark. It comes with a prepared environment for typical data science and data engineering use cases.
Databricks machine learning environment: it uses MLflow to track versions and deploy machine learning models.
Programming languages: python, R, SQL, Scala similar to Spark.
AutoML: to automatically perform feature selection, algorithm choice, and hyperparameter toning.
Google Data Science and ML services
Services give us maximum control and flexibility of the environment, leaving us with some responsibilities such as security patches and environment maintenance. Eg, we can use compute engines for our data science projects and Google’s cloud-based virtual machines.
Using containerization instead of VM, we can develop our containerized application for eg model training or inference on a local machine then deploy that container to
Cloud compute engine:
Kubernetes engine: the cloud-based container orchestration service on GCP.
Deep learning VM: Specialised VM coming with GPU support and pre-installed libraries that are typically used in data science projects.
All Cloud providers offer specialized ML services giving us more comfort and managing some tedious maintenance tasks for us but also taking a little bit of control and flexibility.
On the Google platform, the specialized ML service is called cloud AI/ vertex AI being the main component. They include:
- Dataflow
- Composer
- Dataproc
- Bigquery
- Google Cloud AI for ML training etc
- Google Cloud console
- Vertex AI workbench
- Vertex AI formerly datalab.
- Auto ML
- API endpoints
- Visual reporting
Ready-to-use services on GCP: they can be used by calling the standardized restful APIs to obtain model predictions by pretraining machine learning models hosted on GCP. eg we can use the speech-tech API to send the sound file to that service and receive the transcribed text.
- Natural language AI
- Teachable machine
- Dialogflow
- Translations
- Speech-to-text
- Text-to-speech
- Timeseries insights API
- Vision AI
- Video AI
Data science and ML services on AWS
Same as on GCP, there are services maximizing control and others suitable for more comfort and instant application. Example,
Cloud-based VM called EC2.
Deploying containers to elastic container service ECS/Elastic kubernetes service EKS
And Elastic map reduce EMR
Specialized ML Service
It is called *SageMaker: *
Sagemaker is a service family of an entire collection of sub-services dedicated to supporting typical DS and ML projects. eg, there are graphical user interfaces, auxiliary services for data wrangling, data labeling, prepared scripts and templates, and also auto ML features.
- Notebook instances
- Data labeling
- Data wrangler
- Feature Store
- Clarify
- Pipelines
- Studio
- Jumpstart
- Canvas
- Autopilot
Ready to use services on AWS:
Translate, transcribe, computer vision, and other services being pre-trained ML models that are callable for restful APIs for inference.
The computer vision API here includes:
- Comprehend
- Rekognition
- Lookout for vision, panorama
- Textract
- A2I
- Personalize
- Translate, Transcribe
- Polly, Lex
- Forecast
- Fraud Detector
- Lookout for Metrics
Kendra
Auxillary services are DevOps Guru and Code Guru.
Data science and ML services on Microsoft Azure.
VM: For maximum control and flexibility, we can use Azure VMs to decide how to create and manage our development environment.
ACI/AKS: we can use Azure container instance ACI for containerized model training for testing purposes then use Azure Kubernetes service AKS for production settings.
HDInsight: A managed service for technology.
Databricks:
Synapse Analytics:
Specialized ML service.
This is called Azure Machine Learning. Similar to AWS sagemaker, Azure ML comprises several sub-services for example the GUI called designer and auxiliary services for data labeling and other features.
- Azure machine learning
- Studio
- Workspace
- Notebooks/RStudio
- Data labelling
- Designer
Ready-to-use services on Azure.
Giving us maximum comfort and usability of pre-trained models callable by restful APIs. These services are called cognitive services and they include:
- Computer vision, Face
- Azure cognitive service for language
- Language understanding models,
- Translators, and other services.
- QnA Maker, Translator
- Speech Service
- Anomaly Detector
- Content Moderator
- Personalizer
- Cognitive Services for Big Data
Note: This is a 4-fold series on cloud computing, virtualization, containerization, and data processing.
Check the remaining 3 articles on my blog.
This is the fourth. Here is the link to the third.
https://dev.to/michellebuchiokonicha/cloud-computing-platforms-4667
it focuses on cloud platforms.
Follow me on Twitter Handle: https://twitter.com/mchelleOkonicha
Follow me on LinkedIn Handle: https://www.linkedin.com/in/buchi-michelle-okonicha-0a3b2b194/
Follow me on Instagram: https://www.instagram.com/michelle_okonicha/
Posted on June 18, 2024
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.