Checklist for designing cloud-native applications – Part 1: Introduction

eyalestrin

Eyal Estrin

Posted on March 4, 2024

Checklist for designing cloud-native applications – Part 1: Introduction

This post was originally published by the Cloud Security Alliance.

When organizations used to build legacy applications in the past, they used to align infrastructure and application layers to business requirements, reviewing hardware requirements and limitations, team knowledge, security, legal considerations, and more.

In this series of blog posts, we will review considerations when building today's cloud-native applications.

Readers of this series of blog posts can use the information shared, as a checklist to be embedded as part of a design document.

Introduction

Building a new application requires a thorough design process.

It is ok to try, fail, and fix mistakes during the process, but you still need to design.

Since technology keeps evolving, new services are released every day, and many organizations now begin using multiple cloud providers, it is crucial to avoid biased decisions.

During the design phase, avoid locking yourself to a specific cloud provider, instead, fully understand the requirements and constraints, and only then begin selecting the technology and services you will be using to architect your application’s workload.

Business Requirements

The first thing we need to understand is what is the business goal. What is the business trying to achieve?

Business requirements will impact architectural decisions.

Below are some of the common business requirements:

  • Service availability – If an application needs to be available for customers around the globe, design a multi-region architecture.
  • Data sovereignty – If there is a regulatory requirement to store customers data in a specific country, make sure it is possible to deploy all infrastructure components in a cloud region located in a specific country. Example of data sovereignty service: AWS Digital Sovereignty.
  • Response time – If the business requirement is to allow fast response to customer requests, you may consider the use of API or caching mechanisms.
  • Scalability – If the business requirement is to provide customers with highly scalable applications, to be able to handle unpredictable loads, you may consider the use of event-driven architecture (such as the use of message queues, streaming services, and more).

Compute Considerations

Compute may be the most important part of any modern application, and today there are many alternatives for running the front-end and business logic of our applications:

  • Virtual Machines – Offering the same alternatives as we used to run legacy applications on-premise, but can also be suitable for running applications in the cloud. For most cases, use VMs if you are migrating an application from on-premise to the cloud. Example of service: Amazon EC2.
  • Containers and Kubernetes – Most modern applications are wrapped inside containers, and very often are scheduled using Kubernetes orchestrator. Considered as a medium challenge migrating container-based workloads between cloud providers (you still need to take under consideration the integration with other managed services in the CSPs eco-system). Example of Kubernetes service: Amazon EKS.
  • Serverless / Functions-as-a-Service – Modern way to run various parts of applications. The underlying infrastructure is fully managed by the cloud provider (no need to deal with scaling or maintenance of the infrastructure). Considered as a vendor lock-in since there is no way to migrate between CSPs, due to the unique characteristics of each CSP's offering. Example of FaaS: AWS Lambda.

Data Store Considerations

Most applications require a persistent data store, for storing and retrieval of data.

Cloud-native applications (and specifically microservice-based architecture), allow selecting the most suitable back-end data store for your applications.

In a microservice-based architecture, you can select different data stores for each microservice.

Alternatives for persistent data can be:

  • Object storage – The most common managed storage service that most cloud applications are using to store data (from logs, archives, data lake, and more). Example of object storage service: Amazon S3.
  • File storage – Most CSPs support managed NFS services (for Unix workloads) or SMB/CIFS (for Windows workloads). Example of file storage service: Amazon EFS.

When designing an architecture, consider your application requirements such as:

  • Fast data retrieval requirements – Requirements for fast read/write (measures in IOPS)
  • File sharing requirements – Ability to connect to the storage from multiple sources
  • Data access pattern – Some workloads require constant access to the storage, while other connects to the storage occasionally, (such as file archive)
  • Data replication – Ability to replicate data over multiple AZs or even multiple regions

Database Considerations

It is very common for most applications to have at least one backend database for storing and retrieval of data.

When designing an application, understand the application requirements to select the most suitable database:

  • Relational database – Database for storing structured data stored in tables, rows, and columns. Suitable for complex queries. When selecting a relational database, consider using a managed database that supports open-source engines such as MySQL or PostgreSQL over commercially licensed database engine (to decrease the chance of vendor lock-in). Example of relational database service: Amazon RDS.
  • Key-value database – Database for storing structured or unstructured data, with requirements for storing large amounts of data, with fast access time. Example of key-value database: Amazon DynamoDB.
  • In-memory database – Database optimized for sub-millisecond data access, such as caching layer. Example of in-memory database: Amazon ElastiCache.
  • Document database – Database suitable for storing JSON documents. Example of document database: Amazon DocumentDB.
  • Graph database – Database optimized for storing and navigating relationships between entities (such as a recommendation engine). Example of Graph database: Amazon Neptune.
  • Time-series database – Database optimized for storing and querying data that changes over time (such as application metrics, data from IoT devices, etc.). Example of time-series database: Amazon Timestream.

One of the considerations when designing highly scalable applications is data replication – the ability to replicate data across multiple AZs, but the more challenging is the ability to replicate data over multiple regions.

Few managed database services support global tables, or the ability to replicate over multiple regions, while most databases will require a mechanism for replicating database updates between regions.

Automation and Development

Automation allows us to perform repetitive tasks in a fast and predictable way.

Automation in cloud-native applications, allows us to create a CI/CD pipeline for taking developed code, integrating the various application components, and underlying infrastructure, performing various tests (from QA to securing tests) and eventually deploying new versions of our production application.

Whether you are using a single cloud provider, managing environments on a large scale, or even across multiple cloud providers, you should align the tools that you are using across the different development environments:

  • Code repositories – Select a central place to store all your development team’s code, hopefully, it will allow you to use the same code repository for both on-prem and multiple cloud environments. Example of code repository: AWS CodeCommit.
  • Container image repositories – Select a central image repository, and sync it between regions, and if needed, also between cloud providers, to keep the same source of truth. Example of container image repository: Amazon ECR.
  • CI/CD and build process – Select a tool to allow you to manage the CI/CD pipeline for all deployments, whether you are using a single cloud provider, or when using a multi-cloud environment. Example of CI/CD build service: AWS CodePipeline.
  • Infrastructure as Code – Mature organizations choose an IaC tool to provision infrastructure for both single or multi-cloud scenarios, lowering the burden on the DevOps, IT, and developers’ teams. Examples of IaC: AWS CloudFormation, and HashiCorp Terraform.

Resiliency Considerations

Although many managed services in the cloud, are offered resilient by design by the cloud providers, consider resiliency when designing production applications.

Design all layers of the infrastructure to be resilient.

Regardless of the computing service you choose, always deploy VMs or containers in a cluster, behind a load-balancer.

Prefer to use a managed storage service, deployed over multiple availability zones.

For a persistent database, prefer a managed service, and deploy it in a cluster, over multiple AZs, or even better, look for a serverless database offer, so you won’t need to maintain the database availability.

Do not leave things to the hands of faith, embed chaos engineering experimentations as part of your workload resiliency tests, to have a better understanding of how your workload will survive a failure. Example of managed chaos engineering service: AWS Fault Injection Service.

Business Continuity Considerations

One of the most important requirements from production applications is the ability to survive failure and continue functioning as expected.

It is crucial to design both business continuity in advance.

For any service that supports backups or snapshots (from VMs, databases, and storage services), enable scheduled backup mechanisms, and randomly test backups to make sure they are functioning.

For objects stored inside an object storage service that requires resiliency, configure cross-region replication.

For container registry that requires resiliency, configure image replication across regions.

For applications deployed in a multi-region architecture, use DNS records to allow traffic redirection between regions.

Observability Considerations

Monitoring and logging allow you insights into your application and infrastructure behavior.

Telemetry allows you to collect real-time information about your running application, such as customer experience.

While designing an application, consider all the options available for enabling logging, both from infrastructure services and from the application layer.

It is crucial to stream all logs to a central system, aggregated and timed synched.

Logging by itself is not enough – you need to be able to gain actionable insights, to be able to anticipate issues before they impact your customers.

It is crucial to define KPIs for monitoring an application's performance, such as CPU/Memory usage, latency and uptime, average response time, etc.

Many modern tools are using machine learning capabilities to review large numbers of logs, be able to correlate among multiple sources and provide recommendations for improvements.

Cost Considerations

Cost is an important factor when designing architectures in the cloud.

As such, it must be embedded in every aspect of the design, implementation, and ongoing maintenance of the application and its underlying infrastructure.

Cost aspects should be the responsibility of any team member (IT, developers, DevOps, architect, security staff, etc.), from both initial service cost and operational aspects.

FinOps mindset will allow making sure we choose the right service for the right purpose – from choosing the right compute service, the right data store, or the right database.

It is not enough to select a service –make sure any service selected is tagged, monitored for its cost regularly, and perhaps even replaced with better and cost-effective alternatives, during the lifecycle of the workload.

Sustainability Considerations

The architectural decision we make has an environmental impact.

When developing modern applications, consider the environmental impact.

Choosing the right computing service will allow running a workload, with a minimal carbon footprint – the use of containers or serverless/FaaS wastes less energy in the data centers of the cloud provider.

The same thing when selecting a datastore, according to an application’s data access patterns (from hot or real-time tier, up to archive tier).

Designing event-driven applications, adding caching layers, shutting down idle resources, and continuously monitoring workload resources, will allow to design of an efficient and sustainable workload.

Sustainability related reference: AWS Sustainability.

Employee Knowledge Considerations

The easiest thing is to decide to build a new application in the cloud.

The challenging part is to make sure all teams are aligned in terms of the path to achieving business goals and the knowledge to build modern applications in the cloud.

Organizations should invest the necessary resources in employee training, making sure all team members have the required knowledge to build and maintain modern applications in the cloud.

It is crucial to understand that all team members have the necessary knowledge to maintain applications and infrastructure in the cloud, before beginning the actual project, to avoid unpredictable costs, long learning curves, while running in production, or building a non-efficient workload due to knowledge gap.

Training related reference: AWS Skill Builder.

Summary

In the first blog post in this series, we talked about many aspects, organizations should consider when designing new applications in the cloud.

In this part of the series, we have reviewed various aspects, from understanding business requirements to selecting the right infrastructure, automation, resiliency, cost, and more.

When creating the documentation for a new development project, organizations can use the information in this series, to form a checklist, making sure all-important aspects and decisions are documented.

In the next chapter of this series, we will discuss security aspects when designing and building a new application in the cloud.

About the Author

Eyal Estrin is a cloud and information security architect, and the author of the book Cloud Security Handbook, with more than 20 years in the IT industry. You can connect with him on Twitter.

Opinions are his own and not the views of his employer.

💖 💪 🙅 🚩
eyalestrin
Eyal Estrin

Posted on March 4, 2024

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related