Estimate production-grade Infrastructure

_anshuman

Anshuman Abhishek

Posted on October 19, 2021

Estimate production-grade Infrastructure

Building production-grade infrastructure is difficult. And stressful. And time-consuming. By production-grade infrastructure, I mean the servers, data stores, load balancers, security functionality, monitoring and alerting tools, building pipelines, and all the other pieces of your technology that are necessary to run a business. Your company is placing a bet on you: it’s betting that your infrastructure won’t fall over if traffic goes up, or lose your data if there’s an outage, or allow that data to be compromised when hackers try to break in—and if that bet doesn’t work out, your company can go out of business.

Task Description Example tools
Install Install the software binaries and all dependencies. Bash, Chef, Ansible, Puppet
Configure Configure the software at runtime. Includes port settings, TLS certs,service discovery, leaders, followers, replication, etc. Bash, Chef, Ansible, Puppet
Provision Provision the infrastructure. Includes servers, load balancers, network configuration, firewall settings, IAM permissions, etc. Terraform, CloudFormation
Deploy Deploy the service on top of the infrastructure. Roll out updates with no downtime. Includes blue-green, rolling, and canary deployments. Terraform, CloudFormation, Kubernetes, ECS
High availability Withstand outages of individual processes, servers, services, data centers, and regions. Multidatacenter, multiregion, replication, auto scaling, load balancing
Scalability Scale up and down in response to load. Scale horizontally (more servers) and/or vertically (bigger servers). Auto scaling, replication, sharding, caching, divide and conquer
Performance Optimize CPU, memory, disk, network, and GPU usage. Includes query tuning, benchmarking, load testing, and profiling. Dynatrace, valgrind, VisualVM, ab, Jmeter
Networking Configure static and dynamic IPs, ports, service discovery, firewalls, DNS, SSH access, and VPN access. VPCs, firewalls, routers, DNS registrars, OpenVPN
Security Encryption in transit (TLS) and on disk, authentication, authorization, secrets management, server hardening. ACM, Let’s Encrypt, KMS, Cognito, Vault, CIS
Metrics Availability metrics, business metrics, app metrics, server metrics, events, observability, tracing, and alerting. CloudWatch, DataDog, New Relic,Honeycomb
Logs Rotate logs on disk. Aggregate log data to a central location. CloudWatch Logs, ELK, Sumo Logic, Papertrail
Backup and Restore Make backups of DBs, caches, and other data on a scheduled basis. Replicate to separate region/account. RDS, ElastiCache, replication
Cost optimization Pick proper Instance types, use spot and reserved Instances, use auto scaling, and nuke unused resources. Auto scaling, spot Instances, reserved Instances
Documentation Document your code, architecture, and practices. Create playbooks to respond to incidents. READMEs, wikis, Slack
Tests Write automated tests for your infrastructure code. Run tests after every commit and nightly. Terratest, inspec, serverspec, kitchen-terraform
💖 💪 🙅 🚩
_anshuman
Anshuman Abhishek

Posted on October 19, 2021

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related