It seems like everyone today wants to build the next great scalable web app. And, at least according to Gartner, Amazon Web Services has emerged as the undisputed leading cloud provider where you should build it.
But AWS can look overwhelming. There are now over 40 different AWS services, and many concepts to think about when building your app. AWS provides excellent documentation on each of their services, helpful guides and white papers, reference architectures for common apps, and even aprogram specially for startups where startups can receive one month free to speak with AWS Cloud Support Engineers, and in select cases, free AWS credits.
But how do you put it all together? What’s the relationship between, say, the Elastic Load Balancer (ELB) that you hear about that “just works” and the seemingly unrelated DevOps concept of Service Discovery? When does it make sense to use a non-AWS solution (e.g. using nginx instead of an ELB)?
Also, how do you balance the need to build an app quickly and inexpensively today, but that’s scalable for tomorrow?
My goal in this guide is both to answer these questions, and, more importantly, provide an overall framework you can use to think abouthow to answer these questions.
My target audience is the developer who has a few web or mobile apps under their belt and understands core programming concepts, but now wants to learn how to build their scalable app in AWS. This guide will also be valuable to developers working in larger organizations who are already running an AWS app, or those who wish to migrate to AWS.
I’ll cover both high-level concepts and details. I won’t describe the very basics of AWS like “How to Launch an EC2 Instance” but I’ll provide best practices and perspective on these areas. Finally, I won’t be explicitly discussing Big Data, Event-Driven, or other types of architectures, though there is certainly lots of material here that will apply to those.
Let’s get started.
1. What We’ll Cover
This guide exists in two parts. You’re reading Part 1, where I’ll focus on the high-level concepts in AWS and how to put together an AWS architecture. In Part 2, our focus will be on DevOps and maintaining the infrastructure.
1.1 How to Think About AWS & Scalability (Part 1)
We’ll start with high-level guidelines about how to approach and think about AWS.
1.2 AWS Concepts (Part 1)
You’ll need to learn some — but not all! — AWS services in depth. I’ll give you brief summaries of the AWS services you’ll need specifically to build a scalable web app:
- Route 53
- CloudFormation vs. Elastic Beanstalk vs. OpsWorks
1.3 Architecture Concepts (Part 1)
As you think about what your app will do and how it will scale, you will need to architect a way for your data to flow and be persisted, along with many ancillary concerns. We’ll discuss:
- Architecture Paradigms
- Application Layers
- Architecting for Scalability
- Architecting for High Availability
- Docker & Containers
1.4 DevOps Concepts (Part 2)
In Part 2, I’ll cover the operational aspects of running your infrastructure day-to-day with a special emphasis on automation.
Recently, the term “DevOps” has become a fashionable way to refer to these concepts. Technically, DevOps refers to the idea that application developers must actively think about the operational aspects of their code while ops people are using “development” concepts like source control and automated testing to manage infrastructure. But people now use DevOps to generally mean “infrastructure automation”.
We’ll cover these concepts:
- Configuration Management
- Streamlining Dev Environment Setup
- Service Discovery
- Log & Error Management
- Monitoring & Alerting
- Automated Deployment
- Backup & Disaster Recovery
- Email Service
2. How to Think About AWS & Scalability
I want to start by building your mental model so you know how to think about your scalable web app on AWS. I’ll share some fundamental ideas that aren’t very hard to understand, but they’ll make your ramp-up process easier.
2.1 You don’t need to learn all 40+ AWS services
The first good news is that, of the 40+ AWS services, you’ll only need to dive deep for a small fraction of them. As of December 2014, here’s the latest list of all AWS Services:
If you’re building a standard scalable web app that does not qualify as “Big Data”, here are the AWS services you’re likely to use:
Among those services, the bulk of your learning will be in EC2, VPC, S3, and one or more of the persistence services including RDS or DynamoDB.
A great way to identify the specific set of services you need for your unique app is by reviewing the AWS Solutions page.
2.2 Making the Right Architecture Decisions
Most of the architecture decisions you make will be a tradeoff of these factors:
- Time: How long it takes you to setup
- Team: How productive your team will be with this decision
- Cost: How much you’ll pay to AWS for these services
- Risk: How much down time / data loss / security risk you’re exposed to
- Scale: How many users you can serve / how fast your app is
AWS is a platform, but it’s up to you to assemble the pieces. Essentially, AWS allows you to “purchase” the level of availability, redundancy, security or scale that you need.
For example, consider how you might setup a WordPress site in AWS? Here are two different ways of hosting the same site on AWS. Don’t worry about the meaning of every acronym or concept. We’ll cover that later; the point now is to show how, for the same app, different needs drive different decisions.
- Low Power WordPress: Everything is hosted on a single virtual machine known as an EC2 instance (Cost). You can use a pre-built and tested Amazon Machine Image (AMI) like Bitnami WordPress to be up and running in minutes (Time). Bitnami is well-documented, and we’ll setup automated snapshots of the EC2 instance every 24 hours so that if the server goes offline, we can restore to 24 hours ago (Team, Risk). But we’ll be exposed to data loss up to 24 hours (Risk). Price will be low (Cost), and we can upgrade the instance type if we get more traffic (Team). If the server goes down, the site will go offline until we restore it (Risk).
- High Power WordPress: We’ll use S3 and CloudFront to serve all static files so we can improve site load time (Scale). We’ll setup a WordPress server in each Availability Zone so that if any one server or Availability Zone dies, the site stays up (Risk). We’ll use Amazon’s managed Relational Database Service (RDS) to host our database with the Multi-Availability-Zone option so that we have automated backups for our database, good performance, and low risk of data loss (Risk, Scale). Because we’ll have multiple WordPress servers, we’ll need to setup a special deployment method to ensure that all WordPress servers are always in sync (Team, Cost). Since we are using multiple AWS technologies like CloudFront, S3, EC2, and RDS plus multiple EC2 instances, we’ll pay more (Cost).
I believe the most important factor is “Team.” I like how Edmond Lau puts it from one of his blog posts:
“The guiding heuristic for evaluating most tradeoffs should ultimately be: “What course of action will ultimately increase the probability that the team succeeds?”
2.3 Recognize the “spectrums” inherent to AWS services
Many AWS services exist along a spectrum. Recognize these to help guide your choice on which AWS services to use.
For example, AWS Elastic Beanstalk, OpsWorks, and CloudFormation all provide the same core service of orchestrating your infrastructure (i.e. launching your EC2 instances, setting up Security Groups, etc.), but you trade convenience vs. control.
IMAGE SOURCE: DevOps, PaaS and Everything in Between
Or when dealing with storage data, you can store files in S3 where files can be retrieved within milliseconds of requesting them, or in Glacier where files are available 2 – 6 hours after requesting them but at a cost that is 1/3 the cost of S3. Here the spectrum is “Speed vs. Cost.”
2.4 Recognize the third-party alternatives to AWS services
AWS has been quite aggressive about “horizontally” expanding its services. As an example, did you know you can buy domain names through AWS? It’s part of AWS Route53 (AWS’s DNS-as-a-Service).
AWS also recently released a competitor to Dropbox (AWS Zocalo) and even an alternative to GitHub (AWS CodeCommit).
In fact, here’s a table of services provided by AWS which are also provided with great success by other third-party companies:
My default position is to start with the AWS service. First, it’s integrated into my AWS environment. Second, if applicable, it will generate service-specific alerts to the right people using email, text message, or other means via AWS Simple Notification Service (SNS). Third, I can probably grant granular permissions on that service to my existing AWS IAM users (i.e. the other members of my team who need to login to the AWS console or access AWS via the API). Also, it’s usually the cheapest.
But there are also many cases where third-party services are superior. Log management is a good example. AWS CloudWatch Logs lets you define your own custom metrics (“metric filters”) based on your application log data. You can then view these metrics in chart format or set alerts on them. It also lets you view real-time log data as it streams in.
But tools like Loggly and SumoLogic (both of which have free versions) allow for sophisticated searching, and can summarize your log data in more interesting and useful ways than AWS CloudWatch Logs.
2.5 AWS will let you scale both up and out, but plan to scale out
As your traffic grows, you can either “scale up” by using increasingly powerful EC2 instances, or “scale out” by simply adding more instances to handle your app’s load.
2.5.1 Scaling Out
The classic wisdom is that you should architect your app to scale out. Scaling out means you can adjust your system capacity (by adding or removing EC2 instances) as your app gets more or less load. This works especially nicely for apps that experience an occasional surge in activity, like an eCommerce site on Cyber Monday.
Scaling out also implies that if one of your servers dies, this becomes a non-event because you’ve architected your app to be independent of any single EC2 instance.
One of the best things about scaling out, is you can automate it. AWS offers a set of building blocks to facilitate this. For example, Auto Scaling Groups are an EC2 feature where a metric of your choice — e.g. an instance’s CPU load or a custom app metric — can trigger launching or terminating additional EC2 instances upon exceeding a threshold you define.
IMAGE SOURCE: Auto Scaling with the AWS Management Console
But architecting your application to scale out is non-trivial. We’ll cover this in more detail later.
2.5.2 Scaling Up
Be sure to leverage scaling up, too. Once you create an EC2 instance, it’s possible to stop it, upgrade the instance type, and then start it.
Sometimes, an exclusively “scale up” strategy can work wonders. StackOverflow is famous for having executed well on this. But an exclusively scale-up strategy usually can’t dynamically adjust capacity up and down, and is usually more expensive in public cloud platforms like AWS.
2.5.3 What I Do
When I build my own apps, I architect them from day 1 to support horizontal scaling, primarily by ensuring my web/app tier is completely stateless (more on this later). I then launch them with t2 instances (AWS’s entry-level instance type) to start, and if my metrics like CPU Load, Memory Usage, Network I/O, or Average API Response Time are too slow for my liking, I’ll upgrade to more powerful instances. At some point, I stop scaling up and start scaling out.
AWS also recently announced support for Docker as part of its EC2 Container Service. This changes the calculus of when/how to scale out, and we’ll cover this later as well.
2.6 Always build Multi-Availability Zone (Multi-AZ) architectures
When you launch a single EC2 instance, you are launching a virtual machine running on a physical server at one of Amazon’s data centers. That means any of the following events could take down your instance:
- Your instance itself could fail, or its underlying hard drive volume (Elastic Block Store volume) could become corrupted
- The physical machine on which your EC2 instance resides could fail
- The data center within which the physical machine is located could fail
In AWS-speak, the “data center” is called an Availability Zone (AZ). About 5 – 15 miles away from one AZ is at least one more AZ. A whole cluster of AZ’s is known as a Region. There are 3 Regions in the USA and 11 worldwide.
IMAGE SOURCE: Official AWS Documentation
2.6.1 MultiAZ for High Availability
You should assume that any of your EC2 instances will fail at any time in any of the ways outlined above. When a single EC2 instance or an entire AZ goes offline, your architecture should detect this and simply stop routing traffic to the affected instances. This is the idea behind a Multi-AZ setup, and we’ll cover it in more detail later.
Some AWS services such as AWS RDS (where AWS manages a relational database instance of you) have built-in support for Multi-AZ and enabling it requires simply checking a box in the configuration.
You can go beyond Multi-AZ and use a Multi-Region architecture, where your app can survive an entire region going offline, but a multi-Region setup introduces new complexities and additional cost and is typically only implemented by larger organizations.
2.7 You hardly ever pay upfront, and pay only for what you need
AWS repeatedly emphasizes their goal to shift your thinking “from CapEx to OpEx” or “Pay for what you use.” In other words, in the past to achieve the kind of infrastructure AWS offers, you would need to invest a significant sum upfront (a “Capital Expense” or CapEx). Today, AWS is philosophically committed to eliminating the CapEx so that you pay only ongoing “Operating Expenses” (OpEx) to run your infrastructure.
This means AWS services are almost always priced by usage. For example, you pay by the hour when you run an EC2 instance. If it’s stopped, you pay nothing. S3 is charged per GB of data stored along with some bandwidth costs.
A notable exception to this rule is the concept of a “Reserved Instance.” This is a billing option that AWS offers where you can “purchase” a lower hourly rate for EC2 instances by paying upfront. We’ll cover this shortly.
2.8 Plan out and monitor your costs and use Reserved Instances
One of the hardest things with AWS is managing your costs. Most of us love the “Pay for what you use” model, but in practice you wind up using a wide variety of services and it’s hard to keep track of all of them.
AWS recognized this challenge and, to their credit, has provided several tools to help you manage your costs.
2.8.1 Estimate Costs Upfront
AWS provides an Online Cost Calculator you can use to estimate your monthly (and if applicable upfront) costs based on the services you plan to use.
2.8.2 Use Billing Alerts
Your AWS bill should be inline with what you expect from the calculator, and should remain pretty stable unless you use more services. You can guarantee these assumptions are true by setting up Billing Alerts and Notifications.
Rich Adams has an outstanding article on AWS Tips I Wish I’d Known Before I Started where he advises setting up a billing alert for his expected usage each week.
“So the first week’s alert [is] for say $1,000, the second for $2,000, third for $3,000, etc. If the week-2 alarm goes off before the 14th/15th of the month, then I know something is probably going wrong.”
2.8.3 The Right EC2 Pricing Model
When you launch an EC2 instance, you pay per hour. But the hourly rate you pay can vary depending on which pricing model you use.
220.127.116.11 On Demand Instances
If you just launch an instance you are paying the On Demand rate. For example, an m3.large instance costs $0.140/hour, or about $100/month, plus a negligible charge for the EBS volumes your instance uses ($2/month for 20GB) if you run your instance 24/7 for the month. If you stop your instance, you pay only for the cost of your EBS volume, and pay nothing for the EC2 instance.
18.104.22.168 Reserved Instances
If you expect that you will keep your EC2 instance running for an extended period of time (e.g. 24/7 for a year), you can sign up for a Reserved Instance and save up to 60%! This is a billing concept only. It will affect how you’re billed and it will not affect the technical performance of the EC2 instance in any way.
The basic idea of a Reserved Instance is that you commit to AWS that you will use a particular profile of EC2 instance (e.g. Oregon region, m3 instance type, etc.) for a 1-year or 3-year period. You signify your commitment by either paying:
- 100% upfront, Nothing ongoing (highest savings)
- 0% upfront, but you commit to pay monthly through your term (lowest savings)
- Some upfront in exchange for a lower hourly rate (medium savings)
Savings here are significant, ranging from around 30% – 60% depending on the length of your commitment and your utilization of the services. If you know you’re going to use EC2 instances for a year, this is a no-brainer.
The caveat here is that you can change some but not all of the conditions of your Reserved Instance and there is no official refund from AWS. See theAWS Reserved Instances FAQ for more information.
While AWS won’t issue any refunds, you can actually sell your remaining contract on the open market for a price that you set through the AWS-provided Reserved Instance Marketplace.
22.214.171.124 Spot Instances
Finally, you can purchase Spot Instances which means you bid an hourly rate you’re willing to pay and if your bid is above the current “spot” rate set by AWS, an instance will launch. As of November 28, 2014, that same m3.large instance goes for just $0.0165/hour (a monthly equivalent of $11!). But the catch is that once the spot rate goes above your bid, your instance will immediately terminate. This model is suited to large ongoing computations and obviously requires some special architecture.
See the EC2 Pricing Page for more information.
2.8.4 AWS Trusted Advisor
If you happen to have a Business Support plan or better, you get access to the full version of AWS Trusted Advisor, which will automatically analyze your account for instances which are good candidates for Reserved Instances, or which are sitting idle.
3. Key AWS Services
Now that you have a high-level perspective on AWS, let’s dive one level deeper and give a brief summary of the AWS services essential to a scalable web app. Use the AWS documentation or other resources to learn the “how”, my goal is to give you the “what” and “why.”
Elastic Compute Cloud (EC2) is where you launch servers. In AWS, “servers” are actually virtual machines, and in AWS lingo a single virtual machine is an “EC2 Instance.”
Each EC2 instance is backed by one or more “virtual hard drives”, which are known as Elastic Block Store (EBS) Volumes. You can launch EC2 instances based on a “hard drive snapshot” which is known as an Amazon Machine Images (AMI). For example, you can select an Ubuntu 14.04 AMI, a Windows Server 2008 AMI, etc.
EC2 Instances are protected with “firewalls” known as Security Groups. For example, to SSH into your instance, the Security Group associated with your instance must permit port 22 from your IP address.
Every EC2 instance has a private IP address, which is private in the sense that it’s only visible within the network in which your EC2 instance resides (known as a VPC, see below). Some EC2 instances can also have one more public IP addresses. If you can assign an EC2 instance’s public IP address to a different EC2 instance, that IP address is known as an Elastic IP Address.
EC2 instances can be stopped, (like turning off the server), started (like turning it on), or terminated (like destroying it).
A Virtual Private Cloud (VPC) is a private network where you place your EC2 instances.
Originally in AWS, every EC2 instance was directly exposed to the public Internet, guarded only by the Security Group rules set on it. Many users recognized that in a traditional data center you could setup private networks using direct wiring; VPC attempts to achieve a version of this in the cloud.
Each EC2 instance is placed into a particular Subnet. A Subnet specifies a range of IP addresses and exists in a particular Availability Zone. A Subnet is either directly exposed to the Internet (a “public subnet”) or unreachable from the Internet (a “private subnet”). For example, you probably do not want your database server to be exposed to the public Internet, so you should place it in a private subnet.
When you want to setup high-availability architectures, you can spread EC2 instances across different subnets, and thereby different Availability Zones.
3.3 S3 and Glacier
Simple Storage Service (S3) is for storing files in the cloud.
Glacier is just like S3 but cheaper, with the tradeoff that once you request a file, it may take 2 – 6 hours before it’s available for download.
To upload a file to S3, first you select or create an S3 Bucket, which is simply a namespace for a group of files. The file you upload in S3 can either be publicly downloadable via a URL, or private and only available in certain cases.
A common pattern for web and mobile apps is to store all user-generated files in S3. Such files should generally not be public, so when a user needs to download a file you can issue a temporary access token which can be used to download a file from S3 from its URL for a defined period of time (e.g. 60 seconds).
You can assign “policies” to an S3 bucket regarding what happens to files in a bucket. For example, files in a bucket can be automatically deleted after a certain period, automatically archived to Glacier, or both.
Files uploaded to S3 can be encrypted using a key you provide, or one transparently managed by AWS.
3.4 RDS and DynamoDB
Relational Database Service (RDS) is a managed relational database service for MySQL, PostgreSQL, SQL Server, or Oracle.
DynamoDB is a managed NoSQL service that uses a proprietary AWS NoSQL engine.
There are many challenges to administering your own database servers. RDS and DynamoDB automatically handle backups, scaling, master-standby replication, read replicas, database version updates, and security patches.
The primary alternative to RDS and DynamoDB is managing your own EC2 instances running a database software of your choice.
AWS recently announced Aurora, a massively scalable MySQL-compatible database as a service.
Identity and Access Management (IAM) is a set of services for managing permissions among your human team members and AWS resources.
When you first create your AWS account, the email address you use will be the “root” account and will have super-admin access. It is a best practice to never login with this account, and instead create individual accounts for each member of your team using IAM.
You can then assign permissions to each account so that, for example, some team members can see all information but cannot start or stop an EC2 instance.
IAM permissions can also be assigned to AWS resources such as an EC2 instance itself. This enables an EC2 instance with the right permissions to automatically have access to, for example, a particular folder in S3 without the need for separate authentication.
Recently, AWS added Key Management Service to IAM so that you can centrally manage symmetric encryption keys.
3.6 Route 53
Route 53 is DNS as a service, and also includes the ability to purchase domain names.
Using Route 53, you can setup public DNS records for any domain name. Route 53 recently introduced the ability to setup private DNS as well (DNS that only your private network can query but which is unavailable to the outside world).
You can use Route 53 in conjunction with many other AWS services. For example, using Route 53 in conjunction with S3 enables you to host a static website on S3.
CloudFront is a Content Delivery Network managed by AWS.
CloudFront contains Edge Locations throughout the world which can cache certain content. As a result, when users access CloudFront-cached content, no load is placed on your app, and fewer physical network hops are required for the user to receive the data.
CloudFront can work with either S3 or an HTTP Server as the Origin Server from which it will source the original content.
CloudFront also supports streaming media files.
CloudWatch provides monitoring and alerting for your AWS resources. CloudWatch also offers basic features for managing application log data on your EC2 instances.
Each AWS service has a unique set of metrics it exposes to CloudWatch. For example, EC2 instances expose CPU Utilization, Memory Utilization, and much more, while RDS instances expose Database Connections, Read IOPS, and CPU Utilization.
You can set automated alerts that trigger when a metric exceeds a threshold you define. The result of the alert can be that someone gets notified, or that an action within AWS is taken. For example, a classic pattern is to automatically launch additional EC2 instances when average CPU utilization on existing EC2 instances exceeds a certain threshold.
3.9 CloudFormation vs. Elastic Beanstalk vs. OpsWorks
CloudFormation is a way to capture everything you can do in the AWS Web Console as a single JSON file. This means you can “version control” and “code review” your AWS infrastructure settings.
OpsWorks accomplishes the same infrastructure management as CloudFormation, but at one higher level of abstraction, and with a special focus on application deployment.
Elastic Beanstalk accomplishes the same infrastructure management as CloudFormation and OpsWorks, but at the highest level of abstraction. For example, you can launch a NodeJS stack with just a few clicks.
The most primitive (and most common!) way of setting up your infrastructure on AWS is to use the web console. But when you work in large teams, this can cause problems. Imagine that Engineer A changes a security group setting and then goes home. An hour later, Engineer B receives an alert that a service is down. If Engineer B doesn’t have a record of what was changed, discovering the root cause of the issue becomes difficult.
A best practice for managing AWS infrastructure is to run every single change through CloudFormation and to commit your changes to version control. In the example above, Engineer B can now immediately check the CloudFormation template for updates to see if that caused the issue.
But sometimes CloudFormation is overkill, especially for AWS beginners. At the other end of the specturm is Elastic Beanstalk, which gives you “point and click” access to deploying an entire stack of common technologies, or even of Docker Containers! While you may use “point and click” to provision the infrastructure, you can still see all the associated AWS resources (e.g. EC2 instances) in their usual place.
OpsWorks was an acquisition by AWS formerly known as Scalarium, and is a sort of middle ground. OpsWorks offers more control than Elastic Beanstalk and also introduces “point and click” deployment options, but is also more limiting than using CloudFormation.
3.10 Other AWS Services
To keep things short, I’ve only summarized the AWS services that are most essential to the scalable web app, though certainly you will use others beyond the ones I’ve listed.
4. Architecture Concepts
Now it’s time to discuss the decisions you will make as you create your system architecture on AWS.
For a complementary perspective on how your architecture might evolve over time from a single EC2 instance to hundreds of auto-scaled instances, watch the excellent presentation from Chris Munns at AWS re:Invent 2014 on Scaling Up to your First 10 Million Users.
4.1 Architecture Paradigms
One of the first decisions to make is whether to build your app as a “single stack” or “many small single stacks.”
4.1.1 Single Stack: The Monolithic Architecture
Traditionally, the default architecture for a new software project was the classic n-tier architecture. Wikipedia gives a nice general overview of this.
As the Wikipedia link describes, this means setting up your persistence tier as a database (e.g. PostgreSQL or MongoDB) and/or cache (e.g. Redis or Memcached), setting up your middle tier to handle business logic, and setting up your frontend tier, typically with a web server like Nginx or Apache.
The defining feature of the single-stack architecture is that all or most of your code is contained within these tiers. Your persistence tier will have allyour database tables or document types. Your middle tier will contain your entire domain model and business logic. Each tier may have a single EC2 instance or multiple EC2 instances spread across multiple Availability Zones.
A few years ago, many software teams began to complain that their N-tier architectures had become so large that enhancing and maintaining it was becoming increasingly painful and expensive. In such situations, they pejoratively labeled the N-Tier architecture a “Monolithic Architecture.”
4.1.2 Many Small Single Stacks: The Microservices Architecture
So what is a microservices architecture? The microservices architecture is a paradigm where each microservice is a standalone “single stack” that, as Sam Newman concisely summarizes in Building Microservices:
- Is small, and focused on doing one thing well
- Is a separate, independent process
- Communicates via language agnostic APIs
- Is highly decoupled
A microservices architecture can be thought of as an “approach” to a Service-Oriented Architecture (SOA), though there is often no clear distinction between the two terms.
As an example of a microservices architecture, if you were building a grocery shopping app, you might have a microservice just for creating, managing, and authenticating users, one for managing your catalog of SKUs, and one for managing your inventory levels. Each of these microservices would be on their own independent stack of EC2 instances, could be managed by separate teams and even use different technology choices, connecting to each other via RESTful APIs.
There are many benefits to a microservices architecture and it is certainly the hipster way to build a new app today. But, as with everything in software, there are Pro’s and Con’s. Check out Martin Fowler’s Microservices article for a good overview and comparison. I also like Armon Dadgar’s summary.
Here’s my take on the issue, albeit with some broad generalizations:
As you can see, there are many benefits to a Microservices architecture, but it has its downsides, too.
4.1.3 Microservices and Infrastructure Automation
A critical point about a microservices architecture is that infrastructure automation (which I discuss in Part 2) becomes extremely important. It’s hard enough to manually manage a single stack infrastructure with its N tiers. Imagine managing 5 single stacks manually?
Part of the reason microservices is even a viable architecture today is because of the recent advances in infrastructure automation. Ideally, when setting up your microservices architecture, you would “template” your full stack so that it can be easily reused.
For example, you would use Configuration Management technologies like Chef, Puppet, Ansible, or Saltstack to automate the configuration of servers at each of your tiers, Orchestration tools like AWS CloudFormation or Terraform to automate the deployment of AWS infrastructure like EC2 instances, RDS (Amazon’s managed relational database store) instance or an Elastic Load Balancer. You would bake into your server configurations a common standard for handling logging, alerting, monitoring, process management, security procedures, and more. The details of this are the heart of our Part 2 article.
Actually, the requirement to streamline your infrastructure is itself a benefit, but, per our decision making guidelines discussed earlier, if your team isn’t prepared to setup this level of infrastructure automation, microservices may be a very expensive architecture.
Personally, I start my greenfield projects today with a microservices architecture, but that’s only because I’m willing to invest the heavy overhead upfront to automate as much of my infrastructure as possible.
4.1.4 Microservices and Asynchronous Message Passing
As we’ve said, each microservice may call other microservices. This will raise an important question in your mind: when someone calls one of your microservice: does this need to be a synchronous or asynchronous call?
Imagine you’re running an eCommerce site. If you are collecting a credit card from a user, when the user presses “submit” we need to let them know immediately whether their credit card was approved. This business rule means that “process credit card for approval” must be a synchronous call. The “submit button” code should pause while it waits for a response from the “process credit card” service (by the way, it hopefully pauses in a “non-blocking” way so that the thread on which it’s operating is freed up to do other things). This is a classic example of a synchronous call, and you would typically use an HTTP RESTful web service to make the call.
But once that credit card is approved and the user is ready to submit an order, this can be done asynchronously. Your business will rarely “reject” a newly submitted order, so in this case we could submit the order not synchronously via a RESTful API call, but asynchronously using a message queue. With a message queue we simply submit our order to the queue. Meanwhile, some other service like the “order processing service” is consuming messages off the queue as fast as it can handle them. In our example, the “order processing service” might use a First-In-First-Out (FIFO) processing order.
In AWS, you can implement such a queue either by using a key-value store like Redis (or AWS’s managed Redis, ElastiCache), or you can use AWS Simple Queue Service (SQS) which offers the benefit of a fully managed queue and the detriment of some additional lock-in to AWS.
4.1.5 Microservices and AWS
AWS is especially well-suited to the infrastructure automation that microservices require. For example:
- You can represent an entire application stack, including EC2 instances launched, auto-scaling groups, security groups (individual firewalls for servers), and just about everything else all as code using AWS CloudFormation.
- One option for streamlining deployment is to use a tool like packer to automate the creation of an Amazon Machine Image (AMI) that’s used to launch a new EC2 instance.
- AWS recently announced the EC2 Container Service to streamline deploying multiple microservice stacks across a cluster of EC2 instances using Docker containers.
4.2 Application Layers
Whether you’re dealing with a single stack or a microservices architecture, you’re most likely going to have at least one instance of an N-Tier architecture, that is, at least one full stack. The most common tiers in an architecture are as follows, with a discussion on handling SSL toward the end.
4.2.1 Load Balancing Tier
The role of this tier is to evenly distribute incoming traffic to the next tier, which is likely to be either the Web Tier or App Tier, but could be any tier. The official AWS solution (and the most popular solution) for this tier is an Elastic Load Balancer (ELB), but you can also setup your own EC2 instances to handle this function using software like HAProxy or Nginx.
Your load balancer is often the “entry point” to your app so high availability here is essential. AWS manages high availability for you when using an ELB, whereas managing your own EC2 instance may require setting up things like automatic failover.
If you use AWS Route 53 to handle your DNS, you can take advantage of Route 53 Health Checks, which means that Route 53 checks the health status of the endpoint it’s routing to, and if it fails the standard you specify, will route traffic to healthy endpoints.
But be careful here, because DNS responses from Route 53 include a “Time To Live” property that indicates for how long the DNS response should be cached by a local Internet Service Provider’s DNS servers. Not all DNS servers respect this value, so you can’t guarantee instant fail-over for all your clients. It’s still a handy option, though, since it’s an automated way of re-routing traffic.
4.2.2 Web Tier
The role of this tier is to serve static files to users, and often to route traffic to the right endpoint in the App Tier.
A popular option for the Web Tier is to use software like Apache or Nginx installed on your EC2 instances. These servers are often the final arbiters of the HTTP response that your app sends back to the client, defining properties like HTTP headers, Cross-Origin-Resource Sharing (CORS) preferences, and more. Sometimes they double as a load balancer, as is the case with Nginx, but you can still use an ELB for load balancing and Nginx as your web server.
In fact, it’s a best practice to leverage the High Availability of an ELB and to have a Multi-AZ web tier so that if one EC2 instance in your web tier goes down, no one will notice.
While web server software is designed to serve static files very quickly, it’s still best practice to use S3 or AWS CloudFront (AWS’s Content Delivery Network) to deliver static files.
Web Servers are also an excellent place to use Auto Scaling Groups. If CPU load, network traffic, or I/O exceeds a certain threshold, you can configure your ASG to simply launch more instances.
4.2.3 App Tier
The role of this tier is to run your app’s main process. For example, if you’ve built a Java app, this tier is running a Java program that’s listening for incoming connections. If you’ve built a PHP app, this tier is running a web server with PHP support and listens for incoming connections.
Basically, your technology choices will dictate how this server is setup. For that reason, AWS is the least “opinionated” about this tier, providing you with nothing more than a server (the EC2 instance plus its underlying EBS volumes) and the supporting tooling like Elastic IP Addresses, Security Groups, and AMIs,
Your App Server is sometimes where “state” is stored in your app. For example, a user logging in may have his session information on a particular EC2 instance in the App Tier. The problem with this is that this user is now “pinned” to one EC2 instance, and if it fails, your user’s session dies with it.
The best practice here is to make your App Tier instances stateless and store things like session state in a different tier. We discuss this more more in4.3 Architecting for Scalability
4.2.4 Cache Tier
The role of the Cache Tier is to store ephemeral data like user session information, or the results of commonly requested queries. When data is requested from the Cache Tier, either the Cache Tier has the requested data and serves it back (known as a “hit”) or it does not have the requested data (known as a “miss”) and your app must either write new data to the Cache Tier (e.g. a session key) or fetch existing data from the Database Tier (e.g. a commonly run query) and store it in the Cache Tier.
The idea behind a cache is that accessing memory (i.e. RAM) is orders of magnitude faster than accessing data on disk and avoids repeatedly querying the database for the same information. So we shift load away from our Database Tier to the Cache Tier when possible.
Note that nothing prevents you from storing your session data in a relational database, but you’re unnecessarily adding extra load to a future bottleneck.
In theory, because a cache server handles ephemeral data, it should be able to be completely reset without affecting your app (other than resetting all active sessions). But in practice, if the Cache Tier fails, suddenly an enormous amount of load is put on the Database Tier, sometimes causing a cascade of problems. For this reason, some cache servers like Redis support persisting their data to disk to support rapid recovery.
126.96.36.199 The Cache Tier in AWS
You can setup any cache server of your choice on an EC2 instance, or AWS provides the ElastiCache service as a managed Cache Tier and lets you choose whether it should use Memcached or Redis as the underlying software. Redis is generally the newer and more popular solution on newer projects.
When your Cache Tier holds all ephemeral state like session values, your App Tier no longer has to maintain this state itself. This means you can launch additional EC2 instances, and as long as they are configured to look to the Cache Tier for ephemeral data and the Database Tier for persistent data, they will “just work.” For this reason, a Cache Tier is usually an essential part of auto scaling.
4.2.5 Database Tier
The Database Tier is where all your persistent data is managed. In general, data is persisted either to a Relational Database Management System (RDBMS) such as PostgreSQL, MySQL, SQL Server, or Oracle, or to a “document store” (more commonly known as NoSQL) database such as MongoDB or AWS’s proprietary DynamoDB.
Deciding whether to use an RDBMS (relational) or NoSQL database is a very important architecture decision that is outside the scope of this guide.
188.8.131.52 AWS DynamoDB
If you decide to use AWS DynamoDB you are entering a world that comes as close to zero administration as I have seen. DynamoDB has a single metric — throughput capacity — that you modify as your app scales, plus settings around data consistency and other items that are mostly one-time settings.
Otherwise, there isn’t much more to manage. Perhaps most telling is that AWS released backup/restore functionality for DynamoDB two years after DynamoDB itself was released. Users were concerned that their own app might write bad data or otherwise corrupt the database, but AWS takes primary responsibility for ensuring your data itself is durable.
That being said, you should still consider taking DynamoDB backups and storing them offsite. When AWS events occur, they tend to cause a chain reaction of events, leading to emergent behavior that can be hard for AWS to predict. See the October 2012 AWS Outage Post-Mortem to get a sense of what I’m talking about.
184.108.40.206 Everything Else
If you’re using DynamoDB you should be paranoid about things going wrong in your Database Tier. If you’re not using DynamoDB, you should be very, very paranoid about things going wrong. If your application were to go offline, you can eventually restore it, but losing even a small amount of data is usually considered catastrophic. The most common issues to worry about are:
- You could lose data because…
- A database server failed and you had no real-time data replication
- Your real-time replication was setup but did not function properly
- Your backups failed to occur
- Your backups occurred but are corrupted or otherwise unusable
- Your backups occurred but failed to transfer to your offsite storage
- A rogue employee deleted backup data
- You could lose availability because…
- A database server failed and you had no automatic fail-over setup
- Automatic fail-over took far longer than expected
- Automatic fail-over itself failed
- Automatic fail-over worked, but Service Discovery issues prevented automatic fail-over from working properly
- You could suffer from slow performance because…
- Your database server is receiving too much load for the EC2 instance type it’s running on
- You are not offloading enough load to Read Replicas
- You are not offloading enough load to the Cache Tier
And I haven’t even covered everything. So how do you handle these issues? The main solutions here are:
- Use real-time data replication for High Availability. This can enable a hot standby so that if your master database server fails the hot standby immediately becomes the new master.
- Use real-time data replication to create Read Replicas. As your database load increases, you can send all writes to one server and put all reads on a “Read Replica” server to reduce load on the “Write” server. This can introduce data consistency issues, though, so make sure you understand the consistency guarantees your architecture makes about reading data from a read replica.
- Setup automated data backups. Each database server comes with built-in backup functionality to take a full backup of the database. For example, in the PostgreSQL world this is the
pg_dumpcommand. In MongoDB, this is
mongodumpcommand. Sometimes your data set is so large that only continuous backups are an option. Sometimes using alternative database backup strategies like file system snapshots in combination with log replay is the right strategy.
220.127.116.11 The Database Tier in AWS
As you can see, there is a lot to manage in your Database Tier. AWS recognized this and created the AWS RDS service for Relational Databases andAWS DynamoDB for Document Store Databases, which means that AWS manages almost all the above issues for you, and your job is merely to configure, minimally administrate, and occasionally validate things.
AWS RDS is available for MySQL, PostgreSQL, SQL Server, and Oracle. RDS will give you “one-click” Multi-AZ replication, simple steps to creating a Read Replica, automated backup, point-in-time recovery, control over the database configuration options, and the ability to scale storage and CPU as needed.
AWS RDS is an excellent choice for your relational database, but it’s still not completely hands off. In Section 4.5.2 High Availability for Your Stateful Tiers, we’ll discuss some of the “gotcha” issues of RDS. Also, remember that you can always set up all the above yourself on self-managed EC2 instances.
If you use a NoSQL database, AWS DynamoDB is the fully managed NoSQL option and requires virtually no administration. Keep in mind that DynamoDB is not a managed version of MongoDB or other third-party product. This is AWS’s proprietary product. As before, the primary alternative is to use your favorite NoSQL database on self-managed EC2 instances.
4.2.6 Ancillary Tiers
There may be additional instances that are ancillary to your app. For example, you may use a Bastion Host to more securely connect to instances, aContinuous Integration / Continuous Deployment instance, an instance to run background jobs, an instance for displaying a dashboard, or accumulating data like ElasticSearch.
I don’t discuss these here because these are incidental, not fundamental to our app. But in Part 2 when we cover the DevOps concepts, we’ll go over these in more detail.
Most applications today need to offer their service via SSL. Since we’re discussing this in the context of “Application Layers,” the obvious question ison which layer should SSL terminate? (If you’re not familiar with the word “terminate” in relation to SSL, see this short Wikipedia article.)
The answer is “you probably want to terminate SSL on your ELB, but this may not be sufficient.”
If you can, I recommend terminating SSL on your ELB. It’s relatively simple to setup, and you remove the CPU load of terminating SSL from your self-managed EC2 instances.
There are some scenarios where regulations require that every connection between every one of your servers is encrypted, even within your private network. In such cases, you may still use SSL termination on your ELB, but it may also make sense to just forward HTTPS traffic from your ELB to your next tier.
4.2.8 We Have a Common Set of Considerations for Each Tier
In addition to the inherent functionality we want from a tier, we also care about:
- High Availability: If one instance in the tier fails, or if an entire AZ fails, the tier should keep running.
- Load Balancing: Load should be distributed evenly across the instances we do have in a given tier.
- Scaling: Each tier should be capable of scaling up and down to meet demand
- Backup & Disaster Recovery: In the cloud, failure shouldn’t be a crisis, just a state we plan for. For any tier, we have to prepare for how we recover from a single instance failing.
- Logging: Log data that’s generated may be stored in a file by default, but it’s best if we view it as a stream of events we can aggregate, search, and proactively monitor
- Monitoring & Alerting: If one or more of our instances runs into issues, we want to be alerted so we can take action.
We’ll cover the first 3 concepts below. We’ll discuss the remaining concepts in Part 2 when we cover DevOps.
4.3 Architecting for Scalability
As we discussed earlier, you can always upgrade your EC2 instance type to “scale up”, but eventually you will want to “scale out” by adding more EC2 instances to your infrastructure.
When it comes to scaling out, high-level presentations about AWS always trumpet the ability to dynamically scale your cluster of EC2 instances up or down depending on load. To actually achieve this, the most common solution is Auto Scaling Groups.
Auto Scaling Groups can launch (or terminate) EC2 instances either according to a pre-set schedule (e.g. every day at 9am, launch 4 new instances), or based on metrics from AWS CloudWatch, such as when CPU utilization exceeds a certain threshold.
You create a unique Auto Scaling Group (ASG) for each tier of your app that you want to scale. This graphic from the official AWS documentationillustrates this nicely, showing one ASG for the Web Tier and one for the App Tier:
Automatically launching or terminating EC2 instances is one piece of the puzzle. The other is that your architecture must support dynamically adding/removing instances. There are two key considerations (and sometimes more depending on your architecture) to take into account here:
4.3.1 Service Discovery
In the illustration above, imagine that an ASG has just added a new EC2 instance to your App Tier. This means your Web Tier needs to know that this server exists before it can send any traffic to it! This is a key DevOps concept known as Service Discovery.
Hard-coding an IP address in your Web Tier configuration is the most basic form of Service Discovery and suffers from the problems of not being dynamically updated when new servers get added, and not auto-removing unhealthy instances. The classic way to handle Service Discovery for ASGs is using an Elastic Load Balancer as described in the official documentation.
I’ll talk in detail about your options for achieving Service Discovery in Part 2. For now, note that you’ll either have to leverage AWS built-in features for handling this, or implement it yourself.
4.3.2 Managing State
One of the most significant considerations for architecting in the cloud is state. Some parts of your app will be inherently stateful. For example, your persistence tier is designed to be stateful, to permanently store your data. Your caching tier is also stateful, but is meant to hold ephemeral data such as information about a user’s session.
But everything else should be stateless. Any user request that comes in to your Web Tier should be agnostic about which instance in the Web Tier handles it, and agnostic about which instance in the App Tier handles it. Since these tiers are stateless, we can launch or terminate instances, and as soon as our Service Discovery mechanism routes traffic their way, they should “just work.”
But how do you handle user sessions if the app tier is stateless? The most common solution here is to use a Cache Tier, as described earlier in the section “Application Layers.” Caches store their data in memory (and in some cases can be it to disk for backup or scaling purposes) so they are very fast. Now all instances in your app tier just query the Cache or Database for state.
4.3.3 Scaling Your Database Tier
Of course, your Cache Tier and Database Tier need to scale, too, but since these are both stateful, we handle scaling these in a different way. We can’t just “add servers.”
First, let’s discuss NoSQL databases. If you use AWS DynamoDB, scaling is as simple as specifying the provisioned throughput your app needs. More throughput costs more money, but AWS claims no limit as to what you can request.
Now let’s discuss relational databases. Scaling relational databases is a deep and rich topic, the details of which are outside the scope of this guide. But I’ll cover some highlights.
Basically, strategies for scaling the database amount to:
- Using more powerful hardware, and
- Reducing load on an individual database server in myriad ways.
18.104.22.168 Using More Powerful Hardware
Using more powerful hardware equates to using more powerful EC2 instances and faster I/O (i.e. more IOPS on EBS Volumes). Currently, the most powerful database instance — the db.r3.8xlarge — can give you 32 cores and 244 GB of RAM! It will also cost you about $3,000/month per instance.
On the I/O side, you can pay more to use EBS Provisioned IOPS, which increases throughput and decreases latency to achieve higher database performance.. If you’re managing your own EC2 instances for the Database Tier, you can even implement RAID across multiple EBS volumes to increase throughput.
AWS RDS makes it easy to upgrade the instance types using the API or web console. Recently, AWS introduced AWS RDS for Aurora, which is an AWS-proprietary MySQL-compatible database (notably, not MySQL itself). Aurora is designed to eek more performance than MySQL out of the same hardware. This means you can get even more leverage out of “use more powerful hardware.”
22.214.171.124 Reducing Load on an Individual Database Server
You can’t just “reduce traffic” on your app, but you can certainly distribute and manage the load caused by that traffic in different ways. Here are some common options:
- Read Replicas: Instead of using one server for all database traffic, send all writes to one server and all reads to one or more “Read Replicas.”
- Shard: Instead of using one server for all database traffic, partition your data set across multiple database servers. Each partition has its own master and may also leverage Read Replicas.
- Use a Task & Queue Model: By default, every call your code makes to the database is synchronous, meaning it waits — occupying a valuable database connection and the according resources — until it gets a response. Instead of making 100% of database calls synchronous, your team could implement some calls to the database asynchronously by creating a “database processing task” and then adding it to a queue. This makes your overall database load more predictable, instead of having peeks and troughs of load.
- Optimize the Data Itself: You can de-normalize data to avoid expensive
JOINs, review existing indexes, implement partial indexes, etc.
4.4 Architecting for High Availability
If an individual EC2 instance or an entire AWS Availability Zone fails, your app should stay up. This is the essence of architecting for High Availability (HA).
As we’ve said, the key idea for achieving HA is eliminating a single point of failure by placing multiple EC2 instances in multiple Availability Zones. If you can afford it (in time and money) you can also consider a Multi-Region or even Multi-Cloud strategy. I’ll limit this discussion to Multi-AZ.
When dealing with availability, an essential question is “do we just need to restore function, or do we also need to restore state?”. Let’s take a look at both.
4.4.1 High Availability for Your Stateless Tiers
For your stateless tiers, HA is straightforward:
- Each tier must have more than one EC2 instance
- Each tier must have EC2 instances in at least 2 Availability Zones (ideally more)
- When an EC2 instance fails, your architecture must be able to dynamically discover only healthy, active EC2 instances
To elaborate on the third point, this, again, is a discussion about Service Discovery.
To achieve effective Service Discovery, I recommend using an Elastic Load Balancer (ELB) in front of every stateless tier. This way, you direct all traffic to the ELB (which is itself designed to be High Availability) and the ELB then takes responsibility for Service Discovery, offering you flexibility on what it means for an EC2 instance to be unhealthy plus other configuration options.
The primary alternatives are to use a third-party Service Discovery tool like consul, or to use a third-party load balancer like HAProxy. I’ll discuss the Pro’s and Con’s of these approaches in Part 2 when we cover Service Discovery.
As a final side tip, you can use Auto Scaling Groups to monitor EC2 instance health and automatically terminate and re-launch unhealthy instances.
4.4.2 High Availability for Your Stateful Tiers
For your stateful tiers, HA depends on the technology you’ve chosen and the AWS services you’re using.
For the Cache Tier, if you use AWS ElastiCache, HA is managed for you. When configuring your cache cluster, you can select the number of nodes (for Memcached) or the number of Read Replicas (Redis). You can place these in multiple Availability Zones.
For the Database Tier, if you use DynamoDB, AWS automatically replicates DynamoDB data across three Availability Zones, all behind the scenes. Essentially, you get HA for free with this service.
If you use AWS RDS, RDS offers you the option of doing a Multi-AZ deployment. This means (quoting from AWS documentation) that:
RDS will maintain a synchronous standby replica in a different Availability Zone than the DB instance. Amazon RDS will automatically fail over to the standby in the case of a planned or unplanned outage of the primary.
RDS with Multi-AZ is a great option, and it’s what I recommend to my clients, but it has some caveats:
- Amazon’s Service-Level Agreement guarantees 99.95% availability, or about 20 minutes of downtime per month.
- It typically takes about 2 – 3 minutes for a failover to complete.
- Your app will use a DNS query to discover the correct database IP address, so once the failover completes and updates the private DNS record, your app’s DNS cache has to expire (typically 3 minutes as well, but ultimately dependent on your app)
- There are (increasingly rare) edge cases where automatic failover itself failed
So, to be clear RDS with Multi-AZ does not meant zero-downtime if your primary database instance has an issue. But at least failover should all be automated.
If this is too much exposure to availability risk for your liking, your options are to use AWS RDS for Aurora for improved availability, use DynamoDB for improved availability, or to manage your own database and implement more responsive mechanisms for avoiding downtime.
4.6 Docker & Containers
Five years ago, there was an industry-wide paradigm shift as we went from physical servers to virtual machines (VMs). The idea that a single physical server could run many “virtual” servers made it possible to offer “Infrastructure as a Service” (IaaS), the ability to provision a new server by an API call instead of a phone call. You could say that AWS is a major vendor in the IaaS space.
Now in December 2014, another paradigm shift is already underway: Containers.
What if I told you that the Infrastructure-as-a-Service industry had a dirty secret that, on average, only 8 – 15% of a data center’s deployed and active resources are being used (Source)? That means you pay for 100% of the resources (i.e. CPU, memory, disk space), even if you use only 15% of them.
In addition, developers have realized that they no longer want to build applications for single servers. We want to build highly available, load balanced apps that are distributed across many VMs, each located in different Availability Zones. But packaging and distributing apps to do this can be hard. If you have a new app, how do you easily deploy it to new servers?
These are some of the underlying reasons why Docker and the “containers” paradigm it represents have taken off so rapidly. Indeed, when I was at the AWS re:Invent conference in November 2014, most sessions on Docker were so crowded that the conference rooms reached their fire code limits and people (including me) were turned away.
4.6.1 Learning Docker
One of the consequences of Docker’s meteoric rise has been an explosion in solution offerings for the “containerized world.” Frankly, it’s a confusing jungle out there and just understanding the ecosystem alone could warrant its own article. But here are some brief guidelines about where to get started.
First, you should learn about Docker itself. Docker represents the container paradigm and learning all about Docker will help you to understand how to work with containers in general. Note that, recently, a Docker competitor has arisen called Rocket. Rocket is very early (currently in 0.1), but it highlights that it’s the containers concept that is profoundly important, moreso than the particular implementation of it.
4.6.2 Learning the Docker Ecosystem
Once you start working with Docker, you’ll quickly discover that there are management tasks associated with containers. You want to deploy your containers across a cluster or “fleet” of servers, and you need tooling to help with that. As containers get deployed across a cluster, you have to figure out how to handle:
- Securing containers (especially in multitenant setups)
- Linking containers dynamically (i.e. Docker’s version of Service Discovery)
- Monitoring containers and the cluster as a whole
- Handling container logging
- Auto scaling containers
- Auto healing when a VM/host dies
- Hosting a private container registry where your containers are stored
AWS recognized these challenges and released EC2 Container Service (ECS) to address some of them. ECS is designed to help you provision a cluster of EC2 instances and then deploy docker containers across the cluster. Watch the official demo from AWS re:Invent. It’s an exciting technology, but it’s still a first-generation product and best practices here are not yet established.
But the more I’ve gotten into the container ecosystem, the more I see ECS as a “side story” in this exciting paradigm shift. One alternative to ECS isCoreOS, which is a stripped-down version of Linux (so stripped down you can’t even install new packages on it!) that comes with the
docker software pre-installed. It includes a built-in service discovery tool (
etcd) that runs across your cluster, and a built-in tool for deploying containers across your cluster (
fleet). At least at this point, it seems more “vertically integrated” than ECS does.
Docker themselves has recently thrown their hat in the ring with Docker Machine, Docker Swarm, and Docker Compose. These also look like promising technologies, but Docker claims they won’t be generally available until Q2-2015.
There are yet more offerings. Apache Mesos is an open source “distributed systems kernel” which aims to expose an API to allow developers to work at the abstraction level of a cluster rather than a single server. On Dec 7, 2014, the Mesosphere Datcenter Operating System was announced. Interestingly, this tool can run on top of any Linux distribution, including CoreOS.
And just in case that’s not enough for you, there’s also Dokku, a very lightweight program that enables you to push Heroku-compatible applications to your own private containers; Flynn, an attempt to offer fleet management as a Platform-as-a-Service offering, somewhat like a more “private Heroku” for your Docker containers, and many others.
4.6.3 Docker & AWS
As I mentioned, AWS offers its EC2 Container Service. Beyond that, AWS also supports deploying an application from a docker container using AWS Elastic Beanstalk, or bootstrapping the Docker software installation using AWS CloudFormation.
You can also of course just launch EC2 instances with an AMI that is ready to support Docker containers. CoreOS provides such an AMI. Or you can leverage a feature like CloudInit, which is available for most Linux distributions and can run arbitrary scripts at loadup, including a script to install Docker immediately after the first load.
Best practices for running Docker in EC2 using, for example, CoreOS or Ubuntu, is outside the scope of this article, but it’s absolutely something you should consider as you build your AWS app.
If JP Morgan, Home Depot and Target can get hacked, so can you. I encourage you to take security very seriously and to think about it from Day 1. A rogue party with unauthorized access to your AWS account could single-handedly destroy your company.
I’ll list a subset of best practices, but this is not a complete list; it is merely a starting point.
4.7.1 Helpful Resources
Security is very difficult to get right. I’ll describe mostly infrastructure-level security relevant to AWS, but of course your application itself needs to be secured, too. For app-level security guidelines, check out the very useful OWASP Top 10 Security Vulnerabilities report.
For a good reference point on end-to-end security, check out the PCI DSS 3.0 standard. Anyone who stores credit card data is subject to these requirements, but most of them represent good security practices anyway.
And of course, AWS themselves publishes a White Paper on Security Best Practices that is a must read.
4.7.2 Never Login with Your Master Account
When you create your AWS account, you have created your master AWS account that has super-admin privileges. Like
root on Linux, this account is so powerful, we want to avoid using it unless absolutely necessary.
For that reason, immediately after creating your master account, create a set of IAM Groups that represent the different permission levels you want to assign to your team (e.g. Admin, Dev, Test, Read-Only) and then create IAM Users for each member of your team. You can assign IAM Users to IAM Groups to give them the right permissions.
You should get ready to lock your master account username & password in a safe and distribute the keys only to the most trusted parties, but before you do…
4.7.3 Setup Multi-Factor Authentication & Require Strong Passwords
If your AWS login password is compromised, you want a fail-safe. For that reason, I recommend setting up Multi-Factor Authentication (MFA) for your master account and mandating it for all IAM Users.
In addition, IAM lets you set password policies for IAM Users. You should enforce long and difficult passwords that would make a brute-force attack computationally infeasible (even though Amazon protects against this already).
But even MFA won’t protect you if you fail to address the next point.
4.7.4 Never Commit Your AWS API Access Keys to Source Code
This person’s experience shows that publishing AWS credentials to a GitHub repo is the equivalent of publishing your username and password to the repo.
Actually, credentials are a special case of “configuration information” and in general, you want to keep configuration information out of code. Adam Wiggins explains this in more detail on 12factor.net. Instead, store passwords in Environment Variables. Some people argue in favor of passing config values through command-line parameters, but I worry that this then includes the password in your
bash command history.
Also, you should never generate API Access Keys for your master account. Instead, only generate API keys for your IAM accounts. You may wish to create one IAM user for each third-party service that needs API access so you can explicitly keep track of how API keys are used and assign minimal permissions.
Think about it, if you use a third-party vendor to send data to S3, when you provide them with your API keys, you have no control over how they manage your secrets. If your vendor screws up, you want to limit the damage that can be done.
4.7.5 Use IAM Roles instead of API Keys for EC2 Instances
Many AWS users will generate an API key and then pass it directly to the EC2 instance via environment variables, or worse, hard coding it.
AWS has a great tool to enable these permissions in a much cleaner way. You can create an IAM Role and assign it to a specific EC2 instance. You can then give that IAM Role permissions to access a particular S3 folder, or virtually any other AWS resource. When you do this, AWS will automatically populate Environment Variables on your EC2 instance with temporary API credentials.
This is the most secure way to enable access to AWS resources because there’s simply no key to manage! Note that you can only assign an IAM Role to an EC2 instance when launching it.
4.7.6 Centrally Manage Employee Access
Ideally, all employee identity is managed in a single central place. For example, you might have a company-wide Active Directory server, or a centrally hosted LDAP server, or even use a third-party service like OneLogin.
In a Linux environment, you can configure your servers to only permit login from certain groups of users as listed in the LDAP server. This has the benefit of enabling audit logs for every single login by any employee. It also means that there are no root SSH keys for servers floating around.
But central identity servers are sometimes not priority #1 when building a new app, especially with a smaller team of trusted people. So this may be a “Phase 2” security measure.
4.7.7 Have a Policy for Employees Who Leave
If you have a trusted employee who leaves, you should know exactly which resources that person had access to so that you can reset any passwords as needed. Ideally, an employee who leaves would amount to nothing more than revoking his SSH Key as stored on the LDAP Server and deactivating his IAM User Account. Of course, many third-party services that dev teams use don’t always offer individual accounts, so you may need to reset multiple third-party accounts as well.
4.7.8 Use a Bastion Host or Connect to Your VPC with a VPN
Ideally, your infrastructure is automated enough that you rarely need to directly login to a server. But when you must login, it’s too risky to open login ports (22 for Linux SSH, 3389 for Windows Remote Desktop) directly to the public Internet.
One option is to lock down each server’s ports to the specific IP address logging in, but since you will manage many servers, this quickly becomes cumbersome to maintain.
A better option is to either lock out these ports completely from the Public Internet and connect with via a VPN, or to use a Bastion Host. A Bastion Host (sometimes called “Jump Box”) is a single server that does permit login from the public Internet, but it is the only such server in your network that permits direct login.
This allows you to harden a single server, versus every server. Some examples of extra hardening on this server include permitting access only from specified IP addresses and even setting up two-factor authentication with tools like Duo.
There is a good discussion on Hacker News about the relative merits of the Bastion Host vs the VPN, as well as some ideas for further hardening the Bastion Host.
4.7.9 Pay Attention to Key Management
In my experience, most developers consider encryption and key management a nuisance and an afterthought. In fact, there is a whole body of knowledge on proper key management. One of my favorite resources is the OWASP Cryptographic Storage Key Sheet.
You’ll notice there’s a great deal of discussion on where and how to store keys. Until recently, there were few good options for this in AWS. Essentially, you could store keys in S3 (hopefully encrypted, hopefully locked down by IAM permissions), or you could spends thousands of dollars (and man-hours) using AWS CloudHSM, which is impractical for many companies.
So in November 2014, AWS released AWS Key Management Service. This service is a perfect complement to the OWASP recommendations and provides a centralized place to manage all of your keys. Of course, you’ll still need to implement best practices in your app itself.
4.7.10 Storing Passwords & Secrets & Encrypting Data at Rest
As you build your app, you will accumulate secrets. This includes your database password, symmetric encryption keys, asymmetric private keys, credentials used to access third-party services, and generally any other piece of information that, if known, grants access to sensitive data.
So where do we store all these secrets? Your first thought might have been to encrypt the secrets themselves, but then where do you store the master encryption key? Ultimately, important secrets have to be stored somewhere.
It turns out that where you store the keys and how you secure them is the essential part of encrypting data at rest. Once the keys themselves are secured, you have many options of where to store the actual data and how to secure it, whether on S3, in your own EC2 instance, etc.
This is again, a much bigger topic than I can cover here, and I encourage you to review the AWS White Paper Securing Data at Rest with Encryption.
4.7.11 Encrypting Data in Transit
When users use your app, you want to ensure that (a) no one can see their traffic en route, (b) their message to you has not been altered in transit, even by a party who can’t read it, and (c) no one is doing a man-in-the-middle attack pretending to be you to the user (and pretending to be the user to you!). In short, we care about privacy, integrity, and authenticity.
The solution to these problems is simple: Use SSL/TLS at all times unless you can specifically guarantee you are within a trusted network (e.g. from your Web Tier to App Tier). But, of course, there are nuances and edge cases to consider here, especially when dealing with peer-to-peer connections.
I will defer again to the AWS Security Best Practices White Paper.
4.7.12 Additional Security Measures
I can’t possibly cover all of security in an article not even dedicated to it. In addition to what I’ve listed above, you should consider setting egress rules on your Security Groups (after all there’s no reason ever that your database should be connecting to IP addresses in Russia), setting up vulnerability scanning, using Host-Based Intrusion Detection (HIDS) systems, and much more.
Congratulations on making it to the end of this guide!
I want to emphasize that there are many, many important AWS technologies and techniques we did not cover. AWS is a vast terrain that took years to build and consequently takes time to learn.
But we covered a lot of ground and you now have a mental framework within which to incorporate new AWS services and techniques you discover. More importantly, you have the foundation to build and scale your app on AWS.