Read our newest book, Fundamentals of DevOps and Software Delivery, for free!

Part 6. How to Work with Multiple Teams and Environments

Headshot of Yevgeniy Brikman

Yevgeniy Brikman

JUN 25, 2024
Featured Image of Part 6. How to Work with Multiple Teams and Environments

Update, June 25, 2024: This blog post series is now also available as a book called Fundamentals of DevOps and Software Delivery: A hands-on guide to deploying and managing production software, published by O’Reilly Media!

This is Part 6 of the Fundamentals of DevOps and Software Delivery series. In Part 5, you learned how to set up CI/CD to allow developers to work together efficiently and safely. This will get you pretty far, but as your company grows, you’ll start to hit problems that cannot be solved by CI/CD alone. Some of these problems will be due to pressure from the outside world: more users from more places around the world means you have to deal with more traffic, more data, and more local laws and regulations. Some of these problems will be due to pressure from within: more developers, more teams, and more products means it’s going to be harder to code, test, and deploy without hitting lots bugs, outages, and bottlenecks.

All of these are problems of scale, and for the most part, these are good problems to have, as they are typically signs that your business is becoming more successful. But to paraphrase the philosopher The Notorious B.I.G., more money means more problems. The most common approach companies use to solve problems of scale is divide and conquer. That is, you break things up into multiple smaller pieces, where each piece is easier to manage in isolation, typically using one or both of the following approaches:

Break up your deployments

You deploy your software into multiple separate, isolated environments.

Break up your codebase

You break up your code base into multiple libraries and/or (micro)services.

In this blog post, you’ll learn the advantages and drawbacks of these approaches and how to implement them. You’ll also go through several hands-on examples, including setting up multiple AWS accounts and running microservices in Kubernetes.

Let’s start with the approach you’re likely to see at almost every company, which is breaking up your deployments.

Breaking Up Your Deployments

Throughout this blog post series, you’ve deployed just about everything—servers, Kubernetes clusters, serverless functions, and so on—into a single AWS account. This is fine for learning and testing, but in the real world, it’s more common to have multiple deployment environments, where each environment has its own set of isolated infrastructure. In the next several sections, you’ll learn why you may want to deploy across multiple environments, how to set up multiple deployment environments, some of the challenges with multiple environments, and finally, you’ll go through an example of setting up multiple environments in AWS.

Why Deploy Across Multiple Environments

Here are the most common reasons to break up your deployments into multiple environments:

  • Isolating tests

  • Isolating products and teams

  • Reducing latency

  • Complying with local laws and regulations

  • Increasing resiliency

Let’s dive into each of these, starting with testing.

Isolating tests

You typically need a way to test changes to your software (a) before you expose those changes to your users and (b) in a way that limits the blast radius, so if something goes wrong during testing, the damage is constrained, and doesn’t affect users or your production environment.

To some extent, as soon as you deployed your app onto a server (in Part 1), you already had two environments: your local development environment (LDE), which is your own computer, and production, which is the server. This may be enough for testing some types of software, but in most cases, the differences between your LDE and production are so large, that only testing in the LDE is not sufficient. What you need is one or more environments that closely resemble production, but are completely isolated and only accessible to your team.

A common setup you’ll see at many companies is to have the following three environments:

Production

This is the environment that is exposed to your users.

Staging

This environment is more or less identical to production, though typically scaled down to save money: i.e., you have the same architecture in staging and production, but staging uses fewer and smaller servers. The staging environment is only exposed to employees at your company, so they can test new versions of the software just before those new versions are deployed to production: that is, you stage releases in this environment.

Development

This environment is also a scaled-down clone of production, and is only exposed to your dev team for testing out code changes during the development process, before those changes make it to staging.

This trio of development, staging, and production, often shortened to dev, stage, and prod, shows up at most companies, although sometimes with slightly different names: e.g., stage is sometimes called QA, as that’s where the quality assurance (QA) team does testing before a release to production.

Isolating products and teams

Larger companies often have multiple products and multiple product teams, and at a certain scale, having all of them work in the same environment or even the same set of environments can lead to a number of problems: e.g., different products may have different requirements in terms of security, compliance, uptime, deployment frequency, and so on.

Therefore, it’s common in larger companies for each team or product to have its own isolated set of environments. For example, the search team might have their software deployed in the search-dev, search-stage, and search-prod environments, while the profile team might have their software deployed in the profile-dev, profile-stage, and profile-prod environments. This ensures that teams can customize their environments to their own needs, limits the blast radius if one team or product has issues, and allows teams to work mostly in isolation from each other.

Key takeaway #1

Breaking up your deployment into multiple environments allows you to isolate tests from production and teams from each other.

Reducing latency

If you have users in multiple locations around the world, you may want to run your software on servers (and data centers) that are geographically close to those users. One of the big reasons for this is latency: that is, the amount of time it takes to send data between your servers and your users' devices. This information is traveling at nearly the speed of light, but when you’re building software used across the globe, the speed of light can be too slow! If you’re going to build software, it’s a good idea to familiarize yourself with the latency of common computer operations, as shown in Table 9:

Table 9. Typical latency numbers of common computer operations[1]
OperationTime in ns

Random read from CPU cache (L1)

1

Random read from main memory (DRAM)

100

Compress 1 kB with Snappy

2,000

Read 1 MB sequentially from DRAM

3,000

Random read from solid state disk (SSD)

16,000

Read 1 MB sequentially from SSD

49,000

TCP packet round trip within same datacenter

500,000

Random read from rotational disk

2,000,000

Read 1 MB sequentially from rotational disk

5,000,000

TCP packet round trip from California to New York

40,000,000

TCP packet round trip from California to Australia

183,000,000

These numbers are useful for doing back-of-the-envelope calculations, such as estimating the impact of having one data center on one continent, where your users may end up connecting from a different continent, versus having multiple data centers on multiple continents, where your users always end connecting on the same continent: in this scenario, you can expect the multi-data center approach to reduce latency by around 100,000,000 ns (100 ms). This might not seem like much, but remember, this is the overhead for a single TCP packet: these are typically around 1 KB in size, and most web pages and mobile apps these days sends hundreds or thousands of KB of data, across many requests, so the extra latency can quickly add up to many seconds of additional overhead for every page load and button press.

Therefore, companies with a global reach often end up deploying software across multiple data centers across the globe: for example, you might have one production environment in Ireland (prod-ie) to give your European users lower latency and one production environment in the US (prod-us) to give your North American users lower latency.

Complying with local laws and regulations

If you operate in certain countries, work in certain industries, or work with certain customers, you may be subject to laws and regulations that require you to set up your environments in very specific ways. For example, if you store and process credit card information, you may be subject to PCI DSS (Payment Card Industry Data Security Standard); if you store and process healthcare information, you may be subject to HIPAA (Health Insurance Portability and Accountability Act) and HITRUST (Health Information Trust Alliance); if you are building software for the US government, you may be subject to FedRAMP (Federal Risk and Authorization Management Program); and if you are building software in certain countries, you may be subject to data residency laws, such as the EU’s GDPR (Global Data Protection Regulation), which require businesses that operate in that country, or have customers in that country, to store and process data on servers physically located within that country.

A common pattern is to set up a dedicated environment for complying with laws and regulations: for example, if you’re subject to PCI DSS, you might have one prod environment, perhaps called prod-pci, that meets all the PCI DSS requirements, and is used solely to run your payment processing software, and another prod environment, perhaps just called prod, that isn’t as locked down, and is used to run all your other software.

Increasing resiliency

In Part 3, you saw that a single server can be a single point of failure; the solution was to deploy multiple servers. It turns out that, even if you have multiple servers, if all of them are in a single data center (a single environment), that one data center can be a single point of failure, too. It’s possible for a power outage, cooling problem, network connectivity issue, and a variety of other problems to disrupt the functionality of an entire data center, and all the servers within it.

The solution, for companies that need a higher degree of resiliency, is to deploy across multiple data centers that are in separate locations around the world (e.g., prod-ie and prod-us, as in the previous section), so that whatever causes an outage in one is unlikely to affect the others.

Now that you’ve seen a few of the most common reasons why you might want to break up your deployment across multiple environments, let’s talk about how to actually do it.

How to Set Up Multiple Environments

There are different ways to define an "environment." Here are a few of the most common approaches:

Logical environments

A logical environment is one defined solely in software (i.e., through naming and permissions), whereas the underlying hardware (servers, networks, data centers) is unchanged. For example, you could create multiple logical environments in a single Kubernetes cluster by using namespaces: in Part 3, since you didn’t specify a namespace, everything you deployed into Kubernetes went into the default namespace, but you can also create a custom namespace for each environment using the kubectl create namespace <NAME>. You can then add the --namespace flag to other Kubernetes commands: e.g., you could use kubectl apply --namespace development to deploy an app into the development environment, kubectl apply --namespace staging to deploy the same app into the staging environment, and so on.

Separate servers

One notch above logical environments is to set up each environment on separate servers. For example, instead of a single Kubernetes cluster, you deploy one cluster per environment, including separate control plane and worker nodes for each cluster.

Separate networks

One step above separate servers is to put the servers for each environment in a separate, isolated network: e.g., the servers in the development environment can only communicate with other servers in development, the servers in staging can only communicate with other servers in staging, and so on. You’ll see an example of how to set up separate networks in Part 7 [coming soon].

Separate accounts

If you deploy into the cloud, many cloud providers allow you to create multiple accounts: note that different cloud providers use different terminology here, such as projects in Google Cloud and subscriptions in Azure; I’ll just use the term "account" throughout this blog post series. By default, accounts are completely isolated from each other, including the servers, networks, and permissions you grant in each one, so a common approach is to define one environment per account: e.g., one account for dev, one account for stage, and one account for prod.

Separate data centers in the same geographical region

The next level up is to run different environments in different data centers that are all in the same geographical region: e.g., multiple data centers on the US east coast.

Separate data centers in different geographical regions

The final level is to run different environments in different data centers that are all in different geographical regions: e.g., one data center on the US east coast, one on the US west coast, one in Europe, and so on.

These approaches all have advantages and drawbacks. One dimension to consider is how isolated one environment is from another: e.g, could a bug in the dev environment somehow affect prod with this approach? Another dimension to consider is resiliency: e.g., how well does this approach tolerate a server, network, or even entire data center going down? The preceding list is roughly sorted from least isolated and resilient to most isolated and resilient: that is, logical environments offer the least isolation and resiliency, whereas separate data centers in multiple regions offer the most. Separate data centers in multiple regions is also the only approach that can reduce latency to your users and allow you to comply with local laws and regulations.

However, the flip side of the coin is operational overhead: e.g., how many extra servers, networks, accounts, and data centers do you have to set up, maintain, and pay for? The preceding list is also roughly sorted from least to most overhead: that is, logical environments entail very little overhead, whereas separate data centers in multiple regions is the most time-consuming and expensive. Separate data centers in multiple regions is also an approach that may require you to redesign your entire architecture, something you’ll learn more about in the next section.

Challenges with Multiple Environments

Having multiple environments can offer a lot of benefits—e.g., as you just saw, it helps you to isolate tests, isolate products and teams, reduce latency, and so on—but multiple environments can also introduce a number of new challenges. Here are a few of the most common ones:

  • Increased operational overhead

  • Increased data storage complexity

  • Increased application configuration complexity

Let’s go through these one at a time, starting with increased operational overhead.

Increased operational overhead

Perhaps the most obvious challenge with multiple environments is that you now have more moving parts to set up and maintain: you may need to run more servers, set up more data centers, hire more people around the world, and so on. Using the cloud allows you to offload much of this overhead onto the cloud provider, but even creating and managing multiple AWS accounts results in more overhead: each account needs its own authentication, authorization, networking, security tooling, audit logging, and so on.

But even this overhead may be just a drop in the bucket compared to the overhead of having to change your entire architecture to work across environments that are geographically separated, as discussed in the next section.

Increased data storage complexity

Having multiple data centers around the world, so they are closer to your users, reduces the latency between the data center and those users, but it may also increase the latency between the different parts of your software running in different data centers. This may force you to rework your software architecture completely, especially when it comes to data storage.

For example, let’s say you had a web app that needed to look up data in a database before sending a response. If the database you were talking to was in the same data center as the web app, then as per Table 9, the networking overhead for the database query would be roughly 500,000 ns (0.5 ms) for each packet round trip, which is negligible for most web apps. However, if you had multiple data centers around the world, and the database you were talking to was on a different continent, now the networking overhead could be as high as 183,000,000 ns (183 ms), a 366x increase for every single packet you send. Even a single database query will typically require multiple packets to make round trips, so this extra overhead adds up very quickly, and it can make your webapp unacceptably slow.

No problem, you say, you’ll just ensure that the database is always in the same data center as the web app. But that means you now need one database per environment, rather than just one database total, and that may require you to radically change how you store and retrieve data, including how you generate primary keys (the typical auto incrementing primary key will no longer work with multiple data stores), how you handle data consistency and concurrency (uniqueness constraints, foreign key constraints, locking, and transactions all become very difficult with multiple databases), how you look up data (querying and joining multiple databases is complicated), and so on.

Some companies choose to avoid these challenges by only running in active/standby mode: that is, one data center is active and is serving live traffic, and the other is a standby that only serves live traffic if the active data center goes down, in which case you make the standby data center the new active one. That way, you are only ever reading/writing data in one location at a time. This is useful to boost resiliency, but doesn’t help with latency or local laws and regulations. If you have to have multiple data centers live at the same time, known as active/active, then you will most likely have to rearchitect your data storage patterns to work across multiple geographies. You’ll learn more about data storage in Part 9 [coming soon].

Key takeaway #2

Breaking up your deployment into multiple regions allows you to reduce latency, increase resiliency, and comply with local laws and regulations, but usually at the cost of having to rework your entire architecture.

Increased application configuration complexity

One of the unexpected costs of multiple environments is figuring out how to configure your application differently in each environment. In the early stages of a company, you typically only have a handful of configuration settings, so managing them is pretty straightforward. However, as your company grows, you typically end up with more environments, a more complex architecture, and more demanding security and performance requirements, and as a result, the number of configuration settings can explode. Here are just a few of the most common settings that may differ from environment to environment:

Performance settings

CPU, memory, hard-drive, garbage collection.

Security settings

Database passwords, API keys, TLS certificates.

Networking settings

IP addresses, ports, domain names.

Service discovery settings

The IP addresses, ports, and domain names to use for services you rely on.

Feature settings

Features to turn on and off (i.e., feature toggles).

In a large company, it’s not unusual to have thousands of configuration settings to manage, and without the right tooling and processes, this is a common source of problems. In fact, based on Google’s analysis of thousands of postmortems, configuration changes are one of the biggest causes of outages, as shown in Table 10:

Table 10. Top causes of outages at Google, 2010-2017[33]
CausePercent of outages

Binary push

37%

Configuration push

31%

User behavior change

9%

Processing pipeline

6%

Service provider change

5%

Performance decay

5%

Capacity management

5%

Hardware

2%

Google found that pushing configuration changes is just as risky as pushing code changes (pushing a new binary), and the longer a system has been around, the more configuration changes tend to become the dominant cause of outages. That means it’s worth taking some time to think through how you’re going to manage and validate your application configuration.

Key takeaway #3

Configuration changes are just as likely to cause outages as code changes.

Broadly speaking, there are two methods for configuring applications:

At build time: configuration files checked into version control

The most common way to handle configuration is to have configuration files that are checked into version control, along with the rest of the code for the app. These files can be in the same language as the app itself: e.g., Ruby on Rails apps use configuration files defined in Ruby. However, as config files are often shared across software written in multiple different languages, it’s more common to use language-agnostic formats such JSON, YAML, TOML, XML, Cue, Jsonnet, or Dhall.

At run time: configuration data read from a data store

Another way to configure your app is to have the app read from a data store when the app is booting up or while it is running. One option is to use a general-purpose data store, such as reading from MySQL, PostgreSQL, or Redis. However, the more common option is to use a data store specifically designed for configuration data, and in particular, a data store that can update your app quickly when a configuration value changes. For example, data stores such as Consul, etcd, and ZooKeeper allow you to subscribe to change notifications, so your app is notified as soon as any configuration changes, which is why these data stores are often used for service discovery.

I recommend using build-time configuration for as much of your configuration as possible. That way, you can treat it just like the rest of your code: that is, every configuration change ends up in version control, gets code reviewed, and goes through your entire CI/CD pipeline (including all the automated tests), as per Part 4. I only recommend using run-time configuration for use cases where the configuration changes very frequently: e.g., service discovery and feature toggles. In those cases, having to do a commit and deploy to keep up is too slow, so using a data store designed for those use cases is the better fit.

Now that you’ve seen the reasons to deploy into multiple environments, the different options for setting up multiple environments, and the challenges involved with multiple environments, let’s try an example: using multiple AWS accounts as multiple environments.

Example: Set Up Multiple AWS Accounts

When you first start using AWS, you create a single account, and deploy everything into it. This works well up to a point, but as your company grows, you’ll want to set up multiple environments due to the requirements mentioned earlier: isolating tests, isolating products and teams, latency, resiliency, and so on. While you can meet some of these requirements in a single AWS account—e.g., it’s easy to use multiple availability zones and regions in a single AWS account to get better latency and resiliency—some of the other requirements can be tricky.

In particular, isolating tests and isolating products and teams can be hard to do in a single account. This is because just about everything in an AWS account is managed via API calls, and by default, AWS APIs have no first-class notion of environments, so your changes can affect anything in the entire account. That is, the default behavior of IAM does not give you isolation between environments (as IAM has no notion of environments). For example, if you give one team permissions to manage EC2 instances, it’s possible that, due to human error, someone will accidentally modify the EC2 instances of the wrong team; or perhaps your automated tests have permissions to spin up and tear down EC2 instances, but due to a bug in your test code, you accidentally modify the EC2 instances in production.

Don’t get me wrong: IAM is powerful, and using various IAM features such as tags, conditions, permission boundaries, and SCPs, it is possible to create your own notion of environments and enforce isolation between them, even in a single account. However, precisely because IAM is powerful, it’s hard to get this right. For example, Figure 63 is an image from the AWS IAM policy evaluation logic documentation that shows just one small part of how IAM policies are evaluated:

IAM policy flow chart
Figure 63. IAM policy flow chart

To put it mildly, IAM is complicated. Many teams have gotten IAM permissions wrong—especially IAM permissions related to secrets management and IAM permissions that grant IAM permissions, where now you have to think at multiple levels—and this sometimes leads to disastrous results.

While you can’t avoid IAM entirely, the good news is that, for the very common use case of creating separate environments, there is a simpler alternative: use separate AWS accounts. By default, granting someone permissions in one AWS account does not give them any permissions in any other account. In other words, using multiple AWS accounts gives you isolation between environments by default, so you’re much less likely to get it wrong.

This is why AWS itself recommends a multi-account strategy. With this strategy, you use AWS Organizations to create and manage your AWS accounts, with one account at the root of the organization, called the management account, and all other accounts (e.g., dev, stage, prod) as child accounts of the root, as shown in Figure 64:

AWS multi-account structure
Figure 64. AWS multi-account structure

Let’s follow the multi-account strategy and create some child accounts.

Create child accounts

In Part 1, you created an AWS account by signing up on the AWS website. To create more AWS accounts, instead of signing up again and again on the website, let’s treat that initial AWS account you created as the management account, and use AWS Organizations to create all the other accounts as child accounts. This will give you centralized management of all your accounts, as you’ll be able to see and access them all from your management account, and centralized billing, as all the charges will roll up to the management account, so everything goes through one credit card (rather than a dozen cards if you have a dozen accounts).

Typically, the management account is only used to create and manage other AWS accounts. Since the account has such powerful permissions, you strictly limit who can access it, and do not run any other workloads or environments in that account.

Therefore, as a first step, undeploy everything from your AWS account that you deployed in earlier chapters: e.g., run tofu destroy on any OpenTofu modules you previously deployed and use the EC2 Console to manually undeploy anything you deployed via Ansible, Bash, etc. When you’re done, you should essentially have an empty AWS account, with your IAM user as the only one who can access it.

Example Code

As a reminder, you can find all the code examples in the blog post series’s sample code repo in GitHub.

Next, let’s use AWS Organizations to configure your current account as a management account and create three child accounts below it: development, staging, and production. Head into the folder where you’ve been working on the code samples for this blog post series and make sure you’re on the main branch, with the latest code:

$ cd fundamentals-of-devops

$ git checkout main

$ git pull origin main

Next, create a new ch6 folder for this blog post’s code examples, and within the ch6 folder, create a tofu/live/child-accounts folder:

$ mkdir -p ch6/tofu/live/child-accounts

$ cd ch6/tofu/live/child-accounts

Within the child-accounts folder, you will create an OpenTofu root module that can create the child accounts. Under the hood, this root module will use an OpenTofu module called aws-organizations that’s in the blog post series’s sample code repo in the ch6/tofu/modules/aws-organizations folder. The aws-organizations module uses AWS Organizations to create three child accounts: development, staging, and production.

Within management/child-accounts, create a main.tf file with the contents shown in Example 110:

Example 110. Create child accounts using the aws-organizations module (ch6/tofu/live/child-accounts/main.tf)
provider "aws" {

  region = "us-east-2"

}



module "child_accounts" {

  (1)

  source = "github.com/brikis98/devops-book//ch6/tofu/modules/aws-organization"



  (2)

  # Set to false if you already enabled AWS Organizations in your account

  create_organization = true





  (3)

  # TODO: fill in your own account emails!

  dev_account_email   = "username+dev@email.com"

  stage_account_email = "username+stage@email.com"

  prod_account_email  = "username+prod@email.com"

}

The preceding code does the following:

1Use the aws-organizations module from the series’s sample code repo.
2Before you can use AWS Organizations, you must enable it in your AWS account. If you already enabled it, set create_organization to false; otherwise, leave it set at true and the aws-organizations module will enable it for you.
3Configure the root user email addresses for the dev, stage, and prod accounts. Note that you’ll have to fill in your own email addresses here, and that each email address must be different: AWS requires a globally unique email address for the root user of each AWS account.
Create multiple email aliases for a single email address

Some email providers, such as GMail, ignore any text in an email address after a plus sign, which allows you to create multiple aliases for a single email address: e.g., if your email address is username@gmail.com, you can use username+dev@gmail.com, username+stage@gmail.com, and username+prod@gmail.com as three separate, unique email addresses with AWS, but all the emails will go to the same underlying account in GMail.

You should also create an outputs.tf file with the output variables shown in Example 111:

Example 111. Proxy through output variables from the aws-organizations module (ch6/tofu/live/child-accounts/outputs.tf)
(1)



output "dev_account_id" {

  description = "The ID of the dev account"

  value       = module.child_accounts.dev_account_id

}



output "stage_account_id" {

  description = "The ID of the stage account"

  value       = module.child_accounts.stage_account_id

}



output "prod_account_id" {

  description = "The ID of the prod account"

  value       = module.child_accounts.prod_account_id

}



(2)



output "dev_role_arn" {

  description = "The ARN of the IAM role you can use to manage dev from mgmt"

  value       = module.child_accounts.dev_role_arn

}



output "stage_role_arn" {

  description = "The ARN of the IAM role you can use to manage stage from mgmt"

  value       = module.child_accounts.stage_role_arn

}



output "prod_role_arn" {

  description = "The ARN of the IAM role you can use to manage prod from mgmt"

  value       = module.child_accounts.prod_role_arn

}

The preceding code adds two types of output variables:

1The account IDs for the dev, stage, and prod accounts.
2The ARNs of IAM roles you can use to manage the dev, stage, and prod accounts. When you create child accounts using AWS Organizations, it automatically creates an IAM role named OrganizationAccountAccessRole within each child account, and it configures that IAM role so you can assume it from the management account to get admin permissions in that child account.

Deploy the child-accounts module as usual: authenticate to AWS as described in Authenticating to AWS on the command line, and run init and apply:

$ tofu init

$ tofu apply

After apply completes, you should see your output variables:

Apply complete! Resources: 4 added, 0 changed, 0 destroyed.



Outputs:



dev_role_arn = "arn:aws:iam::222222222222:role/OrganizationAccountAccessRole"

dev_account_id = "222222222222"

prod_role_arn = "arn:aws:iam::444444444444:role/OrganizationAccountAccessRole"

prod_account_id = "444444444444"

stage_role_arn = "arn:aws:iam::333333333333:role/OrganizationAccountAccessRole"

stage_account_id = "333333333333"

Congrats, you just created new AWS accounts using code! And you can use code like this to create new accounts and manage them in the future.

Get your hands dirty

Here are some exercises you can try at home to get a better feel for managing multiple AWS accounts:

  • When you create a child account using AWS Organizations, it will configure a root user for the account with the email address you specify, but that root user will not have a password. To access the root user, you need to go through the root user password reset flow, and while you’re at it, enable MFA for the root user, too.

  • As part of a multi-account strategy, in addition to workload accounts such as dev, stage, and prod, AWS recommends several foundational accounts, such as a log archive account for consolidating operational and audit logs, a security tooling (audit) account for consolidating data from security tools (e.g., Security Hub, GuardDuty, etc.), a backup account for consolidating all your backups, and so on. Consider creating your own aws-organizations module that sets up all these foundational accounts for you.

  • Configure the child-accounts module to store its state in an S3 backend in the management account.

Of course, these new AWS accounts aren’t useful until you deploy infrastructure into them, and to do that, you need to learn how to access them.

Access your child accounts

Now that you’ve created child accounts, to access them, you can assume the IAM role that AWS Organizations creates for you automatically in those accounts. There are many different ways to assume an IAM role. For example, in the AWS Web Console, you can click on your username in the top right corner and choose "Switch role," as shown in Figure 65:

Switching roles in the AWS Web Console
Figure 65. Switching roles in the AWS Web Console

On the next page, enter the ID of one of your child accounts (e.g., grab the dev account ID from the dev_account_id output variable), OrganizationAccountAccessRole as the role name, give it a display name (e.g., dev-admin) and optionally pick a color to use in the console, as shown in Figure 66:

Fill in the IAM role information for one of the child accounts
Figure 66. Fill in the IAM role information for one of the child accounts

Click the Switch Role button, and you should see the AWS Web Console for that child account. At this point, you have admin permissions in that child account, and you can do whatever you want. Feel free to repeat this process with the staging and production accounts, picking a different display name and display color for each one.

There are several ways to assume IAM roles in the terminal, too. One option is to configure an AWS profile for each child account. For example, to create a profile for your dev account, open up the AWS config file, which lives at ~/.aws/config (if the file doesn’t exist already, create it), and add the code shown in Example 112 to it:

Example 112. Configure an AWS profile for the dev account (~/.aws/config)
[profile dev-admin]                                           (1)

role_arn=arn:aws:iam::<ID>:role/OrganizationAccountAccessRole (2)

credential_source=Environment                                 (3)

The preceding code does the following:

1Create a profile called dev-admin. You can name profiles whatever you want.
2Configure the profile to assume this IAM role. This should be the ARN from the dev_role_arn output variable, and <ID> should be the dev account ID.
3Look for AWS credentials in the environment. This allows you to use the same environment variables from your management account for authentication.

Most tools that talk to AWS APIs give you a way to specify the profile to use. For example, with the AWS CLI, you can use the --profile argument, as shown in Example 113:

Example 113. Use the AWS CLI with a profile
$ aws sts get-caller-identity --profile dev-admin

{

    "UserId": "<USER>",

    "Account": "<ACCOUNT_ID>",

    "Arn": "<ARN>"

}

The get-caller-identity command returns information about the authenticated user, so if you configured the profile correctly, you should see your dev account’s ID in Account, and the ARN of the OrganizationAccountAccessRole dev IAM role in Arn.

Create analogous profiles for the stage and prod accounts. In the next section, you’ll see how to use these profiles to deploy infrastructure into the dev, stage, and prod accounts.

Deploy into your child accounts

Let’s now try to deploy the lambda-sample module from earlier blog posts into the dev, stage, and prod accounts. Copy the lambda-sample module from Part 5, as well as the test-endpoint module it relies on, into a new ch6/tofu/ folder:

$ cd fundamentals-of-devops

$ mkdir -p ch6/tofu/live

$ cp -r ch5/tofu/live/lambda-sample ch6/tofu/live

$ mkdir -p ch6/tofu/modules

$ cp -r ch5/tofu/modules/test-endpoint ch6/tofu/modules

Next, make three changes to the lambda-sample module:

  1. Update the backend configuration: Update the key in the backend configuration in backend.tf to point to ch6 instead of ch5, as shown in Example 114:

    Example 114. Update the backend configuration (ch6/tofu/live/lambda-sample/backend.tf)
        key = "ch6/tofu/live/lambda-sample"
  2. Add support for profiles: To be able to use different AWS profiles to authenticate to different AWS accounts, add a new input variable in variables.tf that allows you to specify the profile name, as shown in Example 115:

    Example 115. Add an input variable for specifying the AWS profile to use (ch6/tofu/live/lambda-sample/variables.tf)
    variable "aws_profile" {
    
      description = "If specified, the profile to use to authenticate to AWS."
    
      type        = string
    
      default     = null
    
    }

    Next, in main.tf, set the profile parameter in the AWS provider block to use the new aws_profile input variable, as shown in Example 116:

    Example 116. Configure the profile in the AWS provider using the new input variable (ch6/tofu/live/lambda-sample/main.tf)
    provider "aws" {
    
      region  = "us-east-2"
    
      profile = var.aws_profile
    
    }

    This will allow you to optionally specify the AWS profile to use via a -var aws_profile=xxx flag when running tofu apply.

  3. Show the environment name. Update the text the Lambda function returns in index.js to include the value of the NODE_ENV environment variable, as shown in Example 117:

    Example 117. Include the current environment name in the text returned by the Lambda function (ch6/tofu/live/lambda-sample/src/index.js)
    exports.handler = (event, context, callback) => {
    
      callback(null, {statusCode: 200, body: `Hello from ${process.env.NODE_ENV}!`});
    
    };

    Next, instead of hard-coding the NODE_ENV environment variable to "production" in main.tf, set it dynamically based on the value of terraform.workspace, as shown in Example 118:

    Example 118. Set NODE_ENV dynamically to the value of terraform.workspace (ch6/tofu/live/lambda-sample/main.tf)
    module "function" {
    
      source = "github.com/brikis98/devops-book//ch3/tofu/modules/lambda"
    
    
    
      # ... (other params omitted) ...
    
    
    
      environment_variables = {
    
        NODE_ENV = terraform.workspace
    
      }
    
    }

    What is terraform.workspace? That’s what we’ll discuss next.

In OpenTofu, you can use workspaces to manage multiple deployments of the same configuration. Each workspace has its own state file, so it represents a separate copy of all the infrastructure, and each workspace has a unique name, which is returned by terraform.workspace. If you don’t specify a workspace explicitly, as you’ve been doing so far throughout this blog post series, then you end up using a workspace called "default." In this blog post, instead of using the default workspace, let’s create one workspace per environment.

First, authenticate to your management account as usual (as described in Authenticating to AWS on the command line), and run tofu init to initialize the backend, modules, and providers:

$ cd ch6/tofu/live/lambda-sample

$ tofu init

Next, use the tofu workspace new command to create a new workspace:

$ tofu workspace new development

Created and switched to workspace "development"!

The preceding command creates a workspace called "development," which you can use to store the state for your development environment. The idea is to deploy the development environment into the new development account you created earlier. You can do this by running tofu apply and telling OpenTofu to authenticate to your development account by setting the aws_profile input variable to the name of the profile you created for the development account in the previous section:

$ tofu apply -var aws_profile=dev-admin

You should see a plan output to create the Lambda function, API Gateway route, and so on. If everything looks good, type in yes and hit Enter. When apply completes, you should see the api_endpoint output variable, which contains a URL you can try to access the Lambda function. Try this URL out:

$ curl <DEV_URL>

Hello from development!

Congrats, you now have a serverless web app running in your development account! Let’s now try to deploy the same app into the staging account. First, create a new workspace for staging:

$ tofu workspace new staging

Created and switched to workspace "staging"!

Next, run apply again, but this time, set aws_profile to the name of the profile you created for your staging account:

$ tofu apply -var aws_profile=stage-admin

You should see a plan output that shows OpenTofu will create all the resources (Lambda function, API Gateway route, etc.) again from scratch. That’s because each workspace has its own state file, so when you’re in the staging workspace, OpenTofu doesn’t look at any of the infrastructure you deployed in the development workspace. If everything looks good with the plan, type in yes and hit Enter. When apply completes, you should have a different URL you can try:

$ curl <STAGE_URL>

Hello from staging!

And there you go, you now have a second environment running in a second AWS account! Complete the picture by deploying into the third environment, production (note the use of the full name "production" for this workspace, rather than "prod," as that results in NODE_ENV being set to "production," which is the value you want to use with Node.js apps in production):

$ tofu workspace new production

$ tofu apply -var aws_profile=prod-admin

$ curl <PROD_URL>

When you’re done, you should see "Hello from production!" At this point, you have three environments, across three AWS accounts, with a separate copy of the serverless webapp in each one, and the OpenTofu code to manage it all.

Get your hands dirty

Here are some exercises you can try at home to get a better feel for managing multiple environments with OpenTofu and AWS:

  • Using workspaces to manage multiple environments has some drawbacks: one of the biggest is that all the state is stored in a single backend (a single S3 bucket in the management account), so your environments aren’t as isolated or independent as they should be. See this blog post to learn about some of the other drawbacks to workspaces, as well as alternative approaches for managing multiple environments, such as Terragrunt and Git branches.

  • Update the CI / CD configurations from Part 5 to work with multiple AWS accounts. You’ll need to create OIDC providers and IAM roles in each AWS account and have the CI/CD configuration authenticate to the right account depending on the change: for example, you could configure tofu test to run in the development account for changes on any branch; you could run plan and apply in the staging account for any PR against main; and you could run plan and apply in the production account whenever you push a Git tag of the format release-xxx (e.g., release-v3.1.0).

Use different configurations for different environments

You now have three copies of the serverless webapp running, but all three are configured exactly the same way. Let’s see what it might look like to configure the app differently in each environment. To keep things simple, we’ll use JSON configuration files checked into version control. First, create a folder called config for the configuration files:

$ mkdir -p src/config

Within the config folder, create a file called development.json, with the contents shown in Example 119:

Example 119. Create a config file for the development environment (ch6/tofu/live/lambda-sample/src/config/development.json)
{

  "text": "dev config"

}

This file contains just a single config entry, text, which is the text the web app should return in that environment. Create analogous config/staging.json and config/production.json files, but with text updated to different values in each environment.

Next, update index.js to load the config file for the current environment and return the text value in the response, as shown in Example 120:

Example 120. Create a config file for the development environment (ch6/tofu/live/lambda-sample/src/index.js)
const config = require(`./config/${process.env.NODE_ENV}.json`)          (1)



exports.handler = (event, context, callback) => {

  callback(null, {statusCode: 200, body: `Hello from ${config.text}!`}); (2)

};

There are two updates to the app:

1Read the NODE_ENV environment variable and load the .json file of the same name from the config folder. This should result in reading in development.json in the development environment, staging.json in the staging environment, and so on.
2Read the text value from the config file and return it as part of the HTTP response.

Now it’s time to deploy this change in each environment—that is, in each workspace. To see all your workspaces, use the workspace list command:

$ tofu workspace list

  default

  development

  staging

* production

You can switch to any existing workspace using the workspace select command:

$ tofu workspace select development

Switched to workspace "development".

Now, any OpenTofu command you run will against the development workspace. Run apply with the aws_profile variable set to the dev profile to deploy the changes to the development environment:

$ tofu apply -var aws_profile=dev-admin

When apply completes, open the URL in the api_endpoint output variable, and you should see "Hello from dev config!" Use workspace select and apply (with the aws_profile variable properly set) to deploy the changes in staging and production as well. When you test the URLs for those environments, you should see the text values you put into those configs: e.g., "Hello from stage config!" and "Hello from prod config!" Congrats, you’re now loading different configuration files in different environments!

Close your child accounts

When you’re done testing and experimenting with multiple AWS accounts, you may wish to close some or all of the new child accounts. Going forward, just about all the examples in this blog post series will deploy into just a single account (to keep things simple), so you don’t need all three running. Note that AWS does not charge anything extra for the accounts themselves, but you may want to clean them up to keep your security surface area smaller, and to ensure you don’t accidentally leave resources running in those accounts (e.g., EC2 instances), as AWS does charge for those as usual.

First, commit all your code changes to Git: that way, if you ever want to bring back the three accounts, you’ll have all the code to do it.

Second, undeploy the infrastructure in each workspace. To do that, use workspace select to select each environment and then run tofu destroy, making sure to set the aws_profile variable to the profile you created for that environment. For example, here is how you undeploy the infrastructure in the development workspace:

$ tofu workspace select development

$ tofu destroy -var aws_profile=dev-admin

Repeat the same workspace select and tofu destroy commands for the staging and production environments.

Third, run tofu destroy on the child-accounts module to start the process of closing the child accounts:

$ cd ../child-accounts

$ tofu destroy

When you run destroy, AWS will initially mark the child accounts as "suspended" for 90 days, which is a fail-safe that gives you a chance to recover anything you may have forgotten in those accounts before they are closed forever. After 90 days, AWS will automatically close those accounts.

Destroy may temporarily fail if you created a new AWS Organization

If you had create_organization set to true in the child-accounts module, the destroy operation will initially fail, as you can’t disable AWS Organizations until all child accounts in the Organization are closed: that’s OK, despite the error, the accounts will still be marked as suspended, and after 90 days, you should be able to successfully run destroy (or you can just ignore it, as there’s no charge for AWS Organizations, so it’s also OK to leave it enabled).

Breaking Up Your Codebase

Now that you’ve seen how to break up your deployments into multiple environments, let’s talk about how to break up your codebase. In the next several sections, you’ll learn why you may want to break up your codebase, how to break up your codebase, some of the challenges with breaking up a codebase, and finally, you’ll go through an example of deploying several microservices in Kubernetes.

Why Break Up Your Codebase

Here the most common reasons to break up your codebase:

  • Managing complexity

  • Isolating products and teams

  • Handling different scaling requirements

  • Using different programming languages

Let’s dive into each of these, starting with managing complexity.

Managing complexity

Software development doesn’t happen in a chart, an IDE, or a design tool; it happens in your head.

— Venkat Subramaniam and Andy Hunt
Practices of an Agile Developer (Pragmatic Programmers)

The most common reason to break up a codebase is that it doesn’t fit in your head. Once a codebase gets big enough, no one can understand all of it. There are just too many parts, too many interactions, and too many concepts to keep straight, and if you have to deal with all of them at once, your pace of development will slow to a crawl, and the number of bugs will explode. Consider Table 11, which is a table from the book Code Complete that shows the typical number of bugs in software projects of various sizes:

Table 11. Bug density in software projects of various sizes[34]
Project size (lines of code)Bug density (bugs per 1K lines of code)

< 2K

0 – 25

2K – 6K

0 – 40

16K – 64K

0.5 – 50

64K – 512K

2 – 70

> 512K

4 – 100

It’s no surprise that larger software projects have more bugs, but note that Table 11 shows that larger projects also have a higher bug density, or the number of bugs per 1,000 lines of code. To put this into perspective, take a developer, and have them add 100 lines of code to a small software project (<2K lines of code), and on average, you’ll find that new code has no new bugs, or maybe one or two. Take the same developer and have them add 100 lines of code to a large software project (>512K lines of code), and on average, you’ll find that they have introduced as many as ten new bugs. Same developer, same number of lines of code, but 5-10x as many bugs. That’s the cost of complexity.

There is a limit to how much code complexity the human mind can handle. In fact, in that same Code Complete book, author Steve McConnell defines "managing complexity" as "the most important technical topic in software development." There are many techniques for managing complexity, but almost all of them come down to one basic principle: divide and conquer. That is, find a way to organize your code so that you can focus on one small part at a time, while being able to safely ignore the rest. One of the main goals of most software abstractions, including object-oriented programming, functional programming, libraries, and microservices is to break up the codebase into discrete pieces, so that each piece can hide its implementation details, which are fairly complicated, and expose to you some sort of interface, which is much simpler.

Isolating products and teams

Another common reason to break up a codebase has less to do with technical challenges, and more to do with social and human challenges: the desire to allow teams to work independently from each other and to have total ownership of their part of the product. As your company grows, different teams will start to develop preferences for different product development practices: that is, different practices around how they design their systems and architecture, how they test and review their code, how often they deploy, and how much tolerance they have for bugs and outages.

If you do all your work in a single, tightly-coupled codebase, then a problem in any one team or product can affect all the other teams and products, and that’s not always desirable. For example, if you open a pull request, and an automated test fails in some totally unrelated product, should that block you from merging? If you deploy new code that includes changes to ten different products, and one of them has a bug, should you roll back the changes for all ten products? If one team wants to deploy dozens of times per day, but another team has a product in a regulated industry where they can only deploy once per quarter, should everyone be stuck with the slower deployment cadence? Splitting up the codebase allows you to avoid difficult questions like these and set up separate processes for each team that meet their specific needs.

Note that teams working independently from each other doesn’t mean they never interact. It’s just that the interactions are now limited to well-defined interfaces: e.g., the API of a library or a web service. This lets you benefit from the output of that team’s work (e.g., the data returned by their API) without being subject to the particular inputs they need to make that work possible. In fact, you do this all the time, even in small companies whenever you add a dependency on a third party, such as an open source library or a vendor’s API: you’re able to benefit from the work they are doing, while keeping all your coding practices (testing, code reviews, deployment cadence, etc.) largely separate.

Handling different scaling requirements

As your user base grows, you will hit more and more scaling challenges to handle the extra load. In some cases, you may find that some parts of your software have different scaling requirements than other parts: for example, one part of your code may benefit from distributing work across a large number of CPUs on many servers, whereas another part of your code may benefit from a large amount of memory on a single server. If everything is in one codebase—and more to the point, if everything is deployed together—meeting these conflicting scaling requirements can be difficult. As a result, many companies break up their codebase so that the different parts of the code can be deployed and scaled independently.

Using different programming languages

Most companies start with a single programming language, but as you grow, you may end up using multiple programming languages: sometimes, this is because different developers at your company prefer different languages; sometimes, this is because you acquired a company that uses a different programming language; sometimes, this is because different languages may be a better fit for different problems (use the right tool for the job). Each time you introduce a new language, you have a new app to deploy, configure, update, and so on, and as you’ll see shortly, this typically means your codebase now consists of multiple services to manage.

Now that you’ve seen the most common reasons to break up a codebase, let’s talk about how to actually do it.

How to Break Up Your Codebase

Broadly speaking, there are two approaches to breaking up a codebase:

  • Split into multiple libraries.

  • Split into multiple services.

Note that these are not mutually exclusive options: many companies break their codebase up into both multiple libraries and multiple services. The following sections will go into detail on these two options, starting with breaking up the codebase into multiple libraries.

Breaking a codebase into multiple libraries

Just about all codebases are broken up into various abstractions, such as functions, interfaces, classes, and modules (depending on the programming language you’re using). However, if the codebase gets big enough, you may choose to break it up even further into libraries. When I say library, what I mean is a unit of code that can be developed independently from the other units, as it meets the following properties:

It exposes a well-defined API to the outside world

The library exposes an interface with well-defined inputs and outputs. All the code outside the library must use this interface to interact with the library.

Its implementation can be developed independently of the rest of the codebase

The internals of the library are hidden from the outside world and can be developed independently from all other libraries, so long as the library still fulfills the promises it makes in its interface.

You can only depend on versioned artifacts from a library and not its source code

The different parts of your codebase depend on versioned artifacts produced by other parts of the codebase, rather than depending directly on each other’s code. The exact type of artifact depends on the programming language: for example, in Java, that might be a .jar file; in Ruby, that might be a Ruby Gem; and in JavaScript, that might be an NPM module. As long as you use artifact dependencies, the underlying code can all live in a single repo, or multiple repos; however, multiple repos tends to be more common with libraries, as it ensures you don’t accidentally use source code dependencies, and it gives teams even more independence.

For example, you might start with a code base that has three parts, A, B, and C. Initially, part A depends directly on the source code of B and C, as shown in Figure 67:

Part A depends on the source code of parts B and C
Figure 67. Part A depends on the source code of parts B and C

You could break up this codebase by turning B and C into libraries that publish artifacts (e.g., if this was Java, the artifacts would be b.jar and c.jar), and update A to depend on a specific version of these artifacts, instead of the source code, as shown in Figure 68:

Part A depends on artifacts published by libraries B and C
Figure 68. Part A depends on artifacts published by libraries B and C

The advantage of breaking up your code into libraries is that it allows you to focus on one library at a time—one small part of your codebase—and safely ignore everything else. Each team can develop the internals of a library using whatever practices they want (e.g., for testing, code reviews, etc.), so long as the public interface fulfills its promises. Moreover, whereas a source code change immediately affects everyone that depends directly on that source code, a change to a library only affects users after (a) you’ve released a new versioned artifact and (b) users of your library have explicitly and deliberately chosen to pull in that new version. This allows teams to work more independently from each other.

Key takeaway #4

Breaking up your codebase into libraries allows developers to focus on one smaller part of the codebase at a time.

Almost all software projects these days depend on libraries: namely, open source libraries. For example, the Node.js sample app you’ve been working on throughout this blog post series now depends on Express.js, an open source web framework that you pull in through a versioned artifact (an NPM module). The maintainers of Express.js are able to develop this library completely independently from all the projects that depend on it, following their own coding conventions, testing practices, release cadence, and so on. That doesn’t mean you need to open source your own code, but if you break up your codebase into libraries, you can likewise benefit from being able to develop each piece independently.

If you do break your codebase up into libraries, I recommend following two best practices: semantic versioning and automatic updates

Semantic versioning (SemVer) is a set of rules for how to assign version numbers to your code. The goal is to communicate to users if a new version of your library has backward incompatible changes: that is, changes that would require the user to update how they use your library in their code in order to make use of this new version. Typically, this happens when you make changes to the API: e.g., you remove something that was in the API before, or you add something new to the API that is now required. With SemVer, you use version numbers of the format MAJOR.MINOR.PATCH (e.g., 1.2.3), where you increment these three parts of the version number as follows:

  • Increment the MAJOR version when you make incompatible API changes.

  • Increment the MINOR version when you add functionality in a backward compatible manner.

  • Increment the PATCH version when you make backward compatible bug fixes.

For example, if your library is currently at version 1.2.3, and you have made a backward incompatible change to the API, then to communicate this to your users, the next release would be 2.0.0. On the other hand, if you made a backward compatible bug fix, the next release would be 1.2.4. It’s also worth mentioning that 1.0.0 is typically seen as the first backward compatible release: new software can start at 0.x.y version numbers to indicate that it is not yet providing backward compatibility guarantees.

Automatic updates is a way to keep your dependencies up to date. One of the benefits of using library dependencies is that changes to that library only affect you when you explicitly and deliberately pull in a new version of that library. However, this strength is also a drawback: it’s easy to forget to update a library and to be stuck with an old version for years. This can be a problem as the old version may have bugs or security vulnerabilities, and if you don’t update for a while, updating to the latest version to pick up a fix can be difficult, especially if there have been many breaking changes since your last update.

This is yet another place where, if it hurts, you need to do it more often. In particular, you want to set up a process where you automatically update your dependencies and roll those updates out to production (sometimes called software patching). This applies to all the different types of software you depend on, including open source libraries, internal libraries, the operating systems your software runs on, software from cloud providers like AWS, GCP, and Azure, and so on. The automation you set up can either run on a schedule (e.g., update weekly) or in response to new versions being released.

You can set up automated updates using tools such as DependaBot, Renovate, Snyk, and Patcher, which can detect dependencies in your code, and automatically open pull requests to update you to new versions. That way, instead of having to remember to do updates yourself, the updates come to you, and all you have to do is check that they pass your suite of tests (as per Part 4), and if those pass, merge the pull request in, and let the code deploy automatically (as per Part 5).

Breaking a codebase into multiple services

Consider parts A, B, and C from the previous section: whether you use source code dependencies or library (artifact) dependencies, all the parts of your codebase run in a single process and communicate with each other via in-memory function calls, as shown in Figure 69:

All the parts of the codebase run in a single process and communicate via in-memory function calls
Figure 69. All the parts of the codebase run in a single process and communicate via in-memory function calls

Another way to break up the codebase is to run the different parts of your codebase as separate services: that is, you run each part in a separate process, typically on separate servers, and they communicate with each other by sending messages over the network, as shown in Figure 70:

All the parts of the codebase run in separate processes and communicate via messages over the network
Figure 70. All the parts of the codebase run in separate processes and communicate via messages over the network

Instead of a single, monolithic application, you move to multiple services, where each service meets the following properties:

It exposes a well-defined API to the outside world

The service exposes an interface with well-defined inputs and outputs. All the code outside the service must use this interface to interact with the service. Just as with libraries, the API usually needs to provide backward compatibility guarantees, and it may be versioned.

Its implementation can be developed independently of the rest of the codebase

The internals of the service are hidden from the outside world and can be developed independently from all other services, so long as the service still fulfills the promises it makes in its interface.

It can be deployed independently of the rest of the codebase

Not only can you develop the internals of your service independently, but you can also deploy that service independently of the rest of your codebase. So one team may choose to deploy their service five times per day, whereas another team may choose to deploy their service only once per quarter.

You can only talk to a service via messages over the network

You no longer depend on the code (whether directly or via artifacts) of a service at all. Instead, you can only communicate with the service via messages over the network.

Over the years, there have been many different approaches to building services—and also many buzzwords and fads, which can make it hard to nail down concrete definitions. One approach is service-oriented architecture (SOA), which typically refers to building relatively large services that handle all the logic for an entire business line or product within your company; this was also sometimes called Web 2.0, when it referred to services exposed between different companies (e.g., when you use APIs from services exposed by Twitter, Facebook, Google Maps, etc.). A slightly more recent approach that arose around the same time as DevOps is microservices, which typically refers to smaller, more fine-grained services that handle one domain within a company: e.g., one microservice to handle user profiles, one microservice to handle search, one microservice to do fraud detection, and so on. Yet another approach is event-driven architecture, where instead of services interacting synchronously by messaging each other and waiting for responses, they interact asynchronously, with each service listening for events, which are typically messages on some sort of event bus, processing those events, and creating new events by writing back to the event bus.

Whichever model of services you choose, there are typically three main advantages to breaking up your code into services:

Isolating teams

A common pattern is to have each service owned by a different team, which allows that team to focus more or less entirely on just their service—just their small part of the codebase—and safely ignore everything else. Each team can develop the internals of a service using whatever practices they want (e.g., for testing, code reviews, etc.), so long as the public interface fulfills its promises.

Using multiple programming languages

Since services run in separate processes, you can build them in different programming languages. This allows you to pick the programming languages that are the best fits for certain problem domains. It also makes it easier to integrate codebases from acquisitions and other companies that used different programming languages without having to rewrite all the code.

Scaling services independently

Since services run in separate processes, you can run them on separate servers, and scale those servers independently. For example, you might scale one service horizontally, deploying it across more and more servers as CPU load goes up, and another service vertically, deploying it on a single server with a large amount of RAM.

Almost all large companies eventually move to services for these three advantages, but especially due to the ability to isolate teams. To some extent, using services allows each team to operate like its own, independent company, which is essential to scaling.

Key takeaway #5

Breaking up your codebase into services allows different teams to own, develop, and scale each part independently.

Moving to services can be an essential ingredient in helping a company scale, but beware: breaking up the codebase, whether into libraries or services, comes with a number of costs and challenges, so most companies should avoid it until they have no other choice, as described in the next section.

Challenges with Breaking Up Your Codebase

In recent years, it became trendy to break up a codebase, especially into microservices, almost to the extent where "monolith" became a dirty word. At a certain scale, moving to services is inevitable: every large company has a story of breaking up their monolith. But until you get to that scale, a monolith is a good thing. That’s because breaking up a codebase introduces a number of challenges, including the following:

  • Challenges with backward compatibility

  • Challenges with global changes

  • Challenges with where to split the code

  • Challenges with testing and integration

  • Dependency hell

  • Operational overhead

  • Dependency overhead

  • Debugging overhead

  • Infrastructure overhead

  • Performance overhead

  • Distributed system complexities

Let’s go through these one at a time, starting with increased challenges with backward compatibility.

Challenges with backward compatibility

Both libraries and services consist of two parts: the public API and the internal implementation details. Breaking up your codebase allows you to make changes much more quickly to the internal implementation details, as each team can maintain those however they want. However, making changes to the public API becomes slower and more difficult, as you now need to worry about backward compatibility. Making backward incompatible changes (AKA breaking changes) in a library or service can cause headaches, bugs, and outages for everyone who depends on your library or service, so you have to be careful in changes to the public API.

For example, imagine that in part B of your codebase, you have a function called foo that you want to rename to bar. If all the code that depends on B is in one codebase, renaming is straightforward:

  1. Rename foo to bar in B.

  2. Find all the places that reference foo and update them to bar. Many IDEs can do this rename automatically. If there are too many places to update in one commit, use branch by abstraction (as introduced in Part 5).

  3. Done.

If B is a separate library, the process is more complicated:

  1. Discuss with your team if you really want to do a backward incompatible change. Some libraries make compatibility promises, and can only break them rarely: e.g., some libraries batch all breaking changes into releases they do once per quarter or once per year, so you might have to wait a long time to do the rename.

  2. Rename foo to bar in B.

  3. Create a new release of B, updating the MAJOR version number to make it clear that there are breaking changes, and add release notes with migration instructions.

  4. Every team that relies on B now chooses when to update to the new version. If they see there is a breaking change, they may wait longer before updating. When those teams finally decide to upgrade, they have to find all usages of foo and rename them to bar.

  5. Done.

If B is a separate service, the process is even more complicated:

  1. Discuss with your team if you really want to do a backward incompatible change. These are expensive changes to make in a service, so you may choose not to do it, or you may have to wait a long time before doing the rename.

  2. Add a new version of your API and/or a new endpoint that has bar. Note that you do not remove foo at this point: if you did, you might break the services that rely on foo, causing bugs or outages.

  3. Deploy the new version of your service that has both bar and foo endpoints.

  4. Notify all users and update your docs to indicate there is a new bar endpoint and that foo is deprecated.

  5. You wait for every team to switch from foo to bar in their code and to deploy a new version of their service. You might even monitor the access logs of B to see if the foo endpoint is still being used, identify the teams responsible, and bargain with them to switch to bar. Depending on the company and competing priorities, this could take weeks or months.

  6. At some point, if usage of foo goes to zero, you can finally remove it from your code, and deploy a new version of your service. Sometimes, especially with public APIs, you might have to keep the old foo endpoint forever.

  7. Done.

Phew. That’s a lot of work. If you spend enough time maintaining a library or service, you quickly learn how important it is to get the public API right, and you’ll likely spend a lot of time obsessing over your public API design. But no matter how much time you spend on it, you’ll never get it exactly right, and you’ll always have to evolve it over time anyway, so expect public API maintenance to be one of the overheads of splitting up the codebase.

Challenges with global changes

The reason it’s hard to maintain a public API in libraries and services is because that’s a place where you have to interact with many other teams at your company. As it turns out, this is just one specific type of change that becomes much harder if you split up your codebase: the more general problem is that any global changes—changes that require updating multiple libraries or services—become considerably harder.

For example, LinkedIn, like almost all companies, started with a single monolithic application: this one was written in Java and called Leo. Eventually, Leo became a bottleneck to scaling, both in terms of scaling to handle more developers, and scaling to handle more traffic, so we started to break it up into dozens of libraries and services. For the most part, this was a huge win, as each team was able to iterate on features within their library or service much faster than when those features were mixed with everyone else’s features within Leo. However, we also had to do the occasional global change: for example, almost every single service relied on some security utilities in a library called util-security.jar. When we found a vulnerability in that library, rolling out the new version to all services took a gargantuan effort:

  • A few developers were assigned to lead the effort.

  • They had to dig through dozens of services, many of which were defined in different repos, to find everyone who depended on util-security.jar.

  • Next, they had to update each of those services to the new version. In some cases, this was a simple version number bump, but in many cases, the service was on an ancient version of util-security.jar, so they had to upgrade them through a large number of breaking changes, which required a number of changes throughout that service’s codebase.

  • Then they opened up pull requests and waited for code reviews, prodding each team to accelerate the process.

  • Eventually, the code was merged, and then they had to bargain with each team to deploy their service.

  • Some of the deployments had bugs or caused outages, which required rolling things back, fixing the issues, and deploying again.

What would’ve been a single commit and deploy within a monolith became a multi-week slog when dealing with dozens of microservices. To some extent, this is by design: the whole point of splitting up a codebase is to make it hard for changes in other parts of the codebase to affect you.

Key takeaway #6

The trade-off you make when you split up a codebase is that you are optimizing for being able to make changes much faster within each part of the codebase, but this comes at the cost of it taking much longer to make changes across the entire codebase.

If you split up your codebase and find that the vast majority of the changes each team makes are within a single part of the codebase owned by that team, then this split will allow you to go much faster; but if you find that the majority of the changes require teams to make updates across multiple parts of the codebase, then this split will make you go much slower. Unfortunately, knowing where to spit up the codebase can be surprisingly challenging, as discussed in the next section.

Challenges with where to split the code

One of the challenges of splitting up a codebase is knowing where to put the seams. You know you’ve split it correctly when the vast majority of the changes done by each team are within their own part of the codebase, as that will allow them to go much faster. However, it’s very easy to get the split wrong, and end up in a situation where most changes require updating multiple libraries or services, and as you learned in the previous section, these sorts of global changes will make you go much slower.

One place I see teams get this wrong all the time is splitting up the codebase way too early. It’s much easier to identify the seams in a codebase that has been around for a long time than it is to guess where to put the seams in something totally new. When you’ve been working with a codebase for years, you can usually identify several patterns that are good hints for where the codebase could be split:

Files that change together

If every time you make a change of type X, you update a group of files A, and every time you make a change of type Y, you update a group of files B, then A and B are good candidates to be broken out into separate libraries or services.

Files that teams focus on

If 90% of the changes by team X are in a group of files A and 90% of the changes by team Y are in a group of files B, then A and B are good candidates to be broken out into separate libraries or services.

Parts that could be open sourced our outsourced

Are there parts of your code that you could envision as successful, standalone open source projects? Or parts of your code that could be exposed as successful, standalone web APIs? I’m not saying you actually need to open source your code or open up APIs, but merely use this as a litmus test: anything that would work well as an open source project is a good candidate to be broken out into a library; anything that would work well as a standalone web API is a good candidate to be broken out into a service. Note that this litmus test works well in reverse, too: anything that would not work well as a standalone open source project or web API—perhaps because it only makes sense as part of a larger whole—is probably not a good candidate to break out into a library or service.

Performance bottlenecks

If you know that 90% of the time it takes to serve a request is spent in part A of your code, and it’s mostly limited by RAM, then that might be a good candidate to break out into a service that you scale separately.

Trying to predict any of these items ahead of time for a new codebase is futile. This is especially true of performance bottlenecks, which you can never really predict without running a profiler against real code and real data. The only way to get these seams right is to start with a monolith, grow it as far as you can, and only when you can’t scale it any further, do you break it up into smaller pieces.

This is one of the reasons that I shake my head when I see a tiny startup with a three-person engineering team launch their product with 12 microservices: not only are you going to pay a high price in terms of operational overhead (something you’ll learn more about shortly), but you almost certainly have put the seams in the wrong places. Inevitably, these teams find that every time they go to make the slightest change in their product, they have to update 7 different microservices, and deploy them all in just the right order. Meanwhile, the startup team who built on top of a Ruby on Rails or PHP monolith is running circles around them, shipping changes 10x faster.

Challenges with testing and integration

In Part 5, you learned all about continuous integration, and its central role in helping teams move faster. Well, here’s a fun fact: splitting up your codebase into libraries and services is the opposite of continuous integration. What’s the difference between a long-lived feature branch that you only merge into main after 8 months versus a library dependency that you only update once every 8 months? Not much.

Once you’ve split up your codebase, what you’re effectively doing is late integration. And that’s by design: one of the main reasons to split up a codebase is to allow teams to work more independently from each other, which means you are going to be integrating your work together much less frequently.

Key takeaway #7

Splitting up a codebase into multiple parts means you are choosing to do late integration instead of continuous integration between those parts, so only do it when those parts are truly independent.

This is a good trade-off to make if the different teams are truly decoupled from each other: e.g., they work on totally separate products within your company. However, if the teams are actually tightly coupled, and have to interact often, then splitting them up into separate codebases will lead to problems. Either the teams will try to work mostly independently, and due to the lack of integration and proper testing run into lots of conflicts, bugs, and outages, or the teams will try to integrate their work all the time, and due to the frequent need to make global changes across multiple parts of the codebase, they will find development is very slow.

Dependency hell

One challenge that is unique to libraries is what’s sometimes referred to as dependency hell, which is where using versioned dependencies can lead to one of a number of frustrating situations, such as the following:

Too many dependencies

If you depend on dozens of libraries, and each of those libraries depends on dozens more libraries, and each of those depends on even more, then the mere act of downloading all your dependencies can take up a ton of time, disk space, and bandwidth.

Long dependency chains

You sometimes get long chains of dependencies: e.g., library A depends on B, B depends on C, C depends on D, and so on, until finally you get to some library Z. Now imagine that you made an important security fix to Z and you want to roll it out A. To do this, you have to update Y, and release a new version of that, then update X, and release a new version of that, and so on, all the way up the chain, until you finally get back to A.

Diamond dependencies

Imagine A depends on B and C, and B and C, in turn, each depend on D. This is all fine unless B and C each depend on different, incompatible versions of D: e.g., B needs D at version 1.0.0, whereas C needs D at version 2.0.0, as shown in Figure 71. You can’t have two conflicting versions at once, so now you’re stuck unless B or C are updated, and these may be libraries you don’t control.

Diamond dependencies
Figure 71. Diamond dependencies

Just about all codebases run into these issues from time to time due to dependencies on open source libraries, but if you break your own codebase up into many libraries, these problems may become exponentially worse.

Operational overhead

If you split a monolith into services, instead of having just a single type of app to manage, you now have many different types, possibly written in different languages, each with its own mechanisms for testing, deployment, monitoring, configuration, and so on. Think of all the work you’ve done so far in this blog post series to deploy a single app and a CI / CD pipeline for it, add all the work in upcoming posts, such as networking, secrets management, data storage, and monitoring, and then multiply that by the number of services, and you’ll get an inkling of the operational overhead involved. But that’s not all. There is even more operational overhead from dependencies between services, debugging of multiple services, infrastructure for multiple services, and the performance impact from services, as discussed in the next several sections.

Dependency overhead

With N services, you not only have N things to deploy and manage, but you also have to consider the interactions between services, which grows at a rate of N2. For example, let’s say you have a service A that depends on a service B. As part of developing a new feature, you add a new endpoint called foo to B, and you update the code in A to make calls to the foo endpoint. Now consider what happens at deployment time: if you deploy the new version of A before the new version of B is out, then when A tries to use the foo endpoint, it’ll fail, as the old version of B doesn’t have that endpoint yet. So now you have a dependency on deployment ordering: B must be deployed before A. But of course, B itself may depend on new functionality in services C and D, and those services may depend on new functionality in other services, and so on. So now you have a deployment graph to maintain to ensure the right services are deployed in the right order. And this gets really messy if one of those services has a bug, and you have to rollback: for example, if the new version of C had to be rolled back, you’d also have to know to roll back the new versions of B and A.

One way to mitigate this problem is to ban deployment ordering entirely: that is, you require your code to be written so that services can be deployed in any order, and rolled back at any time. One way to do that is to use feature flags, which you saw in Part 5: you wrap the new functionality in A—the part of the code that calls the new foo endpoint in B—in an if-statement which is off by default. That way, you can deploy A and B at any time and in any order, as the new functionality won’t be visible to any users. When you’re sure both the new versions of A and B are deployed, you then turn the feature toggle on, and the new functionality should start working; if you hit any issue with the new functionality or B or one of its dependencies has to be rolled back, you can turn the feature toggle off again. Once again, you are using feature toggles to separate deployment from release, which is usually a more effective solution than trying to implement deployment ordering—but still nowhere near as simple as avoiding the problem entirely by sticking with a monolith as long as you can.

Debugging overhead

If you have a single monolithic app, and your users report a bug, you know the bug is in the app. If you have dozens of services, and your users report a bug, now you have to do an investigation to figure out which service is at fault. This is considerably harder for several reasons. One reason is the natural tendency for each team to immediately blame other teams, so no one will want to take ownership of the bug. Another reason is that when you have services that communicate over the network, rather than a monolith where everything happens in a single process, there are a large number of new, complicated failure conditions that are tricky to debug (you’ll learn more about this shortly).

One more reason is that, whereas debugging a single app can be hard, trying to track down a bug across dozens of separate services can be a nightmare: you can no longer look at the logs of a single app, and instead have to go look at logs from dozens of apps, each potentially in a different place and format; you can no longer reproduce the error by running a single app on your computer, and instead have to fire up dozens of services locally; you can no longer hook up a debugger to a single process and go through all the code step-by-step, and instead have to use all sorts of tracing tools identify the dozens of different processes that end up processing a single request. A bug that could take an hour to figure out in a monolith can take weeks to track down in a microservices architecture.

Infrastructure overhead

The bureaucracy is expanding to meet the needs of the expanding bureaucracy.

— Oscar Wilde

Moving from a monolith to multiple services isn’t just about deploying a bunch of services: you typically also need to deploy a bunch of extra infrastructure to support the services themselves—and the more services you have, the more infrastructure you need to support them. For example, to help manage the deployments of 12 services, rather than 1 monolith, you may have to deploy a more complicated orchestration tool (e.g., Kubernetes); to help your services communicate with each other securely, you may have to deploy a service mesh tool (e.g., Istio); to help your services communicate with each other asynchronously, you may have to deploy an event bus (e.g. Kafka); to help with debugging and monitoring your microservices architecture, you may have to deploy a distributed tracing tool (e.g., Jaeger), and integrate a tracing library it into all your services (e.g., OpenTracing); and so on. All of this infrastructure takes a lot of time and money to deploy and manage, and you can avoid most of it by sticking with a monolith for as long as possible.

Performance overhead

One of the benefits of services is that they help you deal with performance bottlenecks by allowing you to scale different parts of your codebase independently. One of the drawbacks of services is that, in almost every other way, they actually make performance considerably worse. This is due to the following reasons:

Networking overhead

When all the parts of your codebase run in a single monolith, those parts all run in a single process, so they can communicate with each other via function calls. When those same parts are running in separate processes, they have to communicate with each other over the network. If you refer once more to Table 9, you’ll see that a random read from main memory takes roughly 100 ns, whereas the roundtrip for a single TCP packet in a data center takes 500,000 ns. That means that the mere act of moving a part of your code to a separate service makes it at least 5,000 times slower!

Serialization overhead

Communicating over the network is not only slower in terms of the time it takes for the message to do a roundtrip, but also all the serialization you have to do to that message: that is, all the packing, encoding, unpacking, and decoding to send a message over the network. This includes the format of the messages (e.g., JSON, Protobuf, XML, Thrift), the format of the application layer protocol (e.g, HTTP), the format for encryption (e.g., TLS), the format for compression (e.g., Snappy), and so on. Just to put it into perspective, per Table 9, you can see that compressing 1KB with Snappy takes around 2,000 ns, so just the compression step by itself is at least 20x slower than a random read from main memory.

When you split a monolith into services, you often have to rewrite a lot of your code to use concurrency, caching, batching, and de-duping to minimize this performance overhead. However, this makes your code considerably more complicated, and it’ll still be orders of magnitude slower than keeping everything in a single process.

Distributed system complexities

Splitting a monolith into services is a major shift: you’re turning a single app into a distributed system. Distributed systems are hard. Dealing with distrubted systems introduces a number of new challenges, such as the following:

New failure modes

When all the parts of your code run in a single process and communicate via function calls, the vast majority of function calls succeed, and when they fail, there are typically only several types of errors you need to consider: e.g., the function may return an expected error, or it may throw an unexpected error, or the whole process may crash. When those parts of the code run in separate processes that communicate over the network, you now have a whole new set of possible errors you need to handle: for example, the network request may fail because the network is down, or it may fail because the network is misconfigured and sends your request to the wrong place, or it may fail because the service you’re trying to talk to is down, or the service may take too long to respond, or it may start responding, but crash part of the way through, or it may send multiple responses, or responses not serialized the way you’re expecting, and so on. Dealing with all of these can be tricky, and it makes your code more complicated.

I/O complexity

When you break out a part of the code into a separate service, you typically have to update all the places that used that part of the code from making a simple function call to sending a request over the network. This is a type of I/O (input/output), and since most types of I/O are orders of magnitude slower than operations on the CPU or in memory (refer to Table 9), most programming languages use special code to handle that I/O. One common approach is to use a thread pool, where you use synchronous I/O that blocks the thread until the I/O completes. This allows you to keep your code structure mostly the same, but it requires you to carefully size your thread pools: too many threads and your CPU will spend all its time context switching between them (thrashing); too few threads, and you’ll spend all your time waiting, which will decrease throughput. Another common approach is to use asynchronous I/O that is non-blocking, so the code can keep executing while waiting on I/O, and you’ll be notified when that I/O completes. This approach allows you to avoid having to fight with thread pool sizes, but it requires you to rewrite your code to handle those notifications via mechanisms such as callbacks, promises, or actors.

Data storage complexity

When you break your code into services, each service typically manages its own, separate data store. This is mostly a good thing, as it allows each team to store and manage data as best fits their needs, and to work independently. However, it comes at a cost: moving to microservices with multiple data stores typically means you either sacrifice the consistency of your data (as it’s hard to implement referential integrity and transactions across microservices), or you keep your data consistent, but you end up with microservices that are tightly coupled, slow, and not resilient to outages. In the distributed systems world, you can’t have both. You’ll learn more about this topic in Part 9 [coming soon].

Now that you’ve seen all the challenges with splitting your codebase into libraries and services, you might be feeling a little less excited about that sexy microservices architecture you saw at Google or Netflix. If so, that’s a good thing. You’re probably not working at Google or Netflix, so you shouldn’t blindly copy their architecture, as much of it was designed to handle problems of extraordinary scale, and if you don’t have those problems, then that architecture is more likely to slow you down than to help you.

Key takeaway #8

Splitting up a codebase into libraries and services has a considerable cost: you should only do it when the benefits outweigh those costs, which typically only happens at a larger scale.

Let’s assume that you’re at a company of a large enough scale to merit splitting up the codebase, and see what it looks like to run multiple services in Kubernetes.

Example: Deploy Microservices in Kubernetes

These days, Kubernetes is a popular orchestration tool for managing microservices, so let’s give it a shot. You’re going to convert the simple Node.js sample app you’ve seen throughout the blog post series into two apps, as shown in Figure 72:

A frontend microservice fetches data from a backend microservice and presents that data to the user
Figure 72. A frontend microservice fetches data from a backend microservice and presents that data to the user
Backend sample app

This app will represent a backend microservice, which is typically a microservice that is responsible for data management for some domain within your company, and exposing this data via an API (e.g., JSON over HTTP) only to other microservices within your company (and not directly to users).

Frontend sample app

This app will represent a frontend microservice, which is typically a microservice that is responsible for presentation, gathering data from backends and showing that data to users in some sort of user interface (e.g., HTML rendered in a web browser).

The following two sections will walk you through how to create these services and deploy them in Kubernetes, starting with the backend sample app.

Creating a backend sample app

As a first step, copy the Node.js sample app that you last saw in Part 5 into a new folder called sample-app-backend:

$ cd fundamentals-of-devops

$ cp -r ch5/sample-app ch6/sample-app-backend

Since you’ll be deploying this backend into a Kubernetes cluster, you should also copy the Kubernetes Deployment and Service configurations from Part 3 into sample-app-backend:

$ cp ch3/kubernetes/sample-app-deployment.yml ch6/sample-app-backend/

$ cp ch3/kubernetes/sample-app-service.yml ch6/sample-app-backend/

Next, update the files in sample-app-backend as follows:

app.js

Update this sample app to act like a backend that exposes a simple API where it responds to HTTP requests with JSON, as shown in Example 121:

Example 121. Update the backend to return JSON (ch6/sample-app-backend/app.js)
app.get('/', (req, res) => {

  res.json({text: "backend microservice"}); (1)

});
1Normally, a backend microservice would look up data in a database of some kind, but to keep this example simple, this sample app uses res.json to return JSON that sets text to "backend microservice."
package.json

Update the name to "sample-app-backend," update the description to match, and set the version to 0.0.1, as shown in Example 122:

Example 122. Update the app name, description, and version (ch6/sample-app-backend/package.json)
  "name": "sample-app-backend",

  "version": "0.0.1",

  "description": "Backend app for 'Fundamentals of DevOps and Software Delivery'",
sample-app_deployment.yml

Update the app name as shown in Example 123:

metadata:

  name: sample-app-backend-deployment     (1)

spec:

  replicas: 3

  template:

    metadata:

      labels:

        app: sample-app-backend-pods      (2)

    spec:

      containers:

        - name: sample-app-backend        (3)

          image: sample-app-backend:0.0.1 (4)

          ports:

            - containerPort: 8080

          env:

            - name: NODE_ENV

              value: production

  selector:

    matchLabels:

      app: sample-app-backend-pods        (5)
1Update the name of the Deployment to "sample-app-backend-deployment."
2Update the labels on the pods to "sample-app-backend-pods."
3Update the name of the container to "sample-app-backend."
4Update the Docker image to deploy to "sample-app-backend" at version 0.0.1, which is a Docker image you’ll build shortly.
5Update the pods to target to "sample-app-backend-pods."
sample-app_service.yml

Update the app name and switch to a ClusterIP Service, as shown in Example 124:

Example 124. Update app name and switch to a ClusterIP Service (ch6/sample-app-backend/sample-app-service.yml)
metadata:

  name: sample-app-backend-service (1)

spec:

  type: ClusterIP                  (2)

  selector:

    app: sample-app-backend-pods   (3)

  ports:

    - protocol: TCP

      port: 80

      targetPort: 8080
1Update the name of the Service to "sample-app-backend-service."
2Switch the service type from LoadBalancer to ClusterIP. This is a type of service that is only reachable from within the Kubernetes cluster, and not from the outside world, which is typically what you want for a backend.
3Update the pods to target to "sample-app-backend-pods."

Build a Docker image for the backend app using the dockerize command you added back in Part 4:

$ npm run dockerize

This should result in a new Docker image called "sample-app-backend" with version 0.0.1.

Let’s deploy this Docker image into a Kubernetes cluster. The easiest one to test with is Kubernetes running locally, as part of Docker Desktop, just as you did back in Part 3. You can authenticate to the Kubernetes cluster in Docker Desktop as follows:

$ kubectl config use-context docker-desktop

Now you can use kubectl apply to deploy the Deployment and Service:

$ kubectl apply -f sample-app-deployment.yml

$ kubectl apply -f sample-app-service.yml

If you run kubectl get services, you should see the Service for the backend:

$ kubectl get services

NAME                          TYPE         CLUSTER-IP       EXTERNAL-IP   PORT(S)

sample-app-backend-service    ClusterIP    10.99.156.12     <none>        80/TCP

Note the backend Service name; you’ll need this in the frontend app, which is the focus of the next section.

Creating a frontend sample app

Create the frontend app using a process similar to the one you just used for the backend app. First, copy the Node.js sample app from Part 5 and the Kubernetes Deployment and Service configurations from Part 3 into a new folder called sample-app-frontend:

$ cd fundamentals-of-devops

$ cp -r ch5/sample-app ch6/sample-app-frontend

$ cp ch3/kubernetes/sample-app-deployment.yml ch6/sample-app-backend/

$ cp ch3/kubernetes/sample-app-service.yml ch6/sample-app-backend/

Next, update the files in sample-app-frontend as follows:

app.js

Update the frontend to make an HTTP request to the backend and to render the response using HTML, as shown in Example 125:

Example 125. Update the frontend to make an HTTP request to the backend and to return HTML (ch6/sample-app-frontend/app.js)
const backendHost = 'sample-app-backend-service';             (1)



app.get('/', async (req, res) => {

  const response = await fetch(`http://${backendHost}`);      (2)

  const responseBody = await response.json();                 (3)

  res.send(`<p>Hello from <b>${responseBody.text}</b>!</p>`); (4)

});
1What you’re seeing here is an example of service discovery in Kubernetes. Whenever you create a Service in Kubernetes named foo, Kubernetes creates a DNS entry for that Service, so you can use foo as a hostname, and Kubernetes will automatically route any requests to that hostname (e.g., HTTP requests to http://foo) to that Service. So this code sets the hostname for the backend to the name of the Service you created in the previous section. You’ll learn more about service discovery and DNS in Part 7 [coming soon].
2Use the fetch function to make an HTTP request to the backend microservice, using the hostname from (1).
3Read the body of the response from the backend and to parse it as JSON.
4Send back HTML which includes the text from the backend’s JSON response.
Watch out for snakes: injection attacks

To keep the frontend app simple and avoid introducing new tools, I used a JavaScript template literal in (4) to render HTML. The problem with this approach is that it leaves you open to injection attacks: if you insert dynamic data into the template literal, as the preceding code does with responseBody.text, and if that data comes from your users, then those users could include malicious code in that data, and you’d end up executing their malicious code in the browser. The solution is to sanitize all user input, which you get out-of-the-box with most dedicated templating languages: for example, Express.js has native support for Pug, Mustache, and EJS, any of which would be a better choice than template literals for rendering HTML in production applications.

package.json

Update the name to "sample-app-frontend," update the description to match, and set the version to 0.0.1, as shown in Example 126:

Example 126. Update the app name, description, and version (ch6/sample-app-frontend/package.json)
  "name": "sample-app-frontend",

  "version": "0.0.1",

  "description": "Frontend app for 'Fundamentals of DevOps and Software Delivery'",
sample-app_deployment.yml

Update the app name as shown in Example 127:

metadata:

  name: sample-app-frontend-deployment       (1)

spec:

  replicas: 3

  template:

    metadata:

      labels:

        app: sample-app-frontend-pods        (2)

    spec:

      containers:

        - name: sample-app-frontend          (3)

          image: sample-app-frontend:0.0.1   (4)

          ports:

            - containerPort: 8080

          env:

            - name: NODE_ENV

              value: production

  selector:

    matchLabels:

      app: sample-app-frontend-pods          (5)
1Update the name of the Deployment to "sample-app-frontend-deployment."
2Update the labels on the pods to "sample-app-frontend-pods."
3Update the name of the container to "sample-app-frontend."
4Update the Docker image to deploy to "sample-app-frontend" at version 0.0.1, which is a Docker image you’ll build shortly.
5Update the pods to target to "sample-app-frontend-pods."
sample-app_service.yml

Update the app name, as shown in Example 128:

Example 128. Update app name (ch6/sample-app-frontend/sample-app-service.yml)
metadata:

  name: sample-app-frontend-loadbalancer (1)

spec:

  type: LoadBalancer                     (2)

  selector:

    app: sample-app-frontend-pods        (3)
1Update the name of the Service to "sample-app-frontend-service."
2Keep the service type as LoadBalancer so that you can access this Service from the outside world, which is typically what you want for a frontend.
3Update the pods to target to "sample-app-frontend-pods."

Build a Docker image for the frontend app using the dockerize command you added back in Part 4:

$ npm run dockerize

This should result in a new Docker image called "sample-app-frontend" with version 0.0.1.

Let’s deploy this Docker image into a Kubernetes cluster. Use kubectl apply to deploy the Deployment and Service:

$ kubectl apply -f sample-app-deployment.yml

$ kubectl apply -f sample-app-service.yml

If you run kubectl get services, you should now see the Services for both the backend and the frontend:

$ kubectl get services

NAME                               TYPE           EXTERNAL-IP   PORT(S)

kubernetes                         ClusterIP      <none>        443/TCP

sample-app-backend-service         ClusterIP      <none>        80/TCP

sample-app-frontend-loadbalancer   LoadBalancer   localhost     80:32081/TCP

Notice how EXTERNAL-IP for the frontend is set to localhost and that it’s listening on port 80, so you can test it by going to http://localhost. If you open this URL in a web browser, you should see the HTML rendered, as shown in Figure 73:

The HTML response from the frontend
Figure 73. The HTML response from the frontend

Congrats, you’re now running two microservices in Kubernetes that are talking to each other! A separate team could own each service, developing, deploying, and scaling the service completely independently.

When you’re done testing, you may want to run kubectl delete on each of the Deployments and Services to undeploy them from your Kubernetes cluster. You should also commit your changes to Git, as you will continue to iterate on this code in subsequent chapters.

Get your hands dirty

Here are a few exercises you can try at home to get a better feel for running microservices:

  • The frontend and backend both listen on port 8080. This works fine when running the apps in Docker containers, but if you wanted to test the apps without Docker (e.g., by running npm start directly), the ports will clash. Consider updating one of the apps to listen on a different port.

  • After all these updates, the automated tests in app.test.js for both the frontend and backend are now failing. Fix the test failures. Also, look into dependency injection and test doubles (AKA mocks) to find ways to test the frontend without having to run the backend.

  • Update the frontend app to handle errors: for example, the HTTP request to the backend could fail for any number of reasons, and right now, if it does, the app will simply crash. You should instead catch these errors and show users a reasonable error message.

  • Deploy these microservices into a remote Kubernetes cluster: e.g., the EKS cluster you ran in AWS in Part 3.

Conclusion

You’ve now seen how to address some of the problems of scale that affect a company as it grows: you can break up your deployment into multiple environments and you can break up your codebase into multiple libraries and services. These approaches have a number of benefits and costs, as you learned from the 8 key takeaways from this blog post:

  • Breaking up your deployment into multiple environments allows you to isolate tests from production and teams from each other.

  • Breaking up your deployment into multiple regions allows you to reduce latency, increase resiliency, and comply with local laws and regulations, but usually at the cost of having to rework your entire architecture.

  • Configuration changes are just as likely to cause outages as code changes.

  • Breaking up your codebase into libraries allows developers to focus on one smaller part of the codebase at a time.

  • Breaking up your codebase into services allows different teams to own, develop, and scale each part independently.

  • The trade-off you make when you split up a codebase is that you are optimizing for being able to make changes much faster within each part of the codebase, but this comes at the cost of it taking much longer to make changes across the entire codebase.

  • Splitting up a codebase into multiple parts means you are choosing to do late integration instead of continuous integration between those parts, so only do it when those parts are truly independent.

  • Splitting up a codebase into libraries and services has a considerable cost: you should only do it when the benefits outweigh those costs, which typically only happens at a larger scale.

One topic that came up again and again as you looked at multiple environments and multiple services is the key role of networking: you’ve seen networking as a key part in how services communicate and in how you define environments. Networking also plays a key role in security: so far, just about everything you’ve deployed throughout this blog post series—all the EC2 instances, EKS clusters, and so on—has been directly accessible over the public Internet, which is convenient for learning and testing, but it means that any slight lapse in security—e.g., leaving a port open by accident in a firewall or running out-of-date software that has a vulnerability—can be immediately exploited by malicious actors.

In Part 7 [coming soon], you’ll learn how to set up your network to give you extra layers of protection, so you’re never just one mistake away from disaster, as well as how to use networking to define environments, do service discovery, connect to servers for debugging, and more.

Update, June 25, 2024: This blog post series is now also available as a book called Fundamentals of DevOps and Software Delivery: A hands-on guide to deploying and managing production software, published by O’Reilly Media!

Join the Fundamentals of DevOps Newsletter!