Hello, Terraform

At work, my team owns and maintains a large lab environment for the development and testing of Rubrik Build projects. It was built in a hurry, causing some of our original design principles to be compromised. My team and I have decided to use this no-travel period as an opportunity to redesign and redeploy our lab environment. I will review our design in a later post. 

One of our design goals is to leverage infrastructure as code principles (where possible). The team’s primary tool of choice for provisioning is Terraform

Terraform allows us to define what resources we need in a declarative manner, where we simply define the end state needed for our infrastructure. Here’s a few reasons why we like using Terraform:

  • Multi-platform, similar operations across a number of providers
  • Easy provisioning and deprovisioning of resources
  • Idempotent, saves current state as a file
  • Detects diffs from current state when applying changes

This post will dive into Terraform syntax, architecture, and operations.

Terraform Syntax

The low-level syntax of Terraform is defined in HashiCorp Configuration Language (HCL). The following example shows a generic configuration code block for Terraform:

command_type "provider_resource_label" "resource_label" {
  argument_name = "argument_value"
  argument_name = "argument_value"
}

Let’s dig into the syntax:

  • Command — the command type resource tells Terraform you want to create a resource, such as an S3 bucket or an EC2 instance.
  • Provider Resource Label — this is the type of resource you want to create. The resource name is specified by the provider. For example, you may use aws_instance to provision an EC2 instance using the AWS provider.
  • Resource Label — this what you want to colloquially label the resource within your Terraform configuration. This label should be unique within this configuration file as it is used later when referencing the resource.
  • Arguments — allows you to specify configuration details for the resource being provisioned. These are defined as an argument name and an argument value. As an example, when provisioning an EC2 instance, you may want to specify which AMI is used. 

Note that comments using # or // or even /* or */ are supported. 

To put these concepts together, an example configuration code block may resemble:

resource "aws_instance" "my-first-instance" {
  ami = "ami-008c6427c8facbe08"
  instance_type = "t2.micro"
  availability_zone = "us-west-2c"
  
  tags = {
    Name = "my-first-instance"
    Environment = "test"
  }
}

This example will provision a single EC2 instance in the US-West-2C availability zone, using the AMI specified, along with assigning the two tags. 

Most of your Terraform configuration is written in these code blocks. Once you master this, then you’ll be able to quickly write and provision more resources.

Terraform Architecture

A typical Terraform module may have the following structure:

project-terraform-files
│
└─── terraform-module-example01
│   │   main.tf
│   │   variables.tf
│   │   terraform.tfvars
│   │   outputs.tf
│   
└─── terraform-module-example02
│   │   provider.tf
│   │   data-sources.tf
│   │   main.tf
│   │   variables.tf
│   │   terraform.tfvars
│   │   outputs.tf

The names of the files are not important. Terraform will load all configuration files within the directory.

Providers

A provider is the core construct that allows Terraform to interact with the APIs across various platforms (PaaS, IaaS, SaaS). Think of this as the translator between the platform API and the HCL syntax. Before you can begin provisioning resources, you must first defined which platform by specifying the provider:

provider "aws" {
  region = "us-west-2"
}

Place the provider block in your main.tf file or create a separate provider.tf file.

Resources

I previously covered how to structure resource code blocks in the Terraform Syntax section. 

This example defines the creation of an instance based off the defined AMI, sized as t2.micro, and properly tagged:

resource "aws_instance" "my-first-instance" {
  ami = "ami-008c6427c8facbe08"
  instance_type = "t2.micro"
  availability_zone = "us-west-2c"
  
  tags = {
    Name = "my-first-instance"
    Environment = "test"
  }
}

Define the desired outcome for your resources in the main.tf file.

Data Sources

Data sources enable you to reference resources that already exist outside of Terraform or defined by a separate Terraform configuration. This allows you to extract information that can then be fed into a new resource. First, defined the data source and then reference this as an argument value:

data "aws_ami" "ubuntu" {
  most_recent = true
  owners = ["aws-marketplace"]

  filter {
    name   = "name"
    values = ["ubuntu/images/hvm-ssd/ubuntu-trusty-14.04-amd64-server-*"]
  }

  filter {
    name   = "virtualization-type"
    values = ["hvm"]
  }
}

resource "aws_instance" "my-first-instance" {
  ami = "${data.aws_ami.ubuntu.id}"
  instance_type = "t2.micro"
  availability_zone = "us-west-2c"
  
  tags = {
    Name = "my-first-instance"
    Environment = "test"
  }
}

In this example, I am again creating a new EC2 instance. However, this time I am gathering AMI information using a data source to find and use the latest Ubuntu version instead of manually defining that AMI value. This allows my configuration to be more flexible because I no longer need to manually find and input the appropriate AMI value.

A data source is declared similarly to resources, except that the information provided is used by Terraform to discover existing resources rather than provision. Once defined, data sources can be referenced repeatedly to pass information to new resources. 

Place the data source blocks in your main.tf file or create a separate data-sources.tf file. 

Variables

To make your code more modular, you can choose to use variables instead of hard-coding values. Once defined, variables can be referenced:

provider "aws" {
  access_key = var.aws_access_key
  secret_key = var.aws_secret_key
  region = var.aws_region
}

I typically declare my essential variables in a separate variables.tf file. This may resemble:

variable "aws_access_key" {
  description = "AWS access key for authorization"
  type = "string"
}

variable "aws_secret_key" {
  description = "AWS secret key for authorization"
  type = "string"
}

variable "aws_region" {
  description = "AWS region in which resources will be provisioned"
  type = "string"
  default = "us-west-2"
}

In this example, I have declared a value for the AWS region to be reused when provisioning the infrastructure defined. The descriptions are optional, and for the developer’s benefit only, but I always recommend being kind to the next person using your code. The possible variable types are string (default type), list, and map. Variables can also be declared but left blank, setting their values through environment variables or a .tfvars file. 

Sometimes the variable definition may be specified as a default in the variables.tf file. Otherwise, this value should be defined by creating a file named terraform.tfvars, which allows variable values to persist across multiple executions. This is especially valuable for sensitive information such as secret keys. 

For example, the contents of the terraform.tfvars file may resemble the following variable definition:

aws_access_key = "ABC0101010101CBA"
aws_secret_key = "abc87654321zyxw"
aws_region = "us-west-2"

Terraform automatically loads all files in the current directory with the exact name of terraform.tfvars or any variation of *.auto.tfvars. If the file is named something else, you can use the -var-file flag to specify a file name.

However, keep in mind that these persistent variable definitions often contain sensitive information, such as passwords or API token, and should be treated with care. Consider adding this to your .gitignore file.

Outputs

Outputs can be used to display information needed or export information after Terraform completes a terraform apply command. An example output may resemble:

output "instance_id" {
  value = "${aws_instance.my-first-instance.id}"
  }

You can save the outputs files in a specific file called outputs.tf.

State

When you use Terraform to build resources, a state file gets created and contains configuration information for the resources provisioned. This is what allows Terraform to determine which parts of the configuration have changed, ultimately what provides idempotency because Terraform is able to determine the resource is present and does not create it again. 

After the terraform apply command is executed, the affiliated directory will contain two new files:

  • terraform.tfstate
  • terraform.tfstate.backup

Note: any manually changes made to Terraform provisioned infrastructure will be overwritten by terraform apply.

Modules

Terraform configuration files can be packaged as modules and used as building blocks to create new infrastructure resources without having to put forth much effort. Modules are available publicly in the Terraform registry, and can be directly added to configuration files for quickly provisioning resources.

If I were to use a pre-packaged module to provision an AWS S3 bucket, the code may resemble:

module "s3_bucket" { 
  source = "terraform-aws-modules/s3-bucket/aws" 
  bucket = "my-s3-bucket" 
  acl = "private" 
  versioning = { 
    enabled = true 
  } 
}

In this case, you are reusing the configurations specified by the module. All you need to input are the configuration values.

Terraform Operations

Terraform is managed through a simple CLI. Terraform is a single command-line application: terraform and you specify the action through a subcommand such as apply or plan

To view a list of the available commands at any time, just run terraform with no arguments.

In order to get started, you will need to run terraform init to initialize a number settings for Terraform that will create the required environment to proceed. It will also download the necessary plugins for the selected provider.

Before provisioning, you may want to generate an execution plan, or otherwise known as a dry-run of your changes. Generate by running terraform plan. Terraform outputs a delta, showing you which resources will be destroyed (marked with a -), which will be added (marked with a +), and which will be updated in-place (marked with a ~).

Once you have reviewed the execution plan and are ready to begin provisioning, run terraform apply to the changes to be executed. If at any point you need to remove the resources, simply use the command terraform destroy. If there are multiple resources in the module, you can specifically name which resource(s) to destroy. For example: 

terraform destroy - target=aws_instance.my-first-instance

In general, once you have defined the infrastructure in the .tf files, working with Terraform is pretty much just running terraform plan and terraform apply repeatedly (unless you use CI).

Summary

In this post, I described Terraform syntax, architecture, and common operations. Throughout the article I used the example of creating an AWS EC2 instance, however, these principles apply to all resources types across providers. I hope this helps you get started in your infrastructure as code journey. 

Happy Terraforming!

Problem Solving with the Cynefin Framework

Effective leaders know that problem solving is not “one-size-fits-all”. The action taken depends on the situation and, because the circumstances are changing, better decisions can be by using an adaptive approach. I have previously written about the 75% method that I learned in the military, but there’s another framework that I have consistently used with success.

Cynefin, pronounced “kih-neh-vihn” (don’t worry, I mispronounced it for longer than I’d like to admit), is a Welsh word that means “place”. The Cynefin framework was coined in 1999 by Dave Snowden. Simply, the Cynefin framework is used to help realize that not all situations are equal and to successfully navigate different situations, different responses are required.

Picture1

The 5 Domains

Problems are categorized into five domains using the Cynefin framework (yes, five, don’t forget disorder!).

Ordered Systems

The domains on the right (obvious and complicated) are “ordered” because cause-and-effect are known or can be discovered.

Obvious (fka “Simple”)

This is the domain of best practice.

In this context, the problems are apparent cause-and-effect relationships that are well understood.

The methodology is to “sense – categorize – respond” to obvious problems. This means that the situation should be assessed, categorized by type, and then respond based on an existing process or procedure. These tend to be repeating patterns and/or consistent events…or “known knowns”.

For example, these are problems faced at a helpdesk or call center – often predictable and there are established processes in place to handle the vast majority.

Be careful – some obvious contexts may be oversimplified. This happens when leaders (or organizations, for that matter) experience success and become complacent as a result. Ensure that there are feedback loops in place so that any situations that don’t exactly fit with an established category can be reported.

Another risk with complacency is that leaders may not be receptive to new ideas. Endeavor to stay willing to pursue a new or innovative suggestion.

Complicated

This is the domain of good practice. Sometimes referred to as the “domain of experts.”

Complicated problems may have multiple correct solutions. There is a relationship between cause and effect, but it may not be obvious to everyone because the problem is…well…complicated. There may be several symptoms but you are not sure how to fix them.

The methodology here is to “sense – analyze – respond”. Effectively you should assess the situation, analyze what is known (using the help of experts), and decide what the best response is using good practices. This is generally where we experience “known unknowns” where we know the questions that need to be answered, but may not know the actual answer. It is at this point that we consult the expert. With enough time, you could reasonably identify the known risk and develop a plan. Think evolutionary, not revolutionary.

The danger here is that a leader may lean too heavily on experts while ignoring good solutions from others. In tech, we tend to experience this where we rely on the experts and ignore the generalists – even though the generalist may have the winning answer. Additionally, the leader may experience analysis paralysis. This is where I recommend using the 75% method detailed here.

Unordered Systems

The domains on the left (complex and chaotic) are “unordered” because cause and effect can be deduced only with hindsight or potentially not at all.

Complex

This is the domain of emergent practice.

Sometimes it is impossible to identify a single correct solution or to spot the cause-and-effect relationship. You are likely in a complex context.

This context is typically unpredictable, making the best approach “probe – sense – respond”. Think “unknown unknowns”. You may not know the correct questions to be asking. Regardless of how much time is spent in analysis, it may not possible to accurately identify the risks, predict the solution, or the effort needed to solve the problem.

In this situation, it is best to patiently wait, look for patterns, develop, and experiment to gain more knowledge. As more knowledge is gained, then determine the next steps. Repeat as needed. The goal is to move into the “complicated” domain.

A potential risk is that leaders may fall back into habitual command-and-control modes which are futile in this context. Leaders lacking patience may try to force facts instead of waiting for patterns. It is imperative to have a feedback loop so that open discussion can occur to develop experiments for observing patterns. Think “what if we tried…” Use creativity to solve the problem.

Complicated and complex situations are similar in some ways, and are sometimes confused. If a decision based on incomplete data is being made, you are likely to be in a complex situation.

Chaotic

This is the domain of novel practice.

There is no relationship between cause-and-effect. This means that the primary goal here is to establish order and stability. This is likely a crisis or emergency situation.

The methodology is to “act – sense – respond”. It is necessary to be decisive in order to address the burning issues, determine where there is and isn’t stability, and then work to move the situation from chaos to complexity. Basically, shit has hit the fan – triage time: stop the bleeding and start the breathing… then determine what the real solution should be.

It may feel like in tech we live in this domain (hopefully not!). As an example, there may be an issue in production, say a bad patch that has been installed data center wide. Initially the focus will be on containing the issue and correcting it quickly. The initial solution may not be great, but it gets the job done. Once the bleeding has stopped then you can determine the better long-term solution.

In this situation, the leader must provide clear and direct communication while taking immediate action to re-establish order. A risk is an indecisive leader. This is the time to find “good enough” instead of the perfect answer.

Disorder

Disorder is the space in the middle.

There is no clarity here – decompose and move to another context. Basically, if you have no idea where you are, then you’re likely in “disorder”. The immediate goal is gather information in order to move to a known domain.

In this situation, I tend to try to break the massive disorder into smaller problems and then tackle each one individually. Apply each problem to a domain and work on a solution.

Chaotic problems are dangers, especially when left unaddressed, because there is no process to fix it. This is why it is important to move into a known category.

Final Thoughts

The Cynefin Framework is an excellent model to assist in approaching different situations. Once the situation is defined, then work to solve the problem.

The goal is to adequately lead your team through any of these five domains. Many leaders can only lead effectively in one or two domains (not in all of them) and few, if any, prepare their organizations for diverse contexts. The only way to successfully get through all five domains is to keep an open mind to new and creative solutions, build a feedback loop, and not get stuck in analysis paralysis.

Cynefin_framework_by_Edwin_Stoop

Additional Resources:

Cognitive Edge: The Cynefin Framework (explained by Snowden himself!)

Everyday Kanban: Understanding the Cynefin framework – a basic intro

Sherrieg: The Cynefin Framework

Harvard Business Review: A Leader’s Framework for Decision Making

Ch-ch-change Management

Change management has never been easy for the dev or the ops side of the house. Let’s face it; it’s usually a checklist item and a tool to CYA. However, we are moving to a world where change is a part of the culture and a frequent process. There is no excuse to not improve.

The ultimate goal of change management is to drive organizational results and outcomes by engaging the staff to encourage the adoption of a new way to work. Whether it is a process, system, job role, or organizational structure change (potentially…all of the above), a project can only successful if the individual changes daily behaviors and begins doing the job in a new way. This is the nature of change management.

Therefore, staffing a change management board with a crew of change-adverse individuals will get you nowhere.

Change Management

Often we look at change management as a way to spot problems after they happen. Thus it becomes a tool for responding to change, instead of leveraging change. In this world of DevOps that embraces change as a mechanism to iteratively improve on processes, change management is usually viewed as a blocker to avoid. But in most enterprises and verticals, it cannot be avoided.

Often we look at change management as a way to spot problems afterthey happen. Thus it becomes a tool for responding to change, instead of leveraging change.

In this world of DevOps that embraces change as a mechanism to iteratively improve on processes, change management is usually viewed as a blocker to avoid. But in most enterprises and verticals, it cannot be avoided.

Tooling and implementation can be detached from governance. This decoupling can result in lost communication and a reactive philosophy. Instead consider funneling all changes through the same channel so that nothing gets lost and the change advisory board (CAB) considers all changes. Begin by consolidating change, problem, and incident management into a modular platform that is a part of your DevOps tooling that can streamline everything into one pipeline.

feedback loop

This may seem outlandish at first, but by integrating change into pipelines automates the capture of change records with a set of artifacts. The goal is to ultimately improve collaboration and to build an auditable history.

Companies often establish different modes of change to balance speed, quality, and risk. Consider automating the approval gate for some modes of change. This speeds change processing and increases adoption. This shares the responsibility of effectively making change happen back on to those individuals who conduct the implementations.

Change management should be a priority and used as a single source of truth of all changes. Doing so will increase visibility for risk and compliance management.

We can distill this down to three key ideas to assist in implement efficient change management:

  • Do not decide a new direction and then dump it on your team. Involve them in the decision-making process.
  • Make work visible to all.
  • Embrace value stream mapping to find new ways to increase efficiency.

The bottom line is to be proactive about how change is managed.

Considering the Methods for Release Engineering

The entire goal of release engineering is to accelerate rollout of new software or new releases as much as possible. Release engineering focuses on building a pipeline that transforms source code into an integrated, compiled, packaged, tested, and signed product that is ready for release.

Release management coordinates release workflows between various dev and ops personnel. Release engineers are more technically focused: working with the code, build systems, configuration management tools and container platforms, among other pipeline components, directly.

The goal is for the process to be as simple as possible. Complexity is the enemy of most things. Is my architecture good if it is so complex that no one can figure out how to implement and manage it? Same principles apply to DevOps frameworks. The architecture of the product that flows through the pipeline is a key factor that determines the structure of the continuous delivery pipeline.

For our processes to be simple, we need to automate as much as possible, including any approval gates that aren’t critical. There should be clear expectations of the release workflow and proper feedback loops. Not communicating results back will kill any process. It is imperative for the dev personnel to be communicating with ops to coordinate the release.

DevOpsElephant

And then of course…a method of releasing the new version.

Canary

The concept of canarying first emerged in the early 1900s when coal miners would take the caged bird into the mines. Canaries are more susceptible to carbon monoxide than humans; therefore it would quickly die signaling to the miners to get out.

Canary release is a release engineering technique used to reduce the risk of introducing a new software version in production. It accomplishes this by slowly rolling out the change to a small subset of users before rolling it out to the entire infrastructure and making it available to everybody.

Once the release environment and new version are ready, redirect a few selected users to it. Maybe 5-10%. But, how do you choose which users will see the new version? There are a few different options:

  • Try out the release on internal users first
  • Randomize the user selection
  • Use specific characteristic-based criteria to determine the user subset

The idea is that the faster you can get feedback, the faster the deployment can fail or proceed.

canary-release-5c74ac79
Image from: https://www.gocd.org/

As your comfort level increases with the new version, begin and wider release across the infrastructure and re-directing more and more users to it. Canary releases let you dip your toes in before pulling the trigger on a full release.

Google Cloud Platform blog has a cool post about release canaries, and so does Instagram.

Blue-Green Deployment

The concept with a blue-green deployment is fairly simple – there are two identical infrastructures: “green” with the current production load, say v1; “blue” is deployed with the newest version of the app.

blue-green-deployments-d73adc69
Image from: https://www.gocd.org/

Smoke tests or other kinds of tests have been run, and the “blue” environment is ready to go. Once ready, just change the router / load-balancer / reverse proxy to that “blue” environment. In any automated release, the cutover itself is the most challenging part. This must be done quickly in order to minimize downtime as much as possible. Blue-green deployments approach this by ensuring the two production environments are as identical as possible, minus the application version.

This option also provides a quick to way rollback. If something goes wrong, just switch the router / load-balancer / reverse proxy back to the “green” environment. The goal is to regularly cycle from “blue” to “green” and then “green” back to “blue”. Or, from live to staging for the next release.

Feature Toggles

Feature Toggles (also referred to as Feature Flags) are a powerful technique that allows you to modify system behavior without changing code. The general idea is that you have a configuration file that defines a few toggles for a handful of pending features. The application will use the toggles to determine whether or not to how the new feature.

1*Bn9hPemOuERvqfq0jo2CoQ
Image from: https://medium.com/@thicaso/1-minute-feature-toggle-e0b52a554ffd

Most of these decisions occur in the user interface of the application. There may be a set of toggles that surround any UI part of a pending feature. It will pass the new feature through if the toggle is enabled, if not, it will simply skip it.

Toggles introduce complexity. This complexity can be somewhat controlled by maintaining a clear process while using appropriate tools to manage the toggle configuration. It should be a goal to restrict the number of toggles in the system to the absolute minimum required.

This option seems to be a better fit for organizations with more mature CI/CD processes. Etsy and Flickr provide a great examples of using method this to manage deployments.

Digging into Test Automation

Part of making processes more efficient is relying on the crucial component, automation. In DevOps, automation is a near-must for successful performance, because it reduces the number of repetitive tasks thus decreasing the time required for quality results. It is the biggest quality maintainer and speed promoter.

As it’s impossible to automate everything, it’s important to have an automation strategy to get maximum ROI from time and money spent. A properly planned strategy can increase the speed of development and free up teams to concentrate on more essential tasks.

Select the correct cases to automate

What cases do you choose for automation?

Repetitive tests, high-risk cases, large data sets or checks for different browsers and environments

Well…it depends…on the service you are developing and on your team’s capabilities. The goal is to automate the cases providing the most benefits for the development process and across the entire organization.

Implement automation throughout a sprint

Short release cycles releases can be achieved only if the development and examination are finished simultaneously at the end of the sprint. That’s why quality assurance should begin as early as possible. For example, consider unit testing with each build.

Continue to apply automated cases

It is important to build flexible tests because it is inevitable to cases to evolve over time. It may be ideal to write small cases rather than creating cases with dozens of steps at initial implementation. Consider separating test into smaller steps and individually check components rather than an entire app stack.

Types of Tests

Unit testing is the practice of testing small pieces of code, typically individual functions in an isolated manner. If the test uses some external resource, such as a database, it’s not a unit test.

Functional testing is the testing of complete functionality of an application.

As the name suggests, integration testing is testing how parts of the system work together – the integration of the parts. For example, a unit test for database access code would not talk to a real database, but an integration test would.

Unit and integration tests’ results are validated in code, whereas functional test results should be validated the same way as a user would validate it.

Whenever code is modified, even a small tweak can have unexpected consequences. Regression testing ensures that a change or addition hasn’t broken any existing functionality. The goal is to catch bugs and to ensure bugs that were eradicated stay that way. As an example, re-running a test scenario that was originally created when a problem was initially fixed can help to validate new changes don’t cause components to fail.

Test Framework

After deciding what types of tests to run, the next step is determining success criteria and then automating the tests. A test framework establishes a set of rules for designing and creating the test cases. Typically, a framework combines practices and tools to increase efficiency.
Consider making the test environment closely resemble customer environments, as well as to accommodate for differences. When testing, ensure to test all options for a particular variable. For instance, when conducting web-based GUI tests, make sure to test all major browsers. Don’t only test with Firefox and call it a day. Don’t forget to test scalability and security.

What is the point of running tests if there are no results? Don’t forget to account with how the metrics are reported.
I am fond of the test pyramid approach popularized by Mike Cohn. The pyramid says that tests on the lower levels are cheaper to write and maintain, and quicker to run. Tests on the upper levels are more expensive to write and maintain, and slower to run. Therefore you should have lots of unit tests, some service tests, and very few UI tests.
pyramid
Most testing can and should take place during dev by running unit tests after every build. It is easy, cheap, and fast to conduct these tests and it allows for checking work as you go.
After all unit tests pass, move into the component, integration, and API testing phases. These tests validate most logical and business processes without going through the UI. Therefore, it’s recommended to automate these as much as possible.
UI tests run last and least; because these tests are costlier and more difficult, it is ideal to run as few as possible. Consider automating critical tests to remove the human element. From there, complete any manual tests. During this phase, it is critical to design based on user workflow. Start with user login and move forward from there.

If you are still interested in test automation, feel free to check out this corporate blog post I authored.

Get Mapped: Value Stream Mapping

twain

Value stream mapping (VSM) does exactly that: it is a DevOps framework (“borrowed” from manufacturing) that provides a structured way for cross-functional teams to collectively see where we are today (long release cycles, silos, damage control afterwards, etc.) and where we want to be in the future (short release cycles, infrastructure as code, iterative development, continuous delivery, etc.).

A VSM is a way of getting people to collaborate and see what is really happening. These exercises are often amazing “aha!” moment workshops that make three objectives (flow, feedback, and continuous integration) turn into a sustainable engine of improvement.

Who should participate in a VSM?

  •    Service Stakeholders and Customers
  •    Executors of a Process Tasks
  •    Management

…but not all at the same time.

The VSM process assembles everyone involved with a workflow in the same room to clarify their roles in the product delivery process and identify bottlenecks, friction points and handoff concerns. Realistically, if we include everyone at the same time, the likelihood of honesty decreases. Let’s be for real – if upper management were in the room with you, would you be 100% honest as to where the bodies are buried or exactly what processes each step entails? VSM reveals steps in development, test, release and operations support that waste time or are needlessly complicated and this requires complete transparency.

Lead Time versus Time on Task

If you can’t measure it, you can’t improve it. Why do companies go for Continuous Delivery (CD)? Why do people care about DevOps? The main reason I hear is cycle time. This is the time it takes me to get from an idea to a product or feature that your customers can use. Measurement is one of the core foundations of DevOps, and the VSM is the measurement phase. If you do it right, it’s the sharing phase as well – share the measurements and proposed changes with the entire group. Doing that well allows you to start to change culture simultaneously.

Lead time vs time on task

With a solid foundation in place, it becomes easier to capture more sophisticated metrics around feature usage, customer journeys, and ensuring that service level agreements (SLAs) are met. The information received becomes handy when it’s time for road mapping and spec’ing out the next big project.

“Lead time” is a term borrowed from manufacturing, but in the software domain, lead time can be described more abstractly as the time elapsed between the identification of a requirement and its fulfillment.

The goal of VSM development is to measure how time is spent on each task and identify processes required for each task. It becomes easier to see what processes are inefficient and creating a bottleneck. In turn, this will reduce the lead time to deliver the finished release.

Current State

The following VSM demonstrates a current state analysis of the current software release process. The main thing to note in this example is how linear it is – there are only two feedback loops: at the very beginning and towards the end at new feature testing.

current state

The apparent lack of feedback loops presents a potential problem area – there are 8 steps between the two feedback loops. Imagine getting all the way to the end before realizing there’s an issue and providing feedback. How far will the software release be set back if the problem is not detected and communicated until the new release testing phase?

Future State

Once you have the current state VSM mapped, the next step is to figure out a way to make the mapping more efficient. This is typically driven by the following:

  • How can we significantly increase the percent complete and accurate work for each step in our current state VSM?
  • How can we dramatically reduce, or even eliminate the non-productive time in the lead time of each current state step?
  • How can we improve the performance of the value added time in each current state step?

future

Realistically, no VSM is perfect. However, the future state that we see above demonstrates a set of processes that create a mostly ongoing feedback loop. This allows for continuous communication about the processes and release as it moves forward towards a qualified build.

Demonstrating Business Value

In the manufacturing plants, they would have one pipeline, one production line at a time. As we know, the modern software development world is not like that.

A VSM is about more than just dissecting the software delivery lifecycle to find bottlenecks and pain points, although it is certainly helpful in that area. Analyzing value streams gives management confidence that the business is focusing on the right projects and initiatives. By taking a clearer look at the KPIs and metrics across the tooling and scaling the entire organization, these leaders can make informed decisions the way most business leaders prefer to—with data to back them up.

Delving into Immutable Infrastructure

Before getting too far into the topic, let’s first take a quick look at the difference between mutability and immutability.

Mutable Immutable
Continually updated, patched, and tuned to meet the ongoing needs of the purpose it serves. State does not change or deviated once constructed. Any changes result in the deployment of a new version rather than modifying existing.

Read More »