Developing and maintaining infrastructure can be a frightful experience which Ops and SRE teams may shy away from - "Don't touch a running system". However, it doesn't have to be like that!

There are three core practices that will impact our software delivery and operational performance (reading tip: ‘Infrastructure as Code’ by Kief Morris):

  • Define everything as code
  • Continuously test and deliver all work in progress
  • Build small, simple pieces that you can change independently

If you are a modern builder of software, these are not a surprise to you. But did you know that these also apply to how we develop modern infrastructure?

In this blog post you will learn how to continuously test infrastructure, aka continuous integration.

Continuous integration of Infrastructure as Code

Software developers have gotten used to fast cycles without sacrificing quality and stability. This has been achieved through automated testing of all code changes as they are being committed to Git. Changes that do not pass the automated testing are not allowed to be merged into our code-base. Automated testing is the guard-rails that allow fast software development.

The same continuous integration practices can be applied to infrastructure. We can define policies around properties of our infrastructure. These policies can be validated through tests and thereby we achieve the mix of speed, stability and quality we have come to expect from SW development.

Infrastructure is not just about servers

In this context, by ‘infrastructure’ we mean anything that we can programmatically operate on through an ‘infrastructure as data’ approach. It includes for example:  

  • servers 
  • networking 
  • user authentication and authorization 
  • application deployments 
  • pipeline configuration 

A few typical examples would be anything we can manage with Terraform and all Kubernetes resource types.

Infrastructure as code vs. infrastructure as data

The terms ‘infrastructure as code’ and ‘infrastructure as data’ are often used interchangeably, but they mean different things. Also, very often ‘declarative infrastructure’ is used to describe what we here refer to as ‘infrastructure as data’.. But the key point is:

We can make assertions against infrastructure defined as data. This is much more difficult with infrastructure as code.

The following command creates a VM instance on the AWS cloud using the AWS CLI. A specific machine image (an ‘AMI - Amazon machine image’) and machine configuration (‘instance type’) is specified.

$ aws ec2 run-instances --image-id ami-xxxxxxxx \
                 --count 1 --instance-type t2.micro

This command could be placed in a script and we have ‘infrastructure as code’. However, imagine if we wanted the script to be idempotent, i.e. running the script twice only created one VM. This could be achieved with some scripting logic and eventually our script might end up being quite complicated (and error-prone).

An alternative approach would be to use e.g. Terraform, with which we could define the following ‘data structure’, which defines the desired state of our infrastructure. This ‘infrastructure as data’ has the advantage that Terraform will manage all the logic needed to reconcile our desired state with the actual infrastructure state. Kubernetes works in a similar way, albeit with different ‘data structures’ (aka. ‘Kubernetes resource YAML’).

resource "aws_instance" "server" {
  ami           = “ami-xxxxxxxx”
  instance_type = "t2.micro"
  ...
}

Besides e.g. Terraform or Kubernetes handling some of the complexity of the infrastructure management, another benefit comes from the way the infrastructure is defined. With infrastructure as code, it is generally very difficult to reason about the effect and correctness of the code. Any tool doing such verification would also need to support multiple languages and external tools.

For example, imagine shell scripting using a mix of the AWS CLI together with jq, sed, awk, grep etc. With the data-structure approach it is much simpler since the data structure is typically a relatively simple format and only the schema of the individual objects in the data structure are domain-specific (e.g. AWS Terraform resources or Kubernetes resources).

Because of this, tools like Open Policy Agent works on data and use cases include infrastructure as data.

Open policy agent and related tools

Now let us look at how to use these tools to implement policies on our infrastructure as data:

Open Policy Agent (OPA) is the foundation tool on which the following tools are built. OPA is a policy engine that allows us to define policies as code that our infrastructure can be tested against. Policies are defined in a language called Rego.

Conftest is a tool that integrates OPA and accepts structured data in a variety of formats such as YAML and JSON. This data could be e.g. Kubernetes resources in YAML format or a Terraform plan in JSON, and Conftest allow us to evaluate Rego policies against this data. Conftest is ideal for validating infrastructure as part of a CI pipeline.

Regula is a Rego library that can be used together with Conftest and which is specially designed for Terraform infrastructure data.

GateKeeper is an integration of OPA with Kubernetes. GateKeeper is a Kubernetes admission controller that governs Kubernetes resource operations through the Kubernetes API. For example, it can allow or deny POD creation based on Rego policies evaluated by OPA against the POD definition. GateKeeper is thus not used during the CI process (there we use Conftest), but it should be used with the same policies as used for Conftest to govern what actually gets deployed on Kubernetes.

The Rego language

Rego is a data query language and at the core of the language is the concept of rules which consist of queries which assert a certain state of the data being queried.

Let’s look at a simple example using the following dataset:

creatures :=[
  {
    "name": "bob",
    "human": true,
    "likes": [ "apples", "pizza"]
  },
  {
    "name": "alice",
    "human": true,
    "likes": [ "fish" ]
  },
  {
    "name": "joe",
    "human": true,
    "likes": [ "pancakes", "fish" ]
  },
  {
    "name": "felix",
    "animal": "cat",
    "likes": [ "fish" ]
  }
]

favorite_food := { "fish", "bananas" }

We can query this dataset using Rego rules, creating new datasets based on a set of conditions. In that sense, Rego rules are similar to database queries. The following rule creates a dataset with all human creatures from the dataset:

humans[name] = creature {
  creature := creatures[_]
  creature.human
  name = creature["name"]
}

The code lines of this rule should be read as:

  1. Rule name is ‘humans’, taking an input ‘name’ and returning a dataset based on ‘creature’ assignments in the body of the rule.
  2. Assign ‘creature’ any value from the ‘creatures’ dataset - the underscore variable means ‘loop variable which we do not care about the value of’. Inside this rule, ‘creature’ will take on all four values from the ‘creatures’ list, i.e. this is somewhat similar to a for-loop.
  3. Assert that ‘creature’ is a human. If any condition evaluate to false, the resulting dataset will not contain the specific value of ‘creature’.
  4. Assign name from the specific ‘creature’.
> humans
{
  "alice": { ... },
  "bob": { ... },
  "joe": { ... }
}

Notice that we did not specify a value for the input ‘name’, which means that this variable is not bound to a specific value and the rules are thus resolved without constraint on the value of ‘name’. If we had specified a value, our generated dataset would be accordingly limited:

> humans["joe"]
{
  "name": "joe",
  ...
}

In Rego, the equal sign ‘=’ means both comparison and assignment, and it can assign both left-to-right and right-to-left, so it should be seen as a mathematical equation. The following line from the rule above was an assignment in first rule execution because we did not specify a value for ‘name’ and a comparison in the latter rule execution.

name = creature["name"]

 

The order of statements and even the order of elements on the left and right side of the equal sign ‘=’ does not matter. So the following rule definition is identical to the one above:

humans[name] = creature {
  creature["name"] = name
  creature.human
  creature := creatures[_]
}

We can join our two datasets with a rule like the following:

likes_favorites[creature] {
  some food
  creature := creatures[_]
  creature["likes"][_] == favorite_food[food]
}

In this rule, the join of the favorite_food dataset and the list of what each creature likes is in the last line. Again, note the underscore used to index into the ‘likes’ list - we do not care about the actual index in this case.

A typical way to work with Rego is to build basic rules which are then combined into more advanced rules. The following rule combines the two previous rules to return a list of humans which like the food in the favorite list:

favorites[name] {
  some creature
  likes_favorites[creature]
  name := creature["name"]
  humans[name]
}

If we execute this rule we get:

> favorites
[
  "alice", "joe"
]

Example - AWS tagging policy

Tagging is a very useful technique when deploying resources on the AWS cloud platform. Imagine we have a company policy that states that all resources should be tagged with an ‘Owner’ tag that defines who is responsible for a given resource. How can we do that with Rego?

The following policy (adapted from Regula project examples) implements this tagging requirement using Conftest and Regula.

First we define a list of AWS resource types that support tagging (list abbreviated for clarity) and a list of tags that we require on our resources. The aws_instance is the VM resource type we saw above and the following, which defines two sets, doesn’t look very different from other languages.

taggable_resource_types = {
  "aws_vpc",
  "aws_subnet",
  "aws_instance",
  ...
}

required_tags = {
  "Owner",
  "Environment"
}

Next, we define a Rego rule which joins our AWS resource list with the list of AWS resource types that can be tagged. This is very similar to our ‘humans’ rule example above:

taggable_resources[id] = resource {
  some resource_type
  taggable_resource_types[resource_type]
  resources = fugue.resources(resource_type)
  resource = resources[id]
}

Next we define a Rego function that, given a resource, compares the tags on that resource with our list of required tags.

is_improperly_tagged(resource) = msg {
  keys := { key | resource.tags[key] }
  missing := required_tags - keys
  missing != set()
  msg = sprintf("Missing required tag %v", [missing])
}

Finally, we define the policy. Given the result of the Rego rule/function above we decide if we will allow or deny a given resource:

policy[r] {
  resource = taggable_resources[_]
  msg = is_improperly_tagged(resource)
  r = fugue.deny_resource_with_message(resource, msg)
}

Example - Kubernetes Service Type Policy

If your team is deploying multiple services to Kubernetes, you may want to manage load balancers and TLS certificates globally (e.g. using a multi-tiered routing, traffic routing architecture). In such a situation, we want to ensure that teams do not deploy Kubernetes services of type ‘LoadBalancer’. What would a Rego policy for that look like?

The following policy, implements this Kubernetes service type requirement using ‘input’ as the source of our Kubernetes resource being evaluated.

name = input.metadata.name

kind = input.kind

is_service {
kind = "Service"
}

warn[msg] {
  is_service
  input.spec.type = "LoadBalancer"
  msg = sprintf("Found service %s of type %s", [name, input.spec.type])
}

You might now consider "why not just use Kubernetes RBAC to limit what can be deployed"?

The difference is that, with Rego policies we validate our infrastructure specification before deployment, which is different from enforcing a policy when infrastructure has been deployed. This is similar to how we run tests as part of CI pipelines to catch any errors before deployment.

Enforcing policies with GateKeeper

The previous two policy examples have shown how to validate infrastructure resources as e.g. part of a CI pipeline, before changes being accepted for deployment. Rego policies can similarly be enforced on Kubernetes through the GateKeeper Kubernetes admission controller. GateKeeper policies are implemented in Rego and configured in Kubernetes through custom resource definitions. See e.g. Kubernetes Gatekeeper Service type template for a policy very similar to the one above for Kubernetes LoadBalancer service types.

Typical things your policies could verify:

  • Disallow hostpath volumes
  • Container resource limits and requests being set (and being set to sane values)
  • Container images being pulled from specific registries, possibly also that container tags are digests and not mutable tags (like ‘1.0’)
  • Disallow certain node port ranges
  • Required tags and/or unique tags
  • Ingress paths being valid and/or unique
  • PODs have readiness and lineness probes
  • Containers not running as root

GateKeeper policy templates can be configured using parameters, so different policies can be configured for different namespaces. See for example the GateKeeper ContainerLimits policy.

Limitations of declarative tests

When infrastructure is defined using a declarative ‘as data’ approach, there is a limit to how much testing makes sense. Testing quickly ends up with us testing our own declarative statements, like the following:

foo = 42
if foo != 42:
    error()

Declarative testing is most useful for, for example, policies and interfaces between teams. This is  where the assignment and the test above are the responsibility of different parties.

The ‘data assertions’ shown in this blog post also can only make assertions on the data available. Complex systems may have data in many different places and formats, often where data access is segregated for security issues. This makes it difficult or even impossible, to obtain the data necessary to implement tests.

Final words

We have seen how to write policies for Terraform and Kubernetes resources. But Rego and the OPA-based tools can be used for a wide variety of policy validation. For example, if your pipelines are defined in YAML, why not validate changes with a Rego-based policy before accepting any changes?

The learning-curve for Rego might be a bit steep, but since Rego can be used across so many types of infrastructure as code, it might be time well spent to avoid using a multitude of different policy-related tools. The important thing is that your infrastructure is defined ‘as data’, not ‘as code’. The ‘data structure query’ approach of Rego will not work with ‘infrastructure as code’

Rego can include data sourced from external systems that can be used in policies. The scope of validations is therefore wide. 

But there are things Rego policies cannot do. Rego policies can be considered to be in the ‘component test’ category for your infrastructure. When it comes to dynamic issues like application component interactions, network latencies and failures, and characteristics tests, Rego policies are less well suited. But don’t let that detract from the improved speed and robustness Rego policies can add to your infrastructure continuous delivery!

Published: December 17, 2021

DevOpsCloudCI/CD