Deploying all your DevOps-related tools to AWS can be a real headache.
How do you network, operate and maintain these completely different tools so they work together? And how do you do it without making your cloud costs explode?
Your DevOps tooling can easily become your most critical system, because everything goes through it. If that system goes down, you can’t deploy anything anywhere.
Maintenance doesn’t stop, so there are plenty of opportunities to go completely wrong both today and tomorrow. Trust me, I have been helping companies manage their DevOps tooling for many years, and I know where companies go wrong when it comes to DevOps tooling on AWS.
If you follow my five steps below, you will save yourself and your company a lot of money and heartache when maintaining your tools on AWS.
Let’s jump right into the first one:
Step 1: Engineer your cloud network and features
Like other systems, AWS is not perfect. You have your load balancers, your subnets, and different networks that need to talk to each other.
You want to limit your network’s exposure. Not every service should be visible everywhere to everyone, so you add different sections.
In AWS there are different ways to network two accounts together so that two teams can work on their own accounts. Some of these ways are good, and some are bad.
You could use a virtual public cloud, which pairs two accounts together. The problem with this approach is that the networks talk too much with each other. You can’t control the traffic easily, so when you start doing this at scale, things can break down completely. All networks are talking to each other.
There is now a better technology for this that we use, which solves this at scale: transit gateways. New technologies like this pop up all the time to solve the problems of the cloud.
The key areas within these cloud networks and features
When we talk about engineering your cloud network and features, the most important areas to look at are:
- User access: how users access your system.
- Internal system: how the system runs itself and what’s inside it.
- Machine-to-machine networks: perhaps you have things that e.g. Jenkins and GitLab need to deploy.
- On-premise connections: although this becomes less important with time, perhaps you have something that needs to connect with on-prem.
Networking is the foundation of all security. Do it correctly and everything is secure by design. But do it wrong, and nothing is secure anymore because everything is open to everyone. It doesn’t matter what you do: if your networks are open, you can’t fix things afterward. You have to fix those connections too, which means rewriting things again and again.
Doing it wrong increases your costs in AWS, and it makes debugging hard. You will never know what broke the system.
Secure your endpoints! Protect them behind load balancers, and never have them exposed directly.
Create minimal access groups.
Create frameworks and templates people can use, and ready-made permission schemes.
Constantly scan and test to discover what is still open.
Make Infrastructure-as-a-Code your baseline.
Don’t enforce everything at once — do it bit by bit when you know something works.
Step 2: Get your Infrastructure-as-a-Code right
This is where your code already lives in your Git (or non-Git) repository — your infrastructure should be in the same place. You want to be able to know what has changed, who made the change, and what it affects. This way you can plan and have an audit trail, and it becomes easier to reproduce things.
If you get this wrong, you will be doing ClickOps, clicking around in the UI, and then everything breaks. You have no idea what was changed, and you can’t easily go back to your IaaC.
In this scenario, your infrastructure controls you, you don’t control the infrastructure. You are supposed to force the system to be immutable if it is not version controlled. But it is not immutable.
Key areas within Infrastructure-as-a-Code
You should look at your entire infrastructure, but in large part it all comes down to your:
- Modules: the templates that you reuse, since you don’t want to rewrite everything (which will cause errors). You can make your own modules, but I recommend you use the official ones, to make it easier for others to make similar templates.
- Clear files: standardize the file naming and the way you deploy everything. For example, if you decide to use Terraform, then use only Terraform and not half in another system. Don’t try to manage two languages in one team.
- Code reviews: have some sort of automation and make sure it stays in the system. For example, Terraform uses “states” where you can see what is being changed as you do it. This prevents you from accidentally deleting something.
- Upgrades: your IaaC can get outdated very quickly. Update accordingly so you don’t end up on an old version that isn’t supported or simply doesn’t work anymore.
Failing to do this will cause all kinds of havoc. Your IaaC is a living, breathing thing that you constantly need to update and develop.
This is not easy. What you are doing is documenting your whole infrastructure upfront. Instead of relying on automations, you have to define and specify what your infrastructure should look like. From the beginning, you need to decide “this is what we are going to deploy” and “this is what it is going to be”.
Follow proper coding standards rather than think of this as scripting. With good coding standards, you will do well with your IaaC.
Step 3: Plan traffic and how to migrate
As I mentioned in Step 1 above: getting the networking wrong will quickly become expensive. Traffic is also about networking: what goes in and out at the top level.
If you have many artifacts — let’s say 6TB of data — and you pull 2TB in and out of the cloud constantly, your costs will skyrocket. Putting it into the cloud is cheaper, but pulling it out of the cloud is expensive.
So you need to think about “what am I trying to do?” and “can I plan the traffic better?”.
This also relates to migration, since you also need to consider “how am I going to migrate data from on-premise into the cloud”. Eventually, everything will end up in the cloud, so you need to plan for how you can make that happen as cheaply and efficiently as possible.
Key areas within traffic and migration
There are many minor moving parts in this, but let us focus on the main areas you will benefit from the most. They are:
Connections and integrations: which ones do you have? In DevOps today, you are connected to multiple points. What are you deploying and what is going through the CI/CD system. What is and what should be in the cloud? And if something shouldn’t be in the cloud, you need a smart plant for that.
Access management: you need a solid methodology for deciding who should have access to what, and where.
If you fail to plan for your traffic and migration, you will rack up huge, unnecessary costs.
You also invite unresponsiveness and the user experiences will suffer, which is likely the opposite of what you were looking for when deciding to move to the cloud in the first place. Rather than simplifying things, you add extra levels of complexity in the cloud and slow things down.
The most difficult thing about planning for traffic and migration is that… Well, it’s not something you have likely thought a great deal about before. So, while it’s not rocket science, it’s new. And new things are harder.
Actionable advice — a simple 1-2-3 process:
Start by analyzing what the integration points are, and where they are.
Look at how much data is involved — this you can measure.
Think about what you are trying to do and what is important at each integration, and what is the easiest way to get there. “Can we easily move that over there, and will it affect the user experience?”
Step 4: Manage monitoring, access, and logs
This one is quite self-evident, but in essence, you need control and traceability in the cloud. You need to know what is going on in the system. Who accessed what and when, and what did they do? For that, you need logs and access management.
And to know if something broke, you also need monitoring. If you move to the cloud and use, for example, an auto-scaling system, you need that monitoring in place. Because in the cloud, you are not “just monitoring” something: you use your monitoring metrics to operate what your infrastructure is actually doing.
If you don’t manage your monitoring, you may for example crash the service during a major event because you didn’t consider that you had to scale up before the event.
Key areas of monitoring, access, and logs
We can easily break these key activities down into two parts:
Observability: you are observing how things are working.
Traceability: you figure out who did what. For example, if the system is getting slow and you have nothing in the traceability logs, something else is probably going on in the system that needs your attention.
If you don’t do this, you will simply not know if your system is down, or who broke it. You don’t know which security threats you face.
One of the points of the cloud is that you have a system that can be redeployed multiple times. Some of the work can’t be managed unless you have the information stored somewhere else. You can’t observe the system if it is down, or know what happened just before it went down, as you already removed the corrupted part. And if the system slows down, you can’t react in time to fix it, as you already removed it via auto-scaling. This means you need to have metrics and logs stored outside of the system to efficiently debug possible issues in the system.
But monitoring is not easy. Because essentially you have two alternatives:
You have far too much information
You have too little information
If you have too much information, you are not going to do anything with it — like a 10,000 page book you are just scared to even open. Compared to that 50-pager that doesn’t give you enough information to be worth your while.
The challenge is to get the correct amount of information that is usable and doesn’t confuse you.
There are tonnes of “best practices” out there on the internet, but I will break it down simply for you with some:
- The trick is to start with “too much” information, then cropping it down until you have too little, and finally scale it back up until you discover how much is “enough”. This is a trial and error process. There are no golden rules, so you have to find what works in each unique case.
- Make sure you monitor your endpoints. If they are down, your services are not available for your users, even if the service/code works. (The users don’t like that at all.)
Step 5: Solve the ongoing maintenance and upkeep
You don’t just deploy your services once and let them run. There’s always an ongoing cost of running them.
This cost includes things like service support, end-of-life libraries and systems, integrations, and IP whitelisting just to mention a few — all kinds of ongoing maintenance activities that are required for your infrastructure.
If you simply leave your IaaC as it is, it will quickly get old and eventually stop working.
You need upkeep, which takes resources. Most of these needs can be met with AWS technology, but you still need to follow and manage them. And like all other technology, sometimes these AWS solutions will reach their individual ends of life, too.
Key areas within maintenance and upkeep
Your maintenance and upkeep responsibilities can be broken down into the following parts:
- Security: simply how you upkeep your security.
- Maintainability/future-proofing: as in all of IaaC, you need to make sure you will be able to continue making changes to your system. Once it gets too old, you can’t even upgrade it anymore.
- Knowledge of your software and architecture: you need to know what to do to make sure it stays alive.
- Backups and general data management: making sure your system gets backed up and is running, and that the data is available even if a region goes down.
- Monitoring: this is also part of the upkeep. Somebody needs to follow what the monitoring data says, if it is correct, or if you need to fix anything.
If you fail at this maintenance, you will have a system everybody knows about, but people are afraid to touch. If you try to run an automation, it won’t work anymore. You may have to go back and investigate what you deployed six years ago, and launch into some time-consuming, expensive reverse engineering. And while doing that, you may discover that somebody has been hacking you for a year.
We live in a quarterly economy. If you are the person responsible for this maintenance work, you are not making new features to your service, or finding new ways to make money. Even if you do your maintenance perfectly, it won’t show directly on your company’s EBITDA even if it will affect it in the long term. What you do is not glamorous and doesn’t create value in the next quarter.
When it comes to this ongoing maintenance need, you essentially have three options:
- Outsourcing: More and more organizations avoid allocating their engineers to this type of work (so they can focus on developing the services), and instead outsource it. They sign a long-term maintenance contract with a company like Eficode, which will keep the tools optimized and maintained for as long as the contract runs.
- Having an in-house maintenance team: This team is responsible and has a budget for, all maintenance.
- Establishing common practices that you follow constantly: When you create something new, by default it uses established, well-working practices from other areas. This is the most difficult of the three options because you need to ensure everything new is backward compatible — and everything old is forward compatible. If not, you will constantly generate new technical debt for yourself.
It’s a real jungle out there. So many DevOps tools, plus cloud is still kind of new and constantly evolves.
The maintenance aspect is usually not the sexiest — we rather focus on what’s new and what’s next — but if you lose focus, bad things happen. Things break and money is thrown down a dark well.
Following and paying attention to my five steps will protect you from a lot of pain. Pain a lot of people in your shoes experience every day.
Published: November 16, 2022