Have you noticed that cloud is not as straightforward as we are often led to believe?

A lot of businesses get carried away and start running towards the cloud, eager to realize all the obvious benefits you are supposed to get. It’s a gold rush, but many companies miss out on the gold. 

As an experienced “gold miner”, I want to share with you what you need to know about the Well-Architected Framework and how to use it. When used correctly, it is a fantastic way to make sure you get the most out of using AWS. 

First, here are three harsh truths:

  • It is hard to maintain good security while still offering developers and administrators easy access to cloud resources. 
  • Things fail, and if you don’t have a proper data backup strategy, or maybe you are lacking disaster recovery plans, your entire business will suffer when failure inevitably happens.  
  • Even with a sound technical implementation — with availability, performance, monitoring being on point — the monthly bill can still be needlessly (jaw-droppingly) high. 

Businesses of all types and sizes eventually come to a point, where they ask themselves: “Are we really doing this the right way? Is there an easier, better, or cheaper way to effectively run things in the cloud?”.

To address these questions, AWS has since 2015 continuously published and refined their “Well-Architected Framework,” an up-to-date, neat compilation of best practices that is available for free to everyone.  

Read on below and learn about:

  1. What this best practice framework is

  2. Why it is relevant and how it can help you

  3. Grading and implementing the most relevant changes, to align with best practices

So let’s begin with a simple introduction to the AWS Well-Architected Framework.

The Well-Architected Framework

In 2015 AWS published a whitepaper known as “AWS Well-Architected Framework”. Their aim was to crystallize all the best practices and architectural challenges and to address the main questions that arose with the upswing in cloud computing and services. 

The accumulated experiences were presented in five pillars (now six) that intended to cover all architectural decisions to be made, and ensure the best possible outcome for running workloads in the cloud. 

The six pillars define an extensive set of questions and considerations that focus equally on the technical and business side. 

These six pillars are:

  • Operational excellence 

  • Security

  • Reliability

  • Performance efficiency 

  • Cost optimization 

  • Sustainability

Each pillar contains a set of definitions, design principles, and best practices that help you:

  1. Shine light on areas that would be the most beneficial to improve
  2. Operate and consider different aspects of your business strategy in relation to the technical design and implementation 

Why you should perform Well-Architected Reviews

When companies first migrate to, or deploy new workloads in, the cloud, what they want is typically agility, and also to just “get going”. 

Gone are the days when a business needed to put money upfront for the hardware and infrastructure, which could be a massive hurdle, especially for startups that have uncertain futures. 

With the cloud, you easily deploy and scale infrastructure as needed, paying as you go. And if you need to halt operations, then simply shut down all your consumption in the cloud and pay no more. 

But this agility comes at a price for many companies:

The problem that typically occurs once you are in the cloud

This typically happens: 

Armed with the agility of the cloud, companies rush to speed up the time-to-market. Therefore, they cut corners on security, operations, and in other important areas. And then with time, these issues are exacerbated. Because with time, when demand rises, the technical solutions in place grow in scale and complexity.

How that problem is best solved with a Well-Architected Review

Because your technical solutions keep getting more and more complex, it is important to as soon as possible build or improve a solid foundation of best practices — by design and by policy. Especially on critical workloads and operational routines such as playbooks and runbooks. 

But keep in mind that this is not a one-time activity. 

With time you will experience configuration and operational drift. So once your Well-Architected Review is completed, it is important to revisit it as time passes. 

AWS also recommends doing a review in conjunction with major milestones in the development cycle, and then following good hygiene practices to prevent degradation of the workload design. As a rule of thumb, this could be said to be every 12-18 months.

Top 3 reasons to perform a Well-Architected Review

  1. Keep your workload secure, reliable, and tolerant of faults and outages

  2. Cut down costs (sometimes significantly) by applying smart cost savings plans or shut down resources during inactive hours

  3. Highlight areas of improvement in your operations and resource lifecycles for higher performance efficiency

To get the full benefit of AWS, you simply have to take the Well-Architected Framework into account when you design your architecture. In the following section, you will teach you how to easily implement the best practices with the help of a simple tool.

The Well-Architected tool

AWS’s “Well-Architected tool” is freely available through the AWS console, and is there to aid the application of best practices in the Well-Architected Framework. 

It does not aggregate any data from the actual workloads, but rather provides all involved parties (architects and customers/stakeholders) with a unified checklist of best practices, along with notes that can then be shared by both the customer and service provider (for example Eficode) AWS accounts.

Through the tool, the architect and workload owner view the workload through the lenses of the six pillars. 

Each pillar highlights different aspects of the workload, asking questions from a technical perspective, and looks at how the technical solution aligns with the owner's unique business objectives and key results. Each question varies in complexity and not all are applicable. If a question is deemed irrelevant, it will simply be omitted from the aggregated results and overall score at the end of the review.

The final results highlight issues graded from medium to high. High-risk issues are recommended to be remediated as soon as possible, while medium issues may not be relevant to tackle at all.

What you look at in a Well-Architected Review: The six pillars

The six pillars are what the entire review is based on. You closely analyze and discuss each of them in turn. Below is a quick introduction and summary of the pillars and the key concepts that define them.

1. Operational excellence

Some operation is always necessary, no matter how small the workload is. Someone has to be there when things go wrong, need to be changed, or removed at the end of a resource lifecycle. The key word here is excellence, meaning running workloads like a well-oiled machine to save time, pain, and money in the long run.

In essence, this pillar deals with how you can construct your operational work to achieve greatness in the most effective way possible. It consists of four best practice areas:

  • Organization: Understand your organizational priorities and structure, while also visualizing how the organization supports its team members.

  • Prepare: Understand your current workloads and their expected behavior. You will get insight into their status and the procedures that support them.

  • Operate: Visualize the health of your workload and operations. Also discover where a specific workload might be at risk, and how to respond appropriately.

  • Evolve: See where you can improve, and define previous lessons learned from operational activities and their success rate, for future incremental changes. 

A sample of what your Operational Excellence pillar summary could look like

 

  • Use automation where possible. 
  • Make frequent, small, and reversible changes. 
  • Refine operations procedures frequently. 
  • Learn from all operational failures. 
  • Anticipate failure.
  • Learn from all operational failures.

2. Security

This pillar focuses on how to take advantage of the latest cloud technology to make your workloads safer. Historically seen as a bit bland, in these times security has become all the more important. The pillar in a way builds on top of operational excellence and addresses how to achieve good operational hygiene and robust technical architecture in a safe and secure manner.

Security presents six best practice areas:

  • Foundation: Learn how a foundation should be built to work with your resources in the most secure way possible.

  • Identity and access management: Discover how to create a robust and secure way for users to access resources inside your workloads. It consists of.

    • Identity management: How you manage personnel and workload identities, such as applications, operational tools, and components that need to make requests to your AWS resources.

    • Permissions management: How you manage security with policies, boundaries, Attribute-based access control (ABAC), and Service Control Policies (SCP)

  • Detection: Visualize where potential threats, misconfigurations and/or unexpected behaviors might disrupt your workloads.

  • Infrastructure protection: Use different best practice methodologies to keep your infrastructure as safe as possible. Protect your resources from security threats (both unintended ones and unauthorized access), and find potential vulnerabilities.

  • Data protection: Visualize how you should encrypt and categorize your data (for example using encryption both at rest and in transit), and how to classify it.

  • Incident response: Implement different mechanisms to respond and mitigate future security incidents.

A sample of what your Security pillar summary could look like

 

  • Segregate different workloads by account based on their function and compliance, or data sensitivity requirements. 
  • User access should be granted using a least-privilege approach with best practices, including password requirements and MFA enforced. 
  • It is critical to analyze logs and respond to them so that you can identify potential security incidents. Enforcing boundary protection, monitoring points of ingress and egress, and comprehensive logging, monitoring, and alerting are all essential to an effective information security plan. 
  • Ensure that you have a way to quickly grant access for your security team, and automate the isolation of instances as well as the capturing of data and state for forensics.     

3. Reliability

This is where you dig deep into workloads and assess if their intended functions perform how you want them to. Is the workload self-healing? Do you continuously run tests during its life cycle? Is the workload and data compliant with the requirements of uptime and redundancy?

Reliability focuses on four best practice areas:

  • Foundations: See how you can build a solid foundation that stretches beyond a single workload. 

  • Workload architecture: A reliable workload starts with upfront design decisions for both software and infrastructure. Your architecture choices will impact your workload behavior across all Well-Architected pillars. For reliability, there are specific patterns you must follow.

  • Change management: Describe how to monitor your resources, implement changes, and design your workloads to be as open to changes as possible.

  • Failure management: Find out, for example, what you need to do to make your workload resilient, how to manage your backups and testing, and how to plan for disaster recovery.

A sample of what your Reliability pillar summary could look like

 

  • Architect a workload to automatically add and remove resources in response to demand, this not only increases reliability but also ensures that business success doesn't become a burden. 

  • With monitoring in place, your team will be automatically alerted when KPIs deviate from expected norms. 

  • Automatic logging of changes to your environment allows you to audit and quickly identify actions that might have impacted reliability. 

  • Controls on change management ensure that you can enforce the rules that deliver the reliability you need. 

  • Regularly back up your data and test your backup files to ensure that you can recover from both logical and physical errors. 

  • A key to managing failure is the frequent and automated testing of workloads to cause failure, and then observing how they recover.

4. Performance efficiency

In the performance efficiency pillar, you look at how to use AWS resources and services efficiently, but also how to maintain efficiency over time as the workload demand increases and newer technical solutions evolve.

Performance efficiency has four best practice areas:

  • Selection: Make sure you have selected the right solution for your business needs. You do this by evaluating the existing resources you have selected for your workload in terms of:

    • Performance architecture 

    • Compute architecture 

    • Storage architecture 

    • Database architecture 

    • Network architecture

  • Review: A number of serious questions about how your performance review process works. Questions like: Are you still using outdated resources and services? Do you have a process for improving workload performance?

  • Monitoring: Determine how to best use and set up monitoring and alarms.

  • Trade-offs: Reflect on what kind of tradeoffs you’re willing to make for a better-performing workload, such as trading consistency for time and latency.

A sample of what your Performance efficiency pillar summary could look like

 

  • When architecting for performance, take advantage of the elasticity mechanisms available to ensure you have sufficient capacity to sustain performance as demand changes. 

  • When electing a storage solution, ensuring that it aligns with your access patterns will be critical to achieving the performance you want. 

  • Database is often an area that is chosen according to organizational defaults rather than through a data-driven approach. As with storage, it is critical to consider the access patterns of your workload and also to consider if other non-database solutions could solve the problem more efficiently (such as using graph, time series, or in-memory storage database). 

  • By taking advantage of Regions, placement groups, and edge services, you can significantly improve network performance. Networks in the cloud can easily be improved over time as they can be quickly rebuilt or modified.

  • Ensuring that you do not see false positives is key to an effective monitoring solution. Automated triggers avoid human error and can reduce the time it takes to fix problems. 

  • Use a systematic approach such as load testing, to explore whether your tradeoffs improve performance.

5. Cost optimization

Cost is often a driver for migration to the cloud. Being able to access compute resources on-demand and paying only for what you use, can be a huge factor in business success. As time passes and workloads increase with demand, there will be an inevitable increase in costs as well. Therefore it is important to thoroughly understand current expenditure and find areas of improvement which are often low-hanging fruit. 

The cost pillar touches on five different best practice areas:

  • Practice cloud financial management (CFM): Realize where your business value lies, and how to improve your finances by optimizing costs.

  • Expenditure and usage awareness: Understand how to manage your costs and usage as effectively as possible. 

  • Cost-effective resources: Evaluate which resources, services, and configurations to use, to lower costs.

  • Manage demand and supply resources: Analyze your workload demands, for example, if you can supply resources dynamically instead of statically.

  • Optimize over time: Develop a workload review process, to review new, cost-saving AWS services into your existing workloads. 

A sample of what your Cost optimization pillar summary could look like

 

  • As with the other pillars, there are tradeoffs to consider. For example, whether to optimize for speed-to-market or cost. In some cases, it’s best to optimize for speed—going to market quickly, shipping new features, or simply meeting a deadline—rather than investing in up-front cost optimization.

  • Design decisions are sometimes directed by haste rather than data, and the temptation always exists to overcompensate “just in case”, rather than spend time benchmarking for the most cost-optimal deployment. This might lead to over-provisioned and under-optimized deployments. 

  • Investing the right amount of effort in a cost optimization strategy upfront allows you to realize the economic benefits of the cloud more readily by ensuring consistent adherence to best practices and avoiding unnecessary over-provisioning.

6. Sustainability

This recent addition to the pillars (2021) shines a light on how your business activities impact our environment, economy, and society. It also describes what kind of cloud processes and best practices can minimize your environmental footprint.

The sustainability pillar boils down to three major areas.

  • Cloud sustainability: Understand the shared responsibility model where AWS is responsible for optimizing the sustainability of the cloud, while you, the customer, is responsible for optimizing your workloads and resource utilization inside of it.

  • Improvement processes: Evaluate how to minimize your environmental footprint by re-architecting solutions. Eliminate waste, manage resources with low utilization, and squeeze the most value out of your current cloud resources.

  • Best practices for sustainability in the cloud: Understand the best practices for increasing energy efficiency and maximize utilization of resources.

A sample of what your Sustainability pillar summary could look like

 

Sustainability in the cloud is a continuous effort focused primarily on energy reduction and efficiency across all components of a workload, by achieving the maximum benefit from the resources provisioned and minimizing the total resources required. This effort can range from the initial selection of an efficient programming language, adoption of modern algorithms, use of efficient data storage techniques, deploying to correctly sized and efficient compute infrastructure, and minimizing requirements for high-powered end-user hardware.     

How to perform a Well-Architected Review

Now that you are familiar with the six pillars, we can move on to look at how to practically perform the review itself. 

Every organization is different and whatever works, works. But we here at Eficode do this all the time for our clients, and our process works really well, so I will share it with you right now for inspiration.

The process is quite straightforward. 

Step 1: Preparation

We need the following from the owners of the workload:

  • The workload that is to be reviewed. It has to be a production workload to be eligible for AWS Credits at the end (more on this later).
  • An architectural diagram of current architecture (if there is one)
  • The AWS account ID where the workload resides
  • The AWS account ID of where the review will be shared (this could be any account, but if there is an Audit account within a Control Tower structure, this can be a good target)
  • The region the workload operates in

Step 2: Kickoff 

We schedule a kickoff meeting where we go through the process in its entirety, along with scheduling timeslots for all workshop sessions. 

Step 3: Workshops 

Each pillar will have a dedicated two-hour review. This might seem a bit long-winded, but often this proves to be just the right amount of time. Some pillars might take longer than others, others less. 

After the pillars, two more workshops are needed to conclude the review. Once all pillars are reviewed and a report is generated from the W-A tool, we ask you to study the review for the upcoming workshops. 

In a prioritization workshop, we grade each HRI (high-risk issue) on a diagram that decides how easy an issue is to mitigate in relation to the impact it would have. At the end, we place each issue with its proposed action in an agreed-upon timeline. 

So, this is a great structure for the workshops:

  • Kickoff meeting: Book all workshops and find mutually available time slots.

  • Pillars 1-6. (12h total)

  • Prioritization workshop: Prioritize the HRIs based on their impact and ease of implementation. (2h)

  • Actions and Roadmap workshop: Place all HRI and their proposed actions in a scoped timeline. (2h)

Total time for workshops: 16h

Step 4: Remediation and funding 

After a workshop, the workload owner has a concrete list of Issues that need to be remediated, a list of proposed actions to fix the issues, and a timeline with the best order of implementation for these actions. Basically a blueprint of everything that needs to be done, with the order and timetable. 

If you work with a partner (such as Eficode), they are now well prepared to implement these remediation actions.

If we together manage to fix 45% of all HRIs (not counting in medium-risk issues) the workload is eligible for a $5000 AWS credit to spend. This would most likely cover the entire cost of the review and maybe then some.

In summary 

Now you have some more context on the complexities around cloud, but you also know how to meet the main challenges. AWS have been proactive in creating tools to spot weaknesses and good practices on how to fix them. 

But just like your business in general, the cloud and your architecture in it, constantly changes. This means you need to use your knowledge and the tools at hand to make sure you remain reliable, secure, cost effective, and efficient. Whether you choose to go it alone or work with an experienced partner, you now know the fundamentals of the Well-Architected Framework and have an action plan in the form of the Well Architected Review. 

Over to you.  

Published: Dec 9, 2022

Cloud