The Direct Shopper Technology team at the LEGO Group has taken LEGO.com on a transformation journey from on-premise monolith to serverless on AWS, and now exploring the new opportunities and agility available with this platform.
Enjoy a replay of The DEVOPS Conference sessions conveniently through the DevOps Sauna podcast. This talk is hosted by Nicole Yip - Engineering Manager, Direct Shopper Technology, the LEGO Group, and Gunnar Grosch - Senior Developer Advocate, AWS.
Next stop: we have the fascinating story from Lego and their journey from an on-premise monolith to a server-less architecture on AWS.
Yeah, let's get started. So we have Nicole Yip, the Engineering Manager for direct shopper technology at Lego. We'll be joined by Gunnar Grosch, a Senior Developer Advocate at Amazon Web Services. Welcome Nicola and welcome Gunnar.
Hi everyone, and welcome to the evolution of lego.com to serverless and the beyond. So my name is Gunnar Grosch and I am a developer advocate at AWS. You can see me in the mini figure me right below my name, and I'm delighted to be joined on stage by Nicole. So hi, Nicole.
Hey, Gunnar. So I'm Nicole Yip. I'm an engineering manager at the Lego group and I'm in the direct shopper technology team, and I chose a little rocket man for my mini figure.
So we have a lot to cover in this session, and Nicole will talk us through the journey of how lego.com went serverless and the learnings that they've had. And along the way, I'll chip in with some context as to what a serverless-first strategy entails. So Nicole, introduce us to lego.com.
Thank you. So I'm going to focus on the journey of the team behind lego.com, specifically the e-commerce pages, where you browse for products, add to bag, maybe redeem VIP rewards, and then complete your order and checkout. This is the direct shopper technology team, and we have engineers in the UK, Denmark and the United States. Our main challenge is very typical to the world of e-commerce. The traffic patterns are extremely spiky, regular product launches, and sales events drive large numbers of people to our site at very specific points in time. This diagram shows our typical traffic pattern on one of our busiest sales days, and the problem is every year, those spikes keep getting taller. The number of visitors to our site keeps growing. And now imagine trying to tackle all of that spikiness, all of that yearly increasing demand with an on-premise monolith that's tightly coupled with systems that don't scale.
And that leads me on to our drivers for change. Back in 2017, we had a highly anticipated sales event for the Millennium Falcon set, the biggest Lego set at the time and in an extremely popular product line. On the launch day, we experienced a huge spike in traffic that resulted in our backend services being overwhelmed. All our customers ended up seeing was our maintenance page. The service tax that failed the hardest was a small service that calculated sales tax. It made a call back to our on-premise hosted sales tax calculation product, and that very quickly reached its limits. At that point in time, we knew that we were on a trajectory for growth that could no longer be supported by an on-premise system.
So there were three key drivers as to why we went to the Cloud. Instead of maintaining infrastructure that wasn't a point of difference for the Lego group, we could focus that same energy on building awesome shopper experiences. And you saw the profiles that were on the screen before. Having that flexibility to scale and meet a very spiky demand profile, and also having the exact capacity we need when we needed it was critical. And finally, having a composable architecture down to the most granular levels. That gives us speed to market and ultimately flexibility to keep innovating and pushing boundaries. So Gunnar, what would you recommend to a team in our position?
Oh, so here's the answer in one simple but perhaps boring slide. So let me give you some background as to why serverless-first would be my recommendation. So what do our most successful customers want from us? What lets them innovate quickly, compete in a global market and deliver value for their customers? What they want is their teams to deliver products to customers as fast as possible, and they want their applications and infrastructure to be reliable, highly available and scalable. And they want the highest security and isolation. And they want all of this with a lower cost of ownership. And what is the biggest barrier to them? Well, lack of time spent on what matters. CIOs wants development teams to focus on innovation and move with speed. But today, most time is still spent on operations and maintenance.
And this is why serverless adoption is growing so fast. A serverless strategy enables customers to focus on things that benefit their customers and not infrastructure management. We launched AWS Lambda six, seven years ago, and today, hundreds of thousands of customers have built applications that already drive trillions of invocations per month, and Lambda continues to grow at a phenomenal rate. And the reason customers are adopting a server-less strategy are aligned to what customers are looking for, and these are also the areas where we put the most investment. AWS is fortunate to have, I don't know, hundreds of thousands of years of combined engineering experience building applications in the cloud, and we've learned even more from our customers. And we use all of this experience to bake into our services the lessons we've learned about building in the Cloud and we take the things that are common across all cloud applications and we build them into our services in a way that they completely disappear. So you don't have to think about them, or at least make them as simple as possible to use and control.
And there are four key areas where we are always innovating to try to provide those simple yet powerful capabilities. Agility to help developers and operators move fast, and performance to support as many different workloads as possible. And low costs of course that follows your usage, and built-in continuously improving security. So we at AWS invest in those areas because we know we are helping our customers.
And the one thing that is common across serverless applications is that they follow the design ideas of the internet and the web. Small pieces loosely joined, and all of these applications are modular, composed from multiple different AWS services and customer development components. And it's why we are so focused on delivering lots of integrations in AWS Lambda, in Amazon Advent Bridge, in API gateway and in step functions. We want to make it easy to build new applications and functionality through loose coupling with other components.
And it often starts with APIs managed by API gateway as the application front door. Often, events or messages are then the communication system in the backend, all coordinated with different workflows. And while many customers associate serverless with Lambda, there are actually serverless services at all layers of the stack. And while Lambda offers many key advantages for our customers, it's really when all of these other components surround Lambda that we start to see much bigger benefits. Messaging, orchestration, storage and compute together, that is the secret sauce. But in fact, it's really when all of these components come together that we really start to see big gains. And as I've talked about, to innovate, businesses need capabilities that help them move fast, build simply, and deliver rapidly to their customers. And at AWS, we often talk about builders, and we see both developers and operators as the critical builders in a serverless-first strategy.
Many of our new capabilities, such as Lambda extensions and VPC access controls, they were built specifically for Cloud operators. And many times, operators are the first to adopt Lambda because of how Lambda simplifies IT automation. And features like larger Lambda functions and synchronous express workflows, they were built to help developers build faster. And these roles, they are changing as customers move to the cloud, and our goal is to build these capabilities so they help both operators and developers be successful as their responsibilities change. And I think what is particularly interesting is the collaboration between operators and developers, how they work together to build and operate modern cloud native application as builders.
And one of the most important ways that we ensure safety and security is through what we call guard rails. So let's look at the philosophical approaches to modern operations. One is the free for all, and the other is the central control. No one actually considers the free for all approach to operations or security, but it is a good illustration of the two extremes available at a philosophical level, fast chaos or slow order. And you can let developers do whatever they want and they will indeed release very fast, but you'll also probably release bad code, reduce application reliability, or might even get into legal trouble. And on the opposite end and in a more realistic scenario, we see customers managing operations centrally. They take control of the release pipeline, the process of provisioning resources, securing applications and troubleshooting. And this is much lower risk, but it's also a lot slower due to dependencies and time lags.
At AWS, we don't really want to give in to the tyranny of or. We want it both ways, fast and safe, and this is why we use the concept of guard rails. For most customers, the process of adopting a serverless strategy is incremental, and in fact, we've observed a pattern, and we refer to this as the organic adoption pattern because it is grassroots and it tends to occur without any outside influence. And basically, it starts with developers who discover that Lambda can help them automate basic IT processes. They use it for small projects like croan jobs, projects that can start and complete without permissions. And then they start wondering, what else can they apply this principle to? And data transformation is one of the next common projects.
This is higher profile, and so we start seeing the involvement of directors and leadership, and eventually, executives start to notice that projects are being delivered faster and under budget. And then they start wondering if they can apply this to other areas. Should we use serverless microservices everywhere? And at this point, usually, the CTO has all of a sudden taking notice and we're talking about how to apply a server-less model across the organization. So our goal is to help you innovate. We think serverless is the way you can get to the market the fastest and with the lowest TCO, and we think about where and how to innovate. And we do it based on what will most help you innovate.
We think serverless provides you with the most agility, enhanced now with the Lambda extensions, as I mentioned, and OCI support, and also performance to meet a broad set of workloads. Expanded with now 10-gigabyte large Lambda functions, and you can have provision concurrency auto scaling that is faster than any container orchestration system. And with the lowest TCO, 1-millisecond billing granularity for functions, and with flexible savings plans. And we talk a lot about integrations. We have over 140 different integrations, as well as integration with SaaS providers and your existing tech stacks. So, Nicole, what did your journey at the Lego group actually look like?
Thanks, Gunnar. So indeed, we went with a serverless-first strategy. Here is a high-level view of where we are today. We made that conscious decision to extract and focus on our business logic, and we decomposed that across several layers of serverless services, backed by carefully selected third-party providers for things like payments providers and content management systems. And each of these layers is designed to scale independently and automatically in order to support our ever-changing traffic profiles. This design allows us to support many different squads working on all different parts of the site concurrently, and you'll see why that's important in a moment.
So our journey to the cloud started with migrating a single user-facing service, the one to calculate sales tax, and three backend processing services back in 2018, just to show that serverless would work for us. 10 months later, we then matched our existing capabilities with a completely serverless platform. It immediately started handling the same level of traffic as our existing one. And then we immediately started exceeding those rates of transactions, the traffic, and setting new records every few months. And last year, we started off with an ambitious roadmap, a growing team, and a platform that was only a couple of months old. And then the question became could we deliver that ambitious roadmap with twice the number of engineers, all onboarded remotely, and keep this brand new platform stable while handling high season levels of traffic throughout the year? The answer was yes. And not only have we doubled the number of services in our platform, we've done so whilst handling increasingly busy sales launches, each with higher traffic and transaction processing rates than the last.
So there are two clear phases here. First, we ran quickly to decompose our monolith into server-less services. This is when we leveled up in the world of serverless. We built up architecture patterns and started evolving serverless development practices within our team. And then we started to mature and take advantage of this re-platforming. We focused on the supporting systems for deployments, monitoring, alerting, all while still building out features at the same pace.
Now to give you some numbers around the growth that we experienced in the past year and a half, we now have three times the number of engineers in the team organized into nine cross-functional product-based squads. They focus on building out different parts of the site and have greatly accelerated the number of features that have been added to the sites since we went serverless. As you can see, we've launched another 36 serverless services pushing the total number of Lambda functions in production over 260, and that's just one environment. We still have other development environments where we have all of those Lambdas too. So this means that we're getting into that territory of, how do you keep legacy serverless services up to date, as well as supporting the rapid creation of new serverless services being launched onto the site every few weeks? To handle all of that growth and keep the platform stable, we've learned a lot along the way, and I want to share those with you now.
The growing team meant that we had to distribute many tasks previously conducted centrally by an infrastructure squad. As Gunnar mentioned, we are making that transition from central control to guardrails operations. Automation has been key to supporting the ever-growing number of squads and application engineers to get their new services into production at their own pace, but also safely. We've moved to a self-service model where ever possible, like making it as simple as running a script to create a set of standard integration and deployment pipelines for any new service. This is one of the benefits of having all of our services in a mono repo. We could write a custom script which ensures all of our services are integrated and deployed by pipelines that conform to our standard flow. These are customizable, so you add a flag to a conflict file located within the service's working directory. These are things like adding integration tests or maybe deploying S3 specific resources as part of that deployment pipeline.
We also redesigned our deployment workflow so that they're easier and more intuitive to use. So we had tag-driven deployment workflows for each of our environments. You would check out the commit that you wanted to deploy, tag it with the service name and the environment, and then based on this, you would trigger the appropriate workflow to deploy that service to that environment built from that commit. This was actually a workaround for one of the drawbacks of a mono repo, where each commit is actually a snapshot of all of the services in that mono repo. Unless you look at the changes, you don't know what service was actually changed and what stayed the same.
This was fine when the infrastructure team were handling the deployments. When we pushed the git tag, we knew exactly what would happen in the background. As the team grew, we needed to make it easier for our engineers to confidently deploy the release candidates for their services through each of our environments, so we linked them together. Each of our deployment workflows now triggers the deployment for the next environment. That sits at a hold step until the engineer is ready to commence. Our engineers now interact with deployments in the deployment tool itself and have the context that that entails. They know the commit on the screen is going to be used to build the service on the screen, and deploy it to the environment shown on the screen. Where before, it was a get operation and the context was closer to pushing a new version of your service to source control, you're now in the context of a deployment tool and it's very clear that when you release a hold step, you're deploying to an environment.
Our ultimate goal is to develop our application engineers into DevOps engineers, or builders in the AWS terminology. They own and operate the services they are building. One of the drawbacks of a mono repo is it's not easy to see what version of your service has been deployed to which of our environments, and you have to wade through the commits for every other service in the mono repo to find the ones for yours. And so we created a deployment dashboard so our team can quickly see the commit history for each of the services, which environment each commit was deployed to and when, as well as links to the relevant pipeline so they can look through the deployment logs, trigger a deployment to the next environment in the chain. Giving them this visibility into their services and where they are in the deployment process, it's making it easier for them to confidently progress their release candidates through our deployment pipelines, and that's the first step towards our goal of creating DevOps engineers.
The next step was to focus on safer deployments because we're taking infrastructure engineers out of this deployment process. So we set a standard that all server-less services were to implement canary deployments using AWS code deploy. This provides automatic rollbacks when necessary. They act as a safety net. It takes out the need for an infrastructure engineer with the knowledge of how to monitor a service, how to roll back a deployment, and it just automates that entire process. The engineers implement and tune alerts for their services to listen out for things like changes in error rates, increased 500 errors, and if one of those alarms triggers while traffic is being shifted to the new version of their service, it will automatically roll it back.
And that leads me onto serverless operations. In order to implement Canary releases, you need to implement CloudWatch alerts, and that's what triggers the rollback if your server starts operating outside of its normal profile. We created a serverless plugin with a default set of alerts to be implemented on each service and tuned based on the profile of that service by the squad that owns it. This is giving our engineering team a starting point into how to monitor their services in production, and both detect and react to issues in their space quickly. Crucially, it gives them the flexibility to take their specific use cases into consideration. We have all sorts of use cases and one set of default alerts is never going to cover all of them. It's only ever going to cover the most critical use case, and in our case, that is high traffic services behind an API gateway.
But we have others. We've got use cases related to step functions or services that are called less frequently because they run asynchronously in the background, and it's up to our product squads to analyze and implement the appropriate alerts. Likely, there'll be variations on those default metrics, but with different statistics and thresholds. We've actually set a standard around observability for the platform as well. So we utilize a Lambda layer and that forwards distributed tracing and application logs to our monitoring platform. And actually, over the course of the year, our monitoring platform has added features and it's just grown rapidly to include views of each of the services, and they contain metrics, logs, traces, all in an intuitive and searchable interface. So that's been a massive benefit to us. The growing team means that we can't have tacit standards anymore. Our team is made up of engineers at all stages of their career and with different levels of experience in the technology that they're using. We've scaled beyond the point where we can have a single architect who can be consulted for every service we build and ensure that it's adhering to the latest best practices.
And so the way we're tackling that is with documentation. I've mentioned already a couple of standards that we've set, and we've actually set quite a few more, and they define what a good service means to us. They define the hallmarks of a good service that uses the latest coding practices and patterns that we've developed over time, as well as those guidelines for safer deployments and monitoring of services so they're easier to own and operate. Now, we've actually compared these to the AWS well-architected framework and specifically using the AWS server-less lens. So this has shown that our guidelines mainly focus around the operational excellency pillar with a bit in security and a bit in cost optimization. So now, I'll hand over to Gunnar to explain a bit more about the AWS Well-Architected framework.
Thank you, Nicole. When do you look at the systems you're building, can you answer the question, are you well-architected? And how confident are you that those systems are built and operated following best practices for the Cloud? And hopefully, most of you would say yes. After all, we usually have great people in our teams. So why would you want to apply the AWS Well-Architected framework? Well, because you want to build and deploy faster by reducing firefighting, capacity management, and by using automation, you can experiment and release value more often. You can use it to lower or mitigate risks, understand where you have risks in your architecture and address them before they impact your business and distract your teams. And you want to make informed decisions, ensure you have made active architectural decisions that highlight how they might impact your business outcomes. Anyone that learn AWS best practices, make sure that your teams are aware of best practices that we've learned through reviewing thousands of customers' architectures on AWS. And we've seen customers use the AWS well-architected framework to successfully achieve all of these points.
So Well-Architected is more than a tool. Well-Architected is a mechanism for our customers, their Cloud journey, and it allows customers to learn the strategies and best practices for architecting in the Cloud. And it allows them to measure their architecture against best practices using the Well-Architected tool. And it allows them to improve architectures by addressing any high-risk issues identified using improvement plans, Well-Architected labs, partners, solutions architects and more. And at AWS, we understand that the value of educating customers on architecture best practices, and to ensure that we're actively thinking about foundational areas that are often neglected. And the Well-Architected framework provides a consistent approach to evaluating your architecture.
Well-Architected framework, it provides us with a set of questions and different design principles across five different pillars. Creating technology solutions. It is a lot like constructing a physical building. If the foundation isn't solid, it may cause structural problems that undermine the integrity and the function of the entire building, so if you neglect any of the five pillars, security, reliability, performance, efficiency, cost optimization, or operational excellence when you architect a technology solution, it can become a challenge to build a system that delivers functional requirements and meets your expectations. So when you incorporate these pillars, it will help you produce stable and efficient systems, allowing you to focus on functional requirements instead.
And to complete a Well-Architected review against the framework, we use the tool that's available in the AWS console. All details are then stored securely within your account and you can share workloads with a solutions architect or an AWS partner for collaboration on the review or remediation steps using workload sharing. And Nicole mentioned the serverless lens. That is one of the lenses available in the Well-Architected tool that allows you to look at the framework from a serverless perspective or a serverless lens, hence the name. So Nicole, what is then next on the journey for the Lego group?
Well, we will be looking to explicitly define our standards in the remaining pillars of performance and reliability. And then we want to start working on making the standard more visible, making it easier for our engineers to see all of the services they own, what state they are in, all in one place. In order to drive the engagement in the ownership of our services, we're thinking we can start assigning a score based on our serverless standards and then bringing those together in a leaderboard, just to add in that competitive element, and highlight service ownership as a point of pride within the team. I think gamification of our serverless ownership will be the key to getting our engineers to fully engage in becoming DevOps engineers and get all of our services to a place where they're easy to own and operate. And then we want to start playing with the next levels of serverless operations.
One thing on our roadmap is chaos engineering as part of the reliability pillar, and we want to enable our team to really break apart the services and test out those failure cases for maybe an entire third-party vendor. This will enable us to craft our shopper experiences to still be awesome, even if part of the platform disappears for a bit. I mean, it's really hard to believe that our journey to the cloud only started back in 2017, but we're excited to explore the different ways we can now play with our platform and keep improving and providing awesome shopper experiences. And so that's where we'll end the presentation today, but the journey for lego.com continues. I think we'll move to a Q&A now, and if you have any further questions or comments, feel free to reach out to us on Twitter or LinkedIn. Thank you very much.
Thank you so much, Nicole and Gunnar. I bet there are a lot of members in the audience who had this dream profession of being a Lego engineer when they grow up, and now there's probably even more people who will have the dream of being a Lego DevOps engineer when they grow up. I'm going to get started with the questions. There's a ton of them over there, but coming from a physical product organization, you don't deploy new Legos five times a day. Did the sort of mentality and culture around software engineering specifically change a lot, and was there any kind of clash with the existing product engineering side of the organization?
Well, I think one of the benefits of the way that the Lego group is structured is we're given a lot of freedom in terms of choosing our ways of working. There's a big push towards Agile management strategies and all of that, and so being able to start up in our own product area and really build out the team the way that we needed to in order to move quickly and really make that transition into becoming a digital organization, we're able to really run alongside the existing product organization. So, yeah.
Beautiful. There was also a suggestion that for the next talk you two do, Lego should have an AWS data center set made of Lego. So just a thought for the future.
Lapa you had a question.
I do quite a bit, but before I go to a question, I'd like to highlight the unique opportunity for the audience to do some live testing on the lego.com on April the 1st, because yesterday, I learned that Lego announced this one described as the most complex flying machine ever built, which is the NASA's Discovery shuttle and the Hubble telescope, which will go on sale on the April 1st. I believe it's not an April Fool's joke because there was a follow-up then later about the pictures taken by Hubble telescope to also featured in Lego set. So just pay notes on that. The first question, can you explain a bit more about your decision to go with mono repo? Because I don't think there was a specific mention about motivation behind that.
Yeah. So when you start up a new project, you've got a couple of decisions to make right at the start. Do you have micro repos where all of your services are in their own repository or do you go for a mono repo where you have everything located in one area? And we went for a mono repo, and some of the benefits that we've found is that our engineers are now not... We don't have those barriers between working on different services, different parts of the site. So we want our engineers to have that freedom to be able to raise a PR and change another part of the site if they see something is wrong. So we're trying to build up that engineering culture of, you see something wrong, you fix it. You see something that's interesting, you get involved. You pick up a ticket and you make a change. So our engineers can work in our pipelines, they can work in the services that they work on, and yeah, I think we've seen a lot of benefits from going with a mono repo.
There's a question for Gunnar on how does AWS support chaos engineering? And can you talk a bit about your future together with that?
Yes. I love that question, as a former chaos engineer. We do it in several different ways, and the most recent way I'd say is that we at Reinvent, the last time, we announced a managed service for chaos engineering, AWS Fault Injection Simulator. And that was now generally available last week, so it's available for everyone to start using. So you're able to then create your experiments and do chaos engineering experiments on your AWS environment straight from the AWS console.
And there's a question about a big fire in a server building in France last week, and there was a lot of customers who lost data. And how are you handling these kinds of public and possible disasters at AWS? Any words on that incident or more broadly?
I can't really comment on that specific incident of course, but every time we build a new region and new availability zones, the possibility for disaster is always part of the planning, to be able to avoid that type of disasters.
So let me repeat the previous question about testing test dependencies, so end-to-end testing versus contract testing for APIs and messaging dependencies. How would you and what kind of tooling you would be using for that?
Yeah, we use testing frameworks to test both at the integration level for our services, so testing that API contract level, and then as we deploy through the pipeline into our non-production environment, staging and into production, we do have end to end testing where we test end to end user flows and expected journeys as we deploy. You might be able to see some of our other talks to get more information on the specific frameworks we use.
[crosstalk 00:34:40] for you to mention frameworks, Nicole.
Oh, I can? Okay.
Oh, yeah. For sure.
Yeah. Well, we use the Cypress framework and we also use Jest, so Cypress is for our end to end testing, where we are testing those user journeys through our site, and then we use Jest for our integration tests, and that's really at the services level.
And maybe a question for you both, there's a bunch of questions on the financials. Have you seen any kind of financial benefit in moving into this kind of architecture?
I can start with this. So as you saw, every single year, every few months, we've been having record sales events. So the benefits that we're seeing is that we can actually handle those sales events. And it's always hard to predict the future, but staying on the current track that we were, we're not sure if we would have achieved those goals as fast as we have been seeing them.
This is interesting point, and I wanted to refer back to our earlier conversation between the total cost of ownership, which is basically how can I deliver the service on a global cost level? But then what you are saying is how do we as an organization enable everyone of us to do better on the whole? So it goes back to a much, much higher business metric than managing the cost. It's enabling company to grow and get to their strategic objectives, which is a fantastic comment there.
We are coming to the end of the time and we would still have a lot and lots of questions, but it doesn't prevent people from adding them into the chat, and I hope that both of you have a little chance to stay a little later and take that conversation in the chat. But thank you for your presentation and thank you for answering the questions, and good to have you back, Nicole, after you had to step out for a moment.
Thank you very much, and I loved your creation by the way.
And one last comment, in the resources section in the DevOps conference.com website, there is a DevOps Cloud guide. If you want to get into a general level or a deeper on these topics, you can go and check it out. It's available for download for everybody. You can find it if you go to the website and go to the resources section.
Beautiful. So thank you, Nicole. Thank you. Gunnar.
Engineering Manager, Direct Shopper Technology
The LEGO Group
Senior Developer Advocate
Amazon Web Services (AWS)
Watch all recordings from The DEVOPS Conference 2021 on the event's page: