Skip to main content Search

DevOpsCloudPlatform engineeringConference talks

Why we skipped SRE and switched to Platform Engineering | Electrolux

Discover how Electrolux Group transformed operations by skipping traditional SRE approaches in favor of Platform Engineering. Gang Luo and Ramil Galimov share their journey building an Internal Developer Platform (IDP) that empowered 300 developers with only five Ops engineers. Learn how their cloud-connected platform improved productivity, enabled rapid region launches, and embedded SRE principles through automation, SaaS, and open-source collaboration. Get insights into architecture, automation strategies, and lessons learned in scalability, self-service, and developer experience.

Why we skipped SRE and switched to Platform Engineering | Electrolux
Transcript

[intro jingle] [Ramil:] Hello, everyone. Thank you for coming. My name is Ramil, I'm a Senior Platform Engineer at Electrolux. It's a home appliance manufacturer. And I have 15 years of experience in the industry - as an automation engineer, as a software engineer, - and now as a platform engineer. So, I joined Electrolux six years ago, - and I was at the origins of the IoT platform development. [Gang:] I also joined Electrolux about the same time as Ramil, - but as an SRE and not as a developer. So, I have roughly 20 years of experience in the industry, - started with development and later on - switched to SRE when I joined Electrolux. And then, probably two years ago, - started to move into more about platforms. [Ramil:] Okay, so what is the IoT platform? IoT platform connects home appliances and allows customers - to control their appliances from the mobile application. But our platform provides not only simple appliance control, - but also other customer-oriented features, - like, for example, assisted cooking. It's IOT with an assistant that takes a link from the internet - with a recipe, analyses it, - and selects the most optimal cooking settings for you. [Gang:] And to really support a good consumer experience, - that's what they did, - we actually have a bit complicated setup from our backend. As you can see, probably some of you are familiar with these logos, right? You can see those cloud providers here. That's because we also have a legacy system that we have to decommission, - but we cannot decommission completely. [Ramil:] Okay, so when we started, we had quite a small team, - only one connectivity platform, one product, - eight developers, and only two SRE engineers. Gang, how did you feel at that time? [Gang:] It was very nice when I joined. It was more like a start-up. We call it our own start-up in Electronics. It's not a tech company, - but we work as the same team building the cloud backend. We only focus on the cloud backend. Everything went very well, I would say, like a family. But it didn't last long. Unfortunately, after two years, we have seen different regions, - they started to grow their business, - and more and more consumers started to connect their appliance to our cloud. And also, the backend team started to ship more features. We are running into this firefighting mode. As you know, we have different regions, we have different BAs, business areas, - and then those stakeholders, some of them are not easy to communicate, - because they're not very technical. So, every time when they notice something didn't work - from their mobile app, they would come by our table and ask us, - "Is our cloud done today?" Then, of course, we understand that not everything is about cloud. You also have your mobile apps running somewhere which might have issues. [Ramil:] Yes, so we had plenty of issues at that moment of time. And the problem might have been related to appliance, - cloud infrastructure, or even some service logic. And yes, we communicated a lot with SRE engineers from development side. And we had also, as Gang mentioned, several business areas, - and every business area developed their own mobile application. And as a result, we ended up with more than 20 mobile applications. And of course, this significantly increased the complexity - of the backend and IoT platform. So, we have a statistic how the number of developers - and services changed during our journey. So, we started delivering more and more features, - and the number of developers significantly increased, of course. But what about SRE Engineers, Gang? [Gang:] Yeah, we didn't grow, as you can see from the graph, - we didn't grow at the same rate as developers. One reason is, we actually have a big challenge to hire good SREs. I mean, we tried to hire from EU and APEC, - but we couldn't because we are still - more like a traditional appliance manufacturing company. So, I think for those top SRE engineers, we are not so attractive. But on the other hand, it's also a good learning for us, - because we have to see, as a small SRE team, - how we can keep supporting our growth without expanding the SRE team. And here, you can see, this is the SRE team. As you know, we are responsible for the production operation. And here is just two example incidents we had before. One is the email sending to us at 4 AM from Latin region. They are asking, Is cloud not working?" And another one is actually after we switched to a major cloud provider, - everything went smooth during the EU time. I mean, developers were happy, we were happy, it's like a celebration. We finally completed migration. But at midnight, as you can see, we got a notice. We got alerts that the CPU usage by our caching system reached 100%. And SRE team, I mean, we are responsible for the production, - and we are doing this kind of anchor duty. So, we tried to resolve this on our own, of course, tried different ways. I remember I stayed till 4 AM, - couldn't resolve this and handed over to APEC region. But in the end, the real resolution was when the developers came - in the morning in the EU time, and finally, it got resolved. Maybe, Ramil, you still remember the incident. [Ramil:] I remember the incident, of course, - and the troubleshooting overnight. So, this particular issue was related to business logic - and how Redis was used from the service side. And we had to wait until the team that was responsible - for this service implemented the fix and released it in the morning. So, we ask very often SRE engineers in troubleshooting, - and this is only one part of communication - between SRE engineers and developers. And another part, it's new cloud infrastructure provisioning, - and we also ask about it the SRE team. [Gang:] Yes, so that's actually a big major work for the SRE team, - as well to provide the infrastructure that is needed for our product teams. And we used to have this kind of ticketing way, - so probably some of you still have today, right? We've asked the team, please file us a Jira ticket, - we will deliver it, we will put that into your backlog, - plan it, and finally deliver it, and then your ticket will be closed. But this didn't work, to be honest, because we are quite a small team, - and we are only based in mostly the EU region. Then, we cannot scale. And then, to resolve this, we started to use Slack. I mean, we created a Slack channel to provide a faster response, - of course, response time to our developers, - but quickly, it becomes more like one of the most popular channels in Slack, - and then it becomes more like probably IT help desk. So, not only about production, people also ask questions - about different things, like how can I get access to VPN? That's probably not something we should manage. But then, that means our workload is still quite high. Actually, still today, we have to do this kind of rotation, - we have a dedicated an SRE team member - each week to handle these Slack messages. [Ramil:] Yes, so the SRE channel is the main way of communication - between SRE engineers and developers now. And for me as a developer, since there are so many requests, - the request resolution time is always unknown. [Gang:] So, how can we improve this? And then, of course, we understand the automation can improve this. So, that's why we started to improve our automation a couple of years ago. A lot of script, and then also infrastructure code has been done, - which, of course, provides certain values for our team. So, still, it's through this ticketing way. Ramil, you file a ticket to our team, - and then, we will trigger certain automation, - and eventually, we will deliver and close the ticket. It's just the resolution time will become faster, - but this is still not working so well, - because we are still doing this semi-automated way, - and then some team members from our SRE team need to trigger something. And then, what we can do better, we started brainstorming, - and meanwhile, we don't feel like everything works in the correct way, - because we understand developers will keep pushing changes in production. And on the other side, as SRE team, - we don't really understand what they are pushing every day. But we have to do this kind of production operation, right? So, sometimes when they push something during the day, - and then during night, we get a lot, we couldn't figure out what's going on. And then, the question will be, who owns production? Is it the SRE team who should run everything - to make sure production is up and running 24/7? Or is it the product team who is shipping the features? Or is it more like a shared responsibility? So, we started to do some brainstorming - to figure out how we can improve this. And then, we also invited our stakeholders, - that's our internal developers, to our conversations. And we invited Ramil as well to our interviews. And Ramil, what do you think we should provide for you? [Ramil:] Yes. So, from my side as a developer, I would like - to have something more flexible, not just a ticketing way, - but instead some kind of self-service. And I don't want to be dependent on the SRE team. And also, I would like to see how my resource is provisioned actually, - what is the current status and maybe if there are any issues - during this provisioning flow. [Gang:] Yeah, so then here's the result. We need to provide more of this kind of a self-service portal for our developers. And then, it was also the year, I remember, when - Backstage was launched by Spotify, open source by Spotify. We started to try it out, to see what it offers, - and eventually, we adopted Backstage - and started to build our internal platform. We actually have a product manager who is responsible for the SRE products. And then, one day, I remember, she asked me, - "Are we going to build our own IDP?" And then, at that time, I was SRE, not a platform engineer. So, I was curious, okay, what are you talking about IDP? I had no idea. But later on, we did some learning, - and then, we understand that actually this is the thing - that we should provide to our product teams. And these are actually the major products here today - that are offered by our team. [Ramil:] Okay, so we can see an example. So, all our services are running in Kubernetes. And for example, for developers, - it might be quite complicated to create EKS cluster from scratch. So, I need to read documentation, go to AWS console, - then create resources manually, and then spend hours in troubleshooting. But instead, I just have a simple tool where I can fill a basic configuration - like a case name, select a region, and then click the Create button. The rest will be provisioned for me automatically, and it is very safe. I shouldn't worry that something might go wrong with infrastructure, - and I also basically need no info knowledge for that. [Gang:] So, yes, what's actually under the hood? I mean, it's actually a bit simple if you look at this diagram. So, we talked to our product team to figure out - what kind of infrastructure they need for their application. And then, we started to learn each infrastructure from AWS. We read the documentation, - and eventually converting the code to templates into Git, - and then, this code is just what we learned - and what I would say is a bit opinionated. But we try to follow the best practices. And also, we can release a new version, - and also, developers can contribute as well - if they think something should be improved. And then, this template got loaded into our platform, - and as a developer, they can go to our platform to browse what is available - and pick the infrastructure that they need, - and then it will be provisioned. Then, these are the features we offer to developers. [Ramil:] Yes, and as a developer, - I had a very positive impression about IDP. And the main point was that now I can see the infrastructure, - and it's not a black box to me anymore. So, there is a clear picture of what was provisioned and reviewed with resources. And on top of that, I can see here the metrics - from the observability platform, some metadata, version management, - and also an audit where I can check who did the changes. And of course, there is some approval flow that allows managers - to control this infrastructure provisioning and cost observability. And I strongly believe that cost observability is a very useful feature - for developers as well to see how much your infrastructure costs. And developers at least could take it into account before the next release, - which we don't do often, to be honest. [Gang:] And we launched our IDP in March, - and then in December, we did a rough estimation - of what we have achieved, and here is the result. So, we went through the number of infrastructure resources - that got provisioned through IDP, and then did a rough estimation. If we are doing this in the old ticketing way, - this may take this number of days, like 200, - because every time we have to do planning, - we have to do testing, and then finally deliver it. But with the templates, we can ensure - the infrastructure that got provisioned is more or less identical. And then, we don't need to really spend a lot of time - to check every single infrastructure. And then also developers, they don't need to wait for anything from our team. They can do it anytime. So, Ramil, do you think infrastructure is everything you need? [Ramil:] Okay, so now I have a tool that can manage my infrastructure, - but what about service creation? I would like to have also one button that I can click, - and the service and all the infrastructure - might be provisioned for me as well. [Gang:] Yeah, so then we speak with our developers afterwards. After infrastructure, we figure out what they need during their daily job. For example, we started to introduce more and more, - not only infrastructure, but to provide service management, - for example, when deploying a microservice - and how to get the logs into our monitoring system, - where to find the metrics, how to connect this microservice - with database or third parties, what about all parts? That is kind of also automated now into this platform. And what about the cost? You shipped a feature yesterday. Can you see the cost today of what has been changed? So, this is something that we also introduced. And actually, one reason that we started to look into cloud cost - is because we see a significant increase of our cloud cost - after we launched IDP because developers are kind of free - to pick what infrastructure they would like to provision. [Ramil:] Okay, so IDP supports the full SDLC, - but it doesn't mean that the whole production responsibility - is on my shoulders now as a developer. No, instead, we will share this responsibility - between developers and SRE engineers. But on top of that, I will have ownership of the product, - starting from coding to observability. And we also can consider this kind of cultural shift. So, I have a tool that I can explore, - I can use and be more involved in infrastructure provisioning. And it got me so involved, so I became a platform engineer. So, be careful with IDPs. [Gang:] Yeah, welcome to our team. - Thank you. And yes, with this tool, we are trying to break a silo, but from another side. And so, developers could have an opportunity - to dive deeper into infrastructure. [Gang:] Yeah, this is exactly what we are aiming, - because we want to enable the product teams - to be autonomous without depending on our team. Take a quick go through what we have done through this couple of years. We started with a small team and later on, we are building platforms. But those are still internal platforms, right? So, we started to think more about what we can do after these internal platforms. Of course, one thing that everyone is talking probably today - is about building platform as products. But what does that mean for our team? And we did a rough comparison. And the main difference is probably the mindset, - because internal platform is driven by internal requirements. If the developer team needs something, then we have to ship fast. So, it's more reactive. You need it today? Okay, we can ship it tomorrow. And then, if you need any customisation, we can change, we can adapt. But building platform as product, - we probably won't go with that approach. Instead, when we get the requirements from our product teams, - we will think, how would that requirement fit into our platform? Can we more generalise requirements, - so we can provide more like a general solution - that can be used by also other teams? So, instead of this reactive way, it's more proactive. And also, we have to plan much better, - because we treat this as a product, as our own product. So, we have to talk to our internal developers, - and also the external developers. And then, we also have to work with our PMs to do planning. I would say this is our first product. It's actually the cloud cost part of the plugin that we put into Backstage. But we involved this one to open-source version called InfraWallet. It's actually very simple. What we want is just to get an aggregated view - of how much we spend in our different cloud providers. We looked also into buying versus building - and decided to build it because it's not so hard. And also, the providers or the vendors that we checked, - we couldn't find a good vendor that can meet all our requirements. So, this is open sourced, and after we open sourced it, - we got issues from our developers, also a bit more contributions. And we can also check the downloads, - so we can see if we are doing good or not. So, we are actually very happy with what we have achieved, - because we are just a small team. We have our own job, we cannot dedicate to the open source one. But we started to think more, - because infrastructure is actually the major product - that we provide to our internal developers. Can we also put that as another product? And that actually goes into the second product - that we are trying to build today called InfraKitchen. [Ramil:] Yes, so we have learned a lot during the development and using IDP. And so, IDP was more like UI operations over internal infrastructure scripts. But InfraKitchen is much more powerful. It's not only UI, but also a code-driven tool - keeping real developers' needs in mind. So, for example, with IDP, we were locked only with AWS, - but in InfraKitchen, we support multiple cloud providers. And in IDP, we also had quite a limited number of predefined templates. But in InfraKitchen, we provide developers - to create their own templates, which is great. And InfraKitchen is also a code-driven tool - because developers like working in a code-driven approach, - and we also provided such opportunity, - and all the changes are managed in the code at the moment. [Gang:] Yeah, and this is the UI. [Ramil:] Yeah, so there's a small demo of InfraKitchen. So, you can easily create any cloud resources from a predefined template, - or you can import or create your own template. And from our side, we try to simplify the infrastructure provisioning - and make it really fun. So, at the moment, we are actively testing this tool internally, - and we are planning to open source it soon. So, I hope that we can get some feedback - from the community or even some contributions. [Gang:] Yeah, to go back to our history, - started with one team building one simple product. Everything was well, and then went into firefighting mode - because of the growth. And meanwhile, SRE team, we become the bottleneck for all the teams - because they are waiting for us to deliver things. Automation helps a lot, but still, it's kind of managed by the SRE team. So, we are still the bottleneck. And by providing internal platforms, we see a big change. That is because if it's self-service, - we don't need to provide the direct support anymore. Instead, the people can still file issues to us, - but we will convert them into more a feature request for our platforms. And we also see a culture shift because developers, - once they started to manage things on their own, - they feel more like the owner of everything. And now, we are still learning how to build platforms as product. And also, we love open source, - so we would like to open source what we have. And we started with SRE, as you can see, - and we are still the SRE team. Our selection is still called the SRE. But today, our team name has changed. So, we switched to platform engineering, - and team name is now called platform services, - because we offer platform services to our organisation. So, what exactly does SRE mean then? You may be curious, right? So, SRE, of course, SRE is Site Reliability Engineer. I mean, you can say it's a job title or professional career. In our organisation, because we couldn't scale the engineers, - so we have to find a way to support the business growth - using some kind of engineering approach. And I think platform engineering fits this very well. And what happens today is, we provide things through platforms, - through code to our developers, not through trainings - or through one-on-one sessions, because that didn't scale. And once we see some more issues coming, we try to improve our platform. So, we put more code, more things there. And for developers, as long as they are using our platforms, - they basically get everything for free - that will remove the overhead from their side. [Ramil:] So, that's what we learned during our journey. Thank you very much. There's our LinkedIn profile. So, let's connect. [Gang:] And this is also our open-source repo, - if you are interested, feel free to check it out. [outro music] [music stops]