From DevOps to FinOps - perks and pitfalls | Jessica Lundberg & Daniel Bjork

Name: From DevOps to FinOps - perks and pitfalls | Jessica Lundberg & Daniel Bjork
Uploaded: 2024-11-07T02:00:00+02:00

At The DEVOPS Conference 2024 in Stockholm, Daniel Bjork and Jessica Lundberg gave a presentation about Kambi’s DevOps transformation and how it at the end took at new turn into a path that was not anticipated at the beginning, FinOps. This journey highlights the challenges they encountered and the benefits they saw when they were on the other side of it. About the speakers: Daniel Bjork has experience of working with regulatory and compliance implementations from the Finance and iGaming sectors. Daniel is currently acts as Head of Product for the regulatory area in Kambi, while also being the FinOps driver for the entire organization. When not working sports, baking, cooking and family life takes up most of his time. With over 25 years of experience in software engineering, Jessica has held a wide variety of roles across different businesses. Her technical background as a developer and build and release engineering, combined with her product, management and leadership roles, gives her a broad and holistic perspective on the industry. Jessica is passionate about both people and technology, often finding herself bridging the gap between the two. Currently, she serves as the Head of Product Development in the Engineering Operations Stream, drives strategic DevOps initiative, and is a member of the FinOps team at Kambi.

Transcript

My name is Daniel. - I'm Jessica. And as mentioned in the introduction, - we will talk about our DevOps journey and focus on what were the perks - and pitfalls about this journey and the evolution of that journey - and how it ended up with our FinOps related work that we do in Kambi. Before we go into this, just a short introduction - to what are we doing at Kambi. So, at Kambi, we love sports. We love sports that much, so we decided to make a living out of sports. So, we are the leading supplier of online and land-based Sportsbooks. So, you could use our product using a mobile phone, - you could use it on a laptop, or you can go into a betting kiosk. For instance, in Sweden, we have ATG and Svenska Spel, - which both are hosted and powered by Kambi when it comes to sports betting. Many different devices you could access our products. And what we actually do is that we take real-time data, - and with that, we create games. I mean, sports. I mean, Manchester United, Everton, or whatever tennis sport, or whatever. And we also create odds and bet offers that you could wager on. That's our core product. But of course, there's a lot of other different things within our platform. We have a huge data warehouse because we need to use - a lot of data to have our material. We also have regulatory products because we're in a highly regulated market. We have specific tools for that. Risk management might be one of the most important factors - to sports betting, so we can balance our risks. And we also have a user interface. So, for many of our customers, everything you see is provided by Kambi. Our customers are just using a frame. So, all of these different technologies and services - we provide within our products. We are in all of the dark blue areas of the world, - which means that every single second of 24-7, - there is a bet being placed by one of our customers. Every time there is a game going on - that we have an odds for, that is live right now. But apart from that low latency, everything is up all the time. We have extreme spikes when it comes to certain specific events, - which Super Bowl, Champions League final, - the Euro final, when we have very, very high volume of traffic. So, we need to be up all the time. And on top of that, we're also a global company when it comes to staffing. And we have around a thousand people employed within the company - where the majority of our engineering department - are sitting here in Stockholm. Our DevOps journey, it began in late 2020 - when we realised we needed to do something. We had previously reorganised into value streams, - did maybe the opposite way around - compared to what people have spoken about today. We had also started to break down our monolith into microservices. But we really didn't get that flow we wanted. We had also grown a lot the last years, - about 10 to 15% more people every year, - but we were still organised in the same way - when it came to the division between operations and development. We also had too many processes around day-to-day activities. We were using Agile methodologies, but we still had these handovers - and the manual steps when it came to release and deployment part. We had this traditional division between operations and engineering - where the developers threw their code over the fence - to operations for release and deploy. And the Dev teams, they had ownership, - but they didn't really have the mandate, and they didn't have the access - to production to be able to improve their situation. So, they were frustrated. Operations teams were swamped. Neither of the teams in the different departments - did see any ways to improve their situation. So, the Time to Market was slow, - and at the same time, we had a high demand on throughput. And at this time, our operations department, - they cater for both external customers and the internal ones, - the development teams, with the primary focus - on the external customers, our partners. And at the same time, development were building more and more components, - more applications and wanted to get things out quick. And we simply didn't scale. We had around, I think it was five application technicians - releasing for more than 45 development teams. We also had a tradition that we were building over buying. So, we didn't really take advantage of buying services outside the company. And this is where it also ties back to FinOps later. We actually lacked the financial insights on the team level. We didn't have any insights of the cost of manpower - and the cost of infrastructures to keep our services running. As I said, this was during 2020, - and this was when COVID still dictated everyone's day-to-day life. So, we had problems when it came to hardware - and severe hardware capacity issues. It could mean that we could wait for up till one year to get hardware in place - and to get a new component out in production in the worst-case scenario. We had knowledge about cloud technology. It was used in several parts of the company and with some teams, - but we weren't using it to the full extent. We really didn't have the strategy in place. We were hybrid with on-prem and some cloud things, - but we weren't really using technologies and tools - to the full extent that could speed us up. And the teams, they did not have the full ownership. They had lack of operational knowledge. They had lack of understanding - of how their components were working in production. And no access to production logs and stuff like that. So, it was hard to improve things from an architecture perspective, - but also cost perspective. So, we decided, let's go DevOps. And at the same time, let's do a cloud lift and shift where suitable. So, how did we go about it then? We did have a clear direction and vision from the beginning, - and we got a really good buy-in from the beginning, - from the C-level management and all the way to our partners. They were really on board with this, that we needed to focus a bit - on this transformation instead of product features. We had clear sponsor and priority. We drove this as a strategic initiative with me as a driver, a dedicated driver. We had our SVP of engineering, Maria Naveira Sund, as a sponsor. And we managed to get this initiative in our global priority list, - where, normally, only product features are dominant. We also had a change management group. We were doing maturity level assessments when it came to DevOps, - where we wanted to be, where we were at the moment, - looking into things like culture, communication, - observability, and things like that. And we also appointed change ambassadors within the organisation. Yeah, we were working a lot, we had this Agile approach - with continuous improvements, working incrementally. We had easy adopter teams looking into things, - building a lot of proof of concepts for a lot of technical solutions - with container platforms and things that could speed everything up. But one thing that we also did was to create a completely new value stream. And that were the platform teams that came with their own sole purpose - to focus on the Dev teams and to support the transformation. These engineers, they worked with different things. They created the paved road solutions. They were working as embedded resources within the Dev teams - to speed things up, to broaden the knowledge within the company. They had task force that was removing blockers for the teams - and the mantra was always like unblock yourselves. What we also did was to bring in operations - and product management under the engineering umbrella. They weren't there before, they were outside. So, in with them, and we made operations of value stream as well - and made an Agile transformation. So, they could come closer to the rest of engineering organisation. And then, we worked a lot when it came to these things as transparency, - communication, and we coached around courage and things like dare to fail. We had sessions with fail cakes if you share, - and teams shared things that others could learn from. We introduced the DevOps demo every second week - to even more knowledge sharing and a lot of things. We were more transparent about metrics, both the transformation metrics, but also the Dev experience metrics - like team health checks to look into on a regular basis. Did we succeed then? Well, we have learned a lot. We have come pretty far, I would say. There are some things that maybe we should have been aware of - and done differently today. And one thing, as you see on this picture to the right, - it's a known thing when it comes to change management, - that change phases can be out of sync. And we were in different phases when it came to the transformations, - where the ones that were leading the transformation - in the beginning were ahead. We didn't really have the awareness to that extent, - the impact of people coming a bit later in the phases. [Daniel:] I think I was like one year in, then it sunk into me. Oh, this is what we're doing! - [Jessica:] Exactly. So, this is, as Daniel points out, one thing that we kind of missed - is that we underestimated the effort to get the middle layer managers on board. And the speed of transformation is highly dependent - on the ability to scale as well and to get people on board, - to be ambassadors, to drive the change. We had a lot of people that were working with something else at the same time. So, I was a driver, but I had other roles at the same time, - the other ambassadors as well. It worked, but things could have gone faster - if we had a different approach to that. One other thing is the ratio between employees and consultants. We had a lot of consultants involved driving the change, advocating for it. Sometimes that was perceived as a threat, - and maybe they don't really know our business. So, yeah, make sure to get employees on board, - have them to help you in the transformation. And one interesting learning that we had, - we were appointing change managers and approaching informal leaders, - but there were others that stepped forward - and took the lead, actually, than we expected. It was interesting. Yeah, leading change during change is one thing as well. We realised quite late that we needed really to repeat the message. The drivers were maybe kind of tired of the message already, [chuckles] - but others weren't on board. So, we also grew, as I said, a lot. And we needed to have a simple thing, introduction courses for new starters. Why are we doing this, actually? And what's in it for me, the value of things. And then, in the beginning, we had a lot of gamification - to inspire and spark interest and such. We had qualitative metrics, 2%. But after a while, we tended to more communicate - about the quantitative ones and that weren't so inspiring - and motivating as the feedback was to us. I don't know if you're familiar with the term bucketeering, - but here, since this was an initiative that had a lot of traction, - and it went well, people wanted to put in a lot of things - here under this umbrella because they realised things were happening. So, architects came with the NFR requirements. We had Dev teams really happy and said, "Now we can focus on tech debt." So, this is one thing we realised, yeah, - that it grew, everything grew. And what about the stakeholders outside engineering? We were talking a lot within engineering and it was going well. We were talking with the other departments as well. We had legal, and we had the SecOps, and we had the compliance departments. But still, we kind of missed them. They weren't really on board. So, with all this, if we would have done this today, - maybe we would have done a risk analysis with all this knowledge - and to also identify more of the things that we had not anticipated. So, the perks and the value we see so far. Yeah, we see significant cultural and collaborative improvements. We see this between the teams. We see this within the teams. And people are happier at the moment, actually. We also see that these DORA metrics - that people have been talking about today, we see real improvements. We have increased number of releases. We have more acknowledged benefits in scalability and flexibility. We are technically more able to scale according to load. We have faster incident resolution and stuff. And release time has got down to sometimes one year - due to the hardware situation to minutes sometimes, too. We see the synergy effects like the bucketeering, - we have much more NFR compliance, - [chuckles] tech debt has been addressed, so it's pros and cons, - and less personal dependencies, and innovation increased as well. And finally, increased transparency, - and definitely, we are making more informed choices now. And on the cost side, so the better cost allocation, it became a clear perk. But at the same time here, we also saw that our costs started to go crazy. So, it increased dramatically, and in came FinOps. [Daniel:] In came FinOps, which was then the natural next step - of all the DevOps related work that we had done in Kambi. Just taking a quick pause there, - FinOps, for you that are not aware about that word or that expression, - is the combination of finance and DevOps. Part of our DevOps journey, we also made the journey into the cloud, - in our case, AWS. And [exhales] well, it was good. We could see things. We got all of these different perks. But costs then. Okay. Moving things to the cloud costs money. I mean, you could have things in a data centre, - you could have a huge database and just add more and more stuff. But you just pay for the initial payment when you're setting that up. Adding things to the cloud, on the other hand, - every single load you add to the cloud will increase your bill. You will pay more to your cloud provider. And this was not really a thing we had discussed before, - how to address the costing effect - of moving that much of our load to the cloud. And the cloud costs just kept increasing. So, everyone agreed that we are not on a sustainable path, - and we need to do something about it. And we decided to go with the FinOps principles. And these are open principles, - which are directed by the FinOps Foundation group, - which are on the website, well, on the internet. So, just Google FinOps Foundation, - and you will find a lot of useful information - that anyone can use when setting this up in your organisations. Really inspiring. And we decided to go with that framework. And the first thing we did, FinOps action number one, - form a group. We need a task force. This is, as I said, going out of control. Okay, so who should be part of that group? We went with finance, because finance, they pay our bill, - finance, for sure, is starting to ask questions to engineering, - "Hi, guys and girls, what are you doing over there - because this is going out of control?" So, we realised we can't just have them as some sort of stakeholder here, - we need to have them in the group and discussing this with us. Procurement used to buy hardware and racks and setting up servers. They might be a good factor here as well, - because now we'll start to have conversations with our cloud provider, - also in the same way, someone looking ahead, knowing about the processes. Procurement were needed. Okay, we need some sort of cultural driver, - because this will mean communication. And for some reason, that became me. I mean, I don't know the reason, - there might be a reason why I'm standing on the stage today, - and I have the same role, but in my company as well, - about communicating what we want to do and what we expect - by the different teams and departments. Then we had operations, Jessica, - because now we, yes, we did a big move to the cloud, - but we still had certain things still on-prem. We are also regulated, which means that we need to have certain loads - running in certain geographical territories. Without having the full picture, we will be somewhat blind here. So, you joined the group as well. Of course, an architect is good to have someone having the high-level - technological picture of what are we doing, - what is our platform doing here. And then, we also had the reliability engineers in these platform groups, - because if we can find ways to not reinvent the wheel - in 40 different places, we should. And we had learnings from our DevOps journey - that this was really successful in moving along, - and we thought it could be here as well. Okay, great. We have a group that finally can tackle this. FinOps action number two, get tagging guidelines and cost traction started. Okay, we all can see we are bleeding, but where? Cost Explorer in AWS is, you could get certain amounts of data there, - but it's tied to accounts, it might not be that granular, - it's really hard to find out the details, we thought at least. It could be different depending how you're using it, - but we thought that it was not a complete tool for us. Another thing from the development teams - that they were used to just put up things into some on-prem server. And now the cloud, yes, but they did not have the tradition of tagging things. So, they just throw things up there. They know their components. They know what they're doing. But if someone externally wants to look at how their components are doing - and how the cost traction is, for instance, we had no idea. So, tagging guidelines needed to be in place - so we could see and get knowledge what's actually going on. We then also hired a third-party tool to get a better overview. I know some examples where companies have built their own tools - to consume all the cost files that the cloud provider is sending. We decided we're going and buying a third-party tool directly. Jessica just mentioned that we were, and we still are, I would say, - a generally build before buy company. But in this case, we needed to get traction now. We couldn't wait, and we didn't have the resources to put on this. So, we went on with a tool. And I would say that that has been really, really important - in our journey to become better at this because now anyone, - at team level, product owner level, management level - could build their own customized reports and follow up - on what's actually going on in different departments and teams. So, first, now we would actually, we would have, - we could follow up what's going on, because until this point, we were blind. We could just see things are going out of control. So, now we have a group in place. We start to get a vision. Where are we bleeding? What's not going on? So, our FinOps action number three was to make commitments. I would say this is the fastest way for you to be able - to take control over your increased costs in the cloud. I mean, same thing as if you go to a bank, and you take a mortgage, - you will probably not just take the list rent that the bank offers. You will probably get some sort of discount or negotiate for a discount - with your mortgage. Same thing with cloud providers. They're probably willing to give you discounts - because, of course, that will lock you up. But if you have made a move into the cloud, - you probably are committed anyhow to be within their solutions - for the next one to three or five years. So, making commitments is really our first step to start reducing our cost. And we used the FinOps group here because that was a small, agile group - that could take quick decisions, not involving too many other people. So, they are looking at forecasts, budgets, what do we think, - how much commitment do we want to make, for how long, - and without involving that many other people. So, we can move quick and take quick decisions - when we have this type of situations. And our last action that we did was to start then the cultural journey. Okay, we had a problem here. So, teams used the cloud resources as if they were still on-prem. So, I know examples