Skip to main content Search

AIPlatform engineeringManagement and cultureConference talks

From chaos to control: Building ML platform | Volvo Cars

Discover how Volvo Cars transformed their machine learning operations by building a unified ML platform. In this session, George Markhulia and Steve Larkin discuss overcoming data silos, architecture decisions, and best practices for scaling MLOps. Gain practical insights into improving data access, streamlining workflows, and fostering collaboration for faster innovation.

From chaos to control: Building ML platform | Volvo Cars
Transcript

Hello, everyone. Thanks for coming. My name is George. I'm engineering manager for ML - operations team at Volvo Cars. And this is Steve. Steve is a senior engineer in the team. We're not going to talk about cars today. We're going to talk about Abakus, - which is the platform we've built internally. So less than two minutes is what it takes today for any engineer or data - scientist to go from the idea to validation. When we started building Abakus, we started with a goal in mind. We wanted to reduce friction and improve efficiency and, - as a byproduct, also accelerate some innovation. So we've eliminated long lead times. We've eliminated formal approvals, - fragmented ecosystem in the company, - among some other organizational and technical challenges. And today we're going to share with you our journey. We're going to show you the platform. We're going to go through the major - design decisions that we took along the way. And I'll hand it over to Steve to go through the tech stack. Yeah, thanks, George. So we chose a modern cloud native stack specifically - because it provided the scalability, - reliability and cost efficiency that was required for us at an enterprise scale. So the platform itself here is Abakus. It's the top layer here. And these are the components that we're going to talk - about further today. Before we start, we'd like to acknowledge the other components - that we build on. So in particular, - we use the Enterprise Container Platform, - which is a common code base, which is used to create many Kubernetes - clusters onto which Volvo cars deploys a myriad of container-based workflows. This platform of platforms approach allows us to build or to focus on our - primary concern, which is the ML platform. And as you can see, - we're fond of cloud native and open source technologies. Abakus is built around the Kubeflow ecosystem of software. And we've added several other products into this to complement and - to integrate it into the enterprise environment. So let's start by looking at this platform and setting the context - by seeing some numbers. So this gives a hint of the type of scale at which we're working at - in our company. And the highlights are that we have around 200 monthly active users. We've been running production workloads here for around three years, - including some time at the very start on a development cluster, - which was not ideal. However, the figures that we are most proud of are - the ones around the community. So we use Slack for both the announcements - and for support. We really believe that this transparency allows us to solve - support issues out in the open together with the users. And it's the engineers who are actually building the platform who are solving - those issues and interacting with the users in Slack. There's no first or second line support. And we see that this creates a bond between the engineers and the users. We also see that other users dip in and they help each other out, - which is great. We have 48 contributors to our source code repo, - and we've promoted that anyone within the company can suggest a - modification simply by creating a PR. And we see that that happens quite often. So this is a little bit about where we are today, - but what did we start from? So this picture might be familiar to anyone - working in the ML space. We won't dwell here too long, - but in short, at the beginning, we had almost non-existent support for our - data scientists. So this led to a fairly scattered technology landscape, - and we had individuals either working alone or within small groups, - often solving the same problems, - such as the access to data sources. Without a common development process, - we risked having a lack of reproducibility, - a lack of traceability from a model in production back to the source code, - the parameters and the data used to train it. And given the contemporary regulations, - we see that as a big no-no. So while the undertaking to build a central platform - in such a large enterprise as ours isn't always a solution to every problem, - in this case, we think that we've been able to do so by doing the right thing, - to bring together our diverse - group of data scientists around a common environment, - to build an active community, - to lead people into common practices without being too rigid about it, - to solve the common issues once on behalf of our users, - such as things like the integration into the company's network and following - their prescribed security standards. So let's take a look at how we've - approached it. So this is the big picture of Abakus that we're going to be talking about today. Before we started, the data scientists basically had a blank canvas. And this meant that they had little idea of where to get started. That led to the fragmentation and the siloed way of working. There was a lack of common practices and there were no templates to help - them get started. It proved to be very difficult to roll out an ML model into production. And for those teams who did do that, - they struggled to monitor it and to operate it. And they didn't get to the real feedback loop of continually and iteratively - developing the model. So we divided the lifecycle into three stages. These are three separate user journeys. The first mile is when they first onboard or they create a new project. Then the day-to-day usage, which is either about building an ML model or - producing insights. Then onto the last mile, which is when they take their model into - production and they continually monitor its performance. First, let's take a step back - and we can reflect on the job of a platform engineer here. So in the beginning, we showed the tech stack picture, - which looks like a subset of the CNCF cloud native landscape. And in this picture, we're showing how those various software projects are - used to accomplish some task. If we take away the arrows, we see those same projects. And what we've learned is that this is the easier part of our job. So just installing all of these open source projects with kubectl apply or - with GitOps is quite straightforward. And at the start, - we deployed a lot of different projects that we had no use for. But if we look only at the arrows, - then this is showing the integration between all of those software products. And this is used to accomplish a task or build a development workflow. And this is what really helps us with those missing ilities we saw earlier, - such as the lack of reproducibility, traceability, and visibility. So we're defining a process here for the ML model development. And this is really what we see as the difficult part of the platform, - engineering work. This is where we introduce some control with the software development - lifecycle, while also allowing some flexibility for our colleagues who are - actually using the platform. So if we zoom in on the first mile now. And during the first mile, this is about the onboarding process. It's our first touchpoint with the user. So we want this to be as seamless as possible. We direct all users to a common URL, - and that's the Kubeflow Central dashboard. And we've configured it so that it calls out to our own specific onboarding - application in the case where users aren't already registered. So this application creates all of the resources that they might need. A Kubeflow profile, a namespace in the Kubernetes cluster, - a GitHub repo, which is generated from a template that is really good for getting - started with data science. We create a container image registry project, - we create a vault secret store, - and then a repository in the data lake, which is a versioned object store, - which we need to have to link the model training data to the deployed model. We also deploy the CI infrastructure, which is then used to create the Qflow - pipelines later on, and we use Tekton for that. And then finally, we create an Azure AD group to tie together the access to all of - these various services. And the result is that the user receives an email with links to all of those - services that we've created for them. At this point, they can log into their personal namespace. They can explore the platform - and they can launch a notebook server to start prototyping. From the user's perspective, - they see this as a - way to manage their profile or to in fact, - create additional project profiles so that they can collaborate with - different other users. Also to be able to get access to the services shown here. However, from our perspective, - as the engineering team providing the platform, - we see this as the tip of the iceberg. And if we look below the waterline, - what we're actually doing is we're handling the integration into the - enterprise network, security, ID, and access management systems. We're providing a comprehensive set of documentation, - including tutorials on how-tos. We're ensuring the multi-tenant isolation, - which is really important on a shared platform that users can't see each - other's work or namespaces or data. We deploy most of this infrastructure via GitOps. And we also enable FinOps, - financial operations, from the very start. And we'll see how that works now. So this is an image of the Kubeflow Central dashboard. And this comes from the open source project. We've tailored it to have our color scheme and links to our - documentation, our Slack channel. We've also added a card here in the top left corner, - which shows the cost for each namespace, - and it's tracked over the previous 30 days. It makes it really easy for the users to follow how much they're spending on - the platform, because training ML models can be quite expensive for - GPUs, etc. We've also added the ability to create a new project, - and we can take a quick look at how that works here. So in order to create a new project, - they come here, and that creates a new namespace in which people can - collaborate together. So we start the video. So we set a name for the project. And we can then also optionally set a package name for the Python software - that's generated from the template. We'll skip that here. We select a tier, which is either Insight or ML. And that's what George will talk about in a short while. And then it takes about 40 seconds to actually create the project and create all - of those resources in the background. But we won't dwell too much on that here. So if we look at what we've learned from this process, - So firstly and reassuringly, we see that our colleagues act cost-consciously - and responsibly when they're provided with the information. So around the beginning of each month, we send them an email. So for each namespace, we list out the costs accrued, - and you can see that email on the right-hand side here. So it's the same information that we show in the Kubeflow dashboard, - but what we've observed is by sending an email with the list of costs, - that makes people take action. However, FinOps is a little bit subtle because what we've seen is that some - namespaces only accrue the cost of a price of a coffee, - and that triggers people to off-board from the platform. And what our hypothesis is, - is that we think that because the onboarding flow is fairly seamless, - that they can off-board and then they can come back later on. And indeed, we do see users doing so. We've also learned that we should have contributed more to the upstream open - source software project, so to Kubeflow. And we were aware of this at the time. However, we decided that we would like to get into production first and quickly - because we already had development workflows running. Also, our set of patches weren't so generic, - so it would take time to actually make them - in a form that could be accepted by the upstream project. Several years later, we think that we should start picking this up again, - and we should start actively contributing to Kubeflow. We have done so for other open source projects, - and we found the process both to be welcoming and to be very smooth, - so we definitely recommend it. And then finally, in the design of our onboarding application, - today it's a simple microservice, and it makes REST API calls out to the - various other service APIs to create resources. If we started again, we would use the Kubernetes operator pattern, - and we may even use Crossplane to do so, - which we've got good experience from for other cloud resources, - such as databases and object stores. Again, at design time, we were aware of the pattern, - but we decided to go for the less risky and known quantity of creating a - microservice and using the REST APIs. So with that, I'd like to hand back to George, - and he's going to continue to discuss the day-to-day work of our users. Once we pass the first mile and onboarded to the platform, - it's time for day-to-day operations. And this is where the most work happens - for our team and for our users as well. So if you look at the diagram, we see two boxes annotated with insights and - ML product. So what are those? Early in our journey, what we discovered is that if you throw too - much technology, too many new terms and too much functionality and features - onto users, they will not accept it. So imagine as a data scientist, - one day you work on a notebook server on your laptop, - and the next day you're thrown into Kubernetes with all the new ways of - working and new tech stack that comes with it. So we took a step back. We had to rethink how we do this. This is how we solve the problem. So we addressed this by introducing tiered approach, - which is inside an ML product. This approach lets us to meet users where they are. So if it's a simple - exploratory analysis or hypothesis testing or reporting, - Insight tier is built for that. It offers fewer new terms and less technology stack than the ML product, - which is more of a next step when you want to get into production and when - you want automation, this reliability that the system can provide. So let's take a look at them individually. The inside, as I already said, is the lighter version of the two. And everything begins with code versioning. So when they onboard, we create GitHub repository from the template - that structures the code - as a Python package. And every project that onboards to the platform follows the same pattern. So everyone has the same code structure, - which makes it much easier for us to help and troubleshoot the problems - down the line and also for users to help each other out. So users can clone this repository into the notebook server, - which is now running on Kubernetes cluster. They can install the package and then they can continue working exactly the - same way they were working before on their laptops, - but now they get additional benefits of being connected to the data sources - and being compliant with whatever - organizational restrictions or network restrictions are there. We also offer Spark cluster for distributed computes, - but what we've learned is most of the time users - eat more than enough to use something more lightweight like DuckDB. So, you know, it fully satisfies the analytical needs that they have. we've invested significant amount of time to ensure - that the platform is well connected. And that really took a lot of effort. But we also provide - data versioning capabilities that are built into the platform - when you have some medium-sized data sets that you work - in isolation with. So the insight here is all about giving users the right balance - of simplicity and flexibility - and depth to do the best work without - unnecessary friction. Then once the idea has been validated, - it's time to introduce automation and give the project a bit more of a - production look and feel. And that's what we do in the ML product here. It's worth noting that this tier doesn't replace the insides, it extends it. So everything you had before, - you had now as well, but then you have additional functionality and additional - reassurance that the system is production ready. So first of all, we enhance GitHub repository, - and we enhance it - to hold the manifest or source code for arbitrary applications like, - FastAPI, Streamlit, which are quite popular among data scientists. Also to hold - source code and manifest for their pipelines and their components. So the repository is pre-configured with the GitHub webhook, - and when they push a commit to their branch, - the CI picks it up, we build their source code, - we build their components, we tag them with commit charts for traceability, - we upload to image registry. Then in the same cycle, we build their pipelines, - we update the references in those pipelines and compile them. Again, everything is tagged with the same commit chart for - full traceability, and then it's uploaded to Kubeflow. And yeah, the Kubeflow now is a familiar environment for our data - scientists and engineers and that's where the daily work, - most of the work happens for them and they use hyperparameter tuning with - CatTip, they schedule the pipelines for data processing, - pre-processing and they serve inference, - serve models directly from the pipelines as well. And I want to really emphasize the importance of the CI here - because you can read the documentation for Kubeflow, - for Tekton, or any other project. And what you have is have a bunch of small examples, - like very atomic examples of this is how you do X, Y, or Z. But there is rarely a recipe or a - silver bullet that tells you, - this is how you build the CI for ML system. And this is where a lot of thought and times goes - from our team and also collecting feedback from the user. So how do you build a CI system that - is useful and that helps? Because without the CI system, imagine everything that we talk about - up until now with building and dependencies and tagging and - updating references, this has to be done manually. And this is both error prone and just tedious to do. So that's why we really focus heavily on standardization and automation here. By providing common structure and rigorous CI project, - we ensure that everyone from first commit to production follows a - consistent, reliable and tested path. And that's precisely what this tier delivers. It delivers unified automated framework to let teams move faster and focus on - building value instead of - wrestling with infrastructure. So what have we learned? Well, the obvious one is start simple, right? Don't do too much too early. Listen to your users, like find the same defaults that you start with and then - build on top of that. you need to design for different personas and also maybe a - little bit different backgrounds and experiences as well. So introduce the functionality gradually, - and that's why we did the tiered approach as well, - so that you kind of try to have a wide net, - and then you get more users on different levels, - and then they grow together with you and then help you improve the platform. This can be a little bit tricky, like balancing freedom and control, - So on one hand, you want to give users - freedom to work with what they used to, - what they like. But at the same time, - you need to have best practices in place and you need to have some - limitations that users follow because you want to provide this automation - and you want to provide this reliability. Because if everyone is doing what they want, - well, first of all, some users find it difficult to start because they don't - know where to start, but even if they know where to start, - they might use the platform product in different ways - that you haven't thought about, so that's why it - was mentioned earlier that contracts really don't matter. And I kind of agree. But if you try to tailor the workflow to majority of the - use cases, that's already a success. Last but not least, CI is more difficult than it seems because - the difference from regular CI in the software world is that you don't only - need to think about how you build your code and how you version it, - but you also have a model that you need to train and need to version it, - you have data that you need to version, - and then you have any other artifact in between that you also need to version. And all these things, they can have a different life cycle or a different pace of - development. So that's why you really need to sit down and understand within - the context of your organization, what is required. Roughly saying what is the first class citizen - that you build around the rest of the system. The hypothesis is tested, - the models are trained, we're happy with them, - and it's time to take them to production. There are two ways to this. The wrong way is where you throw over the fence, - you have a model, you have a piece of software, you throw it over the fence - and let the ops team figure it out. And there is the right way, right? The right way is you enable your teams to - give them the capability, the tools and knowledge - that they need to own their product end-to-end. Then we went with the right way. And since we've built this - functionality gradually and we introduce it gradually, - there is really not much change in production from the previous tier, - ML product tier, right? There are two noticeable differences or two - new things maybe. The first one is deployment now happens - entirely through GitOps. So earlier, if they were deploying a model as a - final step in their pipeline, - it's not happening here anymore. Everything is already in their GitHub - repository and everything now is synced by Argo. The manifests are deployed, the manifests are constantly monitored in a - sync loop and if something happens, - if a post runs out of memory or whatever, - it's self-healing and then you get the notification and you can react. The second difference is ML monitoring. One of our design principles here was flexibility. We knew that many teams already had their own storage accounts - and we wanted them to be able to use those storage accounts - for ML metrics as well. We've built a monitoring/logging system, - which is pretty simple if you think about it, - but it does this one thing and it does it really well. It gets the cloud code events from two inference services. It gets the inputs and outputs from the metadata of the event. It correlates them, - those inputs and outputs, and then it just stores them. And it stores them in their storage account. And this design turned out to be very - efficient and very effective because it gives teams full control over - what data they want to share, how they want to store it, - what access policies do they implement, - what retention policies do they implement - to stay compliant with GDPR. Are there restrictions that are specific to their organization? Because Volvo Cars is a huge company, - and you might have different ways of working and different policies in - different parts of the company. On top of that, - we do have alerting built directly in the workflow. So if the application crashes or model crashes, - Argo sends a notification to their Slack channel. And the same happens with ML monitoring as well. And this is a reusable component. So the only thing they need to provide is the URL to their channel. And the rest is taken care of. So this is one of the benefits of being in a - platform team that you've talked to enough users - to understand the patterns, - and then you can tailor for those patterns. An added benefit is the networking. So TLS certificate, end-to-end encryption, - ingress, egress. Honestly, users don't even know it's there. It's just there. So what have we learned from here? Well, a few projects make into - production and that's completely fine. Many are about experimentation, about insights, - about thesis maybe, - but that's still valuable because you learn from every project, - you learn something new. Then there is GitOps. To us, it was quite obvious that as a platform team, - you need to follow GitOps practices, - you need to deploy infrastructure in a reliable manner. But once you show to - to the user that they can deploy and should deploy their application, - their model via GitOps, - it completely changes the way they - see deployment in general. So they never want to go back to the way they were doing it before. And production is not a finish line that you cross and call it a day. It's an iterative process. Models need to be monitored. They need to be retrained, redeployed. And that's why if you have the seamless transition - between the stages, this workflow, - which is standardized, then this becomes the second nature. And you don't see a release of the model as this big thing that - everyone just cross the fingers and they pray - that it succeeds. It just becomes a day-to-day operation. And everyone needs a development environment. Sometimes, I don't know why, it is controversial with certain, - believers that you can shoot tests in production. And in some cases it is true. If you have a mobile app, - and users of the mobile app - and you release multiple times a day, - you can have A-B test, you can have a subset of users, - you release in production, if it goes wrong, - you roll back, if it goes right, then you release to the rest of the people. But as a platform team, like imagine doing that, - okay, oh, let's try a new version of Kubeflow, whatever, - you release it, and then, you know, none of the teams can work. That's not great, right? So that's why as a platform team, - you need to have a development environment where you - roll out new features, you test them, you validate them, - and then you roll out to a prod cluster. But the same applies to our users. They really like having this option of development environment. And that can mean different things, right? It doesn't need to be a different cluster. In our case, - it's a different namespace where they run their pipelines, - where they schedule the workflows, they train the models. But once they release the models, they can release it - and should release it in the production namespace where you don't have any - automation apart from the GitOps syncing your manifest - with the new version. I'll go through this very quickly. So I would like to summarize this kind of in two parts. The red one is the benefits that we see of building the platform - yourself within the company. And the green one is the learnings. So the benefit is that you really can till it for business. You're very close to your users, - at least you should be. You really understand the company and what teams are trying to achieve. And then you can tailor your solution, - your platform for the company that you're working with. And ultimately, the success of your users is success of your platform. That's why it's a two-way street of knowledge sharing and learning. and competence increase. Because when you introduce, let's say, - in our example, a Python package, - a default structure for GitHub repository, - even if data scientists didn't like it and really liked the notebook server, - well, now they're exposed to a Python package and they think how they - structure and design their code. And the learnings, well, the biggest one has already been mentioned by Steve, - but integration is a much, - much bigger effort than installation. You can pip install, helm install, kubectl apply, - whatever, or double-click install a tool. And it works fine, maybe, but integrating everything together within - the platform and then integrating the platform externally within the company - is where most of the work is. And the devil is in the details. The documentation might say one thing, - but in reality, it's very different. So you need to treat your platform as a product because that's what it is. You have users that you need to tailor for. You need to listen to them. You definitely should not implement everything that they ask for. But if there is a large amount, - number of users that need certain feature, - that's probably a good indicator that that's what your platform is missing - and that's what you should focus on. Keep it modular and extensible. All right, I'm out of time. Thank you very much. I hope you enjoyed it.