Skip to main content Search

AI FinOps: Why GPUs are expensive (and often wasted)

AI is everywhere, but the real cost often stays hidden.

In this episode of DevOps Sauna, Pinja and Stefan dive into AI FinOps and explore why GPUs are expensive and why so much AI infrastructure ends up underutilized. They unpack hidden costs like idle hardware, power consumption, networking, and data movement, and explain why traditional Cloud FinOps doesn’t fully apply to AI workloads.

They also discuss how Kubernetes and Dynamic Resource Allocation (DRA) can help improve GPU utilization, and why visibility, ownership, and organizational maturity matter just as much as tooling.

[Stefan] (0:03 - 0:07)

If your data is garbage, well, your model is going to be producing garbage.

 

[Pinja] (0:12 - 0:21)

Welcome to the DevOps Sauna, the podcast where we deep dive into the world of DevOps, platform engineering, security, and more as we explore the future of development.

 

[Stefan] (0:22 - 0:31)

Join us as we dive into the heart of DevOps, one story at a time. Whether you're a seasoned practitioner or only starting your DevOps journey, we're happy to welcome you into the DevOps Sauna.

 

[Pinja] (0:37 - 0:44)

Hello and welcome back to the DevOps Sauna. I am joined by my co-host, Stefan. How are you doing, Stefan?

 

[Stefan] (0:44 - 0:50)

All good. The sun is out. No, not really.

 

Everything is gray and boring again.

 

[Pinja] (0:52 - 0:59)

We actually have actual sunshine here in Helsinki today, and it's a light phenomenon that nobody knows what to do with.

 

[Stefan] (1:00 - 1:07)

Did you take a photo of it? I've heard Finnish people do that whenever the sun is out, like this yellow globe on the sky, like you never see it.

 

[Pinja] (1:07 - 1:11)

It's a must to take a photo of that because the next time might be in the next two months.

 

[Stefan] (1:12 - 1:15)

I might listen too much to these Finnish Instagram people.

 

[Pinja] (1:15 - 2:01)

It's really dangerous. We're getting back to our series on the trends for a software development lifecycle for 2026. A couple of episodes ago, we did an episode on all the trends that we at Epicode recommended to look into for 2026.

 

There are six of them, and one of them is AI FinOps and optimizing your GPU usage with Kubernetes and with the help of DRA. Today, let's do a deep dive again. We're going to talk about the cost of AI and, more specifically, how can we pay more attention to optimizing the GPU usage?

 

Do we want to squeeze the last bit of the hardware that is running our GPU and optimize that? Let's talk about hidden figures in the pricing and what you can do with Kubernetes.

 

[Stefan] (2:01 - 2:12)

This topic is the topic that nobody dares talk about because everybody wants AI, nobody wants to talk about the cost or how you can optimize the cost around it. I think it's a perfect plan for 2026.

 

[Pinja] (2:12 - 2:42)

Yeah, we've been talking about AI for so long. That's what it feels like. The LLMs have been here for a couple of years, but it is not as simple as it may sound when we start talking about the hidden numbers in the pricing.

 

The GPUs are, of course, we need it to run the LLM, we need it to train the LLM. GPUs and AI setups have a massive power and cooling requirement behind that. It's not just the compute and train, but also the waiting part.

 

[Stefan] (2:42 - 3:01)

Yeah, I saw somebody writing, I don't know if it's true, it's the internet, nothing is true, but somebody actually wrote just a rack of hardware for running LLMs that would probably require megawatts per year instead of regular kilowatts we calculate with. In the end, it's still small numbers. I've calculated stuff in gigawatts as well.

 

[Pinja] (3:02 - 3:26)

Yeah, so looking into the idle time of things, we want to, of course, optimize how we use that resource that we have. To give you an example, if you are utilizing your GPU for 20% of the time, basically, that does not mean that it's 20% of the price. It would be from the 100%, it's always the idle time, the waiting time to use that as well.

 

[Stefan] (3:26 - 3:56)

Overcapacity is never your friend, no matter what the FinOps perspective you're looking for. You might make some deliberate decisions because you need to have some readiness in your setup, but having 80% just spinning in a corner, that sounds crazy. I think NVIDIA has been quite public around it.

 

Depending on the conference you listen to, it's like 5%, 10% or 15% utilization globally on their GPUs, which is just crazy when you hear people can't really buy GPUs all of a sudden. There's definitely a place for optimizing here.

 

[Pinja] (3:57 - 4:16)

There is. Not to get into the environmental cost of everything, that's a separate discussion. One more hidden cost category, there are plenty of others, of course, but one to highlight here is the supporting infrastructure costs.

 

A couple things to highlight, for example, high-speed networking costs, which is needed for a fast storage, and of course, the data movement costs.

 

[Stefan] (4:16 - 4:29)

It's actually fun that you mentioned it. I hadn't even thought about the tangent of going into the whole ESG discussion with the CO2 calculations and everything. That's probably going to be a totally different discussion someday.

 

Nobody dares doing that in...

 

[Pinja] (4:29 - 4:34)

Yeah, I think we need to put a pin on that and take it on a separate episode.

 

[Stefan] (4:34 - 4:34)

Yeah.

 

[Pinja] (4:35 - 4:43)

Because we know that AI has come to stay, and I really hope that with some optimization, also the ESG part might improve.

 

[Stefan] (4:43 - 4:50)

But if you're doing all of this FinOps stuff, it's just like regular FinOps, or is there a difference between cloud and AI FinOps? Is it the same thing?

 

[Pinja] (4:52 - 5:31)

Somebody might say that it's almost the same. If you look at the cost and cost optimization of running a cloud, there are similarities. You might think that, oh, but this is just a web application.

 

It kind of is. And at the same time, there are some differences because it's not... Let's go back to why we actually think FinOps is not as easy as we might think, because it's not just a pure finance discipline, but not even a technical discipline.

 

So you cannot push it to one or the other side of your organization, right? So it has to be something, a shared responsibility for the profit and loss in this situation.

 

[Stefan] (5:31 - 6:27)

It's quite funny because some of the big cloud providers have small courses for certified FinOps something, something, and it's more like a three-hour course, and you hand it out to technical people, and they have no idea how to run FinOps in a balance afterwards. They just see something that is costly and they want to reduce the price. But what if it's okay to burn that money on the solution?

 

You need to figure out who's responsible for the profit and loss. Is the organization at a level where you can take these decisions? Is the profit and loss responsibility distributed to teams, areas, or whatever?

 

FinOps is actually quite tricky instead of just looking at it and saying, all right, we'll reserve instances, or we'll pay upfront for some compute to get it cheaper. There's way more to it. But oh well, everybody sees some sort of a concept, runs with it, figures out, gets wiser, keeps learning.

 

That's how it is in everything we do in IT. Nobody's perfect from day one.

 

[Pinja] (6:27 - 7:07)

No, no, no, no, no, obviously not. And we take our own take on it, obviously. And it's just about the maturity of your organization that you need to understand.

 

Are you at that level where you know how to distribute this responsibility? It might have some weightings to one side or the other, obviously, but please do take that into consideration because AI FinOps as a term is not fully established. As we know at this moment, we're talking here at the end of January 2026 here.

 

I understand there are some companies that are trying to claim the term at the moment to make it a selling point for themselves, but making this cost visible is not a new thing to look at.

 

[Stefan] (7:07 - 8:23)

Yeah. All of a sudden you need to think about cost per inference, training cost, efficiency, new aspects of how you actually measure the efficiency of what you're paying for. Going back to the good old total cost of ownership, who has the ownership here?

 

Are we big organizations where we just need to do AI, or have we actually built up a structure and figured out like, all right, X, Y, and Z can use this. These guys need training. These need inference.

 

These need both. How do we figure out just the variety of personas when we talk about AI? It's just crazy.

 

And usually we only talk about data scientists and developers because of inference and training, but what about regular business users? They use something like Copilot 365 or whatever tool they have, ChatGPT. Is that something we need to run some FinOps on as well?

 

If you run, I think it's ChatGPT they build Perseid or something like that, and probably something around the model as well. But if you hand out a license to everyone, do they actually use it? We see the same questions when we go out and run Copilot workshops.

 

People want to have metrics afterwards to figure out like, are people actually getting more efficient out of this? Is there a good return of investment? That's a good precursor for FinOps.

 

Is it a good return of investment if we do this?

 

[Pinja] (8:24 - 9:08)

And back to the question, why is this different from running Cloud FinOps? Cloud has been around for a little bit longer, with little quote marks, than LLMs. Just a few 10 years. Just a little bit.

 

So with Cloud FinOps, we're working with a fairly stable environment. This has been around a little bit longer. We talk about more predictable workloads and there is a more stable financial area as well.

 

But we have such a high level of fuzzy pricing at the moment around AI that might actually change, perhaps not exactly tomorrow, but any day is maybe the analogy here. So the maturity is slowly, slowly rising with AI.

 

[Stefan] (9:09 - 10:31)

There is even a chance that AI is going to bleed back into the Cloud FinOps. Because if you read on different news sites, you'll see the price of memory is rising these days. Would that bleed back into regular Cloud FinOps?

 

So many things. But yeah, AI by nature is non-deterministic. How can you predict the cost of something that is non-deterministic?

 

We want AI to be non-deterministic because it should be creative and give us good responses, help us do X, Y, and Z. And as soon as we do that, we have the issue of if I enter a prompt twice into some whatever AI tool I use, the token used underneath might not be the same. It can be really cheap for one of them and it can be insanely expensive for the next one because it's non-deterministic.

 

That's how it is. And building on that, we see more and more people going to a long dialogue with a big context. The bigger the context, usually the bigger the price because we need to ship more context and it does set higher requirements for the models we're running.

 

It's sort of like, I wouldn't say it's an evil spin, but it's an expensive spin perhaps. We need the bigger context to get these better answers. Because honestly, in the beginning, AI was so horrible.

 

You ask it for something and it just runs over the heels and finds some random response somewhere like, yeah, sure, this is it. The overly happy AI, as we've talked about on several occasions.

 

[Pinja] (10:32 - 10:56)

Let's take if you actually are using, say, ChatGPT for a really long chain of prompts. You might've given it some context in the form of a PDF. It might be an Excel sheet or something that you've given it.

 

Then all the time you're referring back. Go back to the previously submitted documents and with that context in mind, give me that. It has to review it all over again to keep the context.

 

[Stefan] (10:56 - 11:20)

Just coming up with titles. If you ask it for a title for, let's say you write a blog article, ask it for a title and give me 10 titles for this. The first response you get is like, this doesn't make any sense.

 

You ask for it more and more and more. The more you refine your output, the more expensive it's going to get. I still love when, wasn't it some old man from open AI that said, please stop saying hi and would you please to AI?

 

Because it's actually costly.

 

[Pinja] (11:21 - 11:22)

Yeah, please especially.

 

[Stefan] (11:23 - 11:25)

People still do it because they fear AI overloads.

 

[Pinja] (11:26 - 11:38)

Yep. I still feel that in order to make my position more stable, when the AI is raising to make the AI overloads more and more happy, I should still use the please. But we do need to understand the context setting and the price of it.

 

[Stefan] (11:38 - 11:46)

I'm so dead when that happens because I'm like a dictator. I want X, Y, Z. So sorry, you're going to lose me at some point.

 

[Pinja] (11:47 - 11:58)

Let's talk a little bit about how FinOps is actually related to the architecture. If we think about the opportunity to save money in the AI space, we need to consider the labels, how important they are for allocation and visibility.

 

[Stefan] (11:59 - 12:56)

Yeah. It sort of lingers into platform engineering. We want some stable setup with good capabilities.

 

If a team or area wants to run any AI workloads, make sure they're labeled correctly. Should be able for, some would say, allocation if we're talking from the cost perspective, but more from the visibility side, like who is actually consuming what to what degree. We might accept that it's costly, but still we should be able to say like Team X is using 90% of our AI capacity.

 

Of course, that's how it is. They're doing some research, whatever they are, but make sure we can actually figure out who's using what and to what degree. So I've come back to the whole discussion of like, what's the training cost?

 

What's the cost per inference as well? And if we're sitting in a big product organization and we have the responsibility for profit and loss, we are actually interested in knowing like, how much are we using? What's the cost of it?

 

Does it actually make sense compared to what we actually get out of this feature from the user? We need to go down that road.

 

[Pinja] (12:57 - 12:58)

And we want the structured metrics.

 

[Stefan] (12:58 - 13:02)

Yeah. Structured metrics so we can actually look at the return of investment.

 

[Pinja] (13:03 - 13:05)

This is a very cautious approach, isn't it?

 

[Stefan] (13:05 - 13:05)

Yeah.

 

[Pinja] (13:05 - 13:15)

So we have the estimate from IDC that the batches for 2027 will have a 30% underestimated AI infrastructure cost.

 

[Stefan] (13:16 - 13:18)

That's going to be fun by the end of 2027.

 

[Pinja] (13:18 - 13:20)

We're now talking about the year from now.

 

[Stefan] (13:21 - 13:21)

Yeah.

 

[Pinja] (13:21 - 13:24)

Year from now is basically what we're talking about.

 

[Stefan] (13:24 - 13:30)

Just sitting as the CFO and getting that bill that is 30% higher all of a sudden like, oh, that's going to be fun.

 

[Pinja] (13:30 - 14:06)

And speaking of the investment in AI, we're not going to go very deep into this right now, the very specifics of the financials behind the investment decision, but this needs to be taken into consideration. What you want to have in mind is the expected outcome of AI investment. We talked about the ROI, but there needs to be, of course, the understanding that not everything can be a positive ROI and coming to terms with that.

 

So there is, for example, experimentation, surveying some different models, getting accustomed with the agent orchestration. There's so much new in this area that coming to terms with there will be some costs with this.

 

[Stefan] (14:06 - 14:47)

Yeah. Research and development is a costly thing to do. Whenever you start out something that is called research and development, it's going to be costly.

 

You have to accept there will be a loss somewhere because when you do research, there are projects that fail. I don't know. There seems to be this bias in IT.

 

All research will actually pay off. If you ask somebody actually in science doing research, most of them will happily say, well, we scrap maybe 60 or 80% of the research we do because it's not good. That's how it is.

 

Don't force it into success. You're going to go back to the old saying with, I think it was some statistic guy, there are statistics and lies, and it's roughly the same. You can squeeze statistics into being a truth somewhere.

 

[Pinja] (14:48 - 14:56)

It's going to take some time to find the right tools and the right way to use it. But also, we talked about the billing models briefly previously in this episode.

 

[Stefan] (14:57 - 14:57)

Yeah.

 

[Pinja] (14:57 - 15:06)

Is it per seat? Do you need to look at it per seat and a model multiplier? There are other ways.

 

Look at per agent or a token spent, for example.

 

[Stefan] (15:07 - 16:05)

I think when we see the coding assistants, we see a lot of them having some sort of multiplier on your prompt, depending on which model you want. Some of them have started introducing auto model selection to find the cheapest model. But then you go down the road of, are all models equal?

 

I know a lot of people who have actually ventured in and tried different models for the same setup, and they get quite different responses. I even saw some that adopted a new model and actually ended up going back to an older model because the new model wasn't as good for their job. And they saved a massive amount of money by going to an older model because it's not so complex, it's not so big, and it was even faster.

 

So the feedback loop for them was even better as well. We need to look at the value stream of all of this. We can run a five-episode thing on investments in AI, depending on which different personas we're talking to.

 

Are we trying to support business users, developers? What are we doing here?

 

[Pinja] (16:05 - 16:46)

Yeah. But instead, today, let's talk about maybe the main thing here is the utilization of GPUs. And what is usually considered the real financial issue is the underutilization, more specifically.

 

So this is where the money most often has indirect leaks, but there could be multiple reasons for this. So we mentioned already that we need to take into consideration the power and cooling requirements, idle time. So the utilization here perhaps has multiple reasons behind it, and it's not done on purpose.

 

But what we've seen is that there's this mental model that this is treated as a scarce resource.

 

[Stefan] (16:46 - 17:48)

Yeah. And it depends on the organization as if somebody says, now we should do everything with AI. Well, if you don't teach and train your people, you will have underutilized AI.

 

And in the end, that's going to be an underutilized GPU somewhere, whether it's something you buy per seat or you buy your own models or you're running on your own. There is a GPU spinning somewhere and not doing anything, and that's costly. But you need to figure out if the GPUs are dedicated or what model do you dare?

 

I saw a, I think it was an announcement from NVIDIA at KubeCon, was it this year or last year, where they're talking about, what do you call it? Timesharing of GPUs. Well, a lot of people were leaning back in their seats like, oh, we don't dare doing that because what if data bleeds X, Y, and Z?

 

Well, you might be in a highly regulated industry. We need to make sure that everything runs in isolation. That means you will need dedicated GPUs for you.

 

And then you can just see the price going up. Tough luck. You're in highly regulated spaces.

 

Everything is more costly, to be honest.

 

[Pinja] (17:49 - 18:19)

Yeah. So the static allocation of the resources is one of the issues. Sometimes what we see is many organizations don't have a system-wide optimization policy.

 

More that they look into the view as more on the team level. Again, going back to the so-called dedicated GPUs. So for example, this part is for this team only.

 

This is for that team. So again, not taking full use of that. And it might be as well because of a missing policy.

 

It might be that you don't have a shared platform to view all of this.

 

[Stefan] (18:20 - 19:05)

And sometimes you get the surprise because you didn't set any anomaly detection up on your AI usage. So all of a sudden you just see the GPUs are rising and nobody thought about that. Nobody's looking at the metrics or the dashboards.

 

Then you get a humongous bill afterwards. What if you had anomaly detection that would actually catch these? It comes in ebbs and flows.

 

But that's just some of the raw things. One of the other options of underutilizing your GPUs is having humans in the loop. We always talk about, well, do we really trust AI?

 

Are we ready for that? Well, if you're not ready for that and you always put the human in the loop, your AI will have a pause every now and then and just spin in the corner waiting for you. If it's spinning in the corner waiting for you, it's wasting money.

 

It's like all goods that are in storage somewhere is a loss of money.

 

[Pinja] (19:05 - 19:21)

I guess the question is not so much how do we make sure that we don't need the human in the loop, but we make sure that we find the most crucial parts where we absolutely need the human in the loop and then prioritize those over having the human in the loop for every single thing.

 

[Stefan] (19:22 - 20:22)

Yeah. I saw some interesting experiments with agent orchestration and when it should reach out to a human, they had a small queue where they should get questions when the agent just gave up. Well, I've seen people building agents where they never give up.

 

It just gives horrible answers in the end. How do you figure that out? Well, then you need to go back to how do you actually write a test that it can fulfill?

 

How do you give it specifications? You need to do some pre-work. We'll let AI write the test.

 

We'll let AI write the specifications. At some point, we need to be to some degree in control because we ask it for something. If we're not specific enough, it's going to generate whatever.

 

Then we're just going to start all over when we figure out our specification was horrible, which might be okay if we have enough money and we just want AI to be looping all of the time, but that requires money. The alternative is we can go in and don't have enough work for our GPUs. We committed to this humongous AI program, and what should we do with it?

 

That's going to be the same story again. It's sitting in a corner waiting for us.

 

[Pinja] (20:23 - 20:36)

Another bottleneck that we might experience that is causing the GPUs to be underutilized is that we have supportive infrastructure that is not supportive enough in order to create those bottlenecks, so the slow networks and disks.

 

[Stefan] (20:36 - 22:01)

Yeah, I've heard some really fun talks like, yeah, we want to train our own models. The talk about training and some sort of AI model quickly turns into all of the supportive stuff we had wasn't good enough. The network was too slow.

 

The disk was too slow. We're loading this humongous amount of data into it, and it didn't give us good results. Well, did you really do any pre-processing of your data?

 

Did you make sure the data was structured enough to give efficient or good enough learning out of that data? If your data is garbage, well, your model is going to be producing garbage, and that's the discussion that people don't really dare going into because it takes time to structure your data and tag it well, but sometimes we also see lack of what's actually required, so we spin up this training environment, and nobody actually knew that it would require faster networks, faster disks, pre-processing of data, but it sounds fancy, and we want to do this AI thing, but in the end, we need all of the good practices from software development, good practices from networking. We need to have old-school knowledge of networking and how we can minimize the hops in our network, how we can minimize the ingress and egress of traffic between different zones, what does it cost when we start shipping data across different zones in the system as well, so many things that can go into with the supportive infrastructure, and if you just ramp it up, well, then it's going to be super expensive, so you need to figure out how to actually approach this, when is it good enough, when is it way too much.

 

[Pinja] (22:01 - 22:10)

So the slow parts of processing and pre-processing, lack of it altogether, but I guess to sum it up, lack of the knowledge on what is required.

 

[Stefan] (22:11 - 22:43)

Yeah, lack of knowledge is never a fun topic to go into, because some people feel uncomfortable saying, I don't know enough about this, or I've seen this video where they talk about slow networks. It sounds a bit like our network, to be honest, so you need an open culture where you can actually say these things and be sure that people actually listen to you when you're saying it. It might be how you deliver the message, but it also can be the receiver saying like, yeah, yeah, yeah, you know nothing about this, this is for data science people, you know nothing, you're just a network guy.

 

If your culture is like that, oh dear God.

 

[Pinja] (22:43 - 22:46)

We can tie it to psychological safety once again.

 

[Stefan] (22:46 - 22:47)

Yes.

 

[Pinja] (22:47 - 22:48)

I love it when we get to that.

 

[Stefan] (22:48 - 22:55)

There's always, always, always psychological safety. If you don't have that, you are in for a treat. Maybe not a positive treat.

 

[Pinja] (22:56 - 23:11)

Oh, exactly. So, okay, underutilized GPUs. So one thing an organization can do is to bring Kubernetes to the rescue.

 

Well, not just alone, but DRA, so Dynamic Resource Allocation, specifically.

 

[Stefan] (23:11 - 24:12)

Yeah, it had a quite long beta period, but it actually went golden last summer. I think it was the end of August. So it actually gives you the option for claiming, provisioning devices like GPUs or other things.

 

It's a pretty open spec, so you can put whatever you want into it. But especially for AI, it makes sense when we talk about things like, how can you distribute GPUs? Can you get like half a GPU all of a sudden?

 

Well, you can. Surprise, surprise. Which is good because then we can start utilizing the GPU in a wider perspective in our Kubernetes clusters.

 

And of course, we need something to run it. We might add queue to run these, like, well, people run batch jobs. But if you add something like queue, it's going to take care of more details for you, so you don't have to build everything on your own.

 

And of course, you only run the jobs when capacity is available. Surprise, surprise. It would take ages to run some of these workloads if you don't have the right resources available.

 

But again, it came out last summer. It did go through a beta period, and then it got golden. But as soon as you announce something as golden, there's still going to be a long period before it's adopted.

 

[Pinja] (24:13 - 24:39)

Yeah. Like in every technology adoption curve, in a similar way, we have the early adopters. We're still looking for the great maturity to look into this.

 

But since it's already available since 1.34, as I said, at the end of last summer, we know that it's causing a switch from a working setup to a future-proof setup. That would be one of the things that has been seen in so many cases in the Kubernetes sphere. Like, for example, switching to the Cluster API.

 

[Stefan] (24:39 - 25:35)

I saw a great talk by Bloomberg in the US. They had built their own thing because Cluster API wasn't ready, which meant when Cluster API came out, there were some things they just couldn't port one-to-one to it. So they had to spend time slowly adopting the Cluster API.

 

At some point, they can flick the switch and run with Cluster API. But it's not sexy. It doesn't give you anything by saying, oh, by the way, we need to future-proof our platform.

 

It is costly to do that. And if you have a lot of people asking for features and you're a feature shop, well, then you build features. Then you aren't necessarily future-proof.

 

But if you do good architecture, you will know, like, you have this beautiful grid in the beginning. Then you start modeling your features. Then it turns into crumpled pieces of paper in one big pile.

 

And sometimes you need to stretch the paper and come back to this beautiful grid again. It's like you add features, you refine. Your ad features, you refine.

 

That's just the circle of life of software and hardware and infrastructure.

 

[Pinja] (25:36 - 25:52)

Of course, there are limitations to this, right? Yes. So this doesn't work very well with the auto-provisioning, if at all, which is not exactly optimal.

 

And it cannot leverage a time-shared GPU, which would then be cheaper on one hand.

 

[Stefan] (25:52 - 26:08)

Yeah. It's a bit sad because most of the cloud providers give some sort of auto-scaling clusters and so on. But it doesn't work well with those.

 

And like you said, time-shared GPUs, I would love to use those because I'm not paying for a full GPU all of a sudden. Well, tough luck. DRA is not a good friend of that.

 

[Pinja] (26:09 - 26:09)

No.

 

[Stefan] (26:09 - 26:18)

But that's how it is. It might be in one, two years. Who knows?

 

Things change over time as we see it in production and bigger corporations come in and say, we can fix this.

 

[Pinja] (26:18 - 26:32)

But there is always an alternative, obviously. So the alternative, one of them, would be to go serverless. It is possible to do it and play with big cloud providers, but this option either is not very good at the moment.

 

[Stefan] (26:32 - 27:36)

Yeah. It hurts me every now and then when I see a cloud provider announcing serverless x, y, z, and you need to buy into the model and they're doing their serverless. I'm sort of highly leaning into the cloud native space.

 

So I want everything to be in a Kumbaya setting where we all sit down in a circle and can swap workloads x, y, and z. Well, if you want to be efficient with the GPU serverless with the big providers, you need to go into their model. And then you have the big discussion of when they're logging.

 

Because if you want to be efficient with serverless in the different clouds, you will lean into their model. You can do some serverless things in Kubernetes, but most people don't do that if you want the full capacity or the full efficiency or full utilization of what you're paying for. So it's a good old discussion of how far do we dare lean into these cloud providers?

 

If you have a cloud exit strategy, you need to lean into that and say, all right, do we have the budget left in our exit program to actually lean into serverless here or do we actually need to do something else? And do we even value an exit program from a different or from cloud providers? It is a costly decision to take.

 

[Pinja] (27:36 - 27:52)

It is. And earlier in our discussion today, we were discussing how it is any different than a web application when we run AI workloads. And to be fair, not that different.

 

It's just a very big and complicated web application and it needs somewhere to run.

 

[Stefan] (27:52 - 27:56)

Yeah. It's just yet another workload. But in reality, it's a bit different.

 

[Pinja] (27:56 - 27:57)

It is.

 

[Stefan] (27:57 - 30:32)

It just comes with a bigger package of data, bigger requirements on the hardware we supply. But if we take something like agents, if we input, not everyone is leveraging their agents 24/7. If they did, we would probably see it more like a balanced load on it.

 

But right now we see agents spinning up to like full whack. They run it like 90, 95%. Then they spin down to 5%.

 

It's hard to get a good pricing on something that fluctuates like this. Just talking to the cloud providers, figuring out how you can reserve instances or buy compute or whatever model they're running. That's pretty damn hard to do if you have something that fluctuates all the time.

 

Then you go back to more like a spent-based model. Then you're not getting the full efficiency out of your money, but it might be better. Over time, when all of these AI features or applications features, whatever we would call them, when we see them over time, there's going to be some commonalities and we'll see balanced pricing on it.

 

We can lean a bit back and say, all right, now we're at a level where it sort of resembles the good old cloud FinOps. The sad thing is, in software development, we don't like tight coupling. Right now we have a very tight coupling between software and hardware.

 

That's how it is for some time. I know there are several frameworks where you put sort of an abstraction layer in between, but the abstraction layers today are still quite tight on their underlying hardware. You cannot just swap out an NVIDIA GPU for something else.

 

It would be lovely if we could. In the end, it might not even be the hardware or the cloud pricing that is the issue here. It might be more like how we change the culture internally.

 

If I sit with Copilot on a daily basis, I need to be aware of how efficient the different models I'm using. Like I said, should I use an older model? Should I cherry pick which model?

 

Should I use auto? That's a good question. We don't know yet.

 

We need to build this culture of accepting AI, but also being, I wouldn't say cautious, but more deliberate on how we're using AI. Because we see a lot of providers coming out with tools that are AI enabled. In reality, it means you would be paying for the model running somewhere.

 

You might have to have an AWS subscription where you tie that into Bedrock. Well, now you're paying for Bedrock on the side of this license you just paid for. Well, how big do you dare to go?

 

Do you want to host them on premise? Well, now you're in for a treat, because it's not easy to run an AI engine as efficiently on premise, because you don't have the same hardware. You don't have the same amount of memory.

 

So we need to balance these cultures and figure out how we actually want to do this in the end. So there's even more work for Pinja. This comes back to product management and people.

 

You're going to be busy for many years.

 

[Pinja] (30:32 - 30:44)

Yeah, a platform is a product. And now we're talking about your AI workloads as a product. And on that note, that's something for us to discuss in the future.

 

I think that's all the time we have for this topic today. Stefan, thank you so much for joining me today.

 

[Stefan] (30:45 - 30:45)

Thank you.

 

[Pinja] (30:46 - 30:56)

And thank you, everybody else, for listening in. And we'll see you next time. We'll now tell you a little bit about who we are.

 

[Stefan] (30:56 - 31:02)

I'm Stefan Poulsen. I work as a solution architect with focus on DevOps, platform engineering, and AI.

 

[Pinja] (31:02 - 31:07)

I'm Pinja Kujala. I specialize in agile and portfolio management topics at Eficode.

 

[Stefan] (31:07 - 31:09)

Thanks for tuning in. We'll catch you next time.

 

[Pinja] (31:09 - 31:17)

And remember, if you like what you hear, please like, rate, and subscribe on your favorite podcast platform. It means the world to us.

 

 

Published:

DevOpsSauna SessionsCloud nativeAI