Platform Engineering on the Edge

Darren and Pinja break down the basics of platform engineering on the edge, from the “boring” to the truly remote and unreachable. They explore how GitOps, Kubernetes, and infrastructure as code enable reliable software updates in spotty or disconnected environments—whether in cars, factories, ships, or even Mars rovers—and why getting edge engineering right matters now more than ever.

[Pinja] (0:02 - 0:11)

Git is supposed to be the one source of truth, but it comes into an even more crucial role when we're talking about platform engineering on the edge.

[Darren] (0:14 - 0:22)

Welcome to the DevOps Sauna, the podcast where we deep dive into the world of DevOps, platform engineering, security, and more as we explore the future of development.

[Pinja] (0:22 - 0:32)

Join us as we dive into the heart of DevOps, one story at a time. Whether you're a seasoned practitioner or only starting your DevOps journey, we're happy to welcome you into the DevOps Sauna.

[Darren] (0:38 - 0:44)

Welcome back to the DevOps Sauna. I am once again joined by my co-host Pinja.

[Pinja] (0:44 - 0:46)

Hello, how are you doing?

[Darren] (0:46 - 1:09)

I'm doing pretty good, though I think we kind of did something a bit out of order. Not that long ago, we talked to our former colleague Sofus on this podcast about Kubernetes on the edge, and then I realized that there are very few people who are actually doing things with edge systems. Maybe we should take a step back and look at what they actually are.

[Pinja] (1:09 - 1:36)

Yeah, we jumped the gun a little bit here. One of the topics that we covered with Sofus was edge computing, but if we take a step back and actually go to the platform engineering on the edge, what does it mean? We were talking about this episode with Darren, and he coined a term called the boring edge and the non-boring, so the interesting edge.

So Darren, could you enlighten us about what we're talking about here?

[Darren] (1:36 - 3:05)

Yeah, so we have to take it a little bit further back to this idea of the cloud transition, that during 2010, 2015, the cloud transition basically meant that everyone was moving everything to the cloud. And what this actually meant was that they went from infrastructure that was nearby in data centers or in server rooms to cloud systems. And if we look at AWS, these are centralized in Stockholm, Dublin.

I think there are some German ones, but basically what happened is people's data got moved quite far away from them, which actually created the need for this idea of edge computing, which was computing kind of closer to where you are. So, I mean, there were other things that were implemented like CDNs, content distribution networks, but then this idea of having computers next to you as an addition to your cloud became a thing. And then, yeah, as Pinja said, I was talking about the interesting edge and the boring edge, which are my terms, and they're just to describe that to me, the boring edge is anything that's easily reachable.

So the interesting edge is not co-location, it's not local data centers, it's not a machine that's sitting under your desk, it's not a testing device. Although testing devices can be kind of interesting in other ways, I'm thinking about extreme edge cases where you can't reach the system often or perhaps at all.

[Pinja] (3:05 - 3:50)

And so we're talking about cases where the connection is really spotty. We will get more into the details of what these cases are like, where this can be used, and where it is actually used nowadays. But just to be clear here that we're talking about really far-reach places where, as Darren you say, there might be no connection at all.

But I think we need to understand a couple of principles about GitOps before we go more into how Kubernetes plays in this. So if we think of the system's ability needs, what does every system need to be able to do? So we have the pull.

I think if we start from that very beginning, storing, applying, and rollback. So how do we do that when we have a far-reach place? How do we reach that far-reach place to make it happen?

[Darren] (3:50 - 5:28)

The thing about these edge cases is that they kind of turn normal distribution on its head because distribution used to be push-based where you had a centralized location, you pushed updates to it from your pipelines, and then you pushed that out to everywhere who needs this update. But that's not actually how things have worked on the client side. Client side is usually pull updates and all GitOps does is apply that logic to server capacity or industrial applications.

So it's built around the idea that a person of authority in the location needs to be able to make a decision on when to update. So as you say, you need to pull, you need to store the software until it's applied, you then need to be able to apply the software, and if something goes wrong, you need to be able to roll back. And this is what we want from GitOps.

So how we do this is the concept of infrastructure as code, but we do it locally. So what you have is you have an infrastructure as code file, some kind of software manifest that says, I need these components installed and the version, and then you just, you run through that manifest, you pull all of those images and store them locally in a local image repository. Like normally you'd use something like Docker Hub or one of the public repositories, but now you'd use a locally installed repository, and then you can just spin up software as images.

So you have a kind of backbone system which allows you to create these instances and use the self-healing mechanisms built into a lot of container management systems.

[Pinja] (5:29 - 5:55)

Yeah, and we mentioned Kubernetes already. Previously we talked about Kubernetes with Sofus. So now that we have these edge systems, which if we want to call them multi-node edge systems as well.

So Kubernetes might get some very interesting or surprising use cases from this. And if we think of just micro Kubernetes and K3s and their limitations at the moment, what are these surprise cases that we're facing now with Kubernetes?

[Darren] (5:56 - 7:42)

I think, I mean, we talked a lot more about this topic with Sofus. So if this is of interest to you, you should listen to that episode. But when I tried, when I was actually working in this industry, the first thing I did was look into whether Docker could be used because the idea of images is great, but Docker really struggles to scale.

And that's actually kind of why we have an interesting use case for Kubernetes. So Kubernetes was described by Google as planet scale. And, you know, Kubernetes is actually based on Google's, I think it's called Borg, which is their internal container management system and it's used to run Google.

So as you can imagine, the scales required for these systems are enormous. So the idea of putting them onto these tiny systems in these edge cases was kind of surprising, but the concepts actually work really well because in Kubernetes, you have these nodes that spin up and they are self-healing. They have a use case.

And if they are failing at their use case or having some kind of problem, then they have the ability to be destroyed automatically, reprovisioned automatically. So actually turned out to be a no brainer. And we talk about micro K8s and K3s, which are minimal Kubernetes distributions and they take up no space at all.

So micro K8s was 300 megabytes when I checked. K3s is 40 megabytes. So one of the great things you run into in IoT is the use of space because storage space has always been at a premium.

So having distributions of Kubernetes like this, which allow for minimal installation was kind of game changing.

[Pinja] (7:42 - 7:53)

Well, we know that space has always been a challenge when we're talking about edge based systems, but are we looking for a new plateau here? Because we're now looking at this like smaller, more reliable tribes as well.

[Darren] (7:53 - 8:45)

Yeah. Yeah. When we look at the edge based systems now from 10 years ago, you can get some reasonably good models with decently sized hard drives in, and also with things like graphics cards.

So, and we're not talking about commercial graphics cards here. You'll be able to put things like your GPUs in there for our favorite topic, AI applications. And for this, I don't actually mean large language models.

I mean, actual AI applications. So things like image processing, this kind of stuff, which is a really powerful use case. So it's, you know, the, what was it?

Cunningham's law, which is the size of everything doubling or the capacity of everything doubling every however long. I don't remember the exact wording, but basically as computers become more available and size becomes more available, we can do cooler stuff in smaller environments.

[Pinja] (8:46 - 9:13)

And when we're talking about this, we've already mentioned that we're not just talking about the box under your table, but we're really talking about some really unreachable places and something that has a spotty connection. You might not have a spot connection at all, but if we think of one good example, what can be further away from us than space? And wasn't it the first thing that the Curiosity rover did when it landed on Mars in 2012, it did a software update.

[Darren] (9:13 - 9:29)

Yep. I've heard that. I don't know whether it's true.

It's something I've never investigated because it's one of those things that right now, I believe it to be true. And if it turns out to be true, I don't gain anything from that knowledge, but if it turns out to be false, I lose a cool anecdote.

[Pinja] (9:30 - 9:58)

So that's what I was thinking. This is an anecdote that we can use as one of those jokes that we can use as an icebreaker when we're talking about this, but it is in fact an example of a far-reach place. And we can only imagine that the Curiosity rover does have the need to do some kind of updating anyway, but we are talking about distribution through satellites.

They have very limited connection. And this is a number that Darren, you had heard that the transmission time was something like 17 minutes.

[Darren] (9:58 - 11:00)

Yeah. I think that because, you know, it's done through radio communication, the travel time is 17 minutes in one direction, I believe. And obviously because the orbits of Earth and the moon don't line up, then obviously it varies, I think between about 12 and 24, but I'm not a rocket scientist.

So I'm not the best source of information. Most of my information about Mars comes from the Martian book. So we'll see how reliable that is, but it is a great example of this need for platform engineering on the edge, because presumably that's a multi-node system too.

You have radio communication on it like a critical system, because without that, it loses communication. You'll have imaging systems, so it can take photos. It will have distance measures, probably ultrasound as well as laser based.

And all these discrete modules will be in different places controlled by a central control unit. I don't know whether it's running Kubernetes, but I could certainly make a use case for it to be.

[Pinja] (11:00 - 11:31)

That's for sure. But even if we weren't talking about the space, if we were talking about far-reach places on Earth as well, and we already with Sofus, we talked about the vessels, we talked about they're far out in the sea, but we also can think about, let's say, a research station. And one use case that came to my mind, for example, we know that Antarctica is filled with research stations, and it does also apply in my mind to this category of far-reach places.

But also, if we think of cars, isn't that also a thing here?

[Darren] (11:31 - 12:46)

Yep, cars. We don't need to go as far as Antarctica, though. I think it would be an entertaining trip.

If we think about Finland, Finland has thousands of lakes. It has a huge coastline, all of which have, along which you can find these navigational beacons, which again are just, they're running software and software-defined radios to send radio signals. So having distribution to these systems where they are not easily reached is, again, an important factor.

But as you say, cars. Cars are about as close to us as we can get without getting over the boring edge where it's co-location. But if we have autonomous vehicles, even modern vehicles with their measurement systems.

So basically, the edge is everywhere around us, and these connection discussions need to be considered. If you think about it, probably the biggest use case is industrial applications. So if you have a factory that's churning out basically anything, then having measurement as the standard is required, and having autonomous actions is preferred.

And to distribute your software over these autonomous actions easily requires some kind of management system, and Kubernetes can handle it.

[Pinja] (12:46 - 13:38)

Yeah. So as you say, this is a very good thing, at least personally for me to understand that we're not just talking about, and I'm very deliberate when I use the edge cases here, when I use that term here, but actually the edge cases here also apply to, for example, my car. I've been thinking about how it gets its software updates, because now and again, it tells me like, please sit down and don't move your vehicle until we're ready to update your system.

But if we think a little bit how it works, and you mentioned infrastructure as code before already, and local image repositories. And whenever we do, we work on software, we work with code, we know that the Git is that one source of truth, and that's the state where everything should be. But how about when we actually pull some remotely from, or being pulled by, i.e. being pulled remotely, how do we go about that?

[Darren] (13:38 - 15:09)

The idea of software on the edge, or platform engineering on the edge, is that it elevates Git. We know Git to be the source of truth for code, because that's where your master branch is, and when you have built that branch, you push that code to a repository, or you build the code, push it to some kind of image repository if you're using containers, which I think a lot of people are. But what this actually does, the infrastructure as code requirement, means that this also gets fed back to Git.

So Git elevates to not only your development tool, but actually your fleet control method, because you want the edge systems to pull a specific manifest, so you end up with infrastructure as code, and a manifest being pulled. And then you just chain this through the image repository you have remotely, and so you end up pulling the images to the local system from the remote system. But it all works around Git, and this is why GitOps, and having a really good Git branching structure, is so important for any kind of edge-based installations, because Git no longer is just a place where developers can dump anything.

And we've all seen these messy, messy structures, which don't really have a solid branching flow, or a solid workflow that just is, you know, you have a commit, which is just someone's name and a date, and then these get pulled together.

[Pinja] (15:10 - 15:33)

And in the worst case, they're like, how many times have I, for example, seen a branch, and let's say how many branches are moving from the main, and they've been open for God knows how long, and they got so much stuff in there. So as mentioned before, Git is supposed to be the one source of truth, but it comes into an even more crucial role when we're talking about platform engineering on the edge.

[Darren] (15:34 - 16:19)

Because this is the thing, if you have a spotty connection case, and an issue on that spotty connection case, because of something that happened in Git, like you have the wrong version, it means that the edge case may not have access to newer versions. And that's also why rolling back is important. So the idea is to have fully autonomous decision making, that a person like a factory manager, or a ship captain, or whoever at NASA decides to push software updates can go, yes, now we're going for a software update.

And this all pulls from local repositories, because we don't know about the connection. So you can actually minimize a lot of the downtime through poorly timed updates by making sure that the decisions are made at sensible times.

[Pinja] (16:20 - 16:24)

Can there be a case of somebody actually being local and likes it anyway?

[Darren] (16:24 - 16:29)

Yeah, if someone likes having downtime, I guess that's their choice.

[Pinja] (16:29 - 16:39)

We tend to like to avoid downtime, at least if we're talking about the standards of modern software development. Yes, we do have that.

[Darren] (16:39 - 17:10)

Because yeah, if we think about the actual reason to use edge cases, the movement to cloud added a little bit of latency and a little bit of reliance on external systems. So things that are coming to the edge are coming to a position where they need rapid connection and where they need the tool to be operational, independent of anything. So it's all about maintaining autonomy and giving the power to update to the people who need to make that decision.

[Pinja] (17:10 - 17:36)

It is not all fine and dandy and all easy when we're talking about platform engineering on the edge, I guess. So there are some dangers of pitfalls. We're not gonna, we could talk about, I guess, like one episode's worth of dangers and pitfalls when it comes to platform engineering on the edge, what to avoid and how to make it better.

But if we summarize it in a couple of bullet points, a couple of points, Darren, what are the main things to look for and look out for?

[Darren] (17:36 - 18:58)

I think the first thing is that Kubernetes is complicated. And a lot of Kubernetes installations are insecure by default, which is something we obviously want to avoid. You can actually use something like Talos to avoid that.

And that's something we talked about with Sofus, so we won't go into it here. But the complication of Kubernetes is there. Also, threat modeling from a security standpoint.

Threat modeling is rarely performed, and it's mostly because when we're talking about edge cases, we're also talking about industrial IoT. And IoT, everyone I think knows that IoT security is lagging behind. There's this great event you can attend online, actually, every year called the Night Watch.

They host it in Amsterdam, and it's a security event for this kind of industrial IoT or operational technology, OT. So the general undercurrent from those events is that there are massive security flaws in huge parts of critical infrastructure and manufacturing that go unfixed because fixing them would be complicated and costly and problematic. So everything we learn from that event, we can also apply to industrial IoT.

I'm sure there are some people out there who are great at it and doing it really well. I know there's a lot of people out there who are ignoring it.

[Pinja] (18:58 - 19:12)

And I guess if we just raise the question, why does this matter? And we already talked about struggling with the very basics of GitOps, but what are the other options for how to do this? Why are we talking about platform engineering on the edge?

[Darren] (19:13 - 20:10)

I think we're talking about it because the other options to do this are bad. You can treat everything like an individual system, at which point you lose any orchestration capacity. You can maybe use some kind of orchestration tool locally, like having an Ansible that's updating the systems, which is fine, but it doesn't have the self-healing that you get with Kubernetes.

That's just about installation and configuration. You could build some cool scripts with Ansible to pull everything and make sure it's running properly, but Kubernetes can do that out of the box and heal the nodes which have gone down or are not functioning. And then you can use, if you're just working on one system, then yeah, you can use whatever you want.

But as soon as you switch to a multi-node model, then I think Kubernetes just starts to take over all the competitors.

[Pinja] (20:11 - 20:22)

I personally know that Kubernetes is not simple and it does also require investment, right? But I guess this goes for everything when we're talking about platform engineering anyway.

[Darren] (20:23 - 20:49)

Yeah, it really does. The investment you might need is that it might be a bit higher than some of the more basic solutions. Ansible doesn't require the getting started overhead that Kubernetes does.

Having a single node system and using Docker Compose instead of Kubernetes will simplify things. But personally, as someone who tried that approach before and has now switched over to Kubernetes, I think the investment is worth it.

[Pinja] (20:49 - 21:02)

So we know why this matters, but is there something very topical about getting your act together, everybody's act together and doing platform engineering well on the edge. So why should this matter now?

[Darren] (21:02 - 21:28)

Oh, that's a simple question to answer. It's that the EU is coming for you. I mean, that sounds a lot more threatening than it actually is.

And this is actually something we're going to talk about in one of the episodes where we're talking about platform engineering and edge cases and its security. So NIST 2 happened, the Cyber Resiliency Act is upcoming, and industrial IoT is lagging behind. So that's why it matters now to start getting this in place.

[Pinja] (21:28 - 21:42)

Okay. I think this is a very good spot for us to end our episode. We're threatening with the EU, the EU is coming for you.

But this was our, let's call it platform engineering on the edge 101. What are the basics to understand at the moment?

[Darren] (21:43 - 21:52)

Indeed it was. And we hope you join us for our next episode where we're going to be looking into some more of the real world cases of platform engineering on the edge. Thank you, Pinja, for joining me.

[Pinja] (21:52 - 21:53)

Thank you, Darren.

[Darren] (21:53 - 22:00)

We hope you join us next time. We'll now tell you a little bit about who we are.

[Pinja] (22:00 - 22:05)

I'm Pinja Kujala. I specialize in actual and portfolio management topics at Eficode.

[Darren] (22:05 - 22:08)

I'm Darren Richardson, security consultant at Eficode.

[Pinja] (22:08 - 22:10)

Thanks for tuning in. We'll catch you next time.

[Darren] (22:10 - 22:16)

And remember, if you like what you hear, please like, rate and subscribe on your favorite podcast platform. It means the world to us.