Kubernetes has become the de facto standard for container orchestration. All of the cloud providers have their own solutions for Kubernetes. But what do you do when you're on-prem? We've invited Andrew Rynhard CTO and founder at SideroL abs to talk about Talos Linux.

Andrew (00:05): Efforts from people and projects is humans end up abusing the system. And so we've really made it so that you can't because we're going to blow that away by design. And so you really have to force yourself to ask yourself, "How can I run this differently?" And it ends up being more production ready and less hacky in my opinion.

Marc (00:27): Hello, and welcome to DevOps Sauna. Kubernetes has become the de facto standard for container orchestration. All of the t providers have their own solutions for Kubernetes but what do you do when you're on-prem? Today, we've invited Andrew Rynhard from SideroLabs to talk about Talos Linux, welcome. Hello, and welcome to DevOps Sauna. My name is Marc, I'll be your host today. Today, we have a really fascinating program. We have Andrew Rynhart from SideroLabs to talk about Talos Linux. Here I have my usual cohort Andy Allred. 

Andy (01:05): Hello.

Marc (01:05): And nice to see you, Andrew.

Andrew (01:08): Hello, thank you for having me.

Marc (01:09): It's really fantastic to have you on the program. So we have been excited about Talos Linux and have been looking at it with some of our customers. And it's great that we could have you on the podcast today. Let's start Andrew, would you like to tell us what is Talos and Talos Linux? What is the problem that you're trying to solve?

Andrew (01:30): Sure. It's really hard to summarize it because to me, it's so many things. As you could tell from the name of it, it is Linux. It is a Linux distribution, but it does so much more than that. What it is, is a Linux distribution that has been built for the purposes of running Kubernetes and there's a lot behind that. We've done a lot to make the operating system completely aware of Kubernetes, behave like Kubernetes, to deploy Kubernetes. So in some way, you can almost call Talos a Kubernetes distribution as well, some people do tend to call it that, but it is just upstream vanilla CNCF certified Kubernetes. So it's really hard to put it in a single basket because it does a lot.

Andy (02:18): When I was introduced to Talos, I was looking around for a Kubernetes distribution to run on prem. And of course, security and whatnot was important. And I heard about this Kubernetes distribution called Talos. And I started looking at it and thought, "Well, that's not really a Kubernetes distribution at all. It's an operating system, but it's not just an operating system, so what is this?" <laughs>

Andrew (02:46): Exactly, it's a little bit of all the above. It is certainly a paradigm shift of how you think about Linux. Another way that I like to put it is: you can almost imagine Kubernetes as a distributed kernel, if you will, a way to run user space across multiple machines. And Talos is the underlying CPU and RAM that supports this distributed kernel. Our whole vision with Talos is to allow you to not have to think about individual nodes, but rather see the cluster as a whole, as a giant machine. And so one of the ways that we put it to people is your machines are simply CPU and RAM, and Talos Linux, it's there, but it's also not there. And that's the whole thing about it. So yeah, just really allows you to focus on on Kubernetes. And there's a lot of reasons why we've done what we've done in the design of Talos to get to that goal of allowing you to focus on Kubernetes. There's a reason why we have a strong security footprint. It's because to really, truly forget about the node, it has to be secure, has to do almost nothing. It has to go away and not be a problem on many different fronts. And so we go to great lengths to make that happen.

Andy (04:09): When I was bringing up this topic with Marc, he was asking, so what is it and I came up with the analogy that it's a way to treat your bare metal servers and virtual machines like containers.

Andrew (04:23): That's actually a really, really great description because I mean, we've even modeled how upgrades work in Talos after how upgrading a pod in Kubernetes works in that you get a whole entirely new file system, a whole new container, and that is an upgrade. So even our upgrades since Talos, runs completely in RAM, the way that that would work is we actually completely blow away everything underneath the running operating system, install a new version of Talos and then reboot into that version. So it's like you've got a brand new pod or a brand new container. So all of these paradigms you're very much, right, Andy, is that we want you to look at the machine and all the things that we love about running our applications in Kubernetes. We want to bring that down to your actual infrastructure.

Andy (05:11): So you talked about the CPU and the RAM and running everything from memory. Do you use a disk for anything? How much do you persist on disks?

Andrew (05:19): Yeah, maybe I'll step back a little bit. Let me just expand maybe a little bit on Talos and that might set the stage a little bit for things. So Talos at the end of the day is a Linux Kernel that we maintain, and in it a RAMFS that has a SquashFS embedded into it. And we chose the SquashFS because the read-only nature of it. Talos, for the most part is read-only, but in practice, you can't have an entirely read-only operating system, things need to write somewhere. At the most basic example, /tmp is assumed to always be writable. The filesystem hierarchy standard is actually an official thing. And there's a whole directory tree that you need to have to be compliant with it. But we've made decisions to have some of those paths be completely backed by tmpfs'. Let's just take /etc, for example, /etc/resolv.conf, I think is a really, really great example because this file needs to change. It's the traditional place that applications and Linux itself goes to look for networking configuration. If your whole entire operating system is read only, how do you actually make this thing dynamic based off of DHCP. So what we do is we have one directory, which is persistent, which is /var, this directory is backed by a disk. 

(06:47): Anywhere else that is writable within Talos is completely ephemeral. It's a tmpfs. So what about places where, you know, like I'm saying for /etc/resolv.com, that need to change based on DHCP supplied options, what we do is we actually write files into /var, and then we bindmount them over the file in the read-only file system. So it's a little bit different than how you've seen this a core OS to it, where they've decided just make 70% of the read of the file system read only, but then /etc is completely writable. We've decided to go even 10 steps further and make as much as we can read only but only what is required can be changed using these bind mounts. And even then, we don't even allow people to change those files because Talos is purely API driven, and configuration driven. You have to submit networking settings in our required format to the API, and we'll manage /etc/resolv.conf for you. Very long winded answer to what is writable, but I think that goes to show that we really do try to make this thing as read only as possible, but /var is the only thing that is writable. And not just for the purposes that I've already outlined, but Kubernetes needs a place to put things, etcd needs a place to put things, containerd. If you're ever using Rook,  /var/lib/rook is a directory that gets stored or stashed with all kinds of things.

(08:14): But the unique thing about this as well is we don't want people to be dependent on this persistent storage. So what we do is on upgrades, like I was saying Talos runs completely in memory, we actually blow away everything on the disk and reinstall as if it was a fresh machine. So this label on this partition that's mounted up and /var is called ephemeral. That's what we label it as. And it's for a good reason. It's because we don't want you to become dependent on that. If you need something that persists even across upgrades, use Rook, add some disks to your machines and use Rook. So it's really pushing people into a, like I said, become less dependent on the node, we want that thing to almost be like it's not there. And typically, what prevents these sorts of efforts from people and projects is humans end up abusing the system. And so we've really made it so that you can't because we're going to blow that away by design. And so you really have to force yourself to ask yourself, "How can I run this differently?" And it ends up being more production ready and less hacky in my opinion.

Andy (09:21): So you're dragging people kicking and screaming into cloud native way of thinking and that's pets versus cattle?

Andrew (09:30): Yeah, I hope that they're not kicking and screaming. I think at first they definitely kick and scream because Talos, it does have a learning curve. Give them a few days and give them a little bit of time with it. And then they walk away like, "Wow, this is this is actually how it should be done." So we're slowly converting people to our way of thinking and it's catching on. 

Andy (09:53): Yeah, I say dragging people kicking and screaming because that's exactly how it was with me at first, as you'd said. At first, I was like, "No way this can't work. This is the dumbest thing ever." But based on the people who recommended it to me, I decided I need to look into this a little bit further. And the more I looked, I was like, "Well, actually, okay, this is actually a good idea, this actually works pretty well." And those limitations aren't really limitations, they're more like benefits.

Andrew (10:20): Yeah, they just force you to think about things a little bit differently. And I think us as an industry, we've gotten into a groove, which is good. Good things certainly come. And as they say, "You don't ever want to reinvent the wheel," but I do think the wheel could be refreshed every now and then with new designs of tread and whatnot. And that's what we're doing. It's a new paradigm and way of thinking about things. And there's a common misconception and people always come to us asking, "Well, how would I do this? Or how would I do that?" And if you just look at Kubernetes as your package manager, you can get 98% of that done. Just run a daemon set, if you need something to run on every node or run it deployed. There's so many things that you could do in Kubernetes, I'm not recommending you do everything in Kubernetes. I've done things in Kubernetes, you probably shouldn't and I never will, again, do. But you could you just think about things a little bit differently in Talos. And I think ultimately, you walk away from it learning something, and really appreciating everything that we've done to make you think that way.

Andy (11:25): So how did this come about? And how did you end up developing this? Probably didn't just spring from your head fully formed one day in the shower. <laughs>

Andrew (11:36): No, it certainly didn't. So I was working at a place and I was tasked with moving our applications into Kubernetes. And so it was an opportunity for me to really deep dive into Kubernetes. And I'm the type of person when I deep dive, I really deep dive. I got into being a contributor of kubeadm and joining what was called the SIGs back then, and really doing as much as I can to learn about Kubernetes, help develop Kubernetes, and just jumping headfirst in and really enjoying it. But I started very quickly seeing that there were a lot of things in Kubernetes that needed to be done that also had to be done at the operating system level. So user management, securing it, access controls, all of these things. It doubled the work as an operations engineer, I had to do all these things in Kubernetes, but then I also had to do it in the operating system. And then I found that if one of my co-workers were to hop onto our box and change something, that actually trickles up into affecting how Kubernetes behaves. And so what I thought was, I needed a way to get humans off of these machines so that we can have a consistent substrate for Kubernetes to run on. I'm a very, very strong believer that in distributed systems, if you don't have consistency across your machines, you're just asking for trouble. There's just so many little caveats here and there that you're going to have to be aware of, eventually, your clusters become these clusters that no one wants to touch. 

(13:14): Because in the operations world, it's working, don't touch it! Don't even say it or the 'Operations Gods' are going to hear you and you're going to get paged at 2:00a.m. And I wanted to do away with that. And I just found that surprise, surprise, humans are typically the problem. <laughs> They're hopping on the machines, they're changing things outside of windows, they're changing things and not documenting it. And so everything that I really did was to stop humans from screwing things up, so read-only file system. And not just from a resilience point of view as well, but security. The read-only nature and the ephemeral nature of Talos has a dual purpose, it stops this human problem that I talked about. And it also, I guess, this is also a human problem, but outside attackers, maybe not self inflicted, but inflicted by someone else who is trying to be malicious and cause damage. And so what I started with was, let's just make this thing so minimal, that there's really nothing a human can do. <laughs> And so I ripped out bash, I ripped out SSH to stop people from getting on the box. And I just had a kernel running and the kubelet. But then from an operations point of view, it became: okay, there's actually a problem right now, how do I fix it? And back then it was looking at console logs and trying to figure out what was going on. And there's just not enough information there. And so I was left with the decision. 

(14:40): I could easily add SSH and bash and all the Unix utilities that I knew, but it took away from my very first goal of making it as minimal and out of the way as possible and not useful to humans. So I decided to put an API instead. And so I decided to use GRPC because of the bi-directional communication that you can do. There's all kinds of fascinating things that you can do with GRPC. And so that's how the API was born. And then I started to see, "Wow, this thing, first of all solved my goal of just running Kubernetes with as bare minimal as possible and keeping humans off of the machine, but something really interesting came out of that. And that is the really strong security footprint that it has." And so there started to be more of an emphasis on how do we actually run Kubernetes more secure. So when you run Talos, you don't get just Vanilla Kubernetes. It is Vanilla, but we also configure it to the CIS benchmarks and STIG guidelines, we go out of our way to secure Kubernetes. For you secure Linux, so much that like, it's a project called the 'kernel self protection' project, you can't even load kernel modules within Talos, you just can't do it. It's a completely static kernel, that everything that is needed with it needs to be built with it at the time of build. 

(16:00): So it just throws away a whole layer of security potential issues that you face. And so I just found a bunch of places where we could do this. And ultimately, it just becomes a really strong security tool as well. It's really hard to say what Talos is because it's Linux, it's also a very strong security tool, it's Kubernetes and we've got Kubernetes knowledge built into it. So yeah, that's the origins of it. I just saw a problem that I was facing, and I was tired of worrying about it and decided to go on a very long journey in fixing it.

Andy (16:37): In our previous call, and we were chatting about this, you mentioned one time, you said that the OS has become a limitation as you were describing the problem space. And I think I know what you meant, but can you explain a bit what limitation the OS has become? And why do you see it that way?

Andrew (16:56): Yeah, we have this project called the Common Operating System Interface (COSI). And its whole goal is to standardize how we work with Linux. And one of the things that I've always been frustrated with it, it's just how fragmented the whole Linux ecosystem is between one distribution or another, there's enough changes where it's not as simple cut over between one distribution or another. So your package manager has completely different decisions on the file system layout is completely different, your network manager is completely different, the only place that we've standardized is with systemd. So I just grew really frustrated with the fact that there was very, very little consistency in the Linux world. And in some ways that was kind of by design, when they start first started thinking about what Unix is, they just wanted independent programs that can, you know, solve very specific problems. And you could pipe the output and input and all these things. But I think the number one problem with that has been that all of that communication between them has been unstructured. And so we're left with finding all crazy sed and awk and grep, and trim commands to make it work. And you can have whole cultural wars about whether or not you should use sed or awk. And people find their favorite Unix utilities. So why not give these things a common way to communicate with each other. And so this limitation that I have found is that it's really hard for Linux to evolve, because it's sort of still running on a way of thinking that's been around since the 70s. And there's nothing wrong with that, necessarily. But I think with a lot of the innovation that has happened since then, why can't we bring some of these new ideas into the operating system. 

(18:46): And so it starts with that, we need a structured way for processes to sort of communicate with each other instead of just depending on things like redirection and pipes and these utilities to search for unstructured strings between different programs. But then furthermore, a lot of the maintenance work that's required within the Linux world is not declarative. To put it simply, you as a human have to go in and make these things happen. So one of the things that I really appreciate about Kubernetes, and I think is really what makes Kubernetes Kubernetes is the whole operator or controller pattern that has come about with it, and the whole event stream that you can get, and how you can hook into those things and watch them and you can make decisions in a centralized controller on what needs to be done based on events real-time. And this is all within an application. It's not you a human doing it. And so, if you take this idea of having structured output between processes, a way for communication to happen between processes, and you take this controller pattern and you bring those into the operating system, suddenly your operating system can almost think for itself. Talos Linux actually doesn't have systemd. We've written our own PID one, we call it machined, it's a Go binary. And it serves the same purpose of systemd to be the init system. But one thing that's different about it is that it's completely based off this idea of being controller based and pattern/operator based. So the networking controller, for example, can watch configuration changes. So when you hit the API of Talos and say, "Here's my new networking change," we can make a decision based on those things. What needs to change? 

(20:27): Well, /etc/resolv.conf needs to change, maybe we need to change some certificates because some subject alternative names need to be added for the new IP addresses. There's so many things that you could do with this API layer, and with this controller pattern to make the operating system roll things out on its own instead of you as a human saying, "Okay, we need to change this IP address, what else do we need to do to make sure that we don't go down? We need to update the certificates, we need to check this, we need to check that and then furthermore, there's probably a bunch of files that need to change. And they all have different formats." /etc/resolv.conf has a certain format and based on which glibc? Is glibc or is it musl, you can't use certain features. There's so many things to worry about. In the Talos world, we overcome these limitations by saying, "Okay, with this API, we can do validation before we even change anything." So we can just say right out, "Look, what you're requesting. We can't do this because of this." So validation on changes is just a huge thing that comes and you can't get that in any other Linux distribution today. 

(21:29): And this controller pattern as well, where you just simply say, "Here's what I want," instead of going and doing it yourself, these controllers within Linux go out and make that a reality for you. And so these are the limitations that I think we need to overcome with Linux. And that's why we've broken out this idea into a project of its own called the Common Operating System Interfaces. It's because we don't want these ideas to necessarily live just within Talos. In fact, our goal is to have Talos just be COSI, as we call it, C-O-S-I, just a cozy operating system with Talos just being a plugin that adds Kubernetes functionality into it. That's the long term vision with Talos and then having COSI be given to the world. And let's as an industry standardize on this idea, and let's figure out how can we make Linux just be boring <laughs> and act more like Kubernetes. I know that doesn't sound fun. Obviously, I'm using the word boring. But I think that's where we need to go, if we're going to start looking at Kubernetes-like things for distributed systems.

Marc (22:34): I like this boring aspect because it means that you're putting your focus elsewhere.

Andrew (22:37): Exactly. I think Kubernetes was really, really fun in the beginning, I had a blast with it and I'm sure a lot of people did, and probably still do today. But what I'm finding myself wanting to do more is focus on my business logic, I don't want to worry about what kernel version I'm running anymore, or what container runtime. There's kryo, there's containerd, you can even use firecracker, there's so many options. They all do the same thing at the end of the day, really, I don't want to worry about those little decisions anymore. And maybe clouds or people who really have a need will think about it, but I don't think the common use case is you just need to run your application in the most secure production way possible, with as little thinking as possible. And the ability to have your business logic layered on top of that, and this goes back into the whole limitations with with Linux. Your business requirements may be running state hardening guidelines against your machines. Well, that just comes with Talos out of the box. And this controller pattern, you can even write custom operators that hook into the Talos APIs and Kubernetes APIs that contain all of your business logic. And you don't worry about Linux, you don't worry about your application, you just tell Kubernetes what you want for your app. You tell Talos what you want from Linux. And your business logic simply is a layer that translates what you want into configuration that you push into a system that knows how to roll that out.

Marc (24:05): I think it's awesome. It's just really, really, really cool. 

(24:13): Hi, it's Marc again. We shouldn't be asking “Why Kubernetes?” in 2022, but it's great to ask, "How Kubernetes?" Eficode is a Kubernetes certified service provider and can help you achieve your business goals. We've also recently recorded a fantastic podcast on Platform Teams. If you would like to answer the "Who Kubernetes" please have a listen. I'll leave some links for you in the show notes. Now, let's get back to our show. 

(24:45): Can you help a little bit? You've already said a lot of these words, but: Why a company or a software organization should choose Talos. Now, you said a lot of different points. But can you succinctly make this case for us.

Andrew (25:03): Well, I am just a CTO, so this will be very, very hard for me because I'm a tech nerd, and I'm not good at sales, but I'll do my best. <laughs> Really, if you are a team that doesn't want to have to focus on Kubernetes, doesn't want to have to focus on Linux or maybe you don't have the expertise on your team, Talos has that expertise, it is sort of an extension of your team. Our goal is to have Talos be a living, breathing operating system that knows how to react to certain things. You don't need to worry about security, you don't need to worry about hardening Kubernetes, you don't need to worry about deploying Kubernetes the hard way, you just tell the system what you want. And quite literally, you can spin up Talos from an ISO on three machines within your bare metal infrastructure, and have an HA (highly available) Kubernetes cluster up and running within 30 minutes. It's just simply boot Talos, supply a configuration, and you're off to the races. You're not worrying about how to install the kubelet, you're not worried about how to harden Linux, you're not having to worry about how to harden Kubernetes, you get that all out of the box. If you want to save time, you want to have something more secure, I think, Talos is definitely the way to go.

Andy (26:19): What advantages would we have when running this for example in the cloud? Because in the cloud, you can already have things a bit more locked up with VPCs and internal networks only, you get a control plane, which is managed by the cloud providers, and they throw whatever resources. So some of the benefits, of course, will translate in having the same thing on premise in the cloud. Of course, it's an operational benefit, but is there any other goodies you get from running this also in the cloud?

Andrew (26:51): Yeah, I think the cloud definitely solve this, but in just different ways. In the cloud world, for example, let's just take a look at the managed Kubernetes offerings out there. You don't have to worry about the control plane, but that has some caveats with it. You're now limited on how fast that cloud moves as far as Kubernetes versions goes. You can't change the control plane in any way that you might see fit or need for your business requirements, OIDC providers, or maybe you do you want to use some alpha features, who knows? It's your decision. So what we like to say is that we give you a managed-like experience with Talos, but not limiting you and stopping you from tweaking the knobs and changing things if you really, really need to. I think there's a lot of benefit in just having a Linux distribution that humans can't hop on and change things. Because even though from the outside world, the cloud may harden things and whatnot, you can look at CVEs that have been put out recently by clouds, where they're installing agents onto the Linux distribution, which opens up your VM to other tenants on the same hardware. In the Talos world, you can't do that. You just simply can't install this agent. And even if you did get access to the system, what could you actually do? There's no shell, there's no Unix utilities, you can run IP tables commands maybe or something like that. There's just 35 binaries on Talos and most of them are hard link to IP tables command. So if security is something that you want, which I think everyone should want, even if you are running the most simplest of things because who wants their stuff broken into? I think Talos is very appealing because these cloud providers are still running what I call legacy Linux. And there's still ways for attacks to happen behind the scenes internally. With Talos, that just can't happen.

Andy: OK, I'll buy it! <laughs>

Marc (28:53): Andrew, you reminded me of a term we used to talk about: infinitely malconfigurable. So this is something that you prevent humans from going in, "Well, if I just do it this way, or if I just do it that way, it'll solve a problem for today without knowing what kind of problem that we create tomorrow."

Andrew (29:12): Yes, exactly. That's exactly right. And even worse, you end up with a production system that no one wants to touch. Because that change, no one knows what was changed, and no one wants to touch it. So now, you're running three versions of Kubernetes behind and you're on the cusp of being in an unsupported version of Kubernetes. Because you have inconsistency.

Andy (29:34): So what kind of companies what kind of industries are running this and what kind of real world production experiences has been exposed to?

Andrew (29:44): Yeah, so the people that are running purely in the cloud, I wouldn't even say that. Maybe I would rather say the people that are running purely in a single cloud. So if you're all in on AWS, or you're all in on Azure, those folks, they don't see the the benefits of Talos just yet and that's fine. Use your AWS and the Azure's AKS whatever, that's perfectly fine. But the people who are running in multiple clouds are finding a lot of value in Talos because of, again, consistency. You don't have this version of Kubernetes, offered by EKS. And this one by AKS, you can be more sure about the security practices and consistency there. So your toolset, even, maybe you're using cloudFormation in AWS, or TerraForm, and Azure, these are two different teams all together that are maintaining Kubernetes in two completely different ways. The companies that are starting to use Talos in these situations are finding a lot of value in collapsing multiple tools down to just one set of tools. 

(30:48): And when you know that there's something wrong in one cluster, you know it's wrong in another, so you just fix it everywhere. So this consistency story just allows people to sleep a lot better at night. So if you're running in multiple clouds, there's definitely people out there doing that. But I would also say that the bulk of our users are doing some form of purely on premise, or cloud and on premise because, again, the consistency story. I know I sound like a broken record at this point, but really the same image that you run in AWS is the same image that you run on-premise. Talos has different capabilities of what it can do in the cloud, or what it can do on-premise, as far as where it downloads its configuration from and how you can give configuration to it. So industries that want to run on-premise, for whatever reason, typically, that's because they have databases that have high performance requirements. And so running that in the cloud is very, very costly. So they run those applications in house on-premise, but then maybe their stateless apps are in the cloud. So let's run Talos there as well. So we have this consistency. And we can save some money and not have to buy all this hardware on-premise so they use both. In the EU in particular, they're a little more adverse to running in the clouds. And I think for good reason, they tend to want to run their own clouds. And so we have a very large portion of users that are running Talos purely on-premise, whether that's VMs and VMware, Proxmox, whatever your hypervisor of choice is, or the biggest thing out there, which seems to be clicking is Talos on bare metal. People are starting to see these benefits of not having to worry about Linux and not having to worry about the underlying infrastructure. That's why we have our hypervisors. 

(32:49): Because I don't want to have to worry about these things, let's just have an operating system that runs directly on the metal and I don't have to worry about it, it's just there. And I just get Kubernetes. And so you start to save money because you don't have to pay for hypervisor licenses anymore. There's just all kinds of benefits there. So that's one style. I call these more like traditional data center type users. We also have a whole another class of users out there that are running Talos on the edge. So whether that's Raspberry Pis, or even beefier edge machines from HP, or Supermicro or whatever. But again, you could run on a Raspberry Pi, the same version, same image bit for bit of Talos that you run on a Dell R630 in your data center, or in a t2.micro in AWS, it's the same image. And so there's some patterns coming out right now where people are running Talos at the edge and then they're using a feature that we have called KubeSpan, which is actually built on top of wire guard, which would allow you to have this edge machine join a control plane that lives in the cloud. So at the edge, you don't need three machines anymore, just for the control plane. 

(34:06): And to waste that hardware. You can just have your control plane live somewhere within that region. And then using WireGuard, they can reach back and join Kubernetes that way. And so people are starting to deploy Talos at the edge in that fashion. They're deploying single node clusters at the edge as well. Again, the security reasons benefits, we have encryption of disks and all these things. And even if you do get access to the machine, what can you really change? So the edge use case is really starting to explode for us, I think, where people are wanting to run Kubernetes at just really remote locations. We have people running Kubernetes in shipping yards, down in mines. We have people running Talos just in places that personally and to be frank, I would be scared to run Kubernetes, but Talos is handling it quite well and has been doing it for about two years now.

Andy (34:56): Is there anything else in Sidero Labs which would be useful for getting these metal on edge installations going?

Andrew (35:05): Yes. So we do have a project, it's called Sidero Metal, and it's built on top of the cluster API technology. And what it is, is really, if you want to use cluster API's terminology is it's an infrastructure provider. So in the cluster API world, you have your core thing that it brings this notion of machines, it brings this notion of clusters. And you can define machines and clusters using yaml. In the same way that you define your application stack, we're using deployments or daemon sets, or however you do that. The idea with cluster API is let's use Kubernetes to manage other Kubernetes clusters. So you deploy a set of controllers and custom resource definitions to a cluster. And suddenly you can create other Kubernetes clusters. And within this system, they have the notion of an infrastructure provider and infrastructure providers just simply says: I know how to spin up a VM in AWS or a VM in Azure or VMware, or a bare metal machine in Equinix Metal. 

(36:11): Sidero metal is an infrastructure provider that knows how to manage bare metal machines. So it'll use the BMC to power cycle the machines, turn the machines on, force them to PXE boot, it comes with TFTP server iPXE that whole infrastructure so that you can PXE boot Talos. It has an agent that knows how to scrape information off of a server so that when you first boot a server off of this PXE system will actually register it with this with this Kubernetes cluster as a CRD. It's just kubectl get-servers and you can get a whole list of all your servers and all the hardware disks memory, what have you. And then from there, you just say, "Okay, I want a Kubernetes cluster. And I want these types of machines." Because within Sidero Metal, you can have what we call server classes. So you can qualify specific hardware where you can say, "Okay, I want machines with a Supermicro motherboard with this product number and this much RAM and this type of CPU at this clock speed. These are my t2.micros in my data center." And so then you can just go and say, kubectl, apply some yaml files that say give me a Kubernetes cluster that is made up of five of these t2.micros that are in my data center. And so decisions are made, we'll just say okay, let's pick any of them because we have 20 of them currently available, ready to go. And will power cycle those machines, will turn them on, pixie booted, install Talos will supply configuration to those machines, once they're up, will bootstrap etcd, all of these things that you typically would do manually is just all within this system in a very declarative way. So Sidero Metal is our cluster API infrastructure provider for your on premise bare metal needs.

Marc (38:01): Fantastic. Can you give us any idea, Andrew, please. What are the what are the kinds of companies that are running today? I guess you probably have in the NDAs and can't talk about specific customers, but could you give us classes of customers? Or who can we reference here?

Andrew (38:21): Yeah. The telco space seems to be really popular with us, these folks tend to run very complicated on-premise setups. And they're doing all kinds of complex networking and Talos is working out very well, for those types of folks. There's also people that are running Kubernetes in like manufacturing floors and stuff like that. So you're having Kubernetes manage the application, which manages hardware, building things on your manufacturing floor. And Talos gives you a really strong way to do that as well. I'd say Fintech is also very much interested in what we're doing. The government space is really blowing up for us right now because of all the security benefits and the fact that you could run it anywhere is very appealing. So places, I would say in general, that have a need for security are really seeing the value of of Talos and places that don't want to run in the cloud. Because again, you get this cloud-like experience this managed-like experience, but without having to depend on the cloud. Kubernetes is your cloud and Talos is the thing that delivers that without having to really think about it. 

Marc (39:40): Cool. Is there anything new that you'd like to share or anything that you'd like to add?

Andrew (39:46): Yeah. So I mentioned that we're building out. We have this product called Sidero Metal, which is based on Cluster API. And for the most part, we've been pretty happy with that, but we've also found that Cluster API was built with the limitations of traditional Linux built in mind, the fact that SSH is needed, or you can't run a Kubernetes cluster across multiple infrastructure providers. So what if I want to run Talos on-premise and burst out to the cloud in maybe peak hours or something like that. Using KubeSpan, that's totally possible. Using more traditional ways, or more traditional Linux distributions, this is very, very difficult to do because you got to handle all kinds of key exchanges and stuff for WireGuard. It's just not a very fun process. And so for good reason, people have avoided this way of thinking of hybrid Kubernetes clusters. But with Talos, this is a simple boolean flag, you just say kubespan: enabled: true within the configuration, and you could spin up. I did a live stream on Twitch once where I had a cluster spun up in my closet, and I have people joining from Amsterdam, Spain, and Portugal, I had someone in the Netherlands joining a node or trying to, but I don't think they had good enough internet, joining a node on a high speed train. <laughs>

(41:07): We formed this cluster that was made up from machines from all over the world. And it's just a simple boolean flag. So my point here is that Talos really opens up new ways of thinking and new architecture patterns that you can really start to do. And Cluster API was not built with these things in mind. And so we hit a wall, if you will, or hit a hit the ceiling, if you will, with with Cluster API. And so we're starting to work on a new product called Omni, which is our way of managing Kubernetes. And we want to bring all of the principles and all of the philosophy of Talos into managing Talos itself. So it's pretty cool. The way that this works is you just simply boot Talos with three extra kernel parameters. And these kernel parameters tell Talos three things. It tells tell us where to register itself, and also the token to do so. The second one tells Talos where to send console logs to. You can tell the kernel to ship off all console logs to some remote location. So if you don't have Talos API access, you see this console output like IPMI tool saw activate. The third thing is it tells tell us where to send all events to. And these events are used by this product to make decisions on the state of the machine. And so you boot Talos with these three kernel parameters, it uses WireGuard to join the system. So you can have these machines out in the middle of nowhere and they don't need a whole lot of complex networking, they just simply need two ports, port 443, which is typical, and a WireGuard port, so that they can have egress out to the service. And once they join, we can establish this tunnel so that we can send commands to Talos. 

(42:52): This would be nearly impossible with traditional Linux having to ssh over this, it would just, I say, nearly impossible, it's definitely doable. But at the end of the day, it would be held together with stick and tape and gum, it would just not be fun. With the Talos world, we just established a simple WireGuard tunnel. And we can use the APIs to manage these machines remotely from anywhere in the world. And so once they register with the system, you just simply say, I want a Kubernetes cluster, click, click, click make it with these machines, we generate configuration for you. You can supply patches, so if you need to do something specific going back to the limitations of the cloud, if you need to add something to the control plane, that's just a simple patch, we'll merge your business logic your business needs in with the underlying configuration that we generate. And I'm really excited about this product because it's just a platform for us to do all kinds of other things. We could do manage upgrades of Kubernetes, manage upgrades of Talos, you don't have to worry about that anymore. And we can use APIs and health checks to really make sure it's done robustly. Not depending on SSH. We can have bi directional communication between the node itself and Omni as we call this product, to make decisions on what needs to be done next. We're looking at building in high levels of security with this. So one of the features built into Talos or it's not a feature just yet, but one of the things built into the kernel, it's called the integrity measurement architecture. 

(44:24): It's a Linux technology that allows you to sign, using the extended attributes of a file, you basically write the SHA of the file to the contents of the file, and then you cryptographically sign those extended attributes. And the kernel won't even open that, it just won't even allow any operation on a file unless it's signed by a key that it knows and the SHA matches the expected value. This is really, really powerful because this is not just immutability anymore by just being read only. This is cryptographically enforced immutability. So within this product, you could supply a key, we could build a custom version of Talos with you and sign every file within there with your key. And now your operating system is running completely immutable, but not just dependent on read-only parameters. It's just cryptographically you cannot do this, it's just not possible to do. So we could sign your workload potentially in the future. So when you run containers, all of your containers need to be signed with this key. It's just a platform for us to do so much and use Talos to the fullest. And really just ask ourselves, what does an API driven operating system allow us to do? On these types of tooling, this management system, the new animals of the world, if you will, and this is this is what Omni is for us at the moment.

Marc (45:49): What do you think of that, Andy?

Andy (45:50): I was lucky enough to have a demo of that last week and it's rather impressive. Of course, the demo was a subset of all the ideas coming to fruition, but it was really, really impressive. I'm really excited about this.

Marc (46:06): It just makes so much sense, Andrew, everything that you're talking about. It's like the evolution of things are moving so fast right now. I'm going to try to summarize:
maybe not if you're in a single cloud, but:

If you're doing edge computing, hybrid on-prem, multi-cloud, on any of those, Talos Linux can give you an orchestrated, declarative, maintainable, secure environment, so you can focus on business logic and creating value.

Andrew (46:34): Yeah, and with your permission, I'm going to steal that and put that on our website. I think you nailed it.

Marc (46:40): My pleasure. This is really fantastic, Andrew, thank you so much. I've got two final questions that we've started asking everybody that comes on the podcast. The first one is putting you on the spot, now, think back. The first thing that you can remember, maybe when you were a child, as far back as you can go, what's the first thing you wanted to be when you grow up?

Andrew (47:06): Oh, that is a good question. As far back as I can remember, I I remember baseball was very, very big in my family. I wanted to be a baseball player and I'd always promised my mom I'm going to become the next. Oh man. I forget this guy's name, but he was leadoff batter for the Dodgers. I think it was name was Brett Butler. He was fast, he was leadoff batter. He batted left handed and it was very much like me and I would always tell my mom, I'm going to be like Brett Butler and buy you a house one day.

Marc (47:35): Are you a southpaw? Are you left-handed?

Andrew (47:38): I am not a southpaw. No. It was strange, because everything else I do right handed, but I only batted left.

Marc (47:46): All right. Cool. Second question. So is there a point in your life where you either realized that you needed to change the path you're on? Or maybe something crystallized that you're on the right path?

Andrew (48:03): Yeah, I think I've had plenty of those. My path I would say is a little more untraditional. I had a kid when I was 18. Well, I dropped out of high school, I got my GED later on. I had a kid when I was 18, I got into construction. And I was just doing what I could to survive. I even got into mixed martial arts and fighting. That was the next big athletic thing I thought I was going to be as a professional fighter. I did pretty well at that, but I remember I was living in my car, and there was a family, they're basically my family now. I was training their son how to box. And they found out that I was living in my car, and they brought me into their house. And they told me, "Well, if you're gonna live here, you got to go to school." So here I was, I was just basically homeless living out of a car for about a year, year and a half. And they're telling me I got to go to school. I was 25, maybe. And so I got into community college, and I did a report because I was still very much into fighting on the effects of boxing and getting hit in the head. 

(49:17): And because at the time, I was very much convinced I was still going to become a great fighter and all of these things. But I started to realize school is very, very important. And getting hit in the head probably takes away from my ability to do that important thing. So the data showed that getting hit in the head is not healthy. I still struggled to it because I loved the sport and I could have done really well in it, but I had to do what I had to do to support my family. I now have three kids and so that moment, eventually I decided to just commit to school and I got into UCSB for physics and that's where now today I'm in Santa Barbara and we haven't left. It's not a really a single moment, I'd say it was about a year's worth of lot of digging and changes that really set me to where I am today.

Marc (50:11): Wow. Absolutely fantastic story. You've been a really fantastic guest, Andrew. I'm so happy that you're here. I think that Talos is going to have a wonderful impact not only for our customers, but this is the way things should be done. I think so.  Thank you so much for joining us.

Andrew (50:32): Yes, thank you. It was a pleasure. I appreciate you having me.

Marc (50:35): And thank you once again, Andy for a very good interview. 

Andy (50:49): Yep. Thanks. 

Marc (50:41): This is Marc with the DevOps Sauna podcast signing off. Thank you for listening. Both Andrew and Andy can be found in social media, and I'm sure they're eager to continue the conversation around these subjects. We'll leave their profiles in the show notes for you. If you haven't already, please subscribe to our podcast and give us a rating on your platform. It means the world to us also check out our other episodes for interesting and exciting talks. Finally, before we sign off, let's give Andrew and Andy an opportunity to introduce themselves. I say now take care of yourselves. And remember, if you're asking why Kubernetes you're way behind.

Andrew (51:23): I'm Andrew Rynhard. I'm the founder and CTO of SideroLabs and creator of Talos Linux. I've been working with Kubernetes since 1.8, and have been working with containers when they were beta back when Docker was founded somewhere around 2013. Ever since then, I've really fall in love with containers and Linux and today I'm happy to be the founder of SideroLabs.

Andy (51:49): Hi, my name is Andy Allred. I've been in Finland for over 20 years already. I started my career in the US Navy and nuclear powered fast attack submarines doing all kinds of cool tech stuff, and learning that tech is there to serve a mission which people have. And then I've spent my career in IT and mostly telecoms, figuring out how tech can serve the mission of people.