Skip to main content Search

Product developmentPlatform engineeringManagement and cultureConference talks

Diagram driven development | Booking.com

In this talk, Mourjo Sen, Senior Software Engineer at Booking.com, explores how Diagram Driven Development can revolutionize software architecture documentation. By leveraging Business Process Model and Notation (BPMN), teams can transform static diagrams into living, executable blueprints. This approach helps keep architectural documentation up to date, improves understanding across teams, accelerates onboarding, and fosters collaboration among both technical and non-technical stakeholders in complex distributed systems.

Diagram driven development | Booking.com
Transcript

[intro jingle] Hey, everyone. First of all, thank you for making the time to attend - and also for giving me a place on the stage. It's a very prestigious organisation that organises this, - as well as so many people I look up to in the industries here. So, it's a really humbling experience. So, thank you for being here. My topic is diagram-driven development. But before we get into that, first of all, namaste. I'm from India. I lived for around 28 years there before moving to the Netherlands in 2023. And that's when I started working for Booking.com, - where I'm working right now. And overall, I have around 10 years of experience, which is not a lot, - but it's something, something to see what I like - and come to speak at conferences like this. So, over the last five years, about halfway through my career, - I saw that there are some things that I feel more interested in, - some things that I feel are foundational enough - that talking about it with people like you, - all of us together could talk about the foundational things. So, I write a little bit, and I speak a little bit about these topics. And that's what brings me here today as well. Because these foundational topics - are kind of what the future of software might hold for us. So, before talking about the future, a bit into the past, - as a community, we decided that these big monolithic structures, - these big repositories that did everything, were hard to manage. And it started to show its age. And we decided that there's a better way of doing this. Following the Unix philosophy, - building bigger things out of smaller parts is better. So, we decided to go the microservices route. And it's essentially similar to a puzzle piece - where you have a big set of puzzle pieces, - and each one of them is very easy to handle and manage, - but together they form something bigger. And that worked really well for us over the last decade, - microservices gave us a lot of benefits. and that's why everybody is doing it in some form or the other. I certainly am doing this almost every day, - and this is really good. But the fact is that anything that is widely popular in the industry - starts to show us that they are good, - they're really good at solving some issues, - but there are some things we might still have to do with them. And that's what I want to talk about today. The problem that I see in microservices is that it fragments our systems. And just like a jigsaw puzzle, if you look at the set of puzzles - that are there in the box that it comes out with, - it's very hard to see what it makes overall. So, because we have so many pieces, it's hard to see the big picture. But the big picture is really important, because when we build products, - we build a product that should work as a whole. The customer, the end user is not going to be happy - with just the puzzle piece, just the microservice. They want something bigger, something that they can use. And this fragmentation poses two problems. First, it is hard to change because there are so many moving parts. Any change that we want to do to the product - might require changes in five or six microservices. That means meetings, that means discussions, - that means agreements, disagreements. So, adaptability to change is one problem of microservices fragmentation. And the second is that it's hard to understand. Because there may be thousands of microservices, - and thousands of microservices is not a very crazy number, it happens, - it's very hard to understand what does what. Because you might have one corner piece of the puzzle - that is not aware of the other corner. But again, the problem is one whole thing that we're trying to solve. So, let's try to look at these two problems a little bit more into depth, - starting with, how do we adapt to change in a world of microservices? So, the problem starts with fragmentation. And the reason why fragmentation is a problem here - is that once we break up a bigger piece into smaller fragments, - these fragments are designed to compose in very fixed ways. But this fixed composition is a problem because we have to change. We constantly change our products to stay competitive in the market. We have to see what the users really want, what do we not have, - and we want to build new features. And building new features, adapting to change - is harder if we have to deal with a very fragmented system. So, take an example. Suppose we are building a key-value store. A key-value store could be a simple HashMap-like interface - that has a GET API and a SET API. The GET API returns a value for a key, SET sets it. This is simple enough, but say we want to do something more. And we say that we don't just want to look up - a key's value at the point in time, - we want to look up all the historical values of the key. So, we might build a service out of this, - a microservice that tracks the history of a key. And we say we provide this as an API. It has its own concerns, it has its own storage requirements. So, it's a different microservice. So, whenever we have, let's say, a new key or an update - to an existing key, we say, call the history service, - insert this history into the service, and we are good. While we are at it, we want to make some more changes. Why just look up? Why not search? Regex searches, prefix searches, phrase searches. We want to provide those to our users as well. So, we said, we built a search service, a microservice of its own right, - and this provides the search functionality. And this is important because the search service, - the problems that it deals with, - it's not the same thing as the key-value service. So, this is why microservices are so good. Different concerns solving a part of the problem. But before we finish the search component, - let's say we want to search in multiple languages, - like Danish, or Dutch, or English, or Bengali, which is my native language. So, we say, okay, before we insert it into the search component - of the search service, let's call the translation service - to give us all the translations of any new key that is getting inserted. And we can do that, supposed we have 200 languages that we are supporting. And we say, okay, the key-value store can call the translation service, - get all the translations, and really insert it into the search service. Turns out this may not be ideal because 20, 200 languages, - it's not an instantaneous operation, - especially not as instantaneous as saving a key and the value. So, we say, okay, this is not very ideal, - because I don't want my key-value store, the primary objective of our product - to become slow because of another dependency. So, let's do something else. A common pattern here would be to have a data bus, - a message broker like Kafka, - where you would say, okay, let's stream the updates, - the operations that are happening in the primary store, - and whoever wants to subscribe to it, consume from it, - can deal with it without being tightly coupled with the key-value store. So, this is what microservices bring to the table, - different concerns without being coupled. And then, whenever the translations are completed, - it can be inserted into the search index. And therefore, the key-value store, the goal of the key-value store - was to be extremely fast, is not getting slowed down - by the history service or the translation service. So, this is why microservices took such a centre stage. Such a product with which so many different varieties of challenges - are solvable because they are these individual foundational pieces - that do their job really, really well. But what is the problem now? The problem is we have fragmented our product. This is not one system, these are four systems that we can see. So, ever so often, you might have to change this product - to build something that doesn't fit directly into any one of them. So, suppose now we want to build an expiring key, - a key that lives with a TTL. So, a key that you want to save, do everything with it for 10 minutes, - after which it should auto delete itself. So, where do we put it? We have distributed, - or we have de-structured our system into four pieces. So, we will definitely need to put some changes in the key-value store. You'll have to implement, let's say, a timer. And once the timer goes off, we delete it from the key-value store. But is that enough? It wouldn't be, - because we have the history service and the translation service - that also has a copy of the data. What do we do now? We have to delete it from there as well, because the user wants - this data to automatically disappear from the product. So, one option would be to say, send some control messages. When it is time to delete, you send some messages - with special meaning that says it's time - for the history and the translation services to start deleting data. This could work. This might work. The problem now is that the history and the translation services - are kind of data coupled with the key-value store. The key value store, which was just sending data, - now has to send some control messages, - which the history service has to understand. Now think of the case where the history service was serving - not just the key-value store, but something else in the organisation. These control messages would have to be understood - by either both sides of the organisation, - or we'll have to make sure the language is up to date - to support all kinds of changes for the history service. This is doable. Let's look at another option. One more way to solve this would be to say that - we don't want our systems to be coupled on the data level, - just like we didn't want it coupled on the temporal level. So, we say we all implement timers. Everybody implements timers because it's a shared responsibility - to delete it after a while. So, now we have three implementations in each microservice. Now, this is also not ideal - because not only are we spending time doing the same thing, - but also these timers might not be in sync with each other. What if they're a second off here or a minute off there? The user would get an inconsistent view. What is the fundamental problem here? There could be other solutions, but the fundamental problem is - that this one feature, the expiring key, - doesn't fit well into any one of these microservices. But our system is now deconstructed enough that we cannot say - that we should be able to put it in one of them. And what is lacking is the composition. So, we don't have a way to compose these microservices - to give us more emergent behaviour. We just have this Kafka bus at the side and some API endpoints. But what we would rather want is for this - to come and work together in a slightly better way. And because of this lacking composition - or lacking ability to compose with each other, - that adaptability to change is reduced. Every time we have to serve a feature - like the expiring keys, we now have to go back - to the drawing board. Who wants to do this? No, I don't want to do this, this is not a concern - of the history service, and this goes on and on. We could do one thing. We could introduce another microservice. Why not? We can't put it anywhere else. Just put it somewhere and build another one, - and then another one and another one. This can work, and this does work. This reminder service here is going to work perfectly - without any data coupling, without any re-implementations. But the question is, how many such services would you build like this? Because today is the reminder service. What if tomorrow is forget me service or something else? And then, it becomes more and more complex. Do you really want to build a service that just keeps timer information? Because it is not a core fundamental product capability. It is something that we just don't have a way to solve right now. So, let's look at the right side of the image. We, as a community, have decided that having a central data bus, - like Kafka or something, is a good thing to have because it decouples our system. When we saw that the translations were slow, - we said, send it across to the data bus. Whoever wants to consume it can consume it. And this sits somewhere in the centre of all of it, - which ties the microservices together. And that worked really well. But what if it was not just passing around data? What if it was doing something more? What if it was helping us compose our microservices - in a way that we are not able to do today? So, what I'm thinking about is something like a control bus. So, in CPU architectures, you have this control bus which carries instructions - to activate, say, the ALU, the arithmetic logic unit and something. The actual work is still done by these circuits, - but the control bus just says, it's time for you to wake up and do something. If we had something like this, a black box that could tell our microservices - when to compose with each other, that could be really powerful. Especially in this case, because then the timer implementation, - the reminder service which we were building and maintaining, - and if AWS went down, that reminder service would also go down, - so to avoid that kind of thing, it would be good - to have a central piece that allows us to compose microservices. Because here, all we wanted was a notification - to each of our microservices to delete something when the time comes. Similarly, not just timers, what if this central thing, - which I've not called what it is yet, - but if the central bus, the control bus could provide us things like retries, - or parallel execution and then joining, - or conditional execution when one microservice says something, - and then based on the response, you want to do something, - or call-backs, or modelling unhealthy paths, - basically many other constructs that help compose systems. Because we build the product microservices, the domain microservices. But all we need is a little bit of glue that ties it all together. That could really work well for us. Because if we do that, then we would have not puzzle pieces, - because puzzle pieces have these grooves that only let it fit in one or two ways. But what we really want to do is to build Lego blocks, - so that today we want to serve one use case, - tomorrow we want to build something else with the same domain services. And a central orchestration layer or a control layer - or a composability layer could help us do that. So, that is problem number one. Fragments compose in fixed ways. That makes it a challenge for microservices. The second problem was that it is hard to understand. Fragmented systems fundamentally are hard to understand. Why? Because we have so many moving pieces, - the intention that the users come to our systems with keeps getting lost. Because they're deeper and deeper into the microservice system, - you kind of forget what the user wanted to do originally. And that is not for anybody's fault. That is how the system is designed to operate. So, let's take an example. Say we are all a part of a team, - and we own a microservice. And this microservice calls some other microservices - because they do a part, let's say they provide some data to us. And we know that someone calls us to do something, - our endpoints, our requests, our Kafka topics. And we are very familiar with this every day. We know who we call. We know who calls us. This is our local neighbourhood. This is what we deal with every day. This is what is at the top of our minds all the time. But this is not everything. Because there are other microservices in the organisation, - there could be other microservices in the organisation - which we are not fully familiar with. We know that they exist, but we may not know what exactly they do. And what we see every day is a tiny slice of a much larger problem. And this larger problem is important because when the user comes to us, - they want to solve a problem that they have. They want to use our product, no matter who serves that use case - or what our view of it is. Because to the user, it's still one product, - one whole product that has to work in unison. So, a local context may not be enough, especially if you take into account - that we'll have to provide new features, new capabilities. So, what do we do? We discover the scope. We go through the system and say, oh, if you want to build this thing, - this service needs to be updated. But how do we do that? We go through all the microservices that are there, - that are serving the user's requests, for example. And then, finally, we have a mental map, sort of, - to say that, okay, if you want to provide an expiring key, - we would have to change service one, two, and three. But doing this multiple times is not only costly, - but it's also not accurate, - because we are trying to figure our way out in the dark. We don't have something like a map or a mental model - that allows us to immediately see where we fit, - where our microservice fits. And that is costly and not accurate. What we would rather have is, when a user comes to us, - the first service would definitely understand why they're coming to us. And then, the deeper services - have to have a way to understand what the user intent was. Because normally, when you make requests after requests after requests, - this user intent gets lost. And that is why we have to constantly rediscover - our own systems in order to deliver new features. What we would rather want is a mental model - which doesn't look at the fragmentation, doesn't look at the microservices, - though they're there, but it allows us to see where we fit in the larger picture. And that would help us to not lose sight of what we are building. Because we are ultimately building one product. We are building one thing. Even though our systems are fragmented, we are building one thing. So, this unified mental model would give us a view - of what we want to build as a whole. And this is important because we cannot just build systems. We have to work together with the rest of the organisation. And although us, the tech teams, understand the local context, - the microservices, the data, the responses, - the requests, the protocols, we alone cannot solve the problem. Why? Because the non-tech teams, the business focused teams - understand the global context or the global intent of the user rather. So, what the user wants to do, why we are building something, - what is lacking in the market, what can we do to stay up to date? That is something that we as microservice owners - do not necessarily excel at. And therefore, we need both of them to come together. The tech teams, the microservices did this really well at the local level, - but we now need to take that - and extend it to the global view of what we are trying to build. So what we would rather have is a language that everybody understands. And whenever something gets complex, what do we do? We often search for a whiteboard. And this is because we, as humans, - understand visual representation really well. Anything visual really, so flow charts - or user screen diagrams like Figma - or simple Excalidraw diagrams or something like that. This is also evident from the fact that - there are so many diagramming tools out there. Last time I calculated, it's around 15 tools that I could name right now. They're there because they solve a problem, - and the problem is that complex ideas are easy to explain visually. Unfortunately, most diagrams, most whiteboard diagrams - are just historical reflections. What we can think of at the moment, we'll jot something down, - explain it to my colleague, forget all about it, - because tomorrow it will not be up to date again, - because something changes, and this diagram becomes out of date. The problem here is that the system, - the running system in production and the diagram do not stay together. The system stays away from the diagram, - and therefore the diagram goes out of date. But what if it was not? What if we could have systems that were driven by diagrams? Diagrams that we understand as humans, if they could drive our systems, - then we would have a mental model that really, really sits well with us, - everybody understands what we do. So, to recap, two problems in fragmented systems. First, it was not adaptable to change - because the fragments or the microservices that we created - compose in very predefined ways. We had to have composability layers, which was missing, - and those composability layers, if we had it, it would be really good. Second problem, it was hard to understand because the global intent - of what we are trying to do gets lost in the local context. If we had to explain the key-value store - that we were talking about at the beginning to a new joinee, - we might use a diagram like this where we do not talk about the services, - we do not talk about the details of the request, the APIs, - but we just explain what we do first. So, let's simulate the situation - where we are explaining to a new joinee what we do. What do we do? We first set a key. When we get a request to set a key, we say, "Set the key." Then, we do a few other things, that is to update its history - so that later, we can look up previous values and then to add translations. Once that is done, we make a decision. If it is not an expiring key, that is it doesn't have a TTL, time to live, - then that's it, this is the whole picture. However, if there is an expiry to it, then we wait until the expiry happens. Then, we do a few other things, clean up of the key, - clean up of the history, and clean up of the translations. That's really what we wanted to do. And even if you don't understand what these diamonds - and circles and arrows mean, it would be not very hard - for you to figure out what I'm trying to explain here. And I did not explain what these diamonds mean to you. You're fairly going to understand what is happening, the arrows and everything. It's natural to have a general sense - just looking at this diagram, what we are doing. And the reason is that we did not talk about any of the details, - or the microservices, we did not talk about APIs, - we did not specify any YAML. This is a purely declarative expression of what we want to do - without going into any of the implementation. And this is the mental model that could help us to onboard someone, - someone who is new to the organisation or someone who is not technical even. There is nothing technical here. There is no HTTP here. It's just what we want to do. And this mental model hides the underlying fragmentation. The fragmentation is still there. The fragmentation is important. Microservices are important. They will be there. But this mental picture gives us a very good understanding - of where we are in the larger scope of things. But going back to the diagram with the microservices, - they're still there, that diagram doesn't replace this one. This is what we have. We have the key-value store - that stores the keys and the values and is really fast, - the history service that has historical lookups, - there are translations, and then, there is the search service. We said that having a Kafka or a message broker - is a good idea to keep things decoupled. But we also said it might be just a little bit better to have something - more powerful there that allows us to compose our systems together. Because there are some features, like expiring keys, - that don't fit inside the scope of one microservice. It would be good to combine the capabilities - of these microservices in a way that we are not able to do today. So, we need some help there. And so far, I have not said what this black box could do, - except that it supports timers. What if we put the diagram here itself? So, the diagram that I explained to the new joinee - is the same diagram that orchestrates these microservices, - that tells this microservice, "Hey, after setting the key, - call the history service and the translation service in parallel." This black box, which I was calling the control bus, - also becomes the universally understandable diagram. So, the diagram stays in sync with the system. This diagram composes the system, - comes out of the box with composability patterns, - like parallel execution, conditional execution and everything. And that really brings the whole thing together. So, the diagram, no longer a historical reflection - that I drew on the whiteboard and wiped off - the moment the meeting ended, this becomes the specification, - the definition of the system, because it sits in the system. It is not throwaway work, it doesn't live on Confluence. It is inside the system. So, this is a diagram - that is the declarative specification of the system. That ultimately becomes the orchestration - composing the different domain microservices that we have, - ultimately to solve the problem of fragmentation - by gluing the real microservices together, - because that glue was missing. So, we have this orchestration piece, and we have the domain services. The domain services do the heavy lifting, build the product feature, - implement the translations, implement the history. So, when an API request comes, the orchestration layer - literally looks at the diagram to say what to do next. So, in this case, the first thing is a set key operation. And somewhere in the diagram metadata, I'm not showing it here, - it is specified that it has to go to the key-value store. So, the orchestration layer looks at the diagram - in the same way that even I would and says, - "Go to the key-value store to do this." Then, it sees a diamond with a plus sign, - which is a standard notation in this diagramming notation, - that says that this means do things in parallel, - which means it will call the history and the translation service in parallel. Now, this is the composability that I was talking about. This is something that microservices individually cannot do. Because this parallel execution needs to be done by something central. So, in this case, updating history and translations, - composing the microservices capabilities together - by the orchestration layer. And then, there is another diamond with an X, - which means conditional execution. So, this condition is checking, is it an expiring key? If it is, in this case it is, it waits. This waiting logic or the parking until the next time to execute happens, - happens at the control bus layer or the orchestration layer. And when the time comes, it does a few things in parallel, - that is to clean up all the data from the downstream microservices. So, the diagram that I was talking about was easy to explain, - was also something that was actually driving the communication - between the microservices. And this is really powerful because all of the features, - all of the important bits are in the domain level microservices. But the diagram that humans understand, universally almost, - is what is orchestrating the system. And there's a lot of power in that, bringing everything together. And this is called Business Process Modelling, - where we take a business flow or a user journey, - and we model it as a pictorial representation - where each of these has a specific meaning. It's called Business Process Modelling, and this notation is called BPMN. It's an open standard. It's available at bpmn.org. Ultimately, this diagram, which we see as an image, - ultimately compiles down to XML. So, anybody willing to implement it - could build this orchestration layer, because it's an open standard. Ultimately, we can do that - or use off-the-shelf products, like Flowable and Camunda. But the important thing here is that there are composability constructs - that allow microservices to come together in a way - that individual microservices are not able to, like the parallel execution, - like the conditional execution, like the wait-for semantics, - and many others that I did not talk about. And because it is a diagram that goes into the system, - it is understandable not just by humans, but also by systems. And that is really powerful because we live in a socio-technical world. We don't just exist in a technical world. And this is the power of BPMN. So, what happens inside one of these boxes in the code, - if we take one of the implementations, this doesn't have to be like this, - but this is a Spring Boot application that uses one off-the-shelf product, - we can say that we can annotate a part of the code - to say that this part of the code represents this part of the diagram - by using a key. Here it's SET KEY. And whenever this is executed by the orchestration layer, - this method would get executed. And if you're familiar with Spring Boot, this looks no different - from any other component that is there in Spring. And this is the real power of it. We have the blueprint or the mental model right next to us - when we are building something. That grounds us in reality. We always have a clear picture of where we are going with the implementation. Similar to building a house, when we have a blueprint right next to it, - we're not always looking into it, but we know what we are building - while we are building, say, a part of the building. And this allows us to focus on the essential complexity of our domain. Essential complexity, the term comes from No Silver Bullet, - the paper that, in the keynote this morning, Camille mentioned as well. The essential complexity is what we really set out to solve, - that is a key-value store, with historical key values, - searching different languages, expiring keys, - but we spent time just doing repeated implementations, - like timers in every service, or prolonged discovery, - where we just spent time discovering our own systems - or building custom orchestrators or just ending up with a very coupled system. These are not challenges that we set out to solve. I didn't wake up in the morning thinking, "Oh, today, - I'll do some prolonged discovery. That is never the case. What we would like to do is to spend more time building features, - building domain microservices. And a little bit of help from another system - could bring it all together to help us build more product features. So, is this the future of socio-technical systems? Well, I can't predict the future. Maybe you can. But one thing I can say for sure is that - we cannot influence the future if we do not question the present. And although microservices gave us a lot of benefits, - and I think that microservices are here to stay, - they did leave us with one problem that is not solved. That is fragmentation. And fragmented systems are accidentally complex. They're difficult to change, they're hard to understand. BPM diagrams are kind of like the glue - that allows systems to be more composable. So, we have these microservices - that take on a lot of the product responsibilities. But they are not composable just by on their own, right? So, having a layer, a control layer or an orchestration layer - that composes microservices together would be a real benefit. But also, because it is a diagram, - this is a unified mental model that everybody would understand. And we would not have to rediscover our systems multiple times. And that really is the power of a diagram-driven development, - because it brings systems and humans closer together - to build an organisational cohesion, - so that we could really solve the problems that we want to solve - and not get stuck with problems like coupled systems or re-implementations. And that's why I would argue that the future of socio-technical systems, - if not BPM, has to do with solving fragmentation. Thank you. [applause] [outro music] [music ends]