Your AI is only as smart as the data that you give it - Unlocking the World’s most valuable data for the next generation of AI with Federated Computing | Eero Jyske
By leveraging the power of Federated Computing, we ensure that AI systems can access and learn from vast, diverse datasets without compromising privacy or security. This approach not only accelerates the development of smarter AI but also empowers businesses to harness the full potential of their data. As we move forward, the integration of high-quality, distributed data will be the key to creating AI that is more robust, reliable, and capable of solving the world’s most complex challenges.
Transcript
all right greetings from my behalf as well so I'm Aus uh representing an organization called airis a startup company pan European um originally from Germany but we have employees all over all over Europe and I'll be talking to you about how your AI your models are only as smart as the data you you give it and thank you for sticking around with the best track one where we also use AI to generate our images that we present so I've shamelessly also used Dolly mostly for my uh generation and you can see bunch of uh typos here and there and that's on purpose so I didn't go about trying to fix those typos that the AI is is is doing but uh as an Le was talking in her presentation um she took us from a bit of uh past present to the Future so there's no future unless you understand the past and present like where did we come from um where are we today and then where are we actually heading so I'm going to take a few moments to talk about first about the past as and then talk about a bit maybe Tommy's world today you expose what you guys are doing in the future but then I'll talk a bit about how do we make the AIS better going forward and Le also mentioned Alan touring so I I'm going to I I I always tell my wife that the dream job that I want to have one day is teaching the touring machine and touring theories in in some institution so if anybody can help me with that hook me up during lunch I'd be very happy to take that opportunity but um when I went to the university myself uh I I already had done a years of software development myself thought I was really clever and and fantastic software engineer went to the University first class they take make you take or you get to take was a theory of computation I took it I thought it was complete rubbish and then I spent four years in in University last two courses that I took were compilers and artificial intelligence and then I realized like oh during actually had some you know clever things to say so then I actually I read it the class as the last thing I did in University and I've been a big fan ever since this is uh dol's illustration of touring Is My Superhero please make an image and this is what what it came up with if you are um I highly recommend a movie called imidation game um if if you're not familiar with Alan touring and the work that he's done he's basically the the creator of the computer as we know it today um maybe Quantum Computing will change that a bit I'm I'm not going to dive into those details other thing I want to uh emphasize is moors law so Gordon Moore from Intel uh back in' 70s already uh made a projection that the density of transistors will basically double every two years and the cost staying the same so in practice that means that we're going to double the Computing every two years and the cost will stay the same and lot of the companies in Silicon Valley have successfully used this knowledge and and uh built the empires that they have today in knowing and projecting what comput compute power will be available in the future and and hence plan their um their products and and their strategies accordingly um there is a huang's law now I don't think it's very official yet uh but you know it's much more than doubling so Nvidia is currently racing at something like you know thousandfold um compute increase with the latest chips that they're doing Blackwell um so it's certainly increasing but in uh similarly we can still predict what it's be and companies should use this knowledge to to understand what is going to be possible in the future not be stuck with what you can do with today but what is going to be the cost of doing what you dream of doing in the future and what has this uh in software engineering what has happened during this like from the 607s since Alan touring systems have become bigger uh more complicated we can do more things and a lot of the work that has gone into um development in software engineering is to manage this complexity and uh so going from you know low level programming languages uh from C C++ uh to Java you know and and lots of the web Frameworks that exist today to increase our productivity but also allow us to build uh more complicated systems uh you know that we can use and what we're seeing today is really just an evolution of this I feel uh so when we talk about Ai and models what then we talk about data driven algorithms right so you have When anybody being a software engineer here you you've spent time developing algorithms which I still think is the most fascinating uh thing in this craft is is actually creating algorithms to solve problems I think there's still going to be space for these special purpose built non- datadriven algorithms in in addition to specialized uh data driven algorithms as well um but this is this is the evolution now that you know you no longer need to create the algorithm to solve every problem that you can Vision you can create a generic model feed it data let it learn from that data and then uh you know that model to be able to solve all sorts of problems uh you can't you can't even imagine yourself um and this is you know the value continuously shifts more towards data compute is going to be uh U you know valuable so if you're looking for investment advice I would still maybe bet on Nvidia and AMD so as they say during the Gold Rush the people who get rich are the ones who make the shovels and and pickaxes and whatever so I think that's going to apply here as well so lots of other people will be successful but uh you know it's more of a hid and Miss but the folks who do the fundamentals they will be successful uh but the other value shift will be in data so companies that currently are not utilizing their data will be able to do so more in in the future and then maybe a bit of a controversial aii Spiel here so I mean maybe as as Tommy was saying I I don't want to put words in your mouth but you know these models are in essence uh statistics um those are my words not Tommy's words um but the uh so there's nothing like magic happening it is given data and it can make decisions based on input that you give it based on statistics uh I think it's it's not only in our company where this happened but back in an old organization Alpha sense we started looking into building models in 2013 and folks uh who didn't have background in computer science we're looking into studying like how these are done and then they pretty quickly said that it's just glorified statistics like it's you know like the whole machine learning is glorified statistics I wouldn't say glorified it's clever there's lots of you know good things being done but ultimately it is statistics and that's another important thing to remember so compute power going up uh it's all about data and statistics building algorithms is fun and it's also going to be important so not everything is going to be done with these massive models they're also massively uh inefficient um this is a an example that I see popping up every now and then nowadays the moon lander used a processor single processor of 2.54 mahz and uh I probably used the amount of compute to generate this image that would take me to the moon and back 10,000 times you know so you can still you know use very lowlevel processors to solve very specific uh problems and then when you can bound them the geni business or models are not going to be intended for everything and frankly I think we should be thinking about a bit the sustainability aspect here on on on how much energy these models are actually using as we move forward then I'll talk a bit about some of the problems um with the data so um first of all I mean this is a I think a top number like ma maximum number I could find about how much of the internet is open so 10% of the data is actually something that you can access rest is all well it can be dark web or or but you know mostly behind some uh Play Pay walls and and and and behind Corporation um uh data centers so only 10% maximum of the internet is open the other one is like most of the 10% if you go and browse the internet uh makes me sometimes Lose My Faith in humanity so I mean what people are uploading and generating uh most of it is is pretty useless uh it's also this is a darly in illustration of trolling so uh and then you know all these memes so most of this you know if you spend your time at in the internet is how use useful is that data really how much of this 10% is actually like valuable data that one one could really access today and then on the the data that get gets uploaded most of it is actually unused so these are a couple of I think this have been staying pretty static for a while 80% of data that gets uploaded to internet is stored but never used so most of those videos that you saw there for example nobody ever watches them uh 90% becomes unusable after first three months so the internet is full of data which is just rotting maybe there is something really valuable but we don't know because nobody's using it and then um dirty data was mentioned by Tommy as well um I mean this is this is still a a a problem um and it will be increasingly a problem as we move forward that uh how do you uh and if you if you want to build data collaborations that I talk about in in a moment how do you make sure that those data sets are compatible um and how do you make sure that this data is actually not you know um poisonous in in any way so a lot of uh business and value will be in data harmonization and clean up for sure if you're looking for business opportunities so we hit this all the time but just worth mentioning I didn't want to sound like a you know negative Nelly and then just talk about like how bad things are and this is just statistics and it's a it's a scam and whatever I mean we're still obviously we're doing amazing things like relx is doing and and most of the organizations are doing amazing things already today with AI um but we given all these limitations that you that you have um and um but in the future and this is the only image which was not generated by AI one of my favorite movies um but what what will be what will be the future about then so if we maybe recap my talk until now so I talked about how we're on a journey uh from Alan during's uh you know creation the computer uh creating more and more compute taking us to the data driven algorithms and right now the next step given all those limitations with the availability of data The Next Step then being how do you make data more accessible where is this data that we could actually use for for for uh you know better common good um and this was uh the illustration of Silo data in the Internet by Del and um as I mentioned a lot of this data the % is actually uh which is not accessible it is uh in corporations um and uh corporations using it for their own applications and their own improving their own operations today and yes that this is something you can do I mean you can take offthe shelf llm model you can feed it your data you can create already a you know very clever uh customer service agent or you know things to solve your internal company problems but it's not going to help you uh collaborate it's not going to help you uh maybe take the next next step your your customer service agent will only be as good as your customer service is today but if you would want to create the best customer service ever on the planet the way to do that would be to combine the best possible customer service data from all the customer service organizations on the planet so one example and this is now I'll come to what aerious is doing so with not bound to healthcare although right now we are very focused on on Healthcare as as a field um our product that we buildt for distributed machine learning and again I'll talk a bit more about that in the in a moment what that means it can be used for any any industry any application um but Healthcare is really where this problem is already known it is has been visible for years uh and also this tagging number of where most of the data in the world is actually being generated did of all those you know 80 90% well I mean maybe not not all of this goes to the internet but all of the data that gets generated on a daily basis in the world is 30% and growing is Healthcare so it's devices like this mean most people have like rings and and watches it's hospitals producing data it is pharmaceutical B biomedical companies uh government organizations uh generating this data and what is specific about also why why this is a beautiful problem for Aeries to solve as a first case example is that this data really doesn't want to move anywhere because of regulatory reasons first and foremost so it is highly sensitive personal data typically so it needs to be protected uh the data is also valuable and I think this is going to be the maybe it's this talk or the other talks that we'll do you know in the future but it's going to be a wakeup call for many organizations also to see your data is valuable so don't give it up to any just you know for free make sure that you have find a way to monetize it either you monetize it for using it to your benefit or then you find a way to sell your your data to somebody else so there is massive value in potential collaboration and better data we have better models better results we can we can collectively get all right Federated machine learning um in a nutshell and and this is the aerious aerious product in the next two slides here um so normally when you do machine learning or data analytics um you have your data in your own data center whatever it is you know on premise azour AWS gcp whatever and then you're able to pull all the data together in a single data storage basically and train your algorithms there uh no everything everything good uh you can also you know companies have been collaborating since forever um since they've been collaborating on Dat data and you can bring data together in a you know your own uh joint data clean room with certain regulatory protection and and what not contractual protection um and then jointly use that data for for training an algorithm but if you truly have data which I believe is going to be the increasing uh you know more prominent thing going forward that does not want to leave a companies or certain infrastructure then you have to use a Federated machine learning model which effectively means that you have um multiple data centers um which I Illustrated here in the green circles with the shield in in the middle um and they uh are what we call the data custodians who then um own and manage the data that resides in these data centers and the data custodians always have the right to or they they have the control to to say who can access this data uh and with what kind of parameters so for example organization a can run their machine learning model in my data center uh with three rounds of training or whatever and with this and this kind of parameters and that all can be governed and and and and controlled by by for example our product and you can do this with one: one relationship so you would have a machine learning engineer who wants to train their model can submit it to this data custodian have it trained on the model and get the model back um or you can have multiple data centers collaborating all at once you know hundreds of of data centers if you will uh and then the model gets trained in each one of these um data centers separately and then the results are aggregated and the you know the system handles the the training runs uh multiple training runs across all the aggregated data an important aspect here is really the the data custodian um as we've learned that whoever owns the data needs to have and needs to feel like they have control over what happens and always have the opportunity to pull the plug if needed and and really see what's what's going on three M main pillars here because um I'm shamelessly also overlooking some of the complexities that uh you know clearly are to be solved still uh we have some solutions for these but uh lots of innovation still to be done our governance privacy and security and maybe starting with security so this is the simplest one so your classic your data is in your data center and the system will guarantee that this data will not leak out of this this system so it cannot be stol and it cannot be retrieved so the system is secure privacy so anybody here who has any uh experienced training models will understand that if you train a model with given data it will know that data to some extent so even to the extent that it knows the whole data so you could basically train your model send it to somebody's data center have the model trained get your model back and then just ask the model that hey okay tell me all the data that what was in this data center so so privacy mechanisms are you they do exist and more will be developed this is an area of of Rapid Innovation at the moment privacy enhancing Technologies um that how do we guarantee and how do we analyze a model to make sure that it does not leak sensitive data and then the third item being the governance which is then uh more of the again Access Control only the people who have rights to access the data can do so uh as well as then you know paper trail being left for every action that has been taken so there's still a strong legal and and uh governance aspect to to these kind of data collaborations going forward great so that creates the computational governance um so ultimately I I always make our uh head of security very happy when I describe our product as a remote code execution platform so but that's ultimately what we do is is we we enable executing any code in your data center if you give the permission given the analyst and the Privacy protections that are in place cool and uh the future that we would like to see um evolving is then right now if you you know look at the collaborations the way we set upine Lear engineer to to interact with multiple data centers there's still a need for a third party like us to take part in in facilitating and providing technology for this collaboration but in the future I mean obviously Tech technology still needed but there really shouldn't be a need for the middleman in between so making the data collaboration more uh day-to-day so that whoever is you know has valuable data you can install this comp compute Gateway type of device in your data center you can publish announce what kind of data you have you can create examples and you can give uh metadata uh to describe your data sets and then monetize them to the interested parties and and th creating more of a data EOS system where data custodians data owners can can uh advertise what they have and then uh similarly the organizations that want to use this data can find the the meaningful data sets and this create these coll collaboration opportunities so again going back to what I said about um you know there's a the direction then with value being in data will create opportunities for organizations that don't necessarily use the data themselves but can use what the data they have uh to you know enable somebody else to do to do research on on whatever they're doing one example of of things that are already happening on this front so I don't know how many follow the Nobel Prize ceremonies uh I didn't usually well maybe the only one was the Peace Prize but this year the chemistry was uh remarkably interesting because uh we in fact one of the the use cases that we are helping Empower is the protein structure prediction uh which is fundamental to a lot of medicine uh and Drug development and and other advances in in the healthcare so if we can find ways to predict various protein structures and helps us you know accelerate finding cures for lots of diseases and and whatnot and the two gentlemen here Demis hassabis and John jumper they a