Failure Detection in the Era of Gray Failures – Lifeguard in HashiCorp Tooling

Failure Detection in the Era of Gray Failures - Lifeguard in HashiCorp Tooling



[Applause] hey everybody thank you very much for being here I realize I stand between you and lunch so great responsibility I will try to both entertain and educate you as far as I can so hopefully this is a sort of familiar kind of a picture to many of you it doesn't really matter whether it's a cluster or you have some large group of instances that you need to manage and one of the things you have to do with these of course is have a way for your services to discover one another and to have those services be available in a highly available way so you need to do some failure detection I guess from the demographic there's a chance that you're using hashing technology to do this maybe not you know we like to play nicely you don't have to use all of our tools but console of course is one possible solution for doing this so when you decide who you're going to put your trust in for your failure detection solution there's a number of criteria that you use to evaluate the technology that you're going to use we appreciate that you have many choices in this area and of course we also as hashey Corp when we were looking to understand the best algorithm so that we should use inside of console we also had the criteria that we needed to use to evaluate the right technology and these are the kind of things these are general ways that you would evaluate any technology right how does how does it perform performance in this context includes several dimensions you know what is the latency of how quickly if there's a failure how quickly we're going to detect that failure how quickly will we propagate that failure to everybody in the system what's the error rate like are there going to be false positives are there going to be false negatives how many and also in this area we have a consistency model if you know you we actually offer both console up as a strong consistency model if you need everybody to be on the same page then it's possible to go to the console servers and to get one definitive view of the health of the system of the the nodes or if you don't need that you can go with the weaker model you can use surf and get something that's more fresh and then for efficiency of course what is the overhead of all of this you can what are you going to do to how many messages you're going to send on the network and CPU and memory on the instances where where you're running this agent or whatever and then reliability I mean in this particular case this is a sort of a you know a sensitive subject because what's the point of having a failure detection detection mechanism if it itself is not reliable we need to know when there's a failure and we we can't have this sort of infinite regression of turtles all the way down you know the buck stops somewhere you want to know that this thing is a reliable system but also there's a special criterion when you're thinking about failure detection you need to think about the failure model and you can actually you can sort of compare this to the threat model in security this model here it's sort of defining the domain it's it's due it's the things that are going to be visible to you what are the semantics what are the ways that things can fail so implicitly or not your whether you realize that you're always shopping based on the failure model of your failure detection system so luckily there were there are many failure modes to choose from and this is a nice diagram that Stefan Hollander from Technical University of Veen put together he teaches a great course by the way all of the things that I reference at the end I'll give you my Twitter handle and I've just tweeted the links to all of these things so don't be scrambling to write down URLs and things you can go find all of this stuff afterwards so yeah this is many people actually have referenced Stefan's writing about this there's a sort of a hierarchy failure modes here and we're sort of starting at the outside we're starting with very general arbitrary fail is in fact the definition here a Byzantine failure is that is absolutely arbitrary it could fail in any kind of a way doesn't have to stop and it could covers all sorts of horrible things that sort of malicious and collusion situations where people are actually trying to bring your system down they're injecting you know confusing malformed messages or conflicting messages so it's the most general thing that we to try and recover from and then this is sort of narrowing down each each ring here we're going in we're making some simplifying assumptions we're reducing the scope and making it more of a tractable problem how can we actually implement a failure detector that as can find this particular class of failure and then right in the middle there you have a fail stop so not only is it crashing it's inside a crash so that means it has to have crashed but it crashes nicely and I'll come back to this another couple of slides actually so Byzantine fault tolerance PFT Byzantine failures were on the outside of that they were the most general thing by the way they come from a related problem the Byzantine generals problem this is the the guys who are trying to attack and they have to send messages to each other to coordinate the attack but of course one of them might be a bad actor malicious he's a spy for the other side and is there a way to make a protocol that enforces consistency here and it's actually thinking of it as a consensus problem not surprisingly Leslie Lamport posed the Byzantine generals problem this is the same person who gave us the Paxos consensus protocol so great you know this thing is it's very general we can handle arbitrary faults so you think this is something that people would aspire to we should be seeing this out there in the marketplace right well it turns out it's not widely deployed and I think you see this for yourself and I mean there's no open source project out there that everyone is jumping onto to use this thing okay well maybe you know you need you need to be Google you need to be Microsoft so Leslie Lamport works at Microsoft and there's a I'm going to give you the citation later for a paper from Microsoft where they say that it is not being used in a production system inside of Microsoft in fact that I think they mean generally in the world but certainly they would know whether it's in Microsoft and then Google who lepton paxos and you know made it famous also are asserting that PFT is not ready for primetime and there's there's sensible reasons for this when you go to look at the BFG class of systems they're complicated protocols and then if they've been simplified they've been made more tractable but nevertheless they're complicated protocols to tolerate F failures you you typically need three F plus one instances so there's a lot of overhead here and as a consequence of these things you know the way they scale the number of messages the number of instances that can be brought in it's it's a tough it's a tough thing to do which is it pretty much makes sense why this stuff is not widely adopted and you could say that probably it's overly pessimistic for the data center we don't have Byzantine generals typically in the data center you know you you know you want to do something more lightweight you you want to defend against accidental failures not somebody being malicious unless it's something super sensitive it turns out bitcoin actually is a great candidate for BFT because they're you know everybody is going to try and gain the system and it has to be absolutely rock-solid in a fully distributed manner so BFT if you go to any of the conferences working on bitcoin stuff you're going to see by sometime fault tolerance is front and center with bitcoin so what you do see when you go look at the literature and look at the systems is everyone's really come down to either the crash failure model which is sometimes called crash stop and then the fail stop failure model and the crash just says the thing stops it no longer will send messages they only promise you have to make is that this this process that we were monitoring disappears fail stop is stronger it requires that it stops in the same way but this is a model where people are trying to make things actually fail you know on purpose in the consistent manner because they would like it to leave the system that the remaining nodes the healthy nodes should be able to recover the global state of the system so you know clean up after yourself be obeah in a thoughtful colleague as as you go and crash out in flames so but these are these are all simplifying assumptions the crash failure one seems pretty reasonable though right and you'll see many many papers many many concrete implementations asserting that crash failure is is their model they can't they can't go further and promise fail stop but you know crash the crash failure model hopefully is good enough one of the many that uses this model is swim and swim is the protocol that we use underlying console and we evaluated a lot of from technologies before picking swim it was first brought out in a I Triple E DSN conference in 2002 and it actually says very clearly in the paper that the crash failure is the expected model in this domain and of course therefore they support the crash affiliate model so swimmers on board with doing things the the kind of realistic way so when we evaluated swim for our own use and for your use indirectly there were a bunch of criteria we we knew that we needed something that's scaled we wanted it to be robust both to network problems and and local node problems again it was no point having an unreliable failure detection system and we wanted it to be easy to deploy manage and it turns out swim has some really nice properties the way it's architected actually really helps with this stuff it's a peer-to-peer system console has servers but not for swim the swim part is it down in the agent so a console server also happens to have the agent running so from that perspective it's just another one of the members of this group so this is completely symmetric there's there's no special notes to administer there's no special notes to worry about recovering from when they fail and uses randomized communication this is really nice it means that you don't have the chance of a correlated failure is much lower each round of communication each node is picking some run to random other member of the group to communicate with and this sort of gives a really good strength to the the patterns of communication we'll come back to that a little bit later and within that communication it's using gossip it's it's not the case that one node has to talk to all the other nodes or even ten percent of the other nodes in the sitting certain amount of time when information is passed from one node to another it's the sort of epidemic or viral model or the contagion model information spreads through the network by hops and there's been a lot of theoretical work done on this now to be clear this is probabilistic so you know you can have that perfect storm where once in a while somebody doesn't get that message for a certain amount for quite a long time we have extended swim we have some direct sink that we put in periodically between nodes which encourages this to converge a lot faster and we have a hard upper limit but in the general case there's actually a simulator on our website where you can see how quickly it converges and even in you know order of thousands to ten ten thousand nodes the convergence is very fast so swim is mostly excellent we've been using swim for I think five years now and we've grown as as the community has grown for the scale that people are using this at I think early on you know it was a order of tens or hundreds of machines then we start to hear about people doing thousands of machines then we start to take a little survey and we find there plenty of big users routinely deploying a single console group of more than six thousand machines and just in the last few months and now I'm now hearing ten thousand machines so this thing's scales really well we also use it directly you can of course use console and do use console to build the availability and for your services but we also directly make the swim implementation into surf and nomad directly so all three of them share the same library it's called member list it's a it's a hashing co-op open-source library however while it is generally excellent there are occasions where you know users would come to us sometimes a little bit panicked where there would be a kind of escalation hey we really need your help we've got this weird situation help us out so over a period of time there were a number of debug sessions and a picture started to emerge of what the problem was so here I'm going to use a distributed denial-of-service attack to motivate the problem but generally speaking there's the whole class of problems and I'll point that out in a second but so here we are we've got that same cluster and you have some edge nodes which are responsible for the ingress and egress maybe they're running firewalls load balancers you know web servers whatever the right thing is in this particular environment and they're getting hammered by a distributed denial of service attack so okay fair enough we expect some of these nodes to fail there they're overloaded but the peculiar behavior is that in addition to that we see some flapping nodes and these nodes are not directly under attack these are nodes in the interior of the cluster that shouldn't be being affected so what do I mean by flapping so this is a healthy note these these little paler colored ones here they're healthy they're being marked is failed by console and then a little while later they're coming back and being mapped EMP marks has failed and then they've been immediately being marked as healthy again if you would go in and look at the logs for that instance you'd see no problem there's plenty of CPU there's plenty of network connectivity it's not under attack even more disturbingly other healthy nodes think this healthy node is sick so what is going on here this is this is not cool and as I say the DDoS is a concrete example but we we got this from in a few different ways web service were overloaded video transcode service doesn't have to be on the edge it can be in the interior of your network you know somebody is using these burstable instances the AWS t2 micro instance where on average it's not allowed to exceed 10% of its CPU budget so you can do a burst of work but then the budget is depleted eventually if you go to zero budget you're throttled so the common thing here is that there's some resource depletion whether whether it's typically CPU sometimes network but there's resource depletion at some nodes and it's making other healthy nodes healthy nodes appear unhealthy so this this was the mystery and we dug in and we we have a nice solution and on Thursday I will be in Luxembourg presenting this at DSN which is very nice this is the conference where swim was originally published and it's really nice to come back there and we've been in communication with internal group to the lead author of swim and you know he likes this work and is very happy that we've done it so you know yay I'm not going to go over this in detail you could watch the talk I gave her Hoshi conf last year if you want the details of this I also recommend you read the white paper which is on the archive website freely available and it gives all the details but to talk about what I am going to talk about today I have to just give you a little high-level summary of what was wrong so I'll give you a very quick description of swim all you have to know here is that swim is a protocol distributed systems protocol it has a number of steps in step one node a directly probes node B it's trying to say hey are you alive it sends it a ping or a probe if it gets back a response you know within a timeout right you're okay I move on if not denoted by the X we move on to step two and we say okay I couldn't talk to you directly let me let me just see if it's an connectivity problem between you and me let's go around I'm gonna ask these guys to talk to you so then that's a ping appropriate they make the probe if they get through it comes back to you okay I'm still happy if it doesn't come through then we move on and we start gossiping these rumors hey I think that guy's dead and it's you know we're going we're going downhill at this point interestingly in the swim paper now 2002 Cornell University Ivy League school 55 computers they thanked Verner Vogel's CTO of Amazon or AWS at the time he was a professor there he helped him get 55 computers together this is large-scale distributed systems at a university in 2002 it's amazing how well this thing works given that that's the only scale even the you know we had people take this way past a thousand machines before they really started to hurt with this problem but even then at fifty five nodes they've had this thing that they saw which is sometimes they would get these false positives that we get lots of false positives and they tracked it down and it was because sometimes a message isn't processed fast enough there's slow message processing so they put another mechanism I'm not going to give you the details but they had a way to sort of give people longer to come back and you know work around this what we discovered was that unfortunately even that remediation this suspicion mechanism even that has the same vulnerability it still requires some of the messages to be processed in a timely manner so the problem we have here is node a would know des was suspecting that node B was dead and it gossiped this rumor it said hey I think you're dead what the suspicion mechanism is good at is like no D here is actually dead no D couldn't help but the epidemic nature of the suspicion mechanism it floods around any slow or dead nodes if there's a fast way to get the information from A to B it will get there and if it's a fast way to get it back it will get it back unfortunately if the node that was asking in the first place is also slow the message can even be in the computer it can have come into the kernel it can it can be you know somewhere in the device driver or the protocol buffer or even in a go message queue it just hasn't been processed yet and the time outcomes so there's this sort of fundamental this is the exhaust ports of their death star there's there's the flaw that unfortunately they were not able to patch and we have plenty of users who ran into this at scale so generally our approach to fixing this is we realized that there were all these messages we we know the protocol we implement the protocol so when we send a message there's an expectation we know whether we're gonna should receive a reply and it's just some basic accounting you sort of keep track of well I've sent three messages out in the last 200 milliseconds I should really should get three replies back in the next 500 milliseconds and because of that referee randomized communication you're very well connected to the whole network and it makes it it's possible that you've got unlucky and you talked to three dead guys it's possible much less likely than that you are becoming disconnected from the network unless your whole datacenters going down and then you have some other problems to worry about so and actually it's robust even to more than 50% of the nodes going out so but basically what we're saying is we expect we know we have an expectation of how many messages I should receive and now if I'm isolated that even the absence of messages and absences of replies is a signal to me and that's that's nice you know it means that you can be completely isolated you can be disconnected from the network and you know hello I should slow down here I should I should give these other people more time to reply to me then it's an intermittent problem you get some packets or you're bouncing along with the the CPU throttling you are going to give other people the benefit of the doubt and this works really nicely so the false positives were the problem all these nodes being accused of being dead when they weren't and there are three components to the to the solution as on the left you have without any of the components on the right you have the full full solution of life guard with all three components and in the middle just so you know interest the impact of each of them and when you combine them all each of them has an effect but when you combine them there's a sort of synergy at each stage of that at that pipeline we're knocking down the false positives and it's really powerful and once this was deployed those kind of support requests stopped so it's a good it's a good thing also something really cool we found that you can tune the parameters of swim now that we have kind of got those bad guys out of the way we can actually be more aggressive so there's a knob you can basically choose median latencies you know on average how long will it take the detector failure how long before a healthy node sees that an unhealthy node is or dead node has died you can choose to leave that where it was and be really aggressive have more than a 50 X reduction down to 1.9 percent of the false positives or you can balance it you can say you know what I'll just take a 20 X 25 x reduction in a 20 X reduction in the false Buster's but I'll take a 15 X reduction in the latency in in the in the healthy case so this was an unexpected bonus a nice side effect of the work we weren't looking for this but that's cool we were able to tune the system more aggressively because we've eliminated a class of a fault all right so mission accomplished right we go home and you know we we move on to the next project well maybe not because you know why did this happen how do we get into this situation you want to understand this fundamentally and it turns out swim is in a line of work if you go back to 1996 back at this time people were worried a lot about multicast and virtual synchrony delivering the messages to everybody simultaneously you know it's a different era from the this was not the cloud era but a very important paper came out talking about unreliable failure detectors and so the notion of an unreliable failure detector you know there's going to be these processes they're monitoring one another hopefully just some redundancy to that monitoring to make to make sure that it you know it's when somebody fails who's monitoring somebody else there it's we still doing the monitoring there's a whole raft of ways that this can be implemented this is not tied to whether using heartbeats or tokens in a ring or doing the probes that swim does these are all alternatives but this is an abstract model of a failure detector and again different topologies with the rings and hierarchy the randomization this is what swim uses but yeah the key point is that each individual failure detector can be unreliable it can have false positives and then you overlay a protocol on top of this which filters out the false positives so swim fits into this this is swim isn't exactly is an instance of an unreliable failure detector now within this abstract model there's the concept of a local failure detector module so practically speaking this is the member list part of console agent we said that we have a group of processes that's that generic term who are monitoring one another each of those processes embedded within it or collocates it doesn't really matter whether it doesn't have to be the same process but on the box hopefully exposed to the same conditions so that it's kind of a proxy the photo there's a fairly detector on each one monitoring some of its peers and in the first instance when this work was published it was all about really thinking about the far end it was thinking about is that guy alive is that guy alive is that guy alive sending these probes or heartbeats or however you wanted to do it over time people realized that the characteristics of the network were kind of a form of interference so they started to draw more fancy models where you could actually over tie and learn a model for the behavior of the network and then maybe that you sort of subtract that so you you you know D noise you take a filter to remove some of the bias that you're seeing you know you false positives you might accuse or you may you may not wait long enough for process B because of some network problems that you actually could figure out we're occurring because you saw how the the time of the heartbeats was degrading in the last twenty seconds or something many many works following up on this issue so many publications followed this so we read them all and not a single one has talked about the health of the actual failure detector itself of this local module they're all focused on the far end some of them get into this nice modeling of the network and in the case of swimmy with the suspicion mechanism we're worrying about the peers who in between who are forwarding the messages it seems to be a blind spot it's kind of interesting that we've gone on so long without people paying attention to this and you have to think it's probably just a side effect of supporting the crash failure model in this world you're either alive or you're dead so why would I care about your health if the failure detectors working it there's no problem if it's not working it's gone so fundamentally it's interesting how the whole community can adopt the same path for 20 years so we are now you know wondering what is the scope of this thing you this local health and month for the first time potentially worrying about the health of the local detector where does this fit in I mean we were reading these papers because we surely were going to bump into somebody must have done this before and we didn't know we could cite them and we we get to find it but some good news came in parallel to us working on this and publishing it this there's a awesome workshop called hot OS it's out of the ACM Association of Computing Machinery and it's it's a really cool venue people just write these short workshop papers but they're sort of talking about the big painful problems they're trying to find the next big research topic that the community should move towards and there's a lot of great work as you can see that the seed of the idea shows up at this workshop and last year was know was no exception a fantastic paper I would urge you to go and read the great failures paper this is a report from Microsoft and crucially you'll see there's Microsoft Research and Microsoft Azure so this is not not just the Buffon's over there in the ivory tower it's research and engineering working and the people I know these people that in research these are people who have helped to build being they're not they're not ivory tower researchers anywhere these are people have led development teams so this is the considered opinion of people who have been intimately involved in building you know one of the largest private and public cloud infrastructures and their definition of a grave failure and they're not you know they're saying look this this ideas been out here for a long time but they're now stranding a spotlight on it they say you know things don't always fail cleanly there are these gray cases and it can be to do with some degraded hardware and you know the thing can limp along it's not dead its but it's it's not happy and these are exactly the kind of conditions that induce the problem that caused us to build life go and they even call out you know that this bottle is overly simple interesting you see they you will see that they call it fail stop the definition is actually is crash failure and they miss name it so you know even these people are using these terms a little bit loosely there's might be a clue here to what's going on but they definitely they nail the cause and the course really is scale scale and complexity and actually there's a quote this I love this it's pretty much the definition of the data center where the unlikely becomes commonplace so we talked about some you know PCI bus controller malfunction or some cosmic ray or whatever it is you know these are like one in a hundred million type things maybe or one in ten million whatever it is when you have a room with several hundred thousand machines in it multiple of these issues are always going to be going on somewhere in that room so the you know the good news is congratulations we've we've made it to Google scale problems you know you use and through you as we are we are now having to wrestle with these same problems that Microsoft and Google we're using their infrastructure and they can't they can't hide us isolate us from all of these problems so we've got their scale of problems so I guess that's good news this actually came from a really good another really good paper Visigoths from euro cysts not going to have time to go into this I would also encourage you to read this paper these guys are trying to take the Byzantine fault tolerance and make it really usable they have a particular programming model you have to use which does limit the applicability also they are trying to leverage how reliable in terms of performance modern data center networks are but I think you know that public cloud is I think you know this is something that could work for Microsoft provided they're running it on their latest gen switches not so much if you're in random public cloud so that's why we're not dwelling on this but you know they this is a great paper and we honestly are still unpacking the thoughts that in this paper and we may end up implementing some of the ideas from here eventually so now we have the concept of a great failure as I mentioned I'd you know it strikes us that local health really is something that could could be helpful here we've applied it to swim now swims that kind of the easy case I told you about that fully randomized communication this is great you're very well connected to everybody in the in the data sense you have a really a lot of good signals to the fact that you are either well connected or not well connected heartbeat failure detectors you know it's more point two points then the Rings in the hierarchy there's a bit more this there's going to be some redundancy we think this could be applied so this is a future research area for us and hopefully for other people that the criticism could be a guy you know that only was only going to be robust with swim you're not going to make it work with heartbeat failure detectors interesting thing here is that you know usually you will have multiple heartbeat detectors co-located because you want that redundancy everyone is checking a few other people in the literature they they never think about the colocation and never used they never cross a signal between the different instances because they had no need to it was always healthy but now we think about it having poor health hmm I have five failure detectors co-located maybe I can correlate look at the correlation if they all stopped getting their heartbeats maybe it's me also you could actually do the swim style randomized probably even if you send one message from one node each second that's going to give you some background level of confidence that this communication and also if you want network coordinates in in console we have Vivaldi which allows you to you know say route me to the fastest you know closest instance you could do that you know so the a value-add to having some sort of background radiation one of second messages between peers there was another really cool paper at hot OS this time from Google I don't know if they talked to each other first but Google and Microsoft will show up at the same conference and start talking about the pain points that they're having at scale and the the most important thing that they brought to the table here is it's it's it's common sense but we sometimes you need to be reminded of the obvious that there's this inexorable link between availability and performance is it dead or is it just running really really slowly and you know we seem to be having some trouble with things that are running really really slowly so let's maybe model this as a performance problem okay but still how did we get here why is it in 2017 that we've got Google and Microsoft showing up at the same conference saying you know what we've been doing this fail stub thing for a while now and it's just kind of not working for us and confusing fail fail stop the crash stop in their terminology so how did we get here and I told you about Stephan's course and the link link is on Twitter he's actually been teaching this course for twenty years it's based on his PhD thesis which you can buy as a book and you know what community is here he's doing real-time systems for automotive he's doing hardware and software code design he's trying to build dependable systems and it turns out if you go to I Triple E DSN dependable and secure networks is what DSN stands for it turns out that in that community this is a problem that has been considered as an electrical engineering kind of a you know a hard hat you know this this is for trains and missiles and chemical plants there are lives at stake these people are taking this stuff incredibly seriously this is an awesome paper courage everybody to read this paper it is just a fantastic very detailed very precise enumeration of all the aspects of building a dependable system from the point of view of you know engineers with a capital e not software Cowboys and they're just some amazing diagrams in here you know here it is you know basically not every failure is a stopping failure there are these erratic services oh yeah 1980 they studied classifying and tackling this taxonomy in nineteen it's all laid out there Google got the message last year but stopping is a strategy this fail stop thing yeah it's really hard but guess what you actually commit to doing that you try to make your systems fail you're failing safe it's the Deadman switch where the driver has a heart attack he lets go the handle and the train stops they have to build that that doesn't just happen so this is you know this is some pretty important stuff there doesn't seem to be present in in our community so what can we do actually I'm gonna go straight to this one there's a I have a general prescription which you can guess from the last couple of slides here I really think we need to start thinking in terms of building dependable systems as a design Center we've taken the terminology fail star crashed up we've we've gone over there to the I Triple E guys and we've ramp them you know we've said oh yeah that's that's a great model yeah thanks and then we've done absolutely nothing with taking their practices and of course we're in a different setting we're not doing hardware and software Co design right now although if you look at Sdn and disaggregation and the data center people are starting to say hmm if we had this switch that has these special Hardware qualities maybe so you know we could actually go in the direction of hardware and software code design with this stuff but even without that concretely today we have to go beyond the binary and generic I mean we're seeing it now is it alive or is it dead or is it just running slowly the Google pointed the same thing out but you know people have been doing this you have you have to set the SLA s you have to set the quality of service you have to think about the application level view of the health of your system is it delivering at the rate that it needs to deliver to be considered a happy healthy active member of this population and if not we should flip the script we should say okay you know I'm not going to let him run until I you know I'm sure he's dead I'm gonna kill him it's time to go or local health kills itself so these are my prescriptions I hope some of this has sparked some thoughts for you all of the references are on Twitter and I'd love to talk to people about this thank you [Applause]

Add a Comment

Your email address will not be published. Required fields are marked *