Delivered 2017-09-16 at RESTFest 2017, Greenville, SC, USA



Prepared Talk






Raw Transcript

Below is an unedited transcript of the talk and some follow-up questions and answers.


My name is Mike. I work with The API Academy, and I thought I would talk about some microservice stuff today. We had a couple of microservice things already today, and I’m just gonna say in the beginning, this is all half-baked. This is me just working on a puzzle. So I may be off on this, and you can maybe help me with some parts of this, or maybe you can point me to some other places. So there are two things that have been going on in my head recently, and they’ve been affected by two…one is the project, I worked on, the book, with some of my colleagues at The API Academy. This is "The Microservice Architecture." We started to think about how do you break this into, like, units that make sense, that, you know, are sort of grokkable, like, "How many steps? How many things do I have to consider?" Sort of like you’re working through with Atom, right? Like how many moving parts are there, and how do we start to get a hold of that?

And also a book that I’ve read a couple of times now and I really enjoy a lot…it’s called "Release It." I’m not sure if anyone has seen this book before by Michael Nygard. It’s really about, "What does it take to create software that’s safe to release?" And he spends a lot of time pointing out that it’s not the function that’s really the challenge. It’s all the things around it…the network, and the errors, and the bugs, and the other components and the disk drives and the capacity for all these other things. So you really need to write a lot of other code. Usually, you end up writing more code to protect your system than to actually have the system run. You may have one line that does one thing, but the rest of it’s all protection. So he talks about these patterns, and it made me think about the patterns and how they applied to microservices. So for this discussion, I’m gonna call a microservice something that’s a loosely-couple component running in an engineered system. That’s a relatively vague thing, and we could actually do a whole talk on what a microservice is, but just for the sake of this conversation, we’ll stick with that. And when I mean "loosely-coupled", for this discussion, I’ll say it’s across the network. So it’s not running in the same memory space, it’s probably using a HTTP or CoApp or Sockets or something like that.

Mike: So I mentioned Michael Nygaard, and he’s got this series of what he calls "Stability Patterns." He actually…there’s two parts of the book, "Stability Patterns" and "Capacity Patterns." And these stability patterns are the ones I wanna talk about. Some of these you’ve probably seen before. Circuit Breaker and Fail Fast are two that come up quite often when people talk about creating distributed systems. I like some of the other ones…the Timeout, the Bulkhead. Steady State’s a great one. So, Timeout is, you just stop waiting for somebody. Like, you know, I don’t have time for this. If I wait for you to answer me, then I’m gonna be way too late to answer back. So I’m just going to cut it off. So Timeouts are really handy patterns. Circuit Breaker is, something failed, so I’m gonna route this to something else. I’m not gonna blow the whole system up, but I am gonna, maybe, if the database server’s not working, I’m gonna go to backup server automatically. The Bulkhead pattern is a little bit like that. It’s basically "contain the blast damage," right? You think of bulkheads for oil tankers, right? They’re supposed to protect. When one part of the system breaks, it protects the rest of the system. So bulkheads often try to just contain the damage. One of the best of bulkheads is to take the dangerous part of the system to make data a standalone component. So if that fails, then the whole system doesn’t fail. Just that part fails.

Steady State is things where…think about log writing. We sort of forget that you have to maintain this steady state…the health of this disk or this machine. And so we turn on debugging and it’s working fine and so on and so forth, and about four or five days later, suddenly we run out a disk space. We forgot to clean up. We forgot it’s no longer in a steady state. It’s actually in this growing state that’s gonna be troublesome. So a good pattern is to use steady state checking. Is there enough disk space? Do I need to purge some records? Do I need to move some things? Do I need to turn things off?

Fail Fast is sort of the opposite of timeout. Fail Fast is, "You know what? I just did the math. This is gonna take me too long to process, so I’m just gonna tell you no." And Netflix actually talks a lot about this. It’s one of the ways they do testing in their system. They actually have a sort of a budget header kind of deal. "I need a response in 100 milliseconds." And then the next…if they have to call another service after that, they sort of take what they think their time is and they say, "Well, he needs this in 50 milliseconds." Or then they, you know, "She needs it 20 milliseconds." Finally they get down to that the last person, and that last person says, "10 milliseconds? That isn’t enough time." And you just say, "Fail, fail, fail." So that way you don’t get stuck.

And then finally, Handshaking. There are two kinds that he talks about, which is a health check, like, "Are you good? Are you okay? We working? Are you all right?" And the other one is negotiation. "Do you have enough space? Is this the right format? Is this the right protocol?" And I threw in one of his capacity patterns for caching in this as well. So thinking about these…these are all really handy, but when do I use them and why do I use them? And that’s when I thought of the three microservice components.

So I just, you know, I make these large scale breaks in these things. This may not be the best. There may be more of these. Maybe there’s a better way to break these up, but these are the three I’ve been thinking about. Stateless, persistent, and aggregation. So stateless microservices are like converters or translators, or maybe image processors or text readers. They count words or something like that. They’re stateless. You give them something and they give an answer back. They don’t store anything, and they don’t talk to anybody else. These are sort of the classic microservices. Everybody says, "Microservices are great, because they’re stateless." Well, that’s fine if they just take inputs, manipulate them, and give them back. There’s no dependence on any other service, there’s no local…not even local storage that they have to deal with. So what kind of stability patterns would come in handy, along with the caching pattern that I stole from his capacity…I think Fail Fast is a really good example. Big matrix math problem, I can tell you right now this is gonna take too long, I quit. Right? So I think when you’re doing those stateless ones, there’s not a lot of other of those patterns that come in handy. But when you go to persistence, when you go to storage, whether it’s mostly reads, or reads and writes, things get complicated really quick. Now I’m dependent on local disk I/O. Possibly, I’m all on one VM or I’m all in one 1U, but maybe I’m in another rack. Maybe I’m in the same rack network, even though am not on the same, like, distant, like, HTTP kind of experience. So now there’s other things that are gonna be dangerous for me.

So now, timeouts are important. If it takes too long to run the query…maybe if I don’t get a response back from the database server, I should just quit and say, "Sorry, you took too long" and cancel. Maybe I should be using the Circuit Breaker. What if the data storage isn’t available right now? Do I need to go to the cached version, or is there a backup version that I can use? And then the idea of Steady State…logs, data storage. Are we storing so much data that I’m gonna have to purge this or move this in some kind of way? So I think those are great examples of patterns. If I’m building a persistence component, now I want to tick off those patterns. I would really like to have those patterns assured. And then finally, in my sort of aggregation one, this is the most dangerous one. Right? So I depend on other services that may be far away…maybe, you know, in another building, maybe across the country maybe across the world. There’s lots of network dependence. There’s a lot of failure possible, as well as all of this stuff from storage. Like, maybe I’ve also got disk I/O dependence and lots of other things.

So now I add things like the handshaking. Are you healthy? Are you there? Is that service really there? Maybe a DNS check before I even make the call? If I have a really expensive call, maybe I should see if that server’s up and running first? Am I sending the right format? Do we need to negotiate something else? And then, the idea of Bulkheads…if one of those services dies, does this completely blow the whole arrangement, or can I continue processing? Can I limit the damage and either treat one of those as for a circuit breaker, or can I use a cached bit of information that I got from that before? Can I give an answer that’s only a partial answer that might be acceptable? All this kinds of things. So when I think about creating these kinds of services, in my head I’m saying, "Ah, this is an aggregation service. I need to have the following answers to this question." Now what would be really cool to me, is if I had a tool that did this…you know, in Visual Studio or Eclipse, and I’m like, they say, "What service do you want?" "I’m building an aggregation service." It’s like a Clippy, right? "Oh, you’re, like, building an aggregation service, I think you need Fail Fast, and I think you need Handshake." And I would say, "Yes, yes, yes," and it would work all of that out for me.

I kinda get the feeling that AWS makes us think that that’s kind of what they knew they would do for us, right? They have all those things. But they don’t do it automatically. You have to think of all of that. And I would love it if it would help us, sort of, not have to pay the heavy price, and they would actually just, sort of, say, "Yeah, we’ll do that for you. Here’s the suggested alternative. Here’s a suggested Circuit Breaker and thing." So this applying the stability patterns to approve the health, I think is gonna be pretty interesting and I want to kind of experiment more with it.

Q and A

Mike::Has anybody done anything like this, or heard anybody else talk about these things like this?" Yeah.


My team was doing a creation microservice. And we’ve been solving monolith microservice internet delays. So, how much time do we actually have, which resulted all the functional requirements.


Right, that’s right. You brought up those up before. Like, there’s a whole other set of requirements, right, that you have to kinda prepare for?


Yeah, yeah basically those are things you don’t put in the description button. You need to answer the 55 seconds because we have another [inaudible 00:10:02] elsewhere in 60 seconds. And those are things that have been found before by smart people in time to live [inaudible 00:10:11]


And you dealt with some of this, right? This is kinda…


Yeah, we don’t… This is great. So we make this into playing cards, you know…


Oh, actually, that’s an interesting idea, too.


Or a cheat sheet, right? That’s what this is.


Playing cards is a really good idea.


Yeah, you could play a game where you have, like, ratios, like, "If you wanna accomplish this kind of…"


Do you wanna [inaudible 00:10:33]?


You gotta have, you know, these kinds of instances and these…you know, so you need 12 ADI servers to every database and every two searchers. Stuff like that. And those are, like, components of these ship ratios. But this is… One thing I found was crazy about it is…because I haven’t read that book. I meant to read it…


I like it.


…for years now, but is there, like, retry thing, like?


I think the Circuit Breaker, he may talk a little bit about that one. But it’s not called out as a pattern. I think there are two other patterns. One is something about middleware, which is like splitting it up, and something else. But I’d have to go back and read it some more. It may be in there somewhere.