top of page

simplyblock and Kubernetes

Simplyblock provides high-IOPS and low-latency Kubernetes persistent volumes for your demanding database and other stateful workloads.

Writer's pictureChris Engelbert

How I designed PostgreSQL High Availability with Shaun Thomas from Tembo (video + interview)

This interview is part of the simplyblock's Cloud Commute Podcast, available on Youtube, Spotify, iTunes/Apple Podcasts, Pandora, Samsung Podcasts, and our show site.


In this installment, we're talking to Shaun Thomas (Twitter/X, personal blog), affectionately known as "Mr. High Availability" in the Postgres community, to discuss his journey from a standard DBA to a leading expert in high availability solutions for Postgres databases. Shaun shares his experiences working in financial services, where he redefined high availability using tools like Pacemaker and DRBD, and the path that led him to authoring a comprehensive book on the subject. Shaun also talks about his current work at Tembo, an organization dedicated to advancing open-source Postgres, and their innovative approaches to high availability, including the use of Kubernetes and containerized deployments.




Chris Engelbert: Hello, welcome back to this week's episode of Simplyblock's Cloud Commute podcast. This week I have - no, I'm not saying that. I'm not saying I have another incredible guest, even though I have. He's already shaking his head. Nah, I'm not incredible. He's just known as Mr. High Availability in the Postgres space for a very specific reason. I bet he'll talk about that in a second.


So hello, Shaun. Shaun Thomas, thank you for being here. And maybe just introduce yourself real quick. Who are you? Well, where are you from? How did you become Mr. High Availability?


Shaun Thomas: Yeah, so glad to be here. Kind of hang out with you. We talked a little bit. It's kind of fun. My background is I was just a standard DBA, kind of working on programming stuff at a company I was at and our DBA quit, so I kind of had to pick it up to make sure we kept going. And that was back in the Oracle days. So I just kind of read a bunch of Oracle books to kind of get ready for it. And then they had some layoffs, so our whole division got cut. And then my next job was as a DBA. And I just kind of latched onto it from there.


And as far as how I got into high availability and where I kind of made that my calling card was around 2010, I started working for a company that was in financial services. And they had to keep their systems online at all times because every second they were down, they were losing millions of dollars.


So they actually already had a high availability stack, but it was using a bunch of proprietary tools. So when I started working there, I basically reworked everything. We ended up using the standard stack at the time, which was Pacemaker with Corosync and DRBD for distributed replicating block device because we didn't really trust replication back then; it was still too new.


We were also running Enterprise DB at the time, so there were a bunch of beta features they had kind of pushed into 9.2 at the time, I think. Because of that whole process and not really having any kind of guide to follow, since there were not a lot of high availability tools back in 2010, 2011, I basically wrote up our stack and the process I used. I presented it at the second Postgres Open that was in Chicago. I did a live demo of the entire stack, and that video is probably online somewhere. My slides, I think, are also on the Postgres Wiki. But after that, I was approached by Packt, the publisher. They wanted me to write a book on it. So I did. I did it mainly because I didn't have a book to follow. Somebody else in this position really needs to have some kind of series or a book or some kind of step-by-step thing because high availability in Postgres is really important. You don't want your database to go down in a lot of situations. Until there's a lot more tools out there to cover your bases, being able to do it is important. Now there's tons of tools for it, so it's not a big problem. But back then, man, oof.


Chris Engelbert: Yeah, yeah. I mean, you just mentioned Pacemaker. I'm not sure when I heard that thing the last time. Is that even still a thing?


Shaun Thomas: There's still a couple of companies using it. Yeah, you would be surprised. I think DFW does in a couple of spots.


Chris Engelbert: All right. I haven't heard about that in at least a decade, I think. Everything I've worked with had different-- or let's say other tools, not different tools. Wow. Yeah, cool. So you wrote that book. And you said you came from an Oracle world, right? So how did the transition to Postgres happen? Was that a choice?


Shaun Thomas: For me, it wasn't really much of a transition because, like I said, our DBA quit at the company I was at. And it was right before a bunch of layoffs that took out that entire division. But at the time, I was like, ooh, Oracle. I should learn all this stuff. So the company just had a bunch of old training materials lying around. And there were like three or four of the huge Oracle books lying around. So I spent the next three or four weeks just reading all of them back to back.


I was testing in a cluster that we had available, and I set the local version up on my computer just to see if it worked and to learn all the stuff I was trying to understand at the time. But then the layoffs hit, so I was like, what do I do now?


I got another job at a company that needed a DBA. And that was MySQL and Postgres. But that was back when Postgres was still 6.5. Back when it crashed if you looked at it funny. So I got kind of mad at it. And I basically stopped using it from like 2005 to 2010. Or no, that was, sorry, from 2001 to 2005. From 2005, I switched to a company that they were all Postgres.


So I got the purple Postgres book. The one that everyone used back then was I think it was 8.1 or 8.2. And then I revised their entire stack also because they were having problems with vacuum. Because back then, the settings were all wrong. So you would end up loading yourself out of your disk space. I ended up vacuuming their systems down from I think it was 20 gigs down to like 5. And back then, that was a lot of disk space.


Chris Engelbert: I was just about to say that in 2005, 20 gigabytes of disk space was a lot.


Shaun Thomas: But back then, the problem with vacuum was you actually had to set the size of the free space map. And the default was way too small. So what would happen is vacuum would actually only keep track of the last 200,000 unused reusable rows by default. But by default, it only kept track of the first 200,000.


So if you had more than that, even if you were vacuuming constantly, it would still bloat like a little bit every day until your whole disk was used. So I actually had to clean all that up or their system was going to crash. They were days away from going down when I joined. They had already added all the disks they could. And back then, you couldn't just add virtual disk space.


Chris Engelbert: I know those situations, not in the Postgres or database space, but in the software development space where-- same thing, I literally joined days before it all would fall apart. Let's say those are not the best days to join.


Shaun Thomas: Hey, that's why they hired you, right?


Chris Engelbert: Exactly. All right. So let's talk a little bit about these days. Right now, you're with Tembo. And you just have this very nice blog post that blew up on Hacker News for all the wrong reasons.


Shaun Thomas: Well, I mean, we created it for all the right reasons. And so let me just start on Tembo a little bit. So Tembo is like they are all in on Postgres. We are ridiculously all in. Basically, everything we do is all open sourced. You can go to Tembo.io on GitHub. And basically, our entire stack is there. And we even just released our on-prem. So you can actually use our stack on your local system and basically have a Kubernetes cloud management thing for all the clusters you want to manage. And it'll just be our stack of tools. And the main calling card of Tembo is probably our-- if you go to trunk, I think it's called PGT.dev. We just keep track of a bunch of extensions. And it's got a command line tool to install them, kind of like a PGXN.


And we're so kind of into this that we actually hired the guy who basically maintained PGXN, David Wheeler. Because we were like, we need to kind of hit the extension drum. And we're very glad he's re-standardizing PGXN 2. He's starting a whole initiative. And he's got a lot of buy-in from tons of different committers and devs and people who are really pushing it. Maybe we'll create the gold standard of extension networks. Because the idea is to get it all so that it's packaged, right? Kind of like a Debian or an RPM or whatever package system you want to use. It'll just install the package on your Postgres wherever it is. Like the source install, if it's like a package install, or if it's something with on your Mac, whatever.


So he's working on that really. And he's done some demos that are very impressive. And it looks like it'll actually be a great advancement. But Tembo is - it's all about open source Postgres. And our tools kind of show that. Like if you've ever heard of Adam Hendel, he goes by Chuck. But if you heard of PGMQ or PG Vectorize, which kind of makes PG Vector a little easier to use, those tools are all coming from us, basically. So we're putting our money where our mouth is, right?


All right. That's why I joined him. Because I kept seeing them pop up on Twitter. And I'm like, man, these guys really-- they're really dedicated to this whole thing.


Chris Engelbert: Yeah, cool. So back to PG and high availability. Why would I need that? I mean, I know. But maybe just give the audience a little bit of a clue.


Shaun Thomas: So high availability-- and I kind of implied this when I was talking about the financial company, right? The whole idea is to make sure Postgres never goes down. But there's so much more to it. I've done conferences. And I've done webinars. And I've done trainings. And I've done the book. Just covering that topic is it's essentially an infinite font of just all the different ways you can do it, all the different prerequisites you need to fulfill, all the different things you need to set up to make it work properly. But the whole point is keep your Postgres up. But you also have to define what that means. Where do you put your Postgres instances? Where do you put your replicas? How do you get to them? Do you need an intermediate abstraction layer so that you can connect to that? And it'll kind of decide where to send you afterwards so you don't have any outages as far as routing is concerned?


It's a very deep topic. And it's easy to get wrong. And a lot of the tools out there, they don't necessarily get it wrong. But they expect the user to get it right. One of the reasons my book did so well in certain circles is because if you want to set up EFM or repmgr or Patroni or some other tool, you have to follow very closely and know how the tool works extremely well. You have to be very familiar with the documentation. You can't just follow step by step and then expect it to work in a lot of cases.


Now, there's a lot of edge cases you have to account for. You have to know why and the theories behind the high availability and how it works a certain way to really deploy it properly.


So even as a consultant when I was working at EDB and a second quadrant, it's easy to give a stack to a customer and they can implement it with your recommendations. And you can even set it up for them. There's always some kind of edge case that you didn't think of.


So the issue with Postgres, in kind of my opinion, is it gives you a lot of tools to build it yourself, but it expects you to build it yourself. And even the other stack tools, like I had mentioned earlier, like repmgr or EFM or Patroni, those are pg auto_failover, another one that came out recently. They work, but you've got to install them. And you really do need access to an expert that can come in if something goes wrong. Because if something goes wrong, you're kind of on your own in a lot of ways.


Postgres doesn't really have an inherent integral way of managing itself as a cluster. It's more of like a database that just happens to be able to talk to other nodes to keep them up to date with sync and whatnot. So it's important, but it's also hard to do right.


Chris Engelbert: I think you mentioned one important thing. It is important to upfront define your goals. How much uptime do you really need? Because one thing that not only with Postgres, but in general, whenever we talk about failure tolerance systems, high availability, all those kinds of things, what a lot of people seem to forget is that high availability or fault tolerance is a trade-off between how much time and money do I invest and how much money do I lose if something really, well, you could say, s***t hits the fan, right?


Shaun Thomas: Exactly. And that's the thing. Companies like the financial company I worked at, they took high availability to a fault. They had two systems in their main data center and two more in their disaster recovery data center, all fully synced and up to date. They maintained daily backups on local systems, with copies sent to another system locally holding seven days' worth. Additionally, backups were sent to tape, which was then sent to Glacier for seven years as per SEC rules.


So, someone could come into our systems and maliciously erase everything, and we'd be back up in an hour. It was very resilient, a result of our design and the amount of money we dedicated toward it because that was a very expensive deployment. That’s atleast 10 servers right there.


Chris Engelbert: But then, when you say you could be back up in an hour, the question is, how much money do you lose in that hour?


Shaun Thomas: Well, like I said, that scenario is like someone walking in and literally smashing all the servers. We'd have to rebuild everything from scratch. In most cases, we'd be up – and this is where your RTO and RPO come in, the recovery time objective and your recovery point objective. Basically, how much do you want to spend to say I want to be down for one minute or less? Or if I am down for that one minute, how much data will I lose? Because the amount of money you spend or the amount of resources you dedicate toward that thing will determine the end result of how much data you might lose or how much money you'll need to spend to ensure you're down for less than a minute.


Chris Engelbert: Exactly, that kind of thing. I think that becomes more important in the cloud age. So perfect bridge to cloud, Postgres and cloud, perfect. You said setting up HA is complicated because you have to install the tools, you have to configure them. These days, when you go and deploy Postgres on something like Kubernetes, you would have an operator claiming at least doing all the magic for you. What is your opinion on the magic?


Shaun Thomas: Yeah, so my opinion on that is it evolved a lot. Back when I first started seeing containerized systems like Docker and that kind of thing, my opinion was, I don't know if I'd run a production system in a container, right? Because it just seems a little shady. But that was 10 years ago or more. Now that Kubernetes tools and that kind of thing have matured a lot, what you get out of this now is you get a level of automation that just is not possible using pretty much anything else. And I think what really sold it to me was – so you may have heard of Gabriele Bartolini. He basically heads up the team that writes and maintains Cloud Native Postgres, the Cloud Native PG operator. We'll talk about operators probably a bit later. But the point of that was back when—and 2ndQuadrant was before they were bought by EDB—we were selling our BDR tool for bi-directional application for Postgres, right? So multi-master. And we needed a way to put that in a Cloud service for obvious purposes so we could sell it to customers.

And that meant we needed an operator. Well, before Cloud Native Postgres existed, there was the BDR operator that we were cycling internally for customers.


And one day while we were in Italy—because every employee who worked at 2ndQuadrant got sent to Italy for a couple of weeks to get oriented with the team, that kind of thing. During that time when I was there in 2020, I think I was there for February, for the first two weeks of February. He demoed that, and it kind of blew me away. We were using other tools to deploy containers. And it was basically Ansible to automate the deployment with Terraform. And then you kind of set everything up and then deploy everything. It takes minutes to set up all the packages and get everything deployed and reconfigure everything. Then you have to wait for syncs and whatnot to make sure everything's proper.


On someone's laptop, they set up Kubernetes Docker deployment. Kind, I think we were using at that point, Kubernetes in Docker. And in less than a minute, he had on his laptop set up a full Kubernetes cluster of three replicating, bidirectional replicating, so three multi-master nodes of Postgres on his laptop in less than a minute. And I was just like, my mind was blown. And the thing is, basically, it's a new concept. The data is what matters. The nodes themselves are completely unimportant. And that's why, to kind of bring this back around, when Cloud Native Postgres was released by Enterprise DB kind of as an open-source tool for Postgres and not the bidirectional replication stuff for just Postgres.


The reason that was important was because it's an ethos. The point is your compute nodes—throw them away. They don't matter. If one goes down, you provision a new one. If you need to upgrade your tooling or the packages, you throw away the old container image, you bring up a new one. The important part is your data. And as long as your data is on your persistent volume claim or whatever you provision that as, the container itself, the version of Postgres you're running, those aren't nearly as important.

So it complicates debugging to a certain extent. And we can kind of talk about that maybe later. But the important part is it brings high availability to a level that can't really be described using the old methods. Because the old method was you create two or three replicas. And if one goes down, you've got a monitoring system that switches over to one of the alternates. And then the other one might come back or might not. And then you rebuild it if it does, that kind of thing.


With the Kubernetes approach or the container approach, as long as your storage wasn't corrupted, you can just bring up a new container to represent that storage. And you can actually have a situation where the primary goes down because maybe it got OOM killed for some reason. It can actually go down, get a new container provisioned, and come back up before the monitors even notice that there was an outage and the switch to a replica and promote it. There's a whole mechanism of systems in there to kind of reduce the amount of timeline switches and other kind of complications behind the scenes. So you have a cohesive, stable timeline. You maximize your uptime. They've got layers to redirect connections from the outside world through either traffic or some other kind of proxy to get into your actual cluster. You always get an endpoint somehow. And that's something that was horribly wrong, but that's true for anything.

But the ethos of your machines aren't important. It spoke to me a little bit because it brings you to a level that sure, their hardware is great. And I actually prefer it. I've got servers in my basement specifically for testing clusters and Postgres and whatnot. But if you have the luxury of provisioning what you need at the time, if I want more compute nodes, like I said, show my image, bring up a new one that's got more resources allocated to it, suddenly I've grown vertically. And that's something you can't really do with bare hardware, at least not very easily.


So then I was like, well, maybe this whole container thing isn't really a problem, right? So yeah, it's all because of my time in 2ndQuadrant and Gabriele's team that high availability does belong in the cloud. And you can run production in the cloud on Kubernetes and containers. And in fact, I encourage it.


Chris Engelbert: I love that. I love that. I also think high availability in cloud, and especially cloud native are concepts that are perfectly in line and perfectly in sync. Unfortunately, we're out of time. I didn't want to stop you, but I think we have to invite you again and keep talking about that. But one last question. One last question. By the way, I love when you said that containers were a new thing like 10 years ago, except for you came from the Solaris or BSD world where those things were –


Shaun Thomas: Jails!


Chris Engelbert: But it's still different, right? You didn't have this orchestration layer on top. The whole ecosystem evolved very differently in the Linux space. Anyway, last question. What do you think is the next big thing? What is upcoming in the Postgres, the Linux, the container world, what do you think is amazing on the horizon?


Shaun Thomas: I mean, I hate to be cliche here, but it's got to be AI. If you look at pgvector, it's basically allowing you to do vectorized similar researches right in Postgres. And I think Timescale even released pgvectorscale, which is an extension that makes pgvector even better. It makes it apparently faster than dedicated vector databases like Pinecone. And it's just an area that if you're going to do any kind of result, augmented generation, like RAG searches, or if you're doing any LLM work at all, if you're building chatbots, or if you're just doing, like I said, augmented searches, any of that kind of work, you're going to be wanting your data that's in Postgres already, right? You're going to want to make that available to your AI. And the easiest way to do that is with pgvector.


Tembo even wrote an extension we call pg_vectorize, which automatically maintains your embeddings, which is how you kind of interface your searches with the text. And then you can feed that back into an LLM. It also has the ability to do that for you. Like it can send messages directly to OpenAI. We can also interface with arbitrary paths so you can set up an Ollama or something on a server or locally. And then you can set that to be the end target. So you can even keep your messages from hitting external resources like Microsoft or OpenAI or whatever, just do it all locally. And that's all very important. So that I think is going to be-- it's whatever one-- not either one, but a lot of people are focusing on it. And a lot of people find it annoying. It's another AI thing, right? But I wrote two blog posts on this where I wrote a RAG app using some Python and pgvector. And then I wrote a second one where I used pg_vectorize and I cut my Python code by like 90%. And it just basically talks to Postgres. Postgres is doing it all. And that's because of the extension ecosystem, right? And that's one of the reasons Postgres is kind of on the top of everyone's mind right now because it's leading the charge. And it's bringing a lot of people in that may not have been interested before.


Chris Engelbert: I love that. And I think that's a perfect sentence to end the show. The Postgres ecosystem or extension system is just incredible. And there's so much stuff that we've seen so far and so much more stuff to come. I couldn't agree more.


Shaun Thomas: Yeah, it's just the beginning, man.


Chris Engelbert: Yeah, let's hope that AI is not going to try to build our HA systems. And I'm happy.


Shaun Thomas: Maybe not yet, yeah.


Chris Engelbert: Yeah, not yet at least. Exactly. All right, thank you for being here. It was a pleasure. As I said, I think I have to invite you again somewhere in the future.


Shaun Thomas: More than willing.


Chris Engelbert: And to the audience, thank you for listening in again. I hope you come back next week. And thank you very much. Take care.




Comments


bottom of page