top of page

simplyblock and Kubernetes

Simplyblock provides high-IOPS and low-latency Kubernetes persistent volumes for your demanding database and other stateful workloads.

Writer's pictureChris Engelbert

Distributed SQL Databases with Franck Pachot from Yugabyte (interview)

Updated: Jul 12

This interview is part of the simplyblock Cloud Commute Podcast, available on Youtube, Spotify, iTunes/Apple Podcasts, Pandora, Samsung Podcasts, and our show site.

In this installment, we're talking to Franck Pachot from Yugabyte, a PostgreSQL compatible and distributed database, about the need of distributed SQL to scale out with massive transactional data sets, as well as some trends, such as PgVector, in the PostgreSQL ecosystem.


Chris Engelbert: Hello, welcome back to our podcast. I'm really happy to have a good

friend on the call or on the podcast, Franck from Yugabyte. Frank, thank you for being here. Maybe you can quickly introduce yourself.


Franck Pachot: Hey! Thank you for inviting me. Yeah, absolutely. I've always been working with databases and a lot of PostgreSQL and monolithic databases. And I joined Yugabyte as a developer advocate 3 years ago. It's a distributed SQL database. Basically, the idea is to provide all the PostgreSQL features. But on top of a distributed storage and transaction engine with the same architecture based on the spanner architecture. So across all nodes, the active data is distributed. Connections are distributed SQL. Processing is distributed and all nodes provide a logical view of the global database.


Chris Engelbert: So maybe just for the people that are not 100% sure what it distributed databases. Can you elaborate a little bit more? You said. A couple of different nodes and data is distributed.


Franck Pachot: Obviously the main reason is that most of the sq. The current SQL databases run on a single node that can take the reads and write, enabling to check for consistency on that single node. And when there is a need to scale out, then people go to NoSQL. Where they can have multiple nodes active. But now they’re missing the SQL feature. So the idea of distributed SQL is that we can provide both the SQL features, the ACID properties, the consistency, and the possibility to scale horizontally, and two main reasons to scale horizontally. If you run on multiple nodes, one can be down, the network can be down, and everything continues on the other node, and also to scale. If you go to the cloud. You want more elasticity. You want to run with a small amount of resources and be able to add more resources. With a single database you can scale up, but then you have to stop it, and start a larger instance. When you have multiple servers, just add new nodes and then be able to handle a workload.


Chris Engelbert: Right! So what is the target audience? What is like the main customer profile? Are those data companies with a lot of data? With high velocity data? What you would say is like the main company’s amount of data?


Franck Pachot: There are also some users with small databases. I mean 100 GB is a small database. In this case they need to be always up, always available. When you can scale horizontally, you can also do rolling upgrades or rolling patches. So you don't stop the database. You don't stop the application when you upgrade, patch the server or do a key rotation. So I have ability, even for small databases. And of course, the more data you have the harder it is to do the backups with simple tools such as PgDump. If you have a lot of data, you have many constraints to operate on, and [Yugabyte] makes it easier with the horizontal scalability. But basically it targets any use case because a SQL database must handle any use case. But it's mostly optimized for OLTP, for two reasons. First, because with data warehouses you don't need all the transactions, so it's easier to shard and lose a lot of the ACID properties. And then there are some engines with column storage. So Yugabyte is really optimized for OLTP. And the analytics query that ran on OLTP applications.


Chris Engelbert: Okay, if I get that right. The two main use cases are like, you have to be always up and running. Meaning, a single instance built on PostgreSQL or using the PostgreSQL Protocol, there could be a short downtime, even with a failover or a secondary, and you cannot justify that outage or downtime at all. And on the other side, you have this like massive data sets, but you don't necessarily need a whole transaction support around all of that. Right?


Franck Pachot:Yes, but just mention it, in the case of distributed SQL, you have all transactional properties! So even if you run multi-node transactions, that's no difference. In the example of data warehouses where you may not need transactions, and you can add specific optimization when you don't have all the ACID properties. But here the idea is to have all ACID properties, so that you can take an application that you run some “PostgreSQL” and just run it distributed. Maybe there’s a bit  more latency.


Chris Engelbert: Right! So the idea is that with Yugabyte, you have the transactional capabilities. That’s what you lose when you're going for data warehouses. Oh, that's interesting.. I wasn't 100% sure about that. Interesting so let me see. What does that mean? For users, you can use Yugabyte as a drop in replacement for a PostgreSQL Database, right? Is that it?


Franck Pachot: Yes, that's the goal. And we use the PostgreSQL codes for the SQL layer. So in theory, yes. There are a few features that are not yet working like in Postgres when you want to scale out the DDL. For example, PostgreSQL can do transactional DDL changes. We are implementing that, but when you want it to be scalable. That's different, because PostgreSQL allows it, but takes an exclusive look. And when you build a database that must always be up, you have to do something different than taking an exclusive look for the world direction of DDL. So there are few features that are not there for the moment.


There are also some considerations about the data model, because data is sharded. You can make some decisions to shard it on a range of values or applying a as value. So the little things you may think about. But basically the idea is that you don't have to change the code of the application. You may think about data modeling. If you have a bad design that just works on PostgreSQL, it's always worse when you add some network latency. So you may think a bit more about the good design. Same recommendations, same best practices, but the consequence may be a bit more important when you distribute.


Chris Engelbert: Yeah, that makes sense. I hear you. We had the same thing with a different company I've worked for in the past, where it was kind of the same thing. It used the same API, but it worked differently since it had a network layer underneath. And now, suddenly everything had a network operation in between or a network transaction in between. And yeah, II ie. It looks the same but you still have to think about it a little bit. So when you install Yugabyte, how would you recommend deploying that today? Would you recommend buying some traditional servers and co-host them in a data center or going into the cloud?


Franck Pachot: You can buy a bare metal server and then start it. But the real value is the elasticity, and then the real value is going in the cloud. Because the point is, that's when you go to the clouds. If you do the same kind of provisioning, then it will cost a lot. You have an advantage going to the cloud. It can be cost efficient if you can have small instances, and then add them. So any Linux VM or container can be okay for the nodes. There is no strict requirement. The idea also is that it can run on commodity hardware. You just need a network between them. No special hardware. And you can deploy it.


There are some users running it on Kubernetes, which is a great platform when you can scale, because all pods are equal, so you can just scale the stateful sets. Of course I will not recommend deploying a database on kubernetes if you don't know Kubernetes at all. If you have all the applications in Kubernetes, it makes sense to put the database there, but if it's the first time you touch kubernetes, a database isn’t the best place to start, because it's stateful and has some specific considerations. But yeah, Kuberbetes is a good platform. So, it can be Kubernetes, VMa, but you can also go hybrid, on premises and in the cloud. That's also the idea. And it can be multi cloud.


That's also an advantage of when you distribute. You can, for example, move from one cloud provider to the other just by adding new nodes and letting the cluster rebalance, and then removing the old nodes. So yeah, lots of possibilities. The goal is to keep it simple, to have everything done by the database when you scale on Kubernetes. The only command that you do, is to scale the database pods, and then the database will detect it and rebalance the data. The goal is that you don't have to change anything yourself.


Chris Engelbert: I found the migration strategy you just pointed out really interesting. Basically, you create a big cluster over multiple cloud providers. Then you just scale down bit by bit on the one and and scale up on the other [cloud provider], right? That's an interesting thing. From the top of my head, one question that probably comes up a lot and, meaning you probably had to answer it a few times already, how does Yugabyte handle different sized VMs? Because in this scenario [different cloud providers] there's like no chance to get the same setup in terms of VMs. Many systems have issues with differently sized VMs.


Franck Pachot: Thanks. Yeah. Really good question. The goal for predictable performance is to have nodes that are kind of equal. It also makes observability much easier when you have to start to think about the CPU usage on different instances. That's more difficult. So it's possible to run on different sizes, but you should consider that to be just temporary for migration. And you have to expect some kind of impact.There’s also an impact on the cost. From my point of view that can be very expensive. If you run a distributed database, there is a lot of data that is exchanged. The cost can make a lot of sense when moving from one cloud to the other, but constantly running a database on two clouds is expensive. There are some customers doing that just because they want to be sure that, on a Black Friday, they can have enough instances on two cloud providers, but of course there is a cost behind that. It's more about the agility of changing without stopping the application.


Chris Engelbert: Yeah, I think the biggest, the biggest cost in that situation would be the traffic between the nodes right? Because you have to pay for egress or ingress, depending on the cloud provider.


Franck Pachot: Yes, normally you pay when you move data from the clouds. And you don’t pay a lot when you move to their clouds, because you come with more data. So that's fine with them.


Chris Engelbert: Right! So let me see. What do you think is like the biggest trend right now, in terms of databases overall. Not just like relational ones, but in general.


Franck Pachot: I would say, long term trend, it’s simply SQL, because SQL was popular, and that then, during the NoSQL times it was not so popular. However, now the popularity of SQL is growing again. So I think this is also a trend, considering SQL for many solutions rather than thinking about different databases for different use cases. So that's the general trend.


The short term trend, of course, is that everybody is talking about PgVector, vector storing embeddings, indexing embeddings. We'll see what happens. It's kind of a shift in mind for a SQL database, because we are more used to precise results. And this is more for this search, and non-deterministic results. But it comes with the trend, that SQL databases aren’t used only for the pure relational data, but there are other use cases now. I think vectors, like PgVector in PostgreSQL and PostgreSQL compatible databases, will be a thing.


I don't really know. Looking back, a trend a few years ago, it was all about blockchain and blockchain in all databases, now it’s not.



Chris Engelbert: Right, I hear you. I'm also really careful when it comes to hypes and trends. They come and go. Two things in that answer were really interesting. First of all, you said that it's all going back to SQL. That kind of reminds me of this little plate with the evolution of NoSQL. Like, “no SQL, we don't want SQL.” Then it was like “all NoSQL, nothing else.” And now, it's like, “no, it's SQL, right?”


The second one with vector databases, it was interesting that you mentioned that you can expect, like 100 different results, right?


Franck Pachot: I think that is something we have to learn to work within the future, especially the bigger the data sets that we need to analyze, the more important it is to learn to work with. Well, let's call it estimates of how good they are, not right in SQL. Usually, if you run the same query twice on the same data, you expect the same results, which makes it easier also to build tests and to validate a query. And with vectors you may have a different result.


Chris Engelbert: Yeah, one thing you can do in PostgreSQL, and I'm not sure about some other databases, you can have a table and you can utilize sample space which will give you a sample set of a massive data set. You can ask it to give you a subset, sampled at a rate of like 10%. That kind of thing was already possible in the past and it gave you some interesting results when you reload a web page, and the graph was just so slightly different. However,  most of the time when you used that, you wanted to have a bare overview. You wanted to have the form of the graph, not the precise thing. And vector databases go into the same direction, plus adding some more things on top of that.


Franck Pachot: Yugabyte is a PostgreSQL compatible database. You can perfectly use PgVector in it. Yugabyte doesn’t work with vector indexes as of today though. There is work ongoing to make everything work. Extensions at SQL level work on Yugabyte, because it's the same code, but when it touches the storage, however, then it must be a bit aware of the distributed storage. We don't store in heap tables and B-trees. We store in other trees, and then those operations may be different. So today, you can use PgVector, but not the same indexing.


You were talking about the different trends. The goal is also not to build a different index for each new trend and to build an index that can adapt. PgVector already had like, I think, two or three kinds of indexes, and those have changed in less than one year. So it’s better to have something that is flexible enough to be adapted to the different kinds of indexes, and that will come.


Chris Engelbert: Now, I think there is one big difference, because you also mentioned blockchain. I think there is one big difference here. We have a solution to an actual problem, something we want to resolve, whereas, no, I'm not going to bash on blockchain.


You said you can deploy it into kubernetes. And since this podcast is cloud and Kubernetes focused, what do you think is the worst thing people can overlook when going to the cloud. I think storage is probably a complicated thing.


Franck Pachot: Yeah, for sure. Storage is complicated. The big advantage of a distributed database is that you don't have to share the storage, because the database does that. So you can have local storage. But of course you need to think about durability and performance. You can run with local NVMe disks on each instance which will provide the best performance. It may be okay for availability, because if you run in multiple zones you can lose one, and everything just continues to work, but if you lose two zones and it is local storage, then you may lose some data. Usually customers run on block storage like EBS (Elastic Block Storage) in AWS, which has the advantage that the storage is persistent. In addition to the high availability of multiple zones, the storage is persistent, of course there are some performance considerations. Sometimes the performance reminds me of when we were running on spinning disks, just a few years ago, because you have the performance, the latency, and the throughput limitations of the storage itself, but also each instance has a limit. And you can also reach those limits. So yeah, storage is important, thinking about performance, durability and agility, too.


Chris Engelbert: It's good you said EBS and that it reminds you of spinning disks. At least not a floppy disk. That's good. Seriously though, Amazon’s EBS is a good solution. It's just kind of expensive when you need high performance storage. I think that is like the trade off you have to understand.


Anyway, we're at the 20 min mark, unfortunately, already. So many more questions. Thank you very much for being here. Happy to have you back on the show at some point. There's so much more to talk about. For people having questions directed to Frank, you can reach him on LinkedIn, Twitter/X, Mastodon, and Bluesky.


Again, thank you very much for being here. And hope to see you again. Appreciated!


Franck Pachot: Yeah, thank you for having me.

Comments


bottom of page