My Journey into Performance Benchmarking with Jelte Fennema-Nio & Marco Slot

My Journey Into Performance Benchmarking - Episode 11
===

CLAIRE: Welcome to Path to Citus Con, the podcast for developers who love Postgres, where we discuss the human side of open source databases, Postgres of course, and the many PG extensions. Thank you to the team at Microsoft for sponsoring this community conversation about Postgres. I'm one of your hosts, Claire Giordano.

PINO: And I'm Pino di Candia. Today's topic is my journey into Postgres Benchmarking. I have the pleasure of introducing Jelte Fennema-Nio. Jelte is a Principal Engineer at Microsoft, works on the Citus extension that distributes Postgres, and the Managed Service Azure CosmosDB for PostgreSQL. Jelte is also a pgbouncer maintainer, a sporadic PostgreSQL contributor.

He's based in the Netherlands. And did a Master's in Systems and Network Engineering from the university at the University of Amsterdam. Jelte, welcome.

JELTE: Hi, happy to be here.

CLAIRE: And I have the pleasure of introducing Marco Slot. Marco used to work for Citus Data and then Microsoft for many years. So he spent years working on and thinking about, dreaming about Citus and distributed Postgres.

Marco earned his PhD in Distributed Systems at Trinity College, Dublin, and he once worked at AWS on CloudFront and now works at Crunchy Data, also based in the Netherlands. Welcome, Marco.

MARCO: Yeah. Thank you very much. Good to be back on this podcast. My second appearance.

CLAIRE: That's right. You were on the very first episode, along with Simon Willison you were both keynoters for last year's Citus Con, virtual event for Postgres. And yeah, it was awesome. You're, you helped us pave the way to create this thing.

MARCO: Yeah, no, it's it's a great great platform. I really enjoyed the conversation then back with with Simon about working in public and open source.

So if anyone's interested, I recommend it's or you should watch, you should listen to all the podcasts are all pretty good. The lineup is amazing.

CLAIRE: One of the things we do is we put show notes when we publish the podcast. So we'll be sure to drop a link to the the episode with you and Simon into those notes.

Okay. The topic today though, is my journey into Postgres benchmarking or database benchmarking. And we have so many questions about why, when, and how both of you skilled up on performance benchmarking and how it's been useful to you in your work on Postgres and your work on Citus. So I want to start with your personal stories and we'd like to ask each of you the same question.

Can you tell us a story about your very first performance benchmarking experience? Even if you were 18 years old how did you first get your feet wet?

JELTE: I can start. In university, at some point I did some benchmarking of, I don't even remember, I think it was MySQL, because that was somehow the easiest to run the benchmark on.

And it was comparing containers, and microkernels performance between them to see if microkernels, they're like like containers, but they run directly on the hypervisor and supposedly they're much faster. So I wanted to find out if that's, if that was actually true. And at least the benchmark I found on the internet.

That's, that I ran that, that was quite a bit faster for the hyper for the microkernels than in Docker. So that was that, that was quite that's, I guess that's my story.

CLAIRE: Was this an assignment that everybody in class had to do and where, the syllabus clearly specked out, like where to find the tools and what to use and what to do?

Or was this just you being curious?

JELTE: No, this was like the assignment was to do we had to do some sort of tiny research thing, but we could choose something ourselves. So I came up with this to get with another student.

PINO: Does anyone run databases in microkernels?

JELTE: I don't know if anyone that does, but it's, maybe it's a good idea. Yeah. I think it was even faster than running it on a like on a VM because you remove the whole this, the idea is that you remove the whole system call stuff, because you only have one process.

Basically MySQL or Postgres becomes the kernel, which is very strange, but, and hard to debug because logs, where did it go? How do you. All your normal debugging stuff goes away.

PINO: Maybe we should do another episode on all the places you can run Postgres and we'll see if we microkernels come in at that point.

MARCO: Sounds like a good plan.

CLAIRE: Could be a long episode. Okay, Marco, what about you?

MARCO: I have good experience I have fond memories of hacking microkernels at the VU where they were building Minix Minix3 at the time. But my benchmarking story starts at Amazon in 2008 when I joined the CloudFront team, which was still, pre launch at the time.

And I joined as the intern and being the intern, of course I had to do benchmarks and because we hadn't launched yet, we had bought a bunch of servers around the world, but we weren't taking significant traffic. So the question was like, what's our actual capacity? Once we start getting some customers, and at the time we were using.

We were using Squid as the web server, which is this ancient piece of software, but, and then also the benchmarking tool is called Polygraph. I don't know if it still exists, but basically it simulates kind of web traffic. And so you can point that at a web cache and then see how well it does and like how hard you can push it.

But I was doing this and I was like, the benchmark has many different parameters in terms of request size and the distribution, or is it a big uniform distribution of every key every URL gets requested the same amount of time, or is it more Zipfian? We realized I had no idea what the traffic was going to look like once we were actually starting to get customers.

So then I started diving much deeper into the system and using the benchmarks more to understand like the low level details of what how much effort do we need to spend for a particular type of request. So basically creating a model of. What are the variables of the requests that matter?

And what are, what is the expected amount of work that, that needs to be done? And that way we could actually get a pretty good understanding of our utilization and our capacity. Once we. Once we went live and I guess that work ended up buying me a sort of remote part time position at Amazon when I was doing a PhD, which was quite unusual at the time.

And then the fact that I was a remote part time employee, when like talking to the Citus Data founders, they were like, Whoa, no one has that. We should hire this guy. So it was actually that benchmark was a benchmarking work was a quite important moment in my early career. And then when it came to Citus Data, of course, the first thing I had to do was benchmarking.

So we ran these TPC-H benchmarks on Citus at the time. This was. Earlier days, a different time when distributed, like large scale analytical SQL systems were in their infancy and Citus was actually pretty early in being able to do like large scale distributed joins. So we wanted to evaluate that.

And so we got a good understanding out of. Out of the system, but we also found like a lot of limitations and architectural performance issues that we knew would take a very long time to, to address. So I think from that point onwards we started veering away a little bit from, like targeting the type of data warehousing workloads that, that TPC-H kind of simulates.

And more into other types of workloads Software as a Service and multitenancy and real time analytics. So that was also quite interesting work. And then later inside us, we started getting more in the into kind of transactional benchmarks. And then, yeah, Jelte, I guess also has worked a lot on on those.

CLAIRE: You both have touched on. The different reasons people run performance benchmarks, right? Jelte's example was like the canonical case of comparative performance benchmarking, right? Wanting to do a bake off, if you will, and understand which one of these performance better.

And then, but what you're talking about is like understanding the utilization and the capacity and the costs and what's performing and what's not. Just within a particular thing, like in your case. And that literally caused a shift in the go to market strategy from data warehouses to SaaS applications and other types of workloads, which I think is fascinating.

Like it just shows you how important understanding performance is.

MARCO: Yeah, it's an incredibly tedious thing to do. I don't like benchmarking, but it's also often one of the highest impact things you can do. If you I want to say if you managed to answer the question that you're trying to answer, though, very often the question only starts appearing once you run a bunch of benchmarks like it's, it wasn't obvious that, running TPC-H benchmarks would.

Influence our go to market strategy. It was more like, Oh, how do we compare to a system called Impala at the time? And Spark SQL was in a very early stage, super unreliable, but we were comparing against that. And actually we did pretty well, but we also realized that Let's say compared to a redshift, which was appearing that we, that there was a huge gap.

So that and that was a little bit an unexpected finding in a way. But yeah, that like benchmarking is always a journey. You don't. You have a sort of sense of where you want to go, but then you like, you learn about the thing you're measuring. And based on that, you start asking other questions and then you end up in an entirely different place often.

It's, I often think of it visually as a sort of landscape where you're walking through and you cannot cover the whole landscape. You're always covering, you're just walking some kind of path through it of yeah, I can look at this and this and this, but not everything else.

But still, as you walk, you learn more and see the next hill and the next hill. So that's what makes it also fascinating, tedious, but fascinating.

PINO: You make it sound dreamy and pleasant of a walk.

MARCO: You have to, because it's super annoying to do. You have to motivate yourself, right?

The challenge is often that it's the A, like in terms of the problem space, like benchmarks have this sort of exponential space, right? There's maybe thousands of dimensions variables, like which influence the performance of a system. And you're often just changing one of those, but pretty quickly you might find that you're bottlenecked.

Like you're measuring something stupid, like something you're measuring the client or something silly happens. And every iteration takes several hours to go through. So you only learn this very slowly. And that's kind of part of the frustrating the frustration. But if you imagine it being nice, being a nice landscape, then it's, it helps going through with it.

JELTE: I think that I also don't like running benchmarks. I think that's that's one thing we definitely have in common, Marco. But one thing I do is like finding out things that you didn't expect. I think that's the Doing benchmarks just for the sake of getting the numbers. It's not so much, it's not so much fun, but doing benchmarks and finding out that something you didn't expect is very slow or actually something is very fast, even though you didn't know that's those things are like the results are nice, but to get there, it's usually a long and tedious journey.

PINO: Maybe I'll chime in and ask Adam's question from the chat. Adam asks, should benchmarks become declarative, fully automated, and reproducible? Bench ops like DevOps, but for benchmarking methodology?

MARCO: The obvious answer would be yes, but the I think one of the things you find once you start running benchmarks is like, which dimensions matter and those might not be the dimensions that you expected and that built automation around.

And so it's. It's pretty hard and this can differ for every system, every database system, even for different workloads. So it quite often happens that you build all this sort of nice automation to, change all the run different configurations of Postgres, use different disks, using different machine types, but then it turns out that actually we needed multiple driver nodes.

Otherwise we're just measuring the capacity of the driver node. And now I need to go and rebuild my whole sort of benchmarking system to, to support drive multiple driver nodes. And that's a lot of work. And the type of work is often like very context dependent. Are we. Starting with are we mostly doing this on one cloud provider?

Or are we, is it about a database system or distributed database system or single note or something with read replicas? Like they all tend to have different requirements. And actually building something good that covers a really broad range of for a broad range of systems.

It's just extremely hard and perhaps not very economically viable because it's, you don't, you cannot easily sell it, right? Like it's easier to build something that you can sell because then you can have some money and you can invest it in a product. But with benchmarks that's often a little bit trickier because the value is not directly coming out of the benchmark itself.

JELTE: Yeah. I think everyone that's read benchmarks. Almost everyone that's run benchmarks for a significant amount of time, they built their own sort of automation to run the specific benchmarks they needed to run. And then probably they shared it with people and then no one used it or like a few people used it.

But then it's slowly faded into the, to the past because the next person needed to learn different benchmarks or different types of machines. So stuff gets outdated also very quickly, these kinds of things. Yeah.

MARCO: Yeah. There are attempts to like especially for database benchmarking, there's there's BenchBase, which is somewhat generic extensible database benchmarking tool.

So and it's built by CMU, the university, but with many maintainers and hopefully that project keeps going and keeps improving. But it's a lot of work and some, someone needs to be willing to put in all that effort to to, to write a good tool for benchmarks and then keep it up so we can, still use it 10 years from now.

And that's yeah, economically that's. That's often challenging, but technically, yeah, like everyone wants better tools for benchmarking. And also if you're building a cloud platform, ideally at an early stage, you already start building some automation around this. But again, like you probably have customers asking for things.

So are you going to do the. The thing that the customer asked for, or are going to, build this benchmarking infrastructure that you might only use a few times, it's the economic choices is difficult. So it usually ends up with a lot of manual labor and kind of very narrow tools that that do automation, but for a specific system and set of scenarios.

CLAIRE: Okay. So you mentioned BenchBase and I dropped a link into the chat. One of, one of the things I really want to explore are from both of you, Jelte and Marco are links to your favorite resources, blogs, websites, tools, things that helped you either initially learn about how to do effective performance benchmarking, or just, gave you that tip that solved that one little problem, even after you were an expert at it.

What are your pointers? What are your recommendations? Favorite tools?

Marco, go first, please.

MARCO: Yeah, so apart from Benchbase, like another tool we like I've used a lot in the past, Jelte has used like HammerDB is a pretty nice benchmarking tool for different relational databases. I think they have Oracle, SQL Server, MySQL, Postgres, probably DBT-2, and then they implement some variant of TPC-C and TPC-H.

And it's a pretty well maintained tool. The code can be better.

CLAIRE: And it's not a database, just to be clear, to anyone who's listening.

MARCO: Yeah, that's a bit confusing about HammerDB. Like usually we, we name databases. If it's DB in the name, you think it's a database, but it's purely a benchmarking tool.

It doesn't store your data.

JELTE: It hammers the databases.

MARCO: Yeah. Okay. And it's also, it's published by the TPC, which is like the Transaction Performance Council, which defines all the sort of the most well known most standard benchmarks. And then yeah, I think I have a love hate relationship with TPC bench, or TPC benchmarks and TPC-H, TPC-C.

In a way it has this problem of this is only one specific workload, and it's often a pretty strange workload. Especially I find TPC-H pretty strange because It's just not it doesn't link very well to the way people usually use their analytics databases. But they are pretty, usually most of these TPC benchmarks, they are pretty good ways to stress a system.

Like it's really hard for a database to answer a TPC-H query and it's pretty hard to to do all the sort of different TPC-C transactions concurrently. So I do like it a lot as a benchmarking tool. There's obviously, yeah. And there's obviously Postgres comes with pgbench.

And this is very like simplistic benchmarking tool. You could use it to run your own files, which is neat. But the built in workloads there. Like they're extremely basic where they're just always looking up a single row, for example, or updating a single row, which is not, what tends to happen in practice, but again, for studying specific aspects of the system like it, it can actually be very nice that you have this very clean benchmark that's only doing one, one particular thing.

So it, it also depends on what kind of thing are you interested in? What kind of question are you interested in answering? Which, which benchmarking tool makes the most sense?

JELTE: I think another one is that we use a lot, used a lot as YCSB. Yeah. It's it's for really simple queries.

Basically. It's much closer to too many sort of websites that do like simple update simple delete simple read and not much else. That's like their main that's the main It pretends to be Yahoo. It's like a Yahoo benchmark, basically. So it's like fetching something from, or at least it's made by Yahoo at some point.

And, yeah, so it just does very simple queries. So that's, but it does a lot of them. So that's why it's That's what makes it difficult to manage as a database. And apart from that, I think this is all how to run the benchmarks, but that's, that's normally for me, that's like half of the running the benchmarks, like the tool that runs the benchmark.

It's like half of it, like the rest is creating machines, setting it up in a. In a way that you want to measure. So it's much more, it's much more like managing the database itself and simply and easily being able to spawn that and and setting up a machine with this database benchmark on it. So there, there's many tools to do this, but it really depends where you want to benchmark as well.

If you want to, for instance, you want to benchmark VMs against Kubernetes or something, you can't. Use the Kubernetes stuff for the VMs. You have to use the, like the Terraform or ARM Templates or a cloud formation. Some of those sort of create the infrastructure to be able to run your benchmarks. And if you're using like actual bare metal, then it's even harder.

Or you set it up once and then you need to do some on some of your own automation, maybe with even some other tools. So it's comparing things and setting up this, the system itself is already a big task usually when doing benchmarks,

PINO: Jelte you're saying that for the setup part, it sounds like you're saying always use a declarative tool in any case.

JELTE: Yes. Yeah. You don't want to, because I think that's the, that's one of the main things I've learned at least. Is like your first benchmark, your first, probably your first five runs, they're going to be wrong. You messed up something. That's okay. You configured something incorrectly. And so you have to redo it.

You have to redo them. So you have to need to have an easy way to recreate sort of the system that you had, because you're going to do it five times anyway. That's my my rule of thumb.

CLAIRE: I remember reviewing a blog post that you wrote sometime last year, and it was about, I'll go look up the link in a second, but it was about how to benchmark performance.

Of Postgres using HammerDB, you were running it on Azure, and I remember reviewing the draft and the first sentence had like multiple exclamation marks with your main advice to anybody running benchmarks is automate it, right? Yeah.

JELTE: Yeah, that's still my main advice. I think that's like the only way to not go crazy when doing benchmarks.

Because otherwise every time something goes wrong, you're like no. And then, and if you automated it, at least it's okay I've just pressed the button. I wait another day or another few hours instead of having to reset up the whole machines.

MARCO: Yeah, I think that the earlier you automate, like the less regret you'll have later on.

Even if you like want to know the results sooner, like it's probably better to spend a little bit more time on, on automating. The I think that's maybe one thing that has improved over the years. I remember especially doing the TPC-H stuff a long time ago inside of ZDApp. It was all kind of cloud formation based.

And it's at the time cloud formation was. especially tedious in terms of syntax and it like actually testing the the the template took like usually an hour as well. So it's it's hard to debug these these automation tools but still you'll regret it later if you don't spend a good bit of time on that first.

And the other thing that's important for, in terms of tooling is just like any kind of introspection. In a way, like the most important thing in benchmarking is to understand what you measured. Like you, you get a result, but you, there, there's no way of knowing. What a certain queries per second means if you don't have like additional data and clues to, to back up, to build some kind of mental model of like, why am I getting this result?

So like profiling tools are useful for can be useful, but you also want to collect a lot of metrics on, how much I/O is this, is the system doing like using things like I/O stat? How much CPU and because it's, pretty much always happens, I would say like 99% of the time, there's a point where you realize you just, you were just measuring the performance of the client, not the performance of the system under test.

That happens a lot. And so it's good to gather a lot of metrics through various tools to to have at least a mental model of what's going on when you run the benchmark.

CLAIRE: And when you say profiling tools what are you specifically talking about?

MARCO: We worked a lot with with perf it's a pretty nice tool on Linux that can give you the, one way to use it is to generate this flame graph of where the CPU spends its time and different functions of your program.

So that gives you a kind of good sense of what's actually happening. And it can also be very helpful for. Performance optimization, if you that's, can be a reason to run a benchmark is I want my system to be faster then you run the benchmark, but then at the same time, you run tools like perf to see what is the system actually doing, where is it spending most of its time.

And then you might see that there's one function that's just like copying a lot of memory around, for example. So that's quite an important tool, but I think it depends a bit on the language. If like your benchmarking software in Go or Java, there'd be other profiling tools.

JELTE: But in the end, most of them are able to generate flame graphs.

So it's more, it's not so much, I wouldn't say it's so much a tool itself that matters. It's more like the output of the tool in this case. A flame graph is really useful, like a general sort of perf report of all functions how much time is spent in what function is getting also be fine.

Those kind of things those kind of things are very useful.

CLAIRE: Any other profiling favorite types of visualizations or tools or I don't know, even like conceptual resources, like the book you read when you were 25 years old that really helped you understand that whole space of introspection.

I'm just looking for tips for someone who is not yet an expert the way you both are with benchmarking, but wants to be.

JELTE: I would say that one of the most important things, even before you do perf, I think, to reiterate sort of Marco, Marco mentioned, a bit is, is is to get metrics of everything. Because usually there's one thing that's the bottleneck.

It's either you're using all your disc, either you're using all of your CPU, and even then if you're profiling the wrong thing, and you're just wasting your time. Because if CPU is the bottleneck and you're looking at let's see how many disc updates this thing does, it's it's not gonna matter if you improve that at all.

So it's first important to find out what the bottleneck is. Usually, and then, get a performance report of that, the flame graph of or some IO stats IO statistics, something like that to to find out how you can improve the final performance because that's you, that's when you're doing these kinds of things, that's what you want to do.

You have some performance at the start that you want to have a higher number at the end.

PINO: What are you using for plotting and graphing? Is that built into some of the tools you mentioned, or do you use something else

JELTE: In I use htop a lot like task manager or basically just shows if CPUs are busy or if I always busy or what processes are doing a lot.

So it's just general. Look at my computer and see what it's doing. That's one thing. And usually I do benchmarks in Azure and they're like, there's metrics on the, like it tracks metrics from the machine itself. Like how much I/O operations IOP sits.

It is doing and it's allowed to do, and how much bandwidth it's allowed, network and disk, and how much CPU it's using. So you can, if one, one of those drops is like at 100% you almost certainly found your bottleneck. That's the thing you should improve.

PINO: And are there tools to save data of successive runs, or is that something you just have to self organize?

What do you capture from every run, from every iteration

JELTE: In general, that's something you have to self organize, at least in my experience, I don't, there's some, I think, for instance, it's, it can take some metrics from the database itself. And save those automatically, but like I said before, like the thing you're running it on is also a big part of your benchmark and that's that changes.

So like it's the thing you're running, it's the benchmark you're running and the thing you're running it on. So this combination of things is so large that that there's not really good tooling around it, at least not for everything you have to.

CLAIRE: So I just want to point out that relating to what you were just saying, Jelte while you were talking, Marco dropped a link to a webpage that Brendan Gregg built out that has links to documentation and articles and.

Presentations that he's given on performance. So I just want to for anyone who's never heard of Brendan Gregg. He originally worked at Sun Microsystems, which is probably back when I first met him ages ago, and he is just well known in Linux for having done a lot of work on performance. So anyway, I'm looking at that web page right now, and it does look super useful for anyone who wants to been up and learn a lot more in this space.

Thanks and there's a lot of YouTube videos here as well, as Melanie points out. I'm curious.

MARCO: Definitely BrendanGregg.com is like when you get really deep into performance benchmarking, like this is the resource. And he's probably like the world's leading expert on this topic. He has some books I see now.

So maybe those maybe I should get those because like pretty much all the content on there it's deeply technical, but sometimes it's, it can be a lifesaver. We've had this locking issue in the past where like we, we ended up generating these, what's called here off CPU flame graphs, and there's like a description on his website of how to get those.

And that, that made us realize that it was in fact, a locking issue. And yeah, like it also describes how to make flame graphs, how to use advanced tools like eBPF and and other things like perf so yeah, that's definitely the resource especially for more, more advanced performance studies.

I've never found a better website than that.

JELTE: Even for simple performance studies, I think it's very good because it has all, all kinds of perf commands that you can just copy paste and do the thing that you want to do, because like the flags to perf, they're not something I can remember.

MARCO: Yeah true. Yeah. And there's also some aspect of benchmarking where it's you need to build up a sort of a mental model of the system and how it behaves and what it does again, to understand why you're getting a particular results.

Like for example, database workloads, one important, very hidden variable, that's not really. tHere's nowhere where you can see it. You just have to imagine it being there is the size of the working set. And we say size of the working set. It's what is the part of your data that's very actively queried and therefore typically cached in memory?

If that's larger than your memory, then you're going to suddenly do a lot more I/O, which is going to make your results a lot worse. Probably because I/O is slower than than memory, but there isn't a specific way of measuring a working set or it's not really a number. It's more of a pattern, a sort of distribution.

And yeah, it's I've not seen like super good tools for those kinds of things. It helps to have a little bit of an intuition of Hey, there, there is this thing called the working set. And if my benchmark workload involves querying every key in the table and the table is super huge, bigger than my memory, then probably this working set thing is going to be an issue.

Whereas if the workload is more concentrated on a small number of keys, a small amount of data, then it's going to be more of an in memory workload. So that's also one of the hard things about benchmarking that there's like this, these hidden parameters, hidden metrics that aren't actually very easy to to measure, but can be extremely important.

CLAIRE: I realized that in this conversation, I've been asking all my questions of you, Marco, and you, Jelte, but Pino, you've done engineering for years and years and years. I'm curious, did you ever work on performance benchmarking?

PINO: Oh gosh, I've got to, I've got to admit, I probably got out of it by pushing it to people who liked it more than I did or disliked it less than I did.

I'm trying to think of very early on. I might have done some benchmarks, but very rudimentary and it's lost to memory now, but I was going to ask, that in that regard that many people coming to benchmarks are probably looking, coming at it as developers that have other components in their system beyond the database.

And I was wondering. Yeah. If Jelte or Marco have advice about, what, what should we be benchmarking? What should a developer be benchmarking? The whole system, break or breaking it into components right away? Any advice about that?

JELTE: Like an application developer that uses a database, right?

PINO: Yes.

JELTE: I think in general, it's, it, what sort of the recommended approach then is to find out what sort of your most used queries are. Or the queries that are, that you know are currently very slow because that's and then benchmark those.

So you have a few important queries that you care about because any website is going to do, there's some admin interface that there's some query that you don't really care about. It's a bit slow or a bit fast or it's never run. So it doesn't matter. But there's a few that are either so slow that they break the whole system.

And. Or that you run so often that any small improvement there is going to make everything else also measurably faster. So generally it's those are the queries you want to test. So your whole system, it's possible, but it's, then you have to make your own benchmark suite out of your system.

And it's much easier to, if you have a few queries, to use those queries in, for instance, pgbench, to simulate your system a little bit, like the important parts of your system.

PINO: That makes sense, because you'd be throwing away all the standard tools , if you were just testing your API or your website.

JELTE: Yeah, you can also do full, like it depends if you want to benchmark your database or your actual website. Also, if you know where the bottleneck is, then you can, if you know the bottlenecks in the database, then you can just do the crease. Otherwise you might actually want to need to send many HTTP requests to like a staging environment and see what yours, what your own system is doing.

CLAIRE: So Jelte, is there a tool that can look at the application and help create or simulate workloads? For your performance benchmarking that are close to the actual applications workload. And I'm literally quoting a question from Bilal in the chat that seems relevant to this thread. Do you know of such a thing?

JELTE: I think there might be some proxy layers that sort of detect what web pages get visited often. or Like what patterns get visited often. But in general, you don't really need to benchmark the things that happen often. Most of the time you already know. Which those are, or at least it's fairly simple to look up.

If you have any kind of metrics or logs on your website or your database, and then you can use those to create a list of things to, to put in a generic. The generic sort of load tester applications for Postgres that's pgbench for your website. That's, it's been a while since I did, but when I did it, it was wrk I think was the sort of the thing to use back then, but that's, those things change, but give it a few URLs and make it to send requests to those many times as quickly as possible and see what's what kind of breaks or where, or then profile your system to see where it's spending time.

PINO: And I guess I also wanted to address Marco, go ahead and chime in.

MARCO: Yeah, I was going to say I've not seen great tools for Postgres specifically in this area. Some of the commercial vendors or SQL server, Oracle, they tend to have decent tools for kind of workload replay where you capture workload and then try to amplify it by.

Until your system is fully loaded. So you have some good sense of how far you can scale on your current hardware. There's definitely some open source tools out there as, as well, but yeah, it's also, I guess we've all often have been in this situation of of of a database engineer, right?

So you don't care about one specific workload. You care about all workloads. And you're trying to find. Somewhat generic representative things, but yeah, as an application developer you, it's useful to have such a replay tool, but in lieu of that as you have said, like if you can find the core piece of your workload, the other 70 things that your application does probably don't matter that much.

Like it's this hundred thousand selects per second, plus this large update that you're doing, that's probably going to dominate the work.

PINO: What about when any advice about when is the right time to benchmark often it's you know we're about to launch so we better benchmark and make sure won't fall over or we need to scale up and maybe it's gonna start costing too much but any advice about benchmarking that's not quite as reactive.

MARCO: It depends on what, like what kind of like there's not one thing to achieve with benchmarking. There's like a variety of things as we've already gone over. Oh, I want to know how much capacity I've left once, or I want to define my product strategy or. Of course, like one conventional recurring thing is like releasing new software, for example.

That's a good place where let's say with with Citus as a project but also Postgres, like before finalizing the release, there's like a ton of benchmarks that are done in an automated way to compare different versions. And maybe it is a good segue into it was brought up in the chat, like Mark Callaghan who's on Twitter has been enormously helpful resource in terms of database performance benchmarking.

He spends a lot of his time running benchmarks on different versions of Postgres, of MySQL. Seeing whether there's regressions, improvements testing specific patches sometimes and testing it in, in, in different ways, different angles.

And so that's that's an enormously helpful thing for the whole community that We can see that when things are getting slower and yeah, software projects like database software projects, they the bigger ones, they have to have some kind of automated benchmarking infrastructure.

That's usually, especially around the release. Sometimes even much more frequent than that because, and in that case, the question is, did we regress on performance? Because, it's easy to make a small code change that actually makes everything slower, all your unit tests, all your regression tests pass.

But actually the system is now slower. But yeah, it depends a lot on what kind of question you want answered and when you want that question answered. I know

CLAIRE: I've learned a lot in my career by following some key people on Twitter and I know plenty of engineers who follow Mark Callaghan.

So if you're listening and you don't know him, it's @MarkCallaghanDB is his username on Twitter. And there's a G in the Callaghan and two L's. So you can find it with a search now. Super. Super useful. And he shares a lot. Which goes back to that first podcast topic that you and Simon Willison talked about Marco, which is like working in public.

I really appreciate people who do share their epiphanies and their learnings and their observations. Because the rest of us can benefit.

MARCO: No, I definitely recommend checking that out because, or just following because you'll get a stream of very interesting, very useful information about different kinds of database software over time, even if you're not a database engineer it's actually very interesting.

CLAIRE: So when you talked about, I want to release new software and make sure I don't regress.

That kind of brings up the whole question of. How does benchmarking fit into your software release? Workflows and CI workflows. What are your thoughts on the app beyond what you just said? is There more to it? Is this a whole rabbit hole in itself?

MARCO: Yeah, I think benchmarking is by definition a rabbit hole because of again, like the number of dimensions is so large, you can only study a very small space and it's very easy to, okay we run all these.

Transactional and analytical benchmarks. And we feel like we've captured a spectrum of sorts, but actually I don't know, maybe the large batch updates suddenly became very slow. And there's very few benchmarking tools that try to do large batch updates. So it's pretty easy to to have these blind spots.

So it's definitely important that you have this kind of suite of tools and also not look at one configuration of the system, but at least try several. And again, it's it's just, it's a super tiny amount of data out of this enormously multidimensional space that you're trying to understand.

But yeah, and having that automated is pretty essential. Yeah, I think typically with Citus, if we did a release there, there was this whole suite of different different tools that ran, but it's, it's definitely not as comprehensive as I, I would have liked it. I think also in the space, like Citus is used a lot for real time analytics and it.

There's no good real time analytics benchmarks. So it's a little bit of a blind spot and you end up doing a lot of manual testing to make sure that, things haven't slowed down. But yeah, it's. You can always improve it more. And the, what you can actually measure is in a way deeply unsatisfying, it's okay I done a lot of work and I've measured like five workloads out of billions of possible workloads.

But it's nonetheless important.

CLAIRE: I have this vague memory of, from when I worked in the kernel group at Sun and I should have corroborated this with someone like Brian before today's podcast, but basically, it went like this. If someone was integrating a feature into the kernel and it introduced additional work that was going to negatively impact performance in some way but it was a trade off we had to make.

We had to introduce that capability, so we were going to have to take the hit and accept that performance degradation. The deal was you had to go make a performance improvement somewhere else. Try to make it a zero sum game, if you will. Okay, we're going to, we're going to give this up, but we're going to give you this instead.

Is that something you've ever seen before? Is my, is it possible that my memory is valid?

JELTE: It's definitely a thing. I think more recently also in Python they they had similar, like someone was trying to do a similar thing. They were trying to make Python better for multi threaded workloads and to make up for the loss in performance because of On single threaded workloads, they, they added a ton of performance improvements that, that would make up for the 10, 20% loss in performance because of extra, yeah, extra syncing atomic operations and stuff like that, and locks that need to be taken.

And I'm not sure if that actually I think for Python, I think the thing that happened might have been that they just took the performance improvements and then didn't. Did it take the multithreading, or at least not yet? Because that's, that's also possible. Even if, everyone has to agree that the feature is worth sort of the performance loss,

CLAIRE: what's that quote that you gave the other day, Marco, that I like so much in your PGConf.EU talk in Prague on distributed systems?

MARCO: I think it's, if you get something nice, you'll probably have to give up something nice. That's often the case in industry run systems, although it does apply to computer systems in general, that very few things actually come for free.

And there's often, for example, also in databases, like a balance between reads and writes that you can make your reads a bit faster by doing a little bit more work on the write side. But now your writes are slower. So it's like what is the right trade off there? And then, you could add some kind of dial that users can control it, but now you have a dial which is annoying.

Like now you have to configure the system. It's more complex. So yeah, those that, that often comes in with the issue of performance that you have to make your system design particular trade offs.

CLAIRE: All right. So what are the other big challenges of performance benchmarking? You've talked about a few, so maybe enumerate the ones we've talked about already.

And then have we missed any? Are there any we haven't gone down the rabbit hole on?

JELTE: I think sort of the list that we had so far is figure out. What to benchmark from the whole space of benchmarking. Figure out what you want to get from this benchmark, because even at the start you don't really, sometimes you don't even know that's, yeah.

What's the question? You start with one idea and it's not actually the reason why you end up benchmarking in the end. And then the one is automating it, like setting up the system choosing everything and configuring it. And making sure you didn't configure it wrong, because that's, that happens all the time.

Did I miss anything Marco?

PINO: Is that Jelte, is that repeatability or is that separate from,

CLAIRE: Pino, let's get through the list. I wanna get it all together. Figure out what to benchmark, figure out what you wanna get, what's the question? Automating and configuring what else?

MARCO: Yeah. So I think yeah, we've talked about like the thing you're running on, like the hardware where like there, there's a certain amount of variability often in the hardware, especially in cloud environments.

And like you need to accept that in some ways, try to measure it. And so that's a challenge and then, yeah, the sheer number of variables of dimensions that, that matter and the fact that you don't know those dimensions upfront and you cannot always easily know, put a number to those dimensions the working set issue and partly like in benchmarking you're figuring out which dimensions matter as you go along.

So that, that process makes it makes it quite challenging, at least very time consuming because you don't know what you're going to find at the start.

JELTE: I think one final one is when to stop because you can continue forever. It's every time you find something new, you're like, oh, now I can do this thing.

Now we can see if this is faster or this is slower. It's you have to. Timebox, or, one way to timebox is just until you really don't want to anymore. That's usually when I stop with benchmarking. It's like a hard stop oh no, I don't want to do this for the next two months at least.

PINO: So in the chat, some people, some folks have contributed the idea that if you're benchmarking on cloud infrastructure, the the hardware can change underneath you. Any advice about that? And is that one?

JELTE: It's a, it's an issue. it's cloud infrastructure, especially it's like very variable in performance in general.

So the only way to work around that a bit is just running a lot of benchmarks, like running a ring handful. And then, and I'm hoping the difference is large enough that, that it's obvious that one is better than the other. I guess finding sort of the bell curve. And then seeing oh, this peak is clearly much higher, like more to the rights than the other.

thAt's the only, only real way I've been able to do it, but it's, it is time consuming and it costs more money because you have to run the same benchmark over and over again.

CLAIRE: I want to add a challenge, which may be bigger for some people versus others. Like it all depends on your communication skills, but I feel like when you've run on these performance benchmarks, you have to then turn around and explain the conclusions right to your team. Some of whom may not be technologists, some of whom may not, especially if you're trying to influence the product strategy with it, right?

Or the go to market strategy or whatever, someone's got to interpret all those charts. Someone's got to decide what are the right charts to share of these bazillions of iterations that I've run. Is that a challenge or does that just come naturally to both of you?

JELTE: It is a bit of a challenge because you like by the time you're done at least it's also you want to show everything because you're like I spent so much time on this I did so many benchmarks I want to show all the benchmarks I did but then you miss the point of the benchmark at the end so it is,

it doesn't come super natural, but it's, once you see Oh, I have a graph with 20 bars is, it's obvious that's not really tenable. He has to filter out the ones that don't really add much, but yeah, that does feel painful because you spend so much time getting all this data.

But in the end, the. This time is really spent finding out what bars are important, what data points are the ones that you care about now, but didn't know you would care about before the benchmarks.

CLAIRE: All right. Yeah.

MARCO: And I think part of the challenge with the sharing and like it's the storytelling and that you don't know the story up front, but sometimes when I, when a clear picture emerges in terms of. The bigger stories are, how does, how do two competing systems compare, for example but you don't need to be very careful in those stories because you need to be very confident.

There's also like a confidence interval, right? If you think the system slowed down between versions, but actually it was, your cloud hardware, just you were a bit unlucky and you kept getting slower machines at a particular time of day that. That happens but you can afford to be wrong.

But like for some of the bigger stories, like determining your go to market strategy, you cannot cannot really afford to be wrong. So it's yeah, often like there, there's also often a lot of confidence building you try to do when when you run benchmarks of even running the same thing at different times of day we've seen that matters in cloud environments.

Strangely enough, especially for distributed systems, because the network gets more loaded in certain times of day. But yeah, then again, it's also, again, when you stop you can always try to become more, more confident. But at some point you need to stop and tell a story and then hope you're right.

CLAIRE: Melanie Plageman in the chat is wondering if either of you want to start a benchmarking support group.

MARCO: That sounds like a good idea.

CLAIRE: "Performance Testers Anonymous" is Jeremy's proposed name.

JELTE: Yeah, I would definitely join that.

CLAIRE: Okay. So

MARCO: A few other people who will instantly join that.

CLAIRE: All right. It looks like you've already got a group of 10 to get started.

Marco, I know last year there was a third party comparative performance benchmark comparing several different databases against each other that was published. I'm just curious. Do you want to talk a little bit about why companies will hire third parties? To run those kind of performance benchmarks.

When they want to publish them, like when they want to make the results available to customers and prospects and such.

MARCO: One of the reasons it's like, you cannot trust the vendors, right? Like you the vendor will continue tweaking their system until they're the fastest.

So every vendor published benchmark ever will show the same kind of results. Like our bar is bigger than the other bar. But so involving the third party, they can bring a particular type of fairness in the choice of setup and and almost also how much time they spent on, on tweaking the system and those kinds of aspects.

So yeah, it's a little bit more objective, but of course it's. It's the, they're also often more simplistic in a way, like the, it's a single, often these comparisons have a single workload on a single setup. And then one bar will be higher than the other bar, but it could be that if had used a different workload or had used a different setup, it might've been different.

But it, you also have to interpret like in this case, it was the comparison between CosmosDB and Rutgers to be in you go bite on the sort of HammerDB TPC-C benchmark. And like the difference in that workload is quite huge. So it's not just a setup thing.

If you can. You can tweak the systems in different ways and enable HA, disable HA, have more memory for one system or less, change some settings. But like the difference is actually a little bit too big for any of those things to to matter. But of course it says nothing about, say analytics performance.

So you have to also interpret is this relevant to, to my workload? And then yeah, I think TPC-C, again, it's. it's simplistic, not representative of of of a specific workload, but it does stress transactional systems in a pretty good way. Like it gives it a lot of complex work to do.

So that's that's nice. But yeah, you always have to read it with your own read it with your own interpretation.

CLAIRE: So as we've meandered around the performance benchmarking topic in the last hour did any other stories come to mind that we have, that Pino and I haven't asked you all about in terms of your experiences, benchmarking, your big epiphanies and lessons learned, your big screw ups, your failures, like the mistake you want to make sure anybody listening to this podcast doesn't have to go through themselves.

JELTE: I think one, one super simple thing is make sure your driver, like the thing that's running the benchmark is running like a modern operating system. Because in one benchmark I did at some point we figured out that all the benchmark nodes. Not the one with the database, but the one with the benchmarking suite is using some, old version of open SSL.

And that was so slow that it would slow down the benchmark so much. Without actually showing, without actually seeing that in CPU statistics or something like that. There were it was there was some locking issues there But yeah Make sure that at least the systems you're running on they have modern software on them because otherwise you're just gonna run into the same performance issues that other people have already solved

CLAIRE: Can you define what a modern operating system means to you?

JELTE: If you run a Red Hat, only the newest Red Hat, that's for sure not the one that was released five years ago and already had outdated software back then.

CLAIRE: Okay.

JELTE: But yeah, like Ubuntu or but also not like the oldest Ubuntu. Just like the newest release of the OS you're, you care about.

I think that's the sort of the short version.

CLAIRE: Okay. Got it. Marco any failures you've experienced you want to warn someone else about?

MARCO: I know there was, there's something that comes to mind where it's very specific. So I don't know if it's a, if it's a good general warning, but it's just a frustration I have that sometimes you just cannot figure it out.

And so we have, so Citus has this feature like where you can query from any node. So the client can connect to some random node and then it does an insert or something. And maybe the shard for that insert is on the same node. Maybe it's on a different node. And so if it's on a different node, it'll make a connection and send the insert over there.

And so in principle, this, this scales to, to very high throughputs. And so a weird thing happens when we bench, we've seen this weird thing happening when we benchmark it over and over again, that like we put a lot of load on the system, but the utilization actually goes to like 70, 80% instead of moving to a hundred.

And my theory is there's this thing going on where. If a node gets a little bit too hot, the rate at which it starts offering work to other nodes slows down as well. And therefore the system internally slows itself down again. So you get into this equilibrium at around 80% utilization.

And so we're not bottlenecked on anything. It's it just stops there. So it's I've never truly been able to figure out and understand why this happens. It's not necessarily bad because it's self regulating and like you don't hit super long queues and super spiky response times to actually, it's actually a sort of nice property, but I have no clue why it happens and I've run this like a lot of benchmarks on this.

But yeah. Still one day hope to figure out why, what are the mechanics behind that? But yeah, sometimes it's just extremely hard to understand why you get a particular result.

CLAIRE: You were going to talk about how important it is to sometimes step away from your desk and, get outside, go hiking by the canals and take a break and then you have the epiphany and you figure it out.

But you haven't figured it out yet in this case.

MARCO: Yeah. We're still waiting for the, this epiphany, but that's definitely, that's good advice. Like just and in general, I feel like the important thing in benchmarking is just really like doing a lot of thinking and understanding like what why would I get this result?

Like why, if I changed this variable I don't know, some silly thing, like workman, like, why do I actually get. Better results, if I have, let's say, a lower workman. That can be curious. And at least, one thing I enjoy, but it's also really hard, is like just trying to understand those things and figure those out.

And then if you can figure it out, you also know why you've got a particular result and you have confidence in your result. You can share it.

Hidden goal of benchmarking where, the first goal is you run the thing and you get the numbers, but there's the second goal of like actually understanding the system very well. And that's what's enjoyable about it, but also what's sometimes frustrating about it because you cannot always understand it.

CLAIRE: So you get something good. Which is the understanding of the system, but you have to give up something good, which I don't know, somehow relates to frustration. I'm trying to make a connection to your great quote about distributed systems.

MARCO: You have to give up an awful lot of time. This is one of the most time consuming and hardest thing in database development to me, it's sometimes you're, you can be doing this for weeks or months even.

So that's definitely a big nice thing you have to give up.

CLAIRE: But it totally changed your life, Marco, when you think about it...

MARCO: Yeah, I got a lot of nice things back. I gave up a lot of nice things and I got a lot of nice things back.

CLAIRE: It changed your career path. It changed your, like the path of who you spend your time with in your life.

Going back to your AWS part time CloudFront performance benchmarking job that you did while you were getting your PhD.

MARCO: Yeah, maybe I should do more benchmarking. It's, it's definitely it's an important thing and it's hard to measure the impact, but it can be, it can be more impactful and often is more impactful than actually writing code and like adding features to your product.

Although it's hard to know that up front, you'll only know in hindsight years later.

CLAIRE: All right. Pino, any final questions before we wrap up?

PINO: I don't have any final questions, but I've really enjoyed this conversation. It's been great.

CLAIRE: I'm seeing something in the chat that I just want to add to the voice conversation, which is Melanie's tip of always checking the logs.

I don't know if you guys can see that comment, but she said one time I was benchmarking a patch and couldn't figure out why it was slower until I realized it was emitting extra warnings because of a mistake in my patch and emitting the warnings was affecting the performance. I think we would all agree that checking the logs is a thing that people do and should do and good to remind oneself.

Okay. Jelte, Marco, thank you both very much for joining us. This has been an awesome conversation. I've really enjoyed it. And I'm sure a lot of people out there will too.

MARCO: Yeah. It's been great. And like until until my next appearance, I don't know when, but this was a lot of fun.

JELTE: Yeah. I had a great time.

CLAIRE: The next episode of this podcast is going to be recorded live on Wednesday, February 7th at 10 AM PST also on Discord. The guests and topics are still TBD. One of the guests has already confirmed, but we're still working on lining up a second guest. That'll get announced soon.

You can mark your calendar by, if you want to reserve the time right now, with this short URL aka.ms/PathToCitusCon-Ep12-cal.

You can get to past episodes and links to all the platforms where you can listen to this podcast at aka.ms/PathToCitusCon, all one word.

And you can find transcripts on the episode pages on Transistor too.

PINO: We'd also like to ask you a favor, especially if you've enjoyed this podcast, please rate us, rate and review us on your favorite podcast platform. That helps other folks find this new show.

And a big thank you to everyone who joined the recording live today and participated in the live text chat on Discord. It was fun!

CLAIRE: Thank you. Ciao ciao.

Creators and Guests

Claire Giordano
Host
Claire Giordano
Claire Giordano is head of the Postgres open source community initiatives at Microsoft. Claire has served in leadership roles in engineering, product management, and product marketing at Sun Microsystems, Amazon/A9, and Citus Data. At Sun, Claire managed the engineering team that created Solaris Zones, and led the effort to open source Solaris.
Pino de Candia
Host
Pino de Candia
Pino de Candia is a software dev manager at Microsoft since 2020 and is currently working on the Citus open source project. Pino previously worked on the managed PostgreSQL database service in Azure Cosmos DB for PostgreSQL, which includes Citus on Azure support for distributed PostgreSQL. Pino has lived in New Orleans since 2017.
Aaron Wislang
Producer
Aaron Wislang
Open Source Engineering + Developer Relations at @Microsoft + @Azure ☁️ | @golang k8s 🐧 🐍 🦀 ☕ 🍷📷 🎹🇨🇦 | 😷 💉++ (inc. bivalent) | @aaronw.dev (on 🟦sky)
Ariana Padilla
Producer
Ariana Padilla
Program Manager at Microsoft in the Azure Database for PostgreSQL team | Avid Traveler 🛫 & Foodie 🍽️🍹
My Journey into Performance Benchmarking with Jelte Fennema-Nio & Marco Slot
Broadcast by