My Journey into Postgres Monitoring with Lukas Fittl & Rob Treat

My Journey into Postgres Monitoring - Episode 10
===

CLAIRE: Welcome to Path to Citus Con, the podcast for developers who love Postgres, where we discuss the human side of open source databases, of course, Postgres and the many PG extensions. I want to say thank you to the team at Microsoft for sponsoring this community conversation, and I'm Claire Giordano.

PINO: And I'm Pino de Candia.

Today's topic is My Journey into Postgres Monitoring. I have the pleasure of introducing Lukas Fittl, founder of Postgres monitoring company pganalyze. Lukas says that pganalyze is the best way for developers to find out what's slow with Postgres, find missing indexes, and improve its performance. Lukas bootstrapped his own company building it to solve his own use case. He formerly worked at Microsoft, Citus Data, Product Hunt. He's been working on Postgres since 2006 and last year sorry, gave a talk at PGCon in 2020 called What's Missing for Postgres Monitoring. More recently on how to automate index recommendations for Postgres.

Welcome, Lukas.

LUKAS: Thank you. Welcome.

CLAIRE: And I would like to introduce our second guest, Rob Treat. Rob is a longtime Postgres practitioner, was involved in the early development of Circonus, which was an infrastructure monitoring solution built for scalability. And it had like deep metric analysis, has worked with consulting clients on Postgres things, including Etsy, DoorDash, National Geographic. 3/5 of FAANG companies as well as other monitoring companies. And he's given a bunch of talks on the Postgres conference circuit that are about monitoring things like way back in the day at PG day, San Jose, there was a talk about check, please, what your Postgres wishes you would monitor.

And he's also been a speaker at Citus Con and event for Postgres in its inaugural year. Where he talked about advanced into big end conversions. Welcome, Rob.

ROB: Howdy. Thanks

for having me.

CLAIRE: All right. So let's dive in. Today's topic is my journey into Postgres monitoring. And Pino and I have so many questions for both of you about why you both got into the area of monitoring, why you care about it, and why other people care about it.

So with that as our goal Lukas, let's start with you. We really want to, we want to know what the origin story is for pganalyze.

LUKAS: Sure. Let's see, let's dial back to 2012, I think. Whoa! That's a long time ago in this galaxy. So the background story you mentioned, in your intro that I had worked with Postgres since 2006, right?

And so that was really, as an application engineer, I would say was like, back in 2006, we started a company. We were using Postgres as our main kind of data store. And we got into challenges into scaling the Postgres database back then. That's many Postgres releases ago. And so of course, replication was, still in its infancy in terms of how it worked in Postgres all these kinds of fun things.

And so a couple of years later, after I had seen. The struggle with, a Postgres database that's slow, that's not, scaling as you expect. I was essentially at the point where I was like, I want to build something that does this better. And at the time, around 2012 pg_stat_statements , which in Postgres is a way to track query statistics.

Got essentially a major upgrade thanks to Peter Geoghegan essentially as part of work that was back then funded by Heroku improving how Postgres tracks query statistics, so that now you could actually get a lot more detail which queries have run database. And so that kind of sparked that initial idea that what if, we just made a dashboard that showed all the queries that have run on your database in, like the last hour, the last 24 hours, just giving you a really clear overview of the database's view of the world, because as an application engineer, you're oftentimes not really seeing that, right?

You're just stuck in your ORM and you maybe have an APM tool that shows, what's going on in individual requests, but it's really hard to say if your database server is a 90 percent CPU, what's really the root cause. And so that's where, the idea for pganalyze came in was just making, solving that problem essentially for myself and for others.

And then, fast forward many years now we're able to build a successful business around that bootstrap that business. And so we're now a team of like a little bit under 10 people scaling out the company and are actually able to, invest into things that are not maybe just query statistics, but there are other things like index recommendations as well.

PINO: Could you tell us more about how long did it take to get that first version out of that dashboard?

LUKAS: Yeah, good question. I'm fuzzy on the details given it's been, 11 years now. But I think it wasn't that hard to get. So the initial version I think was a Python script that was collecting the data.

And then it was sending that data into a Ruby on Rails application. And pganalyze is still Rails today. So it's kept some of those origins. And I think it must've been probably like two to three months worth of effort, right? I think at the time. There were a lot of things not clear yet into how, that query statistics site on Postgres really worked.

And so I think there was a lot of uncertainty around the initial version. And then I would say, between first version that I myself used and first version that somebody else used probably a couple of more months past. So in the end, I would say. Probably in, sometime in 2013 was when we probably had, like a couple of different companies using it.

CLAIRE: Well, it's really exciting for me to see you grow and scale pganalyze. I totally miss working with you. Full disclosure for the audience. Lukas and I work together at Citus Data and then at Microsoft after the Citus acquisition. So I, big disappointment when you left, but, you're still in the Postgres world and pganalyze is clearly flourishing and growing, which is cool.

LUKAS: That's right. And I think the additional context there for those of you who don't know my own bio by heart is that, obviously pganalyze when it started out was too small to sustain itself as a company. And so I was essentially running it as a side project. For example, when I was at Citus Data luckily the founders of Citus Data were, very generous in allowing that.

And so I was able to essentially build the company in parallel to, being part of Citus and then also had, the experience of kind of joining Microsoft and seeing Citus turn into what's now Azure Database for Cosmos, sorry, Azure Cosmos DB for Postgres.

CLAIRE: Nice. Okay.

Well said. Okay. So Rob, you were at Circonus in the early days. What was that like and what did it have to do with monitoring?

ROB: Sure, yeah. Circonus was a project that kind of came out of the consulting work we did. So I started my career as doing like development and like system and stuff and ended up in this like accidental DBA type category.

And eventually started working at a company called OmniTI , which was scalability consulting, and they had some large projects that were, and a lot of it was like Oracle to migration back in the day, like trying to do these multi terabyte databases on Postgres and. And back in the day here is, yeah, early, early aughts, and so as we were doing consulting work, when there's two kinds of consulting where people get into, and one is that you go work on one client, for two or three years or whatever, and it's more like contracting. And then the other version of consulting is you take the skills that you have towards the technology, and then you try to apply it to as many different problem sets.

And I was fortunate enough to be in a group where we were really doing the latter kind of work, where we were just taking working on all these different projects of people who were trying to scale Postgres, having a lot of the challenges that Lukas, is talking about it. It's in the very same time frame and trying to get that to work.

So one of the problems that we had doing that work was for really high scale sites. Either they had a monitoring system that really didn't cover the things that we needed in Postgres or they had a system like it just couldn't scale high enough. We were always fond of building our own tools for the work that we were doing.

Like one of the things we built years ago was this tool called OmniPITR, which basically was like point in time replication and while replay. And one of the things we built out of that was so we could do like backups off of replica machines. And we had that in eight, three. And so there were just things like that where we were always like, if we need a tool, we'll go build it.

That's why we do open source, right? That was a big part of it is to give us that freedom. And so we ended up building this thing. I think the project originally was called Reconnoiter, was the monitoring project that we built. And one of the sort of two key pieces were, because we were doing a lot of Postgres work, we knew we had to be able to support Postgres and have good monitoring and analytics capabilities within that.

And then the other part of that was we built the initial versions on Postgres because it was our bread and butter. So we thought let's see how far we can take this. And that first version, that really was released, I think, was on Postgres 8. 4 So it even predated any kind of built in replication which was actually one of the pieces that ended up being really tricky for us is the amount of data ingestion that we had and whatnot.

But that code is actually still out there. There's some interesting schema stuff in there. I went to look to see is it, I think it's still floating on GitHub and it is actually still out on GitHub. So it's definitely old, but there were some really interesting ideas in there. And again, as part of that, like having to run Postgres in order to monitor Postgres and make sure, your Postgres has to run better than everybody else's Postgres.

Cause you can't afford downtime, right? It's a monitoring system. Like it's gotta be up all the time. So I think that's one of the unique challenges when you go down this rabbit hole of we're going to build and run monitoring services and keep those up and running. It's part of the interesting thing.

I think both, I'm sure Lukas has seen that at pganalyze and we definitely saw it with the Circonus stuff.

PINO: Rob, was that your major case of eating your own dog food or did you have other Postgres servers that you ran, they operated?

ROB: Now we had others that were definitely, I would say bigger and, just had more varied queries against them, analytics stuff and OLTP.

So I wouldn't say that was the thing that was difficult about that for us as a solution was the size of the data ultimately we knew was going to grow because we, so we had a plan. Our thing was like, we want to keep 10 years of data and we want to keep granular data. So one of the problems that you had a lot of times back in those early days, most of the monitoring and trending solutions were built on RRD.

And I have a special place in my heart for RRD. It did really good things for the industry to get people to be able to trend and monitor stuff. But one of the things it did was it always did these rollups of data. So when you would go back in time, two or three years, you would lose fidelity of the information that you had.

And it was really much more difficult to see, 12 months back or more if I'm trying to do year to date metrics versus last year and that kind of thing. It's much more difficult to see that in RRD. So one of our things was we're going to keep all this data and we want to do it for 10 years.

And that was where it really became like you have all these metrics data and you could just do the projections, and the more clients that you built we were building it as a multi tenant system. So if we had built it maybe in a way where it was, for each individual client, they would have gotten their own thing like that would have definitely been easier.

But when you make it as multi tenant, then even, somebody has some multi terabyte thing that's doing, e-commerce. That's a lot of data and it starts way bigger than the monitoring system, but eventually you could project out this monitoring solution is going to need, way more, data to be held and be able to pull that data back and forth and whatnot.

So that, we could see where the problems were going to come especially in a pre-replication Postgres.

LUKAS: So if you don't mind me asking, this is pre-partitioning as well, right?

ROB: No, we have partitioning, but it was partitioning based on table inheritance. So depending on how what you count as partitioning it was the pre built in table partitioning that Postgres has now.

And it was all inherent stuff. But I had been using inheritance. For better or worse for, I don't know, maybe since the late nineties at that point. So I was very familiar with that and we used that pretty extensively and we even did sharding, so to speak, where we built a namespace with different schemas, depending on which ingestion, metric piece was coming in.

And then we had the ability to use point in time recovery to split servers off and make them second servers. And we just had to put logic in the client to know. Like it would come in and it would ask " Hey, where does my data go?" And then it would return either you should be in this schema or you should be on this server.

And so the little apps that were doing the metric ingestion and sending the data in, like every time they would start, they would call and figure that stuff out. And then if we had to split something onto its own server. And move it to a different machine. We just changed the logic of that so they could easily figure out where their data needed to go.

So it was a lot of kind of manual sharding, but in that case, we controlled the whole product line, right? We controlled the piece that was doing ingestion and we controlled the backend so we could have those two pieces work together, it didn't have to be. Completely transparent between one and the other.

We could have them work together and know what they were doing. So that definitely helped to scale it.

CLAIRE: Okay. So I should have interrupted you like three minutes ago, but just point of clarification, RRD stands for Round Robin Database tool or something else.

ROB: I believe it stands for Round Robin Database.

CLAIRE: Okay. Just people who are listening. I want to make sure they understand.

ROB: And I think where most people would know RRD from I think is probably there was a tool called Cacti, which I believe is probably still out there. And I think there was another one. MRTG maybe also used RRD. So there were a number of these sort of open source tools, mostly for monitoring and trending.

And they built on top of RRD. So RRD was the thing that stored the data, and then they would build these graphical front ends on top of it. In order to see the data and be able to browse around and stuff. And I think Nagios even put that in at some point.

LUKAS: At a later point, right? Yeah, I definitely recall already from back, when I was doing some more infrastructure stuff in the early later two thousands, that was definitely the way to go, it was. It's having your RRD tool generate the graphs.

ROB: That's what everybody was doing, in the, in those very early 2000 times.

CLAIRE: So as I was working with you both to invite you to join the show and come up with the topic and that decision to focus on your journey into monitoring Postgres, I started to wonder monitoring observability.

And I'm curious, like, how do you see those two things? Which one do you care about more? Which one do you spend more time thinking about these days?

ROB: I'll take a stab at that, but my stab will be maybe from a different direction than Lukas, because I don't know his background well enough. One of the things when I went to OmniTI that was a big sort of part of theirs, they used a lot of Solaris and OpenSolaris and Illumos.

So we had access to, yeah, we had access to DTrace back in the day. And so to me, like when I think of observability, like to me, it goes back to DTrace as like the ultimate observability tool. And that saved our bacon on a number of projects that, that we had to deal with over the years. And so I, observability now is I don't know, it's a more of a marketing term and it means different things to different people, so it's a little bit tricky.

But I think that your bread and butter day to me is monitoring. Like you're going to be doing that, that's 90 plus percent of the work. And observability is more about the debugging side when something has really broken in a way that you don't understand. So that's how I split that, but other people may split it differently.

CLAIRE: It's really interesting. You mentioned DTrace because the creators of it, like Bryan Cantrill, Adam Leventhal, others, they host another podcast. Called, I think it's called Oxide and Friends. And that was actually the inspiration for this podcast. They hosted on discord and they have the live text chat and all of that.

So yeah, shout out to the DTrace team.

LUKAS: It is a great podcast.

CLAIRE: It's a great podcast. So if any of you are interested in systems computer systems, software, hardware, the whole kit and caboodle they have some great guests and some good conversations and, Okay. So

LUKAS: If you don't mind me jumping in, yeah

CLAIRE: Not only I don't mind. I want you to jump in. Go.

LUKAS: I think I do concur with Rob that observability is a little bit of a marketing term, like not always. But I feel like the reason we started, we stopped talking about monitoring and we started talking about observability is because companies wanted to market around that term.

I do think that there is some merit to, projects like OpenTelemetry, for example. So I think, if you look at the evolution of like application monitoring in particular, it has always been very vendor driven, right? So you'd always, had like different companies making different ways to monitor and you would add, I don't know, the vendor SDK into your application and that would then make your application.

There goes my voice, that's not good. So the idea is that, OpenTelemetry actually made that vendor agnostic. And so I think the one big benefit that I see around the observability story these days is that it is a vendor agnostic story of how you're tracking application side metrics. I don't think that necessarily relates to what we do in Postgres world.

So to me, when I think of Postgres monitoring or Postgres observability, they mean the same thing. Like I don't really see a big difference there, but I think on the application side, to me, observability means. The more modern way of doing this with OpenTelemetry, with having tracing, right? So you would, have traces collected by the application.

Look at, tools like Honeycomb, which I'm a great fan of. That, give you that level of insight on the application side, using a standard that means you're not really tied to a single vendor.

ROB: And one thing just to toss them. I was just gonna say one thing to toss in there. I don't think Postgres or the sort of core Postgres hackers or whoever you want to call that group of people.

I don't think they. I don't believe in observability and the way that it is generally used, right? If you're doing Postgres now, they believe in you're going to use SQL to query your database and you're going to look at log files and that kind of stuff. And I think that would not be how most people, like they don't expect you to collect traces on Postgres, right? Or even really look at core dumps or whatever. That's just not the normal way that I think Postgres hackers think about people interacting with the database. So I think that's part of also why, like you say, like it's not, observability is not really how you approach Postgres problems in my mind.

LUKAS: The thing that still surprises me is most people don't look at their logs for Postgres or at the error logs. And there's so much useful information in there. And if you talk to the typical hacker, they would say, yeah, sure , I'll put a log message right when this happens, but most people just don't look there.

ROB: Yeah so Postgres, the Postgres hackers expect you to look at your log files. And that was one of the first things, like it's in my checklist of if I'm evaluating a new Postgres monitoring solution, and it was one of the first things I looked at was pganalyze.

I was like, okay, so how do I get access to log files? Is, does he have any mechanism for parsing those log files and getting that information? That was one of the things that like early on, I think in pganalyze the story was like, oh yeah, we need to figure out how to surface log file information, because there's just stuff you can't really get any other way.

And I think if you come from like a SQL server background or Oracle background, the belief there from those DBAs is like, everything should be accessible via SQL statements. And I don't have a problem with that philosophy. I just know that's not how Postgres approaches it, for better or worse, like they expect you to have access to your log files.

CLAIRE: All right. So we're going to dive into how Postgres users, as well as the Postgres developers and hackers, think about monitoring in a sec, but I want to go back to the Honeycomb reference that you threw out a second ago, Lukas. In preparing for this I know that Charity Majors, who's the founder of Honeycomb, I think.

LUKAS: I think that's right. And CTO, I think.

ROB: Yeah, one of, one of the co founders. Yeah,

CLAIRE: Exactly. So she has a quote out there about this monitoring versus observability question. And the way she described it is that "monitoring is for known unknowns. And setting thresholds and running checks against the system and observability is about the unknown." Does that ring true for you both?

LUKAS: I think there's some merit to that. But I think her perspective is different than the Postgres perspective would be, right? Like that statement, I think, should be read in the context of application side monitoring or observability.

But again, I don't think in my personal experience. It doesn't really help if I give special meaning to the word observability, I think it's really more about where's my slow query, right? That's the word that people use. They don't really talk about observability or monitoring.

CLAIRE: Okay, so the focus in Postgres is where's my slow query?

LUKAS: For example, right?

CLAIRE: Or why is my query slow and how do I fix it?

LUKAS: Exactly. Or why is my database serve at 90 percent CPU? It's I think we're talking more about it's a different kind of system than the application world where you might have, maybe requests like bad experiences that customers are having that you don't know of, right?

So it's a different situation, at least in my mind.

ROB: I think so here's how I look at that, that I don't know that it's a different kind of system in the grand scheme. It's just that the way that hackers expect DBAs and sysadmins to interact with a database is different than how you would probably interact with code that you wrote for a web service, right?

If you have, again I don't think anyone expects a DBA to go look at a core dump. Like we did that as consultants, cause like we were into that. And like I said, we use things like DTrace and tools like that. The Postgres hackers would never expect that like a DBA or sysadmin is going to run DTrace on a Postgres backend, to figure out what's going on with the system and whether that's right or wrong, I don't really want to advocate for a world where everybody needs to learn how to use GDB if you're not doing C development so I understand the merit, I think of their approach and saying, we don't want everybody to have to do that, but I think that's where the line is drawn that for some of these folks, I think, and I'm not super familiar with Honeycomb, but I, from how I understand it, like they say, no, you should actually have tooling at every piece of the stack and you should be able to get, trace messages from all your systems, whether it's a Postgres or anything else.

And I think that's just where the gap is between the tool that Postgres hackers are trying to build for end users versus, where some of these observability platforms are out there trying to put something together.

LUKAS: Yeah. And I'll add one more thing to that. The thing we actually, with pganalyze added recently, was a way for us to send tracing data into open telemetry systems, right?

So one of the things that Honeycomb, for example, is great at it's really a full open telemetry system and so what we can do now is when we see a slow execution, like a slow explain plan from auto explain, we can actually send that as a span into an existing trace in an open telemetry type stack like Honeycomb.

And I think that's really. Like what I think the experience is that an application engineer should ideally have, right? It's like they look at a trace and then they look at the SQL statement in the trace and then there's an explain plan or information from the database attached right to that span.

So you don't have to switch tools, right? That's something that I think we, we already see being quite beneficial in combining pganalyze and Honeycomb, for example.

CLAIRE: So a second ago, you started to delineate some pain points that. People trying to solve both of you have touched on these pain points.

So I wrote down, why is my database server at 90 percent CPU or where is my slow query or why is this query slow? What are the other pain points that you both have focused on or is that it? Is it really all just about slow queries?

LUKAS: Rob, do you want to take it?

ROB: I think the vast majority is definitely slow queries, right?

Because that's when you get yelled at as a DBA. Is when the queries are slow. And then I think there's that next level, which is like CPU and memory and disk IO, which, the next layer, and those are things that you probably see as a systems person who is running these services and it may not be the problem yet, but you're like, oh if I'm always pegging CPU, this is going to be a problem at some point. But the thing that causes somebody to, to, come to your desk as if we still did that and say, "Hey, there's a problem, right? " That's slow queries. So that is why I think it takes up so much mind share. for people.

LUKAS: I think the other thing, go ahead,

PINO: You had a thought. I'll ask my question in a second.

LUKAS: The other thing I found interesting with slow queries is since you mentioned, the people running the systems, like I've also been on the side of, like running a database as a service, right?

Data, for example and so one thing I found interesting is as a service provider, how do you delineate, what's the customer's problem and what's your problem? And I think slow queries are actually a perfect example of what's not the provider's problem, right? So if you, as the end user, like you use the database as service like Azure Postgres for example, and have a slow query, it's not the expectation the service provider solves it for you, right?

Versus if there is, really badly performing. Like disk or something issue, right? Like it's a hardware issue. Then of course, that would be more on the service provider side. So I think slow queries are just also very common that they definitely end up in your, like on your plate as the end user of a database.

ROB: And that's definitely one of the ways people differentiate that kind of white glove service versus just a provider of database as a service, right? Is that the more you get into the customer's queries and data, it's a different level of service. It's more expensive to provide and it is a differentiator for some of the companies that do that kind of work.

CLAIRE: Yeah. And I think that's why there are Postgres users, right? Running applications on Postgres who are both running their application in the cloud on a fully managed database service, but they also have a consulting company that they're working with. It's helping them optimize the performance of their database or scale it as they scale or helping them evolve new queries that support new features in their application is for that white glove service.

PINO: Is the intent still that? Oh, sorry. Sorry. Go ahead.

CLAIRE: I just want to know, are we saying the same thing, Rob? Is that what you mean by white glove service?

ROB: Yeah, I think so. And I think it's just the only question I think is how much does the provider take on that, versus maybe working with consultants or, whatever that mix is.

And I think for most people, because that doesn't scale as well as making tooling that, can solve your own problem. There's different ways to approach that. So if you're like, if you're a provider today and you're like, I'm going to do database as a service and you're like, I have to build out all this tooling.

The other thing you can do is just tell people like go install pganalyze and that gives you a lot of stuff that will help people self service and not have to require individualized care and feeding. So there's different ways to go at that, but essentially, yeah, that's the idea.

So the more you want to take that on and give that individualized service, like the more you will approach like a white glove service kind of thing.

CLAIRE: Okay. Yeah

PINO: I wanted to ask about more pain points. Other pain points we haven't mentioned yet.

LUKAS: I mean, I think the thing that I'll maybe add is one thing that I've personally spent a lot of time thinking over the last couple of years is not just individual queries, for example, but really whole workloads, right?

So I find that it's challenging, at least from my experience, to really think about, holistically, think about all the queries that are running against your database and, like, all the activity that's happening, right? And have you done the right settings changes? Have you created the right indexes?

So I think, to me, Like when we talk about slow queries that's one very immediate pain point. But I think there's the bigger pain point of kind of reasoning about your database, like essentially making sense of all that workload that's happening. Because databases are so important, right?

Like they are the core of, everything that runs in terms of data often. But then it's really hard to really say what's actually happening. So I think there's a pain point about just understanding what's going on. That's not necessarily an individual slow query.

ROB: Yeah, that ties into there's just this natural evolution of systems, right?

Like you start usually with something that looks very OLTP oriented. But every system that starts that way, like eventually people are going to try to do analysis and aggregates and, grouping queries and whatever sort of versions of those queries and when they get into that the way you tune for those types of queries is just different than how you tune, for the very short, like insert, update, delete type stuff.

And so now you're like, am I doing that on one system and making trade offs there, or do I replicate out to something else and try to do it, that way? Or, am I doing logical replication versus physical replication? And, then you get into those architecture things that are pain points around figuring out what the right trade offs are for that.

CLAIRE: So how should people think about monitoring? If you have someone that comes to you, and let's say they're a SAAS vendor and they've been trying to monitor things on their own, but have been running up against performance issues and scalability issues and aren't sure they have things architected and aren't sure they've even got the right monitoring solution.

Where do you start that conversation, especially when Lukas, you're saying they need to be, they need to think holistically about it,

LUKAS: Right, I think there's, in my mind, there's two ways to look at this, probably speaking, simplistically speaking, which is you look at it just from the high level, right?

So you're essentially saying, what's my CPU utilization? What's my, I know, aggregate number of sequential scans. What's my. Like some high level metrics, right? So you're just saying the kind of stuff that you can send to your Cardano dashboard where you can, use existing like open source tools, like PGWatch , for example, to collect these high level metrics, right?

So I think there's ways to solve it in a high level sense and then I think what, I think people should do in my perspective is they should find a way that they can actually drill down into individual problems, right? So really, and I think to be completely fair, I actually don't think that this is an exact science.

So if I have a spike in CPU utilization, there is unfortunately not enough data in many cases that I can tell you exactly why that spike happened, but I can oftentimes for example, if you use pganalyze, I could show you which query most likely correlated with that, or which autovacuum maybe was running at the same time, or like essentially the database activity that correlates and then being able to say why did that run right?

If it's an autovacuum, trigger and so to me, how people should think about monitoring is not just at the high level, but really, like being able to click on a number and understand why that number is there, is what I'm saying.

PINO: Who is the monitoring for? And is the monitoring available for the right group of people?

You mentioned earlier that some of the things you mentioned CPU consumption on the database is more of interest to the DBA, whereas the slow query is something the user is concerned about, sorry, the app developer is concerned about, and goes to the DBA asking for help fixing. Is there a trend towards enabling the developer to solve that problem on their own?

LUKAS: So I'd be curious in Rob's perspective, but my quick take on this is that what we've seen is at scale, right? When companies have many application teams, usually you would have a central like data team, right? Some sort of data team that kind of standardizes how the company provisions Postgres, how the company scales Postgres, all that stuff.

And so I think these problems like the CPU utilization problem, for example, that usually lands on that data teams steps essentially, but then the slow query, they, that might actually be something that, only the application engineers can solve. And so I think one of the challenges here is that by default, you'll only have a very limited number of Postgres experts in your company.

And so by default, those probably being that data team that's central and so oftentimes I think they're walked down with slow queries and so I think one of the things that I actually care about that I want to make, like I think pganalyze can help also solve is that we can give application engineers better education, better knowledge about Postgres as a database, as an application engineer. So that these slow query problems, they are essentially solved by the teams that actually use the database directly, that are, that know the application that can change the queries and then only like the really hard problems, like where, it's not straightforward to solve or where it's, actual CPU utilization, like hitting a bottleneck of some internal locks or something.

Then you actually need that central team's attention, or that data engineer's attention. But I don't know, Rob, if you share that perspective.

ROB: Yeah, I would pretty much agree with all of that. To me, the answer to these questions is not really technical, right? It's just, it's however your company is organized.

And the two accesses are on there's a skill level that's required to be able to solve these problems so if your developers have that skill level, then, to my mind, like they should be able to see that, get access to these kinds of metrics and in these graphs and that kind of stuff.

And then the other one is just responsibility. So it's, who you've designated as the responsible parties. And as you scale companies, I think like you're saying that eventually you cannot hire enough DBAs to take that responsibility on for every app developer. And most companies it's, 10, 20, 40, 50, I don't know, to one, the number of developers versus folks that would be, "DBAs". So from that standpoint, you really need to be able to push that responsibility and then with that, the skills back to the developers to have the ability to actually troubleshoot those problems.

One of our core tenants with Circonus was we wanted to be able to share data with anyone in the company, essentially and because we thought a lot of times, just when you're having a conversation about what a potential problem is, if we went to, we got this from consulting, where you would go into a place and you could figure out like, here's what the problem is looking at the graphs. And then you wanted to show the graphs to some other team. And it was like, oh, they don't have the right VPN to access the tool that actually shows the graph or the thing so then you're like trying to do screenshots and send those an email and whatever.

And if you need a little bit of a dynamic story there, that just seemed like such a pain. So something that's very easily shareable, web based tooling and that to me is very natural and you want to share it with anyone and everyone and the only caveat is sometimes if you show someone a graph and they don't understand it, you have to be prepared to.

To be able to explain that to them so that they don't make the wrong conclusions about what they're seeing. But otherwise to me, like you should be able to share with everyone and it just goes back to how are you going to map skill set and responsibility for the problems that you know are going to come.

PINO: Rob, are you saying just shareable or also centralized monitoring? Because I'm wondering whether the DBAs and the app developers each need their own monitoring of what's going on inside the database, but perhaps at different levels of granularity.

ROB: So there might be a different sort of question in there, which to me is more about, do you want like a kind of single monitoring interface versus like you want specific tooling?

So if you look at something like pgMustard or pganalyzer, two very Postgres focused tools and so if you're on a team, if you're on a development team and you're making use of Postgres, but maybe you also have Redash as a front end cache and I say to you like we'll give you pganalyze.

And you're like how do I troubleshoot Redash with that? And the answer is you don't. So you're going to need some other tools. So then it's just organizationally, do we want to try to take, a generic tool, which Circonus would be one, but there's others, Honeycomb, Datadog, whatever, any of those sort of big name brand monitoring systems and say, all of those can monitor Postgres, it's more work probably to get all the right pieces in there. That's the thing that the Postgres specific services give you very easy out of the box, but you can get those to give you the metrics and data that you need.

And so then you can have something that everybody understands how to use and share and whatnot. But that's a different trade off, right? That you're just trying to slide do I need the very specific thing? And maybe the only team that needs the very specific thing is like a special, data platform team that, that has to have that.

But ultimately they are going to end up wanting, if they get a special tool, they're going to want it on every database that they're going to get asked questions about. And so then once they have that, connected to those databases, you're going to say why don't I just share it with those teams?

A lot of companies, I noticed they end up doing posts. So they'll buy the specialized tools and they'll have a generic platform for everybody.

PINO: So do you have specific recommendations for app devs that might have a DBA team? They might be able to influence the DBA team, maybe not. What should they be doing?

ROB: So I don't have a better answer than go install pganalyze and I feel bad because I'm sure there's other people doing other monitors. Most of these tools are pretty good. The specific ones I think are pretty good. Even if you look at something like performance insights in Amazon I think that's maybe not as Postgres specific as it is towards Amazon Database of Service, but that's a pretty good tool. I have no developers who use that and can use that to troubleshoot their problems. I think it's again, it's just the response. If you are going to say hey, if you're thinking like, sort of service oriented architecture where you have service teams underneath and the developers are responsible.

For the whole stack and maybe they have an SRE team or a DBRE team that they can escalate to when they run into problems, but ultimately they're responsible for that. Then I think they need access to the specific tools and probably, generalized tooling as well. And you want to do both, but that's a really larger company that's going to have that type of setup.

If you're a 40 person company and you got 20 engineers and, a couple of systems and a DBA, that's probably not how you're going to build your tooling. Cause it's just, it's more really than I think you can probably handle. Cause all of this stuff, like that's the other part even with the software as a service pieces, like it takes work to make sure those things are working, as well, like even as much as I love something like pganalyze, like we, at places that I have used it there was a team that was responsible for keeping that up and running, meaning making sure it still had connections into the systems and the log files were still getting fed to it correctly. Because there are security pieces that have to get put in place to make those pieces work. Even those things they're very good at making it easy to see the Postgres bits, but you still have to have somebody responsible for keeping them running.

LUKAS: But I think, Claire, if you don't mind me touching on one thing really quick, which is, I think and Jeremy actually also posted this in the chat, which is, it's, these days it's not really an issue of VPNs, it's more of an issue of logins. And so I think, to me, the one thing I do want to call out here is I think it's really powerful when a tool that, is sometimes useful, right?

If you run into a problem, you really want to dive into it that you then have single sign on for it. And so I think one of the conscious choices that we, for example, made is when we built pganalyze is that we have single sign on and we don't charge for users, right? So if you add your application engineer to the system all it takes is.

To integrate a single sign on. And then if you want to send them a link, you can. And so I think that's really an important property when we think about these tools that we provide to the teams is that we don't build these artificial gates where people suddenly have to either remember a login, be added as a user, or worst case, you charge extra for a seat.

I think it's really important that people can just access it. But go ahead, Claire.

CLAIRE: Oh, I think I'm thrilled you jumped in with that. And I want to build on this whole part of the conversation with like at the end of today's call I wanted to ask what are the top five to six monitoring tools that everybody should have either in their toolbox or at least consider putting in their toolbox?

I don't feel like today's conversation should be about, there's one way to do things and the beautiful thing in the Postgres world is that it's an open source ecosystem and so you have choice, which is great, but choice is also hard when you're new because you're not sure which to choose. So I'm just curious if either of you have recommendations.

Top five to six monitoring tools to consider. Obviously, pganalyze is on that list. I will put it on the list. What else?

LUKAS: So my perspective is you should not ask me because I'm going to be biased. Although I'm happy to share, what I know in the open source world. I think I would, there's a monitoring page on the Postgres wiki that's community maintained.

pganalyze is on there, but many other tools are too. So if you want to have a non biased perspective I would recommend wiki, but that's my perspective.

CLAIRE: Awesome and Jeremy just dropped that link in the chat. Thank you, Jeremy. Okay, Rob?

ROB: Yeah, I don't know that I want to pick favorites. The two that for me, I always end up coming back to where PGMustard and pganalyze when I was trying to solve other problems and then like you Google on a certain term.

Like they have good enough documentation that you end up reading their documentation about some internal piece of Postgres and how it relates to the monitoring world. So I would definitely shadow both of those two as things that you should look at because they're doing a lot of things correctly.

When I think back to that, check, please talk that I did, now a decade plus ago, yeah, the areas that I remember going through and I'm like, if you look at these solutions like they built them and they do actually check the boxes on you need access to this and that information.

I think I would say, look at those two that question I maybe would have for Lukas. I'll throw them in a different direction. If you don't want to use a SaaS solution, then I would be curious if he has a recommendation specifically for that, because I think that's a more sort of difficult.

CLAIRE: You mean a SaaS solution for monitoring?

ROB: For monitoring specifically, yeah.

CLAIRE: Okay.

LUKAS: And do you mean on premise compatible or open source?

ROB: If that is a, if you have different answers for those.

LUKAS: I do. I think everyone would agree.

ROB: Go ahead. So then both.

LUKAS: Yeah. So the context there is we actually. Because people ask that question we've actually offered pganalyze, we do offer pganalyze as a kind of self contained software package as well, right?

So for those people that want to have no data leave their environment, we actually package it as a container and you can just run it inside your systems. So that would, fulfill the don't send the data into the cloud checkbox. I think, in terms of if we don't think on the commercial side, but if we think of the open source side, I think one project you could look at is PgHero, for example.

That's maintained by Andrew Kane, uses some of our open source projects from pganalyze also. And I think it's a great starting point for monitoring that can just be easily run on your own systems.

PINO: And I'll just chime in to say Andrew Atkinson on the chat pointed out pgDash.io, which I think is another proprietary solution, but they have a lot of community contributions and helpful blog posts.

LUKAS: Yep, that's right. I think pgDash in spirit is very similar to what pganalyze does.

ROB: So maybe the one other thing I would add to this and this was the mantra that we took from a consulting standpoint was also that a lot of times, like when you go in, if you if you're starting a new job as a DBA at a company, the first battle you want to fight probably is not.

Like I need this particular monitoring solution fight. Again, most of the existing solutions that are out there have the ability to do things like adhoc SQL against the database. So you can go and figure out like, what are the queries I need to run to get pieces of information out of my system. And I probably would say as a first step, try to put that data into the existing systems that the company is already familiar with so that you're not really fighting a battle, right?

You're just saying Hey, I want to add monitoring. I don't need a whole solution. I don't need to have different deployments or whatever. How do I just add these queries in, and if for some reason the system can't do that then you have a leg to stand on and saying can I go get this other thing?

Because I know it does it all. And, it's whiz-bang and awesome. That might be like career advice I would give. That's not the first fight you want to have, go solve some problems, make crafts available to people. And then, maybe go and say, Hey, can we get some better tooling in here?

Again, depending on how your shop is about bringing in new tools and how difficult that is.

CLAIRE: I think that advice about pick your battles carefully and don't try to change things overnight is good advice starting at a new company in all sorts of different functional roles.

Because there'll be so much inertia. When you're there. Point well taken. Back to my top five, six monitoring tools and that you should have in your toolbox. Neither of you mentioned pg_stats_statements . Is that because you were just thinking, okay, add -ons on top of Postgres and pg_stat_statements is already in Postgres?

Is that because it's too low level? Is that because it goes without saying?

LUKAS: I would argue that's what Rob was alluding to in terms of being able to just run the query. Because I think, ultimately, pg_stats_statements , you can just do select star from pg_stats_statements and get the queries that have run the database.

And so that is a starting point where you don't need to get any external tools. And so I would say it's a very important tool and, everybody's tool belt. pganalyze uses pg_stats_statements, PgHero, pgDash, basically everybody who gets query statistics is going to use pg_stats_statements underneath the hood.

So I do think it should be listed.

CLAIRE: Okay. So if I'm a college hire starting off in a data team and I'm focused on monitoring. This is something I should also be learning about pg_stats_statements. That is, yes.

ROB: Yeah. I would definitely say you need to be aware of it. You can, you can get by without it if you have to, but I would say you don't want to.

And the only reason I mention that is, I have been in environments where if you say you want to install an extension. That might take some kind of political capital to be able to do that, even for something as built in as pg_stats_statements, just because those systems are so locked down, due to the nature of, I don't know, whatever security compliance, regulation that the company has.

So if you had to use something like pgBadger and try to get the information from log files. You can get a lot of that same information with pgBadger. So that's another one that I say is probably worth looking at. And in fact, just thinking about it, there's a very old tool that we have. I'm pretty sure it still works, but it's written in Perl, like so many old tools.

Was a little Perl script that we had called system monitoring. And basically it was just a Perl script that was designed to run like queries against the Postgres database and dump them to log files that you could like grep and awk and whatever. And the other reason I mentioned it is again, it's a tool that.

I would hope that you wouldn't necessarily have to use it, but if you're in an environment where it's so locked down, where you can't install stuff or people are worried about, having things on the machine, on the database server changing or whatever, like that one can run completely external and it just dumps data to a text file so it doesn't have to send it anywhere or anything like that.

And so it's a very easy one to then be able to go and study data after the fact. Cause everything is pure tech. So it doesn't take up much space. It was very lightweight, that kind of stuff. So I think there are some other things out there, it's just that ideally you don't have to really think about it that low level, right?

If you go into an environment that is, it's a problem to install pg_stats_statements. Like you're going to have a lot of other problems there as well, just because that's, it's almost expected to be there I think for most people.

LUKAS: And I would say it is there by default most of the days, right?

So it's I think most databases service providers install it by default. And so you just need to do create extension and it works, right? So it's not like you need to add things to share preload libraries and all that stuff. I think luckily that has changed, right? Versus five years ago, you might have had a lot of extra work to get it enabled.

ROB: I think most packagers, as I say, most packagers now include that. It used to be the packaging was split out a little more. So if you were a contrib module, like you needed a different RPM to be installed. And, that could be a problem oh, that's not on our approved list of RPMs, you can't have the contributed modules.

I think, by and large, that problem doesn't exist anywhere, but I would say, I wouldn't be surprised if some people still run into it.

PINO: I think as part of this conversation, we should just mention for listeners that enterprise monitoring tools Foglight, Percona, SolarWinds, Datadog have Postgres specific support.

So if you have one of these at your company, then you can also and it's not enabled look into that,

LUKAS: but I would say, Rob go ahead

ROB: I was going to trash those tools. So you should let me do that. I just, yeah, to say that just be aware that the, what they're going to give you is not as good as what these more dedicated tools like PGMustard and pganalyze and pgDash, any of those sort of PG oriented tools, they will just give you a level of depth.

And I think more insight into why you're monitoring what you're monitoring. Then you will get from these more generic tools. If the more generic tool is the thing that the company has, right? If you're at like an Oracle SQL server shop and they have SolarWinds and they say, we're going to launch a Postgres thing.

I think it's maybe worth asking Hey, can we get this Postgres tool first? And when they tell you no, then say, okay, now I'm going to go into SolarWinds and I'm going to try to configure that to get information. And again, it's just going to be more work to do it. And you'll have to get through that.

And so you can probably get a good enough job. With that stuff to start with. But it's just not going to give you the full on amount of depth that you can get from some of the Postgres specific stuff.

LUKAS: Yeah. And I would add one more thing to that, which is one thing that we often see is when people come from Oracle, for example, really what they're comparing us with a little bit, at least it's more like Oracle enterprise manager or other tools that are very Oracle specific, right?

So it's for them, it's really that in depth tool that they're missing on the Postgres side, which is why they then come to pganalyze or other like Postgres specific tools.

CLAIRE: Okay. Now we haven't talked about pg_stat_io . So let's say I'm the new hire and the data team got to spin up on monitoring and learn to understand this space.

Is pg_stat_io something that I need to learn about, put in my toolbox?

LUKAS: If you're in Postgres 16, yes.

CLAIRE: Obviously, it's a new feature. Obviously, I'm a bit biased, by the way. Full disclosure, it's a new feature in Postgres 16, and the lead author is an engineer named Melanie Plageman, who works on the team at Microsoft, so I know about it.

It's on my radar, but I'm curious how you think about it.

ROB: Yeah, I would ask Lukas if he's yeah have you seen a lot of it in the wild? Cause so I've, what I've told folks who have asked me about it, I'm skeptical that the data you'll get from that in a cloud provider will be how useful it will be because just generally cloud providers don't want to give you all the details on IO that's going on in your system.

And so this exposes some of that. I'm not sure. If you're a person who's I need PSQL and I want to be on the command line and I'm going to run IO stat and all those like really low level, systems oriented tools. I don't know that this will give you that piece, right? So you might still be missing something.

And at the moment, it's just too early for me to know the answer. I don't know if Lukas has more data on that.

LUKAS: I have thoughts on it and I have some early anecdotal data, I would say. So the thoughts for context I did review that patch during Postgres 16 development cycle. And so I have some context on the different thoughts around it.

Although Melanie, of course, having written it would be the expert to ask about it, but I do think that really what Peach.io does is it provides you the Postgres view of the world in terms of IO activity and the gap that it's filling. If you look back to previous releases. You mainly had things like pg_stat_bgwriter or pg_stat_database that provided you some sense for which IO activity Postgres was, creating.

Or for example, pg_stat_statements was tracking, shared buffers hit and such. But the, really what pg_stat_io does, and which I think is extremely helpful, is that it gives you a, an all up top level view of all the IO activity issued by Postgres. And which subsystem issued it, right? So it doesn't tell you which query it did, right?

That's pg_stat_statements' role. But what pg_stat_io will do is it will tell you was autovacuum busy? Was, the checkpointer busy? Was the query backend busy? And so I think that level of information wasn't really accessible before. And so I do think it's going to be very helpful just to help you understand if you're seeing a spike in activity like on IO, let's say on your system metrics, and a spike in IOPS, pg_stat_io will tell you, will be able to tell you which part of Postgres caused that. One thing I found interesting Aurora recently released Postgres 16 support or is in preview. And I know also Citus Azure Cosmos DB also has Postgres 16 support already in production. So you should definitely use that.

But what I found interesting on the Aurora side, because they modified Postgres heavily. And so they actually changed the information that pg_stat_io returns to reflect some of the differences in the IO subsystem. And so I don't have enough data yet to say, is it actually useful on, modified Postgres is like Aurora.

But I do think that's going to be interesting to observe is how, pg_stat_io gets used by the different providers that modify Postgres and actually, have very different IO subsystems.

CLAIRE: That's actually a super helpful explanation of what it is. So thank you for that.

PINO: Maybe we should mention, sorry, Claire, in the interest of folks that would not know that Aurora has its own custom storage system, so it's not vanilla Postgres, and that's maybe why they need to make modifications in pg_stat_io to make the IO meaningful.

LUKAS: That's right. Yeah. Essentially like certain behavior, like check pointer, right? Like it's a very Postgres specific concept and Aurora obviously have modified that heavily. I imagine, another provider that probably have similar like challenges would be neon, right?

So if neon, they also modified Postgres in a similar way that Aurora has modified, and so I would expect Peach.io looking differently on those providers.

CLAIRE: I was going to segue for a split second and try to surprise you, Lukas, or maybe delight you even. You mentioned that Citus it's already supported PG 16, which is probably within a couple of days when PG 16 came out.

But Azure Database for Postgres, the flex server option, now supports Postgres 16 in production a couple weeks ago

LUKAS: so that is wonderful news.

CLAIRE: Yeah in the past, in past releases, it has taken me a lot longer to get there.

LUKAS: Yeah. And I think, I'm no longer, affiliated. So I can say that I think Azure, for example, has really, done a great job at actually being faster at, releasing new Postgres versions and in completeness sake, also AWS, I think has actually gotten better at this as well. So I think across the board, what we're seeing is that these new Postgres releases, you used to have to wait a full year to get them on your managed providers.

And now it's usually at the first patch release at the latest. You'll have support for that in production, which I think is a great change and a great improvement.

CLAIRE: Yes, I agree that it makes me happy, especially since I work for a Postgres provider. And I know we've got some AWS folks here on the chat too, and they probably feel the same way.

Monitoring. I'm curious what, with all the improvements that both of you have seen happen in Postgres over the last 10 plus years. Are there still things on your wishlist? That you think Postgres could do better and that you hope it will do better in those areas in the future. Do you have a wish list?

LUKAS: I have one, but I'm curious about Rob's thoughts.

CLAIRE: Oh, okay. Rob gets to go first.

ROB: I was going to put this to Lukas. I think his wish list might be more interesting than mine.

LUKAS: Let me echo something that Rob said earlier, which is around DTrace, right? So these days you have eBPF trace kind of replacing a BPF , replacing a lot of that functionality.

And one thing I really wish we had, which I don't know how we'll get there, but I think there's certain types of analysis slash monitoring. So it's more in the, maybe too deep to be called monitoring. Like more profiling, but I think one thing I'm really missing today is an ability to get the level of detail that a BPFtrace or perf on Linux can get you on managed providers, right?

So the, in a sense, sad truth from a, getting access to data perspective is that most people these days run on managed. Service providers. And so you don't have the ability to just run a tool like BPFtrace or perf on the command line. And I really wish that we would find a way that we could, get people access to that more easily because there are certain kinds of postgres problems that you can't really understand.

For example, one thing that people say, Hey, I'm having this weight event that's, kind of bottleneck, I see this weight event spike, and then if you had access to, to like perf or to like BPFtrace , you could actually much better, get like low level information like on demand when you need it, versus I think these days, you're you oftentimes your analysis has to stop short because you don't have that level of access.

And so I think either giving away that, I'm imagining like a PG eBPF trace extension that, providers find acceptable, that would be great. Or some other way of, Like more weight event related information, I think would also be helpful, right? So like why is my system bottlenecked on this LWLock block manager or whatever it is?

And what else can I know about this besides the fact that it's bottlenecked?

PINO: Lukas, does that require eBPF trace? Is that running in debug mode all the time or just the ability to tap into? Linux kernel calls, system calls.

LUKAS: It's more in the latter. So it's my understanding, and I'm definitely not an expert on this, but my understanding is it's very low overhead when you don't use it.

And then when you are using it, you're essentially hooking into existing trace points in the binary, which it's not a lot of I don't think there's a reason not to have those. So it's more a it's a root access problem, right? So today on the managed service provider, you don't have enough access to run these tools, but I don't think it would be performance overhead if you were not actively using it.

CLAIRE: So you touched on something, which is monitoring in the scenario where your application is running in the cloud and you're running on a managed service like RDS Postgres or Azure Database for Postgres. When I think about those users, those cloud database customers, do they need to be skilled at monitoring in the same way that in on prem user with self managed Postgres needs to be skilled at monitoring, like how is that different?

ROB: I think it depends just really how hard you're pushing your systems Is the answer of how much of that need you have? and even in that case like it may not matter like when I've been on systems that are pushing very difficult, very hard workloads and you just don't have access to the system stuff, then there's a question of whether you need to learn those skills at all, right?

Because if you're going to, if you wanted to get like traces from the system now, and you're in, managed service provider, like the only way to get that, and probably in most, you even still couldn't even do this, but you can open a ticket with a service provider and say, hey, could you go run?

This like tracing script and see what data it gives you back. And I think that the person on the other end of that ticket would actually probably flip out and be like, who is this person? Why are they asking me to do this? This would seem crazy. Whereas if you have access to those tools, right?

Cause you're running on prem or you're running, even if it doesn't have to be, it could be like EC2 or something where you just have access to the underlying. System and those system level tools. I would say you want to learn that, but is it a top priority? Like it may not be where you're working, for the work that we did as consultants that we're trying to push the envelope with Postgres, we needed to learn that tool set, at least somewhat, like there needed to be a few people who had some idea of, again, like systems.

So it didn't have to be like tracing tools necessarily, but even again, things like iostat and perf, and there's sort of other stuff at the command line level you can get access to. That I think are good to have, but if you can't ever use those because all the companies you're going to work at, are using managed providers where you never get systems access, it's hard for me to argue that's a skill set that you're going to invest a lot of time into.

CLAIRE: I think I was asking. A slightly different question, which is if I'm a customer running Postgres as a managed service, do I need to be as expert in monitoring as if I'm running it on prem? Or is my knowledge and understanding of monitoring the same in both cases? I need to be skilled at it. I need to care about it.

LUKAS: I would argue the emphasis is different, right? We touched on this earlier, but the emphasis on system level problems becomes a bit less in the cloud. You don't really, you don't get a free pass, right? So if your reutilization spikes the provider is not going to fix it for you.

But for example, if you have a bad disk that needs to be replaced, that type of problem goes away because presumably your cloud provider has a way of monitoring for that bad disk. And so I really think that it shifts a little bit like on prem monitoring means a lot more infrastructure monitoring versus in the cloud.

I think it means more like workload monitoring.

CLAIRE: Okay, got it. That actually makes a lot of sense. So some of the hardware problems hopefully the cloud vendor will just take care of those they'll detect it before I even notice and fix it. Or soon after I noticed and fix it. But any of the software level potential issues with slow queries, for example, I, as a customer, I'm still going to need to be paying attention to and potentially solving for probably solving for

ROB: I would agree with that. I think the tricky part is that there are occasions where if you're in the cloud and trying to use cloud based database services, your performance problem could originate at a level that's low enough that you don't have the ability to go see that. And if you have access to something like, the BPFtrace tooling, you would be able to see it because it can be a change.

Like maybe there's a change in how Postgres operates. Speaking of our pain points, like one of the pain points is these major version upgrades, and being able to monitor back and forth across those upgrades and that kind of thing. And so when you do that, there are, 99 percent of all the performance changes are positive.

And then there's the 1 percent that you hit. That's I didn't change my query. But now maybe I get. A different plan and why is this plan different and why is it so much slower or whatever? And the answer could be because Postgres has changed something internally and how it actually approaches trying to solve that problem.

And I think those cases are rare, but they do happen. And then the cloud, like you end up trying different indexing strategies and maybe it ends up getting reported as a bug or something like that, and then somebody figures it out somewhere, but you don't have access to the tools that you would need to figure that out yourself.

Because you cannot get in there and get the traces and that information. You just don't have that tooling available. Is it rare? Yes, it's definitely a rare scenario. And again, that will come back to maybe you've made the choice to be in the cloud. So you're saying in those rare scenarios, we put that on the cloud provider and we don't expect our people again to know things like GDB and system tracing and that kind of stuff because they wouldn't be able to use it.

CLAIRE: All right. So we are, we're going to run out of time soon, but I have one parting question I really want to ask you about, and how can you have any conversation these days without AI coming to the mix? I am curious what your vision is in terms of thinking about how Postgres monitoring will change or may change in the future.

If you were to fast forward, say, 5 years. Do you get asked that question? Have you thought about it?

ROB: I think the version of the question that I hear people asking is are we still going to need DBAs, right? Or is the AI going to be able to monitor everything and make all the adjustments and whatnot?

And I think my short answer to that is that a lot of times when you're doing performance troubleshooting and just, systems running systems in general, a lot of times it's not actually even a technical problem, right? It's really more about either some kind of like political or economic trade off that's going into place.

And that seems very difficult to model in AI. Because it's the kind of thing that could change depending on the time of the year or who's in the meeting or whatever. And so that seems very difficult so I think those tools, there's a really good talk that I saw. I think it was at PGConf New York City and maybe, as I see from the autotune folks about, why automatic AI based kind of monitoring doesn't work and it's a tongue in cheek talk about here's all the ways that it does actually make things better, but labeled in the opposite to get people to go because DBAs are.

Always going to be like, nah, it'll never work. And so I think there's a lot of advantages, that will come out of that. And if you like to see that talk, you're like, okay, yeah, I can see where a lot of this stuff will get better. I don't know. Five years seems like a short timeframe to me because one of the things he pointed out I try to think of the guy's name, Andy Pavlo.

CLAIRE: He gave the keynote at PGConf New York City as well, this year.

ROB: Okay. Yeah. So I'm trying to remember if that was his keynote or if it was a different talk that I saw him do. But like he points out that where he sees a lot of value is when people take that software that they have, that's it's really more ML based than AI based, but, and they run it against their production workloads, which is what they tell you not to do.

But again, you have to do it against production because you need real data. In order to make any kind of analytical decision that will be useful. So people run against production and they do see improvements in performance. Okay. I think that's good, right? So clearly the tools I think can benefit us somewhere down the line.

And then it's a question of, is it five years? Is it 10 years? I don't know, that's, I think a little bit harder to guess that, but I'll still go back to the original things we were talking about, like trade offs of OLTP versus OLAP, and are you going to split that off into a second machine in order to do the tuning for the queries, because they need to be tuned differently, and the servers need to be tuned differently.

I don't see the AI making that decision. Like it's not, it can try to do that balance, but at some point somebody is going to have to say, you know what, we need a second server. We need to do replication. We need to be able to build specific queries on that. That seems to me like something that even in 10 years time would be very difficult to be able to just outsource to an AI to make those kinds of decisions.

CLAIRE: So Pino and I were talking beforehand and he was like, Claire, we're going to, we're going to end up talking about tuning. Someone's going to take us in that direction, but that's a whole nother podcast, right? And I do think that's a whole nother interesting podcast topic. But back to PGConf New York City Andy Pavlo was there.

He gave the keynote talk. There was another autotune talk and there was another tuning company who I think was a sponsor at that event, or somehow I met them. Those folks are called DBtune. Tuning is definitely on people's radar, and we probably should have that as a topic. All right, Lukas, what's your answer?

Don't worry, I'll never forget about you.

LUKAS: I Think, let me see, how do I put this best? I think, first of all, I think we should, we are all doing ourselves a disservice by calling AI something that is thinking AI does something more than what it's actually doing, right? So we're not at the point where AI is actually reasoning or is actually thinking.

We're at a point where it seems plausible in terms of the responses that it gives us. And so I think I'm really conscious of, not just saying, Hey, because we have ChatGPT , now DBAs are going to be replaced because. That's stupid, right? That's not going to happen. I do think that there are systems that are interesting.

I think what autotune and DBtune are doing, like ML based systems or Bayesian statistics based systems are quite interesting, right? I think what worries me a little bit with some of those systems is that it's become a little bit of a, magic box that does things without somebody being able to follow what's going on.

So my personal perspective, and I'm. I'm definitely biased on this because, in a sense we are competing on some of these things with, for example, autotune but what we do in pganalyze we're currently in the process of rolling out a new version of our index advisor. And one of the things that we found was really important with making index recommendations is that any recommendation that we give you, you can follow what happened.

It might not be a good recommendation, but you will know why it's not a good recommendation. And I really wish that, whichever systems we see over the next couple of years emerge, be it, something that pganalyze for example, does or something that other companies do is that we really pay attention to giving transparency in terms of here is why this decision was reached here is why it's recommending to change the setting so that if a human wants to then think about, should I apply this to all my systems. When should we use this, right? I think that whole function of a person thinking about it doesn't go away because AI can't really solve for that, at least not today. I would say in the short term, better tools to help, people make decisions. In the long term, maybe the decisions are made by the AI, who knows?

That's 10 plus years time frame, I'd argue.

PINO: I've heard that principle, right? That until we until people, until you can fully trust the AI to take over, then you want the AI to be explainable so that people can follow what it did. And maybe second guess,

ROB: I just want to jump in here and use this opportunity.

There's, I know there's some people out there with bingo cards waiting for me to say it. And so Lukas has given me the opening. And going back to the question about wishlists I'm going to mention pg_hint_plan . So everyone take your drink as a tool that when you get the index advisor telling you like, Hey, maybe you need this index or get rid of this or whatever.

It'd be great if we had some mechanism in Postgres to actually be able to test that in a production setting where the data is real and pg_hint_plan could actually do that. That would be on my wishlist, even though it's not strictly a monitoring tool, it can be very useful for work related around that.

CLAIRE: Super quick, define pg_hint_plan for people unfamiliar.

ROB: So it is an extension to Postgres. It basically allows you to do query hinting. And one of the main things would be telling Postgres to use specific indexes or to not use specific indexes. When you run a query, so you can get the explain plan and compare explain plans without actually having to drop indexes or, do other drop it in a transaction and roll back or whatever crazy tricks people come up with.

Like you just say, run it on this index, run it on the other index. Let me study the explain plans and now I can see. Which one I would want to choose.

CLAIRE: Okay, so now that we know what it is, can you tell us what your idea to make pg_hint_plan even better is, again?

ROB: So I think just the first step would be having it more accessible to people, because it's, that's not, it's not pg_stats_statements where everybody gets it.

So only certain providers actually provide it because it's an externally developed extension. So step one would be getting it more accessible, but to me, step two would be actually incorporating it into core. And I think there's some things that we could do if it were if hinting were in core itself that you cannot do as an extension.

To me, I'm a hint plan person that I'm like, we should have hints in core and I'm full in on that. So that's where I would go with it. But even just getting more people access to the extension to me as a win for most end users.

PINO: Would that mean that permanently an external system would be making recommendations to Postgres about the planning or just temporarily?

ROB: Yeah, so this is, usually it's temporarily, and that's how I would recommend you use it is to do it as more of a temporary thing rather than trying to force particular plans that may need to change over time. So there is some danger. This is the argument against having it, is that people might use it wrong, but I feel like you can use SQL wrong and we still let people use that.

Let's educate as we provide the tools that, that people need.

LUKAS: And I would add where I found it really useful is in the situation where you have a plan that's unexpected. So you're like, why is Postgres first, loading data from this table and then joining against the other one and doing a nested loop?

Like pg_hint_plan helps you ask that what if question, right? Like why, What was the cost of the alternate plan because you can just tell the planner you must use this index or you must use this order of joining tables and then it gives you that cost and that's a super useful debugging tool. Even if you're not like persisting that

CLAIRE: I know when I was preparing for my talk about postgres extensions at PGConf New York City, I talked to Rob and got some of his inputs beforehand and pg_hint_plan was in that conversation too. He's oh, you got to talk about it it's so important it's so useful. So yeah, everybody take a drink. All right, before we say our goodbyes and say, thank you, is there anything either of you really wanted to say in the podcast today about your journey into Postgres monitoring that we didn't talk about last chance?

LUKAS: I would say this is a small thing, but we need more people sharing upstream on the Postgres mailing lists. What your experiences are with monitoring. So if you are using Postgres and you're using pg_hint_plan or pg_stat_statements I think we need more people working on this stuff on Postgres itself upstream. So definitely plus one to anybody who feels like they, if you have feedback, I think it's good. It's good to engage with the community on this because oftentimes that perspective is lost, I think, on the hackers mail lists.

CLAIRE: For those of you who don't know what we mean by mailing lists, that's actually how the Postgres development community communicates with each other is on these mailing lists that are public.

You can go read them. There's a lot of conversation and specifically I've had other Postgres committers tell me that they love it when users give feedback on the, it's the pgsql-hackers mailing list in particular that you're talking about, Lukas, right?

LUKAS: That's right. Yeah. And I'll give you an example.

pg_stat_statements, which we talked about earlier, there's a long standing patch to discuss how end lists should be handled in pg_stat_statements. It's very specific, so I'm not going to detail. But if you have opinions on this, because you are an end user of pg_stat_statements, it's very helpful if you essentially chime in on these email threads, so that then, we get a better Postgres.

CLAIRE: Can you drop that link, do you mind, in the Discord chat, Lukas?

LUKAS: I'll have a hard time finding it in the short term, but I'll try it.

CLAIRE: Okay. Okay. No, then we can include it in the show notes for the show, so you could drop it in later today or something. It doesn't have to be this moment. Okay. All right.

So that's your one thing. Rob, what about you?

ROB: I think the thing maybe I would just mention to cover for folks that are trying to get into monitoring and learn more about it remember that there are two pieces of information that you're trying to gather with a lot of the stuff. And one is what is going on now.

And then as what has happened over time and those two things, a lot of the data overlaps, but they're actually different problems that you have to solve. And so that's another thing we're thinking about that in whatever tooling you're going to pick, like knowing all the queries to run in like PSQL right against the system tables to figure out what's going on right now.

That's a skill set that is worth developing. But then the idea of I need to know, what is autovacuum doing over the last 30 days? That's not a thing that Postgres is very good about exposing over time, right? So you generally are going to need external systems in order to trend that data over time.

So be aware that whatever your journey is going to look like, probably is going to involve external tools. Whatever those might be because you're going to need some way to trend that data over time in a way to make it, easy to understand and analyze. So one of the reasons that we talk so much, I think, about external tooling is because of that, right?

Postgres can give you what's going on now. It isn't really designed nobody seems to want to tackle, like, how do we give you, over the long term granular data about what's happened in Postgres, right? We out, we sort of export that to external tooling to do that. So just be aware you need to carve some time out in your day to day or over your career in understanding what some of these external tools can do because of that need.

CLAIRE: Wonderful. Shout out. All right. Thank you so much to Lukas Fittl, pganalyze and Rob, Robert, Treat. Loved having you on the show today. And as always couldn't do this without Pino de Candia and our co producers in the background. Our next episode will be recorded live on Wednesday, January 10th at 10:00 AM PST the guests and topics are TBD but you can mark your calendar now at aka.ms/PathToCitusCon-Ep11-cal, and someone will drop that link in the chat right now if you're here live. You can get to past episodes and links to all the platforms at aka.ms/PathToCitusCon as well, and there's transcripts included on the episode pages on Transistor too.

PINO: Before we leave, we just want to ask you a favor, especially if you've enjoyed the podcast please rate it and rate and review us on your favorite podcast platform. It helps other folks find this new show and a big thank you to everyone who joined the recording live and participated in the live text chat on discord.

And that's a wrap.

CLAIRE: Thank you.

PINO: Thanks, everyone.​

Creators and Guests

Claire Giordano
Host
Claire Giordano
Claire Giordano is head of the Postgres open source community initiatives at Microsoft. Claire has served in leadership roles in engineering, product management, and product marketing at Sun Microsystems, Amazon/A9, and Citus Data. At Sun, Claire managed the engineering team that created Solaris Zones, and led the effort to open source Solaris.
Pino de Candia
Host
Pino de Candia
Pino de Candia is a software dev manager at Microsoft since 2020 and is currently working on the Citus open source project. Pino previously worked on the managed PostgreSQL database service in Azure Cosmos DB for PostgreSQL, which includes Citus on Azure support for distributed PostgreSQL. Pino has lived in New Orleans since 2017.
Aaron Wislang
Producer
Aaron Wislang
Open Source Engineering + Developer Relations at @Microsoft + @Azure ☁️ | @golang k8s 🐧 🐍 🦀 ☕ 🍷📷 🎹🇨🇦 | 😷 💉++ (inc. bivalent) | @aaronw.dev (on 🟦sky)
Ariana Padilla
Producer
Ariana Padilla
Program Manager at Microsoft in the Azure Database for PostgreSQL team | Avid Traveler 🛫 & Foodie 🍽️🍹
My Journey into Postgres Monitoring with Lukas Fittl & Rob Treat
Broadcast by