Spinning up on Postgres & AI with Arda Aytekin

Spinning up on Postgres & AI
===

CLAIRE: Welcome to Path to Citus Con, the podcast for developers who love Postgres, where we discuss the human side of open source databases, Postgres, and the many PG extensions. I want to say thank you to the team at Microsoft for sponsoring this community conversation about Postgres. I'm Claire Giordano.

PINO: And I'm Pino de Candia.

Today's topic is Spinning up on Postgres & AI.

CLAIRE: And I am really excited to introduce you to our guest today, whose name I hope I pronounce correctly It's Arda Aytekin. Arda works as a Senior Software Engineer on the Postgres Extensions team at Microsoft, and he specifically focuses on AI and ML capabilities.

Arda earned his PhD in Electrical Engineering. His thesis was on Asynchronous First Order Algorithms for Large Scale Optimization Problems. Arda is a self proclaimed open source enthusiast and one of the many fans of Postgres. Welcome Arda.

ARDA: Hey, thank you for inviting me.

CLAIRE: Well, I'm so glad you're here. All right we want to dive right in. We, Pino and I talked before this call, before this podcast, and we have so many questions for you about how you became an expert in AI capabilities for Postgres. And as we talk today, I specifically want to make sure we give shout-outs and links to pointers that you have for developers, both application developers and Postgres developers.

And of course, Postgres users, but what pointers you have that will help other people spin up on AI and Postgres. So that's the preamble. I want to start with your origin story if we can, and I want to go back in time before we dive into all those tips and pointers and ask you, how did you get started as a developer?

ARDA: Thank you. Actually, I don't come from a computer science background. At university I studied engineering, mechanical engineering specifically, and computer science or software has always been a passion for me, so I'm self taught in software. And no matter what I did throughout my university studies and postgraduate studies, I always tried to, you know, stay as close as possible to software engineering.

And this is maybe how I started in software, maybe towards the end of my university studies. And then onwards, I would say.

CLAIRE: Okay. And what does it mean to stay as close as possible to software? What did you do? Did you have side projects or passion projects? Were you doing things on GitHub or what? What exactly do you mean?

ARDA: Yeah, I had the hobby or pet projects during my undergraduate studies and whenever I needed to do certain projects, I did my best to take up as much as possible the software or coding parts of the topics or projects. And eventually during the Master's and PhD I took courses around software design and towards the end of my PhD, I also wrote a library in C++ to basically realize the algorithms that I was analyzing during my PhD.

So as a best effort, I did my best to always be in, in the computer science part of it during my education.

CLAIRE: Okay. And then what did you, did you go into engineering after you got your PhD or did you go into software?

ARDA: Directly to software, I would say, or data science and software. So right after my postgraduate studies I entered the industry.

It was a telecommunication company back then, before Microsoft. And my task was mostly around data science and data engineering back then and heavily involved in, in software.

CLAIRE: Okay, but no, no AI yet. You were not involved in LLMs or vector databases or anything like that at that point?

ARDA: Yes, I mean, no, I was not involved in that because generative AI has been a very hot topic since last couple of years, and we are talking about four years before, and it was mostly traditional machine learning back then, you know, regression analysis and other machine learning aspects.

PINO: Maybe Arda, could I ask you to take a little detour and explain what is the difference between data engineering, data science, AI, and Gen AI?

ARDA: It's a great question, actually data science is perhaps looking at the data and trying to model a business problem based on the data. So when you look at the data, you do certain exploratory data analysis first to understand how the data looks like.

And together with the domain experts or subject matter experts, you try to model the business problem at hand. And by taking as much information as possible from the data and data engineering is mostly around pipeline building, I would say, to bring the data into the data science problem.

Solving settings. So usually data comes from raw resources or raw sources. It's not workable yet. Then we do certain cleansing operations and then transformations to make it workable by the data scientists. And then we built models in the data science stack and then we hand it over to maybe an even higher level to data analysts to basically create dashboards to explain visually what the data tells either for the time being or for the future near future to the business stakeholders.

And usually these are traditional machine learning. And when you, when we look at the AI part, I think it's a superset of machine learning. So it's a more generic setting and generative AI is I think in the intersection right now, because these Large Language Models, or LLMs for short, are neural network models, very complex models, which has certain, which have certain attention mechanisms to, to remember certain aspects of the text.

To generate texts or numbers based on the context. So generative AI or LLM is in the intersection of AI and machine learning, I would say. And AI or Artificial General Intelligence, so to say, is a more superset if you think of them as, as sets or diagrams.

PINO: Got it. Thank you Arda. And and so now going back to your on the origin story, you were about to tell us how you got into machine learning and AI

ARDA: Yeah, my expertise is on optimization problems and optimization problems and algorithms are the foundations or building blocks towards machine learning models, because when we are talking about machine learning models, we have data and we have a problem at hand, and we need to try and train certain models based on the data and the model training part is an optimization problem. So due to my work in my PhD around optimization problems, it was a natural path towards data science and machine learning for me.

CLAIRE: Is this something you started to do when you were at the telecom or something you started before you came to Microsoft ?

ARDA: And even at Microsoft in my first year I worked as a data scientist and in my second year, I made the switch to pure software engineering, to Postgres actually

CLAIRE: Okay. Now had you worked with Postgres before were you already a fan from your days? From your educational days, or even from your work at the telecom,

ARDA: Unfortunately, not, and I'm saying, unfortunately, because had I known it before, I think I would have used it even earlier. During my PhD and during my data science time, I used mostly Columnar databases, so like Apache Spark and others and Postgres is not one such at least originally. So I wasn't using this technology before.

CLAIRE: Okay. Got it. All right. So then you join the Postgres team at Microsoft. And were you focused on AI from the very beginning or is that something that you evolved into after you joined the Postgres extensions team?

ARDA: It was after the extensions team. So since the last, maybe eight or nine months, I would say, because when I first joined the team, I was with the services part or services team. which was, which is most responsible for deploying Postgres as a service on Azure cloud. So there we were dealing with the issues and problems of deploying it as resilient and as good as possible in our backend.

On Azure, but later I moved to the extension team and I think it was right after this hackathon we did internally at the company, AI hackathon, then I was part of the extensions team and now I'm with the extensions team working mostly with AI or generative AI applications on and around Postgres.

CLAIRE: Okay. And the extensions team at Microsoft, those are the folks who work on things like Citus or I think there's somebody in that team who, who has been a guest on this podcast before who is a maintainer for PgBouncer as well. There's just a whole bunch of people who are working in the Postgres ecosystem, but not directly on the Postgres core.

Is that right?

ARDA: Yes. And it's a fantastic team. I mean, especially for a person coming from outside of Postgres, it's, it's great to have such colleagues. And this is also true for the Postgres community, by the way.

CLAIRE: Well, I think that if we're going to talk about AI in the context of Postgres, extensions are actually a good a good place to start to talk.

A lot of people have been watching what's happened with the pgvector extension to Postgres. I mean, it has just gone parabolic. It is. I was looking at the GitHub repo, the star history for that, the pgvector repo, and you know, it was kind of sort of flat for a number of years, and then it just skyrocketed.

So let's start with pgvector. What is it? And when did you first get introduced to it, and what do people use it for?

ARDA: Pgvector is an extension of Postgres, which makes it, which makes Postgres a vector database. So a vector database is just a database that stores vectors. Which are lists of numbers of fixed dimensions.

So for example, you have 10 10 dimensional vector of numbers together with the data and it, you know, stores the, the pair, the data and their vectors associated vectors inside the database. And pgvector actually stores these vectors and enables certain mathematical operations on them in Postgres.

And frankly, it's a rather old extension. I think if I'm not wrong, it's three years old or even older, but like you said thanks to the generative AI boom it's skyrocketed eventually. It's a solid piece of extension in my opinion. And I got introduced to that one during our AI hackathon and when we were playing with open AI.

Large language models within the team at Postgres.

CLAIRE: Got it. So, so what kinds of things do people do with pgvector?

ARDA: Well, first, you store vectors and generally people store embedding vectors using pgvector in the database. And embedding vectors are, again, set of numbers, real numbers, but they're they have special meaning.

So they encode the semantic meaning of texts or documents in numbers and pgvector, you know, is used to store them in the columns of a table in Postgres. And there are certain mathematical operations defined on them. So you can add the two vectors, you can subtract one from the other, and you can do a dot product or different distance operations.

And these are super important because taking the difference or taking the distance between two vectors, and if they are embedding vectors is basically used to understand the semantic similarity or dissimilarity between two documents. So people use pgvector also to do semantic searching type of applications on the Postgres database.

CLAIRE: So for someone who's listening, I think that's, that's the key, right? People are using pgvector and vector databases to do semantic search, to do similarity search to do oh, my gosh, there's a whole list of use cases that you were telling me about right before we joined the call.

PINO: And that's not new, right?

The similarity search has been, has been used to augment text based search for a while now. And then I guess you can also have recommendation systems. Are there some other use cases?

CLAIRE: Analysis, text classification information extraction, PII redaction. Like, we can think about these real world use cases that people now have a much better tool to work with.

Is that, would you agree, Arda?

ARDA: Yes, especially the on database recommendation system is key because there you, you mix a semantic searching with other types of filtering, such as, you know, ranking by similarity and some score if, you know, such an exists, like think of it as a data set that contains certain products in a retail industry with their scores or relative scores from the comments.

And then you use semantic searching and filtering on top to get a top scored products that are semantically close to what you're searching for. So if I were to give an example, if you're searching for avocado in the, in the lexical search, you just get avocado, but in a semantic search sense, you get tacos maybe because of guacamole and guacamole itself and anything that might be related to semantically related to avocado.

And then you list them. By similarity, and then, you know, which will become your relevance, and then you can further add another filtering on top to basically get the ones with the highest score, for instance.

CLAIRE: So, so before we go any further, I feel like we have to give a shout out to Andrew Kane, because he is, I believe, the creator and the original author of the pgvector extension.

Is that right?

ARDA: Yes, and he's also super responsive in the issues in the GitHub repo. So yeah, shout out and kudos to him as well.

CLAIRE: It's definitely a capability that was in the right place at the right time. So I first met you last November. In Seattle I got to give a demo on stage on one of the keynote stages at the past data community summit which was at the big Seattle convention center.

And you had flown into town to give a talk at that same event. And you were only there for like, what, 18 hours or something crazy like that.

ARDA: Exactly. From touchdown to takeoff, it was 18 hours in total. It was an overseas flight as well, but I had to do it.

CLAIRE: You had a really important reason to get back home.

Is that right?

ARDA: And that is also true. Okay. Hopefully she's now listening to us. And so I had an engagement ceremony right afterwards.

CLAIRE: Okay. So you had to get back to your fiance.

ARDA: Yeah, exactly. And by the way, you did, you did great work in the demo. I think your demo was in an audience of around thousand people maybe, and it was great.

CLAIRE: Thank you. Thank you. Yeah. It's when there's a thousand people in the audience, you kind of like, it's so many people, you don't even know how many it is. It's kind of like just this, this mass of faces out there and you've got the bright light shining on you, so you can't even really see the audience.

But yeah, it was a short six minute demo. So, which leads us into, you know, we want to get to the point where you give people tips about how to learn, how to spin up, but I want to talk about this thing that I demoed briefly. It was called azure_ai. It's another Postgres extension, complementary to, and works well with the pgvector extension.

And Oh, I'll go drop a link into the chat, because I just re-recorded it and just published it on YouTube a couple of weeks ago, but you were one of the people who worked on and created this azure_ai extension. So why don't you tell us what it does and let's start like, at the user facing level.

So not so much about how it's implemented under the covers, but what does it give to users? How does it help?

ARDA: azure_ai is a set of use UDFs, user-defined functions for Postgres, which Postgres users who are very well knowledgeable in the SQL world and with the SQL language can use to do certain generative AI related tasks easily.

And as you mentioned, it works well with a pgvector. It does certain stuff like I can categorize maybe the capabilities into three major task sets. One of them is obviously generating is generating embeddings through the Azure OpenAI endpoint of the, of the user. So it will just go query the endpoint and get the generated embeddings from the, from the text or set of texts.

And then

CLAIRE: Let me, let me paraphrase that what you're saying is this Azure AI extension integrates the Postgres managed service on Azure, which has a mouthful of a name, Azure Database for PostgreSQL, it integrates that with Azure OpenAI so that you can use Azure OpenAI directly from within the database. Is that what you just said?

ARDA: Traditionally, you need an application layer. So application developers usually create a service and then it will query the database for the, for the data, and then it will retrieve it. It will go to the end point, which is the OpenAI endpoint, get the embeddings and then go back to the database to store the embeddings somehow.

So instead of doing this through this application layer, now thanks to azure_ai, we can do that directly from the database—and the database will just connect to the service and then fetch the embedding vectors and then store them in the column using pgvector. And this is the first set of capabilities.

The second set of capabilities is usually around document classification, I would say. And the, the most well known document classification is obviously language detection, which is the core of the other capabilities in the extension, because for instance, if you'd like to do sentiment analysis or information extraction you need to know the context and the context being the language that the document is written in.

So this is the, the core of document classification and sentiment analysis is another task supported within this document classification. So users can actually select, analyze sentiment of a certain set of texts or documents within their database by querying Azure AI services from within their databases again, without needing any application layer.

CLAIRE: Okay, so to paraphrase, it's kind of the same thing. The azure_ai extension basically gives you an integration. Not just between Azure Database for PostgreSQL and Azure OpenAI, but also with the Azure AI Language services, which includes all of these things about text extraction and lets you do sentiment analysis and language detection and things like that.

Did I get it right?

ARDA: Yes, even document summarization as well. You know, if you'd like to get an abstract or abstractive summary or extractive summary you can use these endpoints and their corresponding UDFs inside SQL to summarize a set of documents in your database. So if you have a long text that you'd like to summarize somehow, you just use these functions, get the abstracts, either store them in a different table or give it to the user, to your users actually.

CLAIRE: No, obviously azure_aui is all tied into the Azure ecosystem. So it's not something that is generally useful for, you know, other Postgres databases on other clouds at this point, right? It's all about integrating Azure Database for PostgreSQL, that specific managed service into these OpenAI capabilities.

ARDA: That is correct. As of today, we only support Azure endpoints for these tasks.

CLAIRE: Okay. What was, what was interesting for me is I, I got pulled into giving that demo. I'm not, I'm not part of the team that created the extension at all. So I, I got pulled in a couple of weeks before since people knew I was going to be there and they wanted to know if, you know, I would get up on stage.

So I had to do a lot of learning, like in crunch mode, kind of like college days, right? You know, cramming for an exam. And what, what I thought it was cool, I thought it was interesting, but I was really surprised by the feedback I got from people when I walked off the stage or went to lunch later that day and had random strangers approach me.

I had a bright red shirt on, so it was like very noticeable, right? Short female, bright red shirt. There weren't a lot of people who looked like me. And so, but people would walk up to me. Say, oh my gosh, that blew my mind. I had no idea. And so it was, it was really kind of fun and exciting to be part of this wave of new capabilities.

That are rolling out. That's my story.

ARDA: Yeah. It's exciting. It's exciting times. And like I told in the beginning of our discussion, I do come from this application development side and specifically specifically from data science and data engineering. And it's also important for such a persona, in my opinion, you know, any data scientist or data engineer would love to have such a tool in a database.

We are here in a unique position, thanks to not only Azure AI, but also Postgres, because it is, like you said, the extensibility of Postgres that makes it possible, or as easily possible as, as this one.

CLAIRE: There's been a lot of speculation in the Postgres world about why it's, it's grown in popularity so much in the last five, six, seven years because It was already popular, but it feels too like it has, has skyrocketed.

And some people say that the fact that Postgres supported JSON as effectively and as early as it did in the JSON lifecycle was instrumental. It wasn't the only factor. It wasn't the only reason, but it's definitely a key reason. And so I think it was Jonathan Katz who published a blog post recently, who basically said something like "pgvector is the new JSON" or "vector capabilities in Postgres are the new JSON" for Postgres.

Did you see that blog post?

ARDA: Yes, and I was also there in at the conference in that talk, and I do, I do agree with it, because in my opinion, again, from the application development perspective, at least JSON enabled the use of NoSQL like database applications on Postgres. And now pgvector is enabling vector data store capabilities on Postgres, and it is just great to have, you know, these capabilities on a traditionally SQL database because, you know, we don't need to change the technology stack, and we can stay where we are more comfortable or the most comfortable, comfortable with, and then solve our problems, at least to some extent.

CLAIRE: Okay, so. As you've worked on azure_ai, as you've worked in the intersection of Postgres and AI, have you observed—and I don't want you to judge—I'm just asking about your observations people in the developer communities, whether it's inside Microsoft, outside Microsoft, both, who have been either hesitant or unsure, or just on the fence about the role of AI with Postgres?

ARDA: Yes actually I've met some people who were a bit skeptical, maybe not super hesitant, but skeptical about the use of AI in general, because I also agree with them. Like it is a black box. We don't know, actually, we mostly don't know what's going on under the covers.

So it creates certain friction in that sense, in my opinion. And this is mostly what I understood when, when I was discussing with, with those people.

CLAIRE: So, so what do you tell them if somebody comes to you and says, like, have you had anybody come to you to get your advice? Maybe they're considering taking a job in a new team, or trying to decide, you know, whether to work on a project at the intersection of AI and Postgres.

ARDA: I always try to yeah, yes, by the way, there, there have been people and mostly from, or some from the data science part or my previous organization as well. So more towards software development and Postgres and also within the Postgres community to learn more about AI. Actually, my current colleague is doing more AI work right now.

With me and Postgres, I always try to recommend, you know, certain fundamentals and for those who are more skeptical about them, I try to also give pointers to this interpretable or responsible AI type of movements to at least help them understand more. Into what's going on under the covers, but of course, they're also now, at least to the best of my knowledge in, in their baby steps to this interpretable AI techniques when trying to reason about generative AI scenarios.

CLAIRE: Okay, so I think you just mentioned responsible AI. I know Pino and I were talking beforehand and that was definitely something we wanted to ask you about. And I know that Responsible AI is important at a number, probably all companies, but especially including Microsoft. So talk to me, what, what does Responsible AI mean?

ARDA: Responsible AI, at least at Microsoft is a means of achieving secure trusted or trustworthy AI applications at the company for the broader use of. Our services and applications. And, and I believe all of this has originated from GDPR because there were at least two clauses in the GDPR that made this a requirement.

One of them is the right to be forgotten. You know, whenever you demanded you, all your data should be wiped from the, from the databases and from the. Systems and the other one, the most important one is the right to challenge AI system. So if a company decides something about you based on AI or machine learning in general you have the right to challenge this decision system, and so the, the companies who are using AI and ML should justify the fair usage of their machine learning models to the, to the customer, to the person who's, who's challenging this system.

So responsible AI is a, is a movement or is a, is a systematic approach at least at Microsoft to basically create guardrails around any AI ML related application we build at the company for our users.

CLAIRE: It's really interesting because you would think like, imagine if you were graduating from high school right now, about to go to college, contemplating becoming a software engineer, a computer scientist, and.

If you were doing all that, you know, when you were planning your courses, like what you were going to take, let's say you were going to a university like I did that didn't have a core curriculum. So you really got to kind of craft your own education. It feels more than ever, like ethics and philosophy are an important part of a computer scientist education.

Right. It's not just about code and mathematics and technology, but, but more and more, you have to be thinking about how this tech gets used and the implications on society. I'm sorry. That's a bit of a rant.

ARDA: No, exactly. Especially around fairness and all the ethical issues. But still, I would say mathematics is still an important subject as well.

CLAIRE: Oh yes. Yes. Of course. Yeah. I agree. I 100% agree.

PINO: And Arda, you mentioned fairness. Are there other aspects of, of, of responsible AI that you'd like to call out? Think thus the, the facets of what it means to check your development and your product for responsible AI.

ARDA: Maybe personally identifiable information, like again, due to GDPR and similar regulations in different countries. I am mostly from European regions, so I know mostly about GDPR, but I know similar regulations exist also in the US and all around the world as well. So the use of personal data, for instance, is a really big issue. It's a red flag that, that signals certain aspects in our systems.

So the first thing that we are faced with when we design AI and ML at Microsoft is whether we use any personal data. And if so, how or why? And would it create a bias towards a certain set of people, be it race, be it gender, be it? Age because as long as you're using these, these type of data you, you, you have a chance or possibility of introducing a bias in your system.

And this should be avoided by all means. So mostly, you know, personal information and what type of data sample you selected to train your models to eventually. Be as fair as possible or as bias free as possible to to your end users

PINO: I've heard the term so sorry i just wanted to ask i've heard the term toxicity.

Can you tell me a little bit about that?

ARDA: Oh that unfortunately, I don't know much about maybe Yeah, could you could you elaborate a bit more?

PINO: I guess i've heard i've heard the term toxicity in in in the context of I guess in what you call fairness when a system like a similarity system can associate certain concepts with certain groups or even generating in the context of an LLM generating harsh language.

Yeah, go ahead. Sorry.

ARDA: Yeah. I think I attended one talk by Sébastien Bubeck our researcher at Microsoft, and he was mentioning a similar issue not maybe emphasizing the toxicity, but showing that, you know, certain large language models would say or would try to, you know, wipe the humanity out.

Because you know, in, in our sci fi movies, mostly AI is the, you know, not wanted thing. And usually it, it ends the world. So our Sci-Fi movies are all around this theme. And because we're using also these Sci-Fi movie plots to train the LMs. When you ask an LLM regarding, you know, the, regarding its future and how it interacts with humanity, if humanity is doing wrong the, the first thing that comes, comes to its mind is basically to, to wipe the, the society.

So maybe that's also part of this toxicity that you mentioned, which, which, which we should also be careful about. And that's why Sébastien Bubeck and his team trained five set series of models at Microsoft Research, and they only use textbooks, selected textbooks to train them. And, and their research is, or paper is around the topic.

And the textbooks are all you need. So we should be careful about what we are putting into these models because they are. Super powerful in terms of expressiveness and what you give to it will be augmented be it bad or be it good. So we should be really careful about it. Does this answer your question, Pino?

PINO: Yes, it does. Thank you, Arda.

CLAIRE: Well, and we could probably spend a deep three hour podcast just talking about responsible AI. I think that the, the ethics implications and, and the natural like concerns that people a lot of people, right? have when faced with uncertainty could make for a great topic, but we don't have 3 hours today and we're not here just to talk about responsible.

So I'm going to pivot us if I can. Pamela Fox is on the chat and she asked a question about pg_vectorize. Which I believe is another Postgres extension. I believe it's from the team at Tembo, or at least I saw a blog post that Adam Hendel had written. He works with Samay Sharma over at Tembo, which is a Postgres startup.

And so I'm curious, are you familiar with pg_vectorize? And if you are what Pamela wanted to understand is how does pg_vectorize compare contrast to azure_ai?

ARDA: I've actually recently seen pg_vectorize on their GitHub and I checked the landing page. It seems to be a similar extension to azure_ai, definitely.

The API looks a bit different, but if you forget about the API part for a while, I think underneath it's, it's using different sources. One of them being the OpenAI, the vanilla OpenAI backend for basically getting response or generating embeddings. And the other one the other source being Hugging Face, you know, the famous open source repository for different machine learning models and generative AI models.

And it's using Hugging Face behind the scenes to basically call into many different large language models to, to, to get responses based on, based on your data in Postgres. So they are similar, but we are not using Hugging Face is our backend or source for the time being, for especially also around responsible AI.

For responsible AI reasons.

CLAIRE: So I, I just published a blog post on a completely different topic. It was about a Postgres event that we organize here at Microsoft. It's free and virtual. It's called "POSETTE: An Event for Postgres", formerly called Citus Con. And yes, I am getting somewhere. I am driving to a point.

And because of that rename from Citus Con to POSETTE, I got a lot of pressure to Claire you got to explain the rename. You got to explain the backstory. What, what does the new name mean? And why did you rename, et cetera, et cetera. So I've kind of got naming on the brain because I just published that last Friday.

Right. And, and I anyway, I've been thinking about the topic very intensely for all of a week and I'm just Googling Hugging Face. Like where in the world did this name come from? What does Hugging Face mean with respect to AI? Help me out here. Do you know the backstory?

ARDA: That I don't know, but I know that their logo is the Hugging Face smiley.

So maybe it has to do something with this. Yeah. The, the emoji Hugging Face emoji.

CLAIRE: Okay. So explain as if I'm too, what, what is Hugging Face again?

ARDA: Hugging Face is GitHub for data scientists. So in GitHub or GitLab we use Git and we, we basically version our source code. But when it comes to data science versioning, the source code is not enough.

Versioning the data itself is also important because data science models are actually pieces of software. These are software programs. And not only their source code, but also their data can be changed and they need to be versioned and tracked accordingly. And Hugging Face provides a platform with as far as I know, as far as I could understand, because I've been using it to some extent, at least.

It's, it's providing a set of cookie cutter templates and set of nice libraries for developers to train their machine learning models and share their models together with their benchmarks and the data sets that they used. So it's a nice place for comparing your models or ideas and then pinning your data sets and doing benchmarks and comparisons.

And for the, for the end users, they also provide certain APIs to basically do machine learning inferencing. So it's a nice marketplace I would say for these models and, and developers.

CLAIRE: It sounds actually quite useful. It is. And Aaron dropped something in the chat. Apparently, according to ChatGPT, I don't know which version, the name Hugging Face originated from its beginnings as a chatbot company, aiming to create a digital companion for young users.

Inspired by the Hugging Face emoji. So there.

ARDA: So I only remember the last part or last sentence. So thank you, Aaron, by the way.

CLAIRE: Yeah. Well, and I mean, obviously if you go to the, their page, you can see the emoji right there. So it's as if that's the mascot, or part of the logo or something like, okay.

So one of the things we agreed we were going to do is just cover the basics of the terminology. I think, I think one thing that happens in tech is there are a lot of acronyms and you know, people, people talk as if everybody's expected to know these terms. And yet, I don't think everybody is on the AI bandwagon yet.

I don't think that all of our listeners necessarily know all of the relevant terms. So what were the other terms that we said we were going to talk about today? And just have you kind of explain. I think we've talked about pgvector. We've talked about vector databases already. Did you already define what a text embedding is?

ARDA: I guess I did. Text embedding is a set of vectors or set of, sorry, set of numbers or vectors of numbers, which encode the semantic meaning, meaning of texts that we know of.

CLAIRE: So it enables, basically, the software to do all of those, like semantic search, for example, those things that they. Used to kind of need the human mind to do right by turning language into math and numbers.

It can now be calculated so much faster and there's connections.

ARDA: So we can maybe put them all in the context if you want, like think of a library and the library will have books and usually the books are sorted with respect to their context or topics. Obviously the library now is the vector database.

It stores books or documents and they need to be indexed somehow because when I go to that library, I ask the librarian and they help me. With the... by looking at certain, some sort of index and then point me to the correct set of documents. And embedding vectors are the indices or indexes that the librarian searches for when I ask for a certain topic. In this case, you know, going back to our original discussion around avocado, because who doesn't love avocados.

If I go ask for a book on avocados, the librarian will just search through these embedding vectors or indices and then will pick me, give me the closest books in semantic similarity to the word avocado. So text embeddings help with sorting these documents or clustering these documents around certain contexts.

And embedding models are the tools that we use to generate these indexes based on the content of the document. So it just contextualizes the document by checking the content and assigning them an index, and index being now a vector of numbers instead of a scalar number that we usually use in Postgres database.

CLAIRE: Okay, so I really like your analogy, but I want to know who is the librarian in this analogy. And what I mean by that is, when you go to a librarian and say, Hey, I've got this project I'm working on, and I'm looking for information about A, B, C, the librarian uses their brain to know what to go look for in the card catalog, right?

They have a very good mental map and understanding of not just the layout of the library, but more like the, the richness of the resources and the various topics and places and authors, et cetera. Who's the librarian in this new model?

ARDA: In this new, actually in the analogy I gave, the librarian used to be this human assistant or human personnel that helps with the sorting part.

But now in this new analogy and in this new world, at least in the world of generative AI librarians are large language models. So they have a lot of knowledge around that specific context. So all these models if you remember the beginning of the discussion are chained on a set of certain data and we can think of large language models as Wikipedia specialized around certain topics in this case in the librarian, the librarian would be a certain large language model like, like ChatGPT four or five or Llama, anything, you know, that you can think of.

And this would probably bring us to our yet another acronym, RAG, Retrieval Augmented Generation. Usually these librarians or large language models know a lot of stuff, but they have a cutoff date because they are trained up to a certain point in time. So certain versions of ChatGPT were trained up to maybe November 2022 or 2021 and anything that happened since then obviously is not present in this librarian's mind.

We need to somehow augment this knowledge with new data and this retrieval augmented generation mechanism is basically feeding indexed context. In this case, the libraries of books [are fed] into the context of the librarian. The librarian knows how to reach out to certain staff, but it, in this case doesn't know the new information.

And we bring this information to the librarian through RAG, Retrieval Augmented Generation, and vector databases are currently the state of the art when, when supplying this information to large language models. Okay. Okay.

CLAIRE: So, so let's talk about that. If I am working with Postgres and I'm spinning up on AI and I want to learn more about how to do what you just said with RAG, right?

Because obviously you want the most current data you can get in a lot of cases—not always, but in many cases. How do I, where do I go to learn? What suggestions do you have?

ARDA: First, let me ask you another question. Who are you? Are you an application developer or a Postgres developer?

CLAIRE: Well, we'll do both. But right now, I'm going to channel that I am an application developer.

ARDA: Okay. So the application developer, in this case, needs to know pgvector, because that is the extension. Well, there are many extensions that do, that, that do vector storing and computation, so we should not be unfair to the rest of the extensions, but most of the people are mostly knowledgeable about pgvector.

So you need to know pgvector to store your embedding vectors. And of course, you need to know what embedding vectors are, and you need to find yourself a proper model. So, if you cannot afford these closed embedding generation models in OpenAI or Azure OpenAI or elsewhere, you can go to Hugging Face for open source models or if your team is capable of generating or creating such models, you can, you should go to your team and ask for a prop, an appropriate embedding generation model here and appropriate means basically.

Certain small models can be specialized on certain jargon or certain domains. So if you are working as an application developer in healthcare, maybe there are specialized versions of these open source models in, in the context of healthcare. Then you need to query this model. And so you might need an application layer if this is some.

A custom model that is not hosted somewhere. Otherwise, you can use Azure AI or similar extensions that we discussed today to, to do this within the database. And when you query them when you query the embedding generating model and store the embedding vectors in your database, the rest is writing a set of SQL queries that that do distance ranking and basically you feed this into the LLM.

So ChatGPT had plugin system. I think it still has right now. The plugin is just a Python application that does what I'm trying to tell right now in words, basically searching through the stored embedding vectors with respect to certain distances, distance metrics. So this is the application developer perspective.

CLAIRE: Okay. And let's pivot. Let's say I'm a Postgres developer. Is the answer different?

ARDA: Well, you have now more power. Obviously, if you are a Postgres developer or an extension developer, you can use C or... My favorite is to use Rust and the pgrx framework to write yourself an extension to enable all these things within the database, which is mostly what we are and other teams on Postgres are doing.

So you need to integrate somehow with a pgvector or some vector extension in the, in the database. And then you need to do all these API calls and, and similar stuff within the extension so that it is handled on the database layer. Like, like we've seen in, in these extensions and one thing I forgot regarding the application developer part.

So maybe I should go back there again. There are famous frameworks like Microsoft's AutoGen and LangChain framework to create a chain of thought systems using different large language models. So this is also super helpful. You create a chain of different large language models specialized in different tasks.

And then you feed this chain with, with data and you get their feedback, their output out, and then hopefully solve your business problem. So information or knowledge around the, the long chain framework or Microsoft's AutoGen framework would also be super helpful.

PINO: I guess that means you don't have to go and code at the lower level.

The things that you described, some of these frameworks can do out of the box. You know, you give them the data. You point them to the data set in your database and they can apply the Retrieval Augmented Generation, RAG, pattern to it. They can do things like help you tie in other API calls. I'm not that familiar with it, but basically we're taking it up a level where you don't have to think about all the different steps and glue it all together.

The framework provides that. Is that correct?

ARDA: Correct. These frameworks, especially LangChain that I know of mostly, provide support for different sources of data, Postgres through pgvector being one of them. And you just need to point to your data and then you need to know which large language model to use in your chain.

So ChatGPT is just one type of, or one, one large language model with the latest version, but there are many large language models and you just make a chain out of them. And then you, you let LangChain, the framework do all the steps for you. You're right, Pino.

PINO: And I guess maybe out of that, since you talked about the RAG model, did you talk about "prompt engineering" or "grounding" or "chunking"?

Do you want to talk about those a little?

I wanted to make sure we covered, like, when we talk about Retrieval Augmented Generation, I think it's important to remind people that we're trying to provide more context for the LLM.

I think this is called "grounding".

ARDA: Yeah, or there are I think different words like zero-shot learning, few-shot learning, usually when you're interacting with the prompt... like large language models, or chat like large language models. These models are specialized in obtaining the or understanding the pattern in the context and then generating text that would highly likely to, that would be highly likely to follow the pattern.

So, in this case, "prompt engineering" is, a technique that you use to instruct actually the large language model into the solution that you'd like to obtain. So zero-shot learning or few-shot learning are some other names for different tasks around prompt engineering. You usually basically ask the question and then give pointers to what you'd expect by giving, for instance, examples like you ask it to complete a conversation and you start the conversation with, you know, A and B, person A and B, and then you end with a B person and question mark, and then it will try to complete the sentence or the conversation.

So this is called prompt engineering. And usually within this prompt engineering technique, we would like to benefit from data, the real-time data. And this is where RAG, Retrieval Augmented Generation comes into play and there is a great talk by Andrej Karpathy. He gave a keynote at Microsoft Build last year, 2023, and I think it's available on YouTube.

That's, that's an interesting 53 or 55 minute talk or an hour talk. For those who are, who are interested in the types of LLMs and types of techniques that are used. There's also a great course, and it's a very short course by Andrew Ng in DeepLearning AI regarding prompt engineering for developers.

Maybe I should just point our audience to these two great resources because they do a much better job than I would be doing to explain these things. And they're they're great resources.

CLAIRE: So when I think about prompt engineering and tell me if this is right or not because I certainly don't have your expertise.

The analogy I have in my head is that historically some people were really good at querying a search engine, like just what they typed into the search bar on Google or Bing. And other people were not as effective at it. And it was, you just had to build the understanding and the techniques of how do I, how do I ask that question?

And so to my mind, prompt engineering is like the engineering around what a good prompt is to something like ChatGPT, for example. And I know that to do some of the work that I do, like when I'm promoting a blog or whatever, I always want to choose a good social graphic, a good OG image and so I use things like Bing image creator. If you type something very nuanced and precise and you can get a completely different result than maybe being a little bit sloppier with what you type in.

So, really thinking about what that prompt is has a different impact. So that's what I think of as prompt engineering. Is that a good analogy or not?

ARDA: It's a fantastic analogy. Like you said in, in Google, I think, or in Bing, we used a plus sign to include terms in the first two or three pages.

And we, we used a double quotes to, to make the word a requirement to be on the page itself. So that the search result would pick the correct website, for instance. So instead of using plus and quote signs in our search, now we're using natural language and hinting and we're explaining specifically what we would like to get out of this interaction with the LLM.

That's correct.

CLAIRE: Okay. Awesome. Are there other analogies that you want to use to help people understand? More and specifically, like, I feel like some of the concepts and the terms we've discussed today, assume a certain level of knowledge already. So imagine that you had a favorite nephew who was just graduating from college right now.

They've been working with Postgres through college, but they really haven't paid attention to what's going on with AI at all. They haven't, they maybe haven't even heard of pgvector, azure_ai, pg_vectorize, any of these things. What would you tell them to go read? Where would you point them to, to start?

Do you have like a favorite blog or a favorite, like you just mentioned, the Andrej Karpathy keynote from Microsoft Build last year? What, what are the five starting places?

ARDA: Okay. I think I would recommend first go check open AI documentations. I think they have a great set of information or resources around text generation models, embeddings, and fine tuning, even prompt engineering and safety practices.

Microsoft Learn, which is also open to everyone, has AI Fundamentals, the Generative AI Learning path I believe—which consists of three different modules starting from the basics of generative AI through going through Azure OpenAI services and, and ending in this responsible generative AI movement or requirement that we have that pretty much everyone needs to have basically we've already mentioned the prompt engineering for developers part and generative AI with LLMs.

And if that, and of course, towards data science is a nice publishing area in medium where many data science related topics are discussed. And then if your nephew is into more the deeper aspects.

CLAIRE: I should be clear, nephew or niece, either way.

ARDA: Okay. Okay, fine. If you're a family member, so to say, is more into the theory of it, then I would go take some linear algebra courses. Or there's this great deep learning class in Stanford. It's mostly around image processing, but it gives this neural network perspective and training perspective well enough. And it will be a nice starting point. To further build upon I would say.

PINO: And Arda, do you think, do you think that's, this is a question I've had for a while. You know, when we've talked about AI in the past before ChatGPT folks thought about AI often as being able to build the machine learning models, the new wave of generative AI.

A lot of people are getting into AI and they have maybe no expectation to go understand what's happening under the covers, except at some high level. Is that okay? Or do you recommend that people go beneath the surface to understand how, how the models work?

ARDA: Maybe not too deep, like these courses, because these courses are also a master's level or sometimes maybe PhD level courses but still, you know, having a rough understanding, even a Wikipedia search on, you know, machine learning process or optimization. Having a rough idea on training these models is important because when we talk about fine tuning an already-available large language model.

It's basically retraining of the same model with new set of data. So having a knowledge at this rough understanding of training process, this optimization and the structure of these models would help you. Then it will also help you with these few-shot learning scenarios, because eventually you are feeding information to this neural network with some context.

And you're guiding the large language model into what you're seeking.

PINO: And I guess, would that, would it also help you to select among the, the very the grow, the large and growing number of models you can select from if you're doing Gen AI today?

ARDA: Hey, could you repeat that one? Sorry.

PINO: I noticed that there are more and more large language models being released. And there's even efforts for small language models that perform quite well within a specific domain. And so I think one of the challenges for app developers is going to be just selecting among the many choices, and part of it is a is a resource and performance problem.

But another from the point of view of latency and, you know, CPU cycles used and memory and so on. But then there's also the performance of the system. How accurate are the answers? How much information do I need to convey? And how much information do I need to convey? You know, can it what can it summarize?

How much context can it hold? I'm asking because I don't really know. I'm just imagining the kinds of things. One would ask when deciding whether to take Llama versus Phi.

ARDA: Yeah. So there some level of knowledge would of course prove handy or important, but also Hugging Face and similar websites also create leaderboards for certain tasks.

So at least you can go check these leaderboards for those tasks and then get a nice start on the, on the model selection among a plethora of models. But then, eventually, if you, if you're going to use be it small language models and large language models. You might need to retrain them or fine tune them to some extent.

And, and like you said I forgot to mention the same researcher in Microsoft Research Sébastien Bubeck is also more for smaller language models instead of these large language models. And there is now this line of research as well into assessing the performance and whether we need as big models for our tasks as, as ChatGPT, for instance.

And this is just starting, so we should, I think, be seeing more models and more tries to come. And it's gonna be, I believe, even harder to pick one or the other. So it's also getting harder to select.

PINO: Alright, so we've talked about how to get started, we've covered concepts. Any, well, are there. Oh, sorry. Go ahead, Claire.

CLAIRE: Well, I wanna get back to the tips and resources on the how to get started because I think that Arda gave us a few and then we segued. And I just wanna give you the opportunity, Arda if there's anything else that you wanna recommend to my hypothetical family member who wants to learn more about that intersection between Postgres and AI.

Is there anything else you wanna suggest? I have a tip. So while you think about it, I'm just going to say that our first episode of this podcast, which happened last was it last March or early last April of last year( because this is now episode 13?) featured Simon Willison and Marco Slot. And for those of you who don't know Simon Willison, I mean, he's brilliant.

Marco's brilliant as well, but I want to talk about Simon here. Simon's the co-creator of Django, and he's been very interested in what's going on with large language models from the get go last year, I think it completely pivoted the focus of his research and how he spent his time. And anyway, he has a fantabulous blog.

And I think he does a really good job explaining things in a way that both experts and non experts can understand. I think

ARDA: I've noticed, I think I've read one of Simon's blogs, if that's the same, Simon, it was about embedding vectors or embeddings. Is that the one?

CLAIRE: Well, I I'm looking at his blog right now.

And I think Aaron just dropped the link into the chat and there's 460 blog posts that are tagged with AI.

ARDA: So I guess one of them was that one. Yeah.

CLAIRE: Yeah. The episode that he came on was to talk about working in public, and that's why he has so many blog posts, because he publishes like all these "Today I Learned" short posts, and he does all of his research, all of his thinking in public, which is just, I think, a huge gift to the world and to people like me who want to learn especially because he's such a good person.

To make a little Postgres pun he's such a good EXPLAINer. Anyway, so that, that would be my suggestion. I think it's, it's a great place to learn if you know nothing. And you just need things explained to you.

ARDA: Maybe the last one then, or at least one can come from me. Well, your niece or nephew is lucky enough to already know Postgres from the, you know, undergraduate studies.

I was not that lucky and. I learned the application development aspects of Postgres through a book. So I'm more reading person than watching person. So I benefited a lot from "The Art of PostgreSQL" book by Dimitri Fontaine. And of course, the Postgres documentation. So, I mean, it has tons of information, the documentation, and the book is very practical as well.

CLAIRE: Well, there are a lot of people who work on the Postgres documentation. So huge shout-out to all of those contributors. And they take that work very, very seriously. I think when you are integrating changes to Postgres the documentation is not second class citizen. It's very important. But going back to Dimitri, his book, "The Art of PostgreSQL", I literally have it sitting in front of me, but I have the first edition.

It used to have a different title. It used to be called "Mastering Postgres in Application Development". And then the modern version. Yeah. He renamed it. They redid the cover art. And it's, it is super popular. You are not the only one to recommend that as a way to spin up and learn more about Postgres.

So yeah, I can see someone dropped the link in. Big recommend. So I actually think I have a 30% discount coupon of 20% or 30% that has my name in it, in the, the, the coupon code that I can probably drop into the chat after the, after we're done with the recording today.

ARDA: So actually I would've asked you before, before I bought the book,

CLAIRE: Well, and I, my book is my book signed?

Yes. Okay. Thanks. It's signed with Dimitri's beautiful handwriting.

ARDA: Mine is digital edition. Okay,

CLAIRE: Got it. Okay, so back to before we wrap up, that was your additional tip. I shared my Simon Willison blog post. Pino, did you have anything else that you wanted to add to help people spin up on Postgres and AI?

That we didn't talk about

PINO: Nothing comes to mind.

CLAIRE: Okay. Then the last thing I wanted to do is I saw in the chat somebody piped in a link to pgrx and obviously, Rust is a super popular language. I think that that goes without saying people love it. I remember it last year Citus Con Jelte Fennema-Nio had gave a talk.

And he waxed philosophical about how he had built this extension in Rust and how he loved programming in Rust. So just going back to azure_ai for a second; is that implemented in Rust? And did you use pgrx as a framework for building it?

ARDA: Yes, to both Rust and pgrx.

CLAIRE: And are you, I know the answer to this question.

It's a super softball. Are you giving a talk at the upcoming PGDay Chicago event about building extensions in Rust and with pgrx?

ARDA: Yes, late April myself and my colleague, are going to give a talk around our learnings from extension building in Rust and pgrx in PGDay at Chicago at PGDay in Chicago.

CLAIRE: Okay. I think that's awesome. I'm going to be there as well. And I know it's a three track conference. So even though it's one of those one day PG days, and you might think, Oh, it's just one day. How many talks can there be? Is there really going to be enough to interest me? It's actually a three track conference.

So it's going to be difficult for most attendees to pick which session to to present. And I, I'm actually a little nervous because I'm competing against two really fabulous presenters. And talk topics for, for the talk I'm giving. So anyway

ARDA: I'm sure they're also nervous.

CLAIRE: Well, yeah. So anyway, but hopefully we'll, we won't split the audience too unfairly.

And but I will, I will drop a link to that talk. I don't know if it's going to be recorded, so maybe you should submit it somewhere else too. Like I know that the POSETTE CFP is still open.

ARDA: I know I will be doing it.

CLAIRE: POSETTE CFP opened until April 7th for anybody interested in giving a virtual talk that is well produced with good production values and things.

So, okay. I think we're a wrap. We're at 17 after.

ARDA: To reemphasize I think they should also, I mean, if any person who's interested in Postgres from the developer perspective should give Rust and pgrx a try.

CLAIRE: I think, I think I agree. And I think that the creator of the pgrx is, is it fair to call it a toolkit? I, I, how, how would you describe.

ARDA: I think it's a framework like SDK plus some tooling that supports all the boilerplate code that we need to do when developing extensions for Postgres. So it's, and it includes testing framework as well.

So it's, it's a framework. It's a very nice. Very well thought framework.

CLAIRE: Well, I think Eric Ridge, who's the creator I knew him for years as ZomboDB on Twitter because that's the Twitter account that he uses, I guess, for both ZomboDB and himself. I think he'd be very happy to hear you say that.

So shout out to Eric. As well, thank you, Arda for joining us. Thank you, Pino.

PINO: Thank you, Claire and Arda

CLAIRE: Okay.

ARDA: Thank you.

CLAIRE: This, before we wrap up, we just have a couple of quick announcements. The next episode will be recorded live on Wednesday, April 3rd at 10:00AM PDT that's right. PDT because we'll have changed to daylight savings time.

At that point, and the guests and topics are TBD, but we'll be announcing soon. You can mark your calendar now with aka.ms/PathToCitusCon-Ep14-cal so that, and that calendar invite has like all the instructions for joining and things like that.

You can also get to past episodes of this podcast and links to all the podcasting platforms where you can subscribe at aka.ms/PathToCitusCon, all one word. Transcripts are also included on these episode pages too, which can be super useful.

PINO: Before we leave, we'd like to ask you a favor, especially if you've enjoyed the podcast, please rate and review us on your favorite podcast platform.

It helps others find us.

CLAIRE: And a big thank you to everybody who joined the recording live and participated in the live text chat on discord. That's a wrap. Thank you.

ARDA: Thank you.

Creators and Guests

Claire Giordano
Host
Claire Giordano
Claire Giordano is head of the Postgres open source community initiatives at Microsoft. Claire has served in leadership roles in engineering, product management, and product marketing at Sun Microsystems, Amazon/A9, and Citus Data. At Sun, Claire managed the engineering team that created Solaris Zones, and led the effort to open source Solaris.
Pino de Candia
Host
Pino de Candia
Pino de Candia is a software dev manager at Microsoft since 2020 and is currently working on the Citus open source project. Pino previously worked on the managed PostgreSQL database service in Azure Cosmos DB for PostgreSQL, which includes Citus on Azure support for distributed PostgreSQL. Pino has lived in New Orleans since 2017.
Aaron Wislang
Producer
Aaron Wislang
Open Source Engineering + Developer Relations at Microsoft + Azure ☁️ | Go (golang), Cloud Native, Linux 🐧 🐍 🦀 ☕ 🍷📷 🎹 | Toronto 🇨🇦🌎 | 💨😷💉 | https://aaronw.dev/hello/
Ariana Padilla
Producer
Ariana Padilla
Program Manager at Microsoft in the Azure Database for PostgreSQL team | Avid Traveler 🛫 & Foodie 🍽️🍹
Spinning up on Postgres & AI with Arda Aytekin
Broadcast by