Video Thumbnail 19:46
Introducing Lakebase - Databricks Co-founder & Chief Architect Reynold Xin
21.7K
251
2025-06-11
Introducing Lakebase, the first fully-managed, multicloud, Postgres-compatible transactional database engine designed for developers and AI agents. Many organizations find that their data layer doesn't scale with their AI applications, leading to an explosion of pipelines that must be managed at all integration points. Lakebase combines the familiarity of Postgres with the scalability of the lakehouse and the agility of Neon's database branching technology. The result is a fundamentally bette...
Subtitles

to talk about more about this and what we've been doing at data bricks for the

last two years to work on the lake base. Uh I want to welcome my co-founder uh

Rald Shin to the stage. [Music]

[Applause] Um thank you Ali. We're actually

supposed to shake hands but given you walk all the way there and I'm supposed

to come here I think it will be a little bit difficult. So virtual handshake

um the as Ali said so one um data bricks in the last decade has mostly been

focusing on the analytic side of data infrastructure um and if you look at

analytic systems today what you use on data bricks or even on on other vendors

they look remarkably different from what they were in the 90s um there's a lot of

foundational technology has been invented columnar storage um vectorized

processing have dramatically speed up analytical workloads streaming was

invented about maybe over a decade ago actually make data a lot fresher and of

course in 2020 5 years ago we published uh the lakehouse blog and pioneer new

architecture which decoupled storage and compute and more importantly based on

open formats and that have enabled a lot of new workload dramatically lower the

TCO for all the analytical systems the now OLTP databases however are kind of

stuck in the past um what do we mean by that if you look at OOLTP dat databases

running today, whether it's commercial proprietary systems like Oracle or open

source databases like MySQL, Postgress, they look more or less the same as they

were in the '90s. Yes, we've added a lot of features. They have gotten faster.

We've looked at the components and the techniques and foundational ideas.

They're more or less the same. Um, I even sort of referred it all the way to

the 70s of the system papers. Um and databases are viewed as this

heavyweight infrastructure requires a lot of manual intervention and

maintenance and it's quite clunky. Um first databases are very slow. So the

provision and sort of difficult to scale um if your workloads are fairly dynamic

it's often a nightmare to actually deal with that in the database side. Um and

because of that um databases are fairly disconnected from modern day developer

workflows uh which we'll zoom into it a little bit and it's also very silo from

analytics and AI like it's actually not unusual these days you want to combine

analytics and AI with your transactional database workloads uh but it's very

difficult to do. So what did I mean by uh databases that disconnected from

modern day developer workflows? Well, imagine you're a software engineer and

you're trying to add a new feature to a codebase. The very first thing you'll

likely do is the following command. Get checkout-b maybe click in the UI. Um,

but what it does, it creates a new branch of your codebase. And you'll make

changes to this new code branch. Um, adding new feature, maybe fixing a bug,

you'll be testing against it. But all the changes you do are isolated to this

specific branch. And creating a new branch is an instant operation. It's

very, very fast. You don't have to think twice about it. You just do it right.

What's the equivalent for databases? You want to clone your production

databases, it might take days. Um, you put one up, you almost never shut it

down. Some things like this simply doesn't exist. Like, wouldn't it be nice

if you can branch off a database just like you would do with code?

Now, let's say you get past all that development hassle and you manage to

build a pretty successful app, which many of you have. Um, and now, um, the

app had taken off. You want to add introduce some analytics capabilities or

AI capabilities to the app. So you have your app development team now start

talking to your data infrastructure teams. They're managing the lakehouse.

You say, "Hey, how do I actually get the data from one side to another?"

Um, now you have to figure out how do you manage two disparate systems. You

have to understand the IM role differences. How do you set up secure

networking? How do you create ETL pipelines and load data from one to

another? You learn fancy terms like change data capture, std type one, std

type two, which I up until this stage still don't understand what they are.

um it just seemed awfully complicated. The so in the past couple of years at

data bricks we've been working on how to tackle this problem and actually

eliminate all the challenges and the result of that is lakebase right

lickbase um has the following attributes first and foremost it's based on open

source progress um and second it built on a novel decoupled storage from

compute architecture that actually enables the modern day developer

workflow um and by building on top of data bricks infrastructure it comes with

what you would expect actually amazing lakehouse integration as well as all the

enterprise readiness features. Now let's talk about them 10. sub one each after

each. So first and foremost lakebase is building on open-source standards which

is open source postgress and in the last few years um open source postgress have

steady on the rise and if you look at the latest stack overflow survey of the

most popular databases postgress actually leads by a white margin um and

this is because it's robust ecosystem of tools libraries and extensions and all

this just work out of a box on the lakebase and lakebase can guarantee you

singledigit millisecond latency at scale The second most important part to

Lakebase, it's building on a new novel separation of storage from compute

architecture. And it actually has three layers to this architecture. At the very

bottom, we're using data links or object stores to store the actual physical

data. And object stores are the cheapest storage medium you can find. They're

extremely reliable at scale. Now, one of the challenge that Ali actually referred

to is that object stores were not exactly designed for the type of

workloads that OLTB databases need. A 100 millisecond query is plenty fast in

a lakehouse for analytics, but 100 milliseconds actually unacceptable for

OOLTP workloads. We need singledigit millisecond latency. Um, so the way

we've solved that is by introducing a middle layer storage that's actually

only have soft state and it acts as a right through cache for all the data to

the object stores. And for some of you that are database nerds, it also um

creates a a new way to very quickly persist the right headlock or we

typically call as wall um in the database. And on top of the storage

layer, we have the ephemeral compute nodes which are Postgress instances um

that actually reads and writes to the underlying storage layer.

Um one thing that's very important for me to point out is very similar to

lakehouses. The actual data that store in the object stores data lakes are open

source formats. They are not some proprietary format we invented to

actually improve performance or to lock you in. they are just vanilla Postgress

pages and this opens up a whole new sort of paradigm of opportunities.

Um and there's actually a lot of past attempts at this problem. This is not

the first time I think um sort of the industry have tried to create a

separation of storage from computer architecture for LTB databases. Some

commercial systems especially from hyperscalers have done that. Um but they

were typically actually built on yet another proprietary storage system. They

are way more expensive and also don't use open source formats which means they

can manage to lock you in even more. So um the way we crack the code here um

it's actually based on um some of you might recognize of the architectural

diagram and data bricks actually acquired um this company called Neon

just last month um and you're already talking to us about it. Um now we did

acquire Neon. We announced acquisition last month. We actually only closed the

acquisition last week. Uh but one of the interesting things that's not widely

known is we actually invested in Neon the company uh many many years ago and

been working with Neon team as a technology partner on this separation of

storage from computer architecture and everything on lakebased actually

building on this collaboration right and building on top of this new

novel architecture we managed to accomplish few fairly interesting things

the first is civil and what do we mean by civil well earlier we said database

cases are this heavyweight infrastructure that requires a lot of

manual maintenance and intervention. Um serless here means database should

become lightweight. Um what does it mean it's lightweight? Well, lake base come

in two flavors. The first flavor is a provision throughput flavor um flavor

that it's it shows you you actually specify exactly how big you want it to

be. If you know how to size your world, that's the perfect solution for you. But

for most people, the autoscaling flavor will be far more interesting. In the

autoscaling flavor, you don't have to worry about um how big a database you

should be picking. And because the databases are actually just ephemeral

instances, you could only launch it when you need to. It takes less than a second

to launch a brand new database. Um and if you're sort of low scale, you could

either vertically scale it, which the system will do it for you, or could

choose to create relicas that also comes up in less than a second.

Um, and if your loads goes down and you actually no longer have need, for

example, say for a very American centric company, uh, past 5:00 PM, maybe you

have no loads, you can actually automatically shut it down very quickly.

All of those just happens in less than a second, right? And the best part is to

only pay for when the duration you actually need the compute.

Um, the second thing we built was branching. Uh we talked about earlier

how difficult it is to branch off a database and to apply modern development

software practices to actual databases. It's very easy to do it with code but

it's very difficult to do with databases. Um the separation of storage

from compute architecture also has a copy on write um capability built in

that we can instantly branch off a database. It takes less than a second to

create a whole clone of the database for and that includes both the data and the

schema of the database. Um, and because of the copy on write capability, you

don't actually have to pay for extra storage unless you start making changes

and only the changes themselves will incur extra charge because under the

hood, they all share the same storage. Um, so something pretty magical would

actually happen when you combine the branching capabilities and the serless

capabilities into one. It would completely change the way you think

about database development. Every time you do git checkout-b

you should automatically actually branch off a database with that new branch of

code. You should have them perfectly sync in sync making schema changes. If

you actually don't like your new code branch and whatever changes they make to

the databases just kill both the code branch and the database. You pay next to

nothing just like how you're paying for your code repository. Right? And this is

extra important in the age of agentic coding and AI. um with AI like one way

to think about AI agents is you're um getting at very low cost an armies of

thousands or maybe even in the extreme case millions of AI agents and the AI

agents are acting as their own individual engineers that doing

experiments on your codebase maybe adding new features you might even have

multiple AI agents adding new features um adding the same feature and you have

judges to determine which feature is the best implemented and it's now every AI

agent can actually get their own code branch but also their own databases at

virtually no cost for experimentation. The separation of storage and compute um

especially by leveraging actually open source formats underlying storage layer

also makes it super easy to synchronize data at very high throughput from one

object store to another object store. So from one data lake to another data lake

from lakehouse to lake base. Um and many of you if you're existing data bricks

customer given what we're doing you probably actually expect this out of the

box. Um you can publish any tables in the lakehouse into lakebase for

real-time serving to get your template second latency and you can also do the

reverse. You can very easily get the data from lakebase directly into lake

house managed by UC mud catalog with it to future uh clicks

and of course by building on top of data bricks infrastructure um lake bases are

uh sort of enterprise ready it comes with all the wells and bles you expect

um from security to compliance to governance.

So given lakebase what can you do differently?

Well, first if you're trying to build a new app and you need a relational

database, give leg base a try. If you want to serve data, you have data today,

whether it's machine learning feature stores, u or whether it's a simple um

data pipelines you build, you want to serve that data, um give lickbase a try.

And if you have ETL pipelines, complex ETL pipelines to ingest data from a

relational database into data bricks, which I know almost every customer does,

um, give lake base a try. It'll dramatically simplify your architecture.

So with that, I would like to invite Holly Smith onto stage to actually give

you a demo to uh, visualize what looks about. And I think Holl's actually going

to come up on that side. So you have slowly shifting the stage.

[Applause] Hello. Uh I've been given the job to

manage inventory levels for a drinks company. Making any last minute

adjustments to stock but also to share data between analytics teams. This job

is tricky at crudtime, but fortunately I have some new tools from datab bricks to

help me. And today I'll be sharing how lakebase powers datab bricks apps works

at scale and can use data from delta tables all in real time all in one

platform. So I'm going to switch to my demo and in

front of me I have a data bricks app that's bringing together both

operational and analytical data. Whenever I'm changing these filters,

these are live queries. Nothing is cached in the app. I can see here orders

that I need to review but I am not going to do that. Uh instead I am getting some

intense requests to place some orders for cherry burst for 90 units. I'm going

to select my store, click through the app and go. This app will go and update

Postgress in the back end where the orders table sits.

Now I'm going to double check that this has gone through correctly and I'm

switching to the brand new Postgress SQL editor natively integrated to the data

bricks platform and I can tell it's Postgress because I've got this icon on

the top right here. Um and when I select my compute it's got this Postgress tag

next to it. I'm also not going to do this on production. Uh so I am going to

use dev which is an isolated test branch. And if I look at the left I can

see some neon details. Um I'm also going to check the settings of the version and

I can see that yes this is Postgress version 16.6 and to query the data I

would just query it like I would any other table and I can see here that when

I run this that yes my order has gone through but that's just one record and

that's not very impressive. So earlier today I set up a little program to

simulate lots of users to run in the background. Um I'm going to switch to

the native performance monitoring in data bricks. I'm going to head to this

metrics tab here. Um, and I can see how this has ramped up to thousands of

transactions a second. Uh, and tens of thousands of rows per second. And I've

got all sorts of interesting operation types happening here. One thing this

doesn't have is our app latency as it scales. And fortunately, uh, I bought my

little program so it can measure latency. So, I'm just going to run a

simulation and see how well we're doing. And oh, I could see my performance is

almost 19,000 queries a second. Uh, with the median at 4.56 milliseconds and the

95th percentile at 5.6 milliseconds. So, I'm pretty pleased with that.

Thank you. Lakebase can also access data in Delta

and I can combine them together for snappier responses in the UI. So, I head

to the table in Unity catalog. I can create a synced table. I go ahead and I

can name the table. I select which instance I want to use. I want to use

production. I can also choose the database as well. And then the primary

key. Now I can either run this as a one-off snapshot triggered at a certain

interval or have it continuously updating. And I can either group this

with other updating items in a pipeline or I can have it as a standalone.

Now that that's set up, uh, when I go to it in Unity Catalog from the database

view, uh, I can see here that I've got this

little synced icon next to my table. And so now when I head back to my app, oh,

here it is. And I head to my insights tab, I've got this insights tab, and

that is including data from a delta table for my forecast. And of course, I

can sync all of my orders uh data back to the lakehouse for historical analysis

and the other way around too. So, in this demo, we've shown how data bricks

is bringing together both operational and analytical data closer together. And

I've hope I've shown you how powerful this is when combined with other

databicks features like apps, delta, and unity catalog. And it's inspired you to

give it a go on your next project. And as for me, I don't think I like being an

inventory manager, and I probably should have made this an agent. Back to you,

Reynold. Thank you, Holly. And

only virtual handshake. Um,

so at data andi summit and other conferences, we're announced products of

various stages of maturity. Where where is lickbase? We just spent so much time

talking to you about it. So in the last year we actually been private previewing

lakebase with hundreds of customers and many of their logos actually showing on

the stage here uh across a very variety of uh industries many of them actually

running lakebased in production and we're also very happy to have the

following launch partners joining us to uh announce lakebase itself and this

includes of catalog vendors BI vendors agentic coding platforms and consulting

services but the best part is that lakebase is available today. All right.

So, not something coming next here for today. Starting today in all of your

data bricks workspaces depending on which region you are can either explicit

opt in or it's already on out of a box for you. It includes the full-blown

fully managed Postgress instance all the lakehouse integrations multi cloud

support HDR. All right. And there's a lot more new features coming in the

coming months. Um, so just to summarize, Legbase is

offers a fully managed Postgress instance. It comes with a new new novel

separation of storage from compute architecture which enables the

modern-day developer experience both for humans and for AI agents. And more

importantly than the lakebased product announcement, we actually feel like this

is how database should be built in the future. And our prediction is every

other database, every other transaction OLTB database will evolve towards this

architecture in the coming years.