to talk about more about this and what we've been doing at data bricks for the
last two years to work on the lake base. Uh I want to welcome my co-founder uh
Rald Shin to the stage. [Music]
[Applause] Um thank you Ali. We're actually
supposed to shake hands but given you walk all the way there and I'm supposed
to come here I think it will be a little bit difficult. So virtual handshake
um the as Ali said so one um data bricks in the last decade has mostly been
focusing on the analytic side of data infrastructure um and if you look at
analytic systems today what you use on data bricks or even on on other vendors
they look remarkably different from what they were in the 90s um there's a lot of
foundational technology has been invented columnar storage um vectorized
processing have dramatically speed up analytical workloads streaming was
invented about maybe over a decade ago actually make data a lot fresher and of
course in 2020 5 years ago we published uh the lakehouse blog and pioneer new
architecture which decoupled storage and compute and more importantly based on
open formats and that have enabled a lot of new workload dramatically lower the
TCO for all the analytical systems the now OLTP databases however are kind of
stuck in the past um what do we mean by that if you look at OOLTP dat databases
running today, whether it's commercial proprietary systems like Oracle or open
source databases like MySQL, Postgress, they look more or less the same as they
were in the '90s. Yes, we've added a lot of features. They have gotten faster.
We've looked at the components and the techniques and foundational ideas.
They're more or less the same. Um, I even sort of referred it all the way to
the 70s of the system papers. Um and databases are viewed as this
heavyweight infrastructure requires a lot of manual intervention and
maintenance and it's quite clunky. Um first databases are very slow. So the
provision and sort of difficult to scale um if your workloads are fairly dynamic
it's often a nightmare to actually deal with that in the database side. Um and
because of that um databases are fairly disconnected from modern day developer
workflows uh which we'll zoom into it a little bit and it's also very silo from
analytics and AI like it's actually not unusual these days you want to combine
analytics and AI with your transactional database workloads uh but it's very
difficult to do. So what did I mean by uh databases that disconnected from
modern day developer workflows? Well, imagine you're a software engineer and
you're trying to add a new feature to a codebase. The very first thing you'll
likely do is the following command. Get checkout-b maybe click in the UI. Um,
but what it does, it creates a new branch of your codebase. And you'll make
changes to this new code branch. Um, adding new feature, maybe fixing a bug,
you'll be testing against it. But all the changes you do are isolated to this
specific branch. And creating a new branch is an instant operation. It's
very, very fast. You don't have to think twice about it. You just do it right.
What's the equivalent for databases? You want to clone your production
databases, it might take days. Um, you put one up, you almost never shut it
down. Some things like this simply doesn't exist. Like, wouldn't it be nice
if you can branch off a database just like you would do with code?
Now, let's say you get past all that development hassle and you manage to
build a pretty successful app, which many of you have. Um, and now, um, the
app had taken off. You want to add introduce some analytics capabilities or
AI capabilities to the app. So you have your app development team now start
talking to your data infrastructure teams. They're managing the lakehouse.
You say, "Hey, how do I actually get the data from one side to another?"
Um, now you have to figure out how do you manage two disparate systems. You
have to understand the IM role differences. How do you set up secure
networking? How do you create ETL pipelines and load data from one to
another? You learn fancy terms like change data capture, std type one, std
type two, which I up until this stage still don't understand what they are.
um it just seemed awfully complicated. The so in the past couple of years at
data bricks we've been working on how to tackle this problem and actually
eliminate all the challenges and the result of that is lakebase right
lickbase um has the following attributes first and foremost it's based on open
source progress um and second it built on a novel decoupled storage from
compute architecture that actually enables the modern day developer
workflow um and by building on top of data bricks infrastructure it comes with
what you would expect actually amazing lakehouse integration as well as all the
enterprise readiness features. Now let's talk about them 10. sub one each after
each. So first and foremost lakebase is building on open-source standards which
is open source postgress and in the last few years um open source postgress have
steady on the rise and if you look at the latest stack overflow survey of the
most popular databases postgress actually leads by a white margin um and
this is because it's robust ecosystem of tools libraries and extensions and all
this just work out of a box on the lakebase and lakebase can guarantee you
singledigit millisecond latency at scale The second most important part to
Lakebase, it's building on a new novel separation of storage from compute
architecture. And it actually has three layers to this architecture. At the very
bottom, we're using data links or object stores to store the actual physical
data. And object stores are the cheapest storage medium you can find. They're
extremely reliable at scale. Now, one of the challenge that Ali actually referred
to is that object stores were not exactly designed for the type of
workloads that OLTB databases need. A 100 millisecond query is plenty fast in
a lakehouse for analytics, but 100 milliseconds actually unacceptable for
OOLTP workloads. We need singledigit millisecond latency. Um, so the way
we've solved that is by introducing a middle layer storage that's actually
only have soft state and it acts as a right through cache for all the data to
the object stores. And for some of you that are database nerds, it also um
creates a a new way to very quickly persist the right headlock or we
typically call as wall um in the database. And on top of the storage
layer, we have the ephemeral compute nodes which are Postgress instances um
that actually reads and writes to the underlying storage layer.
Um one thing that's very important for me to point out is very similar to
lakehouses. The actual data that store in the object stores data lakes are open
source formats. They are not some proprietary format we invented to
actually improve performance or to lock you in. they are just vanilla Postgress
pages and this opens up a whole new sort of paradigm of opportunities.
Um and there's actually a lot of past attempts at this problem. This is not
the first time I think um sort of the industry have tried to create a
separation of storage from computer architecture for LTB databases. Some
commercial systems especially from hyperscalers have done that. Um but they
were typically actually built on yet another proprietary storage system. They
are way more expensive and also don't use open source formats which means they
can manage to lock you in even more. So um the way we crack the code here um
it's actually based on um some of you might recognize of the architectural
diagram and data bricks actually acquired um this company called Neon
just last month um and you're already talking to us about it. Um now we did
acquire Neon. We announced acquisition last month. We actually only closed the
acquisition last week. Uh but one of the interesting things that's not widely
known is we actually invested in Neon the company uh many many years ago and
been working with Neon team as a technology partner on this separation of
storage from computer architecture and everything on lakebased actually
building on this collaboration right and building on top of this new
novel architecture we managed to accomplish few fairly interesting things
the first is civil and what do we mean by civil well earlier we said database
cases are this heavyweight infrastructure that requires a lot of
manual maintenance and intervention. Um serless here means database should
become lightweight. Um what does it mean it's lightweight? Well, lake base come
in two flavors. The first flavor is a provision throughput flavor um flavor
that it's it shows you you actually specify exactly how big you want it to
be. If you know how to size your world, that's the perfect solution for you. But
for most people, the autoscaling flavor will be far more interesting. In the
autoscaling flavor, you don't have to worry about um how big a database you
should be picking. And because the databases are actually just ephemeral
instances, you could only launch it when you need to. It takes less than a second
to launch a brand new database. Um and if you're sort of low scale, you could
either vertically scale it, which the system will do it for you, or could
choose to create relicas that also comes up in less than a second.
Um, and if your loads goes down and you actually no longer have need, for
example, say for a very American centric company, uh, past 5:00 PM, maybe you
have no loads, you can actually automatically shut it down very quickly.
All of those just happens in less than a second, right? And the best part is to
only pay for when the duration you actually need the compute.
Um, the second thing we built was branching. Uh we talked about earlier
how difficult it is to branch off a database and to apply modern development
software practices to actual databases. It's very easy to do it with code but
it's very difficult to do with databases. Um the separation of storage
from compute architecture also has a copy on write um capability built in
that we can instantly branch off a database. It takes less than a second to
create a whole clone of the database for and that includes both the data and the
schema of the database. Um, and because of the copy on write capability, you
don't actually have to pay for extra storage unless you start making changes
and only the changes themselves will incur extra charge because under the
hood, they all share the same storage. Um, so something pretty magical would
actually happen when you combine the branching capabilities and the serless
capabilities into one. It would completely change the way you think
about database development. Every time you do git checkout-b
you should automatically actually branch off a database with that new branch of
code. You should have them perfectly sync in sync making schema changes. If
you actually don't like your new code branch and whatever changes they make to
the databases just kill both the code branch and the database. You pay next to
nothing just like how you're paying for your code repository. Right? And this is
extra important in the age of agentic coding and AI. um with AI like one way
to think about AI agents is you're um getting at very low cost an armies of
thousands or maybe even in the extreme case millions of AI agents and the AI
agents are acting as their own individual engineers that doing
experiments on your codebase maybe adding new features you might even have
multiple AI agents adding new features um adding the same feature and you have
judges to determine which feature is the best implemented and it's now every AI
agent can actually get their own code branch but also their own databases at
virtually no cost for experimentation. The separation of storage and compute um
especially by leveraging actually open source formats underlying storage layer
also makes it super easy to synchronize data at very high throughput from one
object store to another object store. So from one data lake to another data lake
from lakehouse to lake base. Um and many of you if you're existing data bricks
customer given what we're doing you probably actually expect this out of the
box. Um you can publish any tables in the lakehouse into lakebase for
real-time serving to get your template second latency and you can also do the
reverse. You can very easily get the data from lakebase directly into lake
house managed by UC mud catalog with it to future uh clicks
and of course by building on top of data bricks infrastructure um lake bases are
uh sort of enterprise ready it comes with all the wells and bles you expect
um from security to compliance to governance.
So given lakebase what can you do differently?
Well, first if you're trying to build a new app and you need a relational
database, give leg base a try. If you want to serve data, you have data today,
whether it's machine learning feature stores, u or whether it's a simple um
data pipelines you build, you want to serve that data, um give lickbase a try.
And if you have ETL pipelines, complex ETL pipelines to ingest data from a
relational database into data bricks, which I know almost every customer does,
um, give lake base a try. It'll dramatically simplify your architecture.
So with that, I would like to invite Holly Smith onto stage to actually give
you a demo to uh, visualize what looks about. And I think Holl's actually going
to come up on that side. So you have slowly shifting the stage.
[Applause] Hello. Uh I've been given the job to
manage inventory levels for a drinks company. Making any last minute
adjustments to stock but also to share data between analytics teams. This job
is tricky at crudtime, but fortunately I have some new tools from datab bricks to
help me. And today I'll be sharing how lakebase powers datab bricks apps works
at scale and can use data from delta tables all in real time all in one
platform. So I'm going to switch to my demo and in
front of me I have a data bricks app that's bringing together both
operational and analytical data. Whenever I'm changing these filters,
these are live queries. Nothing is cached in the app. I can see here orders
that I need to review but I am not going to do that. Uh instead I am getting some
intense requests to place some orders for cherry burst for 90 units. I'm going
to select my store, click through the app and go. This app will go and update
Postgress in the back end where the orders table sits.
Now I'm going to double check that this has gone through correctly and I'm
switching to the brand new Postgress SQL editor natively integrated to the data
bricks platform and I can tell it's Postgress because I've got this icon on
the top right here. Um and when I select my compute it's got this Postgress tag
next to it. I'm also not going to do this on production. Uh so I am going to
use dev which is an isolated test branch. And if I look at the left I can
see some neon details. Um I'm also going to check the settings of the version and
I can see that yes this is Postgress version 16.6 and to query the data I
would just query it like I would any other table and I can see here that when
I run this that yes my order has gone through but that's just one record and
that's not very impressive. So earlier today I set up a little program to
simulate lots of users to run in the background. Um I'm going to switch to
the native performance monitoring in data bricks. I'm going to head to this
metrics tab here. Um, and I can see how this has ramped up to thousands of
transactions a second. Uh, and tens of thousands of rows per second. And I've
got all sorts of interesting operation types happening here. One thing this
doesn't have is our app latency as it scales. And fortunately, uh, I bought my
little program so it can measure latency. So, I'm just going to run a
simulation and see how well we're doing. And oh, I could see my performance is
almost 19,000 queries a second. Uh, with the median at 4.56 milliseconds and the
95th percentile at 5.6 milliseconds. So, I'm pretty pleased with that.
Thank you. Lakebase can also access data in Delta
and I can combine them together for snappier responses in the UI. So, I head
to the table in Unity catalog. I can create a synced table. I go ahead and I
can name the table. I select which instance I want to use. I want to use
production. I can also choose the database as well. And then the primary
key. Now I can either run this as a one-off snapshot triggered at a certain
interval or have it continuously updating. And I can either group this
with other updating items in a pipeline or I can have it as a standalone.
Now that that's set up, uh, when I go to it in Unity Catalog from the database
view, uh, I can see here that I've got this
little synced icon next to my table. And so now when I head back to my app, oh,
here it is. And I head to my insights tab, I've got this insights tab, and
that is including data from a delta table for my forecast. And of course, I
can sync all of my orders uh data back to the lakehouse for historical analysis
and the other way around too. So, in this demo, we've shown how data bricks
is bringing together both operational and analytical data closer together. And
I've hope I've shown you how powerful this is when combined with other
databicks features like apps, delta, and unity catalog. And it's inspired you to
give it a go on your next project. And as for me, I don't think I like being an
inventory manager, and I probably should have made this an agent. Back to you,
Reynold. Thank you, Holly. And
only virtual handshake. Um,
so at data andi summit and other conferences, we're announced products of
various stages of maturity. Where where is lickbase? We just spent so much time
talking to you about it. So in the last year we actually been private previewing
lakebase with hundreds of customers and many of their logos actually showing on
the stage here uh across a very variety of uh industries many of them actually
running lakebased in production and we're also very happy to have the
following launch partners joining us to uh announce lakebase itself and this
includes of catalog vendors BI vendors agentic coding platforms and consulting
services but the best part is that lakebase is available today. All right.
So, not something coming next here for today. Starting today in all of your
data bricks workspaces depending on which region you are can either explicit
opt in or it's already on out of a box for you. It includes the full-blown
fully managed Postgress instance all the lakehouse integrations multi cloud
support HDR. All right. And there's a lot more new features coming in the
coming months. Um, so just to summarize, Legbase is
offers a fully managed Postgress instance. It comes with a new new novel
separation of storage from compute architecture which enables the
modern-day developer experience both for humans and for AI agents. And more
importantly than the lakebased product announcement, we actually feel like this
is how database should be built in the future. And our prediction is every
other database, every other transaction OLTB database will evolve towards this
architecture in the coming years.