See all StarkWare transcripts on Youtube

youtube thumbnail

The Data Availability Landscape: from Trusted to Trust-Minimized | Mustafa Al-Bassam

20 minutes 44 seconds

Speaker 1

00:00:02 - 00:00:13

Hello everyone, my name is Mustafa. I'm a co-founder at Celestia, which is a modular data availability layer 1 blockchain. Today I'm going

Speaker 2

00:00:13 - 00:00:20

to talk to you about, I'm going to present an overview of the landscape for data availability solutions in

Speaker 1

00:00:20 - 00:00:37

2023. But this talk isn't specifically about Celestia, because I'm going to try to give a neutral overview of the data availability landscape. But that being said, obviously, I'm biased. So you can take that how you will. But I will try to be neutral.

Speaker 2

00:00:39 - 00:00:44

So first I'm going to discuss what is data availability, a quick brief overview.

Speaker 1

00:00:45 - 00:01:05

And then I'm going to discuss different approaches to data availability and different trade-offs in this space and what the trade-offs between these different solutions are. Because there isn't necessarily 1 size fits all solution, different applications might want to have different security or performance trade-offs.

Speaker 2

00:01:08 - 00:01:10

So what is data availability? So

Speaker 1

00:01:10 - 00:01:37

if we recall that all blockchains, including roll-up chains, consists of 2 main components. The first component is the block header, and the second component is the actual transaction data in that chain, and the block header commits to the transaction data, usually in the form of a transaction Merkle root. And the question of data availability asks,

Speaker 2

00:01:39 - 00:01:45

if you're a node or a program that only downloads the headers, how can

Speaker 1

00:01:45 - 00:02:11

you actually check that the data, the transaction data that the header points to was actually published by the producers of that block? Because the problem is if they don't publish, if they only publish the header, but they don't publish the actual data, then no 1 knows what the actual transactions behind that block header actually are. And that can cause various problems.

Speaker 2

00:02:12 - 00:02:23

For rollups, chain specifically, For ZK rollups or validity rollups, data availability is important because if the sequencer doesn't publish

Speaker 1

00:02:25 - 00:02:51

the state changes or the transactions, then users won't actually know what their balances are. They won't know what the state of the chain is. And that means that that kind of boils down to effectively a kind of like an attack where a sequencer can freeze people's funds and hold them hostage and potentially other things like bribery attacks. Like it can freeze people's funds unless you pay them, for example.

Speaker 2

00:02:52 - 00:02:56

For optimistic roll-ups, it's even more important because optimistic

Speaker 1

00:02:56 - 00:03:23

roll-ups rely on fraud proofs and if the transaction data behind that roll-up isn't available, then that means that block producers and full nodes cannot generate a fraud proof if there's a malicious transaction inside the roll-up block because they don't know what the transactions are, so they can't generate fraud proofs because they can't see which transactions are bad.

Speaker 2

00:03:25 - 00:03:51

It's worth noting that there's a lot of confusion about what data availability really is. Data availability is not the same thing as data storage. Data storage is more about like making sure that the data is stored long term, but data availability is only about making sure that the data was even published in the first place

Speaker 1

00:03:51 - 00:03:56

so that storage providers actually have the opportunity to download that data

Speaker 2

00:03:56 - 00:04:04

and store it. So data availability is not concerned with long-term data storage. It's only concerned specifically with

Speaker 1

00:04:04 - 00:04:36

the question, how can we make sure the data was published just once on the internet? So a better name for it might actually be data publication. So like for example, in Ethereum EIP-4844, data is only guaranteed to be stored by nodes for 30 days and then is deleted because it's only about making sure it's published so that people can generate for proofs or users can download the data so they know what their balances are. It's not about long-term storage.

Speaker 2

00:04:38 - 00:04:52

Now the interesting thing is that data availability wasn't really a question before roll-ups because before roll-ups all nodes, all blockchains work the same way, where you have full nodes and full nodes download

Speaker 1

00:04:52 - 00:04:57

all the transaction data and that's how they know the data is available. Obviously if you have

Speaker 2

00:04:57 - 00:05:00

to download all the data anyways, then obviously

Speaker 1

00:05:00 - 00:05:52

you know you have the data, it was available, because you could download it. But that's obviously not scalable, and that's why we have rollups, because it's not scalable to have a system where every node has to download every transaction. And that's why rollups exist, because they effectively shard execution into different roll-up chains and the main chain or the settlement layer does not have to execute all the transactions in the roll-up chain. The question of data availability asks is how can light clients or smart contracts check that all the data for a roll-up chain or any specific application was actually published or was actually made available without actually having to download all that data yourself and check it.

Speaker 2

00:05:54 - 00:06:00

So I will give a brief overview of different data availability mechanisms

Speaker 1

00:06:01 - 00:06:04

and their trade-offs. And there's kind of like, I guess,

Speaker 2

00:06:04 - 00:06:15

2 buckets here. On-chain data availability and off-chain data availability. On-chain data availability is when data availability is guaranteed

Speaker 1

00:06:16 - 00:06:27

by the layer 1 network. So if data is unavailable, then the fork choice rule of that layer 1 network actually just rejects the block outright. So it's like you have

Speaker 2

00:06:27 - 00:06:34

a data availability is coupled with the actual chain. Off-chain data availability is opposite. So data availability

Speaker 1

00:06:35 - 00:06:45

is guaranteed off-chain or by some third-party chain or by some third-party service that is not directly connected or part of the layer one's consensus mechanism.

Speaker 2

00:06:46 - 00:06:53

So we just talked about the most obvious data availability solution, which is full nodes, download all data.

Speaker 1

00:06:54 - 00:07:07

The other on-chain, as we mentioned, that's not scalable. The other on-chain data availability mechanism to make on-chain data availability more scalable is called data availability sampling.

Speaker 2

00:07:07 - 00:07:10

I'm not going to go into the weeds of how it works.

Speaker 1

00:07:12 - 00:07:46

There's lots of material about it online, but the general principle is that block producers can commit to something called an erasure-coded version of the block, and that basically allows light nodes to download random chunks of the block. And when they download random chunks of the block, then that effectively gives them an extremely high probability guarantee that 100% of the data is available by only downloading a very small percentage of the data. And this is used like Reed-Solomon encoding, which is similar technology that's also used in stocks.

Speaker 2

00:07:49 - 00:07:55

This is like a very, like a new technology that isn't widely implemented yet. It's kind of like,

Speaker 1

00:07:55 - 00:08:52

there's various protocols, including Celestia, Ethereum, and Polygon Avail that's kind of implementing this. But, and there's several like current challenges that people are researching. The first 1 is in order for data availability sampling to be fully secure, you need to have some peer-to-peer mechanism for full nodes to reconstruct the block by downloading the samples from light nodes. So, if light nodes collectively, let's say you have a thousand light nodes and they collectively downloaded enough samples or enough random chunks from the block that they can collectively reconstruct the block if the block producer was trying to be malicious and hide the data, you need to have a peer-to-peer protocol for that. And that's kind of quite challenging

Speaker 2

00:08:55 - 00:09:07

because it's a unique challenge compared to other peer-to-peer distribution protocols like BitTorrent or IPFS because it's like lots of different nodes have small pieces of data and

Speaker 1

00:09:07 - 00:09:09

you have to kind of like, they have to discover each other.

Speaker 2

00:09:10 - 00:09:12

The second challenge is proving

Speaker 1

00:09:12 - 00:10:05

that the erasure code was constructed correctly And there's kind of like 2 different ways of doing it right now. The first 1 is using What's called a KZG commitment scheme? That which is which is which is a validity proof scheme It proves you can prove that the original code was constructed incorrectly, but the trade-off there is that the times to prove the commitment, what's called the to prove that something is inside the commitment, is quite slow and what's called the KZG opening. And then the second way of doing it is fraud proofs, which are faster, which make computing the original code and generating proofs faster, but the trade-off there is that you have a, light nodes have to wait a challenge period, usually a few minutes, before they can accept that block is valid because they have to wait to see if there's gonna be any fraud proofs.

Speaker 2

00:10:05 - 00:10:40

And the third way, the third challenge is, in my opinion, the biggest challenge to data availability sampling is light node adoption Because data availability sampling is only completely secure if there's enough light nodes in the network such that they can collectively download enough pieces of the block such that they can collectively reconstruct that block. But right now, LiteNode support for any chain other than Bitcoin is extremely poor. This is something that Bitcoin did extremely well. Like, you can

Speaker 1

00:10:40 - 00:10:50

download a Bitcoin wallet on your mobile phone that has about 5 million installs. And it's a Bitcoin Lite client that connects directly to the Bitcoin network.

Speaker 2

00:10:50 - 00:10:53

But we don't really have anything like that

Speaker 1

00:10:54 - 00:10:57

as adopted for other chains like Ethereum or any chain after that.

Speaker 2

00:10:58 - 00:11:12

So a big challenge there is how do we move away from the Metal Mask module where wallet is just connected in Fiora, how can we embed like clients directly into users' wallets so that we have widespread

Speaker 1

00:11:14 - 00:11:20

like node adoption? And a part of the

Speaker 2

00:11:20 - 00:11:21

challenge there is how do

Speaker 1

00:11:21 - 00:11:23

you actually bootstrap the network?

Speaker 2

00:11:24 - 00:11:29

Because if there's not enough like clients then the scheme kind of basically, as

Speaker 1

00:11:29 - 00:11:59

I said, doesn't have full security. It reduces to what's called a proof of data achievability scheme which still gives you some security but not as high as a proof of data availability scheme. I have 3 minutes left so I'm gonna go to the option section already. In terms of option chain solutions there's kind of like 4 different mechanisms. The first 1 is 1 where there's no guarantee at all.

Speaker 1

00:12:00 - 00:12:03

And that's actually kind of not bad for some use cases.

Speaker 2

00:12:03 - 00:12:20

Like, for example, if you have NFT and the NFT data just points to some IPFS URL or URI, there is no availability guarantee because there's no guarantee that the data behind the IPFS URI was actually ever published. But that's actually perfectly fine for NFTs, for example,

Speaker 1

00:12:20 - 00:12:23

because obviously before you buy the NFT, you can check

Speaker 2

00:12:23 - 00:12:27

is the data of that NFT was actually published, otherwise the NFT would be worthless.

Speaker 1

00:12:30 - 00:12:35

Then there's solutions that might have a single trusted third party. So you might have like a single signature

Speaker 2

00:12:35 - 00:12:40

or a single party that attests to the availability of that data.

Speaker 1

00:12:43 - 00:12:52

And this is okay for some schemes like Plasma Cash or like StockWare's Ardumentium because in those schemes if the data is

Speaker 2

00:12:52 - 00:13:07

not available and the single trusted third party lies, the users actually have the power to exit to the L1 But the trade-off there is that this doesn't really work for all use cases. It only mainly works for payments.

Speaker 1

00:13:08 - 00:13:26

It's harder to make it work for generalized smart contracts or DeFi. Plasma Cache, for example, only works for payments. And that's why Plasma was abandoned and in favor of roll-ups. Because you can, in general, you can't make, it's hard to design mass exit schemes

Speaker 2

00:13:26 - 00:13:30

for data unavailability for use cases other than just like payments.

Speaker 1

00:13:32 - 00:13:56

There's also data availability committees. And that's just basically like a data availability multi-sector scheme where a set of people can guarantee the availability of some data. Example of that is like stock revenue limits. There's an honest majority assumption. There's a 7 to 10

Speaker 2

00:13:56 - 00:13:58

member of the ACs

Speaker 1

00:13:58 - 00:14:36

and obviously most of them have to actually sign the data. And then another example is arbitrary money trust, which makes a slightly different trade-off. 19 of 20 of the committee members have to sign the data. So you get a higher safety threshold, but the trade-off is that you have lower liveness guarantees because it's more likely that if just 1 committee member goes down, then you lose liveness. And users can actually just force transaction inclusion by submitting transactions in the L1, but if there's a mass exodus,

Speaker 2

00:14:36 - 00:14:37

it's not clear

Speaker 1

00:14:37 - 00:14:47

if the L1 can handle that kind of capacity. Finally, there's slashable committees, which are like data availability committees, but with some crypto-economic guarantees.

Speaker 2

00:14:51 - 00:14:58

So if the data availability committee lies about the data availability then that means that they can be slashed

Speaker 1

00:14:58 - 00:15:00

and that gives you some extra crypto economic guarantees

Speaker 2

00:15:01 - 00:15:05

but you can't do this You can't do this if the data availability committee is L2.

Speaker 1

00:15:05 - 00:15:11

It has to be an independent L1 because you can't slash the data unavailability on chain,

Speaker 2

00:15:14 - 00:15:21

but You can only slash them if the data availability committee is an independent chain.

Speaker 1

00:15:23 - 00:15:27

So I'll quickly go through these and then I'll take questions. And an example of that is Celestium.

Speaker 2

00:15:29 - 00:15:41

So it's what we call Ethereum L2 that uses Celestia for off-chain data availability. And Celestia can be slashed if data is unavailable by

Speaker 1

00:15:41 - 00:15:44

the light nodes, thanks to the availability sampling.

Speaker 2

00:15:46 - 00:15:52

And we can see that for the off-chain availability landscape, there's generally, it's a wide trade-off space.

Speaker 1

00:15:52 - 00:15:55

Obviously, roll-ups are the most secure because they

Speaker 2

00:15:55 - 00:16:01

use on-chain availability, but there's a trade-off between gas

Speaker 1

00:16:01 - 00:16:09

costs and security level, depending on what kind of trade-offs your application needs. Thanks.

Speaker 2

00:16:09 - 00:16:43

Any questions? Yeah, that's a good question. So the question is about do light nodes need incentives? And that's actually related to the challenge

Speaker 1

00:16:43 - 00:16:48

I brought up. It's about How do we have wide adoption of light nodes?

Speaker 2

00:16:49 - 00:16:52

So in my opinion, light nodes don't need incentives because

Speaker 1

00:16:54 - 00:17:06

if you look at Bitcoin, for example, as I mentioned there's a Bitcoin wallet that you can download on Android that has 5 million installs that's a Bitcoin light node. Light nodes have extremely low resource requirements,

Speaker 2

00:17:06 - 00:17:12

so like you can embed 1 into a wallet without like really, without

Speaker 1

00:17:12 - 00:17:17

the user even noticing, because the resource requirements are so low that

Speaker 2

00:17:19 - 00:17:25

to the user it's just a wallet. So to me the problem is really how do you incentivize users to run

Speaker 1

00:17:25 - 00:17:29

it because users will download wallets anyway, they download Metamask, they download Bitcoin like wallets and so on

Speaker 2

00:17:29 - 00:17:38

and so forth. The question is how do you embed Lightnode into Metamask? Because it's a browser extension. Or how do you encourage people

Speaker 1

00:17:38 - 00:17:53

to download desktop audits, for example? Yeah, but then you're still relying on the, whoever's designing this Lightnode to embed this thing in there, volatility, right? So, which is fine, but it's not guaranteed in any way.

Speaker 2

00:17:53 - 00:18:14

Yeah, well, so the first thing to note is that the first of all, light nodes are something that benefits the user from a security perspective. But secondly, to guarantee the security that the availability scheme, you only need a small number of light nodes for all the light nodes to have security.

Speaker 1

00:18:14 - 00:18:50

So for example, we're talking about a few hundred light nodes but As long as you have a minimum threshold of light nodes, then that guarantees the full security of the scheme. So that only needs a few hundred, or like a thousand light nodes. So we don't necessarily need millions of light nodes. So in my opinion, that's not really difficult to get, as long as there's an option. Sorry, can you speak louder?

Speaker 1

00:18:50 - 00:19:03

Yeah. So today's content is taking for a while eigenlayer that separates data availability into its own region. Is the question, what do I think of protocols like eigenlayer which separate data availability into

Speaker 2

00:19:03 - 00:19:05

a separate module? Yeah, so I guess

Speaker 1

00:19:05 - 00:19:18

the question is about like Separating consensus from execution Sorry separating consensus from data visibility well actually so that what I can layer does is they separate? Kind of like

Speaker 2

00:19:18 - 00:19:37

consensus from peer-to-peer networking. But I don't necessarily think you need to separate that in order to get the benefits of what Eigenlayer claims, which an increased peer-to-peer network. The main benefit, the main reason why Agileo

Speaker 1

00:19:37 - 00:19:40

claims higher throughput is basically because

Speaker 2

00:19:43 - 00:19:55

they don't, yeah, it's not because they don't do consensus, but it's because they use KZGs, which means that they claim that the node operators don't have to download,

Speaker 1

00:19:55 - 00:20:07

they can do sampling instead of downloading the data. So it's more about using KZGs than separating the availability from consensus. But the main kind

Speaker 2

00:20:07 - 00:20:14

of trade-off with Eigenlayer is that you can't, like, they use ETH restaking, but you can't slash ETH,

Speaker 1

00:20:14 - 00:20:16

you can't slash data on availability on

Speaker 2

00:20:16 - 00:20:25

the chain. So the ETH doesn't actually contribute to the security. What they instead have is a dual staking mechanism where the eigenlayer token is slashed

Speaker 1

00:20:27 - 00:20:25

if the ETH is unavailable. Alright, I'm out of time, so thank you, everyone.