What is a Data Lake?
Data Processor Data Lake Introduction
Introduction
In short, a Data Lake allows you to define an on-chain dataset and select a function to aggregate over it.
A Data Lake is a structured format used to describe a set of data points that are stored on-chain, somewhere in the blockchain's history. Essentially, it is one of the task type for the data we want to run computations on. As there are different ways to describe on-chain data, there are different data lakes available.
BlockSampled Data Lake
The BlockSampled Data Lake is used to extract a specific data point over a range of blocks. The range of headers is defined, along with the data point that should be extracted for each block in the range. Data points can be extracted from the block header, an account, or a smart contract storage variable.
Structure:
blockRangeStart: Start block
blockRangeEnd: End block (inclusive)
sampledProperty: Specifies the exact field to sample. The following are available:
header: Sample a field from a header
Available fields: ParentHash, OmmerHash, Beneficiary, StateRoot, TransactionsRoot, ReceiptsRoot, LogsBloom, Difficulty, Number, GasLimit, GasUsed, Timestamp, ExtraData, MixHash, Nonce, BaseFeePerGas, WithdrawalsRoot, BlobGasUsed, ExcessBlobGas, ParentBeaconBlockRoot
account: Samples a field from a specific account
Available fields: Nonce, Balance, StorageRoot, CodeHash
storage: Sample a variable of a smart contract via address and storage slot
increment: Incremental step over the range from
blockRangeStart
toblockRangeEnd
TransactionsInBlock Data Lake
The TransactionsInBlock Data Lake is used to query a specific field from all transactions or transaction receipts of a specific block.
Structure:
targetBlock: The specific block number from which transactions are being sampled
startIndex: The starting index of transactions within the block
endIndex: The ending index of transactions within the block
increment: Incremental step over the range from
startIndex
toendIndex
includedTypes: The transaction types to include in the query
Available types:
Legacy
,EIP2930
,EIP1559
,EIP4844
sampledProperty: The available fields depend on the transaction type
Use the types filter to prevent unavailable field errors:
For transaction receipts:
Success
,CumulativeGasUsed
,Logs
,Bloom
Aggregation Functions
To define a computation we want to run on the specified data, we need to select an aggregation function. This function will then be run over all of the extracted data points defined in the data lake.
Available Aggregation Functions:
AVG
: Calculates the average of a list of values.MAX
: Finds the largest value in a list.MIN
: Finds the smallest value in a list.SUM
: Computes the sum of a list of values.COUNT_IF
: Takes an additional parameter context that encodes the counting condition.The conditions are:
“00” → Equality
==
“01” → Inequality
≠
“02” → Greater than
>
“03” → Greater than or equal to
≥
“04” → Less than
<
“05” → Less than or equal to
≤
I have corrected grammatical errors and improved clarity where needed. Let me know if you have any questions or need further assistance.
Last updated