What is a Data Lake?

Data Processor Data Lake Introduction

Introduction

In short, a Data Lake allows you to define an on-chain dataset and select a function to aggregate over it.

A Data Lake is a structured format used to describe a set of data points that are stored on-chain, somewhere in the blockchain's history. Essentially, it is one of the task type for the data we want to run computations on. As there are different ways to describe on-chain data, there are different data lakes available.

BlockSampled Data Lake

The BlockSampled Data Lake is used to extract a specific data point over a range of blocks. The range of headers is defined, along with the data point that should be extracted for each block in the range. Data points can be extracted from the block header, an account, or a smart contract storage variable.

Structure:

blockRangeStart: Start block
blockRangeEnd: End block (inclusive)
sampledProperty: Specifies the exact field to sample. The following are available:
- header: Sample a field from a header
  - Available fields: ParentHash, OmmerHash, Beneficiary, StateRoot, TransactionsRoot, ReceiptsRoot, LogsBloom, Difficulty, Number, GasLimit, GasUsed, Timestamp, ExtraData, MixHash, Nonce, BaseFeePerGas, WithdrawalsRoot, BlobGasUsed, ExcessBlobGas, ParentBeaconBlockRoot
- account: Samples a field from a specific account
  - Available fields: Nonce, Balance, StorageRoot, CodeHash
- storage: Sample a variable of a smart contract via address and storage slot
increment: Incremental step over the range from blockRangeStart to blockRangeEnd

TransactionsInBlock Data Lake

The TransactionsInBlock Data Lake is used to query a specific field from all transactions or transaction receipts of a specific block.

Structure:

targetBlock: The specific block number from which transactions are being sampled
startIndex: The starting index of transactions within the block
endIndex: The ending index of transactions within the block
increment: Incremental step over the range from startIndex to endIndex
includedTypes: The transaction types to include in the query
- Available types: Legacy, EIP2930, EIP1559, EIP4844
sampledProperty: The available fields depend on the transaction type
- Use the types filter to prevent unavailable field errors:
Transaction Field
Legacy
EIP2930
EIP1559
EIP4844
NONCE
✓
✓
✓
✓
GAS_PRICE
✓
✓
GAS_LIMIT
✓
✓
✓
✓
RECEIVER
✓
✓
✓
✓
VALUE
✓
✓
✓
✓
INPUT
✓
✓
✓
✓
V
✓
✓
✓
✓
R
✓
✓
✓
✓
S
✓
✓
✓
✓
CHAIN_ID
✓
✓
✓
ACCESS_LIST
✓
✓
✓
MAX_FEE_PER_GAS
✓
✓
MAX_PRIORITY_FEE_PER_GAS
✓
✓
BLOB_VERSIONED_HASHES
✓
MAX_FEE_PER_BLOB_GAS
✓
- For transaction receipts: Success, CumulativeGasUsed, Logs, Bloom

Aggregation Functions

To define a computation we want to run on the specified data, we need to select an aggregation function. This function will then be run over all of the extracted data points defined in the data lake.

Available Aggregation Functions:

AVG: Calculates the average of a list of values.
MAX: Finds the largest value in a list.
MIN: Finds the smallest value in a list.
SUM: Computes the sum of a list of values.
COUNT_IF: Takes an additional parameter context that encodes the counting condition.
- The conditions are:
  - “00” → Equality ==
  - “01” → Inequality ≠
  - “02” → Greater than >
  - “03” → Greater than or equal to ≥
  - “04” → Less than <
  - “05” → Less than or equal to ≤

I have corrected grammatical errors and improved clarity where needed. Let me know if you have any questions or need further assistance.

PreviousHDP Jargon NextWhat is a Module?

Last updated 1 year ago