What is a Data Lake?

Data Processor Data Lake Introduction

Introduction

In short, a Data Lake allows you to define an on-chain dataset and select a function to aggregate over it.

A Data Lake is a structured format used to describe a set of data points that are stored on-chain, somewhere in the blockchain's history. Essentially, it is one of the task type for the data we want to run computations on. As there are different ways to describe on-chain data, there are different data lakes available.

BlockSampled Data Lake

The BlockSampled Data Lake is used to extract a specific data point over a range of blocks. The range of headers is defined, along with the data point that should be extracted for each block in the range. Data points can be extracted from the block header, an account, or a smart contract storage variable.

Structure:

  • blockRangeStart: Start block

  • blockRangeEnd: End block (inclusive)

  • sampledProperty: Specifies the exact field to sample. The following are available:

    • header: Sample a field from a header

      • Available fields: ParentHash, OmmerHash, Beneficiary, StateRoot, TransactionsRoot, ReceiptsRoot, LogsBloom, Difficulty, Number, GasLimit, GasUsed, Timestamp, ExtraData, MixHash, Nonce, BaseFeePerGas, WithdrawalsRoot, BlobGasUsed, ExcessBlobGas, ParentBeaconBlockRoot

    • account: Samples a field from a specific account

      • Available fields: Nonce, Balance, StorageRoot, CodeHash

    • storage: Sample a variable of a smart contract via address and storage slot

  • increment: Incremental step over the range from blockRangeStart to blockRangeEnd

TransactionsInBlock Data Lake

The TransactionsInBlock Data Lake is used to query a specific field from all transactions or transaction receipts of a specific block.

Structure:

  • targetBlock: The specific block number from which transactions are being sampled

  • startIndex: The starting index of transactions within the block

  • endIndex: The ending index of transactions within the block

  • increment: Incremental step over the range from startIndex to endIndex

  • includedTypes: The transaction types to include in the query

    • Available types: Legacy, EIP2930, EIP1559, EIP4844

  • sampledProperty: The available fields depend on the transaction type

    • Use the types filter to prevent unavailable field errors:

    • For transaction receipts: Success, CumulativeGasUsed, Logs, Bloom

Aggregation Functions

To define a computation we want to run on the specified data, we need to select an aggregation function. This function will then be run over all of the extracted data points defined in the data lake.

Available Aggregation Functions:

  • AVG: Calculates the average of a list of values.

  • MAX: Finds the largest value in a list.

  • MIN: Finds the smallest value in a list.

  • SUM: Computes the sum of a list of values.

  • COUNT_IF: Takes an additional parameter context that encodes the counting condition.

    • The conditions are:

      • “00” → Equality ==

      • “01” → Inequality

      • “02” → Greater than >

      • “03” → Greater than or equal to

      • “04” → Less than <

      • “05” → Less than or equal to


I have corrected grammatical errors and improved clarity where needed. Let me know if you have any questions or need further assistance.

Last updated