Announcing: 🌊 LakeBench

I’m excited to formally announce LakeBench, now in version v0.3, the first Python-based multi-modal benchmarking library that supports multiple data processing engines on multiple benchmarks. You can find it on GitHub and PyPi.

Traditional benchmarks like TPC-DS and TPC-H focus heavily on analytical queries, but they miss the reality of modern data engineering: building complex ELT pipelines. LakeBench bridges this gap by introducing novel benchmarks that measure not just query performance, but also data loading, transformation, incremental processing, and maintenance operations. The first of such benchmarks is called ELTBench and is initially available in light mode.

While the beta release focuses on code-first data processing engines available in Microsoft Fabric, the stable release milestone is planned to include additional benchmarks (i.e., ELTBench in full mode, AtomicELT) and other data processing engines available in Azure.

While there are other benchmarking projects out there, I designed LakeBench with a few key things in mind, which in total, make it unique:

Python: While most data engineering benchmarking projects are Scala or Java-based, I created LakeBench as a Python project to make it the most easily accessible benchmarking library available. No need to build and package the binaries, just %pip install lakebench directly from PyPi.
Multiple modalities: Most projects (with the exception of Lake Loader by the OneHouse team, which is Scala-based) are a one-trick pony. They either focus on supporting many engines (i.e., ClickBench), focus on multiple benchmarks, or maybe they just do one thing well—one engine that runs one benchmark. I designed LakeBench to solve for the challenges that come with the intersection of combining many benchmarks with many engines. As you combine the two, you multiply the possible scenarios that code needs to account for. However, by doing a few key things listed below, it becomes possible, and dare I boldly say on the day of its formal release: maintainable.
- Separation of engine configuration from the benchmark protocol: When benchmarking different systems, you want to ensure they all follow the same standards. This is why there are distinct Benchmarking classes that are abstracted away from the actual code implementation. This way, a benchmark can be defined in an abstract way, with the actual operation being handled by the required engine instance that must be passed in as a variable.
- Support for both benchmark-specific code paths and shared generic engine methods: Each benchmark subclass maintains a benchmark implementation registry (self.BENCHMARK_IMPL_REGISTRY), which defines which engines are supported and optionally maps benchmark-specific code to be used by the respective engine. Some benchmarks will have very custom code (i.e., ELTBench), while others (TPCDS and TPCH) use entirely generic methods contained in the engine class (i.e., load_parquet_to_delta(), execute_sql_query, optimize_table()). This provides the flexibility that generic stuff only needs to be defined once and can be used across many benchmarks, whereas code can be very custom as needed for novel benchmarks.
Self-contained data generation: Data required by the various benchmarks can be generated via LakeBench DataGenerator classes. DuckDB is used today for generation all datasets except ClickBench. The LakeBench wrapper around DuckDB provides additional functionality to target specific row group sizes in MB, whereas DuckDB only supports specifying the target count of rows. Targeting row group sizes in MB is extremely important for benchmarking to avoid having row groups that are too small. Both TPC-DS and TPC-H parquet datasets can be created in minutes.
Robust telemetry: LakeBench captures key information, including the size of the compute leveraged, total number of cores, duration, estimated job cost (in USD), and other data points. LakeBench will also soon support extended engine-specific telemetry (i.e., leveraging SparkMeasure for Spark) logged into a single flexible map column so that each engine can log what is needed without having a schema maintenance nightmare.

Running a benchmark is now as simple as:

Install LakeBench from PyPi

%pip install lakebench[duckdb]

One-Time Data Generation

from lakebench.datagen.tpcds import TPCDSDataGenerator

datagen = TPCDSDataGenerator(
    scale_factor=1,
    target_mount_folder_path='/lakehouse/default/Files/tpcds_sf1'
)
datagen.run()

Run Benchmark: TPC-DS Power Test

from lakebench.engines.duckdb import DuckDB
from lakebench.benchmarks.tpcds import TPCDS

engine = DuckDB(
    delta_abfss_schema_path='abfss://.........../Tables/duckdb_tpcds_sf1'
)

benchmark = TPCDS(
    engine=engine,
    scenario_name="SF1 - Power Test",
    parquet_abfss_path='abfss://........./Files/tpcds_sf1',
    save_results=True,
    result_abfss_path='abfss://......../Tables/dbo/results'
)
benchmark.run(mode="power_test")

Run Benchmark: ELTBench in `light` Mode

from lakebench.engines.fabric_spark import FabricSpark
from lakebench.benchmarks.elt_bench import ELTBench

engine = FabricSpark(
    lakehouse_name = 'lakebench', 
    lakehouse_schema_name = 'spark_eltbench_sf1',
)

benchmark = ELTBench(
    engine=engine,
    scenario_name="SF1",
    tpcds_parquet_abfss_path='abfss://........./Files/tpcds_sf1',
    save_results=True,
    result_abfss_path='abfss://......../Tables/lakebench/results'
)
benchmark.run(mode="light")

Q&A

Why didn’t you use Ibis to write engine-abstracted generic DataFrame transformations?: In concept, part of what I’m doing is scratching the surface of the Ibis project. However, I didn’t use Ibis for a few reasons:
- I wanted to maintain full control and provide transparency over the engine-specific code leveraged in all benchmarking scenarios (without users having to drill into another project and understand a much larger code base).
- Ibis doesn’t support all of the engines that I wanted LakeBench to support in the beta release (Daft) or in the planned stable milestone.
- I don’t intend for the scope of what LakeBench supports to be anywhere near Ibis.
- Ibis can add additional latency or possibly even inefficiencies as Ibis DataFrame APIs are translated to the backend engine leveraged.
I don’t like the way __ was implemented for engine __, what can I do about it?: Please submit a PR if you are comfortable, or minimally log an Issue.

Cheers!

Running a benchmark is now as simple as:

Install LakeBench from PyPi

One-Time Data Generation

Run Benchmark: TPC-DS Power Test

Run Benchmark: ELTBench in light Mode

Q&A

Run Benchmark: ELTBench in `light` Mode