Miles Cole

Mastering Spark: DataFrameWriterV2 vs. DataFrameWriterV1

2026-06-19T00:00:00+00:00

Most Spark developers learn to write data with df.write long before they ever encounter df.writeTo. It is simple, familiar, and everywhere: choose a format, pick a mode, add a few options, and save the result to a table or path. For years, that mental model worked well enough. Spark was often writing files first and tables second.

But modern lakehouse systems have changed the contract. A Delta table is not just a folder of Parquet files. It has transaction metadata, protocol features, table properties, constraints, generated columns, clustering metadata, schema evolution rules, and catalog-level behavior. In that world, the older DataFrameWriter API starts to show its age. A call like mode("overwrite").saveAsTable(...) can hide several different intentions: create the table, replace the table, overwrite the data, change the schema, or update existing metadata. The code is compact, but the semantics are overloaded.

DataFrameWriterV2 was introduced to make those intentions more explicit. Instead of saying “write this DataFrame somewhere using this mode,” the V2 API says “perform this specific table operation.” Create, append, replace, create-or-replace, overwrite-by-expression, and overwrite-partitions become distinct actions rather than behaviors inferred from a combination of mode, format, options, and table existence.

That distinction matters more as Delta and Spark add richer table capabilities. Features like explicit table properties, dedicated schema-evolution semantics, and catalog-managed tables fit more naturally into a table-oriented API than a file-oriented one. Some features Spark exposes (like clusterBy on the writer) aren’t fully wired into Delta yet, but the direction of travel is clear: V2 is where new table-level capabilities land.

In this post, we will compare the two writer APIs, look at the concrete differences in behavior, and highlight what is new in V2 as of Spark 4.2 and delta-spark 4.2.

The old mental model: `df.write`

Most Spark developers start with the original DataFrameWriter API:

df.write \
  .format("delta") \
  .mode("overwrite") \
  .saveAsTable("dbo.orders")

The core ingredients are:

format + mode + options + path/table

That design makes sense when the output is primarily a set of files. But Delta tables are more than a directory of files. They have transaction logs, table metadata, features, schema rules, constraints, and catalog behavior. When the write target is a table, the question is no longer just “where should these rows go?” It is also “what table operation am I performing?”

That is where the older writer API becomes less clear. The biggest source of ambiguity is mode("overwrite"). Depending on table existence, catalog behavior, provider implementation, options like overwriteSchema or replaceWhere, and Spark configuration, the same line can mean: create the table, replace the table definition, keep the definition but overwrite the contents, replace only matching partitions or a replaceWhere predicate, or change the schema. The code is short, but the intent is overloaded.

The newer mental model: `df.writeTo`

The V2 writer starts from a different place:

df.writeTo("dbo.orders")

Instead of saying “save this DataFrame somewhere,” V2 says “write this DataFrame to this table.” From there, the operation is explicit:

df.writeTo("dbo.orders").create()
df.writeTo("dbo.orders").append()
df.writeTo("dbo.orders").replace()
df.writeTo("dbo.orders").createOrReplace()
df.writeTo("dbo.orders").overwrite(col("order_date") == "2026-01-01")
df.writeTo("dbo.orders").overwritePartitions()

With V1, intent is inferred from mode, format, options, and target. With V2, intent is the method you call.

Note: format (V1) and using (V2) are both optional. If you don’t specify the provider, the default catalog format is used. In Microsoft Fabric, this is delta. The rest of the examples in this post omit format("delta") and using("delta") to avoid being unnessesarily verbose.

A simple comparison

Operation	V1	V2
Create	`df.write.saveAsTable("t")` (errors if exists, depending on mode)	`df.writeTo("t").create()`
Append	`df.write.mode("append").saveAsTable("t")`	`df.writeTo("t").append()`
Replace table	`df.write.mode("overwrite").option("overwriteSchema", "true").saveAsTable("t")`	`df.writeTo("t").replace()`
Create or replace	`df.write.mode("overwrite").option("overwriteSchema", "true").saveAsTable("t")`	`df.writeTo("t").createOrReplace()`
Overwrite by predicate	`df.write.mode("overwrite").option("replaceWhere", "order_date = '2026-01-01'").saveAsTable("t")`	`df.writeTo("t").overwrite(col("order_date") == "2026-01-01")`
Overwrite matching partitions	`df.write.mode("overwrite").insertInto("t")` (with `partitionOverwriteMode=dynamic`)	`df.writeTo("t").overwritePartitions()`

The V2 versions separate ideas that V1 conflates: replace requires the table to exist, createOrReplace does not, and overwrite(condition) and overwritePartitions() are no longer encoded as side-channel options on top of mode("overwrite").

Table properties vs. options: V2 gives them separate seats

This is the single biggest semantic improvement, and it is often misunderstood. In V2, tableProperty(...) and option(...) are not interchangeable. They are stored in two distinct internal maps and are routed to two different places (DataFrameWriterV2.scala in Spark 4.2):

private val options    = new mutable.HashMap[String, String]()
private val properties = new mutable.HashMap[String, String]()

tableProperty(k, v) populates the table metadata that the catalog persists when creating or replacing the table. For Delta, that means it lands in the Metadata action in the transaction log and shows up under SHOW TBLPROPERTIES and in DESCRIBE DETAIL. Examples: delta.enableChangeDataFeed, delta.appendOnly, delta.deletedFileRetentionDuration, delta.feature.timestampNtz, delta.checkpointPolicy.
option(k, v) populates write options that are passed to the data source for this particular write. They do not become table metadata. Examples: mergeSchema, replaceWhere, txnAppId, txnVersion, userMetadata.

In V1, both of these had to be funneled through .option(...), which blurred a real distinction:

# V1: everything is just an "option"
df.write \
  .option("delta.enableChangeDataFeed", "true") \  # actually a table property
  .option("mergeSchema", "true") \                  # actually a per-write option
  .mode("append") \
  .saveAsTable("dbo.orders")

In V2, the two roles are visible at a glance:

df.writeTo("dbo.orders") \
  .tableProperty("delta.enableChangeDataFeed", "true") \
  .tableProperty("delta.feature.timestampNtz", "supported") \
  .option("mergeSchema", "true") \
  .createOrReplace()

This separation is also what allows V2 to round-trip a real table definition. The properties map is what the catalog stores; the options map is what the writer hands to the data source for this specific operation.

Practical note: V2 still accepts option(...). The improvement is not that options went away — it is that table-level metadata is no longer pretending to be a per-write option.

Paths still work — they just aren’t the headline

V2 is table-first, but it has not dropped path support. option("path", "...") is still honored and is used as the table location at create time:

df.writeTo("dbo.orders") \
  .option("path", "/lakehouse/silver/orders") \
  .create()

That is useful for external tables. The shift is one of emphasis: in V1, paths and tables were two equally prominent ways to call save(...) / saveAsTable(...); in V2, the identifier is the table and the path is just one more option that influences where the table lives.

Liquid clustering on the API surface (Spark 4.0+)

CreateTableWriter.clusterBy(...) was added in Spark 4.0.0 and Spark enforces that partitionedBy and clusterBy aren’t both set on the same writer (it throws clusterByWithPartitionedBy). That matches Delta’s rule that a table is partitioned or clustered, not both.

The caveat: on the Delta side, clusterBy from the DataFrame writers (V1 or V2) is not wired in yet. There is an open PR — delta-io/delta#7060 “support accepting clusterBy from both v1 and v2 dataframe writers” that adds this support. Until it lands, the only first-class way to create a liquid-clustered Delta table is via SQL:

CREATE OR REPLACE TABLE dbo.orders
CLUSTER BY (customer_id, order_date)
AS SELECT ...

Or, write and then alter the table:

df.writeTo("dbo.orders") \
    .create()

spark.sql("ALTER TABLE dbo.orders CLUSTER BY (customer_id, order_date)")

This is a good example of the gap noted earlier: Spark’s V2 API can express the intent, but the table provider still has to implement it.

Explicit schema evolution (Spark 4.2 + delta-spark 4.2)

The withSchemaEvolution() method on DataFrameWriterV2 is new in Spark 4.2.0. It only applies to write operations against an existing table — append, overwrite(condition), and overwritePartitions — and throws on create/replace (where schema evolution is implicit in the new definition):

df.writeTo("silver.orders") \
  .withSchemaEvolution() \
  .append()

On the Delta side, this is gated by a TableCapability.AUTOMATIC_SCHEMA_EVOLUTION flag. Delta’s Spark version shims only enable this capability on the spark-4.2 build:

spark-4.0 shim: capability not available at all.
spark-4.1 shim: capability exists in Spark but is intentionally not advertised by Delta because MERGE/INSERT schema evolution wasn’t yet properly wired.
spark-4.2 shim: capability is advertised, and df.writeTo(...).withSchemaEvolution().append() works end-to-end on Delta.

In other words: if you are on delta-spark built against Spark 4.2, withSchemaEvolution() is the new, explicit replacement for .option("mergeSchema", "true") on V2 appends and overwrites.

MERGE finally has a DataFrame API (Spark 4.0+)

For years, the only way to do MERGE INTO from Python/Scala was either raw SparkSQL or Delta’s DeltaTable.merge(...) builder. Spark 4.0 added a Spark-native DataFrame entry point and like the rest of the V2-era APIs, it’s table-oriented and explicit.

The shape is df.mergeInto(target, condition), not df.writeTo(target).merge(...). It’s presumably kept separate because merge needs a join condition and a chain of whenMatched / whenNotMatched / whenNotMatchedBySource clauses that don’t fit the create/append/overwrite builder shape:

source.alias("s") \
    .mergeInto("dbo.orders", expr("dbo.orders.id = s.id")) \
    .whenMatched().updateAll() \
    .whenNotMatched().insertAll() \
    .whenNotMatchedBySource().delete() \
    .merge()

df.mergeInto(...) does not return a DataFrameWriterV2 — it returns a separate MergeIntoWriter. But it sits on the same V2 foundations. From MergeIntoWriter.scala the builder produces a MergeIntoTable logical plan against an UnresolvedRelation with V2 multi-part identifier resolution and the V2 requireWritePrivileges model — the same plan SQL MERGE INTO produces. Providers implement it through V2 row-level operations (Iceberg via SupportsRowLevelOperations; Delta via its own analyzer rules that route to the existing Delta MERGE execution).

MergeIntoWriter also has its own withSchemaEvolution() builder method, separate from the one on DataFrameWriterV2 but conceptually identical: explicit, builder-set, no magic option("mergeSchema", "true") required.

What this means in practice:

For new Delta merge code in Python/Scala, df.mergeInto(...) is now the V2-native equivalent of DeltaTable.forName(...).merge(...). It’s not faster, but it doesn’t require importing delta.tables and it plays naturally with the rest of the V2 DataFrame surface.
DeltaTable.merge(...) is not going away — it still exposes Delta-specific knobs — but df.mergeInto(...) is the cross-provider, Spark way to express the same operation.
If merging based on paths instead of catalog references, you will need to continue using the DeltaTable.merge(...) builder, the new Spark API requires a catalog reference for the table being merged into.

Replace semantics are clearer (and Delta knows the difference)

Delta has special-cased V2’s create/replace behavior for a long time. From CreateDeltaTableLike.scala:

In DataFrameWriterV1, mode("overwrite").saveAsTable behaves as a CreateOrReplace table, but we have asked for overwriteSchema as an explicit option to overwrite partitioning or schema information. With DataFrameWriterV2, the behavior asked for by the user is clearer: .createOrReplace(), which means that we should overwrite schema and/or partitioning.

So df.writeTo("t").replace() and .createOrReplace() are not just nicer-looking — Delta uses the API choice itself as the signal that schema and partitioning should be replaced, without needing overwriteSchema=true as a hint. Domain metadata (used by features like clustering) is also only updated on these explicit replace paths.

Partitioning is part of the table definition

With V1, partitionBy is a write-time layout hint. With V2, partitionedBy is part of the table definition you are creating or replacing:

df.writeTo("dbo.orders") \
  .partitionedBy("order_date") \
  .create()

V2 also supports partition transforms (years, months, days, hours, bucket) for providers that implement them such as Apache Iceberg. Delta doesn’t implement partitioned transforms so it has to be a static column reference.

When V1 is still the right tool

V1 is not going away, and it is still the right choice for file-oriented writes and very simple appends:

df.write.mode("overwrite").parquet("/exports/orders")
df.write.format("json").mode("append").save("/exports/events")
df.write.format("delta").mode("append").save("/lakehouse/bronze/events")
df.write.mode("append").saveAsTable("bronze.raw_events")

The point is not that V1 is obsolete. The point is that V1 carries ambiguity when you are managing modern tables, and V2 now has the features (clustering, explicit schema evolution, table properties) to fully replace it for table lifecycle work.

Watch out for compatibility differences

V2 is cleaner, but it is not magic. Capabilities depend on the Spark version, the catalog, and the provider:

clusterBy requires Spark 4.0+ on the API side, and a provider that implements it. Delta does not yet honor clusterBy from the DataFrame writers — track delta#7060. For now, use SQL CLUSTER BY to create liquid-clustered Delta tables.
withSchemaEvolution() requires Spark 4.2+ and a provider that advertises AUTOMATIC_SCHEMA_EVOLUTION. On Delta, that means a build against the spark-4.2 shim.
Some V2-looking code can still fail if the provider hasn’t fully implemented the requested transform (for example, older Delta versions and partition transforms).

The rule of thumb:

V2 gives Spark a clearer way to express intent.
The table provider still has to implement that intent correctly.

Recommended style

For modern Delta work, a reasonable default style guide:

Use SQL or V2 for table lifecycle operations:

CREATE OR REPLACE TABLE silver.orders
CLUSTER BY (customer_id, order_date)
TBLPROPERTIES ('delta.enableChangeDataFeed' = 'true')
AS SELECT ...

or, until delta#7060 lands, the DataFrame equivalent without clustering:

df.writeTo("silver.orders") \
  .tableProperty("delta.enableChangeDataFeed", "true") \
  .createOrReplace()

Use V2 for writes against existing managed tables:

df.writeTo("silver.orders").append()
df.writeTo("silver.orders").withSchemaEvolution().append()         # Spark/Delta 4.2+
df.writeTo("silver.orders").overwrite(col("order_date") == d)      # replaceWhere, explicit
df.writeTo("silver.orders").overwritePartitions()                  # dynamic partition overwrite for partitioned tables

Use V1 for path-based exports and simple file outputs:

df.write.mode("overwrite").parquet("/exports/orders")

Be cautious with V1 mode("overwrite").saveAsTable(...). That code may be correct, but it deserves a second look. Make sure the intended behavior — create, replace, replaceWhere, overwriteSchema — is obvious to the next person who reads it. If it isn’t, V2 will say it for you.

Final thought

The difference between V1 and V2 writers is not just syntax. It reflects a broader shift in Spark itself. The older API comes from a world where Spark jobs mostly wrote files. The newer API fits a world where Spark manages tables — with first-class properties, clustering, and (as of Spark/Delta 4.2) explicit schema evolution.

df.write is still useful. But when the code is creating, replacing, or managing Delta tables, df.writeTo now tells the truth more clearly, and it has the features to back it up.

Creating your first Spark Job Definition

2026-02-04T00:00:00+00:00

Coming from a notebook-first Spark background, I wanted to write the introduction to Spark Job Definitions (SJDs) that I wish I had when I first encountered them. If you are first interest in why you might want to use a Spark Job Definition over a Notebook, see my blog here.

My first job was in finance, and I learned Spark much later while consulting in environments where everything ran in notebooks. That wasn’t unique to any one company — it’s simply how most consulting teams work. So when I first opened a Spark Job Definition while exploring additional things I could do in Synapse, my reaction was:

“Wow… what the heck is this thing?”

This post is meant for anyone who learned Spark through notebooks and is now staring at SJDs wondering what role they play and how to use them. Think of this as a bridge from interactive development to job-based execution.

What Is a Spark Job Definition?

A Spark Job Definition is effectively a way to run a packaged Spark application, Fabric’s version of executing a spark-submit job. You define:

what code should run (the entry point),
which code files or resources should be shipped with it,
and which command-line arguments should control its behavior.

Unlike a notebook, there is no interactive editor or cell output, but this is arguably not a missing feature, it’s the whole point… an SJD is not meant for exploration; it is meant to deterministically run a Spark application.

You can think of it as:

Notebook = interactive development environment (IDE)
SJD = execution mechanism

Core Concepts

At a high level, creating an SJD revolves around five things which you will commonly configure:

Entry Point – the .py, .scala, or .r file that Spark executes
Reference Files [OPTIONAL] – additional .py, .scala, or .r files that can be referenced from your entry point via import module_name.
Command-Line Arguments [OPTIONAL] – runtime parameters
Lakehouse Reference – the default metastore context for tables
Environment Reference – the Environment context that includes public and custom libraries, Spark pool (a.k.a. cluster) configuration, spark configs, and reference files

If you understand the purpose of each of these, you will be well on your way to running your first successful SJD.

So Where Do I Start?

Start by developing your Spark logic either in a notebook or, ideally, in a local IDE like VS Code. Write modular code that can be packaged as a Python Wheel or JAR.

Once your logic works locally or in a notebook, create a small standalone file whose job is to:

import your package,
initialize Spark and logging,
and run the main executable logic.

At its simplest, this could look like:

from pyspark.sql import SparkSession

spark = (
    SparkSession
        .builder
        .appName("myApp")
        .getOrCreate()
)

spark.range(1).write.saveAsTable("dbo.test")

But for production use, it’s better to structure this code more explicitly. In particular, it helps to:

configure logging,
contain executable code in a main() function,
and use a main guard.

That separates code meant to run when the file is executed from code meant to be imported and reused (for example, in unit tests).

from pyspark.sql import SparkSession
import sys
import logging

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s - %(name)s - %(levelname)s - %(message)s",
    handlers=[logging.StreamHandler(sys.stdout)]
)

logger = logging.getLogger(__name__)

def main() -> None:
    spark = (
        SparkSession
            .builder
            .appName("myApp")
            .getOrCreate()
    )

    spark.sparkContext.setLogLevel("ERROR")

    logger.info("=" * 80)
    logger.info("Starting...")
    logger.info("=" * 80)

    # Executable code goes here

    logger.info("=" * 80)
    logger.info("Completed...")
    logger.info("=" * 80)

if __name__ == "__main__":
    main()

What About Parameterization?

There are two methods available, both of which are frequently used as they serve different but potentially overlapping use cases.

1. Configuration Data

For configuration-driven pipelines (for example, a list of objects or tables to process), YAML files are highly recommended. They are readable, easy to edit, and trivial to parse using the pyyaml library. For you Rust lovers out there, there’s even a Rust based pyyaml-rs library in case your config data is massive.

tables:
  - name: table_1
    config1: ....
  - name: table_2
    config1: ....
    dependencies:
      - table_1

These files can either be built into your Python Wheel or JAR (for tight coupling of framework and configuration), or staged in OneLake and imported via full ABFSS path or default Lakehouse reference.

import yaml

with open('File/...', "r") as f:
    table_registry = yaml.safe_load(f)

2. Runtime Control Flow

For higher-level control flow, the kind of things you normally override in a notebook cell via Pipeline parameters, you should use command-line arguments.

This was the biggest learning gap for me. Instead of overwriting variables in a chosen parameter cell, your application must expect arguments and validate them.

import argparse

def parse_args(argv):
    p = argparse.ArgumentParser()
    p.add_argument("--zone", type=lambda s: s.lower(), required=True)
    p.add_argument("--load-group", type=int, default=0)
    p.add_argument("--config-file-url", required=True)
    p.add_argument("--compression", choices=["snappy", "zstd"], default="snappy")
    p.add_argument("--debug", action="store_true")

    return p.parse_args(argv)

The argparse library that comes included in Python gives you validation, help text, and type enforcement without boilerplate. See the docs for all of the creative ways your can control and constrain inputs.

Your arguments are then provided to the SJD like this:

--zone bronze --load-group 1 --config-file-uri Files/.../table_registry.yml --compression zstd --debug

And parsed inside your executable:

import sys

def main() -> None:
    args = parse_args(sys.argv[1:])

Which exposes them as attributes of a named Python object (i.e. args):

args.zone
args.load_group
args.config_file_uri
args.compression
args.debug

The neat thing about this seemingly more complex parameterization process is that there’s clear deliniation between variables that are inputs since it is self contained as a Python object (i.e. args). When doing Notebook development, deliniation between input parameters and regular Python variables is 100% up to developer hygene in consistently applied naming conventions.

Additional Gotchas

There’s a few things that us notebook-developers take for granted because the notebook UX is all about convience and agility:

spark is not automatically defined

A Spark session exists, but you must assign it:

from pyspark.sql import SparkSession

spark = (
    SparkSession
        .builder
        .appName("myApp")
        .getOrCreate()
)

Common imports are not pre-imported for the user

Anything automatically injected into notebooks must be explicitly imported, such as:
- from pyspark.sql import SparkSession
- import notebookutils

SJDs make implicit behavior explicit — which is both the challenge and the benefit.

Putting It All Together

A typical SJD entry point ends up looking something like this (my_elt_package contains the the locally built and tested business logic, transformations, etc.):

from pyspark.sql import SparkSession
import sys
import logging
import argparse

# import your python packge
from my_elt_package import Controller

# if using yaml for configs
import yaml 
def load_table_registry(path: str) -> dict:
    with open(path, "r") as f:
        table_registry = yaml.safe_load(f)
    return table_registry

def parse_args(argv):
    p = argparse.ArgumentParser()
    p.add_argument("--zone", type=lambda s: s.lower(), required=True)
    p.add_argument("--load-group", type=int, default=0)
    p.add_argument("--config-file-url", required=True)
    p.add_argument("--compression", choices=["snappy", "zstd"], default="snappy")
    p.add_argument("--debug", action="store_true", help="Enable DEBUG logging")
    return p.parse_args(argv)

def configure_logging(debug: bool) -> logging.Logger:
    level = logging.DEBUG if debug else logging.INFO
    logging.basicConfig(
        level=level,
        format="%(asctime)s - %(name)s - %(levelname)s - %(message)s",
        handlers=[logging.StreamHandler(sys.stdout)],
    )
    return logging.getLogger(__name__)

def create_spark(app_name: str, debug: bool) -> SparkSession:
    spark = (
        SparkSession
            .builder
            .appName(app_name)
            .getOrCreate()
    )
    spark.sparkContext.setLogLevel("INFO" if debug else "ERROR")
    return spark

def main(argv: list[str]) -> None:
    # parse input arguments
    args = parse_args(argv)

    # configure logging
    logger = configure_logging(args.debug)

    # assign SparkSession as variable
    spark = create_spark("myApp", args.debug)

    logger.info("=" * 80)
    logger.info(f"Starting load group {args.load_group} for zone {args.zone}...")
    logger.info("=" * 80)

    # main executable code
    table_registry = load_table_registry(args.config_file_uri)

    controller = Controller(
        spark=spark,
        config={
            load_group = args.load_group, 
            compression = args.compression
        },
        table_registry=table_registry
    )

    controller.run_pipeline(zone=args.zone)

    logger.info("=" * 80)
    logger.info(f"Completed load group {args.load_group} for zone {args.zone}...")
    logger.info("=" * 80)

if __name__ == "__main__":
    main(sys.argv[1:])

Because the executable logic lives inside main(), it can be imported and called from test suites or other programs:

# some_other_file.py
import sjd_main as job

def test_bronze_is_created(spark):
    job.main(["--zone", "bronze", "--config-file-uri", "C:/user/dev/table_registry.yml", "--load-group", "1"])
    assert spark.catalog.tableExists("bronze.test_sjd")

Now you can make changes locally, run unit tests, and have high confidence that your job will behave the same way in the cloud. No need to blindly submit a job and cross your fingers :)

How Do I Monitor a Spark Job?

With notebooks, you get cell output and visual cues. With SJDs, monitoring shifts to:

the Spark UI for Spark execution details,
and stdout / stderr logs for application behavior.

Your logging configuration determines what you see. Prints become logs. Cell outputs become structured messages.

It’s less visual — but more precise.

Typical Development Flow

I plan to expand on this in a future post, but the high-level flow usually looks like:

Iterate on code locally or remote in a Fabric Notebook to develop a working PoC.
Formalize your PoC into a locally packaged library with unit tests.
Create a small entry-point script for execution.
Test the entry-point.
Attach the package to a Fabric Environment.
Create an SJD referencing the entry point, any reference files, command line arguments, Lakehouse and Environment reference.
Run 🚀

This development workflow will feel heavier than a notebook at first, but the requirement to develop with strong intentionality will provide you with a more reliable production solution. It buys you testability, repeatability, and modularity that are all critical for well designed Spark applications.

Lastly, this development workflow is not for everyone or all projects. However, if you have already begun to explore packaging your code, and you want to take things to the next level, I highly enourage considering whether the rigor of a Spark Job Definition would force adopting more mature development habits that will result more reliable production jobs.

Notebooks, Spark Jobs, and the Hidden Cost of Convenience

2026-02-04T00:00:00+00:00

I’m guilty. I’ve peddled the #NotebookEverything tagline more than a few times.

To be fair, notebooks are an amazing entry point to coding, documentation, and exploration. But this post is dedicated to convincing you that notebooks are not, in fact, everything, and that many production Spark workloads would be better executed as a non-interactive Spark Job.

I’m certainly not the first to say such a controversial thing. Daniel Beach’s infamously entertaining The Rise of the Notebook Engineer blog post made waves (and enemies) for a reason. Ironically, I’ve spent my entire Spark career being exactly that: a notebook engineer. Sure, I’ve done a lot of software engineering type of stuff that doesn’t take place in a Notebook like creating APIs, CICD automation, building WebApps (both front-end and back-end) before Vibe coding would do nearly everything for you, but for all of my Spark development career I’ve only deployed stuff via Notebooks. I came from the business side of things where later I learned Spark in consulting where everyone only used Notebooks for Spark jobs, production included.

So if you only use notebooks today, no judgement, you’re in good company. In this post I focus on some very real considerations and lessons learned while arguing three core points:

Reliability must come before convenience
Notebooks make testing and modularity harder
Spark Job Definitions encourage better engineering habits

While I’ll use Microsoft Fabric’s Spark Job Definitions as a concrete example, the argument here is not Fabric specific. The same tradeoffs exist in Databricks Jobs, spark-submit on EMR or HDInsight, AWS Glue, or any platform where notebooks and scheduled Spark jobs coexist. This is really about choosing between an interactive editor and a packaged execution model.

1. Reliability Must Come Before Convenience

Beyond performance, cost, and clever optimizations, a good data engineer should optimize for reliability as a first principle.

Why? I’ll propose it algebraically:

\[\text{stakeholderSatisfaction} = \text{dataTimeliness} \times \text{TCO} \times \text{securityExpectations} \times (\text{reliability})^{10}\]

You can build the fastest pipeline with the lowest TCO and perfect security posture, and none of it matters if the data only arrives correctly 95% of the time.

What is good performance if data doesn’t reliably get from A to Z? Will your CFO care about your cost savings if a regression adds extra zeros to sales figures?

One bad incident can undo months of tuning, cost optimization, and feature work. That’s why I consider reliability a first principle. Everything else is downstream from it.

If reliability is the goal, then the levers we control as data engineers start to matter a lot. In practice, three things show up again and again as predictors of whether a pipeline stays healthy over time:

Testing → determines how often we prevent incidents in the first place
Modularity → determines how fast we recover when a portion of your complex code base breaks and how testable your code is
Governance → determines who can introduce a change into production

Surely there are others, however few would disagree that these are high predictors of being able to achieve high reliability.

2. Notebooks Make Testing and Modularity Harder

Notebooks and Testing

Notebooks can be tested. But if this were a conference talk and I asked, “Who runs unit tests against their notebook code before every release?”, I’d expect a lot of uncomfortable silence.

In my years of consulting before Microsoft, I never once saw a real test suite for notebook-based pipelines — not from customers, and not from teams I worked on. There might be CI validating that a SQL project builds or that a Python wheel compiles, but never a meaningful assertion that a pipeline produces the expected result or a utility does what it is supposed to.

assert my_elt_func(df) == exepected_result

Why is this? In the data engineering space, there’s a handful of core reasons:

Economic realities: Very few organizations want to pay for work that doesn’t immediately translate into more data, more dashboards, or tighter SLAs. Testing is preventative, and preventative work with intangible benefits is notoriously hard to justify in budgets.
Technical constraints: Writing unit tests in a data context is genuinely harder than in typical application code. You’re often asserting over distributed behavior, schemas, and transformations rather than simple return values.
Skillset gaps: Notebooks are highly encouraged in consulting scenarios because both the inputs, progress, and outputs are much more transparent to those who did not build the solution but will own it going forwards.
Development mechanics: Notebooks don’t naturally fit into a testable development workflow. They blur together setup, logic, and execution. They can mix languages. They encourage inline code rather than reusable functions. And while they are technically just files in source control, they are awkward to import and test like normal code.

The only scalable pattern I’ve seen work is to treat the notebook as nothing more than an entry point. All of the actual ELT logic lives in a Python wheel or JAR with proper unit tests, and the notebook simply imports classes and executes methods or functions defined outside of the notebook. At that point, the notebook is no longer the system. It’s just a user interface for calling run with a specific configuration context.

But what about modularity?

Notebooks and Modularity

Yes, you can modularize notebook code. You can reference .py files. You can attach modules through Environments. You can even inline-install packages at runtime. But all of those techniques tend to bind your logic to a specific notebook or execution context.

Code that lives in a notebook (including Fabric’s Notebook and Environment resources) is harder or even impossible to efficiently reuse outside that scope without copy-paste distributing your source code. It is also harder to version cleanly, harder to promote across environments, and harder to reason about as a product rather than as an artifact of an editor.

Packaging your logic as a wheel or JAR forces separation between what the code does and how it is executed. That separation is what enables testing, reuse, and controlled deployment. It is the same pattern application engineers have relied on for decades, and it works just as well for data engineering when we choose to use it.

If your transformation logic, shared utilities, or dataframe operators are worth reusing outside of a single data pipeline context, it probably shouldn’t live inside a notebook. Minimally, aim to package your code as a Python wheel or JAR, and then use the Notebook as an entry point to calling your ELT package.

3. Spark Job Definitions Encourage Better Engineering Habits

This section hits closest to home for me.

I run a low-risk internal Spark workload at Microsoft where the use case requires frequently adjusting input parameters. For a long time, I ran it via notebooks, even after I had already refactored all logic into Python packages. The notebook was just the entry point.

But notebooks made it too easy to be lazy:

I’m not going to schedule this job because I’ll just open the Notebook before results are needed, modify the one or two lines of code to adjust the execution context and run. So easy!

Because it was so easy to modify, I avoided formalizing various behaviors. There was no stable interface. No clear contract. No forced decision about what should be configurable and what should not.

When I moved those jobs to Spark Job Definitions with proper command-line arguments, something surprising happened: the friction forced me to think.

I had to decide:

what was input and what was the expected behavior
what could change safely and what should not
how parameterization and control flow should work
where validation should live and what is tested

In other words, I had to think about things that directly shape data pipeline reliability.

There’s an uncomfortable truth hiding here:

If the barrier to running production code is near zero, then the barrier to breaking production is near zero too. Notebooks are easy to create, and they are just as easy to mutate. There is no inherent guardrail beyond human discipline.

Spark Job Definitions, by contrast, require packaging, interfaces, and intent. They are less convenient, and that inconvenience is arguably not a flaw, it’s the nature of complex data engineering that requires better habits. Going back to our premise around what drives reliability, your job not having a built-in IDE adds a layer of healthy friction to govern how easy it is to make a change, a change that could be untested and regretted.

What About Interactivity?

Spark Job Definitions are not interactive, and that is usually framed as a downside, but I’ll push back by asking “does it really make sense for a production job to ship with a built-in IDE”? IDE’s are meant to make developing code easier and a Notebook is functionallity an executable script with a built-in IDE. Sure we could lock the production notebook to be read-only in our production workspace, but that doesn’t change the fact that it’s still a notebook that comes with the necessary overhead IDEs require to do things like nicely visualize cell outputs, snapshots, and such. While an SJD wouldn’t be meaningfully faster compared to when run with a Notebook with 20 cells, the UI cost is certainly not zero.

Consider a website built via Square vs. one deployed via conventional methods (building web app locally, and then publishing the compiled package to a hosting service): which website would you trust to run a billion dollar business? I would certainly not trust the Square Space implementation because the barrier to making a breaking change is too low, it ships with an IDE. You are not more than 2-3 clicks away from making a change that could disrupt opterations (sorry, I accidentally deleted the order form).

But interactivity does not disappear; it simply moves earlier in the process. You still explore and debug locally. You still test in notebooks if that helps. You still validate behavior before release.

By the time you execute an SJD, you are supposed to already know what it will do and have executed tests that prove it works as expected. An SJD is nothing more than a Spark job API contract, it expects certain inputs, and in return it will run your code. Bad code == bad result, good code == good result.

⚠️ WARNING - controversial claim: notebooks shine when you need to explore, explain, visualize, or teach. They are phenomenal for data science and experimentation, but they are arguably not ideal for most production use cases. Production data engineering and data science workloads are typically extremely binary:

Did I get the data from A to Z?
Did it arrive on time?
Did the dataset get scored?
Did it arrive in the right shape?
Did it break anything downstream?

There’s nothing about most production workloads that requires the use of notebooks, it’s a convenience thing: I can ship the thing I used to interactively develop my solution while benefitting from ease of making further code changes, and it comes with the ability to interweave documentation with code.

While notebooks optimize for convenience, Spark Job Definitions optimize for intent. If reliability is your first principle, intent should always come before convenience.

So the real question isn’t whether you can run production jobs from notebooks. It’s whether doing so makes you a more disciplined engineer and produces more reliable outcomes for your stakeholders.

Notebooks make it easy to ship any code. Spark Job Definitions make it hard to ship the wrong code. That’s why I’m reconsidering how I deploy most production pipelines.

See my blog for how to create your first Spark Job Definition. The internet is strangely thin on this topic, probably because too many of us still #NotebookEverything 😄, but it’s really not that hard once you understand the core concepts.

Announcing: 🌊 LakeBench

2025-07-11T00:00:00+00:00

I’m excited to formally announce LakeBench, now in version v0.3, the first Python-based multi-modal benchmarking library that supports multiple data processing engines on multiple benchmarks. You can find it on GitHub and PyPi.

Traditional benchmarks like TPC-DS and TPC-H focus heavily on analytical queries, but they miss the reality of modern data engineering: building complex ELT pipelines. LakeBench bridges this gap by introducing novel benchmarks that measure not just query performance, but also data loading, transformation, incremental processing, and maintenance operations. The first of such benchmarks is called ELTBench and is initially available in light mode.

While the beta release focuses on code-first data processing engines available in Microsoft Fabric, the stable release milestone is planned to include additional benchmarks (i.e., ELTBench in full mode, AtomicELT) and other data processing engines available in Azure.

While there are other benchmarking projects out there, I designed LakeBench with a few key things in mind, which in total, make it unique:

Python: While most data engineering benchmarking projects are Scala or Java-based, I created LakeBench as a Python project to make it the most easily accessible benchmarking library available. No need to build and package the binaries, just %pip install lakebench directly from PyPi.
Multiple modalities: Most projects (with the exception of Lake Loader by the OneHouse team, which is Scala-based) are a one-trick pony. They either focus on supporting many engines (i.e., ClickBench), focus on multiple benchmarks, or maybe they just do one thing well—one engine that runs one benchmark. I designed LakeBench to solve for the challenges that come with the intersection of combining many benchmarks with many engines. As you combine the two, you multiply the possible scenarios that code needs to account for. However, by doing a few key things listed below, it becomes possible, and dare I boldly say on the day of its formal release: maintainable.
- Separation of engine configuration from the benchmark protocol: When benchmarking different systems, you want to ensure they all follow the same standards. This is why there are distinct Benchmarking classes that are abstracted away from the actual code implementation. This way, a benchmark can be defined in an abstract way, with the actual operation being handled by the required engine instance that must be passed in as a variable.
- Support for both benchmark-specific code paths and shared generic engine methods: Each benchmark subclass maintains a benchmark implementation registry (self.BENCHMARK_IMPL_REGISTRY), which defines which engines are supported and optionally maps benchmark-specific code to be used by the respective engine. Some benchmarks will have very custom code (i.e., ELTBench), while others (TPCDS and TPCH) use entirely generic methods contained in the engine class (i.e., load_parquet_to_delta(), execute_sql_query, optimize_table()). This provides the flexibility that generic stuff only needs to be defined once and can be used across many benchmarks, whereas code can be very custom as needed for novel benchmarks.
Self-contained data generation: Data required by the various benchmarks can be generated via LakeBench DataGenerator classes. DuckDB is used today for generation all datasets except ClickBench. The LakeBench wrapper around DuckDB provides additional functionality to target specific row group sizes in MB, whereas DuckDB only supports specifying the target count of rows. Targeting row group sizes in MB is extremely important for benchmarking to avoid having row groups that are too small. Both TPC-DS and TPC-H parquet datasets can be created in minutes.
Robust telemetry: LakeBench captures key information, including the size of the compute leveraged, total number of cores, duration, estimated job cost (in USD), and other data points. LakeBench will also soon support extended engine-specific telemetry (i.e., leveraging SparkMeasure for Spark) logged into a single flexible map column so that each engine can log what is needed without having a schema maintenance nightmare.

Running a benchmark is now as simple as:

Install LakeBench from PyPi

%pip install lakebench[duckdb]

One-Time Data Generation

from lakebench.datagen.tpcds import TPCDSDataGenerator

datagen = TPCDSDataGenerator(
    scale_factor=1,
    target_mount_folder_path='/lakehouse/default/Files/tpcds_sf1'
)
datagen.run()

Run Benchmark: TPC-DS Power Test

from lakebench.engines.duckdb import DuckDB
from lakebench.benchmarks.tpcds import TPCDS

engine = DuckDB(
    delta_abfss_schema_path='abfss://.........../Tables/duckdb_tpcds_sf1'
)

benchmark = TPCDS(
    engine=engine,
    scenario_name="SF1 - Power Test",
    parquet_abfss_path='abfss://........./Files/tpcds_sf1',
    save_results=True,
    result_abfss_path='abfss://......../Tables/dbo/results'
)
benchmark.run(mode="power_test")

Run Benchmark: ELTBench in `light` Mode

from lakebench.engines.fabric_spark import FabricSpark
from lakebench.benchmarks.elt_bench import ELTBench

engine = FabricSpark(
    lakehouse_name = 'lakebench', 
    lakehouse_schema_name = 'spark_eltbench_sf1',
)

benchmark = ELTBench(
    engine=engine,
    scenario_name="SF1",
    tpcds_parquet_abfss_path='abfss://........./Files/tpcds_sf1',
    save_results=True,
    result_abfss_path='abfss://......../Tables/lakebench/results'
)
benchmark.run(mode="light")

Q&A

Why didn’t you use Ibis to write engine-abstracted generic DataFrame transformations?: In concept, part of what I’m doing is scratching the surface of the Ibis project. However, I didn’t use Ibis for a few reasons:
- I wanted to maintain full control and provide transparency over the engine-specific code leveraged in all benchmarking scenarios (without users having to drill into another project and understand a much larger code base).
- Ibis doesn’t support all of the engines that I wanted LakeBench to support in the beta release (Daft) or in the planned stable milestone.
- I don’t intend for the scope of what LakeBench supports to be anywhere near Ibis.
- Ibis can add additional latency or possibly even inefficiencies as Ibis DataFrame APIs are translated to the backend engine leveraged.
I don’t like the way __ was implemented for engine __, what can I do about it?: Please submit a PR if you are comfortable, or minimally log an Issue.

Cheers!

The Small Data Showdown ‘25: Is it Time to Ditch Spark Yet??

2025-06-30T00:00:00+00:00

Last December (2024) I published a blog seeking to explore the question of whether data engineers in Microsoft Fabric should ditch Spark for DuckDb or Polars. Six months have passed and all engines have gotten more mature. Where do things stand? Is it finally time to ditch Spark? Let The Small Data Showdown ‘25 begin!

Goals of This Post

First, let’s revisit the purpose of the benchmark: The objective is to explore data engineering engines available in Fabric to understand whether Spark with vectorized execution (the Native Execution Engine) should be considered in small data architectures.

Beyond refreshing the benchmark to see if any core findings have changed, I do want to expand in a few areas where I got great feedback from the community:

Framework Transparency: While I didn’t publish the benchmark code last time, it is now available as part of the beta version of my LakeBench Python library. You can find it on GitHub and PyPi. This blog leverages the ELTBench benchmark run in light mode. Hopefully, this will help provide additional trust, enable reproducing benchmarks, or at least allow folks to give me tips for how to improve the methodology. If there’s anything you’d do differently for one of the engines, just raise an Issue, or better yet, submit a PR!
Additional Engines: While I by no means plan to benchmark the gamut of OSS engines, I did get common asks to include Daft and Databricks Photon in the benchmark. I’ve elected to include Daft this time. I am not including Photon as it doesn’t fit the intent of this study: to explore engines available in Fabric for small data workloads.

Benchmark Methodology

If you haven’t already read my initial blog comparing these engines, I’d recommend reading it first. I’ve made a few minor adjustments to the benchmarking methodology this time:

To provide better clarity in terms of the scale of data where small engines become definitively faster than Spark, I’m now referencing the size of compressed data rather than the TPC-DS scale factor used. This is particularly important as my benchmark only uses a subset of the TPC-DS tables. The scale factor-to-size mapping (for my lightweight benchmark) is below:

TPC-DS Scale Factor	Compressed Size (store_sales, customer, dim_date, item, store)	Largest Table Row Count (store_sales)
1GB	140MB	2,879,789
10GB	1.2GB	28,800,501
100GB	12.7GB	288,006,388

As seen above, this differentiation is critical as the size of compressed data processed is about 8x smaller than the scale factor size.

I switched the order of the VACUUM and OPTIMIZE phases. Given the intent of running VACUUM was to measure the efficiency of vacuuming files, it made more sense to do so after OPTIMIZE generates yet additional files that could be cleaned.
Maintenance jobs, VACUUM and OPTIMIZE, are included in the detailed phase analysis but excluded from the cumulative execution time for each benchmark scale. There are two reasons for this change:
- Spark is the only engine that implements its own native VACUUM and OPTIMIZE command. All of the other single-node engines don’t, and therefore the Delta-rs Python library is used, which results in the difference of execution time between single-machine engines largely being noise. Delta-rs is significantly more efficient at running VACUUM. If not using Deletion Vectors in Spark, you can also benefit from the same performance.
- Maintenance jobs are typically not executed with proportional frequency as present in this 6-phased benchmark. In Spark, I recommend using Auto Compaction to programmatically have compaction run only when needed, synchronously as part of write operations. VACUUM doesn’t have a direct impact on performance, so engineers are able to choose a suitable cadence that aligns with their storage cost and data recovery expectations.
I added a third benchmark scale to represent ultra-small workloads, this being the 1GB scale factor that translates to 140MB of compressed data.
In my prior benchmark, I included a modified version of the Polars benchmark that would use DuckDB for the pre-merge sample operation. While Polars still doesn’t support a lazy evaluated sample, I rewrote the code to replicate the output of sampling while still keeping things lazy.

Why This Benchmark Is Relevant

Most benchmarks that are published are too query-heavy and miss the reality that data engineers build complex ELT pipelines to load, clean, and transform data into a shape that is consumable for analytics. TPC-DS and TPC-H particularly fall short in this regard. Yes, they are relevant for bulk data loading and complex queries, but they miss the broader data lifecycle.

My lightweight benchmark proposes that the entire end-to-end data lifecycle which data engineers manage or encounter is relevant: data loading, bulk transformations, incrementally applying transformations, maintenance jobs, and ad-hoc aggregative queries.

Engine Versions Used

Engine	Version
Daft	0.5.7
Delta-rs	1.0.2 (0.25.5 for Daft)
DuckDB	1.3.1
Polars	1.31.0
Spark	Fabric Runtime 1.3 (Spark 3.5, Delta 3.2)

Spark Core -> Cluster Map

For the single-node engines, there’s nothing to be confused about. 16-vCores means a 16-vCore machine. For Spark, it gets nuanced. The below shows the mapping of cluster config to how many cores were used (including the driver node):

Core Count	Cluster Config	Executor Cores
4	4-vCore Single Node	2
8	8-vCore Single Node	4
16	3 x 4-vCore Worker Nodes	12
32	3 x 8-vCore Worker Nodes	24

What Has Changed Over the Last 6 Months?

Before we dig into the results, all engines have shipped various changes since December ‘24. I’ll focus on a few key performance-related features or notable updates of each:

Fabric Spark:
- The Native Execution Engine was GA’d at Build ‘25. This included a number of optimizations and provides greater coverage for native operators being used (i.e., Deletion Vectors).
- Snapshot Acceleration: Phase 1 of efforts to reduce the cold query overhead of interacting with Delta tables has shipped. This can be enabled via spark.conf.set("spark.microsoft.delta.snapshot.driverMode.enabled", True). This cuts the overhead of Delta table snapshot generation (the process of identifying and caching the list of files that are active in the version of the table being queried) by ~50%. Note: this feature is currently disabled by default. I recommend enabling this config for all workloads.
- Automated Table Statistics: These table-level statistics are collected synchronously as part of write operations to better inform the Catalyst cost-based optimizer in Spark about optimal join strategies. I’ve elected to disable auto stats collection for this benchmark since this is not a “write less, query often” workload that would have clear benefit from table statistics (if running a battery of SELECT statements or complex DML, I would certainly enable it).
DuckDB:
- External File Cache: Shipped as part of 1.3.0, this allows files to be cached on disk to avoid needing to make the more expensive hop to read data from cloud object stores for repeat queries to the same files. This is fundamentally the same feature as the Intelligent Cache in Fabric Spark.
- The DuckDB extension for Delta shipped a number of perf improvements around file skipping and pushdown.
- Still no native ability to write to Delta tables, but we can continue to use the Delta-rs Python library.
Polars:
- Polars shipped a new streaming engine: https://github.com/pola-rs/polars/issues/20947
- Since v1.14, the Polars Delta reader now leverages the Polars Parquet reader and is thus no longer dependent on Delta-rs for reading Delta tables.
- Polars still doesn’t support reading and writing to tables with Deletion Vectors.
Daft:
- Daft’s new streaming engine, codename “Swordfish,” is default in v0.4: https://blog.getdaft.io/p/swordfish-for-local-tracing-daft-distributed
Delta-rs:
- Still no Deletion Vector support :(. Make noise here: https://github.com/delta-io/delta-rs/issues/1094

Where Do Things Stand?

On 7/2/25 I reran the benchmark with a few changes:

Delta-rs 1.0.2 was used instead of 0.18.2.

ELTBench was updated to use the same exact sudo sampling logic as the input to the merge statement. Since Polars doesn’t support a Lazy sample function it used its own custom sampling logic. All of the engines now use the same exact DIY sampling logic.

Polars was upgraded to 1.3.1

With the above changes, particularly the upgrade to Delta-rs v1, the results generally had the non-distributed engines improve the most (the Delta-rs rust engine in v1 is now mature enough to not see performance regressions whereas in 0.18.2 the pyarrow engine was typically faster or at least prevented OOM).

140MB Scale

At the 140MB scale (not tested in my benchmark from December ‘24), all single-machine engines are quite close in performance and handily beat Spark.

Polras is ~ 2x faster than DuckDB and Daft at 2 and 4-vCores. At 8-vCores all non-distributed engines are decently close.

140MB Scale @ 4-vCores - Phase Detail

Spark is significantly (2-5x) slower at all write operations.
Polars somehow ran the ad-hoc query in 146 ms. It barely shows up on the chart, this is absolately mind blowing!
Spark took the bronze at completing the ad-hoc query, beating DuckDB. Somewhat suprising given how much faster the single-machine engines were at the write operations.

1.2GB Scale

We are beginning to see that Spark is starting to catch up in aggregate but still has a ways to go.

Fabric Spark beats Daft, the “Spark killer”, at 8cores but DuckDB and particularly Polars still have a massive advantage.
While Fabric Spark doesn’t give the option to run Spark on 2-vCores, at 4-vCores Spark is the slowest but its worth noting that only 1/2 of the nodes cores are allocated as executor cores in Single node mode, meaning that Spark is operation at 1/2 the compute power.

1.2GB Scale @ 8-vCores - Phase Detail

Looking at the detail by phase, a couple observations:

Again we see that Spark is not the fastest at any of the phases, however it’s also not the slowest. Fabric Spark beat DuckDB at the ad-hoc query, and beat Daft at 2 of 3 write phases.
I’m again stunned by Polars…

12.7GB Scale

Now at 12.7GB scale, we see Fabric Spark with the Native Execution Engine start to flex its muscles as the data scale grows to what I’d consider the peark of the “small data” range:

Spark was the fastest engine, with DuckDB close behind, to complete all compute scales without running into out-of-memory (OOM).
Polars leaves me perplexed. It somehow beat Spark at the 16 and 32-vCore compute scale, yet it also ran into OOM below 16-vCores.
DuckDB was the only non-distributed engine to complete the benchmark at 2-vCores.
I will again highlight that Spark at 4 and 8-vCores is running in single-node mode and only 1/2 of the machines cores and RAM are allocated to executors. The reason I point this out again is that this is a platform configuration (which conceptually could change) and at only 50% of the available compute being used, it is on-par or beating non-distributed engines. If all cores were allocated to executors I’d expect Spark to decisively win this scale and compute size.
Lastly, a note on the importance of upgrading your composible data stack (the reality that Delta-rs is used to write DuckDB in-memory data to Delta format): before upgrading to Delta-rs v1, DuckDB ran into OOM at the 2 and 4-vCore scale. After upgrading, with DuckDB being able to leverage the more efficient Rust based engine in Delta-rs it had no problem running the tests at 2 and 4-vCore compute scales.
Daft trails the competition by a wide margin. I absolvely love Daft’s vision, but I’m just not seeing it in the perf department.

Note: the 'PyArrow' Delta-rs engine was used instead of the newer 'Rust' engine for engines that don't directly support writing to Delta (in version 0.18.2). The Rust engine had nearly the same performance but resulted in OOM at 8-vCores, whereas PyArrow didn't have any issues at this compute size.
In Delta-rs V1 the Rust engine is the only engine option.

12.7GB Scale @ 16-vCores - Phase Detail

Looking at the detail from the 16-vCore tests:

Polars and Daft tie at completing the ad-hoc query.
Fabric Spark comes in 2nd place at 2 of 3 write phases.
Polars was either the fastest or tied at every phase.
Daft took significantly longer to load the 5 Delta tables.

General Observations

As noted, the last time I ran this benchmark, VACUUM is significantly slower in Spark. On the odd chance that you aren’t using Deletion Vectors in Fabric, you could use the Delta-rs library to vacuum your tables.
OPTIMIZE is generally faster via Delta-rs. The reason for this is primarily that the Native Execution Engine doesn’t support the entire compaction code path and results in two fallbacks to execution on the JVM. I anticipate this will get much faster once we ship support for this code path.
In all benchmarks where Polars didn’t run into OOM, it was consistently the fastest engine.
Both Spark and DuckDB where the only engines to complete the entire battery of benchmark scenarios with not a single out-of-memory exception. Maybe unsuprising for DuckDB which isn’t JVM based, but for Spark this is the result of the Native Execution Engine’s highly efficient use of columnar memory, outside the JVM. Where JVM memory is needed for any fallbacks (i.e., when running OPTIMIZE), memory is dynamically allocated between on-heap and off-heap as needed.
Spark consistently sees greater relative improvement in execution time via adding more compute as compared to the other engines.

Which Engine Gained the Most Ground Since December ‘24?

While all engines got much faster, Polars followed by Fabric Spark with the Native Execution Engine saw the greatest performance gains relative to December ‘24. Polars got so much faster that I honestly questioned whether or not there was a bug in my code resulting in less data being written or LazyFrames that were never triggered.

So Is It Time to Ditch Spark?

While the non-distributed engines, particularly Polars and DuckDB are very competitive or even faster than Spark at most small data benchmarks, there’s a few reasons why I would still use Spark with the Native Execution Engine in most small data scenarios:

Maturity: What the perf numbers don’t highlight is the amount of work involved to get the benchmark to run successfully. Daft, DuckDB, and Polars all required significantly more time than Fabric Spark to get the same code from December ’24 running on the latest engine versions. I didn’t have to change a single thing in Spark — it just ran. And with zero effort (thanks to the engineering investment from Microsoft), my code ran ~2x faster.
- Daft had all sorts of issues with authenticating to storage (GitHub Issue: 4692). After a few hours I gave up and reverted to using ADLS Gen2. Daft also broke after upgrading to Delta-rs v1, as it references a method that no longer exists in v1 (GitHub Issue: 4677). On the code front, the only feature support issue I had with this benchmark was that it doesn’t have a random value function. On adding support for TPC-DS and TPC-H benchmarks in LakeBench, I’ve found that Daft SQL is very immature — it gets tripped up easily (no support for CROSS JOINs and frequent data type casting issues that other engines don’t have).
- Polars code required some light refactoring to use the new streaming engine. Polars also required me to refactor the existing benchmark as it doesn’t support LazyFrame.sample and doesn’t have a random value function. My only other issue was navigating the OOM errors.
- DuckDB also had periodic issues authenticating to storage. At the larger data scale, tasks seemed to get stuck — almost like the auto-generated token was no longer valid — but would just keep running until I manually canceled the job. Upgrading to Delta-rs v1 required removing the engine parameter and possibly introduced this error: InvalidInputException: Invalid Input Error: Attempting to execute an unsuccessful or closed pending query result. Refactoring the code to explicitly establish a DuckDB connection and create my own storage secret fixed this, but it’s extremely hard to tell what the exact root cause was — DuckDB, Delta-rs, or ultimately a Fabric token issue.
Triaging Support: Imagine that you have a query that has been running for a while and you just want to know what’s going on or what’s actually running at that moment. In Spark, you can simply look at the in-cell task metrics to see that things are happening or open the Spark UI to get full details on what’s currently running and what has run. For the non-distributed engines, I had multiple cases of wanting to know what it was actively doing — and there’s zero visibility. Fine for any operation that runs in <1 minute, but for anything longer, the lack of visibility is just like rolling dice, hoping you wrote the code well and that your compute size will work out. Want to look at logs to see what’s already happened or the details of a prior session? Good luck.
DIY Composable Data Systems == More Management Overhead: First of all, I love the idea of the composable data stack — if you aren’t familiar with it, give Wes McKinney’s blog a read. Having pluggable components in your stack makes it more flexible and allows you to leverage the best of open source. Fabric takes advantage of this by using Velox and Apache Gluten as foundational components of the Native Execution Engine to accelerate Spark. But this is all managed for users — no need to test and choose versions, perform upgrades, roll out changes, etc. I’m beginning to love DuckDB (and Polars — I’m blown away by its recent perf gains), but what I don’t love is the necessity to stitch together different technologies just to get something simple to work. DuckDB is the most robust non-distributed engine at reading Delta format, but it doesn’t natively write to Delta. You can cast DuckDB relations to Arrow format so that Delta-rs can take over and do the write, but there are at least four different ways to do it (arrow, fetch_record_batch, fetch_arrow_reader, record_batch) and the documentation is poor at explaining the differences and best practices. What DuckDB natively supports is fantastic, but when you need to complete the whole E2E data lifecycle, things start to get fragmented. As your stack gets fragmented with different technologies, you then need to manage compatibility — e.g., LakeBench installs Delta-rs v1.0.N for Polars and DuckDB but v0.25.5 for Daft.
Delta Feature Support: I look forward to the day when all these engines fully support features like Deletion Vectors for both reads and writes. Currently, DuckDB supports reading Deletion Vectors, but Delta-rs lacks support for writing them. Polars and Daft, as far as I know, do not support either read or write paths. In LakeBench, the telemetry logging table is configured with Deletion Vectors disabled to ensure compatibility across all engines for writing logs. Relying on the lowest common denominator of features can be quite limiting and frustrating.
Future Data Growth: In most cases, small data will grow into big data — or at least into data of a scale where distributed engines are necessary for decent perf. If you have small data today, consider the rate of possible growth and whether it makes sense to start with distributed-capable compute like Spark. You can start on single-node configs to keep costs low and seamlessly scale out to multiple nodes as your data volumes grow.

Just to add some data growth sanity to this benchmark, let’s consider if our largest scale tested grew 10x from 12.7GB to ~ 127GB (2.8B row transaction table).

Which engine wins at the 127GB scale?

All engines were tested on 16, 32, and 64 total cores (Spark w/ 7x8-vCore Workers + 1 8vCore driver).

DuckDB was the only non-distributed engine to complete the benchmark but did results in OOM at 16-vCores. Polars ran into OOM just minutes into the job. Daft ran for over and hour and then failed.
Spark was the only engine to complete the 127GB scale on all compute sizes.
- Spark was ~ 3.5x faster than DuckDB at 32-vCores
- Spark was ~ 6x faster than DuckDB at 64-vCores

There we go, now we have out dose of “medium data” reality, Spark is still king. I was starting to sweat a bit there as the small data tests completed 😅.

So what’s my guidance here?

If you have uber-small data (i.e. up to 1GB compressed), you can be quite successful reducing costs and improving performance by using a non-distributed engine like Polars, DuckDB, or Daft. If your data is between 1GB and 10GB compressed, Spark with vectorization via the Native Execution Engine is super competitive perf-wise, much more fault- and constrained-memory-tolerant, and thus entirely worth leveraging. While DuckDB, Polars, and Daft all leverage columnar memory and vectorized execution via either C++ or Rust implementations, Fabric Spark with the Native Execution Engine (via Velox and Apache Gluten) does as well. And guess what? There are plenty of additional optimizations still planned for Fabric Spark and the Native Execution Engine that will continue to improve performance in the coming year. I look forward to seeing where things stand in 2026 😁.

Regardless of your current data scale, consider potential data growth, maturity, and feature support so you aren’t setting yourself up for a required engine replatform as your data grows beyond the bounds of being small or you require a more mature set of capabilities.

Elevate Your Code: Creating Python Libraries Using Microsoft Fabric (Part 2 of 2: Packaging, Distribution, and Consumption)

2025-03-26T00:00:00+00:00

This is part 2 of my prior post that continues where I left off. I previously showed how you can use Resource folders in either the Notebook or Environment in Microsoft Fabric to do some pretty agile development of Python modules/libraries.

Now, how exactly can you package up your code to distribute and leverage it across multiple Workspaces or Environment items? How could we acomplish something like the below?

Building / Packaging

While you can certainly run all of this code locally on your machine, everything I’ll show in this section will be 100% from the Fabric Notebook UI. Sure, doing some of this stuff locally can be more productive, but there’s something convenient—and a little magical—about being able to do everything in your browser.

Packaging a Python library results in a single compressed file, a “wheel” file with the .whl extension. For anyone new to Python, this is really just a ZIP archive (you can rename it to .zip and peek inside) that contains all of your Python modules, metadata, and references to any dependencies your library needs.

Since all I had in the prior blog was a single utils.py module, I’ll need to add a couple of other files to support making this a packageable library.

__init__.py: Since the module is no longer in the root of the library folder, I need an __init__.py file. This is required for any folders within the root directory where you have modules that need to be included in the build process. This is an empty file.

setup.py – This Python file contains metadata about your library and instructions for packaging. Create it in the root of your library directory.

 from setuptools import setup, find_packages

 # Read the contents of your README file
 with open("README.md", "r", encoding="utf-8") as fh:
     long_description = fh.read()

 # Read the contents of the requirements.txt file
 with open('requirements.txt') as f:
     requirements = f.read().splitlines()

 setup(
     name="lakehouse_utils",
     version="0.1.0",
     author="Miles Cole",
     description="Example Python Library",
     long_description=long_description,
     long_description_content_type="text/markdown",
     url="",
     project_urls={},
     classifiers=[
         "Development Status :: Development",
         "Programming Language :: Python :: 3",
         "Operating System :: OS Independent",
         "Topic :: Benchmarking",
         "License :: OSI Approved :: MIT License",
     ],
     python_requires=">=3.10",
     install_requires=requirements
 )

In the above setup code, the name, version, and python_requires fields are key to generating the name of the resulting WHL file: lakehouse_utils-0.1.0-py3-none-any.whl. The parts of the WHL file name have the below basic pieces of information.
  f"{name}-{version}-{python_version}-{os_specific}-{architecture_specific}"

Anytime you are making code changes you should evaluate if it is a major (0.1.0 → 1.0.0), minor (0.1.0 → 0.2.0), or revision (0.1.0 → 0.1.1) to your existing code and then update the version metadata in setup.py accordingly.

requirements.txt – This simple text file lists any dependencies your library requires. My module is pretty simple, but here’s an example of what this file might look like:
```
 sqlglot==25.23.0
 JayDeBeApi==1.2.3
```
Even if you don’t have dependencies yet, I still recommend including an empty requirements.txt file. This way, you won’t need to refactor anything later when you eventually do.
README.md: Technically optional, but required from a human decency perspective. Be kind to the future developer (or your future self!) who might inherit your work—add a README!

After creating the basic structure, it could look something like the below:

lakehoues_utils/
└── lakehoues_utils/
    ├── __init__.py # tells the build process that this directory contains a module in scope for packaging
    └── utils.py # source code
├── README.md # documentation
├── requirements.txt # dependencies
└── setup.py # build instructions

If I had not put utils.py in a folder in the root called lakehouse_utils, the eventual import statement would’ve been import utils. To make the import more descriptive and avoid ambiguity I moved utils into a subfolder called lakehouse_utils so that the import statement becomes import lakehouse_utils.utils.

Now that the structure is in place, let’s build the library. I like to add the following code into the same Notebook used for developing and testing the module. That way, I can make a quick change, generate a new build, and finish by publishing the new version to an artifact repo—all in one Notebook.

install_packaging_libs = !pip install setuptools wheel

import os
# Change directory to the library's path
os.chdir('/synfs/nb_resource/builtin/lakehouse_utils') 

# Clean the build directory
!python setup.py clean --all
# Build the wheel file
!python setup.py bdist_wheel

Just update the path to your library’s root directory based on where it lives:

If using Notebook Resources: /synfs/nb_resource/builtin/

os.chdir('/synfs/nb_resource/builtin/lakehouse_utils') 

If using Environment Resources: /synfs/env/
```
os.chdir('/synfs/env/lakehouse_utils') 
```
This results in a .whl file being generated in a new ./dist/ (distribution) folder. From here, we can install it directly before publishing to an artifact repository.

 %pip install '/synfs/nb_resource/builtin/lakehouse_utils/dist/lakehouse_utils-0.1.0-py3-none-any.whl`

Distributing

Are we done yet?? Not unless you enjoy manually uploading your newly minted library to various Environment items and worrying about keeping things in sync as you have new versions to publish.

Rather than manually distribute your library, the best practice is to publish it to a central artifact repository. When apps or Notebooks need it, they simply fetch the trusted version automatically.

This has major benefits:

Trust – Manually sharing .whl files is risky. Someone could overwrite, corrupt, or even maliciously tamper with the package. Centralized repositories like PyPI or Azure DevOps Artifact Feeds offer access control, provenance, usage stats, and a tag classification system.
Versioning – Since versions are immutable by default, you can rely on consistent behavior over time. Once published, the code won’t change unless you explicitly choose to upgrade to a newer version.
Single source of truth – One place to publish. One place to consume. One less governance headache.

Could we publish this to PyPi for public distribution? Sure, but most organizations do not open-source their code given that it is often organizationally specific in nature, therefore I’ll be showing how you can publish libraries to a private repository. In this case I’ll be using Azure DevOps Artifacts as the hosting service, but this same process generally applies to any other service, you need to provide authentication and use a specific API to publish your library.

For those who are GitHub fans, GitHub sadly doesn’t support Python libraries in it’s artifact repository service.

Setting up an Azure DevOps Artifact Feed

There’s two very basic steps to follow that the ADO docs effectively illustrate:

Publishing the library

I’m referencing my Azure DevOps PAT token stored in Azure Key Vault to avoid storing any credentials in plain text. Run the code below to publish:

import subprocess
import sys

# Input Params
ado_org_name = 'milescole'
ado_project_name = 'library_dev_demo'
ado_artifact_feed_name = 'DataForge'
key_vault_name = 'mcoleakvwcus01'
key_valut_pat_secret_name = 'milescole-ado-pat'
whl_path = "/synfs/nb_resource/builtin/lakehouse_utils/dist/lakehouse_utils-0.1.0-py3-none-any.whl"

repo_url = f"https://pkgs.dev.azure.com/{ado_org_name}/{ado_project_name}/_packaging/{ado_artifact_feed_name}/pypi/upload/"
artifact_pat = notebookutils.credentials.getSecret(f"https://{key_vault_name}.vault.azure.net/", key_valut_pat_secret_name)

# Install twine and wheel
install_publishing_libs = !pip install twine wheel

# Publish Library
result = subprocess.run([
    sys.executable, "-m", "twine", "upload", "--verbose",
    "--repository-url", repo_url,
    "-u", "__pat__", "-p", artifact_pat,
    whl_path
], capture_output=True, text=True)

stdout = result.stdout or ""
stderr = result.stderr or ""
combined_output = stdout + stderr
print(combined_output)

The result confirms the library upload was successful:

If we check Azure DevOps, we’ll find that the latest version now appears in the Artifact feed:

We then assign a minimum of Feed Reader permissions to consumers so they can access and install the package:

Using a Private Artifact Repository in Fabric

Alright, so we’ve got our library safely tucked into our fancy Artifact feed—how do we actually use it inside Microsoft Fabric?

While Environment items don’t currently support private feeds, you can install the library from a Notebook using a pip command.

Normally %pip can’t be parameterized, but we can work around that using get_ipython().run_line_magic()—a neat trick that lets you run magics inline with Python code.

# Input params
ado_org_name = 'milescole'
ado_project_name = 'library_dev_demo'
ado_artifact_feed_name = 'DataForge'
key_vault_name = "mcoleakvwcus01"
key_valut_pat_secret_name = "milescole-ado-pat"
library_name = "lakehouse-utils"
library_version = "0.1.0"
# Get PAT
artifact_pat = notebookutils.credentials.getSecret(f"https://{key_vault_name}.vault.azure.net/", key_valut_pat_secret_name)
# Execute PIP
install = get_ipython().run_line_magic("pip", f"install {library_name}=={library_version} --index-url=https://{ado_artifact_feed_name}:{artifact_pat}@pkgs.dev.azure.com/{ado_org_name}/{ado_project_name}/_packaging/{ado_artifact_feed_name}/pypi/simple/")

Easy, right? If you don’t need parameters, you can reduce it to two lines:

artifact_pat = notebookutils.credentials.getSecret(f"https://mcoleakvwcus01.vault.azure.net/", "milescole-ado-pat")
install = get_ipython().run_line_magic("pip", f"install lakehouse-utils==0.1.0 --index-url=https://DataForge:{artifact_pat}@pkgs.dev.azure.com/milescole/library_dev_demo/_packaging/DataForge/pypi/simple/")

Now all that is left is to import the library and you’re off and running with being able to take advantage of modular, governed, and easily download code assets.

import lakehouse_utils.utils

Note: If your private package includes dependencies from PyPI, they’ll be automatically mirrored into your artifact feed—effectively giving you a private backup.

Library Versions

Now, if the value of this whole effort still isn’t totally clicking, let’s explore one more thing that’s truly the bee’s knees: library versioning.

So far, I’ve published version 0.1.0 of my lakehouse-utils library. Now imagine this: my company decides to start using this beta version in production 😬. Sure enough, feedback starts pouring in from other devs—feature requests, bug reports, naming complaints, the usual. I go back to the drawing board, roll up my sleeves, and after a few minor and patch updates, I finally ship the first stable, non-beta version, 1.0.0.

Life is good. Everywhere I go, people give me that subtle nod—you know the one that says “yeah, we know… the library is out of beta now.” I start walking a little taller. I’m basically a celebrity.

But then, back to reality: how do we actually start using this shiny new version, especially since it includes some breaking changes as part of its rise to glory in the anals of artifact repos?

Well, first consider what our library version history looks like in Azure DevOps. We’ve got every published version sitting there nicely. It’s beautiful.

And here’s where it gets powerful: maintaining older versions means we can continue building and testing new functionality in dev using 1.0.0, without breaking everyone else. Once testing wraps up, we promote the changes to UAT with a reference to the newer version. No need to deploy the library itself, we only deploy the reference to the version number. Meanwhile, the other data teams—deep in the throes of their quarterly ping-pong tournament—don’t even need to worry. Their code can keep humming along with the older version until they’re ready to upgrade on their own schedule.

In short: versioning gives you the power to move fast, without breaking things, and even when Jim from Procurement Analytics is too busy celebrating his huge win to adopt what might be the most glorious package release to grace the halls of our archaic IT org.

Was it worth it?

Okay, maybe you’re thinking: “This seems unnecessarily complex. Why not just use the %run magic command to inject some code from another Notebook and call it a day?”

That’s a fair question—and really, it boils down to this:

Do you want to be a good data engineer, or a great one?

Do you want to build something that works for a few months or maybe a year, only to require a complete rewrite when the data model changes, the team grows, or business needs evolve? Or do you want to build something that scales with your organization, stands the test of time (at least until AI takes all of our jobs and we get plugged into the Matrix), and—dare I say—brings joy (or minimally appreciation) to the next engineer who inherits it?

The fundamental process that I used—Develop → Package → Distribute → Install—isn’t something I just made up. It’s how every piece of mature software on the planet Earth is shipped and consumed.

Spark source code doesn’t get manually copy-pasted to each VM when your cluster spins up by some guy named George. Pandas didn’t become the most widely used DataFrame library because someone shared a .py file on a Google Drive. And if you browse today’s open-source ecosystem, nearly everything worth using started with a dev like you or me, who had an amazing idea, followed standard SDLC practices, and decided it was worth sharing with the world.

Now, let me climb down off my soapbox for a second 😅

Yes, there are great uses for %run. No, not everyone is aspiring—or needs—to be a great data engineer. And maybe you don’t care about publishing packages, governance, or modular design—and that’s okay.

All I’m saying is this: evaluate what you’re trying to build.
If your goals include things like:

“mature software development”
“data mesh architecture”
“modular, reusable code”
“cross-workspace distribution”
“organizational data operations”
“unit testing”

…then maybe, just maybe, you should consider doing what every successful tech org has done for at least the last decade:

Treat data engineering a bit more like software engineering.

And if that still came across a little too strong, here’s a friendly list to wrap it up:

Cross-workspace, cross-tenant, or even 100% public distribution of code assets
The more seasoned a data engineer becomes, the more they think in terms of scalability, flexibility, and modularity. Why rewrite the same logic ten times with slight variations when you could write it once, publish it, and reuse it safely across your org?
Minimized latency for code reuse
%run gets slower the more cells it has to inject. For complex ELT logic or large utility libraries, it quickly becomes a performance bottleneck—especially in interactive workflows.
ALM capabilities
Once Fabric adds Git support for Resource folders, you’ll be able to integrate automated unit tests, packaging, and artifact publishing right into your CI/CD pipelines. Until then, manual builds from a Notebook are a are a huge step in the right direction.

Mastering Spark: The Art and Science of Table Compaction

2025-02-26T00:00:00+00:00

If there anything that data engineers agree about, it’s that table compaction is important. Often one of the first big lessons that folks will learn early on is that not compacting tables can present serious performance issues: you’ve gotten your lakehouse pilot approved and it’s been running for a couple months in production and you find that both reads and writes are increasingly getting slower and slower while your data volumes have not increased drastically. Guess what, you almost surely have a “small file problem”.

What engineers won’t always sing the same tune on is how and when to perform table compaction. There’s really 5 things I see when looking generally at any platform using log-structured tables like Delta, Hudi, or Iceberg:

No Compaction: We’ve all been there at some point in our career, no shame. You came from using SQL Server or Oracle with nice clustered indexes where any infrequent table rebuild operations were handled by a company DBA. Life was easy. While not a good option, it’s important to understand the impact of not having any compaction strategy. Yes, it’s a slow burn that takes you deeper and deeper down the poor performance rabbit hole.
Pre-Write Compaction: Rather than needing to compact files, introduce a pre-write shuffle of data that ensures optimal sized files are written. In Delta this feature is called Optimized Write.
Post-Write Manual Compaction: As part of your jobs you’ve coded an OPTIMIZE (and possibly a VACUUM) operation to run after every table that is written to.
Scheduled Compaction (Manual): Just as it sounds, you schedule a job, maybe on a weekly basis, that will loop through all tables and run OPTIMIZE.
Automatic Compaction: A feature of the log structured table that will automatically evaluate if compaction is needed and run it syncronously (or async in the case of Hudi) following write operations.
- Delta Lake: Auto Compaction is disabled by default but can be enabled to run syncronously, as needed, after writes. Here’s a all the basics on Auto Compaction in Delta Lakes:
- Hudi: Compaction runs automatically (async) by default, as needed, after writes.
- Iceberg: Compaction in Iceberg is only supported as a user executed operation, there’s no support for automatic maintenance here. Ironically, the Iceberg docs even list compaction under Optional Mainenance, this seems a bit shortsighted as there’s no technical reason why Iceberg users wouldn’t suffer from small file issues just like Delta and Hudi.
Background Platform Managed Compaction: The first things that comes to mind is S3 Tables (AWS proprietary fork of Iceberg) with it’s heavily marketed managed compaction feature. You write and query your tables and we will charge you an exhorbinant amount to perform background compaction jobs so you don’t need to worry about table maintenance! While AWS may have gotten some flak their pricing ($0.05 per GB + $0.004 per 1,000 files processed) and overmarketing a feature that Hudi and Delta already solve for, not needing to manage or even configure compaction is a wonderful thing since it reduces the compelxity and experience needed to implement a performant solution.

So, there’s plenty of options for ensuring tables are appropriately sized. But, is there a best practice option when using Fabric Spark and Delta Lake? Lets find out.

The Case Study

To study the efficiency and performance implications of various compaction methods, I formed a benchmark to study the effects of the following 4 scenarios:

No Compaction
Pre-Write Compaction (a.k.a Optimized Write)
Scheduled Compaction
Automatic Compaction

I ran all tests using an iteration target batch count of 1K, 100K, and 1M rows. Each test consisted of running 200 back-to-back iterations of the below phases to immitate a table that has been updated long enough to start seeing small file issues:

Merge Statement: data is generated with a target row count with +/- 10% random variance in batch size and is merged into the target table with 10% of the input records being updates and the rest being inserts.

 data = spark.range(start_range, end_range + 1) \
         .withColumn("category", sf.concat(sf.lit("category_"), (sf.col("id") % 10))) \
         .withColumn("value1", sf.round(sf.rand() * (sf.rand() * 1000), 2)) \
         .withColumn("value2", sf.round(sf.rand() * (sf.rand() * 10000), 2)) \
         .withColumn("value3", sf.round(sf.rand() * (sf.rand() * 100000), 2)) \
         .withColumn("date1", sf.date_add(sf.lit("2022-01-01"), sf.round(sf.rand() * 1000, 0).cast("int"))) \
         .withColumn("date2", sf.date_add(sf.lit("2020-01-01"), sf.round(sf.rand() * 2000, 0).cast("int"))) \
         .withColumn("is_cancelled", (sf.col("id") % 3 != 0))

     delta_table_path = f"abfss://@onelake.dfs.fabric.microsoft.com/.Lakehouse/Tables/auto_compaction/{iteration_id}"

     if not DeltaTable.isDeltaTable(spark, delta_table_path):
         data.createOrReplaceTempView("input_data")
         if auto_compaction_enabled:
             ac_str = "TBLPROPERTIES ('delta.autoOptimize.autoCompact' = 'true')"
         else:
             ac_str = ""

         spark.sql(f"""
             CREATE TABLE mcole_studies.auto_compaction.`{iteration_id}`
             {ac_str}
             AS SELECT * FROM input_data
         """)

         delta_table = DeltaTable.forPath(spark, delta_table_path)
     else:
         delta_table = DeltaTable.forPath(spark, delta_table_path)

         delta_table.alias("target").merge(
             source=data.alias("source"),
             condition="target.id = source.id"
         ).whenMatchedUpdateAll() \
          .whenNotMatchedInsertAll() \
          .execute()

Aggregation Query: The query touches every column in the table and does not have any filter predicates to ensure that all files in the current Delta version are included in scope.

 select 
     sum(value1), 
     avg(value2), 
     sum(value3), 
     max(date1), 
     max(date2), 
     category 
 from mcole_studies.auto_compaction.`{iteration_id}`
 group by all

Compaction: only applicable for the Scheduled Compaction test, every 20 iterations the OPTIMIZE command is executed.
```
 spark.sql(f"OPTIMIZE delta.`{delta_table_path}`")
```

For each phase of the iteration I logged the duration and count of files in the active Delta version.

Active File Count - 1K Row Batch Size

Before getting into the performance comparison of running these tests, let’s baseline how each scenario impacts the number of files written:

The following charts intentionally use the same Y axis max value for evaluating the magnitude of impact.

No Compaction

As expected, since we aren’t performing any maintenance, the count of parquet files in the active Delta version increases linearly. After 200 iterations, we have 3,001 files.

Scheduled Compaction

With compaction scheduled to run every 20th iteration, the final file count is 1 due to it ending on a compaction interval. The file count peaks at > 300 right before each compaction operation is run.

Automatic Compaction

With Auto Compaction, based on this workload, we see that every 4 iterations results in the background, syncronously run, min-compaction job. After 200 iterations we have 47 files, this makes sense as by default auto-compaction triggers whenever there is 50 or more files below 128MB.

Automatic compaction certainly produces the most optimal file layout after 200 iterations, it has by far the lowest standard devation of file count which will result in more consistency in both write and read performance.

Performance Comparison - 1K Row Batch Size

No Compaction

Without any compaction, by iteration 44 the write duration has doubled and by iteration 200 the merge operation now takes nearly 5x longer to complete. Reads were impacted less, but by the last iteration had surpassed being 1.5x slower.

Scheduled Compaction

With compaction every 20th iteration, we see that the performance of both writes and reads gets slower until the compaction operation runs.

Automatic Compaction

With automatic compaction, just like how there’s the lowest standard deviation in the active file count, we also see that performance is extremely stable. Both the write and query duration from start to end have no discernable upward trend. What is noticeable though is that every 4th write operation after the first, we can see that the merge step takes over 2x longer since it is performing the min-compaction.

With the frequent mini-compactions taking place, this begs the question: can we avoid writing small files to begin with?

Optimized Write

If we refresh our knowledge on Optimized Write, the idea is that there’s a pre-write step where data is shuffled and grouped across executors to bin data together so that fewer files are written. This feature is critical for partitioned tables, however for non-partitioned tables there are even a few write scenarios where more files are typically written due to the nature of the operation, and optimized write can help prevent this:

MERGE statements
DELETE and UPDATE statements w/ subqueries

For this small batch size, optimized write results in one file being written each iteration rather than ~16. The small amount of data being shuffle pre-write has an immaterial impact on write performance and more importantly, we can see that the performance from start to finish was extremely consistent.

Auto Compaction + Optimized Write

Is Optimized Write a replacement for Auto Compaction or Scheduled Compaction here? No, consider if this process of merging 1K rows into a table were in production for 1 year running once every hour; after 1 year we would have 8,760 files in our table. Over the course of the year the performance of both reading and writing would become signficantly slower. Given that we still need some sort of process to compact files post-write, what if we combined this feature with Auto Compaction?

With both features combined, we have less files written per iteration which translates to less frequent auto compaction being run. As the number of small files exceed 50, auto compaction is run, now we get the best of both worlds :).

File Count Impact

See below for a comparison of only enabling Optimized Write vs enabling the feature with Auto Compaction:

So What Method Won?

Auto Compaction + Optimized Write had the lowest total runtime, lowest standard deviation of file count, nearly the lowest standard deviation for queries, and the 2nd lowest standard deviation of write duration. By all measures, the combination of avoiding writing small files (where possible) and automatically compacting small files was the winning formula.

Scenario	Duration (minutes)	Std. Deviation of File Count	Std. Dev. of Merge + Optimize Duration (seconds)	Std. Dev. of Query Duration (seconds)
No Compaction	33.27	864	2.90	0.70
Scheduled Compaction	14.63	89	0.61	0.35
Auto Compaction	14.51	17	1.40	0.21
Optimized Write	13.76	58	0.62	0.27
Auto Compaction + Optimized Write	12.77	14	0.74	0.24

While Scheduled Compaction was almost as fast as Auto Compaction, it’s important to consider the additional cost of coding, scheduling, optimzing the frequency of run, and maintaining the maintenance job. With Auto Compaction on the other hand, just turn it on and you get the same benefit as a perfectly scheduled compaction job, but without any of the overhead and complexity.

What about larger batch sizes?

I performed testing at both 100K and 1M row batch sizes. At 100K row batches the results are nearly identical to the 1K row batches. At 1M rows, Auto Compaction appeared to be running too frequently which resulted in much less of a performance benefit.

With auto compaction we now see that as our data volume increases we start to accumulate files that are right sized (> 128Mb). The active file count no longer returns to 1 file every 4 batches, instead it increases linearly and ends with 42 total files. The frequency of mini-compactions that are runs adapts as the data volume changes, based on the count of small files below a max file count threshold (explained later).

Note: the below chart is on a zoomed-in Y-axis scale to better illustrate the bug.

As the iterations and number of compacted files increases, the frequency of compaction increases even give the same number of additive small files each iteration (~16). This is technically not per the documented functionality of the feature and after a interrogating the OSS Delta-Spark source code, I found that there’s a bug where compacted files are also counted towards the minNumFiles threshold. This means that anytime the total number of active files exceeds 50 (or whatever you set minNumFiles to), compaction will be triggered, even if you have less than 50 files that meet the “small file” criteria.

⚠️ Due to [this bug](https://github.com/delta-io/delta/issues/4045) in OSS Delta (and therefore Fabric), for now I would recommend only using auto compaction for tables that are 1GB in size or smaller. Anything larger than this and auto compaction will run too frequently and therefore result in unnessesary write overhead. Until then, I recommend continuing to schedule compaction jobs for tables > 1GB in size. BUT **good news**, I submitted a PR to fix the issue in [OSS Delta](https://github.com/delta-io/delta/pull/4178) and the fix is also soon to be shipping in Fabric Spark.
This bug is FIXED in the Fabric Spark Runtime, the OSS Delta fix is still pending.

Below is the behavior that we see with the bugfix in place: as the number of compacted files increases, the frequency of compaction wouldn’t increase, instead you would see that the maximum active file count would slowly increase over time. Once a write operation puts the number of uncompacted files over the minNumFiles threshold (50 files by default), auto compaction is triggered.

Below are the results with the bugfix in Fabric, again we see that Auto Compaction does wonders to maintain the performance of both writes and reads, even as the amount of data we process scales. Two observations:

As we scale to merge more data the benefit of avoiding needing to later compact small files is evident, Optimized Write provided the best results with the combination of Auto Compaction + Optimized Write coming close behind.
At this scale, since each write operation gets us relaively close to our ideal file size (with Optimized Write enabled), Auto Compaction doesn’t yet provide much performance benefit in comparison to Optimized Write alone, however it does act as insurance to prevent the accumulation of too many small files which would surely occur and start to impact performance if this process was run for another few hundred or even a thousand iterations.
Scheduled Compaction slightly outperformed Automatic Compaction. This is purely a factor of Automatic Compaction evaluating to run at a more frequent interval compared the Scheduled Compaction based on the default configs, the result of which is more consistent and better read performance, but at the cost of slower writes due to more compaction operations being triggered.

How to Enable Auto Compaction

At the session level:

spark.conf.set('spark.databricks.delta.autoCompact.enabled', 'true')

At the table level:

CREATE TABLE dbo.ac_enabled_table
TBLPROPERTIES ('delta.autoOptimize.autoCompact' = 'true')

It can also be enabled on existing tables with:

ALTER TABLE dbo.ac_enabled_table
SET TBLPROPERTIES ('delta.autoOptimize.autoCompact' = 'true')

Tuning Auto Compaction

The behavior of auto compaction can be adjusted via changing the two properties:

Property	Description	Default Value	Session Config	Table Property
maxFileSize	The target maximum file size in bytes for compacted files.	134217728b (128Mb)	spark.databricks.delta.autoCompact.maxFileSize	Not available
minFileSize	The minimum file size in bytes for a file to be considered compacted. Anything below this threshold will be considered for compaction and counted towards the `minNumFiles` threshold.	Unset by default, it is calculated as 1/2 of the `maxFileSize` unless you explicitly set a value.	spark.databricks.delta.autoCompact.minFileSize	Not available
minNumFiles	The minimum number that must exist under the max file size threshold for a mini-compaction operation to be triggered.	50	spark.databricks.delta.autoCompact.minNumFiles	Not available

Here are the use cases for when I would tweak these properties:

minNumFiles: assuming you can tollerate higher standard deviation in query execution times, make this value larger if I want auto compaction to be triggered less frequently.
maxFileSize: adjust this value to align with the ideal file size for your tables. In the below chart you can see the relationship between the size of a table and the ideal size of each file. This helps to minimize I/O cycles to read data into memory as well as optimizes file skipping opportunities (too few files means suboptimal file skipping).

Key Takeaways

Auto compaction removes complexity: the “how often should I run OPTIMIZE” question was completely eliminated. In my benchmark, after having analyzed the results, I realized that I ran the scheduled compaction too often. While running OPTIMIZE every 20 iterations was beneficial for the 1K row batch size, as my data volumes increased, less small files were written and a full compaction being run that often was somewhat inefficient. Also, I could’ve better designed the process to only compact files added since the last compaction operation was run.
Scheduled or Ad-Hoc Compaction Might Still Be Necessary: While auto compaction seems to win at all data volumes that I tested, would this continue after 1,000 or even 10,000 iterations? While a 128Mb file size target for auto compaction seems to work well, at some point you may need to compact these into 500Mb or even up to 1Gb files. While I would typically rely on auto compaction for short-term maintenance, in the long term you may need to selectively run an ad-hoc OPTIMIZE operation since the two different methods have different maxFileSize thresholds.

Closing Thoughts

Given the results of the three options that I tested, I would enable auto compaction in almost all use cases. It’s just too easy to enable and produces consistent results at various workload sizes. Sure, you might be able to schedule an incremental compaction job based on workload metadata that might match auto compaction results, but why overcomplicate things? It’s one (or more) less job to support, tune, and execute. With additional settings to control thresholds which impact the frequency of run and file size considered, for many workloads, it’s a no-brainer.

I was just recently in the scenario where I had a scheduled process that would frequently insert a smallish number of rows into a table (similar to my 1K row test) and noticed considerable slowness when querying the log table where queries would take 30+ seconds to return. Rather than scheduling a maintenance job or ad-hoc running OPTIMIZE for agile dev/test work I was doing, I just enabled auto compaction on the table. The next run of the process cleaned up the small files and I was back to 1-2 second latency when querying the table to analyze results.

Bonus Bits!

I’ve presented on this topic a few times and received some interesting questions that I’ll share answers to below:

How can I tell what files are part of the active Delta version being queried?: you can use the inputFiles() DataFrame method to evaluate the parquet files that would be read to return the query result.
```
  spark.sql("SELECT * FROM dbo.table").inputFiles()
```

How can I tell when Auto Compaction is actually run?: use the below PySpark. Auto Compaction operations show up as regular OPTIMIZE jobs in the transaction log but have an additional auto flag which is logged in operationParameters.

  history_df = spark.sql("DESCRIBE HISTORY dbo.table_with_ac_enabled")
  filtered_history = history_df \
      .filter(history_df.operation == "OPTIMIZE") \
      .filter(history_df.operationParameters.auto == "true")
  display(filtered_history)

How can I estimate the appropriate target file size for my Delta tables?: You can use DESCRIBE DETAIL to get the size of the latest version of your Delta table in bytes and then use this number to estimate the ideal target file size based on my prior referenced sizing chart.
```
  spark.sql("DESCRIBE DETAIL dbo.table_with_ac_enabled")
```

Automating V-Order: A Targeted Approach for Direct Lake Models

2025-01-31T00:00:00+00:00

I’ve previously blogged in detail about V-Order optimization. In this post, I want to revisit the topic and demonstrate how V-Order can be strategically enabled in a programmatic fashion.

Since V-Order provides the most benefit and consistent improvement for Direct Lake Semantic Models, why not leverage platform metadata to enable it automatically—but only for Delta tables used by these models?

This will be a short blog—let’s get straight to the concept, the source code, and then move on to more strategic use of this feature.

How to Implement

Unset the V-Order Session Config
By default, the Spark config spark.sql.parquet.vorder.default is set to true, meaning V-Order is enabled automatically for the DataFrameWriter class. This takes precedence if the spark.sql.parquet.vorder.enabled session config is unset (default), causing write operations to enable V-Order. Additionally, the spark.microsoft.delta.parquet.vorder.property.autoset.enabled session config ensures the Delta table V-Order property is automatically applied.

To prevent V-Order from being applied universally, we either need to unset or disable spark.sql.parquet.vorder.default, ensuring that no write operation automatically writes V-Ordered data. As a result, the table property won’t be automatically applied.
```
 spark.conf.unset('spark.sql.parquet.vorder.default')
```
You should ensure that all your data engineering jobs either unset this session config or explicitly set it to false in your environment configurations.
Remove the V-Order Table Property from Existing Tables
This step is optional but useful if you have multiple tables with V-Order enabled that are not used in Direct Lake Semantic Models. While I may provide a bulk removal script later, for now, you can manually list your tables and run an ALTER TABLE command to remove the property.
```
 ALTER TABLE dbo.vordered_table UNSET TBLPROPERTIES ('delta.parquet.vorder.enabled')
```
This doesn’t rewrite existing data to remove V-Order—it simply removes the feature from the table properties. Future writes will not use V-Order as long as the session config from the previous step remains unset or disabled.
Schedule an Automatic V-Order Maintenance Script
The script provided below should be scheduled (e.g., weekly) to automatically update tables used in Direct Lake Semantic Models, selectively enabling the V-Order Delta table property only for relevant tables.

While this functionality may eventually be packaged into a formal Python library, for now, I’m sharing it as a GitHub Gist. Just copy and paste the code into your preferred notebook (Python, not Spark), update the workspace scope filtering, schedule it, and you’re all set!

Why Python instead of Spark?
This workload is a great example of where plain Python shines. Since we’re just calling APIs and performing lightweight metadata updates, there’s no need for the overhead of Spark. Running this job in Spark would be much slower.

Why do I need to provide a list of workspaces?
Your Semantic Models could be hosted in a different workspace than your Lakehouses. Scoping to multiple workspaces allows you to bridge this separation. As long as you have write access to the source Lakehouse, you’ll be able to automatically set the table property. If no workspace list is provided, it defaults to the current workspace.

With this approach, there’s no need to enable V-Order by default across the board or manually analyze which tables need it. Just run this script, and any table used in a Direct Lake Semantic Model will have V-Order automatically enabled. The next time data is written to these tables, new data will be V-Ordered.

For older data, you may want to run a full OPTIMIZE operation to ensure all data benefits from the optimization.

Cheers!

Mastering Spark: Session vs. DataFrameWriter vs. Table Configs

2024-12-20T00:00:00+00:00

With Spark and Delta Lake, just like with Hudi and Iceberg, there are several ways to enable or disable settings that impact how tables are created. These settings may affect data layout or table format features, but it can be confusing to understand why different methods exist, when each should be used, and how property inheritance works.

While platform defaults should account for most use cases, Spark provides flexibility to optimize various workloads, whether adjusting for read or write performance, or for hot or cold path data processing. Inevitably, the need to adjust configurations from the default will arise. So, how do we do this effectively?

Spark Session vs. Delta Table Configurations

Configuration Scopes Explained

I decided to blog about this topic after encountering a job writing to partitioned tables that ran 10x slower than expected and queries that were over 6x slower. I obviously had a “small-file” problem at hand. Initially, I thought the issue could be resolved by enabling Optimize Write at the table level, assuming it would always be leveraged. However, I soon realized that the session-level config was disabled which takes precedence, meaning the Delta table property I added had no functional effect.

Hierarchy of Precedence and Scopes

The following order determines which configuration is applied when there’s a conflict:

Spark Session-Level Configurations (Highest Priority): (e.g., spark.databricks.delta.optimizeWrite.enabled) are global for the duration of the Spark session.
- Scope: These configurations apply globally across all operations within the active Spark session but can be overriden by some DataFrameWriter options.
- Use Cases: Ideal for cluster-wide defaults or platform-level behavior, ensuring consistency across multiple jobs.
```
 spark.conf.set('spark.databricks.delta.autoCompact.enabled', 'true')
```
or
```
 SET spark.sql.parquet.vorder.enabled = TRUE
```
DataFrameWriter Options: Settings applied directly in the DataFrameWriter (e.g., .option(“optimizeWrite”, “true”)). Some writer options override both session-level and table-level configurations.
- Scope: Apply only during the execution of a specific write operation.
- Use Cases: Best for ad-hoc or one-off scenarios where temporary overrides are needed without altering global or table-level settings.
Example:
```
 df.write.option('optimizeWrite', 'true').saveAsTable('dbo.t1')
```
Table-level properties (e.g., delta.autoOptimize.optimizeWrite) are settings tied to the specific table. Tables have three functional types of properities:
1. Persistent: Applied permanently, will be enforced across any writer (or reader) until the feature is dropped. Session and DataFrameWriter configs do not override the function of the feature.
  
  Examples:
  - delta.enableChangeDataFeed
  - delta.enableDeletionVectors
  - delta.logRetentionDuration
  - delta.checkpointInterval
2. Transient: Features that apply by default if a session or DataFrameWriter setting does not override it.
  
  Examples:
  - delta.parquet.vorder.enabled
  - delta.autoOptimize.optimizeWrite
  - delta.autoOptimize.autoCompact
  - delta.schema.autoMerge.enabled
3. Symbolic: Any arbitrary key-value pair, these don’t determine the function of the table but enrich the table with supporting metadata.
```
 CREATE TABLE dbo.table_with_properties
 TBLPROPERTIES (
     'delta.enableChangeDataFeed' = 'true', --persistent
     'delta.autoOptimize.autoCompact' = 'true', --transient
     'foo' = 'bar' --symbolic
 )
```
Any table property can be retrieved via running:
```
 SHOW TBLPROPERTIES dbo.table_with_properties
```
Why the deliniation between persistent and default?:
- Persistent Table Properties: Designed for features that are core to table behavior and must persist across sessions and jobs.
- Transient Table Properties: Offer runtime flexibility based on workload types, allowing configurations to be customized for specific Spark jobs.

Why Do Multiple Scope Exist?

Flexibility: Different workloads require different optimization strategies, and multiple scopes allow fine-tuning.
Isolation: Ensures that provided that global settings don’t set a precedence, table-specific requirements are respected and isolated.
Compatibility: Supports the evolving needs of distributed systems where various users and tools interact with the same datasets.

Key Configurations

Feature	Session-Level Config	DataFrameWriter Option	Table-Level Config
Optimize Write	spark.databricks.delta.optimizeWrite.enabled	option(‘optimizeWrite’, ‘true’)	delta.autoOptimize.optimizeWrite
Auto Compaction	spark.databricks.delta.autoCompact.enabled	option(‘autoCompact’, ‘true’)	delta.autoOptimize.autoCompact
Change Data Feed (CDC)	spark.databricks.delta.properties.defaults.enableChangeDataFeed		delta.enableChangeDataFeed
Schema Auto-Merge	spark.databricks.delta.schema.autoMerge.enabled	option(‘mergeSchema’, ‘true’)	delta.schema.autoMerge.enabled
Log Retention Duration	spark.databricks.delta.logRetentionDuration		delta.logRetentionDuration
Checkpoint Interval	spark.databricks.delta.checkpointInterval		delta.checkpointInterval
Deletion Vectors	spark.databricks.delta.properties.defaults.enableDeletionVectors		delta.enableDeletionVectors
V-Order	spark.sql.parquet.vorder.[enabled/default]	option(‘parquet.vorder.enabled’, ‘true’)	delta.parquet.vorder.enabled

You’ll notice the DataFrameWriter options only eixsts for transient writer settings.

Precedence Rules: What Happens When They Conflict

Optimized Write Example

What happens when the session-level config for Optimize Write is disabled, but the Delta table property delta.autoOptimize.optimizeWrite is enabled?

spark.conf.set('spark.databricks.delta.optimizeWrite.enabled', 'false')

spark.sql("""
    CREATE TABLE dbo.ow_is_not_enabled PARTITIONED BY (country_sk)
    TBLPROPERTIES ('delta.autoOptimize.optimizeWrite' = 'true')
    AS SELECT 1 as country_sk
""")

As hinted earlier, the session-level config takes precedence. Although the table has the Optimized Write property enabled, writes to the table will not use the Optimized Write feature. To control this setting on a table-by-table basis, we should unset the session-level config so that we can selectively enable the setting only for partitioned tables.

spark.conf.unset('spark.databricks.delta.optimizeWrite.enabled')

spark.sql("""
    CREATE TABLE dbo.ow_is_now_enabled PARTITIONED BY (country_sk)
    TBLPROPERTIES ('delta.autoOptimize.optimizeWrite' = 'true')
    AS SELECT 1 as country_sk
""")

V-Order Example

There are exceptions to the standard precedence rule for transient writer configs. In the example below, we have V-Order enabled at the session level, but when writing to a table using the DataFrameWriter, we attempt to disable V-Order. The result is that the table is still written with the V-Order optimization. This is an exception where the session-level config always takes precedence when set.

spark.conf.set('spark.sql.parquet.vorder.enabled', 'true')

df.write.option('parquet.vorder.enabled', 'false').saveAsTable('dbo.vorder_is_enabled')

To allow for defining V-Order for individual tables on an opt-in basis, Runtime 1.2 required unsetting the spark.sql.parquet.vorder.enabled session-level config, however Runtime 1.3 uses spark.sql.parquet.vorder.default instead which no longer requires unsetting the property just to have table level control. The spark.sql.parquet.vorder.default session-level config enables V-Order as a DataFrameWriter option if it is not already set.

spark.conf.get('spark.sql.parquet.vorder.enabled') # NONE | session-level config which overrides DataFrameWriter and Table Properties | priority #1
spark.conf.get('spark.sql.parquet.vorder.default') # TRUE | session-level config which sets V-Order as default for the DataFrameWriter option | priority #2, takes precedence if the prior config is unset and the DataFrameWriter option is not defined

# SCENARIO 1
df.write.saveAsTable('dbo.vorder_is_enabled') # ENABLED since the DataFrameWriter will default to enabling V-Order

# SCENARIO 2
spark.conf.unset('spark.sql.parquet.vorder.default')
df.write.saveAsTable('dbo.vorder_is_not_enabled') # NOT ENABLED since we didn't define the DataFrameWriter option and the session-level default was unset

# SCENARIO 3
df.write.option('parquet.vorder.enabled', 'true').saveAsTable('dbo.vorder_is_enabled2') # ENABLED since we specified the DataFrameWriter option as enabled

# SCENARIO 4
spark.sql("""
    CREATE TABLE dbo.vorder_is_enabled
    TBLPROPERTIES ('delta.parquet.vorder.enabled' = 'true')
    AS SELECT 1 as c1
""") # ENABLED since we specified the table property and the session-level config `spark.sql.parquet.vorder.enabled` defaults to being unset

Best Practices for Config Management

Given the precedence hierarchy, evaluate which configurations should be applied table-by-table or as a default behavior for writers and sessions.

For writer features that do not automatically enable the feature as a table property, these configs should always be defined as table properties. V-Order is an example of a feature that automatically enables the table property if set at the session or DataFrameWriter level:

spark.conf.get('spark.microsoft.delta.parquet.vorder.property.autoset.enabled') # if a table is written to with V-Order optimizations and the table property is not already set, it will enable it

Why This Matters

Some properties do not automatically apply as table properties, risking inconsistent writes from other sessions or writers. Optimized Write and Auto Compaction are examples where enabling them via session or DataFrameWriter options does not persist the setting as a table property. This can cause serious issues.

Example: Risk of Inconsistent Writes

Session 1:

  df.write.option("optimizeWrite", "true").partitionBy("country_sk").saveAsTable("dbo.partitioned_table")

Session 2:

  spark.conf.unset('spark.databricks.delta.optimizeWrite.enabled') # OR spark.conf.set('spark.databricks.delta.optimizeWrite.enabled', 'false')

  df.writeTo("dbo.partitioned_table").append()

  spark.sql('OPTIMIZE dbo.partitioned_table')

What Happens?

Session 1 successfully creates a partitioned table using Optimized Write.
Session 2, with different session-level defaults, appends without Optimized Write.
The OPTIMIZE command rewrites the entire table, worsening the small file problem.

The Solution: Use Table Properties

Rely on table properties where possible and avoid session-level defaults for settings that won’t be used consistently across your environment.

Corrected Example Using Table Properties:

Session 1:

  spark.sql("""
      CREATE TABLE dbo.partitioned_table PARTITIONED BY (country_sk)
      TBLPROPERTIES ('delta.autoOptimize.optimizeWrite' = 'true')
      AS SELECT * from df_tempview
  """)

Session 2:
```
  spark.conf.unset('spark.databricks.delta.optimizeWrite.enabled') # OR spark.conf.set('spark.databricks.delta.optimizeWrite.enabled', 'false')

  df.writeTo("dbo.partitioned_table").append()

  spark.sql('OPTIMIZE dbo.partitioned_table')
```
In this scenario, since the Delta table itself has the transient delta.autoOptimize.optimizeWrite feature enabled, Session 2, which does not define whether Optimized Write is used at the session or DataFrameWriter level, the optimization is still applied due to the Delta table property.

When properties like Optimized Write and Auto Compaction are enabled at the table level, Spark automatically applies them when the DataFrameWriter or session configs are unset. This ensures consistent writes and simplifies troubleshooting by making table metadata a source of truth for data layout properties.

General Best Practices

Use Table Properties for Long-Term Consistency

Why: Table properties persist across sessions, ensuring consistent behavior across all jobs and writers.
Best Practice: Always set critical features like delta.autoOptimize.autoCompact or delta.autoOptimize.optimizeWrite as table properties to avoid reliance on consistent session configurations across various writers.

Minimize Session-Level Configs

Why: Session-level configs only apply to the current Spark session and can cause unexpected results if forgotten or if other writers use different session configs in combindation with transient table properties.
Best Practice: Use session-level configs only for temporary testing or persistent configurations that should be applied platform-wide.

Use DataFrameWriter Options Selectively

Why: DataFrameWriter options only apply to the current write operation and do not persist across sessions.
Best Practice: Only use DataFrameWriter options if the feature supports automatically enabling the corresponding table property (e.g., delta.parquet.vorder.enabled for V-Order). Otherwise, restrict their use to testing or ad-hoc writes, where applying the same feature for future writes does not matter.

Retrieving Active Configs

Given that it is important to understand what session-level configurations are set and what the active values are, the below function can be extremely handy as it will return a dictionary of key-value pairs which can easily be viewed in whole or queried. Kuddos to this Stack Overflow Post for the source code.

def get_spark_session_configs() -> dict:
    scala_map = spark.conf._jconf.getAll()
    spark_conf_dict = {}

    iterator = scala_map.iterator()
    while iterator.hasNext():
        entry = iterator.next()
        key = entry._1()
        value = entry._2()
        spark_conf_dict[key] = value
    return spark_conf_dict

With this function we can now create a dictionary variable that encompasses all session configs and easily query the dictionary to check for how configs are set:

spark_configs = get_spark_session_configs()

print(spark_configs['spark.databricks.delta.optimizeWrite.enabled']) # if we want to throw an error if the config is not set

print(spark_configs.get('spark.databricks.delta.optimizeWrite.enabled', 'unset')) # if we want to gracefully handle configs not being set

Should You Ditch Spark for DuckDb or Polars?

2024-12-12T00:00:00+00:00

There’s been a lot of excitement lately about single-machine compute engines like DuckDB and Polars. With the recent release of pure Python Notebooks in Microsoft Fabric, the excitement about these lightweight native engines has risen to a new high. Out with Spark and in with the new and cool animal-themed engines— is it time to finally migrate your small and medium workloads off of Spark?

Before writing this blog post, honestly, I couldn’t have answered with anything besides a gut feeling largely based on having a confirmation bias towards Spark. With recent folks in the community posting their own benchmarks highlighting the power of these lightweight engines, I felt it was finally time to pull up my sleeves and explore whether or not I should abandon everything I know and become a DuckDB and/or Polars convert.

The Methodology

While performance can be the most important driver in selecting an engine, the reality is that performance alone does not make a technology worthy of a spot in your architecture landscape. In this analysis, I’ve chosen to build a benchmark suite that aims to evaluate the following based on real-world-type test cases:

Performance
Execution Cost
Development Cost
Engine Maturity and Compatibility

The Test Cases

If I can find any complaint with benchmarks that people post, it’s that they don’t always reflect real-world use cases. The recent blog by my colleague Sandeep Pawar is fantastic, as it highlights how optimizing row group sizes can allow single-machine engines to approach V-Order-like performance. In terms of the Spark comparison, as I shared with Sandeep, the use of the LIMIT operator in his benchmark resulted in Spark running a CollectLimit operation, which forces all data on worker nodes to be collected and then filtered at the driver level. This resulted in unnecessary data movement from workers to the driver as well as a single-threaded write operation, which constrained the possible parallelism and performance. While using LIMIT to interactively return a small result set to the console is a real-world use case, returning 50M rows to the console OR using the LIMIT operation in typical ELT processes (i.e., building a fact table) is not. Therefore, it doesn’t make sense to draw serious conclusions about Spark based on this test.

For my test cases, I aimed to comprehensively cover the basic ELT use cases in a Lakehouse architecture, evaluated at both the 10GB and 100GB levels based on a sampling of TPC-DS tables generated via the Databricks DS-DGEN-based library (the largest was the store_sales table):

Read Parquet, Write Delta (5x): I’ve selected five tables from the TPC-DS schema. This test simply measures the time to read the source Parquet data and write a Delta table for each of the five tables.
Create Fact Table: This test measures the time to create a fact table based on the aggregation of data from the five source TPC-DS tables. A simple CREATE TABLE AS SELECT operation is run.
Merge 0.1% into Fact Table (3x): This test measures the time to take a 0.1% sampling of records from the core transaction source table, join them with dimension tables, randomize values, and then merge them into the target fact table created in the prior step. This is run three times to simulate having multiple incremental loads.
VACUUM (0 Hours): This measures the time to clean up old Parquet files that are no longer in the latest Delta commit. I ran with 0 hours of history retained (not recommended for production workloads) so that it would clean up the maximum number of files.
OPTIMIZE: Nothing fancy about this, just the time to perform compaction.
Ad-hoc Query (Small Result Aggregation): The time to perform a simple aggregated SELECT statement that returns a small result set. This imitates the type of ad-hoc query that would be run interactively and displayed for analysis.

Based on my experience consulting where I built many Lakehouse architectures, these are the types of operations that would be generally representative of end-to-end data engineering work. No APIs or semi-structured data to make things too complex—just the typical operations that would result if you had Parquet files being delivered as a starting place and the goal was to build a dimensional model to support reporting and ad-hoc queries.

Compute Configurations

I elected to use the smallest possible compute size for each respective engine for both the 10GB and 100GB benchmarks. For DuckDB and Polars, using Python Notebooks, this was the default 2-vCore VM size. For Spark, the smallest possible compute size is a Single-Node 4-vCore Spark cluster (one single Small node VM). While the starting node size for Spark is 2x bigger, Fabric Single-Node clusters allocate 50% of cores to the driver, meaning the Spark job effectively only has 2 vCores available for typical Spark tasks.

The 10GB benchmark was run on 2, 4, and 8-vCore machines (all single-node configurations for Spark and single-VMs running Python for DuckDB and Polars).
The 100GB benchmark was run on 2, 4, 8, 16, and 32-vCore compute configurations:
- For Spark, I used single-node configurations for 4 and 8-vCores.
- For 16-vCores, I used a cluster with three 4-vCore worker nodes (4 driver vCores + 12 worker vCores).
- For 32-vCores, I used a cluster with three 8-vCore worker nodes (8 driver vCores + 24 worker vCores).
- For DuckDB and Polars, single-VMs running Python were used.

For Spark, I used the Native Execution Engine (NEE), as this is a native C++ vectorized engine that makes vanilla Spark faster. There’s no additional CU rate multiplier, so there’s no reason not to use it, particularly when trying to optimize for both cost and performance.

Engine Versions

Engine	Version
Spark	Fabric Runtime 1.3 (Spark 3.5, Delta 3.2)
DuckDB	1.1.3
Polars	1.6.0

Delta Lake Writer Configs

I used the best practice Delta Lake writer configs available in each engine.

For the Spark tests, I enabled deletion vectors. See my blog on this topic to understand the value proposition.
For both DuckDB and Polars, since they depend on the Rust-based DeltaLake Python library for writes, which does not support deletion vectors, this setting could not be enabled. However, at this small scale, deletion vectors would only have a marginal impact on performance, so this does not skew the results in any meaningful way.

The Native Execution Engine (NEE) doesn’t yet natively support deletion vectors. When DVs are included, it results in mixed execution query plans with fallback to Spark row-based execution. Depending on the workload, DVs can still improve performance where merge-on-read results in less data being written. In this benchmark, DVs resulted in NEE completing ~3% faster.

Polars Benchmark Sampling Mod

After running the benchmark with Polars and getting OOM errors below 16-vCores, I identified that Polars does not support lazy evaluation for data sampling. This meant that to run the Merge 0.1% into Fact Table (3x) test, Polars needed to read the entire source Delta table into memory and then take an in-memory sampling of data. Spark and DuckDB, on the other hand, are able to sample directly on top of the source data, eliminating the need to load the entire table into memory.

Since sampling a large table as the source for an incremental load is not something you’d typically see in production and was only used for data generation purposes, I decided to run a second version of the benchmark for Polars. This version, labeled as Polars (Mod), uses DuckDB to perform the more efficient sampling operation (sampled_table = duckdb.sql("SELECT * FROM delta_scan('abfss://...') USING SAMPLE 0.1%").record_batch()) before processing the data further with Polars.

Benchmark Analysis

ℹ️ After reading this blog, see my refresh of this benchmark updated on 6/30/2025 which covers new insights, includes Daft in the mix, and an intro to LakeBench, the benchmark behind this blog post.

Performance

10GB Scale

At 2-vCores, Polars (Mod) was the fastest engine, followed by DuckDB, and then Polars without the benchmark modification.
At 4-vCores, DuckDB takes the win followed by Polars and lastly Spark. DuckDB was ~1.6x faster than Spark w/ NEE.
At 8-vCores, DuckDB finishes only slightly faster than Spark w/ NEE. Both Polars scenarios come last.

100GB Scale

No engine completed the benchmark with only 2-vCores (Fabric doesn’t offer a 2-vCore node size for Spark so this wasn’t tested).
DuckDB was the fastest engine when using 4-vCores, taking a slight edge over Spark w/ NEE.
Spark w/ NEE was fastest at 8, 16, and 32-vCores.
Polars ran into out-of-memory (OOM) and wasn’t able to finish tests at 4 or 8 vCores. Polars was much slower than DuckDB and Spark at 16 and 32-vCores.

Note: In all of these tests, Spark has access to fewer total vCores for data processing work yet was able to keep up and even exceed the others.

Which Phases Did Different Engines Excel At?

Read Parquet, Write Delta (5x)
- 10GB: While Polars took the win at 2-vCores, DuckDB had an edge at 4-vCores.
- 100GB: Spark was over 2x faster than both DuckDB and Polars.
Create Fact Table
- 10GB: DuckDB was ~2x faster than every other engine, with the other engines performing very similarly.
- 100GB: DuckDB and Spark w/ NEE tied, with both Polars variants running almost 6x longer.
Merge 0.1% into Fact Table (3x)
- 10GB: Polars (Mod) was the fastest at 4-vCores, with the other engines closely clustered.
- 100GB: Spark w/ NEE was ~2x faster than DuckDB and significantly faster than both Polars variants.
VACUUM (0 Hours)
- Neither DuckDB nor Polars have a native VACUUM command; however, the DeltaLake Python library based on Delta-rs was significantly faster than the native VACUUM command in Spark.
OPTIMIZE
- Same as VACUUM, neither DuckDB nor Polars have a native OPTIMIZE command, but the Delta-rs-based library again was significantly faster than the native OPTIMIZE command in Spark.
Ad-hoc Query (Small Result Aggregation)
- As expected, this is where engines like DuckDB and Polars provide mind-blowing, super-low-latency performance. Depending on the scale, DuckDB and Polars were between 2-6x faster than Spark w/ NEE.

10GB Results @ 4-vCores

100GB Results @ 16-vCores

Since the performance difference for VACUUM, OPTIMIZE, and Ad-hoc/Interactive Queries tends to be overshadowed by longer-running ELT processes, here’s an isolated view of the 10GB 4-vCore benchmark highlighting how much faster DuckDB and Polars (with Delta-rs) are for these workloads.

Execution Cost

Since I logged the vCores used for each run, translating to CU seconds and then the approximate dollar cost for the job was straightforward. Now that I’ve established that vanilla Spark can compete, going forward I will highlight results comparing Spark w/ NEE and deletion vectors enabled compared to DuckDB and Polars.

10GB Cost

Both DuckDB and Polars (Mod) were about 50% cheaper compared to Spark.
With 8-vCores, Spark w/ NEE and DuckDB have very close job costs ($0.019 vs $0.017).

100GB Cost

With 4-vCores, the DuckDB and Spark jobs cost the same at ~ $0.08.
With 8-vCores, the cost of the Spark job is unchanged ($0.08) but we were able to cut ~10 minutes off the processing time. Spark was the cheapest.
As the allocated cores increase, the relative performance gain for Spark is much higher compared to DuckDB and Polars:
- Spark: Compared to the 4-vCore run, Spark w/ 32-vCores was 4.5x faster while the job only costs 2x more.
- DuckDB: Compared to the 4-vCore run, DuckDB w/ 32-vCores was only 2.4x faster while the job costs 3.5x more.
- Polars: Compared to the 16-vCore run, Polars w/ 32-vCores was only ~1.1x faster while costing ~1.9x more.

Development Cost

Selecting a compute engine isn’t just about raw performance—it’s also about how easily and quickly developers can implement solutions. In this evaluation, I focused on two key aspects of development agility: features that impact implementation time and the real-world experience of implementing this benchmark. While the feature evaluation is relatively objective, the implementation evaluation is based on my experience and prior background, making it subjective.

Key Features Impacting Development Cost

Engine	SQL Interface	DataFrame API	Native Delta Reader	Native Delta Writer	Local Development	Live Monitoring Capabilities	OneLake Auth Setup
Spark	Yes	Yes	Yes	Yes	Great	Good but w/ a steep learning curve	Excellent
DuckDB	Yes	Yes††	Yes (via Delta Kernel)	No	Great	Poor	Ok
Polars	Yes†	Yes	Yes	Yes (via Delta-rs)	Great	Very Poor	Partial

† Corrected 12/16/24, Polars does support a SQL interface. This has been decently mature since 0.17.0 (June 2023).

†† Corrected 12/16/24: DuckDB supports a DataFrame-like API through its Relational API and Expression API, introduced in version 0.7.0 (August 2022). Additionally, DuckDB is developing an experimental Spark API, enabling Spark users to run workloads using the DuckDB engine while leveraging the familiar Spark DataFrame API. This feature facilitates seamless migration of lightweight Spark jobs to DuckDB with near-zero code changes, while also allowing users to start with the DuckDB Spark API and transition to the Spark engine as data scales beyond DuckDB’s optimal range.

† Updated 7/17/25, I’d rate local development with Spark as beeing ‘great’ with caveats. After recently working on a contribution to OSS Delta-Spark, I really didn’t know how powerful IntelliJ made developing in Spark. Rich debugging, fantastic linting, dependency tracking, code navigation, etc., it’s quite amazing. The only caveat is that it’s not relevant for PySpark development. I love VS Code, but it’s not quite as rich and out-of-the-box for developing in Spark.

My Analysis

SQL and DataFrame API: While you can use a DataFrame abstraction library like Ibis or SQLFrame, Spark is the only engine I benchmarked that natively supports both SQL and a DataFrame API. Having both presents tremendous flexibility in building data engineering pipelines. Most Spark developers I know heavily use both the SparkSQL and the DataFrame API. Corrected 12/16/24: All engines support both a SQL interface and a DataFrame API, enabling programmatic chaining of transformations that can be executed via lazy evaluation. Spark offers the most robust capabilities through SparkSQL and its DataFrame API. However, Polars (DataFrame-first) and DuckDB (SQL-first) are both making significant progress in enhancing their secondary query construction models. Notably, DuckDB is actively developing a Spark API, allowing Spark users to leverage DuckDB with familiar syntax while providing a seamless path (_fingers crossed, this is still experimental) to switch to Spark’s distributed compute engine as data volumes scale._
Native Delta Writer:
- DuckDB only supports writing to Delta tables by converting DuckDB DataFrames to another memory format and then using the DeltaLake Python library to perform the write operation. This should be natively supported in time, but today this experience of needing to convert DataFrames and use another writer was quite surprising and took some time to figure out the most optimal way to do it. I first started by converting DuckDB DataFrames to Arrow Tables via arrow() and ran into OOM issues below 16-vCore. Mim then jumped in and helped me understand that I should be using record_batch() to make this a streaming Arrow DataFrame so that the data gets processed in batches and doesn’t require the full dataset to fit into memory.
- Polars supports a native Delta Lake writer via Delta-rs bindings.
- Since both DuckDB and Polars are dependent on the Delta-rs-based DeltaLake Python library for full-featured writes, both are limited by features that have yet to be implemented in Delta-rs, namely deletion vectors. This feature request was reported almost two years ago and is still open. Since deletion vectors are not supported, this means that while DuckDB can read from DV-enabled tables, since both DuckDB and Polars are dependent on Delta-rs, neither can write to such tables. See my post on deletion vectors to understand the importance of merge-on-read.
Local Development: DuckDB and Polars both win in the ‘local development’ category as the engines are super lightweight and can be run on a local computer with a simple PIP command. Spark is more complex, as it’s not possible to run the Fabric Spark Runtime locally. Therefore, you must connect remotely to a Fabric Spark cluster in VS Code (local or web) to get Fabric Spark-specific features. This experience is getting better every day but is not nearly as simple as running the actual engine locally.
Live Monitoring Capabilities: When doing development and you run something, you often might need to check to see what is actually happening. With Spark, you can look in the Spark UI or Fabric UI surfaced telemetry. It’s not perfect by any means, and the learning curve is steep, but once you have the basics figured out, it’s easy enough to check what is running, triage where something might be stuck, or evaluate live running query plans. With DuckDB, there’s a nice tqdm-style progress bar, while with Polars, you’re left to guess what might be going on and when your job might be done.
OneLake Auth Setup: Note, this is not a critique of the engine itself; this is an evaluation of how natively the engine is integrated to authenticate to OneLake (or ADLS) in Fabric.
- Spark: Easy—you don’t do anything; it just works.
- DuckDB: In hopes of avoiding more complex auth methods, I tried to get token authentication to work. I was blocked on this for a few hours until my colleague Mim Djouallah (he has some great blogs on DuckDB) saved the day and noted that I needed to upgrade to DuckDB version 1.1.3 to use this newer auth method. Once I got this one line of code, everything seamlessly works.
- Polars: At first, I couldn’t get any Polars authentication to work, then Sandeep Pawar showed me that scan_delta() works with ABFSS paths without needing to specify auth (since it gets a token from env vars). ABFSS does not currently work with scan_parquet(), read_parquet(), and other similar methods. David Browne, however, pointed out that while ABFSS does not work for all methods, relative file paths do work: /lakehouse/default/Files since it interacts with the OneLake directory via a mount point instead of directly making ABFSS endpoint calls. I got everything working eventually, but this was frustrating to say the least.

Implementation Cost Comparison

Engine	Learning Curve	Implementation Speed / Workflow Integration
Spark	Medium	Excellent
DuckDB	Medium	Ok
Polars	High	Ok

My Analysis

Learning Curve
- Spark: For myself, and I think for most people as well, learning distributed computing concepts that are critical to being successful with Spark is not a simple task. But once you get the basics, Spark is so mature that it can be hard to get too stuck. Plus, Spark supports SparkSQL, which is one of the best SQL dialects there is.
- DuckDB: I was quite surprised how long it took me to get going with DuckDB. I couldn’t figure out how to authenticate to OneLake until Mim told me I had to update DuckDB to the latest version (1.1.3). Once I was authenticated, I was challenged by how far from straightforward it was to take my PySpark code and refactor it as DuckDB. Beyond the below challenges I stumbled through, DuckDB is almost all SQL, and thus very easy to navigate once you get going:
  - No support for natively writing to Delta tables. This includes inserts, running optimize or vacuum. You can only write to Delta tables by converting your DuckDB DataFrame to an Arrow DataFrame and then using the Delta-rs Python library to do the actual write to Delta.
  - No support for natively reading from Hive Meta Store. You can use delta_scan() or register Delta tables as views. Not hard once you understand this.
  - I originally used the arrow() method to convert DuckDB DataFrames to Arrow Tables prior to writing to Delta and experienced OOM issues. Mim thankfully showed me that the record_batch() method should be used instead so that the data is streamed into Arrow format in batches. Quite a cool feature as this allows you to run on very constrained compute and prevent OOM. That said, this was not intuitive and I have yet to find the documentation on this specific method. Is there a reason why you’d use arrow() over record_batch()? I have no idea at this point, but it seems like record_batch() makes more sense to prevent OOM.
- Polars: Polars is a DataFrame API-centric engine, which is good news for those already comfortable with the Spark DataFrame API. That said, Polars adds additional (and possibly unnecessary?) complexity through the nuance of being able to control the evaluation model based on what methods you use. For example, read_parquet() is an eager evaluation method, while scan_parquet() is lazily evaluated. Calling the native write_delta() method to save data to a Delta table will throw an error if you chain it on top of a lazy-evaluated step, so you need to run collect() first before running write_delta() (but why can’t it just automatically do that???). Oh, and if you want to have the data be streamed for batch processing so that you can process data that is larger than your VM memory, you need to specify collect(streaming=True). I can see this level of control being fantastic if you live and breathe Polars, but this makes the learning curve pretty steep.
Workflow Integration / Implementation Speed: I’d define this category as how well the engine works to fit into a typical data engineering workflow. How well is it integrated into the platform? How do features of the engine impact how fast you can get work done, and do the features work with typical data engineering patterns? How complete is the engine itself, or does it feel more like a bolt-on capability?
- Spark: I live and breathe Spark, so the actual implementation was fast for me. For the average user, I’d still suggest it can be pretty fast since things like auth, evaluation, and both reader and writer capabilities are extremely robust. Spark is a standalone, full-featured data processing engine. AL/ML, Graph, structured, semi-structured—Spark can do it all at any data size.
- DuckDB: Ok. Could I swap some DuckDB into normal workflows? Certainly. Would I take additional time to refactor things since DuckDB doesn’t natively support Hive Meta Store and in-memory database concepts are fundamentally different? Yes. The necessity to pass DataFrames from DuckDB to the DeltaLake Writer and so forth is not hard when you get used to it, but the user experience of having to do this isn’t great and does impact the time to implement solutions.
- Polars: Ok. The positive here is that Polars offers a native Delta Lake writer method built on Delta-rs, which provides full-featured writes (including a merge operator), and authentication for OneLake was out-of-the-box—for Delta tables. The downside is that users need to learn the nuances of having tasks evaluated with potentially both eager and lazy evaluation in the same DataFrame. This adds additional work to figure out the most optimal way to code things. That said, like DuckDB, Polars is blazing fast for querying Delta tables, and this is a big positive. I was about to give Polars an OK+ rating but will leave off the plus since I could never get Polars to complete the tests below 16-vCores, even after successfully swapping in DuckDB for the data sampling and unsuccessfully trying to improve write performance for the large table by messing with write batch sizes.

I’d easily give Spark the win in this category.

Engine Maturity and OSS Table Format Compatibility

With Polars, there’s no support for deletion vectors as it’s native Delta reader doesn’t yet support it and it’s writer uses Delta-rs bindings which don’t yet support it as well. While DuckDB does support reading from tables with deletion vectors enabled, via using Delta Kernel bindings, it’s dependency on Delta-rs for writing (after converting the DuckDB DataFrame to Arrow format) also blocks the ability to write to tables with deletion vectors enabled. Deletion vectors are a general best practice setting for Delta tables. If you want to use Polars or DuckDB to read or write to Delta tables, you need to weigh the impact of potential Delta compatibility issues which may block the ability to use newer/optimal Delta features. If your data is super small, not being able to use deletion vectors will have very minimal impact, but as your data volume increases, the potential impact can be significant.

In terms of engine maturity, Polars and DuckDB are both relatively new. In contrast, Spark has been around for over a decade, and we are now approaching GA of the 4th major release. Spark performance continues to improve, Spark capabilities are continuing to expand, and Spark is going nowhere. Just consider some of the upcoming Spark 4.0 features:

Stored Procedures
SQL scripting constructs
Data Source APIs (create your own spark.read class extension)
Improved error logging
Variant data types
Collation support
Structured logging

…and so much more. All I’m trying to point out is that the Spark community is taking real action on pretty much everything that Spark doesn’t excel at or doesn’t support. In terms of performance, both Fabric and Databricks provide native C++ engines within Spark that allow Spark jobs to run much faster than natively possible with vanilla OSS Spark. Spark is here to stay and continues to improve, so get used to it. :)

New doesn’t mean bad, just that you should be cautious about APIs or syntax changes and that the engine is not going to be as full-featured as an engine like Spark that has been around for over a decade.

Considerations when choosing data processing engines

Future data growth: Avoid needing to refactor all code because your data went from small to medium and now you need to rewrite your code as PySpark. If you have small data today and a non-Spark engine only runs 2x faster, I would still use Spark simply so that I don’t have to migrate once my data gets large, as well as to take advantage of the more robust engine capabilities.
Skillset of team: Spark is synonymous with data processing. Tons of people know Python, more know basic SQL, but Spark supports both and since it’s been around longer, more people will have this experience. That said, I highly encourage people to learn additional languages, frameworks, and engines, so don’t rule out using DuckDB or Polars because of a potential skillset gap—just be aware there might be some time needed for cross-skilling.
Performance: To summarize my performance analysis, Spark can be just as fast, and even faster, for typical data engineering tasks. DuckDB and Polars can be much faster than Spark for lightweight exploration tasks and maintenance operations.
Cost: In my benchmark, Spark was as cheap as DuckDB and cheaper than all engines as the allocated vCores scaled. The only two tests where Spark was not the cheapest was the 10GB 2 and 4-vCore benchmarks. Remember that the cost of an engine goes beyond the direct invoice you get from your cloud provider—you should consider the cost of time to learn, the cost for your team to upskill and refactor code, and the cost of longer development cycles through the engine not being as tightly integrated as you’d like.

Where would I use each engine?

Ok, I’ve done the benchmark, but where would I actually use each engine now that I’ve done some basic testing and can confidently say that I’m less ignorant when it comes to single-machine engines?

If I were to optimize for performance, cost, and engine maturity/compatibility, I would do the following (with exceptions):

Primary Spark Use Cases

Any and all “data processing.” Think E.L.T., the steps to extract, load, and transform your data in the Lakehouse architecture.

Primary DuckDB Use Cases

Interactive and ad-hoc queries
Data exploration
Data processing microservices

Primary Polars Use Cases

Honestly, with DuckDB generally outperforming Polars, with zero tuning effort, and less OneLake authentication issues, I’d probably start with DuckDB but certainly wouldn’t rule Polars out, particularly if the use case doesn’t require robust SQL capabilities (one area where DuckDB excels). Polars did win the 10GB 2-vCore test, I’d still give it a fair shot at the same use cases as DuckDB:

Interactive and ad-hoc queries
Data exploration
Data processing microservices

Primary DeltaLake Python Library Use Cases

I added this category since all of the VACUUM and OPTIMIZE operations in my benchmark for Polars and DuckDB technically were just using the DeltaLake Python library. Using a pure Python Notebook, I would use the DeltaLake library for:

Maintenance operations: Maintenance operations on this library were significantly faster compared to Spark. While you could use this library on a Spark cluster, there’s no need to have your worker nodes sit idle while you run lightweight jobs that only run on the driver node. Rather than running VACUUM and OPTIMIZE (where the table can fit into VM memory), I would split these maintenance jobs into a Python notebook (2-vCore for VACUUM) and have these jobs complete much faster, all while consuming much less compute.

Here’s a quick visual to summarize where I think each engine makes sense for most Lakehouse architecture use cases.

Updated 12/16/24, I added Polars to the image above since it does support a basic SQL interface, thus making it a good candidate for ad-hoc analysis.

My Key Takeaways

Migrating off of Spark is all hype: I think the whole narrative that you should consider replacing your Spark workloads with DuckDB or Polars if your data is small is all hype. Yes, the engines have certainly earned their place at the table, however Spark is still reigns king for data processing any way you look at it. Sure, DuckDB and Polars can marginally outperform Spark at data processing at the 10GB scale on a 4-vCore (or smaller machine). I think the real story here is this:
- Each engine does something really well, so why not strategically mix and match them to take advantage of where each truly shines. Use Spark for ELT work, use the Rust-based DeltaLake Library on Python for maintenance operations, and use DuckDB or Polars for interactive queries on your small datasets.
I now have tremendous respect for Polars and DuckDB: While I prefer developing with Spark because I can seemlessly move between the extremely robust SparkSQL and the DataFrame API as needed, all while being able to scale to process massive amounts of data, DuckDB’s implementation of an in-memory SQL engine is remarkably powerful and supports many use cases—especially when access to a Spark cluster is not readily available. Polars, the newestkid on the block, is rapidly maturing. If its current capabilities are any indication, Polars will undoubtedly make the “which engine should I use” question even more challenging. DuckDB’s investment in developing a Spark API shows that they take Spark seriously and suggests they believe they can capture some of Spark’s market share by simplifying migration to DuckDB and making Spark devs feel at home. While this is likely to happen, I believe native vectorized engines that integrate with Spark and eliminate JVM inefficiencies—such as the Native Execution Engine (Velox & Gluten) in Microsoft Fabric and Photon in Databricks—will continue to make staying within the Spark ecosystem compelling, even for small-data use cases.
Performance with Spark more consistently scales as compute scales: I was extremely surprised to find that the performance of DuckDB and Polars was barely impacted by throwing more cores and memory at the benchmark. I’m sure there’s some magic that could be worked to tune things and get more efficient compute utilization as cores are increased, but this just isn’t something you often need to consider with Spark.
Memory spill matters!: While you want to avoid it, by default, Spark can spill memory to disk if needed, making it resilient to out-of-memory (OOM) issues. With DuckDB and Polars, I ran into OOM issues (100GB @ 2-vCore for DuckDB and 2, 4, and 8-vCore for Polars) ~~, and neither engine supports memory spilling to disk to prevent the memory exhaustion causing the VM to crash.~~ Corrected 12/16/24: Both Polars and DuckDB support memory spill to disc, that said, with both having OOM issues I’m guessing that something here is not as efficient (or out-of-the-box) as Spark. I need to do some more triaging here. While memory spill causes Spark to run slower when it happens, it at least greatly reduces the risk of job failures and allows flexibility in compute sizing.
Distributed computing has compute overhead for task orchestration, but this adds fault tolerance: When DuckDB and Polars VMs crashed due to OOM, that was it—no automatic restart or ability to resume from where it left off. The same would happen with single-node Spark clusters. However, with multi-node Spark clusters (which most production workloads use), fault tolerance is built in. If a worker node crashes for any reason, the driver node maintains the task lineage and processing state so another VM can replace the worker and resume from where the crashed VM left off, without data loss. This may lead to some in-process transformations being reprocessed, but the engine guarantees that data writes are only performed once. See my blog on RDDs vs. DataFrames for more details.
Consider your specific workload: I designed my benchmark to reflect the typical lakehouse architecture that I see. Given that Spark has the biggest advantage for ELT-type data processing, if your use case involves infrequent small data loads (e.g., monthly), primarily interactive querying, or the necessity for an embedded in-memory database engine, DuckDB could be a great fit—especially for small data volumes.

Lastly, this is just another benchmark—do your own testing.

Miles Cole

Mastering Spark: DataFrameWriterV2 vs. DataFrameWriterV1

The old mental model: df.write

The newer mental model: df.writeTo

A simple comparison

Table properties vs. options: V2 gives them separate seats

Paths still work — they just aren’t the headline

Liquid clustering on the API surface (Spark 4.0+)

Explicit schema evolution (Spark 4.2 + delta-spark 4.2)

MERGE finally has a DataFrame API (Spark 4.0+)

Replace semantics are clearer (and Delta knows the difference)

Partitioning is part of the table definition

When V1 is still the right tool

Watch out for compatibility differences

Recommended style

Final thought

Creating your first Spark Job Definition

What Is a Spark Job Definition?

Core Concepts

So Where Do I Start?

What About Parameterization?

1. Configuration Data

2. Runtime Control Flow

Additional Gotchas

Putting It All Together

How Do I Monitor a Spark Job?

Typical Development Flow

Notebooks, Spark Jobs, and the Hidden Cost of Convenience

1. Reliability Must Come Before Convenience

2. Notebooks Make Testing and Modularity Harder

Notebooks and Testing

Notebooks and Modularity

3. Spark Job Definitions Encourage Better Engineering Habits

What About Interactivity?

Announcing: 🌊 LakeBench

Running a benchmark is now as simple as:

Install LakeBench from PyPi

One-Time Data Generation

Run Benchmark: TPC-DS Power Test

Run Benchmark: ELTBench in light Mode

Q&A

The Small Data Showdown ‘25: Is it Time to Ditch Spark Yet??

Goals of This Post

Benchmark Methodology

Why This Benchmark Is Relevant

Engine Versions Used

Spark Core -> Cluster Map

What Has Changed Over the Last 6 Months?

Where Do Things Stand?

140MB Scale

140MB Scale @ 4-vCores - Phase Detail

1.2GB Scale

1.2GB Scale @ 8-vCores - Phase Detail

12.7GB Scale

12.7GB Scale @ 16-vCores - Phase Detail

General Observations

Which Engine Gained the Most Ground Since December ‘24?

So Is It Time to Ditch Spark?

Which engine wins at the 127GB scale?

Elevate Your Code: Creating Python Libraries Using Microsoft Fabric (Part 2 of 2: Packaging, Distribution, and Consumption)

Building / Packaging

Distributing

Setting up an Azure DevOps Artifact Feed

Publishing the library

Using a Private Artifact Repository in Fabric

Library Versions

Was it worth it?

Mastering Spark: The Art and Science of Table Compaction

The Case Study

Active File Count - 1K Row Batch Size

No Compaction

Scheduled Compaction

Automatic Compaction

Performance Comparison - 1K Row Batch Size

No Compaction

Scheduled Compaction

Automatic Compaction

Optimized Write

Auto Compaction + Optimized Write

File Count Impact

The old mental model: `df.write`

The newer mental model: `df.writeTo`

Run Benchmark: ELTBench in `light` Mode