a microsoft data platform blog

Mastering Spark: Should You Infer Schema in Production?

February 20, 2026

Schema inference is convenient. In production or benchmarking, it is often a silent performance killer.

Notebooks, Spark Jobs, and the Hidden Cost of Convenience

February 04, 2026

I’m guilty. I’ve peddled the #NotebookEverything tagline more than a few times.

Creating your first Spark Job Definition

February 04, 2026

Coming from a notebook-first Spark background, I wanted to write the introduction to Spark Job Definitions (SJDs) that I wish I had when I first encountered them. If you are first interest in why you might want to use a Spark Job Definition over a Notebook, see my blog here.

Announcing: 🌊 LakeBench

July 11, 2025

I’m excited to formally announce LakeBench, now in version v0.3, the first Python-based multi-modal benchmarking library that supports multiple data processing engines on multiple benchmarks. You can find it on GitHub and PyPi.

The Small Data Showdown '25: Is it Time to Ditch Spark Yet??

June 30, 2025

Last December (2024) I published a blog seeking to explore the question of whether data engineers in Microsoft Fabric should ditch Spark for DuckDb or Polars. Six months have passed and all engines have gotten more mature. Where do things stand? Is it finally time to ditch Spark? Let The Small Data Showdown ‘25 begin!

1 / 8