a microsoft data platform blog

Announcing: 🌊 LakeBench

July 11, 2025

I’m excited to formally announce LakeBench, now in version v0.3, the first Python-based multi-modal benchmarking library that supports multiple data processing engines on multiple benchmarks. You can find it on GitHub and PyPi.

The Small Data Showdown '25: Is it Time to Ditch Spark Yet??

June 30, 2025

Last December (2024) I published a blog seeking to explore the question of whether data engineers in Microsoft Fabric should ditch Spark for DuckDb or Polars. Six months have passed and all engines have gotten more mature. Where do things stand? Is it finally time to ditch Spark? Let The Small Data Showdown ‘25 begin!

Elevate Your Code: Creating Python Libraries Using Microsoft Fabric (Part 2 of 2: Packaging, Distribution, and Consumption)

March 26, 2025

This is part 2 of my prior post that continues where I left off. I previously showed how you can use Resource folders in either the Notebook or Environment in Microsoft Fabric to do some pretty agile development of Python modules/libraries.

Mastering Spark: The Art and Science of Table Compaction

February 26, 2025

If there anything that data engineers agree about, it’s that table compaction is important. Often one of the first big lessons that folks will learn early on is that not compacting tables can present serious performance issues: you’ve gotten your lakehouse pilot approved and it’s been running for a couple months in production and you find that both reads and writes are increasingly getting slower and slower while your data volumes have not increased drastically. Guess what, you almost surely have a “small file problem”.

Automating V-Order: A Targeted Approach for Direct Lake Models

January 31, 2025

I’ve previously blogged in detail about V-Order optimization. In this post, I want to revisit the topic and demonstrate how V-Order can be strategically enabled in a programmatic fashion.

1 / 8