a microsoft data platform blog

Generative AI and Spark: Leveraging LLMs for Accelerated Migrations

May 19, 2024

LLMs like ChatGPT and CoPilot are transforming every industry, so why not use them as a data engineer to free up time for more complex tasks? One thing every data engineer—and most humans—are revolted by is repetitive tasks. Thankfully, we don’t live in the world of iRobot and all we need are tokens to pay the LLM masters to get our work done.

The TCO of Photon in Databricks: Is it a No Brainer?

April 30, 2024

Photon is a native vectorized execution engine within Databricks, entirely written in C++, designed to massively boost performance on top of Spark by circumventing some of the JVM inefficiencies and better leveraging modern hardware.

The Fabric Concurrency Showdown: RunMultiple vs. ThreadPools

April 26, 2024

I recently blogged about cluster configuration options in Spark and how you can maximize compute utilization and processing time. Of the many options that I listed and data provided, I never gave any benchmarks comparing RunMultiple and Multithreading. The goal of this post is exactly that, drilling into real data that pushes the concurrency limits of both. Going forward I’ll reference Multithreading simply as ThreadPools since that is the specific Multithreading implementation that I’ll be testing.

Confessions of a Spark Convert

April 24, 2024

I’ve had a draft blog post labeled Are Azure Synapse Dedicated Pools Dead that I’ve periodically added thoughts to for the last year but haven’t pulled the trigger on publishing.

The SQL Decoder Ring for Replatforming to Fabric and Databricks

April 17, 2024

So, pretty much everyone seems to be on board with Lakehouse architecture these days, and for good reason. Decoupled compute and storage with all of the best data warehousing type features and more via Delta Lake is a pretty easy sell. It’s a complete no-brainer for greenfield development, but the conversation gets quite a bit more nuanced as you start talking about migrating, or more accurately, replatforming from a traditional data warehousing platform (coupled compute and storage) to Lakehouse.

5 / 8