One of the most critical challenges in large-scale data processing with Apache Spark is tracking what each job is doing. As Spark applications grow in complexity, understanding what’s running and when can become difficult, especially when looking at the Spark UI.
Every software platform has its own terminology, and when terms overlap but don’t mean the same thing, it can be quite confusing. For example, coming from my years as a developer in Databricks land, I initially assumed that Fabric Spark Pools were just like Pools in Databricks. However, as I discovered, this assumption was completely wrong—and understanding this distinction is key to designing the right architecture.
At the time of writing this post, Fabric Spark Runtimes enable Optimized Write by default as a Spark configuration setting. This Delta feature aims to improve read performance by providing optimal file sizes. That said, what is the performance impact of this default setting, and are there scenarios where it should be disabled?
How do you develop a python library in Microsoft Fabric while maintaining the full ability to test code prior to packaging?
With Microsoft Build 2024 underway, the wave of announcements are hot off the press! This is a recap of some of the data engineering specific updates that I’m particularly excited about.