Options, options, options. There are now plenty of documented ways to connect from a Spark (or soon Python) notebook to run Data Query Language (DQL) or Data Manipulation Language (DML) commands on top of the SQL Endpoint or Fabric Warehouse. Sandeep has already done a recap of these options in his blog, and Bob Duffy explores another method, using PyOdbc + SqlAlchemy in his post. While each method has pros and cons, I wanted to jump in with yet another way to connect, one I believe is the simplest and most streamlined method available in Python.
Fabric Spark Runtimes currently enable V-Order optimization by default as a Spark configuration. V-Order is a Parquet write optimization that seeks to logically organize data based on the same storage algorithm used in Power BI’s VertiPaq engine.
One of the most critical challenges in large-scale data processing with Apache Spark is tracking what each job is doing. As Spark applications grow in complexity, understanding what’s running and when can become difficult, especially when looking at the Spark UI.
Every software platform has its own terminology, and when terms overlap but don’t mean the same thing, it can be quite confusing. For example, coming from my years as a developer in Databricks land, I initially assumed that Fabric Spark Pools were just like Pools in Databricks. However, as I discovered, this assumption was completely wrong—and understanding this distinction is key to designing the right architecture.
At the time of writing this post, Fabric Spark Runtimes enable Optimized Write by default as a Spark configuration setting. This Delta feature aims to improve read performance by providing optimal file sizes. That said, what is the performance impact of this default setting, and are there scenarios where it should be disabled?