a microsoft data platform blog

Mastering Spark: Creating Resiliency with Retry Logic

October 14, 2024

In any programming environment, handling unreliable processes—whether due to API rate limiting, network instability, or transient failures—can be a significant challenge. This is not exclusive to Spark but applies to distributed systems and programming languages across the board. In this post, we’ll focus on Python (since I’m a PySpark developer) and explore how to make any unstable process more resilient by leveraging the open-source library Tenacity. By adding strategic retry logic with exponential backoff, we can gracefully handle API throttling, server-side failures, and network interruptions to build more robust and fault tolerant solutions.

Mastering Spark: Parallelizing API Calls and Other Non-Distributed Tasks

October 11, 2024

Spark is fantastic for distributed computing, but can it help with tasks that are not distributed in nature? Reading from a Delta table or similar is simple—Spark’s APIs natively parallelize these types of tasks. But what about user-defined tasks that aren’t inherently distributed?

Mastering Spark: RDDs vs. DataFrames

October 10, 2024

I was 95% done with writing a fun case study on how to parallelize API calls and other non-distributed tasks in Spark when I realized that I was about to gloss over a extremely foundational topic in Spark: RDDs. While most developers minimally understand the basics of DataFrames, RDDs are less commonly known, partly because they are a lower-level abstraction in Spark and DataFrames are full featured enough that you can often get away without needing to know what an RDD is.

Yet Another Way to Connect to the SQL Endpoint / Warehouse via Python

September 27, 2024

Options, options, options. There are now plenty of documented ways to connect from a Spark (or soon Python) notebook to run Data Query Language (DQL) or Data Manipulation Language (DML) commands on top of the SQL Endpoint or Fabric Warehouse. Sandeep has already done a recap of these options in his blog, and Bob Duffy explores another method, using PyOdbc + SqlAlchemy in his post. While each method has pros and cons, I wanted to jump in with yet another way to connect, one I believe is the simplest and most streamlined method available in Python.

To V-Order or Not: Making the Case for Selective Use of V-Order in Fabric Spark

September 17, 2024

Fabric Spark Runtimes currently enable V-Order optimization by default as a Spark configuration. V-Order is a Parquet write optimization that seeks to logically organize data based on the same storage algorithm used in Power BI’s VertiPaq engine.

3 / 8