I recently blogged about cluster configuration options in Spark and how you can maximize compute utilization and processing time. Of the many options that I listed and data provided, I never gave any benchmarks comparing RunMultiple and Multithreading. The goal of this post is exactly that, drilling into real data that pushes the concurrency limits of both. Going forward I’ll reference Multithreading simply as ThreadPools since that is the specific Multithreading implementation that I’ll be testing.
I’ve had a draft blog post labeled Are Azure Synapse Dedicated Pools Dead that I’ve periodically added thoughts to for the last year but haven’t pulled the trigger on publishing.
So, pretty much everyone seems to be on board with Lakehouse architecture these days, and for good reason. Decoupled compute and storage with all of the best data warehousing type features and more via Delta Lake is a pretty easy sell. It’s a complete no-brainer for greenfield development, but the conversation gets quite a bit more nuanced as you start talking about migrating, or more accurately, replatforming from a traditional data warehousing platform (coupled compute and storage) to Lakehouse.
With Microsoft going all in on Delta Lake, the landscape data architects deeply integrated with the Microsoft stack is undergoing a significant transformation. The introduction of Fabric and OneLake with a Delta Lake driven architecture meanas that the decision on which data platform to use no longer hinges on the time and complexity of moving data into the platform’s data store. Instead, we can now virtualize our data where it sits, enabling various Delta-centric compute workloads, including Power BI.
Have you ever needed to delve into the Information Schema within a notebook environment? There are myriad reasons for wanting to do so, such as: Programmatically recreating view definitions in another lakehouse Identifying table dependencies via view definitions Locating tables that include a soon-to-be-dropped column