Configuring Apache Spark on FlashBlade, Part 3: Tuning for True Parallelism
This post will explore how to diagnose and resolve performance bottlenecks that are not related to storage I/O, ensuring you can take full advantage of the high-performance, disaggregated architecture of FlashBlade. We'll use a real-world scenario to illustrate how specific tuning can unlock massive parallelism.
Configuring Apache Spark on FlashBlade, Part 3: Tuning for True Parallelism
Part one of this series covered the fundamentals of configuring Apache Spark with FlashBlade® NFS and S3. Part two demonstrated how to automate a data science environment. In this installment, we'll tackle a crucial performance question: How to optimize your Spark jobs, even with incredibly fast storage?
This post will explore how to diagnose and resolve performance bottlenecks that are not related to storage I/O, ensuring you can take full advantage of the high-performance, disaggregated architecture of FlashBlade. We'll use a real-world scenario to illustrate how specific tuning can unlock massive parallelism.
Code examples and configuration snippets are included.
The Myth of "Plug-and-Play" Performance
A common assumption when moving from legacy HDFS to a modern, disaggregated storage solution like FlashBlade is that performance will increase automatically. While fast storage is a critical foundation, it often exposes another bottleneck in the stack: the Spark application configuration itself.
If your storage can deliver data at 10 GB/s, but your application can only process it at 1 GB/s, the job will only run at 1 GB/s. The storage sits idle, waiting for the application to catch up. This is a classic sign of underutilization. The most common culprits are not only slow disks but issues within the Spark framework that prevent parallel execution, such as:
- Data Skew: One or a few tasks receive a disproportionately large amount of data to process. The entire job must wait for these "straggler" tasks to finish, nullifying the benefit of having hundreds of other cores and fast storage waiting.
- Driver Bottlenecks: The driver node is responsible for coordinating tasks across all executors. If the driver is under-resourced or too busy, it can't assign new work fast enough, leaving executor cores idle.
- Insufficient Parallelism: If your dataset is split into too few partitions, you can't use all the available executor cores. Having 1,000 cores is useless if you only give them 50 partitions of data to work on.
The key is to ensure your Spark job can parallelize its work effectively to match the parallelism of the underlying storage.
Diagnosing Underutilization in the Spark Web UI
Before we can fix the problem, we need to find it. The Spark Web UI is the best place to start. If you don’t have access to the UI but can still obtain history from the spark server, you can use tools like spark-sight. What I really like about spark-sight is that it focuses on CPU and Memory execution to optimize your workload.
If you suspect your high-performance storage is being underutilized, look for these patterns.
The Event Timeline: A Visual Red Flag
The most powerful view is the Event Timeline within the "Stages" tab of the Spark UI.
- What You Want to See: A dense, solid "wall" of green bars, indicating that all executors are tightly packed with tasks and are constantly computing. This is a sign of excellent parallelism.
- What You Don't Want to See: Significant white space between tasks or extremely uneven task bars. Gaps mean an executor was idle, waiting for the driver to assign work. A single task bar that is 10x longer than the others is a classic sign of data skew.
Task-Level Metrics: The Smoking Gun
To confirm your suspicions, select into a stage and view the directed acyclic graph (DAG) of the selected sage representing DataFrames and operations being applied. Inspecting these graphs can pinpoint the exact problems such as:
- Duration: Look at the summary statistics (min, median, max). A max duration that is orders of magnitude larger than the median is the clearest indicator of data skew. Sort the table to find the problem task.
- Input Size / Records: Sort by this column. If one or a few tasks have a vastly larger input size than the others, you have confirmed that your input data is skewed.
- Scheduler Delay: If this value is high across many tasks, it indicates a driver bottleneck. Your executors and storage are ready, but the driver can't coordinate them fast enough.
- GC Time: If executors spend significant time on Java Garbage Collection, they are under memory pressure and are pausing work, leaving the storage idle.
By inspecting the combination of gaps in the timeline, a huge max task duration, and a corresponding outlier in input size you can definitively diagnose that your job is failing to parallelize its work optimally.
Configuration Tuning for S3A Performance
Once you've diagnosed the bottleneck, you can apply specific configuration changes. A customer recently experienced this exact issue: moving a large-scale job from HDFS to S3A on FlashBlade did not yield the expected performance gains at first. The job was bottlenecked and experienced some severe errors.
Key S3A Configuration Parameters
As we worked with the customer we observed some of the settings within the application were still not taking full advantage of the storage appliances they purchased.
Utilizing spark-submit, a command line utility for execution parameters, we identified a few settings that were critical in resolving the performance issues when writing data to the S3A object store.
- Use the Version 2 Output Committer: The default committer in Hadoop 2.x is inefficient for object stores because it relies on slow rename operations. Version 2 is optimized for S3.
#bash --conf spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2
- Enable Fast Uploads: This setting enables multipart uploads, which allows a single output stream to be written in parallel, significantly improving write throughput for large files.
#bash --conf spark.hadoop.fs.s3a.fast.upload=true
- Configure a Buffer Directory: By default, the S3A connector may buffer data to /tmp before uploading. If this local directory is slow or too small, it becomes a bottleneck. Pointing the buffer to a larger, dedicated filesystem can dramatically help. This is especially useful for environments where workers have access to high-speed local or shared storage for temporary data. The same FlashBlade can also provide high-performance shared NFS storage for this purpose. As the Hadoop documentation states, this can "absorb bursts of data."
# bash # Point to a larger, faster local volume for buffering like FA or FB --conf spark.hadoop.fs.s3a.buffer.dir=/path/to/larger/volume
The combination of these three settings are some simple steps you can take to help resolve the application-level bottlenecks, allowing your jobs to better leverage the full parallelism of the FlashBlade storage backend.
Conclusion
Moving to a high-performance, disaggregated storage platform like FlashBlade is a critical first step in modernizing analytics, but it often reveals other performance bottlenecks that lie within the application itself. By analyzing Spark task-level metrics for outliers in duration and input size, engineers can pinpoint the exact cause of the inefficiency and confirm that the storage is sitting idle while the application struggles to distribute work. This, combined with leveraging modern features like Adaptive Query Execution (AQE) and properly sizing executor and driver memory, ensures the application can match the parallelism of the storage appliance like FlashBlade.