I recently made the switch from a Scala Spark shop to a Java Spark shop, and I was surprised to see how differently we approach Spark development. At my previous company, we cared deeply about code optimization, avoiding User-Defined Functions (UDFs) whenever possible and leveraging the Dataframe API to ensure high-performance ETL jobs. We were able to process hundreds of GBs of data in just 10-15 minutes.
However, in my new role, I’ve noticed that we’re using the Spark Streaming API and treating each row individually, which seems to be the opposite of what I learned in a Java Spark Udemy course. This approach has led to slower job times, taking hours to process around 20GB of data.
I have a few questions about Java Spark development:
Is it common in Java Spark to use foreach and treat each row differently, and does the Java Spark engine recognize common transformations written in foreach and optimize the plan accordingly? Is the Scala logic of focusing on Dataframe operations still applicable in Java?
Is Java Spark, when written well, less performant than Scala Spark?
Could the streaming part of our pipeline be contributing to slower performance when dealing with smaller datasets like 20GB?
I’d love to hear from others who have experience with Java Spark and can offer insights on how to optimize our pipeline.