As I worked on a recent take-home project, I couldn’t help but think that data engineering is just backend distributed systems in disguise. The project involved ETL from PubSub, which felt eerily similar to building a backend distributed system with Postgres and pub/sub. I had to handle deduplicates, exactly once processing, think about horizontal scaling, and ensure idempotence behavior – all familiar territory for a backend engineer.
But what struck me was that the role title was ‘distributed systems engineer’, not ‘data engineer’ or ‘backend engineer’. It made me wonder – are data engineers just backend engineers with a different name tag?
In this project, I felt like I needed to use Apache Arrow for the transformation, but the time constraint of 4 hours made it challenging. I’ve spent about 20 hours on it so far, mostly because my Postgres/SQL skills aren’t sharp, and I had to learn GCP PubSub from scratch.
So, is data engineering just backend distributed systems with a different name? Or is there more to it than meets the eye? I’d love to hear your thoughts.