As a product manager venturing into the realm of platform management, I’ve been diving into the fascinating world of data ingestion systems for compliance and archival purposes. Specifically, I’m exploring the design of a system that ingests massive volumes of data, such as emails and Teams messages, for regulatory archiving. In this post, I’ll outline my current understanding of the data flow and pose some critical questions to clarify my doubts.
The system I’m looking at ingests millions of messages daily, with a focus on scalability, resiliency, and availability. The tech stack includes Java, Spring Boot, Event-Driven Microservices, Kubernetes, Apache Pulsar, Zookeeper, Ceph, Prometheus, and Grafana. Here’s my current understanding of the data flow:
TEAMS (or similar sources) → REST API → PULSAR (as message broker) → CEPH (object storage for archiving) ↑ CONSUMERS (downstream services) ←───── PULSAR
Now, I have a few key questions regarding the design:
1. Should we persist data immediately upon ingestion, before any transformation, for compliance purposes?
2. Do we own the data transformation/normalization step, and where does that happen in the flow?
3. Given the use of Pulsar and focus on real-time ingestion, can we assume this is a streaming-only system with no batch processing involved?
I’d appreciate feedback on whether this architecture makes sense for a compliance-oriented ingestion system and any critical considerations I may have missed. Oh, and I’m also curious about the role of Aerospike cache in this system.