I recently made the jump from Data Analyst to Data Engineer at a startup, and I’m facing a significant architectural challenge. I’d love to hear any advice or guidance from fellow data engineers out there.
The issue is this: we have an ETL pipeline that ingests data from Firestore, which is generated by the OpenAI API based on dynamic, client-specific prompts. My boss and the CTO decided that this data should be stored in structured tables in a PostgreSQL database. Sounds reasonable, right? But here’s the catch: this architecture has led to two major problems.
Firstly, the JSON structure is client-dependent, which means that every time a client wants to add or remove a field, I receive a request to update the OpenAI prompt. This requires me to manually modify our ETL pipeline and run ALTER TABLE commands on the SQL database to accommodate the new schema. It’s a never-ending cycle of manual work and schema changes.
Secondly, these PostgreSQL tables directly feed client-facing reports in Metabase. The tight coupling between the rigid SQL schema and the reports makes every small change a multi-step, fragile, and time-consuming process.
So, I’m asking for your help: how can I handle this problem more effectively? Are there alternative architectures or key concepts I should learn to build a more flexible system that doesn’t break every time a client’s requirements change?
Some additional context: our entire pipeline is written in Python, and the data volume isn’t the issue (around 10,000 records daily). The main pain point is the constant manual effort required to adapt to schema changes.
Thanks in advance for any suggestions!