Hey there, data engineering folks! I’m stuck on a problem and hoping someone out there has faced this too. I’m trying to set up incremental syncs from our production database, but I’m running into a weird schema behavior. The source DB has both created_at and updated_at columns, but here’s the catch: updated_at is NULL until a row gets updated for the first time, and many rows are never updated after insert, so they only have created_at, no updated_at. Using updated_at as a cursor means I completely miss these rows.
The obvious workaround would be to coalesce created_at and updated_at, or maybe maintain a derived last_modified column… but here’s the real problem: I have read-only access to the DB, CDC isn’t enabled, and enabling it would require a DB restart, which isn’t feasible. That means I can’t modify the schema, add computed fields, or enable CDC.
So, how do you handle cases where the cursor field is unreliable and you’re locked out of changing the source? Have you hit this issue before? How did you overcome it? I’d appreciate any tips or advice on how to do incremental syncs into a lake or warehouse without full table scans.
Let’s dive into the details and explore possible solutions together.