Breaking Free from Spark: An Open-Source Data Validation Tool | Ranjan Kumar

As data engineers, we all know the importance of data validation. But, let’s be real, the current tools assume you have Spark infrastructure, which can be a huge hurdle. I’m sure you’ve been there – spinning up EMR clusters just to check if a column has nulls. The cost and complexity are enough to make you want to skip validation altogether. But, what if I told you there’s a better way?

I built Term, an open-source data validation library that runs anywhere, without any JVM or cluster setup. It’s built on top of Apache DataFusion, which gives you Spark-like performance on a single machine. With Term, you can validate your data on your laptop, GitHub Actions, or EC2, without breaking the bank.

Term comes with all the Deequ validation patterns, 100MB/s single-core throughput, and built-in OpenTelemetry for monitoring. Plus, it’s easy to set up – just run `cargo add term-guard` and you’re good to go.

Of course, there are some limitations. For now, it’s only available in Rust, but Python and Node.js bindings are on the way. Single-node processing might not be ideal for everyone, and streaming support is still missing. But, I’m confident that this tool can make a real difference in your data engineering workflow.

So, I want to hear from you. What data validation do you do today? Are you using Deequ, Great Expectations, or just hoping for the best? What validation rules do you need that current tools don’t handle well? Let’s discuss!

Leave a Comment Cancel Reply