Uncover the Hidden Bottlenecks in Your ML Inference | Ranjan Kumar

Ever wondered why your machine learning inference is slow? You’re not alone. We’ve all been there, staring at a mess of data from torch.profiler, trying to make sense of it all. But what if I told you there’s a better way?

I’ve been working on a new tool that helps you identify the root cause of slow ML inference. It’s a profiler that shows you exactly where the compute time goes, from Python to CUDA kernels to PTX assembly. You can even drill down to see memory movements and kernel bottlenecks.

I’ve used this tool on Llama models and achieved a 50%+ speedup. The best part? You can try it out for free. Get 10 hours of profiling with our beta version at keysandcaches.com. You can also check out the GitHub repo at github.com/Herdora/kandc.

This tool is perfect for anyone running models locally and wondering why inference is slow. Give it a try and see the difference for yourself.

Leave a Comment Cancel Reply