As a newcomer to a project focused on document forgery detection in medical claims, I’m facing a daunting task: evaluating the effectiveness of a forgery detection agent built around GPT-4.1. The agent takes base64-encoded images of documents like discharge summaries, hospital bills, and prescriptions, and its job is to detect whether they’re authentic or forged.
The biggest hurdle I’m facing is collecting the right data. I need a dataset that covers medical/health-claim documents, but so far, I’ve come up empty-handed. Public forgery datasets like DocTamper (CVPR 2023) are great, but they don’t cover medical claims.
The Data Conundrum
I need a dataset with paired authentic vs. forged health claim reports, but I haven’t found one yet. My evaluation metrics are accuracy and recall, so I need a good mix of authentic and tampered samples.
Possible Solutions
I’ve considered two approaches:
- Synthetic generation: Designing templates in Canva/Word/ReportLab and then programmatically tampering them with OpenCV/Pillow. This would allow me to create a dataset of discharge summaries, bills, and other medical documents with tampered elements like changed totals, dates, and signatures.
- Leveraging existing datasets: Pretraining with a dataset like DocTamper or a receipt forgery dataset, and then fine-tuning/evaluating on synthetic health docs.
A Call to the Community
Has anyone come across an open dataset of forged medical/insurance claim documents? If not, what’s the most efficient way to generate a realistic synthetic dataset of health-claim docs with tampering? Any advice on annotation pipelines/tools for labeling forged regions or just binary forged/original would be greatly appreciated.
If you have any guidance, papers, or tools to share, please do! I’m still new to this and could use all the help I can get.