Hey there, fellow data enthusiasts! I recently stumbled upon a project that caught my attention – building a structured database of math exam questions from Portuguese national final exams. The project involves extracting data from 45 PDFs, each covering a specific topic from the curriculum, with around 2,600 exercises in total.
The goal is to extract the following information from each exercise: topic, year, exam phase/type, question text in LaTeX format, images, type of question (multiple choice or open-ended), and MCQ options A-D in LaTeX format or as images if needed.
The question is, what’s the most reliable way to extract this kind of structured data from PDFs at scale? Would you use OCR tools, Python libraries like PyPDF2 or pdfquery, or perhaps machine learning models? Share your thoughts and experiences in the comments below!
Check out the sample PDF to get an idea of the data we’re working with. Let’s discuss the best approach to tackle this challenge.