As I delve into my thesis project on metadata extraction, I’m reminded of the overwhelming feeling that comes with navigating the vast landscape of Large Language Models (LLMs). My project aims to extract metadata such as module names, credit points, language of instruction, semester, duration, and responsible lecturers from academic module handbooks in PDF format.
So far, I’ve managed to extract text information from around 50 PDFs using Python libraries like pymupdf and pdfplumber. I’ve also generated augmented training data using manually extracted sample metadata input and output pairs from each PDF’s layout and formatting.
However, I’m stuck on several fronts. I need help understanding which LLMs to consider for this application, how to ensure the model ignores irrelevant text, and whether LoRA or a supervised finetuning approach is more suitable. I’m also unsure about how to deal with large PDF inputs and what essential elements my solution prototype should have.
If you’re familiar with LLMs and metadata extraction, I’d appreciate any guidance or nudges in the right direction. I’m in the final stretch of my thesis, and every bit of insight counts!