Unlocking the Secrets of the Past: Training an LLM on 1800's London Literature | Ranjan Kumar

Imagine being able to converse with a language model that’s been trained exclusively on literature from 1800’s London. Sounds like a fascinating concept, right? That’s exactly what I’m attempting to do – train a Large Language Model (LLM) from scratch using only texts from a specific region and time period. My current dataset consists of almost 7,000 texts from 1800-1875 London, and I’m using a Phi 1.5 model with 700M parameters on an A100 GPU.

My long-term goal is to see if a model trained this way can actually reason and provide insightful responses. The latest update is promising – the model is starting to reference real historical events instead of just hallucinating everything. I’ve also received feedback from many that fine-tuning would be a more efficient approach, but I’m curious to see how far this method can take me. And with the Internet Archive having around 175,000 London texts within my chosen time period, scaling the dataset won’t be an issue.

The potential applications of this project are vast. Imagine being able to generate text that’s indistinguishable from literature written during that time period. It could revolutionize the way we approach historical research, creative writing, and even education.

If you’re interested in following my progress, I’ve made the project open-source and available on GitHub (https://github.com/haykgrigo3/TimeCapsuleLLM). Let’s see how far we can take this experiment!

Unlocking the Secrets of the Past: Training an LLM on 1800’s London Literature

Leave a Comment Cancel Reply