Extracting Stock Picks from YouTube Videos with Multimodal and Text-Only LLMs | Ranjan Kumar

Have you ever wondered how well AI models can extract stock recommendations from noisy YouTube videos created by financial influencers? I recently came across a fascinating study that benchmarked the performance of multimodal large language models (LLMs) and text-only LLMs in doing just that.

The study’s goal was to isolate specific, directional recommendations like “buy TSLA” or “sell NVDA” from long-form YouTube videos, despite the noise, unstructured data, and visual distractions. The researchers evaluated various models, including GPT-4o, DeepSeek-V3, and Gemini 2.0 Pro, on their ability to extract stock tickers, investment actions, and conviction levels.

Interestingly, text-only models outperformed multimodal models in extracting full recommendations, while multimodal models were better at identifying surface signals like visually shown tickers. The study also found that segmented transcripts led to better performance than using entire transcripts or full videos.

To assess the value of extracted recommendations, the researchers simulated basic investment strategies and found that a simple inverse strategy led to stronger cumulative returns compared to following the recommendations.

The study’s findings have implications for how we evaluate the performance of AI models in extracting valuable insights from noisy data. It’s a fascinating area of research that could have significant applications in finance and beyond.

Leave a Comment Cancel Reply