As a machine learning enthusiast, I’ve been exploring the world of multimodal learning, where image data is combined with structured or genomic data for prediction tasks. One of the biggest challenges in this field is finding publicly available datasets that pair MRI or image data with genomic or structured data for the same individual or subject. In this post, I’ll delve into my quest for these datasets and share some insights on what I’ve discovered so far.
My ideal scenario is to find human medical data, particularly for cancer research, where MRI scans are paired with genomic data like gene expression, mutations, and methylation. However, with recent changes in data access policies, it’s become increasingly difficult to find suitable datasets.
I’ve looked into the Cancer Genome Atlas (TCGA) dataset, which offers a wealth of information on genomic data, but accessing the corresponding imaging data has become a challenge. I’ve also explored animal and plant datasets, searching for paired image and genomic or structured data.
In the animal domain, I’m looking for datasets that combine MRI or image data with genomic markers, physiological sensor data, and behavioral data. For plants, I’m interested in datasets that pair image data with environmental sensor data, plant species genetics, and agronomic metadata.
Throughout my search, I’ve realized that finding publicly available paired datasets is a daunting task. Many datasets either lack the paired data or require complex and unreliable manual matching. I’m hoping that by sharing my experience, I can connect with others who have faced similar challenges and uncover new resources that can help advance multimodal machine learning research.
If you have any pointers to publicly available paired datasets or advice on navigating the complex world of data access, I’d love to hear from you!