As a deep learning enthusiast, I’ve been experimenting with replacing the traditional mel-spectrogram in LatentSync’s SyncNet model with the more modern Wav2Vec2. The goal is to leverage Wav2Vec2’s strengths in speech recognition and apply it to the audio-visual synchronization task. But, as I soon discovered, this conversion isn’t as straightforward as I thought.
The main issue lies in the dimension mismatch between the mel-spectrogram and Wav2Vec2 outputs. The mel-spectrogram produces a tensor of shape (batch, channel=1, 80, 52), while Wav2Vec2 yields a tensor of shape (batch, 1, 768, 32). To bridge this gap, I employed the DownEncoder2D module with a specific configuration, which I’ll outline below.
## The DownEncoder2D Configuration
audio_encoder: # input (1, 80, 52)
in_channels: 1
block_out_channels: [32, 64, 128, 256, 512, 1024, 2048]
downsample_factors: [[2, 1], 2, 2, 1, 2, 2, [2, 3]]
attn_blocks: [0, 0, 0, 1, 1, 0, 0]
dropout: 0.0
However, when I modified the `downsample_factors` to accommodate Wav2Vec2’s output, the model refused to converge, even after 150 global steps. The loss remained stagnant at around 0.693. I was stumped.
## The Conundrum
I’m still unsure what’s causing the model to fail to converge. Is it the changed `downsample_factors`? Is it the different output dimensionality of Wav2Vec2? I’d love to hear from anyone who has experience with similar conversions or has insights into what might be going wrong.
Have you faced similar challenges when working with Wav2Vec2 or LatentSync’s SyncNet? Share your thoughts and suggestions in the comments below!