Handling Variable-Length Sensor Sequences in Gesture Recognition: To Pad or Not to Pad?

Handling Variable-Length Sensor Sequences in Gesture Recognition: To Pad or Not to Pad?

When working with gesture recognition datasets, we often encounter a common problem: dealing with variable-length sensor sequences. This is exactly what I’m struggling with right now, and I’m hoping to get some insights from fellow machine learning enthusiasts.

My dataset consists of recordings from three different sensors, and I plan to feed each sensor’s data through its own neural network (maybe an RNN, LSTM, or 1D CNN). Then, I’ll concatenate the outputs and pass them through a fully connected layer to predict gestures. Sounds straightforward, right? Well, not quite.

The issue is that these sequences have varying lengths, ranging from around 35 to 700 timesteps. This makes the input sizes inconsistent, and I’m unsure about the best way to handle this efficiently.

Padding

One approach is to pad all sequences to the same length. I’m worried that this might waste memory and make it harder for the network to learn if sequences are too long. But it’s a simple and intuitive solution.

Truncating or Discarding Sequences

Another option is to truncate or discard sequences to make them uniform. However, this risks losing important information that might be crucial for accurate gesture recognition.

I know that RNNs, LSTMs, or Transformers can technically handle variable-length sequences. But I’m still unsure about the best way to implement this efficiently, especially when dealing with three separate sensors.

So, how do you usually handle datasets like this? Are there any best practices to keep information while not blowing up memory usage?

I’d love to hear your thoughts and experiences on this matter. 🙏

Leave a Comment

Your email address will not be published. Required fields are marked *