The goal of this project is to create multi-modal implementation of Transformer architecture in Swift. It's a learning exercise for me, so I've taken it slowly, starting from simple image classifier and building it up.
Also it's an attempt to answer the question if Swift for Tensorflow is ready for non-trivial work.
The use-case is based on a paper "Learning a bidirectional mapping between human whole-body motion and natural language using deep recurrent neural networks" by Matthias Plappert. He created a nice dataset of few thousand motions "The KIT Motion-Language Dataset (paper)", website.
The Motion2Language Transformer which works is there, already. Lang2motion one started to work recently, and I'm implementing more sophisticated motion generation strategy now.
I'm using modified Swift Transformer implementation by Andre Carrera.
annotations and labels:
|Last commit: 2 weeks ago|
Scaled motions, using simplified MotionSample, and actual motion recording frequency when downsampling.