The goal of this project is to create multi-modal implementation of Transformer architecture in Swift. It's a learning exercise for me, so I've taken it slowly, starting from simple image classifier and building it up.
Also it's an attempt to answer the question if Swift for Tensorflow is ready for non-trivial work.
The use-case is based on a paper "Learning a bidirectional mapping between human whole-body motion and natural language using deep recurrent neural networks" by Matthias Plappert. He created a nice dataset of few thousand motions "The KIT Motion-Language Dataset (paper)", website.
The Motion2Language Transformer which works is there, already. Lang2motion one started to work recently, and I'm implementing more sophisticated motion generation strategy now.
I'm using modified Swift Transformer implementation by Andre Carrera.
- motion 2 language
- ☑ Transformer from motion to annotation
- language 2 motion
- ☑ Transformer from annotation to motion
annotations and labels:
You may find interesting
third set of datasets - 2020-07-30T15:03:04
Scaled motions, using simplified MotionSample, and actual motion recording frequency when downsampling.
second set of datasets - 2020-06-18T12:57:00
first set of datasets - 2020-05-02T14:17:15
first set of img2label, language2label and motion2label datasets