The Zamia Brain project provides infrastructure useful to create natural language processing systems based on transformer networks (see https://arxiv.org/abs/1706.03762 ).
This project is still highly experimental, everything is subject to change without prior notice. The current approach is to generate training corpora for pre-training as well as (multi-)domain refinement. The goal is to train networks that are very robust (i.e. avoid brittleness present in traditional rule-based systems) in their natural language processing capabilities (pretraining) while allowing for a certain amount of control of their behavior (refinement).
For this, you will find these components:
- scripts to generate pre-training corpora, typically using web scraping techniques as well as scripts that adapt scientific copora for training https://github.com/gooofy/zbrain
- scripts that generate corpora from patterns (“skills”) for refinement https://github.com/gooofy/zbrain
- A GPT-2 implementation along with tokenization, training and inference tools https://github.com/gooofy/transformer-lm
- A TransformerXL implementation along with tokenization, training and inference tools https://github.com/gooofy/transformer-xl
- Pre-trained models https://goofy.zamia.org/zamia-speech/brain/
|gpt2-german-345M-r20190906||345M||german||4.5 epochs on 27GB twitter+wikipedia+heise+parole||50k sentencepiece|
|gpt2-german||117M||german||3 epochs on 27GB twitter+wikipedia+heise+parole||50k sentencepiece|
|transformerXL-german-163M-r20190928||163M||german||1 epochs on 27GB twitter+wikipedia+heise+parole||50k sentencepiece|
Massive thanks to Konstantin Lopuhin https://github.com/lopuhin for great code and support!