With the addition of the TED-LIUM 3 corpus and positive results from the auto-review process the r20190609 release of the English Zamia-Speech models for Kaldi has been trained on the largest amount of audio material yet (over 1100 hours):

     zamia_en            0:05:38
     voxforge_en       102:07:05
     cv_corpus_v1      252:31:11
     librispeech       450:49:09
     ljspeech           23:13:54
     m_ailabs_en       106:28:20
     tedlium3          210:13:30

additionally 400 hours of noise-augmented audio derived from the above corpora were used (background noise and phone codecs):

    voxforge_en_noisy   22:01:40
    librispeech_noisy  119:03:26
    cv_corpus_v1_noisy  78:57:16
    cv_corpus_v1_phone  61:38:33
    zamia_en_noisy       0:02:08
    voxforge_en_phone   18:02:35
    librispeech_phone  106:35:33
    zamia_en_phone       0:01:11

so in total this release has been trained on over 1500 hours of audio material (training took over 6 weeks on a GeForce GTX 1080 Ti GPU).


%WER 10.64 exp/nnet3_chain/tdnn_250/decode_test/wer_8_0.0
%WER  8.84 exp/nnet3_chain/tdnn_f/decode_test/wer_8_0.0
%WER  5.80 exp/nnet3_chain/tdnn_fl/decode_test/wer_9_0.0

The tdnn_250 model is the smallest one meant for use in embedded applications (i.e. RPi-3 class hardware), tdnn_f is our regular model, tdnn_fl is the tdnn_f model adapted to a larger language model (results illustrate the importance of language model domain adaptation btw.).

Downloads: https://github.com/gooofy/zamia-speech#asr-models