![]() ![]() But I was wrong, and the Windows and macOS libraries were available since late November. ![]() I've been on the lookout for the Linux library, since that is my preferred environment and I was under the impression that the development was taking place on that platform. So SODA finally landed, sort of, and for a couple weeks already apparently. I recently found a nice overview presentation of (almost) current research, with an interesting description starting on slide 81 explaining when to advance the encoder and retain the prediction network state. ![]() The two inputs of the joint are just the outputs of de decoder and encoder, and the softmax only turns this output into probabilities between 1 and 0. The joint and softmax have the least amount of tweakable parameters. This way the current symbol depends on all the previous symbols in the sequence. In the next iteration the decoder is fed with the output of the softmax layer, which is of lenght 128 and represents the probabilities of the symbol heard in the audio. The decoder is fed with a tensor of zeros at t=0. ![]() The output of the second encoder is fed to the joint. Both those outputs should be fed to the second encoder (enc1) to provide it with a tensor of length 1280. Then three more frames should be captured to run enc0 again to obtain a second output. Gauging from the number of inputs to the first encoder (enc0), 3 frames should be stacked and provided to enc0. The audio input is probably 80 log-Mel channels, as described in this paper. The Encoder Network comprises 8 such layers. The Prediction Network comprises 2 layers of 2048 units, with a 640-dimensional projection layer. The Prediction and Encoder Networks are LSTM RNNs, the Joint model is a feedforward network ( paper). The predicted symbols (outputs of the Softmax layer) are fed back into the model through the Prediction network, as y u-1, ensuring that the predictions are conditioned both on the audio samples so far and on past outputs. Representation of an RNN-T, with the input audio samples, x, and the predicted symbols y. Further analysis of the app is necessary to find the right parameters to the models, but the initial blog post also provides some useful info:
0 Comments
Leave a Reply. |
Details
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |