What is the speech recognition?

Mapping acoustic waveforms to sequences of graphemes

What is the input to a speech recognizer?

A series of acoustic waves that are sampled, quantized, and converted to a spectral representation like the log mel spectrum.

What are the two common paradigms for speech recognition?

Two common paradigms for speech recognition are the encoder-decoder with attention model, and models based on the CTC loss function. Attention based models have higher accuracies, but models based on CTC more easily adapt to streaming: outputting graphemes online instead of waiting until the acoustic input is complete.

How is ASR evaluated?

ASR is evaluated using the Word Error Rate; the edit distance between the hypothesis and the gold transcription.

What is the architecture for TTS?

TTS systems are also based on the encoder-decoder architecture. The encoder maps letters to an encoding, which is consumed by the decoder which generates mel spectrogram output. A neural vocoder then reads the spectrogram and generates waveforms.

What is the role of text normalization in TTS?

TTS systems require a first pass of text normalization to deal with numbers and abbreviations and other non-standard words.

How is TTS evaluated?

TTS is evaluated by playing a sentence to human listeners and having them give a mean opinion score (MOS) or by doing AB tests.