This introduction is mainly based on the appendix of this paper, "Wavenet: Generation Model of Raw Audio". The link of the paper is as follows: blogs.com/BaroC/p/4283380.html.
For the algorithm of neural network, generally, 256 quantized values are generated based on softmax classifier, corresponding to 256 quantized values of sound. WaveRNN and wavenet are generated in this way.
The following are some materials for my study of speech synthesis, among which Stanford cs224s is highly recommended, but the logic of this handout is not very clear, so I will understand it after reading it repeatedly.
Ucsb digital speech processing course, the basis of sound signal processing. I suggest you have a look. The link is as follows. /view/68 fbf 1a4f 6 1fb 7360 b4c 658 b . html