Data set processing: the audio file of DSD 100 data set is converted into time-frequency sound spectrum.
DSD contains two folders, one is the "mix" of mixed audio, and the other is the "source" of split-track audio of human voice, drums, bass and other instruments. Each folder contains two subfolders, "Dev" is the training set and "Test" is the test set.
Util.savespectrogram (y _ mix, y _ vocal, y _ inst, fname) is mainly used.
Input an audio path, train a model or use an existing model to obtain separated speech/separated audio from the original audio.
Five main interfaces were called:
Util。 LoadDataset(target) loads the data set, where target is the target to be separated, and here is "voice".
Network. trainunet (xlist,ylist,savefile = "unet。 Model ",EPOCH = 30) training network.
Util。 LoadAudio(fname) loads the audio.
Util.computemask (input _ mag, unet _ model = "unet.model ",hard = true) calculates the mask.
Util。 SaveAudio(fname, mag, phase) saves audio.
(If there is a ready-made model, just call the last three functions. )
Xlist returned by the function is the sound spectrum of mixed audio, and Ylist is the sound spectrum of split target audio.
Find_files is librosa.util.find_files.
Load is librosa.core.load, which loads audio in various formats.
Stft is librosa.core.stft, which is a short-time Fourier transform.
(I don't understand why the number of bands = 1+n _ FFT/2)
(Online explanation: The probability of continuous Fourier transform is symmetrical (Hermite symmetry) when calculating the real signal. The discrete FFT (even length) has fine twisted symmetry. In other words, the result of FFT calculation is symmetrical in frequency, and there is "repetition" in positive and negative frequencies. )
(I don't understand)
Mag is an order of magnitude, do absolute value.
Phase = np.exp (1.j * np.angle (spec)) gives the complex angle of spec.
Istft is librosa.core.istft
Write_wav is librosa.output.write_wav
Stft is librosa.core.stft, short-time Fourier transform. The SaveSpectrogram function is used to process data sets and convert the audio in the data sets into sound spectra.
Network. UNet () calls U-Net neural network, and uses the trained model to calculate hard mask or soft mask.
UNet class
Training network:
U-Net is actually a mapping with the same input space and output space. When training, the input is (MixSpec, TargetSpec) = (mixed sound spectrum, target sound spectrum), and the parameter θ is optimized by Adam method to get a mask = UNet_θ(MixSpec), and MixSpec*mask makes it as close as possible to the TargetSpec.