Building Block TFC-TDF: Densely connected 2-d Conv (TFC) with TDFs
U-Net with TFC-TDFs
+
Ablation (n_fft = 2048)
Large Model (n_fft = 4096)
Conditioned-U-Net extends the U-Net by exploiting Feature-wise Linear Modulation (FiLM)
Baseline C-U-Net + TFC-TDFs
TFC vs TFC-TDF
TFC vs TFC-TDF
Although it does improve SDR performance by capturing common frequency patterns observed across all instruments,
We propose the Latent Source-Attentive Frequency Transformation (LaSAFT), a novel frequency transformation block that can capture instrument-dependent frequency patterns by exploiting the scaled dot-product attention
Naive Extension: MUX-like approach
However, there are much more 'instruments' we have to consider in fact
We assume that there are latent instruemtns
We assume each instrument can be represented as a weighted average of them
LaSAFT
We now can compute the output of the LaSAFT as follows:
We apply a LaSAFT after each TFC in the encoder and after each Film/GPoCM layer in the decoder. We employ a skip connection for LaSAFT and TDF, as in TFC-TDF.
PoCM is an extension of FiLM
Since this channel-wise linear combination can also be viewed as a point-wise convolution, we name it PoCM. With inter-channel operations, PoCM can modulate features more flexibly and expressively than FiLM.
Instaed of PoCM, we use Gated PoCM (GPoCM), since GPoCN is robust for source separation task. It is natural to use gated apporach the source separation tasks becuase a sparse latent vector (that contains many near-zero elements) obtained by applying GPoCMs, naturally generates separated result (i.e. more silent than the original).
The authors of cunet tried to manipulate latent space in the encoder,
However, we found that this approach is not practical since it makes the latent space (i.e., the decoder's input feature space) more discontinuous.
Via preliminary experiments, we observed that applying FiLMs in the decoder was consistently better than applying FilMs in the encoder.
Choi, Woosung, et al. "Investigating u-nets with various intermediate blocks for spectrogram-based singing voice separation." 21th International Society for Music Information Retrieval Conference, ISMIR, Ed. 2020.
Choi, Woosung, et al. "LaSAFT: Latent Source Attentive Frequency Transformation for Conditioned Source Separation." arXiv preprint arXiv:2010.11631 (2020).