LaSAFT: Latent Source Attentive Frequency Transformation for Conditioned Source Separation

Woosung Choi, Minseok Kim, Jaehwa Chung, and Soonyoung Jung

Our code and models are available online.

You can also check separated samples from here.

opening bgm

Contents

1. Task Definition: Conditioned Source Separation

2. Part 1: Frequency Transformation Blocks (FTBs)

  • review: U-Net for Spectrogram-based Singing Voice Separation
  • motivation: Spectrogram \neq Image
  • solution: Frequency Transformation Blocks

3. Part 2: LaSAFT for Conditioned Source Separation

  • review: Conditioned-U-Net (C-U-Net) for Conditioned Source Separation
  • motivation: Extending FTB to Conditioned Source Separation
  • solution: Latent Instrumant Attentive Frequency Transformation Block (LaSAFT)
  • how to modulate latent features: more complex manipulation method than FiLM

1. Task Definition: Source Separation

  • Separates signals of the specific source from a given mixed-signal

    • Music Source Separation, Speech Enhancement...
  • Categories of Source separation models

    • A Dedicated model: is dedicated to a single instrument.
    • A Multi-head model: generates several outputs at once with a multi-head
    • A Conditioned model: separates different instruments with the aid of the control mechanism
      • Input: an input audio track AA and a one-hot encoding vector CC that specifies which instrument we want to separate
      • Output: separated track of the target instrument

1. Task Definition

  • Digital Audio Signal Processing (sr:44100Hz)
    • Linear Audio Mixing System

      • M[t]=S(i)[t]+N[t]M[t]=\sum S^{(i)}[t] + N[t]
    • Audio Source Separation

      • ASS(M[t])={S(0),S(1),...,S(n)}ASS(M[t])=\{S^{(0)}, S^{(1)}, ..., S^{(n)}\}

2. Part 1: Frequency Transformation Blocks (FTBs)

  1. review: a (dedicated) U-Net for Spectrogram-based Singing Voice Separation
  2. motivation: Spectrogram \neq Image
    • What's wrong with CNNs and spectrograms for audio processing?
    • Alternatives: 1-D CNNs, Dilated CNNs, FTBs, ...
  3. solution: Frequency Transformation Blocks
    • Employing Fully-Connected (FC) Layers to capture Freq-to-Freq Dependencies
    • (empirical results) Injecting FCs, called FTBs, into a Fully 2-D Conv U-Net significantly improves SDR performance

2.1. Review: U-Net for Spectrogram-based Source Separation

  • U-Net: an encoder-decoder structure with symmetric skip connections

    • These symmetric skip connections allow models to recover fine-grained details of the target object during decoding effectively.
      width:500
  • Originally proposed for Medical Image Segmentation

    • can also be viewed as an Image-to-Image Translation
    • The original U-Net is fully 2-d convolutional

2.1. Review: U-Net for Spectrogram-based Source Separation

  • Audio Equalizer - Eliminate signals with unwanted frequencies

  • Spectrogram-based Source Separation

    1. Apply Short-Time Fourier Transform (STFT) on a mixture waveform to obtain the input spectrograms.
    2. Estimate the vocal spectrograms based on these inputs
    3. Restore the vocal waveform with inverse STFT (iSTFT).

2.1. Review: U-Net for Spectrogram-based Source Separation

  • Spec2Spec (Masking-based or Direct Estimation)
    height:500

2.1. Review: U-Net For Spectrogram-based Source Separation

  • Naive Assumption
    • Assuimg a spectrogram is a two (left and right) - channeled image
    • Spectrogram-based Source Separation can be viewed as an Image-to-Image Translation
      width:500

2.1. Review: U-Net For Spectrogram-based Source Separation (2)

  • ..., and it works...!

    • Jansson, A., et al. "Singing voice separation with deep U-Net convolutional networks." 18th International Society for Music Information Retrieval Conference. 2017.
    • Takahashi, Naoya, and Yuki Mitsufuji. "Multi-scale multi-band densenets for audio source separation." 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). IEEE, 2017.
  • Recall the assumption of this approach:

    • Assuming a spectrogram is a two (left and right) - channeled image
    • Spectrogram-based Source Separation \approx Image-to-Image Translation
    • (empirical results) Fully 2-D Convs can provide promising results

2.2. Spectrogram \neq Image

  • What's wrong with CNNs and spectrograms for audio processing?
    • The axes of spectrograms do not carry the same meaning
      • spatial invariance that 2D CNNs provide might not perform as well
    • The spectral properties of sounds are non-local
      • Periodic sounds are typically comprised of a fundamental frequency and a number of harmonics which are spaced apart by relationships dictated by the source of the sound. It is the mixture of these harmonics that determines the timbre of the sound.
        width:300

2.2. What's wrong with CNNs and spectrograms for audio processing?

  • Yin, Dacheng, et al."PHASEN: A Phase-and-Harmonics-Aware Speech Enhancement Network." AAAI. 2020.

    Non-local correlations exist in a T-F spectrogram along the
    frequency axis. A typical example is the correlations among harmonics ... However, simply stacking several 2D convolution layers with small kernels cannot capture such global correlation.

  • Park, Soochul, and Ben Sangbae Chon. "GSEP: A robust vocal and accompaniment separation system using gated CBHG module and loudness normalization." arXiv preprint arXiv:2010.12139 (2020).

    The two-dimensional convolution network used in the Spleeter for a frequency component misses the useful information at lower or higher frequency components which is out of the kernel range.

2.2. Source Separation

  • Harmonics

  • Timbre of 'Singing Voice' - decided by resonance patterns

2.2. Alternatives

  • 1-D CNNs
    • Liu, Jen-Yu, and Yi-Hsuan Yang. "Dilated convolution with dilated GRU for music source separation." arXiv preprint arXiv:1906.01203 (2019).
  • Dilated Convolutions
    • Takahashi, Naoya, and Yuki Mitsufuji. "D3Net: Densely connected multidilated DenseNet for music source separation." arXiv preprint arXiv:2010.01733 (2020).
  • FTBs: Frequency Transformation Blocks
    • PHASEN, ours
  • RNNs: \sim FTBs
    • Takahashi, Naoya, Nabarun Goswami, and Yuki Mitsufuji. "Mmdenselstm: An efficient combination of convolutional and recurrent neural networks for audio source separation." 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC). IEEE, 2018.

2.2. Receptive Field: 2-D Conv Layer vs. FC Layer

  • A single Fully-Connected Layer
    • can capture every freq-to-freq correlation!

2.3. Our Approach: Injecting FTBs into U-Nets

  • FTBs: Frequency Transformation Blocks

    • An FTB called Time-Distributed Fully-connected Layer (TDF):
      width:800

    • Choi, Woosung, et al. "Investigating u-nets with various intermediate blocks for spectrogram-based singing voice separation." 21th International Society for Music Information Retrieval Conference, ISMIR, Ed. 2020.

2.3. Time-Distributed Fully-connected Layer

import torch
import torch.nn as nn

class TDF(nn.Module):
    ''' [B, in_channels, T, F] => [B, in_channels, T, F] '''
    def __init__(self, channels, f, bf=16, bias=False, min_bn_units=16):
        
        '''
        channels: # channels
        f: num of frequency bins
        bf: bottleneck factor. if None: single layer. else: MLP that maps f => f//bf => f 
        bias: bias setting of linear layers
        '''
        
        super(TDF, self).__init__()

          bn_unis = max(f//bf, min_bn_units)
          self.tdf = nn.Sequential(
              nn.Linear(f, bn_unis, bias),
              nn.BatchNorm2d(channels),
              nn.ReLU(),
              nn.Linear(bn_unis, f, bias),
              nn.BatchNorm2d(channels),
              nn.ReLU()
          )
            
    def forward(self, x):
        return self.tdf(x)

2.3. Injecting TDFs into a U-Net framework

  • Building Block TFC-TDF: Densely connected 2-d Conv (TFC) with TDFs

    width:700

  • U-Net with TFC-TDFs

    width:300 + width:500

2.3. Results?

  • Ablation (n_fft = 2048)

    • U-Net with 17 TFC blocks: SDR 6.89dB
    • U-Net with 17 TFC-TDF blocks: SDR 7.12dB (+0.23 dB)
  • Large Model (n_fft = 4096)

2.3. Why does it work?: Weight visualization

  • freq patterns of different sources captured by TDFs, of FTBs

width:500 width:500

2.3. ISMIR 2020

width:600

3. Part 2: LaSAFT for Conditioned Source Separation

  • review: Conditioned-U-Net (C-U-Net) for Conditioned Source Separation
  • motivation: Extending FTB to Conditioned Source Separation
    • Naive Extention: Injecting FTBs into C-U-Net?
    • (emprical results) It works, but ...
  • solution: Latent Instrumant Attentive Frequency Transformation Block (LaSAFT)
  • how to modulate latent features: more complex manipulation method than FiLM

3.1. Conditioned Source Separation

  • Task Definition
    • Input: an input audio track AA and a one-hot encoding vector CC that specifies which instrument we want to separate
    • Output: separated track of the target instrument
  • Method: Conditioning Learning
    • can separate different instruments with the aid of the control mechanism.
    • Conditioned-U-Net (C-U-Net)
      • Meseguer-Brocal, Gabriel, and Geoffroy Peeters. "CONDITIONED-U-NET: INTRODUCING A CONTROL MECHANISM IN THE U-NET FOR MULTIPLE SOURCE SEPARATIONS." Proceedings of the 20th International Society for Music Information Retrieval Conference. 2019.

3.1. C-U-Net

  • Conditioned-U-Net extends the U-Net by exploiting Feature-wise Linear Modulation (FiLM)

    width:700

3.1. C-U-Net: Feature-wise Linear Modulation

width:1200

3.2. Naive Extention: Injecting FTBs into C-U-Net?

  • Baseline C-U-Net + TFC-TDFs

    width:600

  • TFC vs TFC-TDF
    width:600

3.2. Naive Extention: Above our expectation

  • TFC vs TFC-TDF
    width:600

  • Although it does improve SDR performance by capturing common frequency patterns observed across all instruments,

    • Merely injecting an FTB to a CUNet does not inherit the spirit of FTBs
  • We propose the Latent Source-Attentive Frequency Transformation (LaSAFT), a novel frequency transformation block that can capture instrument-dependent frequency patterns by exploiting the scaled dot-product attention

3.3. LaSAFT: Motivation

  • Extending TDF to the Multi-Source Task
    • Naive Extension: MUX-like approach

      • A TDF for each instrument: I\mathcal{I} instrument => I\mathcal{I} TDFs
    • However, there are much more 'instruments' we have to consider in fact

      • female-classic-soprano, male-jazz-baritone ... \in 'vocals'
      • kick, snare, rimshot, hat(closed), tom-tom ... \in 'drums'
      • contrabass, electronic, walking bass piano (boogie woogie) ... \in 'bass'

3.3. Latent Source-attentive Frequency Transformation

  • We assume that there are IL\mathcal{I}_L latent instruemtns

    • string-finger-low_freq
    • string-bow-low_freq
    • brass-high-solo
    • percussive-high
    • ...
  • We assume each instrument can be represented as a weighted average of them

    • bass: 0.7 string-finger-low_freq + 0.2 string-bow-low_freq + 0.1 percussive-low
  • LaSAFT

    • IL\mathcal{I}_L TDFs for IL\mathcal{I}_L latent instruemtns
    • attention-based weighted average

3.3. LaSAFT: Extending TDF to the Multi-Source Task (1)

width:600

  • duplicate IL\mathcal{I}_L copies of the second layer of the TDF, where IL\mathcal{I}_L refers to the number of latent instruments.
    • IL\mathcal{I}_L is not necessarily the same as I\mathcal{I} for the sake of flexibility
  • For the given frame VRFV\in \mathbb{R}^F, we obtain the IL\mathcal{I}_L latent instrument-dependent frequency-to-frequency correlations, denoted by VRF×ILV'\in \mathbb{R}^{F \times \mathcal{I}_L}.

3.3. LaSAFT: Extending TDF to the Multi-Source Task (2)

width:600

  • The left side determines how much each latent source should be attended
  • The LaSAFT takes as input the instrument embedding zeR1×Ez_e \in \mathbb{R}^{1 \times E}.
  • It has a learnable weight matrix KRIL×dkK\in \mathbb{R}^{ \mathcal{I}_L \times d_{k}}, where we denote the dimension of each instrument's hidden representation by dkd_{k}.
  • By applying a linear layer of size dkd_{k} to zez_e, we obtain QRdkQ \in \mathbb{R}^{d_{k}}.

3.3. LaSAFT: Extending TDF to the Multi-Source Task (3)

width:600

  • We now can compute the output of the LaSAFT as follows:

    • Attention(Q,K,V)=softmax(QKTdk)VAttention(Q,K,V') = softmax(\frac{QK^{T}}{\sqrt{d_{k}}})V'
  • We apply a LaSAFT after each TFC in the encoder and after each Film/GPoCM layer in the decoder. We employ a skip connection for LaSAFT and TDF, as in TFC-TDF.

3.3. Effects of employing LaSAFTs instead of TFC-TDFs

width:600

3.4. GPoCM: more complex manipulation method than FiLM

  • FiLM (left) vs PoCM (right)

width:550 width:550

  • PoCM is an extension of FiLM.
    • while FiLM does not have inter-channel operations
    • PoCM has inter-channel operations

3.4. GPoCM: more complex manipulation method than FiLM (2)

  • PoCM is an extension of FiLM

    • FiLM(Xciγci,βci)=γciXci+βciFiLM(X^{i}_{c}|\gamma_{c}^{i},\beta_{c}^{i}) = \gamma_{c}^{i} \cdot X^{i}_{c} + \beta_{c}^{i}

    • PoCM(Xciωci,βci)=βci+jωcjiXjiPoCM(X^{i}_{c}|\omega_{c}^{i},\beta_{c}^{i}) = \beta_{c}^{i} + \sum_{j}{\omega_{cj}^{i} \cdot X^{i}_{j}}

      • where γci\gamma_{c}^{i} and βci\beta_{c}^{i} are parameters generated by the condition generator, and XiX^{i} is the output of the ithi^{th} decoder's intermediate block, whose subscript refers to the cthc^{th} channel of XX

      width:500

3.4. GPoCM: more complex manipulation method than FiLM (3)

  • Since this channel-wise linear combination can also be viewed as a point-wise convolution, we name it PoCM. With inter-channel operations, PoCM can modulate features more flexibly and expressively than FiLM.

  • Instaed of PoCM, we use Gated PoCM (GPoCM), since GPoCN is robust for source separation task. It is natural to use gated apporach the source separation tasks becuase a sparse latent vector (that contains many near-zero elements) obtained by applying GPoCMs, naturally generates separated result (i.e. more silent than the original).

  • GPoCM(Xciωci,βci)=σ(PoCM(Xciωci,βci))XciGPoCM(X^{i}_{c}|\omega_{c}^{i},\beta_{c}^{i}) = \sigma(PoCM(X^{i}_{c}|\omega_{c}^{i},\beta_{c}^{i})) \odot X^{i}_{c}

    • where σ\sigma is a sigmoid and \odot means the Hadamard product.

Experimental Results

width:1200

LaSAFT + GPoCM

  • achieved state-of-the-art SDR performance on vocals and other tasks in Musdb18.

width:900

  • news: outdated :(

Discussion

  • The authors of cunet tried to manipulate latent space in the encoder,

    • assuming the decoder can perform as a general spectrogram generator, which is `shared' by different sources.
  • However, we found that this approach is not practical since it makes the latent space (i.e., the decoder's input feature space) more discontinuous.

  • Via preliminary experiments, we observed that applying FiLMs in the decoder was consistently better than applying FilMs in the encoder.

Links

  • Choi, Woosung, et al. "Investigating u-nets with various intermediate blocks for spectrogram-based singing voice separation." 21th International Society for Music Information Retrieval Conference, ISMIR, Ed. 2020.

  • Choi, Woosung, et al. "LaSAFT: Latent Source Attentive Frequency Transformation for Conditioned Source Separation." arXiv preprint arXiv:2010.11631 (2020).