'Extracting Instrument Qualities From Audio Signal

I'm looking to write a function that takes an audio signal (assuming it contains a single instrument playing), out of which I would like to extract the instrument-like features out of the audio and into a vector space. So in theory, if I had two signals with similar-sounding instruments (such as two pianos), their respective vectors should be fairly similar (by euclidian distance/cosine similarity/etc.). How would one go about doing this?

What I've tried: I'm currently extracting (and temporally averaging) the chroma energy, spectral contrast, MFCC (and their 1st and 2nd derivatives), as well as the Mel spectrogram and concatenating them into a single representation vector:

# expects a numpy array (dimensions: [1, num_samples], 
# similar to torchaudio.load() output). 

# assume all signals contain a constant number of samples and sampled at 44.1Khz
def extract_instrument_features(signal, sr):
  # define hyperparameters:
  FRAME_LENGTH = 1024
  HOP_LENGTH = 512

  # compute and perform temporal averaging of the chroma energy:
  ce = torch.Tensor(librosa.feature.chroma_cens(signal_np, sr))
  ce = torch.mean(ce, axis=1)

  # compute and perform temporal averaging of the spectral contrast:
  spc = torch.Tensor(librosa.feature.spectral_contrast(signal_np, sr))
  spc = torch.mean(spc, axis=1)

  # extract MFCC and its first & second derivatives:
  mfcc = torch.Tensor(librosa.feature.mfcc(signal_np, sr, n_mfcc=13))
  mfcc_1st = torch.Tensor(librosa.feature.delta(mfcc))
  mfcc_2nd = torch.Tensor(librosa.feature.delta(mfcc, order=2))

  # temporal averaging of MFCCs:
  mfcc = torch.mean(mfcc, axis=1)
  mfcc_1st = torch.mean(mfcc_1st, axis=1)
  mfcc_2nd = torch.mean(mfcc_2nd, axis=1)

  # define the mel spectrogram transform:
  mel_spectrogram = torchaudio.transforms.MelSpectrogram(
    sample_rate=target_sample_rate, 
    n_fft=1024, 
    hop_length=512,
    n_mels=64
  )

  # extract the mel spectrogram:
  ms = mel_spectrogram(signal)
  ms = torch.mean(ms, axis=1)[0]

  # concatenate and return the feature vector:
  features = [ce, spc, mfcc, mfcc_1st, mfcc_2nd]
  return np.concatenate(features)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source

'Extracting Instrument Qualities From Audio Signal

Sources

Related Questions