Python Extract Audio Fbank Feature for Training – Python Tutorial

By | June 10, 2022

If you plan to train deep learning model using wav files, you may have to extract audio features from these files. In this tutorial, we will introduce you how to extract.


We can use python librosa and python_speech_features libraries.

pip install librosa
pip install python_speech_features

Read audio file data

We usually use wav pcm format files to extract audio data. This kind of file contains sound data without be compressed.

To know what is the format of your wav files, you can read:

View Audio Sample Rate, Data Format PCM or ALAW Using ffprobe – Python Tutorial

Python Read WAV Data Format, PCM or ALAW – Python Tutorial

Then, we can read wav data using python librosa. Here is the example:

import librosa
import numpy
audio, sr = librosa.load(audio_file, sr= sample_rate, mono=True)

Here audio_file is the path of wav file. audio is the wav data, which is a numpy ndarray. sr is the sample rate of this file.

You also can read wav data using The difference between them is here:

The Difference Between and librosa.load() in Python – Python Tutorial

Extract audio fbank feature

After having read wav data, we can extract its fbank feature. We can use python_speech_features to implement it.

Here is an example:

frame_len=0.025 #ms
wav_feature, energy = python_speech_features.fbank(audio, sr, nfilt = 256, winlen=frame_len,winstep=frame_shift)

The wav_feature is the fbank feature of this wav file.

Notice: From paper: Understand the Difference of MelSpec, FBank and MFCC in Audio Feature Extraction – Python Audio Processing

We can find wav_feature is MelSpec, in order to get FBank, we should use logfbank() method or:

wave_feature = numpy.log(wave_feature)

As to python_speech_features.fbank(), it is defined as:

def fbank(signal,samplerate=16000,winlen=0.025,winstep=0.01,
          winfunc=lambda x:numpy.ones((x,))):

We should notice nfilt is the dimension of the output.

For example, you may find the shape of wav_feature is 499*256. Here 499 is determined by wave file total time, winlen and winstep. 256 is the nfilt.

Normalize the fbank feature

After we have got wav file fbank feature, we can normalize it. Here is an example:

def normalize_frames(m,epsilon=1e-12):
    return np.array([(v - np.mean(v)) / max(np.std(v),epsilon) for v in m])
wav_feature = normalize_frames(wav_feature)

Then you can use this wav_feature as your model input to train.

You may see this fbank feature as follows: