contributors: Jean-Louis Durrieu web: https://git.epfl.ch/repo/pyfasst/ https://github.com/wslihgt/pyfasst
A Python implementation to the Flexible Audio Source Separation Toolbox
This toolbox is meant to allow to use the framework FASST and extend it within a python program. It is primarily a re-write in Python of the original Matlab (C) version. The object programming framework allows to extend and create new models an easy way, by subclassing the pyfasst.audioModel.FASST and re-implementing some methods (in particular methods like pyfasst.audioModel.MultiChanNMFInst_FASST._initialize_structures())
Most of the code is written in Python, but occasionally, there may be some C source code, requiring either Cython or SWIG for compiling. In general, to run this code, the required components are:
- Matplotlib http://matplotlib.sourceforge.net
- Numpy http://numpy.scipy.org
- Scipy http://www.scipy.org
- setuptool https://pypi.python.org/pypi/setuptools
In addition to the aforementioned packages, installing this package requires to compile the tracking part, pyfasst.SeparateLeadStereo.tracking._tracking. In the corresponding folder, type:
python setup.py build_ext --inplace
We have implemented several classes that can be used directly, without the need to re-implement or sub-class pyfasst.audioModel.FASST. In particular, we have:
pyfasst.audioModel.MultiChanNMFInst_FASST, pyfasst.audioModel.MultiChanNMFConv, pyfasst.audioModel.MultiChanHMM: these classes originate from the distributed Matlab version of FASST. For example, the separation of the voice and the guitar on the tamy <> example gives, with a simple model with 2 sources, with instantaneous mixing parameters and NMF model on the spectral parameters (to run from where one can find the tamy.wav file) - don’t expect very good results!:
>>> import pyfasst.audioModel as am >>> filename = 'data/tamy.wav' >>> # initialize the model >>> model = am.MultiChanNMFInst_FASST( audio=filename, nbComps=2, nbNMFComps=32, spatial_rank=1, verbose=1, iter_num=50) >>> # estimate the parameters >>> model.estim_param_a_post_model() >>> # separate the sources using these parameters >>> model.separate_spat_comps(dir_results='data/')Somewhat improving the results could be to use the convolutive mixing parameters:
>>> import pyfasst.audioModel as am >>> filename = 'data/tamy.wav' >>> # initialize the model >>> model = am.MultiChanNMFConv( audio=filename, nbComps=2, nbNMFComps=32, spatial_rank=1, verbose=1, iter_num=50) >>> # to be more flexible, the user _has to_ make the parameters >>> # convolutive by hand. This way, she can also start to estimate >>> # parameters in an instantaneous setting, as an initialization, >>> # and only after "upgrade" to a convolutive setting: >>> model.makeItConvolutive() >>> # estimate the parameters >>> model.estim_param_a_post_model() >>> # separate the sources using these parameters >>> model.separate_spat_comps(dir_results='data/')The following example shows the results for a more synthetic example (synthetis anechoic mixture of the voice and the guitar, with a delay of 0 for the voice and 10 samples from the left to the right channel for the guitar):
>>> import pyfasst.audioModel as am >>> filename = 'data/dev1__tamy-que_pena_tanto_faz___thetas-0.79,0.79_delays-10.00,0.00.wav' >>> # initialize the model >>> model = am.MultiChanNMFConv( audio=filename, nbComps=2, nbNMFComps=32, spatial_rank=1, verbose=1, iter_num=200) >>> # to be more flexible, the user _has to_ make the parameters >>> # convolutive by hand. This way, she can also start to estimate >>> # parameters in an instantaneous setting, as an initialization, >>> # and only after "upgrade" to a convolutive setting: >>> model.makeItConvolutive() >>> # we can initialize these parameters with the DEMIX algorithm: >>> model.initializeConvParams(initMethod='demix') >>> # and estimate the parameters: >>> model.estim_param_a_post_model() >>> # separate the sources using these parameters >>> model.separate_spat_comps(dir_results='data/')pyfasst.audioModel.multiChanSourceF0Filter: this class assumes that all the sources share the same spectral shape dictionary and spectral structure, _i.e._ a source/filter model (2 _factors_, in FASST terminology), with a filter spectral shape dictionary generated as a collection of smooth windows (overlapping Hann windows), and the source dictionary is computed as a collection of spectral combs following a simple vocal glottal model (see [Durrieu2010]). The advantage of this class is that in terms of memory, all the sources share the same dictionaries. However, that means it makes no sense to modify these dictionaries (at least not individually - which is the case in this algorithm) and they are therefore fixed by default. This class also provides methods that help to initialize the various parameters, assuming the specific structure presented above.
Additionally, we provide a (not-very-exhaustive) plotting module which helps in displaying some interesting features from the model, such as:
>>> import pyfasst.tools.plotTools as pt >>> # display the estimated spectral components >>> # (one per row of subplot) >>> pt.subplotsAudioModelSpecComps(model) >>> # display a graph showing where the sources have been "spatially" >>> # estimated: in an anechoic case, ideally, the graph for the >>> # corresponding source is null everywhere, except at the delay >>> # between the two channels: >>> delays, delayDetectionFunc = pt.plotTimeCorrelationMixingParams(model)
TODO: add typical SDR/SIR results for these examples.
FASST.spat_comps:
variable
description
possible values
spat_comps[n]
n-th spatial component
dictionary with the fields detailled below
spat_comps[n][‘time_dep’]
define the time dependencies
‘indep’
spat_comps[n][‘mix_type’]
which type of mixing should be considered
- ‘inst’ - instantaneous
- ‘conv’ - convolutive
spat_comps[n][‘frdm_prior’]
- ‘free’ to update the mixing parameters
- ‘fixed’ to keep the parameters unchanged
spat_comps[n][‘params’]
the actual mixing parameters.
- mix_type == ‘inst’ :
n_channels x rank numpy.ndarray
- mix_type == ‘conv’ :
rank x n_chan x n_freq numpy.ndarray
Note: the way the parameters are stored is a bit convoluted and making a more consistent ordering of the parameters (between instantaneous and convolutive) would be an improvement.
FASST.spec_comps:
variable
description
values
spec_comps[n]
n-th spectral component
dictionary with the following fields
spec_comps[n][‘spat_comp_ind’]
the associated spatial component in spat_comps.
(integer)
spec_comps[n][‘factor’][f]
f-th factor of n-th spectral component
dictionary with the following parameters
spec_comps[n][‘factor’][f][‘FB’]
Frequency Basis
(nbFreqsSigRepr x n_FB_elts) ndarray: n_FB_elts is the number of elements in the basis (or dictionary)
spec_comps[n][‘factor’][f][‘FW’]
Frequency Weights
(n_FB_elts x n_FW_elts) ndarray: n_FW_elts is the number of desired combinations of FB elements
spec_comps[n][‘factor’][f][‘TW’]
Time Weights
(n_FW_elts x n_TB_elts) ndarray: n_TB_elts is the number of elements in the time basis
spec_comps[n][‘factor’][f][‘TB’]
Temporal Basis
empty list [] or (n_TB_elts x nbFramesSigRepr) ndarray: if [], then n_TB_elts in TW should be nbFramesSigRepr.
spec_comps[n][‘factor’][f][‘TW_constr’]
- ‘NMF’: Non-negative Matrix Factorization
- ‘GMM’, ‘GSMM’: Gaussian (Scale) Mixture Model
- ‘HMM’, ‘SHMM’: (Scaled) Hidden Markov Model
spec_comps[n][‘factor’][f][‘TW_all’]
for discrete state models (TW_constr != ‘NMF’), keeps track of the scales for all the possible states
same as spec_comps[n][‘factor’][f][‘TW’]
spec_comps[n][‘factor’][f][‘TW_DP_params’]
Dynamic Programming (?) parameters prior or transition probabilities.
- TW_constr in (‘GMM’, ‘GSMM’): (number_states) ndarray. Prior probabilites for each state. number_states is the number of states (typically spec_comp[n][‘factor’][f][‘TW’].shape[0]).
- TW_constr in (‘HMM’, ‘SHMM’): (number_states x number_states) ndarray. Transition probabilites for each state.
spec_comps[n][‘factor’][f][‘XX_frdm_prior’]
whether to update a parameter set or not, where XX is one of FB, FW, TW, TB, TW_DP
- ‘free’ update the parameters
- ‘fixed’ do not update
The key names are reproduced from the Matlab toolbox.
The FASST framework and the audio signal model are described in [Ozerov2012]. We have implemented this Python version mostly thanks to the provided Matlab (C) code available at http://bass-db.gforge.inria.fr/fasst/.
For initialization purposes, several side algorithms and systems have also been implemented: * SIMM model (Smooth Instantaneous Mixture Model) from [Durrieu2010] and [Durrieu2011]: allows to analyze, detect and separate the lead instrument from a polyphonic audio (musical) mixture. Note: the original purpose of this implementation was to provide a sensible way of using information from the SIMM model into the more general multi-channel audio source separation model provided, for instance, by FASST. It is implemented in the pyfasst.SeparateLeadStereo.SeparateLeadStereoTF module.
[Arberet2010] | Arberet, S.; Gribonval, R. and Bimbot, F., A Robust Method to Count and Locate Audio Sources in a Multichannel Underdetermined Mixture, IEEE Transactions on Signal Processing, 2010, 58, 121 - 133. [web] |
[Durrieu2010] | (1, 2) J.-L. Durrieu, G. Richard, B. David and C. F\’{e}votte, Source/Filter Model for Main Melody Extraction From Polyphonic Audio Signals, IEEE Transactions on Audio, Speech and Language Processing, special issue on Signal Models and Representations of Musical and Environmental Sounds, March 2010, Vol. 18 (3), pp. 564 – 575. |
[Durrieu2011] | J.-L. Durrieu, G. Richard and B. David, A Musically Motivated Representation For Pitch Estimation And Musical Source Separation, IEEE Journal of Selected Topics on Signal Processing, October 2011, Vol. 5 (6), pp. 1180 - 1191. |
[Ozerov2012] | (1, 2) A. Ozerov, E. Vincent, and F. Bimbot, A general flexible framework for the handling of prior information in audio source separation, IEEE Transactions on Audio, Speech and Signal Processing, Vol. 20 (4), pp. 1118-1133 (2012). |