A Guide to Montreal-Forced-Aligner(MFA) Align Mandarin (China Pinyin Input)

By | May 15, 2024

Montreal Forced Aligner (MFA) is a powerful tool for aligning speech and text data. In this tutorial, we will introduce how to align mandarin, especially china pinyin input.

Step 1: install mfa

The simplest way to install mfa is using conda. For example:

conda install -c conda-forge montreal-forced-aligner

If you have installed mfa successfully, input mfa command in terminal, you will see:

Step 2: Install pretrained acoustic model

There are some pretrained acoustic models for mandarin, you can find these models here.

https://mfa-models.readthedocs.io/en/latest/acoustic/index.html#acoustic

As to mandarin, there are three models we can select.

mandarin mfa acoustic model

However, the phoneset is MFA, not china pinyin.

For example, as to Mandarin MFA dictionary v2.0.0a, the MFA phoneset is:

https://mfa-models.readthedocs.io/en/latest/dictionary/Mandarin/Mandarin%20MFA%20dictionary%20v2_0_0a.html#mandarin-mfa-dictionary-v2-0-0a

mandarin mfa phoneset

It means we can not align mandarin by inputting pinyin directly.

Finally, we select v2.0.0a. Because it has lower wer.

  • WER: 5.0%
  • CER: 5.8%

Step 3: Download mfa pinyin dict

In order to use china pinyin, we should create a dict to build the relation between mfa phoneset and pinyin. You can down it here:

https://github.com/GalaxieT/MFA-mandarin-pinyin-dict-for-pretrained-mfa-model-v2.0

mandarin pinyin to mfa phoneset

Then, we can start to align mandarin.

Step 4: Prepare wave speech and text

When we are preparing wave speech corpus, we should notice:

  • sample rate: MFA uses a consistent frequency range (20-7800 Hz). Files that are higher or lower sampling rate than 16 kHz will be up- or down-sampled by default to 16 kHz during the feature generation procedure
  • duration:  should be less than 30 seconds for best performance

You can find more detail here.

https://montreal-forced-aligner.readthedocs.io/en/latest/user_guide/corpus_structure.html

The speech corpus should be: (speech_filename.wave, speech_filename.lab)

For example:

mfa speech corpus

20005.wav = 行行行,大概他还要多久呀?

We should convert chinese text to pinyin.

Python Convert Chinese String to Pinyin: A Step Guide – Python Tutorial

The content of 20005.lab is: xing2 xing2 xing2 , da4 gai4 ta1 hai2 yao4 duo1 jiu3 ya5 ?

You should notice: symbols, “,。!?” et al, will not affect alignment.

Step 5: Use mfa align mandarin

We can use command below:

mfa align "D:\03-TTS\10-文本对齐\v2.0\wave" "D:\03-TTS\10-文本对齐\v2.0\MFA-mandarin-pinyin-dict-for-pretrained-mfa-model-v2.0-main\mandarin_pinyin_to_mfa_lty
.dict" "D:\03-TTS\10-文本对齐\v2.0\acoustic-mandarin_mfa-v2.0.0a.zip"  "D:\03-TTS\10-文本对齐\v2.0\result"

The command format is:

mfa align speech_corpus_path mfa_dict_path mfa_acoustic_path result_path

Run this code, we will get:

mfa align mandarin

You should notice:

speech_corpus_path=D:\03-TTS\10-文本对齐\v2.0\wave

mfa will create a wave.log file (the wave is from the speech_corpus_path)

mfa log file

Finally, we will see some TextGrid files.

mfa mandarin TextGrid files

Step 6: Validate textgrid files using Praat tool

Select 20005.wav and 20005.TextGrid, we will see:

validate mfa textgrid using praat

We also can use wavelet_prosody_toolkit to validate.

For example:

python  wavelet_prosody_toolkit/wavelet_gui.py

We should select a folder that only contains .wav and .TextGrid, exclude .lab files.

wavelet evaluate textgrid file

Then, we will see:

wavelet evaluate textgrid file example