Code Released:

We release our code at yerfor/SyntaSpeech.

Abstract

The recent progress in non-autoregressive text-to-speech (NAR-TTS) has made fast and high-quality speech synthesis possible. However, current NAR-TTS models usually use phoneme sequence as input and thus cannot understand the tree-structured syntactic information of the input sequence, which hurts the prosody modeling. To this end, we propose SyntaSpeech, a syntax-aware and light-weight NAR-TTS model, which integrates tree-structured syntactic information into the prosody modeling modules in PortaSpeech. Specifically, 1) We build a syntactic graph based on the dependency tree of the input sentence, then process the text encoding with a syntactic graph encoder to extract the syntactic information. 2) We incorporate the extracted syntactic encoding with PortaSpeech to improve the prosody prediction. 3) We introduce a multi-length discriminator to replace the flow-based post-net in PortaSpeech, which simplifies the training pipeline and improves the inference speed, while keeping the naturalness of the generated audio. Experiments on three datasets not only show that the tree-structured syntactic information grants SyntaSpeech the ability to synthesize better audio with expressive prosody, but also demonstrate the generalization ability of SyntaSpeech to adapt to multiple languages and multi-speaker text-to-speech. Ablation studies demonstrate the necessity of each component in SyntaSpeech.

Audio Samples

We provide the audio samples generated by the TTS systems of three datasets, including LJSpeech (a single-speaker English dataset), BiaoBei (a single-speaker Chinese dataset), and LibriTTS (a multi-speaker English dataset).

LJSpeech (English, single-speaker)

  1. "The essential point to be remembered is that the ornament, whatever it is, whether picture or pattern work, should form part of the page"
  2. GT GT (voc.) FastSpeech 2
    wav
    PortaSpeech PortaSpeech (adv.) SyntaSpeech
    wav

  3. "John of Spires and his brother Vindelin, followed by Nicolas Jenson, began to print in that city"
  4. GT GT (voc.) FastSpeech 2
    wav
    PortaSpeech PortaSpeech (adv.) SyntaSpeech
    wav

  5. "And things got worse and worse through the whole of the seventeen century, so that in the eighteeth printing was very miserably performed."
  6. GT GT (voc.) FastSpeech 2
    wav
    PortaSpeech PortaSpeech (adv.) SyntaSpeech
    wav

BiaoBei (Mandarin, single-speaker)

  1. lǎo yòu zǎi chǒng quǎn wán shuǎ
  2. GT GT (voc.) FastSpeech 2
    wav
    PortaSpeech PortaSpeech (adv.) SyntaSpeech
    wav

  3. sǎo sǎo gěi zuò yóu miàn yáng ròu dùn shān yào
  4. GT GT (voc.) FastSpeech 2
    wav
    PortaSpeech PortaSpeech (adv.) SyntaSpeech
    wav

  5. wèn niú ròu xìn yuán jiā luó zǎi chǎng
  6. GT GT (voc.) FastSpeech 2
    wav
    PortaSpeech PortaSpeech (adv.) SyntaSpeech
    wav

LibriTTS (English, multi-speaker)

  1. "With it all, however, they went a second source of dissatisfaction."
  2. GT GT (voc.) FastSpeech 2
    wav
    PortaSpeech PortaSpeech (adv.) SyntaSpeech
    wav

  3. "And they went on and saw a cow with a calf; and they thought that they would milk the cow and drink the milk, but when they went to catch it it ran away from them and would not let itself be caught; and they sang: "
  4. GT GT (voc.) FastSpeech 2
    wav
    PortaSpeech PortaSpeech (adv.) SyntaSpeech
    wav

  5. "It is much better to cause people to think more than we say, and not outrage language, and run the risk of going beyond what we ought to say."
  6. GT GT (voc.) FastSpeech 2
    wav
    PortaSpeech PortaSpeech (adv.) SyntaSpeech
    wav

  7. "I was about to throw it away, but i remember that it was inflammable and burned with a good bright flame was, in fact, an excellent candle and I put it in my pocket."
  8. GT GT (voc.) FastSpeech 2
    wav
    PortaSpeech PortaSpeech (adv.) SyntaSpeech
    wav

  9. "Citizens, let us offer the protest of the corpses. Let us show that, if the people abandon the republicans, the republicans do not abandon the people."
  10. GT GT (voc.) FastSpeech 2
    wav
    PortaSpeech PortaSpeech (adv.) SyntaSpeech
    wav

Audio Samples Provided in Rebuttal

We additionally run Glow-TTS (NeurIPS 2020) and DiffSpeech (AAAI 2022) on LJSpeech, to make comparison with flow-based and diffusion-based models.

  1. "The essential point to be remembered is that the ornament, whatever it is, whether picture or pattern work, should form part of the page"
  2. Glow-TTS DiffSpeech
    wav

  3. "John of Spires and his brother Vindelin, followed by Nicolas Jenson, began to print in that city"
  4. Glow-TTS DiffSpeech
    wav

  5. "And things got worse and worse through the whole of the seventeen century, so that in the eighteeth printing was very miserably performed."
  6. Glow-TTS DiffSpeech
    wav