The ability to generate sounds, including speech, has been crucial in multiple industries such as entertainment. With the advent of deep learning, the popularity of deep generative models grew, especially in TTA, the task of creating audio based on text input.

Taking inspiration from Stable Diffusion, a paper by Haohe LiuZehua ChenYi YuanXinhao MeiXubo LiuDanilo MandicWenwu WangMark D. Plumbley proposes AudioLDM: a TTA generation model which uses latent diffusion models. A notable aspect of this model is the use of CLAP for latent extraction, allowing AudioLDM to learn audio representation without cross-modal relationships to achieve both efficiency and quality.

We have summarized the paper below, but here are some key points that are important to note:

  1. AudioLDM uses CLAP for latent extraction. This enables audio embeddings containing text information to be obtained without having to compute cross-modal dependencies, leading to a higher computational efficiency.
  2. In tandem with the above advantage, the use of CLAP results in quality improvement with the model prior generation now only coming from audio (refer to the forward process of the diffusion model in the paper).
  3. The model and its latent variable tractability allow for competitive results in zero-shot text-guided audio style transfer, inpainting/super-resolution, showing its applicability beyond TTA.

Audio Generation

The diagram of the model is provided as shown below. The model is composed of several components which will be discussed in detail.

Figure 1. AudioLDM architecture for text-to-audio generation

1. Contrastive Language-Audio Pretraining

Inspired by the recent success of CLIP in text-to-image generation, Audio LDM takes advantage of CLAP to aid its TTA synthesis (marked as Contrastive objective in the figure above). The model leverages a text encoder and an audio encoder to extract embeddings from the two sources, that are used as inputs to CLAP. This allows for the extraction of an audio embedding that contains text information.

2. Conditional Latent Diffusion Models

With the use of the CLAP, AudioLDM now extracts audio embedding Ex from audio input x and text embedding Ey from text input y. The latent diffusion model is then trained to match its distribution, p(z0|Ey) (with parameter theta) with the true conditional data distribution q(z0|Ey), with z0 being the latent of the audio sample x.

The diffusion model essentially achieves the following during its training/sampling phase:

  • During training, z0 synthesis is achieved with audio embedding Ex used for noise estimation.
  • During sampling, the audio input is absent and z0 synthesis is guided with predicted noise from text embedding Ey.

In other words, the diffusion model is conditioned by audio embeddings during training, and text embeddings during sampling. CFG (Classifier Free Guidance) is applied on the diffusion model, with modifications to its noise estimation including application of a guidance scale, determining the effect of conditioning information.

3. Encoder/Decoder

VAE’s encoder is used to compress the audio into its latent space, and its decoder reconstructs the mel-spectrogram from z0. The model is trained with reconstruction, adversarial, and Gaussian constraint loss. The reconstructed mel-spectrogram is finally fed into a vocoder for audio generation (HiFI-GAN used for AudioLDM).

Text-Guided Audio Manipulation

The authors show that the model not only excels in TTA, but also displays significant performance in text-guided audio manipulation as well as audio inpainting/super-resolution. This involves a slight change to the model’s inference flow, as shown below.

Figure 2. AudioLDM architecture for text-guided audio manipulation

Firstly, text-guided audio manipulation is attained by exploiting the nature of diffusion process. Via a shallow reverse process, latent variable z0 can be computed from an “earlier” latent at diffusion step n0, where n0<N and is computed from forward process equation. The starting point of the the reverse process, n0, controls the manipulation results. As n0 approaches N (gaussian noise), information from original source audio will be gradually lost and the model will behave like the conventional TTA. Note that the latent at n0 can be obtained from the source audio to be modified with text guidance using the learned forward diffusion process. Additionally, by exposing the model to latent representation of a partial audio (expressed in the above diagram with z superscript ob), the model can also be used for inpainting/audio super-resolution, generating the missing audio based on the given audio latent. This is specifically attained by applying observation masks during reverse diffusion process, which denotes generation and observation part, indicating which section of the audio must be generated from text embedding conditioning while retaining original audio.


While the idea of generative AI with latent diffusion is not new, the model’s capability in zero-shot text-guided audio style transfer is interesting. With appropriate training data, the shallow reverse diffusion process could be used to embed emotion and add effects to already synthesized audio. As an example, with a sample audio of a narrating voice that has been already generated from TTS, the user could input a guiding text, such as “An angry man shouting with sirens buzzing in the background.”, to modify the original and match the guidance.

Full paper: