SemanticAudio

Audio Generation and Editing in Semantic Space

Zheqi Dai1, Guangyan Zhang2, Haolin He1, Xiquan Li3, Jingyu Li2, Chunyat Wu1, Yiwen Guo4,⋆, Qiuqiang Kong1,⋆
1The Chinese University of Hong Kong    2LIGHTSPEED    3Shanghai Jiao Tong University    4Independent Researcher

Abstract

In recent years, Text-to-Audio Generation has achieved remarkable progress, offering sound creators powerful tools to transform textual inspirations into vivid audio. However, existing models predominantly operate directly in the acoustic latent space of a Variational Autoencoder (VAE), often leading to suboptimal alignment between generated audio and textual descriptions. In this paper, we introduce SemanticAudio, a novel framework that conducts both audio generation and editing directly in a high-level semantic space. We define this semantic space as a compact representation capturing the global identity and temporal sequence of sound events, distinct from fine-grained acoustic details. SemanticAudio employs a two-stage Flow Matching architecture: the Semantic Planner first generates these compact semantic features to sketch the global semantic layout, and the Acoustic Synthesizer subsequently produces high-fidelity acoustic latents conditioned on this semantic plan. Leveraging this decoupled design, we further introduce a training-free text-guided editing mechanism that enables precise attribute-level modifications on general audio without retraining. Specifically, this is achieved by steering the semantic generation trajectory via the difference of velocity fields derived from source and target text prompts. Extensive experiments demonstrate that SemanticAudio surpasses existing mainstream approaches in both semantic alignment and audio fidelity.

Audio Generation Demos

Below are audio samples comparing our SemanticAudio with the Base Model, AudioLDM, and Ground Truth recordings. Samples are ordered by caption length from shortest to longest.

Audio Generation Framework

Audio Generation Framework
Caption Base Model SemanticAudio AudioLDM Ground Truth
1 Typing on a keyboard
2 Echoing male speech, laughter and applause
3 A speech and gunfire followed by a gun being loaded
4 A bird tweets far away and someone flushes the toilet
5 A train horn blows as a train approaches with warning bells ringing
6 Metal scrapping on wood followed by wood sanding then more metal scrapping against wood
7 A dog whimpering followed by a dog growling and barking as metal jingles and footsteps squeak on hard surface

Audio Edit Demos

Below are audio editing samples demonstrating our training-free text-guided editing mechanism. We show the source audio with its original caption, the target caption for editing, and results from two editing modes: with source caption guidance and without source caption guidance.

Audio Editing Framework

Audio Editing Framework
Source Caption Source Audio Target Caption Edit with Source Caption Edit without Source Caption
1 Dishes clanking followed by metal clanking on glass several times as a man is talking
Dishes clanking followed by metal clanking on glass several times
2 A dog growling and barking repeatedly
A bird chirping repeatedly
3 A woman speaking clearly
A woman speaking clearly with light rain falling
4 A woman speaking clearly
A bird speaking clearly
5 A woman speaking clearly
A woman speaking clearly with traffic noise
6 A woman speaking clearly
A man speaking clearly
7 A woman speaking clearly
A baby speaking clearly
8 Fast and loud typing on computer keyboard
Fast and quiet typing on computer keyboard