Efficient and high-quality text-to-audio generation with Latent Consistency Model.
Year: 2023
Website: https://audiolcm.github.io/
Input types: Text
Output types: Audio
Output length: Variable
AI Technique: Latent Diffusion
Dataset: "Teacher" model not disclosed, AudioCaps dataset, (Kim et al., 2019) for AudioLCM mode
License type: MIT
Real time:
Free:
Open source:
Checkpoints:
Fine-tune:
Train from scratch:
The first in a suite of generative audio tools for producers and musicians to be released by Harmonai. The provided Jupyter notebooks allow users to perform: - Unconditional random audio sample generation - Audio sample regeneration/style transfer using a single audio file or recording - Audio interpolation between two audio files
Year: 2022
Website: https://github.com/Harmonai-org/sample-generator
Input types: Audio Text
Output types: Audio
Output length: Variable
AI Technique: Latent Diffusion
Dataset: Online sources - glitch.cool, songaday.world, MAESTRO dataset, Unlocked Recordings, xeno-canto.org
License type: MIT
Real time:
Free:
Open source:
Checkpoints:
Fine-tune:
Train from scratch:
Language Model for conditional music generation developed by Meta. The output can be prompted by a text description and additionally conditioned on a melody.
Year: 2023
Website: https://ai.honu.io/papers/musicgen/
Input types: Text
Output types: Audio
Output length: 30s
AI Technique: Transformer
Dataset: NSynth Dataset; Others not disclosed
License type: MIT/CC-BY-NC
Real time:
Free:
Open source:
Checkpoints:
Fine-tune:
Train from scratch:
Open source platform capable of generating music from text prompts.
Year: 2023
Website: https://okio.ai/
Input types: Audio Text
Output types: Audio
Output length: Variable
AI Technique: Suite of AI tools
Dataset: Not disclosed
License type: MIT for core tools
Real time:
Free:
Open source:
Checkpoints:
Fine-tune:
Train from scratch:
Music generation from text descriptions based on stable diffusion. Can be conditioned on an image.
Year: 2022
Website: https://www.riffusion.com/
Input types: Text Image
Output types: Audio
Output length: around 3min
AI Technique: Diffusion
Dataset: Not disclosed
License type: MIT
Real time:
Free:
Open source:
Checkpoints:
Fine-tune:
Train from scratch:
Open source text-to-audio model for generating samples and sound effects from text descriptions. The model enables audio variations and style transfer of audio samples. The creators claim it is ideal for creating drum beats, instrument riffs, ambient sounds, foley recordings and other audio samples for music production and sound design. Generates stereo audio at 44.1kHz.
Year: 2023
Website: https://stability.ai/news/introducing-stable-audio-open
Input types: Audio Text
Output types: Audio
Output length: 47s
AI Technique: Diffusion
Dataset: freesound.org, freemusicarchive.org
License type:
Real time:
Free:
Open source:
Checkpoints:
Fine-tune:
Train from scratch:
Mustango is an open-source Text-to-Music model with focus on fine controllability allowing to specify musical attributes such as key or chord sequences.
Year: 2023
Website: https://amaai-lab.github.io/mustango/
Input types: Text
Output types: Audio
Output length: 10 sec
AI Technique: Latent Diffusion
Dataset: MusicBench
License type: MIT/CC-BY-SA
Real time:
Free:
Open source:
Checkpoints:
Fine-tune:
Train from scratch:
Yue AI is an open-source music generation model. The user can input lyrics and genre information as text, along with optional audio clips for context, with the input audio clips needing to be around 30 seconds long. Supported input languages include English, Mandarin Chinese, Cantonese, Japanese, and Korean. Yue AI will output its generated music in an audio file that is up to 5 minutes in length.
Year: 2025
Website: https://map-yue.github.io/
Input types: Audio Text
Output types: Audio
Output length: 5 minutes
AI Technique: Transformer
Dataset: WeNetSpeech, LibriHeavy, GigaSpeech, 650K hours of internet mined data
License type: Apache-2.0 license
Real time:
Free:
Open source:
Checkpoints:
Fine-tune:
Train from scratch:
DiffRhythm is an open-source music generation model. The user can input lyrics and style information as text, along with an optional audio prompt for context, with the input audio clips needing to be less than 10 seconds long. Supported languages include English and Chinese. DiffRhythm will output its generated music as an audio file (MP3, wav, or ogg) that is up to 285 seconds in length.
Year: 2025
Website: https://aslp-lab.github.io/DiffRhythm.github.io/
Input types: Audio Text
Output types: Audio
Output length: 95-285 seconds
AI Technique: Latent Diffusion
Dataset: 300,000 hours of internet scraped music, cleaned to 25,000 hours of top-quality music
License type: Apache-2.0 license
Real time:
Free:
Open source:
Checkpoints:
Fine-tune:
Train from scratch:
Muzic ROC is an open-source music generation model. The user can input lyrics and a chord progression as text, and the model will output a melody. ROC is language-insensitive, so slightly editing the code can modify its output language.
Year: 2022
Website: https://github.com/microsoft/muzic/tree/main/roc
Input types: Text
Output types: MIDI
Output length: dependent on lyrics provided
AI Technique:
Dataset: LMD-matched MIDI dataset
License type: MIT License
Real time:
Free:
Open source:
Checkpoints:
Fine-tune:
Train from scratch: