We tackle the task of conditional music generation. We introduce MusicGen, a single Language Model (LM) that operates over several streams of compressed discrete music representation, i.e., tokens. Unlike prior work, MusicGen is comprised of a single-stage transformer LM together with efficient token interleaving patterns, which eliminates the need for cascading several models, e.g., hierarchically or upsampling. Following this approach, we demonstrate how MusicGen can generate high-quality samples, while being conditioned on textual description or melodic features, allowing better controls over the generated output. We conduct extensive empirical evaluation, considering both automatic and human studies, showing the proposed approach is superior to the evaluated baselines on a standard text-to-music benchmark. Through ablation studies, we shed light over the importance of each of the components comprising MusicGen. Music samples can be found on the supplemental materials. Code and models are available on our repo github.com/facebookresearch/audiocraft.
Check out our a paper on Simple and Controllable Music Generation for more information.
In the following, we compare MusicGen (including stereo generation) 3.3B to a number of prior work detailed in the paper: MusicLM, using the public AI Test Kitchen demo, Riffusion using the provided pre-trained modes, and Mousai, which we retrained on the same dataset as our proposed MusicGen model.
desc | MusicGen | MusicGen Stereo | MusicLM | Riffusion | Musai |
Pop dance track with catchy melodies, tropical percussion, and upbeat rhythms, perfect for the beach | |||||
A grand orchestral arrangement with thunderous percussion, epic brass fanfares, and soaring strings, creating a cinematic atmosphere fit for a heroic battle. | |||||
classic reggae track with an electronic guitar solo | |||||
earthy tones, environmentally conscious, ukulele-infused, harmonic, breezy, easygoing, organic instrumentation, gentle grooves | |||||
lofi slow bpm electro chill with organic samples | |||||
drum and bass beat with intense percussions | |||||
A dynamic blend of hip-hop and orchestral elements, with sweeping strings and brass, evoking the vibrant energy of the city. | |||||
violins and synths that inspire awe at the finiteness of life and the universe | |||||
80s electronic track with melodic synthesizers, catchy beat and groovy bass | |||||
reggaeton track, with a booming 808 kick, synth melodies layered with Latin percussion elements, uplifting and energizing | |||||
a piano and cello duet playing a sad chambers music | |||||
smooth jazz, with a saxophone solo, piano chords, and snare full drums | |||||
a light and cheerly EDM track, with syncopated drums, aery pads, and strong emotions | |||||
a punchy double-bass and a distorted guitar riff | |||||
acoustic folk song to play during roadtrips: guitar flute choirs | |||||
rock with saturated guitars, a heavy bass line and crazy drum break and fills. |
We now experiment with our novel chroma-based melody conditioning. We condition on famous melodies from classical music along with new text description to provide interpretations in any genre or style. We use our MusicGen 1.5B with melody and text conditioning.
Source of Melody | desc | MusicGen |
90s rock song with electric guitar and heavy drums | ||
- | An 80s driving pop song with heavy drums and synth pads in the background | |
- | An energetic hip-hop music piece, with synth sounds and strong bass. There is a rhythmic hi-hat patten in the drums. | |
90s rock song with electric guitar and heavy drums | ||
- | An 80s driving pop song with heavy drums and synth pads in the background | |
- | An energetic hip-hop music piece, with synth sounds and strong bass. There is a rhythmic hi-hat patten in the drums. | |
90s rock song with electric guitar and heavy drums | ||
- | An 80s driving pop song with heavy drums and synth pads in the background | |
- | An energetic hip-hop music piece, with synth sounds and strong bass. There is a rhythmic hi-hat patten in the drums. |
It is possible to generate longer sequence using a fixed 30 seconds windows. We then slide the window by chunks of 10 seconds, keeping the last 20 seconds that were generated as context. We use MusicGen 3.3B.
desc | MusicGen |
lofi slow bpm electro chill with organic samples | |
a light and cheerly EDM track, with syncopated drums, aery pads, and strong emotions | |
A grand orchestral arrangement with thunderous percussion, epic brass fanfares, and soaring strings, creating a cinematic atmosphere fit for a heroic battle. |