NotaGen: Advancing Musicality in Symbolic Music Generation with Large Language Model Training Paradigms

Yashan Wang1*, Shangda Wu1*, Jianhuai Hu1, Xingjian Du2, Yueqi Peng3,
Yongxin Huang4, Shuai Fan5, Xiaobing Li1, Feng Yu1, Maosong Sun1,6

1Central Conservatory of Music, China, 2University of Rochester, USA,
3Beijing Flowingtech Ltd., China, 4Independent Researcher,
5Beihang University, China, 6Tsinghua University, China

*Indicates Equal Contribution

🎵 Click on the music player at the bottom of the webpage and enjoy the music composed by NotaGen! 🎵


(Videos and audio on this page are rendered and exported using Sibelius + NotePerformer.)

Abstract

We introduce NotaGen, a symbolic music generation model aims to explore the potential of producing high-quality classical sheet music. Inspired by the success of Large Language Models (LLMs), NotaGen adopts pre-training, fine-tuning, and reinforcement learning paradigms (henceforth referred to as the LLM training paradigms). It is pre-trained on 1.6M pieces of music, and then fine-tuned on approximately 9K high-quality classical compositions conditioned on ``period-composer-instrumentation'' prompts. For reinforcement learning, we propose the CLaMP-DPO method, which further enhances generation quality and controllability without requiring human annotations or predefined rewards. Our experiments demonstrate the efficacy of CLaMP-DPO in symbolic music generation models with different architectures and encoding schemes. Furthermore, subjective A/B tests show that NotaGen outperforms baseline models against human compositions, greatly advancing musical aesthetics in symbolic music generation.

Data Representation


data-representation.png

ABC notation is a musical notation system that uses a combination of letters, numbers, and symbols to represent musical notes. We adopt a modified version---interleaved ABC notation, as our data representation. In this format, different voices of the same bar are rearranged into a single line and differentiated using voice indicators ''[V:]''. Furthermore, we remove bars with full rests.

There have long been misconceptions about the capabilities of ABC notation. For instance, "ABC notation cannot handle multitrack music", "ABC notation is unable to represent complex scores". Actually, ABC is incredibly versatile. Not only can it accommodate multitrack music for any instrumentation, but it can also capture nearly all elements found in Western staff notation system. In addition to notes and rhythms, ABC notation can represent various techniques, expressions, and tempos, as well as flexibly incorporate text annotations.

Model Architecture


model-architecture

NotaGen utilizes the Tunesformer architecture and bar-stream patching. It consists of two hierarchical GPT-2 decoders: a patch-level decoder and a character-level decoder. Each patch is flattened by concatenating one-hot character vectors and then passed through a linear layer to obtain the patch embedding. The patch-level decoder captures the temporal relationships among patches, and its final hidden states are passed to the character-level decoder, which auto-regressively predicts the characters of the next patch.


Training Paradigms


model-architecture

Pre-training

NotaGen is pre-trained on 1.6M pieces of music. This corpus covers a wide range of genres and periods, enabling NotaGen to capture fundamental musical structures and patterns through next-token prediction.

Fine-tuning

NotaGen is fine-tuned on high-quality classical music sheet data to further enhance musicality in generation. We curated a fine-tuning dataset comprising 8,948 classical music sheets and covering 152 composers, from DCML corpora, OpenScore String Quartet Corpus, OpenScore Lieder Corpus, ATEPP, KernScores, and internal resources. We label all the pieces with 3 periods---Baroque, Classical, and Romantic; 6 instrumentations---Keyboard, Chamber, Orchestral, Art Song, Choral, and Vocal-Orchestral. Each piece is preprended with a ''period-composer-instrumentation'' prompt for conditional generation.

Reinforcement Learning

To refine both the musicality and the prompt controllability of the fine-tuned NotaGen, we present CLaMP-DPO. This method builds upon the principles of Reinforcement Learning from AI Feedback (RLAIF) and implements Direct Preference Optimization (DPO). In CLaMP-DPO, CLaMP 2, a multimodal symbolic music information retrieval model, serves as the evaluator within the DPO framework, distinguishing between chosen and rejected musical outputs to optimize NotaGen. Our experiments demonstrated that CLaMP-DPO efficiently enhanced both the controllability and the musicality across different symbolic music generation models, irrespective of their data modalities, encoding schemes, or model architectures. This underscores CLaMP-DPO's broad applicability and potential for auto-regressively trained symbolic music generation models.


Generated Samples

Thank you for reading this far! In addition to classical music, we have also attempted to fine-tune NotaGen to the pop music style. We used around 100 popular songs from the last century to fine-tune the pre-trained model and applied reinforcement learning. Please enjoy a pop song composed by NotaGen! (This video is a screenshot from the Musescore software.)

BibTeX

@misc{wang2025notagenadvancingmusicalitysymbolic,
        title={NotaGen: Advancing Musicality in Symbolic Music Generation with Large Language Model Training Paradigms}, 
        author={Yashan Wang and Shangda Wu and Jianhuai Hu and Xingjian Du and Yueqi Peng and Yongxin Huang and Shuai Fan and Xiaobing Li and Feng Yu and Maosong Sun},
        year={2025},
        eprint={2502.18008},
        archivePrefix={arXiv},
        primaryClass={cs.SD},
        url={https://arxiv.org/abs/2502.18008}, 
  }