PerceiverS: A Multi-Scale Perceiver with Effective Segmentation for Long-Term Expressive Symbolic Music Generation

In this work, we introduced a novel model, PerceiverS, which builds on the Perceiver AR architecture by incorporating Effective Segmentation and a Multi-Scale attention mechanism. The Effective Segmentation approach progressively expands the context segment during training, aligning more closely with autoregressive generation and enabling smooth, coherent generation across ultra-long symbolic music sequences. The Multi-Scale attention mechanism further enhances the model's ability to capture both long-term structural dependencies and short-term expressive details.

All music pieces are presented in the order they were originally generated, with no selection, filtering, or reordering applied. Click here to listen to more demo music.

Image

Section 1: Original Model

Dataset: Maestro / Context Length: 32,768 / Segmentation: Traditional / Cross Attention Mask: None




Section 2: Improvement 1 (Effective Segmentation)

Dataset: Maestro / Context Length: 32,768 / Segmentation: Progressive / Cross Attention Mask: None




Section 3: Improvement 1 + Improvement 2 (Multi-Scale)

Dataset: Maestro / Context Length: 32,768 / Segmentation: Progressive / Cross Attention Mask: Multi-Scale




Citing This Paper

To cite this paper, please use the following format:

            @misc{yi2024perceiversmultiscaleperceivereffective,
                  title={PerceiverS: A Multi-Scale Perceiver with Effective Segmentation for Long-Term Expressive Symbolic Music Generation}, 
                  author={Yungang Yi and Weihua Li and Matthew Kuo and Quan Bai},
                  year={2024},
                  eprint={2411.08307},
                  archivePrefix={arXiv},
                  primaryClass={cs.AI},
                  url={https://arxiv.org/abs/2411.08307}, 
            }