This page accompanies the last section of Chapter 5 about the expressive performance rendering model and its cross-modal extension.

Overview

Model architecture

Training pipeline for the proposed cross-modal performance rendering model. The performance rendering model R still modifies the note properties of a given score (timing, duration and velocity) into performance-like note features.

In this setting, both MIDI performances and audio performances are used as realistic data in the adversarial training. To account for both the symbolic and audio modalities, fake and real MIDI performances are all synthesized into audio with a smaller DDSP-Piano model (after a conversion from the note-wise to the frame-wise encoding of MIDI performances).

Real audio performances, real MIDI performances synthesized into audio, and fake audio performances rendered by the model are all fed into a multi-scale audio discriminator.

The unpaired datasets used for training:

“Can I Play It?” provides the compositions.
ASAP provides real MIDI performances recorded with Disklaviers.
ATEPP provides real audio performances, gathered from Youtube.

Rendering examples

This section exposes some early examples rendered by the full rendering pipeline, at different stages:

Deadpan: the plain MIDI score, synthesized into audio with DDSP-Piano.
Proposed: the performance output by the rendering model and synthesized with DDSP-Piano.
Human: a real human performance recorded in MIDI, synthesized with DDSP-Piano.

J.S. Bach - Fugue BWV 873

Output	Audio sample
Deadpan S(X)
Proposed S(R(X))
Human S(Y)

L. van Beethoven - Sonata N°18, 1st Movement

Output	Audio sample
Deadpan S(X)
Proposed S(R(X))
Human S(Y)

L. van Beethoven - Sonata N°8, 3rd Movement

Output	Audio sample
Deadpan S(X)
Proposed S(R(X))
Human S(Y)

F. Chopin - Etude Op.10 N°2

Output	Audio sample
Deadpan S(X)
Proposed S(R(X))
Human S(Y)

F. Liszt - Paganini Etude N°6

Output	Audio sample
Deadpan S(X)
Proposed S(R(X))
Human S(Y)

F. Schubert - Sonata N°13 - 1st Movement

Output	Audio sample
Deadpan S(X)
Proposed S(R(X))
Human S(Y)