$ cat abstract.md

INSTRUCTFX2FX

A Multi-turn Text-to-Preset Demo for Iterative Audio Effect Refinement

AUTHORS
Song-Ze Yu·Milan Liessens Dujardin·Yuxuan Cai·Wantong Zhang
LAB
UC Berkeley · CNMAT
VENUE
Proc. 29th Int. Conf. Digital Audio Effects (DAFx) · Demo Track · 2026
TAGS
text-guided audio editing · multi-turn FX refinement · LLM · CLAP · audio production

[01]Abstract · Sequential FX Refinement

$ problem

Given an existing FX parameter set P, how can a sequence of natural-language instructions {I₁, I₂, …} make the sound fit the user's need? We name this problem sequential FX refinement.

$ answer

This is why we propose InstructFX2FX. An LLM plans and orders the FX chain; CLAP-guided optimization then refines the parameters perceptually, turn after turn, keeping what earlier turns achieved.


[02]Architecture · Three-Layer Harness
SESSION STATEP · historypersistent
read ▾▴ write
INPUTaudio · text
L1 · SELECTLLM picks & orders FX
L2 · ROUTEparse the update
L3 · REFINEoptimize in CLAP space
OUTPUTrendered audio
▾ L3 splits by effect type ▾
3aGRADIENT DESCENT
differentiable · EQ, reverb
3bBAYESIAN OPT
comp · dist · delay · pitch · crush

$ routing — every instruction takes one of three modes

  1. Initialize. Build a fresh FX chain when the instruction starts a new direction.

  2. Extend. Reuse the current chain and add new effects on top of it.

  3. Reuse & optimize. Keep the existing effects and re-tune their parameters in place.


[03]Listen · Gradient-Descent Sessions3 SESSIONS

Real sessions on the gradient-descent path (EQ & reverb). Each turn shows what Layer 1 planned, then plays the result. Click a waveform to seek, toggle dry / result to A​/B, or drag the gradient-descent track to hear the optimization converge.

Loading sessions…

[04]Evaluation · By the Numbers
// TARGET MMD0.450.34
// REDUCTION24%
// PAIRS WON9/10
// GD ITERS120

EXP.01 · SEQUENTIAL MMD

Sequential MMD across directed prompt pairs: InstructFX2FX beats the LLM-only baseline on 9 of 10
Lower target-directed MMD on 9 of 10 directed descriptor pairs versus an LLM-only baseline. CLAP-guided refinement cuts MMD from 0.45 to 0.34.

EXP.02 · MMD CONVERGENCE TRAJECTORY

warm to bright MMD trajectorywarm → bright
bright to warm MMD trajectorybright → warm
heavy to calm MMD trajectoryheavy → calm
heavy to harsh MMD trajectoryheavy → harsh
harsh to soft MMD trajectoryharsh → soft
harsh to calm MMD trajectoryharsh → calm
warm to heavy MMD trajectorywarm → heavy
soft to loud MMD trajectorysoft → loud
calm to loud MMD trajectorycalm → loud
loud to heavy MMD trajectoryloud → heavy
All 10 directed descriptor pairs (the paper has room for only two). MMD vs. gradient-descent iteration for piano and violin, each against its LLM-only baseline — some pairs drop and stay below the baseline, others overshoot or drift, which is why the demo exposes intermediate checkpoints.

EXP.03 · LLM INITIALIZATION ABLATION

CLAP similarity rising and MMD falling across optimization iterations
CLAP similarity rises and MMD falls across iterations, the measurement behind the gradient-descent tracks above.

[05]Cite
@misc{yu2026instructfx2fxmultiturntexttopresetdemo,
  title         = {InstructFX2FX: A Multi-turn Text-to-Preset Demo for
                   Iterative Audio Effect Refinement},
  author        = {Song-Ze Yu and Milan Liessens Dujardin and
                   Yuxuan Cai and Wantong Zhang},
  year          = {2026},
  eprint        = {2606.22005},
  archivePrefix = {arXiv},
  primaryClass  = {cs.SD},
  url           = {https://arxiv.org/abs/2606.22005}
}