Diffusion Filtered Exploration via Ensembles

Diffusion Policies in the Age of Exploration 🔭 ⋆˚࿔

Diffusion policies can learn powerful multimodal decision-making models from offline experience, and strategies have been devised for finetuning them with respect to online experience. But as real-world interactions can be expensive, we look to reduce collection quantity for increased experience quality through exploration, to improve behavior in a sample-efficient way. We focus on three central questions in equipping diffusion policies for principled online exploration:

identify

quantify

collaborate

We present Diffusion Filtered Exploration via Ensembles (DF-ExpEnse), an exploration technique that improves the quality of online experience collection, thus increasing finetuning sample-efficiency. DF-ExpEnse leverages the multimodal modeling capabilities of the diffusion policy to identify an expressive and tractably evaluatable candidate set. It utilizes an ensemble of critics to quantifiably score each action, to select the one that best balances execution quality with exploration interest. DF-ExpEnse further enables cross-agent communication to facilitate collaborative exploration as a group.

Exploration can come for free! DF-ExpEnse requires no extra components beyond ones commonly found in reinforcement learning finetuning, simply reusing them during online inference for principled exploration!

DF-ExpEnse Framework

At each timestep, DF-ExpEnse selects an exploratory action to execute by performing three steps. First, (a) filters the continuous action space by generating multiple samples from the diffusion policy. Then, (b) estimates exploration interest in each action with respect to quality and uncertainty using an ensemble. Lastly, (c) normalizes exploration interest across the fleet and selects the action with the maximum interest to execute.

Experiments

DF-ExpEnse is a general exploration technique, and can be seamlessly integrated with existing strategies that finetune pretrained diffusion policies via reinforcement learning to provide sample-efficiency benefits. We integrate DF-ExpEnse with input noise and residual finetuning, and evaluate on a variety of manipulation and locomotion tasks across Robomimic, Gym, and DexMimicGen.

Robomimic Tasks

Timestep: 512,000

Success Rate: ~70%

Timestep: 512,000

Success Rate: ~50%

Timestep: 512,000

Success Rate: ~90%

Timestep: 480,000

Success Rate: ~60%

Timestep: 480,000

Success Rate: ~40%

Timestep: 480,000

Success Rate: ~90%

Timestep: 480,000

Success Rate: ~100%

Timestep: 480,000

Success Rate: ~100%

Timestep: 480,000

Success Rate: ~100%

Robomimic Tool Hang Performance Comparison

Timestep: 2,048,000

Success Rate: ~45%

Timestep: 2,048,000

Success Rate: ~45%

Timestep: 2,048,000

Success Rate: ~60%

DexMimicGen Tasks

Timestep: 50,000

Success Rate: ~40%

Timestep: 50,000

Success Rate: ~50%

Timestep: 50,000

Success Rate: ~80%

DexMimicGen Coffee Performance Comparison

Timestep: 50,000

Success Rate: ~10%

Timestep: 50,000

Success Rate: ~40%

Timestep: 50,000

Success Rate: ~60%

Timestep: 50,000

Success Rate: ~60%

Timestep: 50,000

Success Rate: ~80%

Timestep: 50,000

Success Rate: ~90%

Fleet Size Ablations

Intuitively, larger fleets may provide greater amounts of normalization and collaboration possibilities. We find that performance does decrease below a fleet size of 4, verifying that DF-ExpEnse can leverage larger fleet sizes to help improve sample efficiency. Nevertheless, DF-ExpEnse still reliably outperforms vanilla DSRL and Max-Q across all fleet sizes, large and small.

These findings further reinforce DF-ExpEnse as a robust method that can be integrated with standard reinforcement learning finetuning techniques to provide consistent sample efficiency benefits across a variety of available resource settings.

BibTeX

@inproceedings{
      luo2026dfexpense,
      title={{DF}-ExpEnse: Diffusion Filtered Exploration for Sample Efficient Finetuning},
      author={Calvin Luo and Chen Sun and Shuran Song},
      booktitle={Forty-third International Conference on Machine Learning},
      year={2026}
    }

DF-ExpEnse: Diffusion Filtered Exploration for Sample Efficient Finetuning

Diffusion Policies in the Age of Exploration 🔭 ⋆˚࿔

DF-ExpEnse Framework

Experiments

Robomimic Tasks

Vanilla DSRL

Timestep: 512,000

Success Rate: ~70%

Max-Q

Timestep: 512,000

Success Rate: ~50%

DF-ExpEnse

Timestep: 512,000

Success Rate: ~90%

Vanilla DSRL

Timestep: 480,000

Success Rate: ~60%

Max-Q

Timestep: 480,000

Success Rate: ~40%

DF-ExpEnse

Timestep: 480,000

Success Rate: ~90%

Vanilla DSRL

Timestep: 480,000

Success Rate: ~100%

Max-Q

Timestep: 480,000

Success Rate: ~100%

DF-ExpEnse

Timestep: 480,000

Success Rate: ~100%

Vanilla DSRL

Timestep: 2,048,000

Success Rate: ~45%

Max-Q

Timestep: 2,048,000

Success Rate: ~45%

DF-ExpEnse

Timestep: 2,048,000

Success Rate: ~60%

DexMimicGen Tasks

Vanilla ResFiT

Timestep: 50,000

Success Rate: ~40%

Max-Q

Timestep: 50,000

Success Rate: ~50%

DF-ExpEnse

Timestep: 50,000

Success Rate: ~80%

Vanilla ResFiT

Timestep: 50,000

Success Rate: ~10%

Max-Q

Timestep: 50,000

Success Rate: ~40%

DF-ExpEnse

Timestep: 50,000

Success Rate: ~60%

Vanilla ResFiT

Timestep: 50,000

Success Rate: ~60%

Max-Q

Timestep: 50,000

Success Rate: ~80%

DF-ExpEnse

Timestep: 50,000

Success Rate: ~90%

Fleet Size Ablations

BibTeX