Multi-Dataset Training
Contents
Multi-Dataset Training#
This tutorial shows how to train a model on multiple datasets simultaneously.
Note: Understand how to launch training runs first!
Before trying to train on multiple datasets, it might be useful to read the following tutorials:
1. Overview#
Robomimic supports training on multiple datasets simultaneously. This is useful when you want to:
Train a single model on multiple tasks
Combine datasets with different qualities (e.g., expert and suboptimal demonstrations)
Balance data from different sources
Each dataset can have its own weight for sampling, and you can control whether these weights are normalized by dataset size or not.
2. Configuring Multi-Dataset Training#
To train on multiple datasets, you need to specify a list of dataset configurations in your config file. Each dataset configuration is a dictionary with the following keys:
config.train.data = [
{
"path": "/path/to/dataset1.hdf5", # (required) path to the hdf5 file
"demo_limit": 100, # (optional) limit number of demos to use
"weight": 1.0, # (optional) weight for sampling, defaults to 1.0
"eval": True, # (optional) whether to evaluate on this dataset's env
"lang": "make coffee", # (optional) language instruction for the dataset
"key": "coffee" # (optional) key for naming eval videos
},
{
"path": "/path/to/dataset2.hdf5",
"weight": 2.0, # this dataset will be sampled twice as often
}
]
Additionally, you can control how the weights are used with the normalize_weights_by_ds_size
setting:
config.train.normalize_weights_by_ds_size = False # default
3. Understanding Weight Normalization#
The normalize_weights_by_ds_size
setting controls how dataset weights affect sampling:
When
False
(default):Raw weights are used directly
Larger datasets will naturally be sampled more often when assigned the same weight
Example: If dataset A has 1000 samples and dataset B has 100 samples, with equal weights (1.0), you’ll see roughly 10 samples from A for every 1 sample from B
When
True
:Weights are normalized by dataset size
Equal weights result in balanced sampling regardless of dataset size
Example: If dataset A has 1000 samples and dataset B has 100 samples, with equal weights (1.0), you’ll see roughly equal numbers of samples from both datasets
4. Example Configuration#
Here’s a complete example showing how to train a BC model on two datasets with different weights:
import robomimic
from robomimic.config import config_factory
# create BC config
config = config_factory(algo_name="bc")
# configure datasets
config.train.data = [
{
"path": "expert_demos.hdf5",
"weight": 2.0, # sample expert demos more frequently
},
{
"path": "suboptimal_demos.hdf5",
"weight": 1.0,
}
]
# normalize weights by dataset size for balanced sampling
config.train.normalize_weights_by_ds_size = True
# other training settings...
config.train.batch_size = 100
config.train.num_epochs = 1000
5. Best Practices#
Weight Selection:
Use higher weights for higher-quality data
Consider using
normalize_weights_by_ds_size=True
when datasets have very different sizesStart with equal weights and adjust based on performance
Dataset Compatibility:
Ensure all datasets have compatible observation and action spaces
Use consistent preprocessing across datasets
Evaluation:
Use the
eval
flag to control which environments to evaluate onSet descriptive
key
values for clear video naming