Multi-Dataset Training#

This tutorial shows how to train a model on multiple datasets simultaneously.

Note: Understand how to launch training runs first!

Before trying to train on multiple datasets, it might be useful to read the following tutorials:

1. Overview#

Robomimic supports training on multiple datasets simultaneously. This is useful when you want to:

  • Train a single model on multiple tasks

  • Combine datasets with different qualities (e.g., expert and suboptimal demonstrations)

  • Balance data from different sources

Each dataset can have its own weight for sampling, and you can control whether these weights are normalized by dataset size or not.

2. Configuring Multi-Dataset Training#

To train on multiple datasets, you need to specify a list of dataset configurations in your config file. Each dataset configuration is a dictionary with the following keys:

config.train.data = [
    {
        "path": "/path/to/dataset1.hdf5",      # (required) path to the hdf5 file
        "demo_limit": 100,                     # (optional) limit number of demos to use
        "weight": 1.0,                         # (optional) weight for sampling, defaults to 1.0
        "eval": True,                          # (optional) whether to evaluate on this dataset's env
        "lang": "make coffee",                 # (optional) language instruction for the dataset
        "key": "coffee"                        # (optional) key for naming eval videos
    },
    {
        "path": "/path/to/dataset2.hdf5",
        "weight": 2.0,                         # this dataset will be sampled twice as often
    }
]

Additionally, you can control how the weights are used with the normalize_weights_by_ds_size setting:

config.train.normalize_weights_by_ds_size = False  # default

3. Understanding Weight Normalization#

The normalize_weights_by_ds_size setting controls how dataset weights affect sampling:

  • When False (default):

    • Raw weights are used directly

    • Larger datasets will naturally be sampled more often when assigned the same weight

    • Example: If dataset A has 1000 samples and dataset B has 100 samples, with equal weights (1.0), you’ll see roughly 10 samples from A for every 1 sample from B

  • When True:

    • Weights are normalized by dataset size

    • Equal weights result in balanced sampling regardless of dataset size

    • Example: If dataset A has 1000 samples and dataset B has 100 samples, with equal weights (1.0), you’ll see roughly equal numbers of samples from both datasets

4. Example Configuration#

Here’s a complete example showing how to train a BC model on two datasets with different weights:

import robomimic
from robomimic.config import config_factory

# create BC config
config = config_factory(algo_name="bc")

# configure datasets
config.train.data = [
    {
        "path": "expert_demos.hdf5",
        "weight": 2.0,                    # sample expert demos more frequently
    },
    {
        "path": "suboptimal_demos.hdf5",
        "weight": 1.0,
    }
]

# normalize weights by dataset size for balanced sampling
config.train.normalize_weights_by_ds_size = True

# other training settings...
config.train.batch_size = 100
config.train.num_epochs = 1000

5. Best Practices#

  1. Weight Selection:

    • Use higher weights for higher-quality data

    • Consider using normalize_weights_by_ds_size=True when datasets have very different sizes

    • Start with equal weights and adjust based on performance

  2. Dataset Compatibility:

    • Ensure all datasets have compatible observation and action spaces

    • Use consistent preprocessing across datasets

  3. Evaluation:

    • Use the eval flag to control which environments to evaluate on

    • Set descriptive key values for clear video naming