# Multi-Dataset Training
This tutorial shows how to train a model on multiple datasets simultaneously.
Note: Understand how to launch training runs first!
Before trying to train on multiple datasets, it might be useful to read the following tutorials:
- [how to launch training runs](./configs.html)
- [how to view training results](./viewing_results.html)
#### 1. Overview
Robomimic supports training on multiple datasets simultaneously. This is useful when you want to:
- Train a single model on multiple tasks
- Combine datasets with different qualities (e.g., expert and suboptimal demonstrations)
- Balance data from different sources
Each dataset can have its own weight for sampling, and you can control whether these weights are normalized by dataset size or not.
#### 2. Configuring Multi-Dataset Training
To train on multiple datasets, you need to specify a list of dataset configurations in your config file. Each dataset configuration is a dictionary with the following keys:
```python
config.train.data = [
{
"path": "/path/to/dataset1.hdf5", # (required) path to the hdf5 file
"demo_limit": 100, # (optional) limit number of demos to use
"weight": 1.0, # (optional) weight for sampling, defaults to 1.0
"eval": True, # (optional) whether to evaluate on this dataset's env
"lang": "make coffee", # (optional) language instruction for the dataset
"key": "coffee" # (optional) key for naming eval videos
},
{
"path": "/path/to/dataset2.hdf5",
"weight": 2.0, # this dataset will be sampled twice as often
}
]
```
Additionally, you can control how the weights are used with the `normalize_weights_by_ds_size` setting:
```python
config.train.normalize_weights_by_ds_size = False # default
```
#### 3. Understanding Weight Normalization
The `normalize_weights_by_ds_size` setting controls how dataset weights affect sampling:
- When `False` (default):
- Raw weights are used directly
- Larger datasets will naturally be sampled more often when assigned the same weight
- Example: If dataset A has 1000 samples and dataset B has 100 samples, with equal weights (1.0), you'll see roughly 10 samples from A for every 1 sample from B
- When `True`:
- Weights are normalized by dataset size
- Equal weights result in balanced sampling regardless of dataset size
- Example: If dataset A has 1000 samples and dataset B has 100 samples, with equal weights (1.0), you'll see roughly equal numbers of samples from both datasets
#### 4. Example Configuration
Here's a complete example showing how to train a BC model on two datasets with different weights:
```python
import robomimic
from robomimic.config import config_factory
# create BC config
config = config_factory(algo_name="bc")
# configure datasets
config.train.data = [
{
"path": "expert_demos.hdf5",
"weight": 2.0, # sample expert demos more frequently
},
{
"path": "suboptimal_demos.hdf5",
"weight": 1.0,
}
]
# normalize weights by dataset size for balanced sampling
config.train.normalize_weights_by_ds_size = True
# other training settings...
config.train.batch_size = 100
config.train.num_epochs = 1000
```
#### 5. Best Practices
1. **Weight Selection**:
- Use higher weights for higher-quality data
- Consider using `normalize_weights_by_ds_size=True` when datasets have very different sizes
- Start with equal weights and adjust based on performance
2. **Dataset Compatibility**:
- Ensure all datasets have compatible observation and action spaces
- Use consistent preprocessing across datasets
3. **Evaluation**:
- Use the `eval` flag to control which environments to evaluate on
- Set descriptive `key` values for clear video naming