Pre-trained Visual Representations#

Robomimic supports multiple pre-trained visual representations and offers integration for adapting observation encoders to the desired pre-trained visual representation encoders.

Terminology#

First, let’s clarify the semantic distinctions when using different pre-trained visual representations:

  • Backbone Classes refer to the various pre-trained visual encoders. For instance, R3MConv and MVPConv are the backbone classes for using R3M and MVP pre-trained representations, respectively.

  • Model Classes pertain to the different sizes of the pretrained models within each selected backbone class. For example, R3MConv has three model classes - resnet18, resnet34, and resnet50, while MVPConv features five model classes - vits-mae-hoi, vits-mae-in, vits-sup-in, vitb-mae-egosoup, and vitl-256-mae-egosoup.

Examples#

Using pre-trained visual representations is simple. Each pre-trained encoder is defined by its backbone_class, model_class, and whether to freeze representations or finetune them. Please note that you may need to refer to the original library of the pre-trained representation for installation instructions.

If you are specifying your config with code (as in examples/train_bc_rnn.py), the following are example code blocks for using pre-trained representations:

# R3M
config.observation.encoder.rgb.core_kwargs.backbone_class = 'R3MConv'                         # R3M backbone for image observations (unused if no image observations)
config.observation.encoder.rgb.core_kwargs.backbone_kwargs.r3m_model_class = 'resnet18'       # R3M model class (resnet18, resnet34, resnet50)
config.observation.encoder.rgb.core_kwargs.backbone_kwargs.freeze = True                      # whether to freeze network during training or allow finetuning
config.observation.encoder.rgb.core_kwargs.pool_class = None                                  # no pooling class for pretraining model

# MVP
config.observation.encoder.rgb.core_kwargs.backbone_class = 'MVPConv'                                   # MVP backbone for image observations (unused if no image observations)
config.observation.encoder.rgb.core_kwargs.backbone_kwargs.mvp_model_class = 'vitb-mae-egosoup'         # MVP model class (vits-mae-hoi, vits-mae-in, vits-sup-in, vitb-mae-egosoup, vitl-256-mae-egosoup)
config.observation.encoder.rgb.core_kwargs.backbone_kwargs.freeze = True                      # whether to freeze network during training or allow finetuning
config.observation.encoder.rgb.core_kwargs.pool_class = None                                  # no pooling class for pretraining model

# Set data loader attributes for image observations
config.train.num_data_workers = 2                           # 2 data workers for image datasets
config.train.hdf5_cache_mode = "low_dim"                    # only cache non-image data         

# Ensure that you are using image observation modalities, names may depend on your dataset naming convention
config.observation.modalities.obs.rgb = [
                    "agentview_image",
                    "robot0_eye_in_hand_image"
                ]                                                  

Alternatively, if you are using a config json, you can set the appropriate keys in your json.