Pre-trained Visual Representations
Contents
Pre-trained Visual Representations#
Robomimic supports multiple pre-trained visual representations and offers integration for adapting observation encoders to the desired pre-trained visual representation encoders.
Terminology#
First, let’s clarify the semantic distinctions when using different pre-trained visual representations:
Backbone Classes refer to the various pre-trained visual encoders. For instance,
R3MConv
andMVPConv
are the backbone classes for using R3M and MVP pre-trained representations, respectively.Model Classes pertain to the different sizes of the pretrained models within each selected backbone class. For example,
R3MConv
has three model classes -resnet18
,resnet34
, andresnet50
, whileMVPConv
features five model classes -vits-mae-hoi
,vits-mae-in
,vits-sup-in
,vitb-mae-egosoup
, andvitl-256-mae-egosoup
.
Examples#
Using pre-trained visual representations is simple. Each pre-trained encoder is defined by its backbone_class
, model_class
, and whether to freeze
representations or finetune them. Please note that you may need to refer to the original library of the pre-trained representation for installation instructions.
If you are specifying your config with code (as in examples/train_bc_rnn.py
), the following are example code blocks for using pre-trained representations:
# R3M
config.observation.encoder.rgb.core_kwargs.backbone_class = 'R3MConv' # R3M backbone for image observations (unused if no image observations)
config.observation.encoder.rgb.core_kwargs.backbone_kwargs.r3m_model_class = 'resnet18' # R3M model class (resnet18, resnet34, resnet50)
config.observation.encoder.rgb.core_kwargs.backbone_kwargs.freeze = True # whether to freeze network during training or allow finetuning
config.observation.encoder.rgb.core_kwargs.pool_class = None # no pooling class for pretraining model
# MVP
config.observation.encoder.rgb.core_kwargs.backbone_class = 'MVPConv' # MVP backbone for image observations (unused if no image observations)
config.observation.encoder.rgb.core_kwargs.backbone_kwargs.mvp_model_class = 'vitb-mae-egosoup' # MVP model class (vits-mae-hoi, vits-mae-in, vits-sup-in, vitb-mae-egosoup, vitl-256-mae-egosoup)
config.observation.encoder.rgb.core_kwargs.backbone_kwargs.freeze = True # whether to freeze network during training or allow finetuning
config.observation.encoder.rgb.core_kwargs.pool_class = None # no pooling class for pretraining model
# Set data loader attributes for image observations
config.train.num_data_workers = 2 # 2 data workers for image datasets
config.train.hdf5_cache_mode = "low_dim" # only cache non-image data
# Ensure that you are using image observation modalities, names may depend on your dataset naming convention
config.observation.modalities.obs.rgb = [
"agentview_image",
"robot0_eye_in_hand_image"
]
Alternatively, if you are using a config json, you can set the appropriate keys in your json.