# Models
`Randomizer` modules are intended to be used alongside an `ObservationEncoder` --- see the next section for more details. Additional randomizer classes can be implemented by subclassing the `Randomizer` class and implementing the necessary abstract functions.
## Observation Encoder and Decoder
`ObservationEncoder` and `ObservationDecoder` are basic building blocks for dealing with observation dictionary inputs and outputs. They are designed to take in multiple streams of observation modalities as input (e.g. a dictionary containing images and robot proprioception signals), and output a dictionary of predictions like actions and subgoals. Below is an example of how to manually create an `ObservationEncoder` instance by registering observation modalities with the `register_obs_key` function.
```python
from robomimic.models.obs_nets import ObservationEncoder, CropRandomizer, MLP, VisualCore, ObservationDecoder
obs_encoder = ObservationEncoder(feature_activation=torch.nn.ReLU)
# There are two ways to construct the network for processing a input modality.
# Assume we are processing image input of shape (3, 224, 224).
camera1_shape = [3, 224, 224]
image_randomizer = CropRandomizer(input_shape=camera2_shape, crop_height=200, crop_width=200)
# We will use a reconfigurable image processing backbone VisualCore to process the input image modality
net_class = "VisualCore" # this is defined in models/base_nets.py
# kwargs for VisualCore network
net_kwargs = {
"input_shape": camera1_shape,
"core_class": "ResNet18Conv", # use ResNet18 as the visualcore backbone
"core_kwargs": {"pretrained": False, "input_coord_conv": False},
"pool_class": "SpatialSoftmax", # use spatial softmax to regularize the model output
"pool_kwargs": {"num_kp": 32}
}
# register the network for processing the modality
obs_encoder.register_obs_key(
name="camera1",
shape=camera1_shape,
net_class=net_class,
net_kwargs=net_kwargs,
randomizer=image_randomizer
)
# We could mix low-dimensional observation, e.g., proprioception signal, in the encoder
proprio_shape = [12]
net = MLP(input_dim=12, output_dim=32, layer_dims=(128,), output_activation=None)
obs_encoder.register_obs_key(
name="proprio",
shape=proprio_shape,
net=net
)
```
By default, each modality network should reduce an input observation stream to a fixed-size vector. The output of the `forward` function of the `ObservationEncoder` is simply the concatenation of all the vectors. The order of concatenation is deterministic and is the the same as the order that the modalities are registered. `ObservationGroupEncoder` further supports encoding nested groups of observations, e.g., `obs`, `goal`, and `subgoal`. This allows constructing goal-conditioned and / or subgoal-conditioned policy models.
However, it can be tedious to enumerate all modalities when creating a policy model. The standard entry point to create an `ObservationEncoder` is via the `observation_encoder_factory` function in `robomimic.models.obs_nets`. It will enumerate all observation modalities and initialize all modality networks according to the configurations provided by the `config.observation` config section.
The `ObservationDecoder` class is relatively straightforward. It's simply a single-input, multi-output-head MLP. For example, the following snippet creates an `ObservationDecoder` that takes the output of the observation encoder as input and outputs two action predictions.
```python
obs_decoder = ObservationDecoder(
input_feat_dim=obs_encoder.output_shape()[0],
decode_shapes=OrderedDict({"action_pos": (3,), "action_orn": (4,)})
)
```
See `examples/simple_obs_nets.py` for the complete example that shows additional functionalities such as weight sharing among modality networks.
## Multi-Input, Multi-Output (MIMO) Modules
`MIMO_MLP` and `RNN_MIMO_MLP` are highest-level wrappers that use `ObservationGroupEncoder` and `ObservationDecoder` to create multi-input, multi-output network architectures. `MIMO_MLP` optionally adds additional MLP layers before piping the encoded feature vector to the decoder.
`RNN_MIMO_MLP` encodes each observation in an observation sequence using `ObservationGroupEncoder` and digests the feature sequence using RNN-variants such as LSTM and GRU networks.
`MIMO_MLP` and `RNN_MIMO_MLP` serve as **the backbone for all policy and value network models** -- these models simple subclass a `MIMO_MLP` or `RNN_MIMO_MLP`, and adds specific input and output shapes.
## Implemented Policy Networks
These networks take an observation dictionary as input (and possibly additional conditioning, such as subgoal or goal dictionaries) and produce action predictions, samples, or distributions as outputs. Note that actions are assumed to lie in `[-1, 1]`, and most networks will have a final `tanh` activation to help ensure that outputs lie within this range. See `robomimic/models/policy_nets.py` for complete implementations.
### ActorNetwork
- A basic policy network that predicts actions from observations. Can optionally be goal conditioned on future observations.
### PerturbationActorNetwork
- An action perturbation network - primarily used in BCQ. It takes states and actions and returns action perturbations.
### GaussianActorNetwork
- Variant of actor network that outputs a diagonal unimodal Gaussian distribution as action predictions.
### GMMActorNetwork
- Variant of actor network that outputs a multimodal Gaussian mixture distribution as action predictions.
### RNNActorNetwork
- An RNN policy network that predicts actions from a sequence of observations.
### RNNGMMActorNetwork
- An RNN policy network that outputs a multimodal Gaussian mixture distribution over actions from a sequence of observations.
### VAEActor
- A VAE that models a distribution of actions conditioned on observations. The VAE prior and decoder are used at test-time as the policy.
## Implemented Value Networks
These networks take an observation dictionary as input (and possibly additional conditioning, such as subgoal or goal dictionaries) and produce value or action-value estimates or distributions. See `robomimic/models/value_nets.py` for complete implementations.
### ValueNetwork
- A basic value network that predicts values from observations. Can optionally be goal conditioned on future observations.
### DistributionalActionValueNetwork
- Distributional Q (action-value) network that outputs a categorical distribution over a discrete grid of value atoms. See the [paper](https://arxiv.org/pdf/1707.06887.pdf for more details).
## Implemented VAEs
The library implements a general VAE architecture and a number of prior distributions. See `robomimic/models/vae_nets.py` for complete implementations.
### VAE
A general Variational Autoencoder (VAE) implementation, as described in https://arxiv.org/abs/1312.6114.
Models a distribution p(X) or a conditional distribution p(X | Y), where each variable can consist of multiple modalities. The target variable X, whose distribution is modeled, is specified through the `input_shapes` argument, which is a map between modalities (strings) and expected shapes. In this way, a variable that consists of multiple kinds of data (e.g. image and flat-dimensional) can be modeled as well. A separate `output_shapes` argument is used to specify the expected reconstructions - this allows for asymmetric reconstruction (for example, reconstructing low-resolution images).
This implementation supports learning conditional distributions as well (cVAE). The conditioning variable Y is specified through the `condition_shapes` argument, which is also a map between modalities (strings) and expected shapes. In this way, variables with multiple kinds of data (e.g. image and flat-dimensional) can jointly be conditioned on. By default, the decoder takes the conditioning variable Y as input. To force the decoder to reconstruct from just the latent, set `decoder_is_conditioned` to False (in this case, the prior must be conditioned).
The implementation also supports learning expressive priors instead of using the usual N(0, 1) prior. There are three kinds of priors supported - Gaussian, Gaussian Mixture Model (GMM), and Categorical. For each prior, the parameters can be learned as independent parameters, or be learned as functions of the conditioning variable Y (by setting `prior_is_conditioned`).
### GaussianPrior
- A class that holds functionality for learning both unimodal Gaussian priors and multimodal Gaussian Mixture Model priors for use in VAEs. Supports learnable priors, learnable / fixed mixture weights for GMM, and observation-conditioned priors,
### CategoricalPrior
- A class that holds functionality for learning categorical priors for use in VAEs.