What Matters in Learning from Offline Human Demonstrations for Robot Manipulation

Please see the paper and blog post for more information (links below).

Paper: link

Blog Post: link

Video Summary

Overview

In this paper, we conduct an extensive study of six offline learning algorithms for robot manipulation on five simulated and three real-world multi-stage manipulation tasks of varying complexity, and with datasets of varying quality.
Our study analyzes the most critical challenges when learning from offline human data for manipulation.
Based on the study, we derive a series of lessons to guide future work, and also highlight opportunities in learning from human datasets, such as the ability to learn proficient policies on challenging, multi-stage tasks and easily scale to natural, real-world manipulation scenarios.
We have open-sourced our datasets and all algorithm implementations to facilitate future research and fair comparisons in learning from human demonstration data.

In this study, we investigate several challenges of offline learning from human datasets and extract lessons to guide future work.

Why is learning from human-labeled datasets difficult?

We explore five challenges in learning from human-labeled datasets.

(C1) Unobserved Factors in Human Decision Making. Humans are not perfect Markovian agents. In addition to what they currently see, their actions may be influenced by other external factors - such as the device they are using to control the robot and the history of the actions that they have provided.
(C2) Mixed Demonstration Quality. Collecting data from multiple humans can result in mixed quality data, since some people might be better quality supervisors than others.
(C3) Dependence on dataset size. Policy learning is also sensitive to the state and action space coverage in the dataset.
(C4) Train Objective ≠ Eval Objective. Unlike traditional supervised learning, where validation loss is a strong indicator of how good a model is, policies are usually trained with surrogate losses. This makes it hard to know which trained policy checkpoints are good without trying out each and every model directly on the robot – a time consuming process.
(C5) Sensitivity to Agent Design Decisions. Performance can be very sensitive to important agent design decisions, like the observation space and hyperparameters used for learning.

Next, we summarize the tasks (5 simulation and 3 real), datasets (3 different variants), algorithms (6 offline methods, including 3 imitation and 3 batch reinforcement), and observation spaces (2 main variants) that we explored in our study.

Study Design: Tasks

Lift

Can

Tool Hang

Square

Lift (Real)

Can (Real)

Tool Hang (Real)

Transport

We collect datasets across 6 operators of varying proficiency and evaluate offline policy learning methods on 8 challenging manipulation tasks that test a wide range of manipulation capabilities including pick-and-place, multi-arm coordination, and high-precision insertion and assembly.

Study Design: Task Reset Distributions

Study Design: Datasets

We collected 3 kinds of datasets in this study.

Machine-Generated

These datasets consist of rollouts from a series of SAC agent checkpoints trained on Lift and Can. As a result, they contain random, suboptimal, and expert data. This kind of mixed quality data is common in offline RL works (e.g. D4RL, RLUnplugged).

Lift (MG)

Can (MG)

Lift and Can Machine-Generated datasets.

Proficient-Human

These datasets consist of 200 demonstrations collected from a single proficient human operator using RoboTurk.

Lift (PH)

Can (PH)

Square (PH)

Transport (PH)

Tool Hang (PH)

Proficient-Human datasets generated by 1 proficient operator (with the exception of Transport, which had 2 proficient operators working together).

Multi-Human

These datasets consist of 300 demonstrations collected from six human operators of varied proficiency using RoboTurk. Each operator falls into one of 3 groups - “Worse”, “Okay”, and “Better” – each group contains two operators. Each operator collected 50 demonstrations per task. As a result, these datasets contain mixed quality human demonstration data. We show videos for a single operator from each group.

Lift (MH) - Worse

Lift (MH) - Okay

Lift (MH) - Better

Multi-Human Lift dataset. The videos show three operators - one that's "worse" (left), "okay" (middle) and "better" (right).

Can (MH) - Worse

Can (MH) - Okay

Can (MH) - Better

Multi-Human Can dataset. The videos show three operators - one that's "worse" (left), "okay" (middle) and "better" (right).

Square (MH) - Worse

Square (MH) - Okay

Square (MH) - Better

Multi-Human Square dataset. The videos show three operators - one that's "worse" (left), "okay" (middle) and "better" (right).

Transport (MH) - Worse-Worse

Transport (MH) - Okay-Okay

Transport (MH) - Better-Better

Transport (MH) - Worse-Okay

Transport (MH) - Worse-Better

Transport (MH) - Okay-Better

Multi-Human Transport dataset. These were collected using pairs of operators with Multi-Arm RoboTurk (each one controlled 1 robot arm). We collected 50 demonstrations per combination of the operator subgroups.

Study Design: Algorithms

We evaluated 6 different offline learning algorithms in this study, including 3 imitation learning and 3 batch (offline) reinforcement learning algorithms.

BC: standard Behavioral Cloning, which is direct regression from observations to actions.
BC-RNN: Behavioral Cloning with a policy network that’s a Recurrent Neural Network (RNN), which allows modeling temporal correlations in decision-making.
HBC: Hierarchical Behavioral Cloning, where a high-level subgoal planner is trained to predict future observations, and a low-level recurrent policy is conditioned on a future observation (subgoal) to predict action sequences (see this paper and this paper for more details).
BCQ: Batch-Constrained Q-Learning, a batch reinforcement learning method proposed in this paper.
CQL: Conservative Q-Learning, a batch reinforcement learning method proposed in this paper.
IRIS: Implicit Reinforcement without Interaction, a batch reinforcement learning method proposed in this paper.

Study Design: Observation Spaces

We study two different observation spaces in this work – low-dimensional observations and image observations.

Image Observations

We provide examples of the image observations used in each task below.

Most tasks have a front view and wrist view camera. The front view matches the view provided to the operator during data collection.

Tool Hang has a side view and wrist view camera. The side view matches the view provided to the operator during data collection.

Transport has a shoulder view and wrist view camera per arm. The shoulder view cameras match the views provided to each operator during data collection.

Summary of Lessons Learned

In this section, we briefly highlight the lessons we learned from our study. See the paper for more thorough results and discussion.

Lesson 1: History-Dependent Models are extremely effective.

Methods that make decisions based on history, such as BC-RNN and HBC, outperform other methods on human datasets.

Lesson 2: Batch Offline RL struggles with suboptimal human data.

While Batch Offline RL methods are proficient at dealing with mixed quality machine-generated data, they struggle to deal with mixed quality human data.

To further evaluate methods in a simpler setting, we collected the Can Paired dataset, where every task instance has two demonstrations, one success and one failure. Even this simple setting, where each start state has exactly one positive and one negative demonstration, poses a problem.

Lesson 3: Improving Offline Policy Selection is important.

The mismatch between train and evaluation objective causes problems for policy selection - unlike supervised learning, the best validation loss does not correspond to the best performing policy. We found that the best validation policy is 50 to 100% worse than the best performing policy. Thus, each policy checkpoint needs to be tried directly on the robot – this can be costly.

Lesson 4: Observation Space and Hyperparameters play a large role in policy performance.

We found that observation space choice and hyperparameter selection is crucial for good performance. As an example, not including wrist camera observations can reduce performance by 10 to 45 percent

Lesson 5: Using Human Data for manipulation is promising.

Studying how dataset size impacts performance made us realize that using human data holds much promise. For each task, the bar chart shows how performance changes going from 20% to 50% to 100% of the data. Simpler tasks like Lift and Can require just a fraction of our collected datasets to learn, while more complex tasks like Square and Transport benefit substantially from adding more human data, **suggesting that more complex tasks could be addressed by using large human datasets**.

Lesson 6: Study Results transfer to Real World.

We collected 200 demonstrations per task, and trained a BC-RNN policy using identical hyperparameters to simulation, with no hyperparameter tuning. We see that in most cases, performance and insights on what works in simulation transfer well to the real world.

Lift (Real). 96.7% success rate. Nearly matches performance in simulation (100%).

Can (Real). 73.3% success rate. Nearly matches performance in simulation (100%).

Tool Hang (Real). 3.3% success rate. Far from simulation (67.3%) - the real task is harder.

Below, we present examples of policy failures on the Tool Hang task, which illustrate its difficulty, and the large room for improvement.

Insertion Miss

Failed Insertion

Failed Tool Grasp

Tool Drop

Failures which illustrate the difficulty of the Tool Hang task.

We also show that results from our observation space study hold true in the real world – visuomotor policies benefit strongly from wrist observations and pixel shift randomization.

Can (no Wrist). 43.3% success rate (compared to 73.3% with wrist).

Can (no Rand). 26.7% success rate (compared to 73.3% with randomization).

Without wrist observations (left) the success rate decreases from 73.3% to 43.3%. Without pixel shift randomization (right), the success rate decreases from 73.3% to 26.7%.

Takeaways

Learning from large multi-human datasets can be challenging.
Large multi-human datasets hold promise for endowing robots with dexterous manipulation capabilities.
Studying this setting in simulation can enable reproducible evaluation and insights can transfer to real world.