What Matters in Learning from Offline Human Demonstrations for Robot Manipulation

Please see the paper and blog post for more information (links below).

Paper: link

Blog Post: link

Video Summary

Overview

In this study, we investigate several challenges of offline learning from human datasets and extract lessons to guide future work.

Why is learning from human-labeled datasets difficult?

We explore five challenges in learning from human-labeled datasets.

Next, we summarize the tasks (5 simulation and 3 real), datasets (3 different variants), algorithms (6 offline methods, including 3 imitation and 3 batch reinforcement), and observation spaces (2 main variants) that we explored in our study.

Study Design: Tasks

Lift
Can
Tool Hang
Square
Lift (Real)
Can (Real)
Tool Hang (Real)
Transport
We collect datasets across 6 operators of varying proficiency and evaluate offline policy learning methods on 8 challenging manipulation tasks that test a wide range of manipulation capabilities including pick-and-place, multi-arm coordination, and high-precision insertion and assembly.

Study Design: Task Reset Distributions

Lift
Can
Tool Hang
Square
Lift (Real)
Can (Real)
Tool Hang (Real)
Transport
We show the task reset distributions for each task. Initial states are sampled from this distribution at both train and evaluation time.

Study Design: Datasets

We collected 3 kinds of datasets in this study.

Machine-Generated

These datasets consist of rollouts from a series of SAC agent checkpoints trained on Lift and Can. As a result, they contain random, suboptimal, and expert data. This kind of mixed quality data is common in offline RL works (e.g. D4RL, RLUnplugged).

Lift (MG)
Can (MG)
Lift and Can Machine-Generated datasets.

Proficient-Human

These datasets consist of 200 demonstrations collected from a single proficient human operator using RoboTurk.

Lift (PH)
Can (PH)
Square (PH)
Transport (PH)
Tool Hang (PH)
Proficient-Human datasets generated by 1 proficient operator (with the exception of Transport, which had 2 proficient operators working together).

Multi-Human

These datasets consist of 300 demonstrations collected from six human operators of varied proficiency using RoboTurk. Each operator falls into one of 3 groups - “Worse”, “Okay”, and “Better” – each group contains two operators. Each operator collected 50 demonstrations per task. As a result, these datasets contain mixed quality human demonstration data. We show videos for a single operator from each group.

Lift (MH) - Worse
Lift (MH) - Okay
Lift (MH) - Better
Multi-Human Lift dataset. The videos show three operators - one that's "worse" (left), "okay" (middle) and "better" (right).
Can (MH) - Worse
Can (MH) - Okay
Can (MH) - Better
Multi-Human Can dataset. The videos show three operators - one that's "worse" (left), "okay" (middle) and "better" (right).
Square (MH) - Worse
Square (MH) - Okay
Square (MH) - Better
Multi-Human Square dataset. The videos show three operators - one that's "worse" (left), "okay" (middle) and "better" (right).
Transport (MH) - Worse-Worse
Transport (MH) - Okay-Okay
Transport (MH) - Better-Better
Transport (MH) - Worse-Okay
Transport (MH) - Worse-Better
Transport (MH) - Okay-Better
Multi-Human Transport dataset. These were collected using pairs of operators with Multi-Arm RoboTurk (each one controlled 1 robot arm). We collected 50 demonstrations per combination of the operator subgroups.

Study Design: Algorithms

We evaluated 6 different offline learning algorithms in this study, including 3 imitation learning and 3 batch (offline) reinforcement learning algorithms.

We evaluated 6 different offline learning algorithms in this study, including 3 imitation learning and 3 batch (offline) reinforcement learning algorithms.

Study Design: Observation Spaces

We study two different observation spaces in this work – low-dimensional observations and image observations.

We study two different observation spaces in this work.

Image Observations

We provide examples of the image observations used in each task below.

Most tasks have a front view and wrist view camera. The front view matches the view provided to the operator during data collection.
Tool Hang has a side view and wrist view camera. The side view matches the view provided to the operator during data collection.
Transport has a shoulder view and wrist view camera per arm. The shoulder view cameras match the views provided to each operator during data collection.

Summary of Lessons Learned

In this section, we briefly highlight the lessons we learned from our study. See the paper for more thorough results and discussion.

Lesson 1: History-Dependent Models are extremely effective.

Methods that make decisions based on history, such as BC-RNN and HBC, outperform other methods on human datasets.

Lesson 2: Batch Offline RL struggles with suboptimal human data.

While Batch Offline RL methods are proficient at dealing with mixed quality machine-generated data, they struggle to deal with mixed quality human data.
To further evaluate methods in a simpler setting, we collected the Can Paired dataset, where every task instance has two demonstrations, one success and one failure. Even this simple setting, where each start state has exactly one positive and one negative demonstration, poses a problem.

Lesson 3: Improving Offline Policy Selection is important.

The mismatch between train and evaluation objective causes problems for policy selection - unlike supervised learning, the best validation loss does not correspond to the best performing policy. We found that the best validation policy is 50 to 100% worse than the best performing policy. Thus, each policy checkpoint needs to be tried directly on the robot – this can be costly.

Lesson 4: Observation Space and Hyperparameters play a large role in policy performance.

We found that observation space choice and hyperparameter selection is crucial for good performance. As an example, not including wrist camera observations can reduce performance by 10 to 45 percent

Lesson 5: Using Human Data for manipulation is promising.

Studying how dataset size impacts performance made us realize that using human data holds much promise. For each task, the bar chart shows how performance changes going from 20% to 50% to 100% of the data. Simpler tasks like Lift and Can require just a fraction of our collected datasets to learn, while more complex tasks like Square and Transport benefit substantially from adding more human data, suggesting that more complex tasks could be addressed by using large human datasets.

Lesson 6: Study Results transfer to Real World.

We collected 200 demonstrations per task, and trained a BC-RNN policy using identical hyperparameters to simulation, with no hyperparameter tuning. We see that in most cases, performance and insights on what works in simulation transfer well to the real world.

Lift (Real). 96.7% success rate. Nearly matches performance in simulation (100%).
Can (Real). 73.3% success rate. Nearly matches performance in simulation (100%).
Tool Hang (Real). 3.3% success rate. Far from simulation (67.3%) - the real task is harder.

Below, we present examples of policy failures on the Tool Hang task, which illustrate its difficulty, and the large room for improvement.

Insertion Miss
Failed Insertion
Failed Tool Grasp
Tool Drop
Failures which illustrate the difficulty of the Tool Hang task.

We also show that results from our observation space study hold true in the real world – visuomotor policies benefit strongly from wrist observations and pixel shift randomization.

Can (no Wrist). 43.3% success rate (compared to 73.3% with wrist).
Can (no Rand). 26.7% success rate (compared to 73.3% with randomization).
Without wrist observations (left) the success rate decreases from 73.3% to 43.3%. Without pixel shift randomization (right), the success rate decreases from 73.3% to 26.7%.

Takeaways

  1. Learning from large multi-human datasets can be challenging.

  2. Large multi-human datasets hold promise for endowing robots with dexterous manipulation capabilities.

  3. Studying this setting in simulation can enable reproducible evaluation and insights can transfer to real world.

Additional Policy Evaluation Videos

In this section, we provide some additional policy rollout videos.

BC-RNN image agent on Proficient-Human Transport dataset. 72% success rate.
BC-RNN image agent on Multi-Human Transport dataset. 42% success rate.
BC-RNN image agent on Multi-Human Can dataset. 96% success rate.
BC-RNN image agent on Multi-Human Square dataset. 76.7% success rate.
BC image agent on MH-Worse Can dataset. 54.7% success rate.
BC-RNN image agent on MH-Worse Can dataset. 70% success rate.
BC image agent on MH-Worse Square dataset. 17.3% success rate.
BC-RNN image agent on MH-Worse Square dataset. 36.7% success rate.
BC image agent on MH-Worse-Better Square dataset. 38.7% success rate.
BC-RNN image agent on MH-Worse-Better Square dataset. 57.3% success rate.