Manipulate by Seeing: Creating Manipulation Controllers from Pre-Trained Representations

Oral Presentation at International Conference on Computer Vision (ICCV) 2023

Jianren Wang*         Sudeep Dasari*         Mohan Kumar         Shubham Tulsiani         Abhinav Gupta
Carnegie Mellon University
*denotes equal contribution


The field of visual representation learning has seen explosive growth in the past years, but its benefits in robotics have been surprisingly limited so far. Prior work uses generic visual representations as a basis to learn (task-specific) robot action policies (e.g. via behavior cloning). While the visual representations do accelerate learning, they are primarily used to encode visual observations. Thus, action information has to be derived purely from robot data, which is expensive to collect! In this work, we present a scalable alternative where the visual representations can help directly infer robot actions. We observe that vision encoders express relationships between image observations as \textit{distances} (e.g. via embedding dot product) that could be used to efficiently plan robot behavior. We operationalize this insight and develop a simple algorithm for acquiring a distance function and dynamics predictor, by fine-tuning a pre-trained representation on human collected video sequences. The final method is able to substantially outperform traditional robot learning baselines (e.g. 70% success v.s. 50% for behavior cloning on pick-place) on a suite of diverse real-world manipulation tasks. It can also generalize to novel objects, without using any robot demonstrations during train time.


Our system: In our problem setting we use a low-cost reacher grabber tool (left) to collect training demonstrations. These demonstrations are used to acquire a robot controller purely through distance/representation learning. The final system is deployed on a robot (right) to solve various tasks at test-time.


Our method leverages a pre-trained representation network, R, to encode observations, it = R(It), and enables control via distance learning. Specifically, we use contrastive representation learning methods to learn a distance metric, d(ij, ik), on the pre-trained embedding space. The key idea is to use this distance metric to select which of the possible future state is closest to the goal state. But how do we predict possible future states? We explicitly learn a dynamics function, F(it, at) that predicts future state for a possible action at. During test time, we predict multiple future states using different possible action and select the one which is closes to goal state.


Given the distance function and dynamics model, our inference procedure is as simple as choosing the action that minimizes the distance to the goal state. More concretely, initialize from a beginning state and given a goal image. We consider a set of candidate actions at each step. The learned distance predictor then infers the future distance corresponding to all these candidates, and we execute the action with the lowest predicted distance-to-goal. We repeat this procedure until reaching sufficiently close to the desired goal.

Experiment Videos