The vision network is trained before-hand on lots of different configurations in...

The vision network is trained before-hand on lots of different configurations in simulation and then used to infer the block locations in the image from the camera. So it’s not learning continously. The imitation network takes the block locations predicted by the vision network, together with the demonstration trajectory in VR, and imitates the task shown in the demonstration. So, it learns to look through the demonstration to decide what action to take next given the current state (i.e. location of blocks and gripper). To keep the setup simple, we only trained the imitation network on stacking tasks (so no unstacking, throwing, etc). In future work, we want to make the setup and tasks much more general.