Well, my understanding is that they are not training object detectors or segment...

Well, my understanding is that they are not training object detectors or segmentation models. They wouldn't be very useful anyway, you still need to have a 3d understanding of the scene and going from 2d->3d mapping wouldn't cut it. What they are doing instead is using stereo cameras, they are estimating the depth map of the field. Their model takes two rgb images, and produces a depth map. They combine this depth information with sensors on the drone such as acceleration etc and try to predict what an expert drone agent trained on perfect information in simulation would do. They train in simulation, restrict some of the information to the student agent and have it rely on only stereo cameras and sensors like in the world to mimic the 'privileged drone agent'. Computation wise, if you can run the depth estimation network on the hardware, remaining steps(given the depth map and sensor information, predict privileged drone agent path/vector), should be trivial and most likely are shallow networks.