Yes since the background image is static you don't need a greenscreen. From watching just once, I think the guy never overlaps his clone(s). So you make two separate versions where their bounding boxes won't overlap (requires some planning) and then just show the left two-thirds of the bg image with guy 1, and the right one-third with guy 2, divided by a simple vertical cut.
And the overall zooming/panning is done after the merging. That's how I'd approach it.
And the overall zooming/panning is done after the merging. That's how I'd approach it.