There's some work in SAR stuff on that sort of thing. One sketch of an idea would be to iteratively estimate the measurement including multi-path from what you think the surfaces look like and then remove the "surfaces" that are explained by echos off of single surfaces. The goal would be to get images that explain the measurements you take using the fewest number of surfaces.
This is trickier than normal imaging, because if you ignore the multiple reflections you're basically just inverting a matrix (a fourier transform in the case of single line imaging). With multipath the measurement becomes nonlinear in your scene and inversion isn't as trivial.
Another alternative is to sweep the radar so you get multiple revisits from different times and angles. Stacking the revisits at a location will suppress the multipath since, in theory, the multipath components depend more strongly on grazing and aspect angle than the "real" components.
Absolutely! I only included our lowest-level data products in this post. We don't sweep the radar side-to-side, but since it's on a moving platform, we see the same thing from multiple angles. In the data I showed, there has been some very limited stacking applied, mostly just to improve the signal-to-noise ratio. The next level of processing would be focusing/migration (depending on whether you're in the radar or seismic world) - it reasons about possible incident angle and collapses the hyperbolas to a single point.
Unfortunately, this only improves the resolution along the flight track; features parallel to the flight path and offset to the side are the hardest to filter out.
Using our sensor geometry, there's an ambiguity from where off to the side the energy was returned from. So, the solution would wind up being non-unique whatever you do.
There has been some fun work along the lines you describe that uses maps of the surface shape (generated from camera imagery or scanning laser data) to discriminate which apparently-subsurface echoes are most likely due to the surface topography. So far as I know, this hasn't been automated - it's more an aid to human interpretation (we have an army of undergrads that "picks" the most likely bed location).
This is trickier than normal imaging, because if you ignore the multiple reflections you're basically just inverting a matrix (a fourier transform in the case of single line imaging). With multipath the measurement becomes nonlinear in your scene and inversion isn't as trivial.