It is very rare for programs to be memory bandwidth bound. It usually take a lot of optimization just to get to that point as well as some disregard for memory bandwidth on top of that (such as looping through large arrays, only doing one simple calculation to each index, then doing that on many cores).
The vast majority of what people run is memory latency bound and in those cases using extra threads makes sense so that the explicit parallelism can compensate for memory latency.
> (such as looping through large arrays, only doing one simple calculation to each index, then doing that on many cores).
...which perfectly describes a parallelized mat-vec-mult. Yes, that's not common in most applications, but I'd have a hard time naming a more basic operation in scientific (and related) computations.
We are saying the same thing here, though I think you are missing the point that this is all a response to someone asking if SMT is useful anymore since there are many cores in almost every CPU.
The answer is that it is absolutely still useful since your example is niche and most software/systems can still benefit from being able to work around memory latency with more threads.
The vast majority of what people run is memory latency bound and in those cases using extra threads makes sense so that the explicit parallelism can compensate for memory latency.