The most disillusioning thing for me as an undergrad was to try to replicate results from a paper on a relatively complicated problem, and not achieving anywhere near the papers level of results.
I got access to the source code, and the super complicated algorithm added almost nothing to the results, the glossed over / hand waved past data normalization worked so well that there wasn't a need for any further classification. This paper was pretty well received and cited, despite it basically not work.
I am afraid that what you describe is a pretty fundamental limitation of peer review. Currently it is not expected of reviewers to try to replicate experiments, but only to check if the methodology of the paper seems sound and the relevant previous work is cited. Should this be expected of reviewers? Should the authors provide all code and data so that the results can be replicated at the push of a button? That seems a pretty tall order. Seems like we'll be stuck with a modus operandi of lots of papers being published, peer review only catching obvious errors, and replication being very rare.
I get that, and I don't even think that that should be the reviewers job. But for CS in particular, I do wish more people published source code and build procedures.
I think a central hub for public comments, with source, and available PDF would be incredibly useful for CS as a field. For most papers I had access to, to get the source I'd have to go through at least one person. So I tried for quite a while to replicate the result before even trying to get access. If it was available, on say github, I'd have grabbed the source code as a resource to understand the paper, and with something like publicly available commenting, I think it'd have been a non-issue.
The problem is, I can see how such a system would be incredibly constructive to the field and the community, but it could be a liability to the authors, and would make publishers meaningless, so I doubt it happens anytime soon.
I got access to the source code, and the super complicated algorithm added almost nothing to the results, the glossed over / hand waved past data normalization worked so well that there wasn't a need for any further classification. This paper was pretty well received and cited, despite it basically not work.