I wasted a week trying to replace the scorer component with a NN-based language model. Every time I made a change the whole codebase, including Tensorflow recompiled, so the turnaround time was about an hour per change. It was awful. I mean I get reproducible builds etc. and probably if you're running stuff at Google scale it has all kinds of useful features. But for development on a personal laptop it was torture. Eventually I gave up.
Fwiw, that sounds like a bug or a misconfiguration; it's absolutely supposed to have better caching behavior than that (and does in the few projects I've used it on, even on a personal laptop). If you're interested in pursuing it further (I'd understand if you aren't; that sounds frustrating), I bet the bazel team would be interested in your report.
Mind if I ask why?