Using another level of convolution would produce outputs that are statistically independent if they are farther apart than the size of the convolution kernel. In a conditional random field, the dependence of outputs on each other can be modeled as well.
For example, a conditional random field could express "either these patches both contain a tumor or none of them does" (which is helpful when there's something suspicious on the patch boundary) and the consequences of committing to either possibility can propagate over the whole field. In contrast, a convolutional layer would have to make the decision independently for each local area.
My personal approach is to read papers that seem interesting. Of course I usually do not have the necessary background in everything that's mentioned, but I treat those cases as black boxes. E.g. if the paper says that they use X to do Y I'll simply assume that you can do Y using X. If I think that the details of X are important, I dig deeper. Sometimes just by reading the corresponding Wikipedia article, sometimes by looking at the references in the paper. Then repeat recursively.
That approach has the advantage that you'll learn about techniques roughly proportional to their current popularity, but it has the disadvantage that explanations in papers tend to be brief and you have to put them into a coherent whole yourself.
If you prefer textbooks, I heard about http://www.deeplearningbook.org/ but didn't get around to reading it. In addition to neural networks, you'll probably also want to read about classical statistics and probability theory, since that's the origin of concepts like conditional random fields, which can be mixed with neural networks but are unlikely to be covered by literature on deep learning.
You could use more levels of convolution with larger receptive field. But this corresponds to larger patches, e.g. 512x512 pixels, and larger patches sometimes may not just be pure tumor cells or pure normal cells. And if you are just predicting 1 label for larger patches, it sometimes confuses the learning. What we propose with CRF, is larger receptive field with dense predictions, e.g. predicting more than one labels, and we use CRF to model the correlation between labels.