I had the exact same experience recently while writing a dynamic programming simulation in Python. The naive (and elegant) solution to the problem consists of a 20-line recursive function which took me about an hour to write. Unfortunately function calls are very expensive in Python, and I my sim required a couple billion of them, so I started searching for ways to eliminate the recursion. 3 weeks and a deep foray into Numpy later and I had it.
The second implementation is about the same length and replaces all of the simplicity of the recursive solution with a bunch of five-dimensional arrays. Never before did I have to think as deeply about vectorized functions, array broadcasting, dimension matching, etc. The non-recursive version is about an order of magnitude faster, and what's better, using the numpy operators means nearly all of the processing is offloaded into low-level C routines, so the Python overhead is effectively nil. On the downside the code is extremely difficult to follow now... trying to visualize what's going on in 5 dimensions is a real pain.
Might be able to save yourself some trouble if you can generalize it to an n-dimensional set of operations. Then again, that might not work for your situation.
The second implementation is about the same length and replaces all of the simplicity of the recursive solution with a bunch of five-dimensional arrays. Never before did I have to think as deeply about vectorized functions, array broadcasting, dimension matching, etc. The non-recursive version is about an order of magnitude faster, and what's better, using the numpy operators means nearly all of the processing is offloaded into low-level C routines, so the Python overhead is effectively nil. On the downside the code is extremely difficult to follow now... trying to visualize what's going on in 5 dimensions is a real pain.