I suspect a polynomial approximation takes less than either of these, since that's what most C standard library implementations use on the PC (I just check Visual Studio 2017).
It isn't just "the first time" that is slow for the LUT solution.
Each time the table access hits a new cache line, there will be a stall. Say your LUT has 2048 64-bit entries, that's 256 cache lines. So the total stall time waiting for cache lines to be filled could be as bad as 210 cycles * 256 = 53760 cycles. In that amount of time, the std library function could have returned about 1000 results (assuming the std library takes 50 cycles).
And, as well as waiting for the table reads, the LUT based function has other work to do. If we are comparing performance to the std library function, it handles input from the entire range of IEEE 64-bit floats. It isn't practical to do that directly with a lookup table. You would need to clamp the range down to something manageable. Once that's done, if you want anything like decent accuracy you need to read two values from the table and do a linear interpolation between them. Or you could have a HUGE table instead of the lerp but it would have to be far too big to fit in L1 cache, and thus also quite slow.
My guess is a clamp and lerp'd lookup table will take ~10 cycles once the table is entirely in cache. So, I guess I'm claiming that the LUT would then be 5x faster than the std library.
I don't have a lot of confidence in that prediction though. There are lots of other things to worry about if we were to do this analysis properly, including: a) how the input pattern alters the effectiveness of the branch predictor, b) how exactly you are finding some input data to operate on without polluting the cache, c) whether we are measuring latency or throughput.