I think there was some misunderstanding, you're arguing different points than ones I made.
> Of course this approach produces a worse code than a full compiler by definition---stencils would be too rigid to be further optimized.
Yeah, but that's not what I meant by "worse code". I just meant that even being aware this is a naive copy-and-patch JIT, my first impression was that the code was slightly worse than I expected. I don't expect the compiler to do any magic on a small code slice; I only claimed that there's "room to improve" in the currently generated code, though I may be totally wrong on whether it's actually possible to achieve by "just convincing clang" and without manually messing with the asm.
> But I think you are thinking too much about a possibility to make Python as fast as, say, C for at least some cases.
I never said this about CPython, quite the opposite.
> I believe that it won't happen at all
(FWIW, if we're talking long-term and about Python in general, it already did happen, PyPy (and modern JS runtimes) are good examples of this being possible in principle. But being able to make a language orders of magnitude faster (with some major asterisks too) doesn't mean I expect the same from the CPython implementation.)
As for your example with integer adding, I totally agree with all you said, and that's exactly what I meant by "there’s only so much one can do without touching the data model".
> In the best scenario, it will eventually hit the limit of what's possible with copy-and-patch and a full compiler will be required at that point. But until that point (which may never come as well), this approach allows for a long time of incremental improvements without disruption.
That's why in my initial message I said I wonder about expected peak improvement. I won't be surprised if it (together with theorized uop optimizations) barely exceeds single-digit percent perf gains, which would of course be still totally worth it. And even it's more, well, even better :) And in the worst case - which I hope won't happen - the point you mentioned is today, and copy-and-patch would never be worth enabling by itself.
> I just meant that even being aware this is a naive copy-and-patch JIT, my first impression was that the code was slightly worse than I expected.
> "there’s only so much one can do without touching the data model"
You probably want to look at the other link in that PR, which demonstrated how well copy-and-patch can do for another dynamic language (Lua): [1]
Of course, whether or not CPython could eventually make it to that point (or even further) is a different story: they are under a way tighter constraint than just developing something for academia. But copy-and-patch can do a lot even for dynamic languages :)
> That's why in my initial message I said I wonder about expected peak improvement. I won't be surprised if it (together with theorized uop optimizations) barely exceeds single-digit percent perf gains, which would of course be still totally worth it. And even it's more, well, even better :) And in the worst case - which I hope won't happen - the point you mentioned is today, and copy-and-patch would never be worth enabling by itself.
Ah, so you meant that even all of them including specializing interpreter and copy-and-patch JIT may not give a reasonable speedup. But I think you have missed the fact that specializing interpreter has already landed on 3.11 and provided 10--60% speedup. So specialization really works, and copy-and-patch JIT should allow finer-grained uops which can have an enormous impact on performance.
On the other hand, it is possible that copy-and-patch JIT itself turns out to be useless even after all the work. In this case there is no other known viable way to enable JIT without disruption, so JIT shouldn't be added to CPython. I should have stressed this point more, but "incremental" improvements are really important---it was a primary reason that CPython didn't even try to implement JIT compilation for decades after all. CPython can give them up, but then there is one less reason to use (C)Python, so CPython never did so. (GIL is the same story by the way, and the current nogil effort is not possible without other performance improvements that outweigh a potential overhead in the single-threaded setting.)
> As for your example with integer adding, I totally agree with all you said, and that's exactly what I meant by "there’s only so much one can do without touching the data model".
If the data model refers to the publicly visible portion of the interface, I don't think so. Even JS runtimes didn't require any change to the public interface, and CPython itself already caches lots of the data model for the sake of performance. I'm not aware of attempts like shape optimizations, but it might be possible to extend the current `__slots__` implementation to allow the adaptive memory layout.
> Ah, so you meant that even all of them including specializing interpreter and copy-and-patch JIT may not give a reasonable speedup. But I think you have missed the fact that specializing interpreter has already landed on 3.11 and provided 10--60% speedup
No, I'm talking compared to the current default production state. Exactly what Brandt said in his talk at around 23:30, and what I observed when building his branch.
Then I'm not sure why that would refute the intermediate goal to "enable JIT codegen without sacrificing too much performance" as stated in my initial comment, since the proposed copy-and-patch JIT compiler can't make the impact by itself.
> Of course this approach produces a worse code than a full compiler by definition---stencils would be too rigid to be further optimized.
Yeah, but that's not what I meant by "worse code". I just meant that even being aware this is a naive copy-and-patch JIT, my first impression was that the code was slightly worse than I expected. I don't expect the compiler to do any magic on a small code slice; I only claimed that there's "room to improve" in the currently generated code, though I may be totally wrong on whether it's actually possible to achieve by "just convincing clang" and without manually messing with the asm.
> But I think you are thinking too much about a possibility to make Python as fast as, say, C for at least some cases.
I never said this about CPython, quite the opposite.
> I believe that it won't happen at all
(FWIW, if we're talking long-term and about Python in general, it already did happen, PyPy (and modern JS runtimes) are good examples of this being possible in principle. But being able to make a language orders of magnitude faster (with some major asterisks too) doesn't mean I expect the same from the CPython implementation.)
As for your example with integer adding, I totally agree with all you said, and that's exactly what I meant by "there’s only so much one can do without touching the data model".
> In the best scenario, it will eventually hit the limit of what's possible with copy-and-patch and a full compiler will be required at that point. But until that point (which may never come as well), this approach allows for a long time of incremental improvements without disruption.
That's why in my initial message I said I wonder about expected peak improvement. I won't be surprised if it (together with theorized uop optimizations) barely exceeds single-digit percent perf gains, which would of course be still totally worth it. And even it's more, well, even better :) And in the worst case - which I hope won't happen - the point you mentioned is today, and copy-and-patch would never be worth enabling by itself.