Others have good explanations, but let me put it super simply.
Suppose we have two very large vectors and we want to take the dot product of them. If we store these vectors in a standard Python list, write pure Python, and try to run it in CPython, it is very, very slow, at least relative to how fast almost any other non-dynamic scripting language could do it. (I have to add all those caveats because this is easy mode for a JIT and PyPy might blast through this. But CPython will be very slow.)
Suppose instead we have a data type that is a C array under the hood, and all we do is this:
Suppose "load_from_file" loads into an optimized data structure that provides Python access, but under the hood is stored efficiently, and the "dot_product" method leaves Python and directly performs the dot product in some more efficient language.
In this situation, you're running a bare handful of slow Python opcodes, a few dozen at the most, and doing vast quantities of numerical computation at whatever the speed of the underlying language is, which in this case can easily use SIMD or whatever other acceleration technologies are available, even though Python has no idea what those are.
That is the core technique used to make Python go fast. There are several ways that things get from Python to that fast code and underlying structure. The simplest and the oldest is to simply do what I laid out above directly, with external, non-Python implementations of all the really fast stuff. I believe this is the core of NumPy, though I'm not sure, but it's certainly been the core of a lot of other things in Python for decades.
There are also approaches involving writing in a Python-like language, but one that can be compiled down (Cython), approaches around trying to compile Python directly with a JIT into something that under the hood is no longer using CPython's interpretation structure (PyPy), and other interesting hybrid approaches.
CPython is generally a slow language, but it isn't always a big deal in practice because if you write Python code where a reasonable amount of most lines you write go out to faster code, the slowness of Python itself is no big deal. For example, you can write decent, performant image manipulation code in Python as long as you're stringing together image manipulation commands to some underlying library, because in that case, for each "expensive" Python opcode, you're doing a lot of "real work" at max speed. I scare-quote "expensive" because in that case, the Python opcodes may be a vanishing percentage of the cost of the work. The problem with numerical manipulation in pure CPython is that it swings the other way; each Python opcode requires many assembly-level instructions to implement, but it's inefficient to run all those instructions just to add two integers together, a single assembly opcode.
Arguably the dominant factor in sitting down with any programming language and writing reasonably perfomant code on your first pass is understanding the relative costs of things and just being careful not to write code that unnecessarily amplifies the costs of doing the "real work" with too much bookkeeping. The same 1000 cycle's worth of "bookkeeping" can be anything from a total irrelevancy if your "payload" for that bookkeeping is many millions of cycles worth of work, to a crippling amplified slowdown if your payload is only one cycle's worth of work. Even a "fast" language like C++ can be brought to a crawl if you manage to use enough abstractions and bounce through enough non-inlinable method calls and through the requisite creation and teardown of function frames, etc. just to add two numbers together, and then do it all again a few million times.
Suppose we have two very large vectors and we want to take the dot product of them. If we store these vectors in a standard Python list, write pure Python, and try to run it in CPython, it is very, very slow, at least relative to how fast almost any other non-dynamic scripting language could do it. (I have to add all those caveats because this is easy mode for a JIT and PyPy might blast through this. But CPython will be very slow.)
Suppose instead we have a data type that is a C array under the hood, and all we do is this:
Suppose "load_from_file" loads into an optimized data structure that provides Python access, but under the hood is stored efficiently, and the "dot_product" method leaves Python and directly performs the dot product in some more efficient language.In this situation, you're running a bare handful of slow Python opcodes, a few dozen at the most, and doing vast quantities of numerical computation at whatever the speed of the underlying language is, which in this case can easily use SIMD or whatever other acceleration technologies are available, even though Python has no idea what those are.
That is the core technique used to make Python go fast. There are several ways that things get from Python to that fast code and underlying structure. The simplest and the oldest is to simply do what I laid out above directly, with external, non-Python implementations of all the really fast stuff. I believe this is the core of NumPy, though I'm not sure, but it's certainly been the core of a lot of other things in Python for decades.
There are also approaches involving writing in a Python-like language, but one that can be compiled down (Cython), approaches around trying to compile Python directly with a JIT into something that under the hood is no longer using CPython's interpretation structure (PyPy), and other interesting hybrid approaches.
CPython is generally a slow language, but it isn't always a big deal in practice because if you write Python code where a reasonable amount of most lines you write go out to faster code, the slowness of Python itself is no big deal. For example, you can write decent, performant image manipulation code in Python as long as you're stringing together image manipulation commands to some underlying library, because in that case, for each "expensive" Python opcode, you're doing a lot of "real work" at max speed. I scare-quote "expensive" because in that case, the Python opcodes may be a vanishing percentage of the cost of the work. The problem with numerical manipulation in pure CPython is that it swings the other way; each Python opcode requires many assembly-level instructions to implement, but it's inefficient to run all those instructions just to add two integers together, a single assembly opcode.
Arguably the dominant factor in sitting down with any programming language and writing reasonably perfomant code on your first pass is understanding the relative costs of things and just being careful not to write code that unnecessarily amplifies the costs of doing the "real work" with too much bookkeeping. The same 1000 cycle's worth of "bookkeeping" can be anything from a total irrelevancy if your "payload" for that bookkeeping is many millions of cycles worth of work, to a crippling amplified slowdown if your payload is only one cycle's worth of work. Even a "fast" language like C++ can be brought to a crawl if you manage to use enough abstractions and bounce through enough non-inlinable method calls and through the requisite creation and teardown of function frames, etc. just to add two numbers together, and then do it all again a few million times.