1. I don't use Python for this task, but libraries like numpy give you ready access to unboxed arrays. It's becoming common in scientific codes to glue together "dumb" numeric components (written in C or Fortran) using Python. Threading granularity is limited in this case due to the GIL, to the point where either "smarter" code must be pushed into the compiled language. To keep the "smart" code in Python, many projects end up using only MPI for parallelism. This was fine until recently, but with modern memory hierarchies and proliferation of cores within a node, it gives up enough performance to be an issue.
2. As I said above, you need to use multiple cores per memory bus to utilize the hardware bandwidth because there is a limited number of outstanding memory requests per core (or hardware thread). Remember that the max bandwidth realized by your application is bounded above by
independent of the theoretical bandwidth of the link. Additionally, when you use more hardware threads, you get access to more level 1 caches. On machines with non-inclusive L2/L3, this also means you can fit more in cache.