Your second point is spot on. I realized this when I tried to optimize my numpy code that does millions of dot products between 3x1 vectors and 3x3 matrices to no avail. I realized there is an awesome little library called tinyarray, that doesn't have the overhead of numpy, and is compatible with the basic numpy syntax. I exchanged all my numpy arrays for tinyarrays and got the easiest 10x speed up ever.
If you look in the numpy source you can begin to understand why the overheads are so large: for every operation it's first necessary to apply rules for broadcasting between different sizes and types, plus often being callable with a number of different function signatures (that includes __getitem__). And in many cases all of that is implemented in Python.
fwiw - the best you can do here, is probably implement the dots yourself as cdef functions. If your arrays get larger, you can get a C pointer to the blas routine from scipy and call it from cython (which can be faster than the dot you write when your arrays are larger)