Maybe it's a function of the fact that I'm not an AI expert, but I never thought it was that specialization for features (whether semantically meaningful or not) was localized to individual neurons, rather than the entire net. Why would we think otherwise?
I suppose it was assumed it was working in a "divide and conquer" manner, since that usually leads to a complexity reduction of algorithms? (and it was then assumed that the division was a clear region over the previous layer)
Of course there's no real need for the network to work that way, and perhaps this interpretation can be made if we assume that divisions are "fuzzy"/arbitrary.
I assumed it is because the amount of output neurons is small compared to the amount of inputs. For example a digit OCR network takes maybe 10.000 pixels as input, but has only 10 possible outputs. I'm no expert either, any confirmation or refutation would be very welcome :)