> I can't see anything obvious that would make a certain kernel immune to high load induced issues, but then I'm largely ignorant about L4 et al.
The typical system call in a modern microkernel costs on the order of hundreds of cycles. Fine-grained locking would by itself probably double this cost for almost no gain. Just waiting the hundreds of cycles is likely to be just as good as a fine-grained lock in such kernels, and the complexity savings are significant.
I was midly in shock when I looked in to L4 message passing times [1]. 250-300 cycles on x86_64 processors, and ~150-300 on ARM.
As somebody interested in concurrency what is being done internally to pull off this speed, but maintain extra-thread consistency in a SMT-type system? As far as I know even a x86_64 memory fence will require around >2,000 cycles.
I understand you create a notification object to pass a message, but don't you have to collect these notification objects at some point?
Edit 1:
I'm reading the manual [2], but I'm just becoming more confused. If you are passing register files, swapping stacks normally takes ~3ns even with small stacks. Or is the 300 cycles just the send time?
Notification objects are only for specific receive patterns IIRC. Simple message sends occur via endpoints, which I believe in seL4 you can think of sort of like a file descriptor, ie. each thread has a kernel-controlled address space of endpoints it can index to send messages.
The cycle count is for a one-way IPC. This roughly consists of entering kernel mode (~50-100 cycles), switch to kernel stack, finding the endpoint specified by system call (roughly: thread_control_block.endpoints[registerX], let's say ~10 cycles), switch to thread's address space (~100 cycles, maybe more for TLB flushes), switch to new thread's stack, return to user mode (~50-100 cycles). This is the fast path in which the whole message is passed in registers. The only memory that's touched are the original and target thread control blocks, so it's quite fast at ~300 cycles for x86_64.
These are all rough estimates of course, so don't quote me. Longer message sends require copying buffers which may also incur page mapping overheads and page faults. A page fault would abort the message send, notify the disk process to load the missing pages, and the send is retried once the page is mapped back into memory.
Edit: as noted in the L4 review paper you linked to, the majority of message sends are very small, and most simply pass all data in registers, so the fast path is the average cost.
The typical system call in a modern microkernel costs on the order of hundreds of cycles. Fine-grained locking would by itself probably double this cost for almost no gain. Just waiting the hundreds of cycles is likely to be just as good as a fine-grained lock in such kernels, and the complexity savings are significant.