Hacker Newsnew | past | comments | ask | show | jobs | submit | monocasa's commentslogin

I mean, the core issue here is that proper engineering just isn't valued.

Social capital just isn't given out to people that fix things in a lot of these companies, but instead those who ship a 1.0a.

On the management/product side, the inevitable issues are problem for another quarter. On the engineering side, it's a problem for the poor shmucks who didn't get to jump to the next big thing.

Neither of those groups instructionally care about the mess they leave in their wake, and such guardrails they'd perceive as antithetical to releasing the next broken but new, fancy feature.


Except pretty much the entire millennial generation knows about computer folders and files, as that was necessary information for graduating school.

I feel like using spinlocks in user space at all without kernel support like rseq is just asking for weird performance degradations.

I really dislike the use of spinlocks in postgres (and have been replacing a lot of uses over time), but it's not always easy to replace them from a performance angle.

On x86 a spinlock release doesn't need a memory barrier (unless you do insane things) / lock prefix, but a futex based lock does (because you otherwise may not realize you need to futex wake). Turns out that that increase in memory barriers causes regressions that are nontrivial to avoid.

Another difficulty is that most of the remaining spinlocks are just a single bit in a 8 larger byte atomic. Futexes still don't support anything but 4 bytes (we could probably get away with using it on a part of the 8 byte atomic with some reordering) and unfortunately postgres still supports platforms with no 8 byte atomics (which I think is supremely silly), and the support for a fallback implementation makes it harder to use futexes.

The spinlock triggering the contention in the report was just stupid and we only recently got around to removing it, because it isn't used during normal operation.

Edit: forgot to add that the spinlock contention is not measurable on much more extreme workloads when using huge pages. A 100GB buffer pool with 4KB pages doesn't make much sense.


Addendum big enough to warrant a separate post: The fact the contention is a spinlock, rather than a futex is unrelated to the "regression".

A quick hack shows the contended performance to be nearly indistinguishable with a futex based lock. Which makes sense, non-PI futexes don't transfer the scheduler slice the lock owner, because they don't know who the lock owner is. Postgres' spinlock use randomized exponential backoff, so they don't prevent the lock owner from getting scheduled.

Thus the contention is worse with PREEMPT_LAZY, even with non-PI futexes (which is what typical lock implementations are based on), because the lock holder gets scheduled out more often.

Probably worth repeating: This contention is due to an absurd configuration that should never be used in practice.


Contention doesn't exist in older kernel versions even with huge-pages disabled, no?

The contention does exist in older kernels and is quite substantial.

You said

> Maybe we should, but requiring the use of a new low level facility that was introduced in the 7.0 kernel, to address a regression that exists only in 7.0+, seems not great.

... so that leaves me confused. My understanding is that the regression is triggered with the 7.0+ kernel and can be mitigated with huge pages turned on.

My question therefore was how come this regression hasn't been visible with huge pages turned off with older kernel versions? You say that it was but I can't find this data point.


> ... so that leaves me confused. My understanding is that the regression is triggered with the 7.0+ kernel and can be mitigated with huge pages turned on.

It gets a bit worse with preempt_lazy - for me just 15% percent or so - because the lock holder is scheduled out a bit more often. But it was bad before.

> My question therefore was how come this regression hasn't been visible with huge pages turned off with older kernel versions? You say that it was but I can't find this data point.

I mean it wasn't a regression before, because this is how it has behaved for a long time.

This workload is not a realistic thing that anybody would encounter in this form in the real world. Even without the contention - which only happens the first time the buffer pool is filled - you lose so much by not using huge pages with a 100gb buffer pool that you will have many other issues.

We (postgres and me personally) were concerned enough about potential contention in this path that we did get rid of that lock half a year ago (buffer replacement selection has been lock free for close to a decade, just unused buffers were found via a list protected by this lock).

But the performance gains we saw were relatively small, we didn't measure large buffer pools without huge pages though.

And at least I didn't test with this many connections doing small random reads into a cold buffer pool, just because it doesn't seem that interesting.


> On x86 a spinlock release doesn't need a memory barrier (unless you do insane things) / lock prefix, but a futex based lock does (because you otherwise may not realize you need to futex wake).

Now you've gotten me wondering. This issue is, in some sense, artificial: the actual conceptual futex unlock operation does not require sequential consistency. What's needed is (roughly, anyway) an release operation that synchronizes with whoever subsequently acquires the lock (on x86, any non-WC store is sufficient) along with a promise that the kernel will get notified eventually (and preferably fairly quickly) if there was a non-spinning sleeper. But there is no requirement that the notification occur in any particular order wrt anything else except that the unlock must be visible by the time the notification occurs [0]; there isn't even a requirement that the notification not occur if there is no futex waiter.

I think that, in common cache coherence protocols, this is kind of straightforward -- the unlock is a store-release, and as long as the cache line ends up being written locally, the hardware or ucode or whatever simply [1] needs to check whether a needs-notification flag is set in the same cacheline. Or the futex-wait operation needs to do a super-heavyweight barrier to synchronize with the releasing thread even though the releasing thread does not otherwise have any barrier that would do the job.

One nasty approach that might work is to use something like membarrier, but I'm guessing that membarrier is so outrageously expensive that this would be a huge performance loss.

But maybe there are sneaky tricks. I'm wondering whether CMPXCHG (no lock) is secretly good enough for this. Imagine a lock word where bit 0 set means locked and bit 1 set means that there is a waiter. The wait operation observes (via plain MOV?) that bit 0 is set and then sets bit 1 (let's say this is done with LOCK CMPXCHG for simplicity) and then calls futex_wait(), so it thinks the lock word has the value 3. The unlock operation does plain CMPXCHG to release the lock. The failure case would be that it reports success while changing the value from 1 to 0. I don't know whether this can happen on Intel or AMD architectures.

I do expect that it would be nearly impossible to convince an x86 CPU vendor to commit to an answer either way.

(Do other architectures, e.g. the most recent ARM variants, have an RMW release operation that naturally does this? I've tried, and entirely failed AFAICT, to convince x86 HW designers to add lighter weight atomics.)

[0] Visible to the remote thread, but the kernel can easily mediate this, effectively for free.

[1] Famous last words. At least in ossified microarchitectures, nothing is simple.


Using LOCK CMPXCHG or even plain CMPXCHG does not make sense unless it is done in a loop, which tests the success of the operation.

Implementing locks does not need this kind of loops, which may greatly increase the overhead, but only loops that do simple loads, for detecting changes, or the invocation of a FUTEX_WAIT, which is equivalent with that.

Besides loops that wait for changes, any kind of lock may be implemented with atomic read-modify-write instructions (e.g. on x86 XCHG, LOCK XADD, LOCK BTS and so on, and equivalent instructions on Armv8.1-A or later ISAs) that are not used in loops, so they have predictable overhead. For example, a futex may be used by a thread that waits for multiple events, if the other threads use a locked bit-test-and-set on the futex variable to signal the occurrence of an event, where each event is assigned to a distinct bit.

CMPXCHG and the equivalent load-and-lock/store conditional are really needed far less often than some people use them. The culprit is a widely-quoted research paper that has shown that these instructions are more universal than simple atomic fetch-and-operation instructions, allowing the implementation of lock-free algorithms, but the fact that they can do more does not mean that they should also be used when their extra power is not necessary, because that is paid dearly by introducing non-deterministic overhead.

A simple atomic instruction has an overhead much greater than an access to the L1 cache or the L2 cache, but typically the overhead is similar to that of a simple access to the L3 cache and significantly lower than the overhead of a simple access to the main memory, which remains the most expensive operation in modern CPUs.

Moreover, while mutual exclusion can be implemented reasonably efficiently with locks, it is also used far more often than necessary. It is possible to implement shared buffers or message queues that use neither mutual exclusion nor optimistic access that may need to be retried (a.k.a. lock-free access), but instead of those they use dynamic partitioning of the shared resource, allowing concurrent accesses without interference.

Organizing the cooperation between threads around shared buffers/message queues is frequently much better than using mutual exclusion, which stalls all contending threads, serializing their execution, and also much better than lock-free access, which may need an unpredictable number of retries when contention is high.


You are misunderstanding me, which is perhaps understandable, since I’m talking about the minutiae of x86, not locking in general.

When unlocking a futex-backed mutex, one needs to do two things. First, one needs to actually unlock it: this is a store-release in modern lingo, and on x86 almost any store instruction has the correct ordering semantics. Second, one needs to determine whether to call futex_wake, which is conceptually just reading a flag “is someone waiting” and then branching on the result. The problem is that the load needs to be ordered after (or at least not before) the store.

x86 provides two main ways to do this, MFENCE and LOCK. For whatever reason, at least Intel has tried pretty hard to optimize LOCK, and it’s often the case that LOCKed operations on a hot cache line is faster than MFENCE. (I have benchmarked this, and Linux uses this trick.)

My point is that the specific algorithm of unlocking a futex-backed mutex does not require the full ordering semantics of MFENCE or LOCK. And my secondary observation is that x86 has some non-LOCKed RMW instructions, one of which is plain CMPXCHG. Unlocked CMPXCHG is much faster than LOCK anything or MFENCE — I’ve benchmarked it. There are also the flag outputs from operations like ADD. And I’m speculating that maybe some of these instructions are secretly actually ordered strongly enough for futex unlock.


> > On x86 a spinlock release doesn't need a memory barrier (unless you do insane things) / lock prefix, but a futex based lock does (because you otherwise may not realize you need to futex wake).

> Now you've gotten me wondering. This issue is, in some sense, artificial: the actual conceptual futex unlock operation does not require sequential consistency. What's needed is (roughly, anyway) an release operation that synchronizes with whoever subsequently acquires the lock (on x86, any non-WC store is sufficient) along with a promise that the kernel will get notified eventually (and preferably fairly quickly) if there was a non-spinning sleeper. But there is no requirement that the notification occur in any particular order wrt anything else except that the unlock must be visible by the time the notification occurs [0]; there isn't even a requirement that the notification not occur if there is no futex waiter.

Hah.

> ... > But maybe there are sneaky tricks. I'm wondering whether CMPXCHG (no lock) is secretly good enough for this. Imagine a lock word where bit 0 set means locked and bit 1 set means that there is a waiter. The wait operation observes (via plain MOV?) that bit 0 is set and then sets bit 1 (let's say this is done with LOCK CMPXCHG for simplicity) and then calls futex_wait(), so it thinks the lock word has the value 3. The unlock operation does plain CMPXCHG to release the lock. The failure case would be that it reports success while changing the value from 1 to 0. I don't know whether this can happen on Intel or AMD architectures.

I suspect the problem isn't so much the lock prefix, but that the non-futex spinlock release just is a store, whereas a futex release has to be a RMW operation.

I'm talking out of my ass here, but my guess is that the reason for the performance gain of the plain-store-is-a-spinlock-release on x86 comes from being able to do the release via the store buffer, without having to wait for exclusive ownership of the cache line. Due to being a somewhat contended simple spinlock, often embedded on the same line as the to-be-protected data, it's common for the line not not be in modified ownership anymore at release.


> I suspect the problem isn't so much the lock prefix, but that the non-futex spinlock release just is a store, whereas a futex release has to be a RMW operation.

> I'm talking out of my ass here, but my guess is that the reason for the performance gain of the plain-store-is-a-spinlock-release on x86 comes from being able to do the release via the store buffer, without having to wait for exclusive ownership of the cache line.

I don’t think so. The CPU is pretty good about hiding that kind of latency — reading a contended cache line and doing a correctly predicted branch shouldn’t stall anything after it.

But LOCK and MFENCE are quite expensive.


That 64-bit atomic in the buffer head with flags, a spinlock, and refcounts all jammed into it is nasty. And there are like ten open coded spin waits around the uses... you certainly have my empathy :)

This got me thinking about 64-bit futexes again. Obviously that can't work with PI... but for just FUTEX_WAIT/FUTEX_WAKE, why not?

Somebody tried a long time ago, it got dropped but I didn't actually see any major objection: https://lore.kernel.org/lkml/20070327110757.GY355@devserv.de...


> That 64-bit atomic in the buffer head with flags, a spinlock, and refcounts all jammed into it is nasty.

Turns out to be pretty crucial for performance though... Not manipulating them with a single atomic leads to way way worse performance.

For quite a while it was a 32bit atomic, but I recently made it a 64bit one, to allow the content lock (i.e. protecting the buffer contents, rather than the buffer header) to be in the same atomic var. That's for one nice for performance, it's e.g. very common to release a pin and a lock at the same time and there are more fun perf things we can do in the future. But the real motivation was work on adding support for async writes - an exclusive locker might need to consume an IO completion for a write that's in flight that is prevent it from acquiring the lock. And that was hard to do with a separate content lock and buffer state...

> And there are like ten open coded spin waits around the uses... you certainly have my empathy :)

Well, nearly all of those are all to avoid needing to hold a spinlock, which, as lamented a lot around this issue, don't perform that well when really contended :)

We're on our way to barely ever need the spinlock for the buffer header, which then should allow us to get rid of many of those loops.

> This got me thinking about 64-bit futexes again. Obviously that can't work with PI... but for just FUTEX_WAIT/FUTEX_WAKE, why not?

It'd be pretty nice to have. There are lot of cases where one needs more lock state than one can really encode into a 32bit lock state.

I'm quite keen to experiment with the rseq time slice extension stuff. Think it'll help with some important locks (which are not spinlocks...).


> Turns out to be pretty crucial for performance though...

I don't doubt it. I just meant nasty with respect to using futex() to sleep instead of spin, I was having some "fun" trying.

I can certainly see how pushing that state into one atomic would simplify things, I didn't really mean to question that.

> We're on our way to barely ever need the spinlock for the buffer header, which then should allow us to get rid of many of those loops.

I'm cheering you on, I hadn't looked at this code before and its been fun looking through some of the recent work on it.

> It'd be pretty nice to have. There are lot of cases where one needs more lock state than one can really encode into a 32bit lock state.

I've seen too much open coded spinning around 64-bit CAS in proprietary code, where it was a real demonstrable problem, and similar to here it was often not straightforward to avoid. I confess to some bias because of this experience ("not all spinlocks...") :)

I remember a lot of cases where FUTEX_WAIT64/FUTEX_WAKE64 would have been a drop-in solution, that seems compelling to me.


> I feel like using spinlocks in user space at all without kernel support like rseq is just asking for weird performance degradations.

Yeah, exactly. "Doctor, help, somebody replaced my wooden hammer with a metal one, and now I can't hit myself in the face with it as many times."

If you use spinlocks in userspace, you're gonna have a bad time.


Most people looking for performance will reach for the spinlock.

The expectation is that the kernel should somehow detect applications that are spinning, and avoid preempting them early.


Well that seems like an unreasonable expectation no? Also isn't the point of spinlocks that they get released before the kernel does anything? Otherwise you could just use a futex... Which maybe you should do anyway...

https://matklad.github.io/2020/01/04/mutexes-are-faster-than...


The scheduling is based on how much the LWP made use of its previous time slices. A spinning program clearly is using every cycle it's given without yielding, and so you can clearly tell preemption should be minimized.

If you are spinning so long that it requires preemption, you're doing something wrong, no?

It doesn't matter, it's a long tail thing: on average user spinlocks can work, and even appear to be beneficial on benchmarks (for many reasons, Andy alludes to some above). But if you have enough users, some of them will experience the apocalyptic long tail, no matter what you do: that's why user spinlocks are unacceptable. RSEQ is the first real answer for this, but it's still not a guarantee: it is not possible to disable SCHED_OTHER preemption in userspace.

If I make something 1% faster on average, but now a random 0.000001% of its users see a ten-second stall every day, I lose.

It is tempting to think about it as a latency/throughput tradeoff. But it isn't that simple, the unbounded thrashing can be more like a crash in terms of impact to the system.


Well, you can always pin to a core and move other threads out of that core.

That's what you'd do if manually scheduling. Ideally the dynamic scheduler would do that on its own.


Sure. But if you squint even that isn't good enough, you'll still take interrupts on that core in the critical section sometimes when somebody else wants the lock.

The other problem with spin-wait is that it overshoots, especially with an increasing backoff. Part of the overhead of sleeping is paid back by being woken up immediately.

When it's made to work, the backoff is often "overfit" in that very slight random differences in kernel scheduler behavior can cause huge apparent regressions.


PostgreSQL is old and had to support kernels which did not support spinlocks. But, yes, maybe PostgreSQL should stop doing so now that kernels do.

No joke, I worked at a place where in our copy of system headers we had to #define near and far to nothing. That was because (despite not having supported any systems where this was applicable for more than a decade) there was a set of files that were considered too risky to make changes in that still had dos style near and far pointers that we had to compile for a more sane linear address space. https://www.geeksforgeeks.org/c/what-are-near-far-and-huge-p...

Now, I'm just a simple country engineer, but a sane take on risk management probably doesn't prefer de facto editing files by hijacking keywords with template magic compared with, you know just making the actual change, reviewing it, and checking it in.


Meanwhile Europe has access to $30k EVs because they didn't stock a 100% tariff on Chinese EVs. Hell, a leapmotor to3 is almost down to $20k.


Not 100%, but Europe has a 45% tariff rate on SAIC, 27-40% tariff rate on other Chinese EVs.

I guarantee you that these will increase as European manufacturers feel the pain.


They have done a bit of this. SMIC is basically operating off of a cloned TSMC N7 node that they have since iterated on to get to a 5nm class node.


But its still such a complex sort of beast.

Even if you had 'ai tools' guessing at component blocks on evaluation you would have to have some evaluation of the result.

And, thats assuming NVDA hasn't pulled a Masatoshi Shima type play on their designs (i.e. complex traps that could require lots of analysis to determine if they are real or fake)

Im not sure how much of a speedup even modern tooling/workflow could do reliably.

Even then,

The elephant in the room is that China is working on their own AI accelerators/etc, so while there can be benefit from -studying- the existing designs, however I think they do not want to clone regardless.


Oh, absolutely. Straight up soviet style cloning of masks makes no sense for multitude of reasons. In addition to what you've said, China isn't banned from N7 class Nvidia architectures so could just buy those on the open market.


They definitely are using Nvidia. Part of deepseek's special sauce was using an "undocumented" ptx instruction to get a cute microoptimization with the memory hierarchy.

https://youtube.com/watch?v=iEda8_Mvvo4


Did they originally say it was a grain of rice Ethernet module?

I thought it was supposed to be an incredibly tiny micro sitting on the bmc's boot flash to break inject vulnerabilities.


I recall, at the time Bloomberg and their source were taking about tiny chip on the bmc that was masking as a resistor.

However they did not produce any concrete evidence, citing NDA between that security company and their client.


Even that makes little sense.

A malicious modification to the flash content would leave no physical evidence…


I thought the point was an extra chip in the place of a pull up resistor or something that would edit the firmware image as it made its way across the bus, so you wouldn't see the modifications even if you pulled the flash chip and read it out manually, and would also be persistent across flash updates.


Well, also had other pen testers come forward saying that they had found implants on supermicro servers and had talked to federal authorities who had said it was a known relatively large issue they were trying to get a handle on while keeping it under wraps.

And if it were posted to move the market, that would have been about the most cut and dry SEC violation possible, posted at a time when the federal government still enforced such things.


Well, also had other pen testers come forward saying that they had found implants on supermicro servers and had talked to federal authorities who had said it was a known relatively large issue they were trying to get a handle on while keeping it under wraps.

Jeez, if only they had said that all these unnamed pen testers had said they had found implants and had talked to all these unnamed federal authorities, I'd have approached the question with an entirely different set of Bayesian priors! Thanks for filling in the blanks on that.

And if it were posted to move the market, that would have been about the most cut and dry SEC violation possible, posted at a time when the federal government still enforced such things.

It's (allegedly) company policy: https://www.politico.com/blogs/media/2013/12/the-bloomberg-m...


There's been plenty of computer security issues wrt state actors that we in the industry assume is true, without any actual evidence beyond a whistleblower saying that it happened combined with plausibility that it could have happened.

Room 641A is a good example.

And the bonus is for true information. My point was that the only reason to move the market with false information was for market manipulation.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: