coekie's comments

coekie · 2025-08-19T13:33:12 1755610392

Yes indeed. A nice surprise to stumble upon my old blog post again here.

I don't think much has changed since then, most of the problems are still there in newer Java versions. And that's fine, in practice they don't hurt. Still imo interesting curiosities of the type system.

coekie · on Sept 13, 2024

No, that must have been something different.

It would be possible to extend what FlowTracker does to also find SQL (or other) injection vulnerabilities. So it's possible the tool you're thinking of used a similar approach.

coekie · on Sept 13, 2024

Thanks.

As I was developing FlowTracker, a lot of the work was driven by making tracking of specific example programs work. I knew what result I was aiming for, but it was hard to predict what lower level mechanisms needed to be supported to make a specific example work. That often depended on internal implementation details of the JDK or libraries being used where the data was passing through.

But the HTML element linking back to the SQL script that added that data into the database wasn't like that. I didn't expect or work towards it, that just happened, so it blew me away a little too and got me excited about what else this approach could accomplish.

immibis · on Sept 13, 2024

A great example of how design of good products should be guided by the end goal instead of by the technical mechanism, when possible. You went out of your way to make sure the functionality was not limited by a certain single mechanism.

Noumenon72 · on Sept 13, 2024

I didn't make it to that element of the demo because I don't need a tool to help me find which file HTML text strings are from or that HTTP headers come from my web server. So I would recommend putting that "wow" element earlier in the demo.

noisy_boy · on Sept 14, 2024

Or maybe split the demo into shorter demos/gifs where each highlights a specific part. Very cool project, should get more attention.

coekie · on July 12, 2023

I just added solutions to the empty String challenge in the blog post. This includes a very interesting find from Xavier Cooney, that causes the same problem without involving any concurrency. It instead makes StringBuilder misbehave by throwing an exception at an unexpected place: https://gist.github.com/XavierCooney/e9f6235f05479ac6bf962ca...

coekie · on July 12, 2023

Author of the blog here. Yes, this is a bug. While the javadoc doesn't state it explicitly, immutable classes in the core library are expected to be thread safe. Java tries to be and mostly succeeds at being a safe language, where (by default) the guarantees of its internals cannot be broken no matter what user code does. The JVM preserves its own integrity.

There are some other deliberate holes in that safety, such as using reflection to access private members, and instrumentation, where it is clear you are stepping outside the safety zone, but even that is getting locked down now with the "Integrity and Strong Encapsulation" effort https://openjdk.org/jeps/8305968 .

In general, most code we write would not (and should not) try to protect against such abuse, but classes in the core platform play by different rules.

Btw, this has been accepted into the bug database now: https://bugs.java.com/bugdatabase/view_bug?bug_id=JDK-831190... . I expect this to be fixed in a future release.

marginalia_nu · on July 12, 2023

Dunno, this seems like you're violating the java memory model and then then fully expected weird shit happens. If you're mutating a shared state between threads without proper synchronization, and there is no way for String to do this on its own, the CPU is permitted to do stuff like out-of-order execution as there is no happens-before relationship between the write and the read. The JVM is further permitted to optimize the code with the assumption that the data stays the same in the absence of volatile/synchronized/other synchronziation primitives, as long as the observable outcome is the same "as if serial" execution.

The understanding 'It first tries to encode the characters as latin-1 using StringUTF16.compress. If that fails, it returns null and the constructor falls back to using UTF-16. ' is incorrect in unsynchronized code. The reality is that it's permitted to do all these things at the same time in any order (or speculatively do both at the same time) as long as the observable consequences are the same. This relies on the assumption that the data stays the same. If you violate that assumption, you get bizarre and often unpredictable behavior.

I don't think this is a bug in String. It may be a security vulnerability in the JVM, in a log4j-esque "it's working as we intended but holy crap did we ever not intend on this interaction" manner.

whack · on July 12, 2023

I didn't think it was a bug either till I got to the "Spooky action at a distance" section. The fact that the following code:

"hello world".startsWith("hello")

can return false in any circumstance, is a bug in the language. The fact that some other code in an entirely unrelated part of the codebase can intern a broken string and thus break string-comparisons for the entire codebase, is mind boggling.

Fortunately, I don't think the fix needs to be too expensive. After a defensive copy of the byte-array is made, just before returning the newly constructed string object, verify once again that the chosen coder matches the string contents. If it doesn't, that would imply a race condition, and the broken string instance should not be returned

cempaka · on July 12, 2023

> After a defensive copy of the byte-array is made, just before returning the newly constructed string object, verify once again that the chosen coder matches the string contents.

Allocating new String objects might very well be the most frequent memory allocation operation in the JVM. IMO it would be a mistake to do anything to slow this down to protect against this weird instantiation behavior, unless it implies a security vulnerability which can't be fixed by guards at the interning step.

whack · on July 12, 2023

The language maintainers have already decided to slow down string-construction by calling StringUTF16.compress(value) which checks if the input can be converted into LATIN1 and if so, creating a LATIN1 byte-array representation from the ground up.

Compared to this, I wager that the incremental cost is small to verify that the coder matches the content.

Also, depending on the implementation of string-interning and StringUTF16.compress, it's possible that the verification step is only needed for non-LATIN1 strings. Which according to the JEP is only a small minority of strings seen.

If there is a cheaper solution, I'm certainly all for it. I'm just spitballing ideas here. But I don't think it is acceptable to allow "hello world".startsWith("hello") to return false.

masklinn · on July 12, 2023

I don’t think you need any defensive copy, just to have compress switch to utf16 mode when it encounters a non-latin1 char and go on from there.

It would likely be faster since currently the implementation restarts the entire conversion from 0 in utf16 mode.

Genbox · on July 12, 2023

Making the 99% case faster at the cost of making the 1% case slower, is a speedup.

marginalia_nu · on July 12, 2023

A defensive copy probably doesn't help here. The JVM is permitted to optimize it away, since it has no externally observable effects and doesn't have a constructor.

orra · on July 12, 2023

Wait, really? It's very externally observable in the presence of multithreading. Surely there must be a way to say "no, really, please make a copy".

dtech · on July 12, 2023

afaik the person you replied to is wrong, using System.arrayCopy would definitely make a copy and not be optimized away.

Maybe poster meant only copying the reference with `char[] copy = original` which indeed does nothing, but that's not a defensive copy.

marginalia_nu · on July 12, 2023

This copy has no external observable effects (it's only temporarily used in the constructor and never stored anywhere), and since the JVM is (under the JMM) permitted work under the assumption the original array doesn't change unless the thread itself changes it or there is some happens-before relationship with something that does, and is permitted to re-order same-thread execution as-if-serial, it's fully permitted to assume the original array is identical to the copied array, and it may elide the copy, since the original array isn't stored or mutated in any way.

Whether the JVM actually does this for arraycopy is an implementation detail. The JVM is permitted to perform this type of escape analysis and elision, and definitely does for some allocations.

In general arrays and concurrency doesn't mix very well in Java, and is something I'd avoid mixing on principle. You avoid most of these types of problems by having threads talk by passing around immutable objects via concurrency primitives, volatile references, synchronized-blocks, etc.

guipsp · on July 14, 2023

You are incorrect - it is allowed to behave as if the original didn't or did change, that is true and in the memory model. But once the copy is done, then the state is settled, and is no longed allowed to change at any point. There are no more conflicting accesses.

It is "fully permitted to assume the original array is identical to the copied array" only to the extent that affects reading from the original array, not the copy (ie, it can pretend that the original didn't change, even it did).

Anything else would break causality requirements (JLS§17.4.8.).

orra · on July 12, 2023

> In general arrays and concurrency doesn't mix very well in Java, and is something I'd avoid mixing on principle.

Sound sensible. I think what you've said is any semi-malicious code can give you a primitive array, then modify it from under you in another thread. That's not completely surprising but the clincher is, you can't even take a defensive copy.

marginalia_nu · on July 12, 2023

Well I mean there are things you can do prevent the JVM from this particular optimization. But you need to use the right tools to tell Java to expect spooky action at a distance, and arraycopy just isn't sufficient.

This would probably do

  byte[] b;

  // A
  synchronized (a) { 
    b = System.arraycopy(a);
  }
  // B

Here A happens-before B, and you can assume a read fence at the start of synchronized block and a store fence at end of it, so consumers of b should get a consistent view of the array, and the synchronized block tells Java that you expect a to be modified by another thread so it can't assume it stays the same throughout the execution; and elision of the arraycopy isn't possible; so while the array may be in a weird state, it will at least stay consistent throughout the execution.

coekie · on July 12, 2023

> The reality is that it's permitted to do all these things

Yes, given the code in the String class, the JVM is allowed to execute it in this way. I'm not claiming that is wrong, I'm claiming the code in the String class should not have been written in a way that allowed the JVM to do that.

> If you violate that assumption, you get bizarre and often unpredictable behavior.

Java is (supposed to be) a safe language, that maintains its integrity. It does not have the effect that doing something that was not documented as being safe leads to _undefined behaviour_ and therefor all bets are off (like e.g. C).

I'm pretty sure the JDK maintainers intended for String to be a thread safe class with guarantees that can not be broken with race conditions. It's true that the thread safety of String is not explicitly documented, but I believe that's just an oversight in the docs, that String being thread safe just goes without saying. I realize that's hand-wavy, and I don't have any hard proof for that right now, so I understand not everyone might be convinced of that. I'll wait for the bugfix to make that more clear. I don't expect that fix to have a significant performance impact. Fixing it doesn't necessarily require more barriers or more defensive copies. See e.g. this draft patch (incomplete, but good enough to give a rough idea of how it could be done): https://gist.github.com/amaembo/83b4aeab1cd190f2e1dbd75eacb3... .

But the combination with interning, the "hello world".startsWith("hello") example, is clear evidence that it is a bug. Some code running in the same JVM having triggered some race condition earlier should not break unrelated code, it should not put the JVM in a broken state where such code behaves in nonsensical ways.

haspok · on July 12, 2023

Something like a ConcurrentModificationException would be nice to have in the "weird shit happens" case, because you may not even be aware of it, and then it might cause headache much further down the line. If you could point upstream that would save a lot of headache and grey hair...

marginalia_nu · on July 12, 2023

There's no way to reliably detect this type of concurrent modification that doesn't involve introducing expensive barrier operations that evict the CPU cache on multiple points along the code flow and choke out-of-order execution and limit what the JIT can do. If you do not have such a barrier, the CPU is permitted to assume the data hasn't been changed, and may cache it, making detection of concurrent modification impossible. The JIT will also optimize your code based on this assumption. Even if you go out of your way to check that the data is the same you may see stale data when you check!

It may run your code out of order, and even JIT-recompile it out of order, and your code that's like

  int oldValue = val;
  doSomething(val); 
  if (val != oldValue) throw new ConcurrentModificationException();

may be re-recompiled like this, since it looks like doSomething can't modify val or oldValue

  int oldValue = val;
  if (val != oldValue) throw new ConcurrentModificationException();

  doSomething(val);

and then since this if check is trivially always unnecessary, and nothing of consequence ever reads oldValue it may delete them entirely

  doSomething(val);

Introducing a barrier would likely make most Java programs 10x slower 'cause Java does a lot of throwaway string allocation on the assumption it's cheap and short-lived GC is virtually free.

The Java Memory Model is a great accomplishment and if you heed it you will write performant multi-threaded applications with an ease that is rare in other languages. If on the other hand you do not heed the JMM, you will at best get a CME alerting you of this case, but in most cases just get holes in your feet.

haspok · on July 12, 2023

> it looks like doSomething can't modify val or oldValue

...sure, if you are using primitives. But here we have an array (reference), the contents of which can be modified by `doSomething(val)` in your example. It is only the reference value that cannot be modified.

marginalia_nu · on July 12, 2023

I was trying to reduce the amount of concepts I had to invoke to make the example make sense.

java.lang.String is a final class so in this case so doSomething can't be overridden, and strong assumptions can be made about doSomething's behavior and in practice it's likely inlined given that this is an extremely hot method.

Since this is a constructor, the reference can't be aliased by e.g. fields in the class, so static analysis of the code is surprisingly easy. The compiler just needs to answer whether this particular reference is mutated by the invoked function(s).

haspok · on July 12, 2023

OK, this makes sense, thanks.

shultays · on July 12, 2023

char[] is not an immutable class. Thread safety issue happens on that type, not the fault of String type.

It is weird Java considered it a bug, I expect it to be resolved as "wont do"

moring · on July 12, 2023

Thread safety of char[] is the cause of the bug, but the effect is that String instances can violate their invariants, which they shouldn't even in case of thread-unsafe code.

pshirshov · on July 12, 2023

In my opinion the abstraction is leaky but everything works according to the spec.

I might tell you about another "bug" and "invariant violation" which is possible but is not a bug. Try to use Sets or Maps with keys having broken hashcode.

orra · on July 12, 2023

I can't agree with this analogy. This is not a defective user implementation of hashcode; it is a defective platform implementation of equals (and starts with, etc.).

pshirshov · on July 12, 2023

This is a broken user code written without any JMM understanding.

thfuran · on July 12, 2023

A proper implementation of hashcode can return nonsense if you mutate the object during hashing.

orra · on July 12, 2023

Sure, but here we're not mutating the object during the String.startsWith() call. Hence I'd say it's a TOCTOU bug in String.valueOf().

pshirshov · on July 12, 2023

Again, this is not a bug. There are many, way too many places in standard library which are thread unsafe. Some of these are actual bugs, but not this one. If Java originally moved in Odersky's direction that won't be the case but now it's way too late to point fingers at these "quirks". In order to properly fix these issues the whole standard library has to be redesigned around immutability and, probably, immutable arrays must be supported at VM level. This is not going to happen.

But it's specified that standard Java classes are thread unsafe by default. So this is a specified behaviour.

thfuran · on July 12, 2023

This is pretty much the same as breaking a Map by mutating a key during insertion.

pshirshov · on July 12, 2023

Precisely.

Is it good that this can happen? No, Java must be more like Scala and push immutability as sane default.

Is it a bug? No.

Is it possible to "fix" the language and the VM at this stage? Yes, but that would be a new language and new VM. So, that's not going to happen.

Genbox · on July 12, 2023

Immutability and thread-safety are two different things. There are no thread-safety guarantees made by the JVM or the runtime for the String type, and I suspect your bug report will get the same answer.

You are right that there is an expecation or assumption of such (from the users), which makes it even more interesting when it breaks.

Locking a thread for every allocation of a string is very likely cost-prohibitive and not a great trade-off for the common case. It would make more sense to add something like StringFactory.CreateString(char[]) and instruct developers in using it if thread safety is nessecary.

haspok · on July 12, 2023

How would you fix this? Possibly by copying the input array to a local array as the first step? But then you might copy a lot of data, which is in 99% of the cases totally unnecessary. Have a separate threadsafe and a non-threadsafe constructor for char[]?

EDIT: another idea is to do a hash of the input array, and compare between start and end. But that also requires an O(n) walk through the input...

tpxl · on July 12, 2023

> But that also requires an O(n) walk through the input...

Compressing requires an O(n) walk through the input either way.

orra · on July 12, 2023

Yes. In fact, both String.valueOf(char[]) and constructor String(char[]) are already documented as copying the array contents.

silon42 · on July 12, 2023

/s redesign a language to prevent parallel mutable use of the array

pshirshov · on July 12, 2023

This won't happen (unfortunately) for obvious reasons.

the_mitsuhiko · on July 12, 2023

I'm not sure how much of the underlying implementation details of arrays are leaking into JNI, but one way to approach it would be with first class copy-on-write for arrays or opt-in immutability (freezing).

masklinn · on July 12, 2023

> How would you fix this? Possibly by copying the input array to a local array as the first step?

Upon encountering a non-Latin1 char, convert the existing latin1 array to an utf16 one, append the char (converted), then resume iteration in utf16 copy mode.

This way you know the char which made you bail out on Latin1 is part of the output.

barrkel · on July 12, 2023

Copy the array, then use alias analysis to remove the copy most of the time, by having two versions of the String constructor, with a version chosen based on aliasing.

ris58h · on July 12, 2023

Accepting an issue doesn't mean that there is a bug.

> While the javadoc doesn't state it explicitly, immutable classes in the core library are expected to be thread safe.

If it's not documented it's just an assumption.

PedroBatista · on July 12, 2023

Yes, but at some point one has come out of the theory domain and face the real world. So in practical terms it’s closer to a bug and even the “official” guys from Java accepted it as a bug.

ris58h · on July 12, 2023

Could you provide a link to the documentation to proof it's a bug?

The author could call it 'probably bug' or 'misleading String constructor documentation' but instead they state it's a bug without any proofs.

PedroBatista · on July 12, 2023

Here: https://bugs.java.com/bugdatabase/view_bug?bug_id=JDK-831190...

It was accepted as a bug. I'm not saying I would condemn anyone to the death penalty based on that but being accepted as a bug there has to have a significant weight attached to it.

ris58h · on July 12, 2023

> It was accepted as a bug.

I know. It was in this thread above. So there is no link to the specification to proof that it's a bug.

pshirshov · on July 12, 2023

In my opinion this is not a bug and there is no general fix for that without significant overhaul of the whole runtime and standard library.

In my opinion the author might need to read the vm and memory model spec and understand that such things are expected in Java.

There are real bugs in standard library singletons (e.g. concurrent Filesystem calls might fail during singleton initialization, while plugins are loaded). These bugs are being ignored for years.

The author must not expect any "fix" for this in foreseeable future, but they might expect being laughed at.

marginalia_nu · on July 12, 2023

Eh, I don't think the author is to be ridiculed. We're all wrong from time to time, I'm wrong in most of what I say, I'm sure you've been wrong on occasions, this is fine.

The author's views on how concurrency works is very common and taught in schools and textbooks, but also quite inadequate.

Instead let's take this opportunity to talk and teach about the JMM and deepen the collective understanding of the many unintuitive behaviors of multi-threaded applications.

haspok · on July 12, 2023

Maybe there could be a JMM 2.0 where at least some of the unexpected behaviour is removed. Maybe even at the cost of performance. If something is misunderstood by >99% of the average programmer population then clearly it is not the best solution to the problem.

marginalia_nu · on July 12, 2023

If that many programmers don't understand the JMM, it's because that many programmers haven't looked at it.

It is the key to reasoning about Java concurrency. Any other model that is not the JMM is flat out wrong. It is virtually impossible to write correct concurrent Java without understanding the memory model.

barrkel · on July 12, 2023

A defensive copy is the cheap way to fix this.

The bug is a potential security vulnerability since it can affect the value of interned strings, which can break other libraries.

mmis1000 · on July 12, 2023

Even the example in article alone. Imagine someone intern a broken "script" string, and people trying to sanitize HTML script tags no longer find it.

pshirshov · on July 12, 2023

That still has to be done ad-hoc and might kill performance for many usecases.

This is not a bug, but a specified behavior.

barrkel · on July 12, 2023

Changing the evaluation of the expression x.equals(y) to false when x and y are strings with the same value is not a specified behavior.

pshirshov · on July 12, 2023

It is. The "buggy" code isn't marked as thread safe.

coekie · on Feb 3, 2020

You can see what IP it uses to send requests to google using https://captcha.nsa.gov/search?q=what+is+my+ip

AnssiH · on Feb 3, 2020

The link didn't work for me (i.e. just got regular results) until I added &hl=en to get the English version: https://captcha.nsa.gov/search?q=what+is+my+ip&hl=en

Apofis · on Feb 3, 2020

Another write up at the NSA.