Yes indeed. A nice surprise to stumble upon my old blog post again here.
I don't think much has changed since then, most of the problems are still there in newer Java versions. And that's fine, in practice they don't hurt. Still imo interesting curiosities of the type system.
It would be possible to extend what FlowTracker does to also find SQL (or other) injection vulnerabilities. So it's possible the tool you're thinking of used a similar approach.
As I was developing FlowTracker, a lot of the work was driven by making tracking of specific example programs work.
I knew what result I was aiming for, but it was hard to predict what lower level mechanisms needed to be supported to make a specific example work. That often depended on internal implementation details of the JDK or libraries being used where the data was passing through.
But the HTML element linking back to the SQL script that added that data into the database wasn't like that.
I didn't expect or work towards it, that just happened, so it blew me away a little too and got me excited about what else this approach could accomplish.
A great example of how design of good products should be guided by the end goal instead of by the technical mechanism, when possible. You went out of your way to make sure the functionality was not limited by a certain single mechanism.
I didn't make it to that element of the demo because I don't need a tool to help me find which file HTML text strings are from or that HTTP headers come from my web server. So I would recommend putting that "wow" element earlier in the demo.
I just added solutions to the empty String challenge in the blog post.
This includes a very interesting find from Xavier Cooney, that causes the same problem without involving any concurrency. It instead makes StringBuilder misbehave by throwing an exception at an unexpected place: https://gist.github.com/XavierCooney/e9f6235f05479ac6bf962ca...
Author of the blog here.
Yes, this is a bug. While the javadoc doesn't state it explicitly, immutable classes in the core library are expected to be thread safe. Java tries to be and mostly succeeds at being a safe language, where (by default) the guarantees of its internals cannot be broken no matter what user code does. The JVM preserves its own integrity.
There are some other deliberate holes in that safety, such as using reflection to access private members, and instrumentation, where it is clear you are stepping outside the safety zone, but even that is getting locked down now with the "Integrity and Strong Encapsulation" effort https://openjdk.org/jeps/8305968 .
In general, most code we write would not (and should not) try to protect against such abuse, but classes in the core platform play by different rules.
Dunno, this seems like you're violating the java memory model and then then fully expected weird shit happens. If you're mutating a shared state between threads without proper synchronization, and there is no way for String to do this on its own, the CPU is permitted to do stuff like out-of-order execution as there is no happens-before relationship between the write and the read. The JVM is further permitted to optimize the code with the assumption that the data stays the same in the absence of volatile/synchronized/other synchronziation primitives, as long as the observable outcome is the same "as if serial" execution.
The understanding 'It first tries to encode the characters as latin-1 using StringUTF16.compress. If that fails, it returns null and the constructor falls back to using UTF-16. ' is incorrect in unsynchronized code. The reality is that it's permitted to do all these things at the same time in any order (or speculatively do both at the same time) as long as the observable consequences are the same. This relies on the assumption that the data stays the same. If you violate that assumption, you get bizarre and often unpredictable behavior.
I don't think this is a bug in String. It may be a security vulnerability in the JVM, in a log4j-esque "it's working as we intended but holy crap did we ever not intend on this interaction" manner.
I didn't think it was a bug either till I got to the "Spooky action at a distance" section. The fact that the following code:
"hello world".startsWith("hello")
can return false in any circumstance, is a bug in the language. The fact that some other code in an entirely unrelated part of the codebase can intern a broken string and thus break string-comparisons for the entire codebase, is mind boggling.
Fortunately, I don't think the fix needs to be too expensive. After a defensive copy of the byte-array is made, just before returning the newly constructed string object, verify once again that the chosen coder matches the string contents. If it doesn't, that would imply a race condition, and the broken string instance should not be returned
> After a defensive copy of the byte-array is made, just before returning the newly constructed string object, verify once again that the chosen coder matches the string contents.
Allocating new String objects might very well be the most frequent memory allocation operation in the JVM. IMO it would be a mistake to do anything to slow this down to protect against this weird instantiation behavior, unless it implies a security vulnerability which can't be fixed by guards at the interning step.
The language maintainers have already decided to slow down string-construction by calling StringUTF16.compress(value) which checks if the input can be converted into LATIN1 and if so, creating a LATIN1 byte-array representation from the ground up.
Compared to this, I wager that the incremental cost is small to verify that the coder matches the content.
Also, depending on the implementation of string-interning and StringUTF16.compress, it's possible that the verification step is only needed for non-LATIN1 strings. Which according to the JEP is only a small minority of strings seen.
If there is a cheaper solution, I'm certainly all for it. I'm just spitballing ideas here. But I don't think it is acceptable to allow "hello world".startsWith("hello") to return false.
A defensive copy probably doesn't help here. The JVM is permitted to optimize it away, since it has no externally observable effects and doesn't have a constructor.
This copy has no external observable effects (it's only temporarily used in the constructor and never stored anywhere), and since the JVM is (under the JMM) permitted work under the assumption the original array doesn't change unless the thread itself changes it or there is some happens-before relationship with something that does, and is permitted to re-order same-thread execution as-if-serial, it's fully permitted to assume the original array is identical to the copied array, and it may elide the copy, since the original array isn't stored or mutated in any way.
Whether the JVM actually does this for arraycopy is an implementation detail. The JVM is permitted to perform this type of escape analysis and elision, and definitely does for some allocations.
In general arrays and concurrency doesn't mix very well in Java, and is something I'd avoid mixing on principle. You avoid most of these types of problems by having threads talk by passing around immutable objects via concurrency primitives, volatile references, synchronized-blocks, etc.
You are incorrect - it is allowed to behave as if the original didn't or did change, that is true and in the memory model. But once the copy is done, then the state is settled, and is no longed allowed to change at any point. There are no more conflicting accesses.
It is "fully permitted to assume the original array is identical to the copied array" only to the extent that affects reading from the original array, not the copy (ie, it can pretend that the original didn't change, even it did).
Anything else would break causality requirements (JLS§17.4.8.).
> In general arrays and concurrency doesn't mix very well in Java, and is something I'd avoid mixing on principle.
Sound sensible. I think what you've said is any semi-malicious code can give you a primitive array, then modify it from under you in another thread. That's not completely surprising but the clincher is, you can't even take a defensive copy.
Well I mean there are things you can do prevent the JVM from this particular optimization. But you need to use the right tools to tell Java to expect spooky action at a distance, and arraycopy just isn't sufficient.
This would probably do
byte[] b;
// A
synchronized (a) {
b = System.arraycopy(a);
}
// B
Here A happens-before B, and you can assume a read fence at the start of synchronized block and a store fence at end of it, so consumers of b should get a consistent view of the array, and the synchronized block tells Java that you expect a to be modified by another thread so it can't assume it stays the same throughout the execution; and elision of the arraycopy isn't possible; so while the array may be in a weird state, it will at least stay consistent throughout the execution.
> The reality is that it's permitted to do all these things
Yes, given the code in the String class, the JVM is allowed to execute it in this way. I'm not claiming that is wrong, I'm claiming the code in the String class should not have been written in a way that allowed the JVM to do that.
> If you violate that assumption, you get bizarre and often unpredictable behavior.
Java is (supposed to be) a safe language, that maintains its integrity. It does not have the effect that doing something that was not documented as being safe leads to _undefined behaviour_ and therefor all bets are off (like e.g. C).
I'm pretty sure the JDK maintainers intended for String to be a thread safe class with guarantees that can not be broken with race conditions. It's true that the thread safety of String is not explicitly documented, but I believe that's just an oversight in the docs, that String being thread safe just goes without saying. I realize that's hand-wavy, and I don't have any hard proof for that right now, so I understand not everyone might be convinced of that. I'll wait for the bugfix to make that more clear.
I don't expect that fix to have a significant performance impact. Fixing it doesn't necessarily require more barriers or more defensive copies. See e.g. this draft patch (incomplete, but good enough to give a rough idea of how it could be done): https://gist.github.com/amaembo/83b4aeab1cd190f2e1dbd75eacb3... .
But the combination with interning, the "hello world".startsWith("hello") example, is clear evidence that it is a bug. Some code running in the same JVM having triggered some race condition earlier should not break unrelated code, it should not put the JVM in a broken state where such code behaves in nonsensical ways.
Something like a ConcurrentModificationException would be nice to have in the "weird shit happens" case, because you may not even be aware of it, and then it might cause headache much further down the line. If you could point upstream that would save a lot of headache and grey hair...
There's no way to reliably detect this type of concurrent modification that doesn't involve introducing expensive barrier operations that evict the CPU cache on multiple points along the code flow and choke out-of-order execution and limit what the JIT can do. If you do not have such a barrier, the CPU is permitted to assume the data hasn't been changed, and may cache it, making detection of concurrent modification impossible. The JIT will also optimize your code based on this assumption. Even if you go out of your way to check that the data is the same you may see stale data when you check!
It may run your code out of order, and even JIT-recompile it out of order, and your code that's like
int oldValue = val;
doSomething(val);
if (val != oldValue) throw new ConcurrentModificationException();
may be re-recompiled like this, since it looks like doSomething can't modify val or oldValue
int oldValue = val;
if (val != oldValue) throw new ConcurrentModificationException();
doSomething(val);
and then since this if check is trivially always unnecessary, and nothing of consequence ever reads oldValue it may delete them entirely
doSomething(val);
Introducing a barrier would likely make most Java programs 10x slower 'cause Java does a lot of throwaway string allocation on the assumption it's cheap and short-lived GC is virtually free.
The Java Memory Model is a great accomplishment and if you heed it you will write performant multi-threaded applications with an ease that is rare in other languages. If on the other hand you do not heed the JMM, you will at best get a CME alerting you of this case, but in most cases just get holes in your feet.
> it looks like doSomething can't modify val or oldValue
...sure, if you are using primitives. But here we have an array (reference), the contents of which can be modified by `doSomething(val)` in your example. It is only the reference value that cannot be modified.
I was trying to reduce the amount of concepts I had to invoke to make the example make sense.
java.lang.String is a final class so in this case so doSomething can't be overridden, and strong assumptions can be made about doSomething's behavior and in practice it's likely inlined given that this is an extremely hot method.
Since this is a constructor, the reference can't be aliased by e.g. fields in the class, so static analysis of the code is surprisingly easy. The compiler just needs to answer whether this particular reference is mutated by the invoked function(s).
Thread safety of char[] is the cause of the bug, but the effect is that String instances can violate their invariants, which they shouldn't even in case of thread-unsafe code.
In my opinion the abstraction is leaky but everything works according to the spec.
I might tell you about another "bug" and "invariant violation" which is possible but is not a bug. Try to use Sets or Maps with keys having broken hashcode.
I can't agree with this analogy. This is not a defective user implementation of hashcode; it is a defective platform implementation of equals (and starts with, etc.).
Again, this is not a bug. There are many, way too many places in standard library which are thread unsafe. Some of these are actual bugs, but not this one. If Java originally moved in Odersky's direction that won't be the case but now it's way too late to point fingers at these "quirks".
In order to properly fix these issues the whole standard library has to be redesigned around immutability and, probably, immutable arrays must be supported at VM level.
This is not going to happen.
But it's specified that standard Java classes are thread unsafe by default. So this is a specified behaviour.
Immutability and thread-safety are two different things. There are no thread-safety guarantees made by the JVM or the runtime for the String type, and I suspect your bug report will get the same answer.
You are right that there is an expecation or assumption of such (from the users), which makes it even more interesting when it breaks.
Locking a thread for every allocation of a string is very likely cost-prohibitive and not a great trade-off for the common case. It would make more sense to add something like StringFactory.CreateString(char[]) and instruct developers in using it if thread safety is nessecary.
How would you fix this? Possibly by copying the input array to a local array as the first step?
But then you might copy a lot of data, which is in 99% of the cases totally unnecessary.
Have a separate threadsafe and a non-threadsafe constructor for char[]?
EDIT: another idea is to do a hash of the input array, and compare between start and end. But that also requires an O(n) walk through the input...
I'm not sure how much of the underlying implementation details of arrays are leaking into JNI, but one way to approach it would be with first class copy-on-write for arrays or opt-in immutability (freezing).
> How would you fix this? Possibly by copying the input array to a local array as the first step?
Upon encountering a non-Latin1 char, convert the existing latin1 array to an utf16 one, append the char (converted), then resume iteration in utf16 copy mode.
This way you know the char which made you bail out on Latin1 is part of the output.
Copy the array, then use alias analysis to remove the copy most of the time, by having two versions of the String constructor, with a version chosen based on aliasing.
Yes, but at some point one has come out of the theory domain and face the real world. So in practical terms it’s closer to a bug and even the “official” guys from Java accepted it as a bug.
It was accepted as a bug. I'm not saying I would condemn anyone to the death penalty based on that but being accepted as a bug there has to have a significant weight attached to it.
In my opinion this is not a bug and there is no general fix for that without significant overhaul of the whole runtime and standard library.
In my opinion the author might need to read the vm and memory model spec and understand that such things are expected in Java.
There are real bugs in standard library singletons (e.g. concurrent Filesystem calls might fail during singleton initialization, while plugins are loaded). These bugs are being ignored for years.
The author must not expect any "fix" for this in foreseeable future, but they might expect being laughed at.
Eh, I don't think the author is to be ridiculed. We're all wrong from time to time, I'm wrong in most of what I say, I'm sure you've been wrong on occasions, this is fine.
The author's views on how concurrency works is very common and taught in schools and textbooks, but also quite inadequate.
Instead let's take this opportunity to talk and teach about the JMM and deepen the collective understanding of the many unintuitive behaviors of multi-threaded applications.
Maybe there could be a JMM 2.0 where at least some of the unexpected behaviour is removed. Maybe even at the cost of performance. If something is misunderstood by >99% of the average programmer population then clearly it is not the best solution to the problem.
If that many programmers don't understand the JMM, it's because that many programmers haven't looked at it.
It is the key to reasoning about Java concurrency. Any other model that is not the JMM is flat out wrong. It is virtually impossible to write correct concurrent Java without understanding the memory model.
I don't think much has changed since then, most of the problems are still there in newer Java versions. And that's fine, in practice they don't hurt. Still imo interesting curiosities of the type system.