Nobody's asking for perfection. But the AI is offering inexplicable and obvious ...

duskwuff · on Feb 16, 2024

Seconding this.

Something like Magika is potentially useful as a second pass if conventional methods of detecting a file type fail or yield a low-confidence result. But, for the majority of binary files, those conventional methods are perfectly adequate. If the first few bytes of a file are "GIF89a", you don't need an AI to tell you that it's probably a GIF image.

EnigmaFlare · on Feb 17, 2024

Doesn't seem all that non-deterministic. I tested the vba.html example multiple times and it always said it was VBA. I added a space between </HEAD> and <BODY> and it correctly picked HTML as most likely but with a low confidence.

So I think we can say it's sensitive to mysterious features, not that it's non-deterministic. Still leads to your same conclusion that you can't anticipate the failures. But I don't think you can with traditional tools either. Some magic numbers are just plain text (like MZ) which could legitimately accidentally appear at the beginning of a plain text file, for example.

jsnell · on Feb 16, 2024

Where are you getting the non-determinism part from? It would seem surprising for there to be anything non-deterministic about an ML model like this, and nothing in the original reports seems to suggest that either.

TeMPOraL · on Feb 16, 2024

Large ML models tend to be uncorrectably non-deterministic simply from doing lots of floating point math in parallel. Addition and multiplication of floats is neither commutative nor associative - you may get different results depending on the order in which you add/multiply numbers.

eapriv · on Feb 17, 2024

Addition and multiplication of floats are commutative.

Gormo · on Feb 16, 2024

> It would seem surprising for there to be anything non-deterministic about an ML model like this

I think there may be some confusion of ideas going in here. Machine learning is fundamentally stochastic, so it is non-deterministic almost by definition.