I never understand why people always seem to think 'test flakiness' should/can be fixed by saying "it's OK if test works 1/10 times" -- if it's so dependent on the RNG then something is bound to fail for users as well, so either the test is not representative of the game or something is rotten.
It depends entirely on the reason for that 1/10 times.
E.g. I had a test that hit a 3rd party sandbox that would go down occasionally. I could have mocked it to avoid this but I had way bigger fish to fry. The flakiness triggered a very specific error that we felt safe ignoring (other kinds of flakiness we would not ignore). The corresponding prod API never had issues.
Sometimes it's a select statement without an order by in which case even if it's not strictly a bug but just quickly adding an order by solves the flakiness forever so you might as well.
Sometimes it's a pretty terrifying race condition in the code that will absolutely crop up in prod.
Agreed - at a previous job lots of time was invested in setting up test retries, and splitting test runs out so that you could retrigger only the subsets that failed instead of the whole test suite.
I guess time to fix tests is expensive...