Their hypothesis is a good one: - A form of reasoning is to connect cause and ef...

Their hypothesis is a good one:

- A form of reasoning is to connect cause and effect via probability of necessity (PN) and the probability of sufficiency (PS).

- You can identify when the natural language inputs can support PN and PS inference based on LLM modeling

That would mean you can engineer in more causal reasoning based on data input and model architecture.

They define causal functions, project accuracy measures (false positives/negatives) onto factual and counter-factual assertion tests, and measure LLM performance wrt this accuracy. They establish surprisingly low tolerance for counterfactual error rate, and suggest it might indicate an upper limit for reasoning based on current LLM architectures.

Their findings are limited by how constrained their approach is (short simple boolean chains). It's hard to see how this approach could be extended to more complex reasoning. Conversely, if/since LLM's can't get this right, it's hard to see them progressing at the rates hoped, unless this approach somehow misses a dynamic of a larger model.

It seems like this would be a very useful starting point for LLM quality engineering, at least for simple inference.