This is a valid concern and we are still experimenting with how to do this right...

This is a valid concern and we are still experimenting with how to do this right. A combination of preserving the reasoning history, having the generated code, and using tests to enforce the public interface (and fix it if anything breaks) looks promising.

I think the crucial part is indeed not being able to deterministically go from NL to code but to take an existing state of the codebase and spec and "continue the work".