Yeah if we get an open model that one could apply a LoRA (or similarly cheap fin...

Yeah if we get an open model that one could apply a LoRA (or similarly cheap finetuning) to, then even problems like reproducing identity would (most likely) be solved, as they were for diffusion models. The coherence not just to the prompt but to any potential input image(s) is way beyond what I've seen in diffusion models.

I do think they run a "traditional" upscaler on the transformer output since it seems to sometimes have errors similar to upscalers (misinterpreted pixels), so probably the current decoded resolution is quite low and hopefully future models like GPT-5 will improve on this.