You're right, I noted too that the comparison isn't direct but then, I wasn't justified in calling out the gap claim as wrong, so sorry for that. I think it'd be nice however, to have it undergo an external or more neutral test of performance. I say this without at all doubting the quality of the results.
RE Winograd: WNLI is different, see https://arxiv.org/pdf/1804.07461.pdf