Apologies, I'm an outsider to the field, but what exactly are you referring to here ? The whole vector-space semantic embedding that was popularized by works like word2vec ?
I have to wonder if English is really the best language for NLP research. Things like the Winograd schemas which have attracted a lot of attention simply aren't possibilities in other languages.
Why not start working with more structured agglutinative * languages like Japanese/Korean and Indic family (Sanskrit esp.) .
How about other European languages ? Are they better structured empirically ? I hear German is very grammatical, and that Hungarian is ... erm odd ?
( Note: I know occidental tradition likes to split Indic tongues, and Indo in Indo-European is not considered agglutinative. I don't subscribe to this view. I use agglutinative in the sense of Panini: "particles" sticking to stems/roots/words - phonetic modifications are irrelevant for grammar.)
> I hear German is very grammatical, and that Hungarian is ... erm odd ?
Just want to point out that "grammatical" probably isn't the word you want here. Every language is grammatical by definition in the sense that there are rules that govern its sound system, word formation system, syntax, etc.
The concept you're getting at, though--that some languages are easier for computer programs and/or speakers of Indo-European languages to understand--is sound.
"Regular" would be the classic linguistics term, would it not? Although computer science limits the term to the use of regular languages in the Chomsky hierarchy sense (that is, more specifically to regular expressions and the languages they describe), I am under the impression linguistics as a whole treats regularity as a multivariate spectrum. Some languages have more regularity in terms of grammar productions or morphology than English.
That points to Isolating [1] and I think highly isolating may be the more useful distinction to this specific example. (Modern English is rather analytic, having dropped most, but not all, inflections in the Middle English era. Mandarin Chinese is much more isolating than Modern English.)
One reason is that the amount of training data is many many orders of magnitude smaller.
FWIW it seems the structure you're talking about exploiting is at a morphological and syntactic level, which modern language models tend to effectively handle. Semantics are a much harder problem.
> Things like the Winograd schemas which have attracted a lot of attention simply aren't possibilities in other languages.
I do not think that is correct. Anaphora exists in many languages. Check out the Anaphora article on wikipedia and click on different language versions. There are example sentences for many languages.
There are translation for the Winograd Schemas into a couple of languages. Granted I found some of the translations a little unnatural in some cases but they are still understandable and expose the problem.
Apologies, I'm an outsider to the field, but what exactly are you referring to here ? The whole vector-space semantic embedding that was popularized by works like word2vec ?