The result is bad, that is just an image recognition model we know those can be trained with relatively little words and just a bunch of images, the model didn't learn grammar or sentences or relationships like a baby would. The only models we have that can do grammar and sentences and relationships are LLMs with massive amounts of data.