The hand rolled strcmp almost certainly performed better than the native one merely because it can be inlined; consider using rep cmpsb or directly including another assembly version of strcmp (though since the strings are short the latter's overhead might not be worth it). Ditto memcpy.
Perhaps also consider using a radix sort?