Howdy. Author here. Really cool to see so much good discussion on this. I want to turn several of them into blog posts on their own with explanations/stories/what-have-you. Taking votes for what you'd like to see first. For the record, my fave is "Languages don’t change".
Thank you for the thorough and practical write-up.
About the only thing I would add to it is i18n concerns.
A few quick ones off of the top of my head:
- Words are separated by whitespace or dashes.
- Customers only ever enter ASCII.
- Customers only ever enter accented characters with/without accents.
- A "Unicode-capable" system will happily take in any valid unicode.
- A "Unicode-capable" system will pass through any valid unicode undisturbed.
- Software systems perform Unicode normalization.
- WinNT API is UTF-16.
- There is 1-to-1 mapping between uppercase and lowercase.
- Unicode collation algorithm is optimal for every single language.
- Unicode collation algorithm is optimal for multi-language document sets.
- Distinguishing/coalescing plural and singular forms of words is easy.
- There are separate plural/singular forms of words.
- Words have stem and optional suffixes, but not prefixes.
- Soundex etc. works for every language.
Having spent a fair chunk of my career dealing with search, I went down that list nodding in agreement to nearly every single bullet point save for about 10...
...and those I had to classify as "problems I probably had and didn't recognize" or "will surely encounter soon"
I found the list of falsehoods about phone numbers (https://github.com/google/libphonenumber/blob/master/FALSEHO...) really enlightening because it gave a short rationale or example for each point. I think that's way more helpful and useful than the more traditional snarky format.
You can increase recall without adding noise. Customer wanted to match substrings of words and then was like, why are all these irrelevant documents returning?