Howdy. Author here. Really cool to see so much good discussion on this. I want t...

dexen · on May 29, 2019

Thank you for the thorough and practical write-up.

About the only thing I would add to it is i18n concerns.

A few quick ones off of the top of my head:

  - Words are separated by whitespace or dashes.
  - Customers only ever enter ASCII.
  - Customers only ever enter accented characters with/without accents.
  - A "Unicode-capable" system will happily take in any valid unicode.
  - A "Unicode-capable" system will pass through any valid unicode undisturbed.
  - Software systems perform Unicode normalization.
  - WinNT API is UTF-16.
  - There is 1-to-1 mapping between uppercase and lowercase.
  - Unicode collation algorithm is optimal for every single language.
  - Unicode collation algorithm is optimal for multi-language document sets.
  - Distinguishing/coalescing plural and singular forms of words is easy.
  - There are separate plural/singular forms of words.
  - Words have stem and optional suffixes, but not prefixes.
  - Soundex etc. works for every language.

ProblemFactory · on May 30, 2019

> There are separate plural/singular forms of words.

Or that there are just two plural/singular forms (1 and many) for translating strings, or that which form to pick is clear.

While English has one form for 1, and one form for 0/many:

- French pluralises 0 the same way as 1,

- Czech has a form for exactly 2-4 items,

- Irish has forms for exactly 3-6 and 7-10 items,

- Polish has a form for all numbers that end in 2-4,

- Russian has a form for all numbers that end in 1,

- Arabic has forms for exactly 0 and 2 items, ending in 03-10, and many more.

A strings table will need at least 10+ variants if you want to translate strings referring to number of items.

mikesickler · on May 29, 2019

yes! tokenization and problems with word boundaries alone would be great to dive into!

binarymax · on May 29, 2019

Thanks! Nice additions!

busterarm · on May 29, 2019

Having spent a fair chunk of my career dealing with search, I went down that list nodding in agreement to nearly every single bullet point save for about 10...

...and those I had to classify as "problems I probably had and didn't recognize" or "will surely encounter soon"

So often we underestimate this thing...

koala_man · on May 30, 2019

I found the list of falsehoods about phone numbers (https://github.com/google/libphonenumber/blob/master/FALSEHO...) really enlightening because it gave a short rationale or example for each point. I think that's way more helpful and useful than the more traditional snarky format.

dumbfounder · on May 29, 2019

You can increase recall without adding noise. Customer wanted to match substrings of words and then was like, why are all these irrelevant documents returning?

ben509 · on June 4, 2019

May I propose an additional falsehood?

"Users won't want to turn search highlighting off."

Maybe it's just me, but this[1] seems distracting.

[1]: https://docs.python.org/3/library/pickle.html?highlight=pick...