Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Howdy. Author here. Really cool to see so much good discussion on this. I want to turn several of them into blog posts on their own with explanations/stories/what-have-you. Taking votes for what you'd like to see first. For the record, my fave is "Languages don’t change".


Thank you for the thorough and practical write-up.

About the only thing I would add to it is i18n concerns.

A few quick ones off of the top of my head:

  - Words are separated by whitespace or dashes.
  - Customers only ever enter ASCII.
  - Customers only ever enter accented characters with/without accents.
  - A "Unicode-capable" system will happily take in any valid unicode.
  - A "Unicode-capable" system will pass through any valid unicode undisturbed.
  - Software systems perform Unicode normalization.
  - WinNT API is UTF-16.
  - There is 1-to-1 mapping between uppercase and lowercase.
  - Unicode collation algorithm is optimal for every single language.
  - Unicode collation algorithm is optimal for multi-language document sets.
  - Distinguishing/coalescing plural and singular forms of words is easy.
  - There are separate plural/singular forms of words.
  - Words have stem and optional suffixes, but not prefixes.
  - Soundex etc. works for every language.


> There are separate plural/singular forms of words.

Or that there are just two plural/singular forms (1 and many) for translating strings, or that which form to pick is clear.

While English has one form for 1, and one form for 0/many:

- French pluralises 0 the same way as 1,

- Czech has a form for exactly 2-4 items,

- Irish has forms for exactly 3-6 and 7-10 items,

- Polish has a form for all numbers that end in 2-4,

- Russian has a form for all numbers that end in 1,

- Arabic has forms for exactly 0 and 2 items, ending in 03-10, and many more.

A strings table will need at least 10+ variants if you want to translate strings referring to number of items.


yes! tokenization and problems with word boundaries alone would be great to dive into!


Thanks! Nice additions!


Having spent a fair chunk of my career dealing with search, I went down that list nodding in agreement to nearly every single bullet point save for about 10...

...and those I had to classify as "problems I probably had and didn't recognize" or "will surely encounter soon"

So often we underestimate this thing...


I found the list of falsehoods about phone numbers (https://github.com/google/libphonenumber/blob/master/FALSEHO...) really enlightening because it gave a short rationale or example for each point. I think that's way more helpful and useful than the more traditional snarky format.


You can increase recall without adding noise. Customer wanted to match substrings of words and then was like, why are all these irrelevant documents returning?


May I propose an additional falsehood?

"Users won't want to turn search highlighting off."

Maybe it's just me, but this[1] seems distracting.

[1]: https://docs.python.org/3/library/pickle.html?highlight=pick...




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: