Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> And then everybody proceeds to write their own String library anyway.

Is this true? It was (is!) certainly true for C, but C has an especially emaciated expectation for string processing primitives. Any runtime developed after like 1995 that I can think of has fixed this by providing a sane string implementation people generally agree upon.



If you care about Unicode for real, yes.

Rust and Go both don't have a builtin package for grapheme iteration - and many naively (and incorrectly) think that a 'Go unicode rune == "character"' in Go. I assume the same happens with Rust `char` type.

If you care about unicode-aware string sorting (you should), rather than the naive string sorting the Go and Rust standard libraries provide out of the box... then you probably want a proper Unicode library.

I think the only language that gets Unicode 'right' out of the box is Swift, as it actually provides grapheme iterators, Locale awareness, etc. - but it comes at the cost of the language being tied to the (ever-moving) Unicode standard.


> I think the only language that gets Unicode 'right' out of the box is Swift, as it actually provides grapheme iterators, Locale awareness, etc. - but it comes at the cost of the language being tied to the (ever-moving) Unicode standard.

I think this might (at least partially) be why Rust's stdlib doesn't have this. If it did, then support for it would be tied to Rust's release schedule and which version of Rust you're using. Granted it is every six weeks, and it's usually trivial to update, but that's still a connection that could be an issue.

By having this be in a separate library it means that it can update as and when it needs to, and it's not inherently connected to a specific release of Rust.


I was pessimistic about grapheme-based orientation towards text, deleted it to research more, and I've come to the conclusion that this is simply not a consensus opinion. Can you give me an example where grapheme-based sorting makes a critical difference from codepoint-oriented sorting on a normalized text? Full unicode composition certainly seems to provide a reasonable solution with western languages, CJK characters, and romanization of CJK characters, but that leaves a hell of a lot of scripts that I don't know about.

I mean unicode is incredibly complex, but it doesn't even seem like there's a consensus outside of swift's string implementation of what a grapheme even is.

(Granted, this might support the above concept that people can't even agree on what a string is, but unicode code points seems like a reasonable baseline to expect from a modern language. That said, rust doesn't even include unicode normalization in the standard library, although the common crate for it seems like a reasonable solution.)


The issue I am aware of is with the Thai language that has zero-length unicode codepoints that get superimposed on the preceding non-zero-length unicode codepoint preceding it (or if none is present, an 'empty' non-zero-length placeholder). A non-zero-length unicode codepoint can have multiple zero-length unicode codepoints following it. (In Thai, no more than 2 for morphemically correct words.) For sorting, a normalization needs to happen in the order of these zero-length codepoints in order for unicode codepoint sorting to be correct. The standard practice in Thai is to have vowel signs before tone markers.

In recent years, application support for this has greatly improved.


> it doesn't even seem like there's a consensus outside of swift's string implementation of what a grapheme even is.

Linguistically it's easy, graphemes are the squiggles people actually draw, as distinct from how a machine encodes them. Of course since people aren't a single individual with just one consistent opinion that does mean there's room for nuance - maybe some people think this is two separate squiggles.


Even PL/I has better string handling than C.


Nope. It's not at all fixed because nobody can ever agree on what a "String" is and what performance guarantees the underlying data structure should provide.

Let's just assume a String is UTF-8 to make things "simple".

Is a String mutable or not? Should mutable and immutable Strings have the same underlying structure? If mutable, is a String extensible or not? Can a String be sliced into another String? Can those slices be shared? Should you walk across codepoints or characters (which could be multiple codepoints due to combining)? If you want to insert a codepoint in the middle of a String, what are the performance guarantees?

I can go on and on ...

"String" really has to be a library as there are simply far too many permutations once you step away from "Shove ASCII to tty".


Well sure, people may colloquially refer to a lot of things as "strings"—hell, you could refer to all sequences as strings if you just wanted to argue with people—but the idea of trying to encapsulate this all in the standard library in a single implementation seems confusing semantically and of questionable value. It seems a lot easier to work with a reasonable interpretation of a string with its associated tradeoffs—which again is implied by most standard libraries.

That said, I personally would balk at willing adopting any runtime that didn't enable iterating over a sequence of unicode code points, whether they be stored as utf8 or some 16-bit form, from a string of bytes in 2024 unless I were guaranteed to avoid having to deal with text processing of free-form human input.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: