Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Short answer: no.

Yes, the concept of String is problematic. It's an overloaded one that people have variously mapped to:

    a. An array of bytes. (C char is a byte.)
    b. "Words", from a (possibly fuzzy) set of 2-100k specific strings from natural language. 
    c. Arbitrary arrays of characters. 
    d. Arbitrary arrays of *printable* characters.
    e. Compact representations of abstractions, e.g. regexes which represent functions on strings. 
These have conflicting needs. For (a), most seasoned programmers have learned the hard way of the need to separate byte[] from String as concepts, due to Unicode and encoding and various nasty errors you get if you confuse UTF-8 and UTF-16; but also because random access into a byte[] of known structure is often a fast way of getting information while random access into a String is generally inferior to regex matching.

Regarding (b), what you sometimes end up wanting is a symbol type (or, in Clojure, keywords) that gives you fast comparison. You might also want something that lives at a language level (rather than runtime strings) like an enum or tagged union (see: Scala, Ocaml) to get various validation properties.

Regarding (e), I think everyone agrees that regexes belong in their own type (or class).

Where there's some controversy is (c)-(d). There are over a million supposedly valid code points in non-extended Unicode, but only about 150,000 of them are used, and some have special meanings (e.g. endian markers). UTF-8/16 issues get nasty quick if you don't know what you're doing. What all this means is that you can make very few assumptions about an arbitrary "string". You might not even have random access (see: UTF-8/16)! (Although a strong argument can be made that if you need random access into something, a string isn't what you want, but a byte[]. Access into strings is usually done with regexes, not positional indices, for obvious reasons.)

As messy as Strings are over all use cases, the thing about them is that they work and also that they're a fundamental concept of modern computing in practice. We can't get rid of them. We shouldn't. Making them an abstract class I don't like, for the same reasons as most people would agree that making Java's String final was the right decision. (Short version: inheritance mucks up .equals and .hashCode and breaks the world is hard-to-detect ways.)

What we do however need to keep in mind is that when we have a String, we're stuck with something that's meaningless without context. That's always true in computing, but easy to forget. What do I mean by "meaningless without context"? There's almost nothing that you know about something if it's a String.

On the other hand, if you have a wrapper called SanitizedString (some static-typing fu here) that immutably holds a String and the only way to get a value of that type is to pass a String through a SQLSanitize function, you know that it's been sanitized (or, at least, that the sanitizing function was run; whether it's correct is another matter). But this isn't a case of inheritance; it's a wrapper. You can use this to strengthen your knowledge about these objects (a function String -> Option[SanitizedString] returns Some(ss) only if the input string makes sense for your SQL work).

Inheritance I dislike because it tends to weaken knowledge. I think it's the wrong model, except for a certain small class of problem. What good there is in inheritance is being taken over by more principled programming paradigms (see: type classes in Haskell, protocols in Clojure).



We know some things for (a-d). For instance, a string is (in every case I've seen) a monoid, you can append to either end and there exists an "empty" string which turns appending into an identity.

For (e) this may not be true if we want to treat the string as semantically equivalent to whatever its representing---you cannot always arbitrarily append represented objects. That said, a string is not that represented object unless there's an isomorphism. In Haskell types, you'd want to have a `String -> Maybe a` function to capture failure-to-translate. This case includes things like `SQLSanitize`. Also HTML header parsing whenever the headers have an interpretation in the current level of the system.

Note also that this value of (e) does not depend upon a representation of strings as any kind of character-based thing. You can make a monoidal representation of a regular expression from a star-semiring combinator set (and see http://r6.ca/blog/20110808T035622Z.html for a great example).

The remaining semantic troubles seem to be around ideas of mapping or folding over a string. Length depends on this representation---do I want character length or byte length?---as do functions like `toUppercase`. And then there's the entire encoding/byte-representation business.

So they should perhaps be instances of an abstract class of "Monoid" plus a mixin specializing the kind of "character" representation. Or, the outside-in method of Haskell's typeclasses where String, ByteString(.Char8), and Text each specialize to a different use case but all instantiate `Monoid` and each have some kind of `mapping` and `folding` function which specialize to the kind of "character" intended. Finally, there are partial morphisms between each of them which fail when character encoding does.


> the thing about them is that they work

The extent to which you think strings need replacement is equal to the extent that you agree with this particular statement. The charitable view is they let you get up and running without needing to work out every little detail yourself. The uncharitable view is that they appear to work far more often than they actually work.

Because any invariants you may want them to have are ad-hoc, any error with those invariants can very easily slip by the programmer, the compiler, and the unit tests. Depending on what type of programmer you are, this can be The Father Of All Sins.

Strings are low-level. They're a half-step above binary streams. And because they're ubiquitous and all libraries work in terms of strings, there's active pressure against creating new types that represent the invariants and structure that your application may actually need, which often results in those invariants never even being thought through in the first place.

Again, it's subjective, and depends strongly on your opinion of Worse of Better. I agree that your thoughts on Inheritance. I think this problem is difficult to solve in a way that's not strictly worse than the original problem. I would be interested to see a language that only defines strings as a Protocol (or whatever) with a handful of implementations in the stdlib. I honestly have no idea if it would be the Garden of Eden or a steaming pile of ass, but I think it's important that we find out.


The string is a stark data structure and everywhere it is passed there is much duplication of process. It is a perfect vehicle for hiding information. -- Allen J. Perlis


His name is "Alan." I'm not sure how you managed to copy-paste that improperly.


BTW: there is no such thing as generic sanitized string. There may be SQL-escaped string, HTML-escaped, JS-escaped, JS-in-HTML-in-SQL-escaped, etc. It always depends on context (I'm going to invent format that uses ASCII 'a' in escape sequence — sanitize that! ;)


Base64 encode a string and it's generically sanitized. They're a bit difficult to read with the naked eye though.


Unless it's in a URL. (This is why URL-safe Base64 versions exist... Which can then in turn be inappropriate for other places.)


And base64 can use the / character, so it's unsafe for POSIX filenames.


Until the next element in the pipeline chain decodes it and you can then have injection.


Sure. I was just using one example of sanitization: defense against Bobby Tables.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: