Short answer: no. Yes, the concept of *String* is problematic. It's an overloade...

tel · on May 25, 2013

We know some things for (a-d). For instance, a string is (in every case I've seen) a monoid, you can append to either end and there exists an "empty" string which turns appending into an identity.

For (e) this may not be true if we want to treat the string as semantically equivalent to whatever its representing---you cannot always arbitrarily append represented objects. That said, a string is not that represented object unless there's an isomorphism. In Haskell types, you'd want to have a `String -> Maybe a` function to capture failure-to-translate. This case includes things like `SQLSanitize`. Also HTML header parsing whenever the headers have an interpretation in the current level of the system.

Note also that this value of (e) does not depend upon a representation of strings as any kind of character-based thing. You can make a monoidal representation of a regular expression from a star-semiring combinator set (and see http://r6.ca/blog/20110808T035622Z.html for a great example).

The remaining semantic troubles seem to be around ideas of mapping or folding over a string. Length depends on this representation---do I want character length or byte length?---as do functions like `toUppercase`. And then there's the entire encoding/byte-representation business.

So they should perhaps be instances of an abstract class of "Monoid" plus a mixin specializing the kind of "character" representation. Or, the outside-in method of Haskell's typeclasses where String, ByteString(.Char8), and Text each specialize to a different use case but all instantiate `Monoid` and each have some kind of `mapping` and `folding` function which specialize to the kind of "character" intended. Finally, there are partial morphisms between each of them which fail when character encoding does.

lmkg · on May 25, 2013

> the thing about them is that they work

The extent to which you think strings need replacement is equal to the extent that you agree with this particular statement. The charitable view is they let you get up and running without needing to work out every little detail yourself. The uncharitable view is that they appear to work far more often than they actually work.

Because any invariants you may want them to have are ad-hoc, any error with those invariants can very easily slip by the programmer, the compiler, and the unit tests. Depending on what type of programmer you are, this can be The Father Of All Sins.

Strings are low-level. They're a half-step above binary streams. And because they're ubiquitous and all libraries work in terms of strings, there's active pressure against creating new types that represent the invariants and structure that your application may actually need, which often results in those invariants never even being thought through in the first place.

Again, it's subjective, and depends strongly on your opinion of Worse of Better. I agree that your thoughts on Inheritance. I think this problem is difficult to solve in a way that's not strictly worse than the original problem. I would be interested to see a language that only defines strings as a Protocol (or whatever) with a handful of implementations in the stdlib. I honestly have no idea if it would be the Garden of Eden or a steaming pile of ass, but I think it's important that we find out.

brudgers · on May 25, 2013

The string is a stark data structure and everywhere it is passed there is much duplication of process. It is a perfect vehicle for hiding information. -- Allen J. Perlis

MostAwesomeDude · on May 25, 2013

His name is "Alan." I'm not sure how you managed to copy-paste that improperly.

pornel · on May 25, 2013

BTW: there is no such thing as generic sanitized string. There may be SQL-escaped string, HTML-escaped, JS-escaped, JS-in-HTML-in-SQL-escaped, etc. It always depends on context (I'm going to invent format that uses ASCII 'a' in escape sequence — sanitize that! ;)

kruhft · on May 25, 2013

Base64 encode a string and it's generically sanitized. They're a bit difficult to read with the naked eye though.

damncabbage · on May 25, 2013

Unless it's in a URL. (This is why URL-safe Base64 versions exist... Which can then in turn be inappropriate for other places.)

mikeash · on May 25, 2013

And base64 can use the / character, so it's unsafe for POSIX filenames.

wglb · on May 25, 2013

Until the next element in the pipeline chain decodes it and you can then have injection.

michaelochurch · on May 25, 2013

Sure. I was just using one example of sanitization: defense against Bobby Tables.