Couldn't help riffing off on a tangent from the title (since the article is about diagramming tools)...
Dylan Beattie has a thought-provoking presentation for anyone who believes that "plain text" is a simple / solid substrate for computing: "There's no such thing as plain text"https://www.slideshare.net/slideshow/theres-no-such-thing-as... (you'll find many videos from different conferences)
Haven't watched the videos yet, but from the slides, it looks like part of the issue he was talking about was encodings (there's a slide illustrating UTF-16LE ve UTF-16BE, for example). Thankfully, with UTF-8 becoming the default everywhere (so that you need a really good reason not to use it for any given document), we're back at "yes, there is such a thing as plain text" again. It has a much larger set of valid characters, but if you receive a text file without knowing its encoding, you can just assume it's UTF-8 and have a 99.7% chance of being right.
The point is, a lot of work went into making that happen. I.e., plain text as it is today is not some inherent property of computing. It is a binary protocol and displaying text through fonts is also not a trivial matter.
So my question is: what are we leaving on the table by over focusing on text? What about graphs and visual elements?
I was not very descriptive, but I was referring to the next layer up of building blocks. Instead of text, we could also express things in hybrid ways with text but also visual nodes that can carry more dense information. The usual response is that those things don't work with text-based tools, but that's my point. Text based tools needed invention and decades of refinement, and they're still not all that great.
> Thankfully, with UTF-8 becoming the default everywhere (so that you need a really good reason not to use it for any given document), we're back at "yes, there is such a thing as plain text" again.
Whenever I hear this, I hear "all text files should be 50% larger for no reason".
UTF-8 is pretty similar to the old code page system.
Hm? UTF-8 encodes all of ASCII with one byte per character, and is pretty efficient for everything else. I think the only advantage UTF-16 has over UTF-8 is that some ranges (such as Han characters I believe?) are often 3 bytes of UTF-8 while they're 2 bytes of UTF-16. Is that your use case? Seems weird to describe that as "all text files" though?
UTF-8 encodes European glyphs in two bytes and oriental glyphs in three bytes. This is due to the assumption that you're not going to be using oriental glyphs. If you are going to use them, UTF-8 is a very poor choice.
UTF-8 does not encode "European glyphs" in two bytes, no. Most European languages use variations of the latin alphabet, meaning most glyphs in European languages use the 1-byte ASCII subset of UTF-8. The occasional non-ASCII glyph becomes two bytes, that's correct, but that's a much smaller bloat than what you imply.
Anyway, what are you comparing it to, what is your preferred alternative? Do you prefer using code pages so that the bytes in a file have no meaning unless you also supply code page information and you can't mix languages in a text file? Or do you prefer using UTF-16, where all of ASCII is 2 bytes per character but you get a marginal benefit for Han texts?
Yikes. That would lose the ability to know the meaning of the current bytes, or misinterpret them badly, if you happen to get one critical byte dropped or mangled in transmission. At least UTF-8 is self-syncing: if you end up starting to read in the middle of a non-rewindable stream whose beginning has already passed, you can identify the start of the next valid codepoint sequence unambiguously, and then end up being able to sync up with the stream, and you're guaranteed not to have to read more than 4 bytes (6 bytes when UTF-8 was originally designed) in order to find a sync point.
But if you have to rely on a byte that may have already gone past? No way to pick up in the middle of a stream and know what went before.
No, we haven't. You can start at any byte in a UTF-8 document and resume reading coherent text. If you start reading from the middle of a multi code point sequence, then the first couple of glyphs may be wrong, for example you may see a lone skin tone modifier rendered as a beige blob where the author intended a smiley face with that skin tone. But these multi code point sequences are short, and the garbled text is bounded to the rest of the multi code point sequence. The entire rest of the document will be perfectly readable.
Compare this to missing a code page indicator. It will make the whole section until the next code page indicator garbled. The fact that you're even comparing these two situations as if they're the same is frankly ridiculous.
A file isn't meaningful unless you know how to interpret it; that will always be true. Assuming that all files must be in a preexisting format defeats the purpose of having file formats.
> Most European languages use variations of the latin alphabet
If you want to interpret "variations of Latin" really, really loosely, that's true.
Cyrillic and Greek characters get two bytes, even when they are by definition identical to ASCII characters. This bloat is actually worse than the bloat you get by using UTF-8 for Japanese; Cyrillic and Greek will easily fit into one byte.
As someone who has been using Cyrillic writing all my life, I've never noticed this bloat you're speaking of, honestly...
Maybe if you're one of those AI behemots who works with exabytes of training data, it would make some sense to compress it down by less than 50% (since we're using lots of Latin terms and acronyms and punctuation marks which all fit in one byte in UTF-8).
On the web and in other kinds of daily text processing, one poorly compressed image or one JavaScript-heavy webshite obliterates all "savings" you would have had in that week by encoding text in something more efficient.
It's the same with databases. I've never seen anyone pick anything other than UTF-8 in the last 10 years at least, even though 99% of what we store there is in Cyrillic. I sometimes run into old databases, which are usually Oracle, that were set up in the 90s and never really upgraded. The data is in some weird encoding that you haven't heard of for decades, and it's always a pain to integrate with them.
I remember the days of codepages. Seeing broken text was the norm. Technically advanced users would quickly learn to guess the correct text encoding by the shapes of glyphs we would see when opening a file. Do not want.
UTF-8 does not require a byte order mark. The byte order mark is a technical necessity born from UTF-16 and a desire to store UTF-16 in a machine's native endianness.
The byte order mark has has no relation to code pages.
I don't think you know what you're talking about and I do not think further engagement with you is fruitful. Bye.
EDIT: okay since you edited your comment to add the part about Greek and cryllic after I responded, I'll respond to that too. Notice how I did not say "all European languages". Norwegian, Swedish, French, Danish, Spanish, German, English, Polish, Italian, and many other European languages have writing systems where typical texts are "mostly ASCII with a few special symbols and diacritics here and there". Yes, Greek and cryllic are exceptions. That does not invalidate my point.
UTF-8 may still be a good choice for Japanese text, though.
For one thing, pure text is often not the only thing in the file. Markup is often present, and most markup syntaxes (such as HTML or XML) use characters from the ASCII range for the markup, so those characters are one byte (but would be two bytes in UTF-16). Back when the UTF-8 Everywhere manifesto (https://utf8everywhere.org/) was being written, they took the Japanese-language Wikipedia article on Japan, and compared the size of its HTML source between UTF-8 and UTF-16. (Scroll down to section 6 to see the results I'm about to cite). UTF-8 was 767 KB, UTF-16 was 1186 KB, a bit more than 50% larger than UTF-8. The space savings from the HTML markup outweighed the extra bytes from having a less-efficient encoding of Japanese text. Then they did a copy-and-paste of just the Japanese text into a text file, to give UTF-16 the biggest win. There, the UTF-8 text was 222 KB while the UTF-16 encoding got it down to 176 KB, a 21% win for UTF-16 — but not the 50% win you would have expected from a naive comparison, because Japanese text still uses many characters from the ASCII set (space, punctuation...) and so there are still some single-byte UTF-8 characters in there. And once the files were compressed, both UTF-8 and UTF-16 were nearly the same size (83 KB vs 76 KB) which means there's little efficiency gain anyway if your content is being served over a gzip'ed connection.
So in theory, UTF-8 could be up to 50% larger than UTF-16 for Japanese, Chinese, or Korean text (or any of the other languages that fit into the higher part of the basic multilingual place). But in practice, even giving the UTF-16 text every possible advantage, they only saw a 20% improvement over UTF-8.
Which is not nearly enough to justify all the extra cost of suddenly not knowing what encoding your text file is in any more, not when we've finally reached the point of being able to open a text file and just know the encoding.
P.S. I didn't even mention the Shift JIS encoding, and there's a reason I didn't. I've never had to use it "for real", but I've read about it. No. No thank you. No. Shudder. I'm not knocking the cleverness of it, it was entirely necessary back when all you had was 8 bits to work with. But let me put it this way: it's not a coincidence that Japan invented a word (mojibake) to represent what happens when you see text interpreted in the wrong encoding. There were multiple variations of Shift JIS (and there was also EUC-JP just to throw extra confusion into the works), so Japanese people saw garbled text all the time as it moved from one computer running Windows, to an email server likely running Unix, to another computer running Windows... it was a big mess. It's also not a coincidence that (according to Wikipedia), 99.1% of Japanese websites (defined as "in the .jp domain") are encoded in UTF-8, while Shift JIS is used by only 1% (probably about 0.95% rounded up) of .jp websites.
So in practice, nearly everyone in Japan would rather have slightly less efficient encoding of text, but know for a fact that their text will be read correctly on the other end.
I can't tell what the argument is just from the slideshow. The main point appears to be that code pages, UTF-16, etc are all "plain text" but not really.
If that really was the argument, then it is, in 2026, obsolete; utf-8 is everywhere.
He has a YouTube channel, there's a talk on there.
He also discusses code pages etc.
I don't think the thesis is wrong. Eg when I think plain text I think ASCII, so we're already disagreeing about what 'plain text' is. His point isn't that we don't have a standard, it's that we've had multiple standards over what we think is the most basic of formats, with lots of hidden complications.
I read that article long time ago, and for me it's a hard disagree. A system as complex and quirky as Unicode can never be considered "plain", and even today it is common for many apps that something Unicode-related breaks. ASCII is still the only text system that will really work well everywhere, which I consider a must for calling something plain text.
And yes, ASCII means mostly limiting things to English but for many environments that's almost expected. I would even defend this not being a native English speaker myself.
I agree with you that Unicode is too complicated and messy, although it also shows that whether or not something is considered "plain" is itself too difficult.
Unicode has caused many problems (although it was common for m17n and i18n to be not working well before Unicode either). One problem is making some programs no longer 8-bit clean.
Unicode might be considered in two ways: (1) Unicode is an approximation of multiple other character sets, (2) All character sets are an encoding of a subset of Unicode. At best, if Unicode is used at all, it should be used as (1) (as a last resort), but it is too common for Unicode to be used as (2) (as a first resort), which is not good in my opinion.
(I mostly avoid Unicode in my software, although it is also often the case (and, in many (but not all) programs, should be the case) that it only cares about ASCII but does not prevent you from using any other character encodings that are compatible with ASCII.)
> ASCII is still the only text system that will really work well everywhere, which I consider a must for calling something plain text.
Yes, it does work well (almost) everywhere.
Supersets of ASCII are also common, including UTF-8, and PC character set, ISO 2022 (if ASCII is the initially selected G0 set, which it is in the ASN.1 Graphic string and General string types, as well in most terminal emulators), EUC-JP, etc. In these cases, ASCII will also usually work well.
However, as another comment mentions, and I agree with them, that if you mean "ASCII" then it is what you should say, rather than "plain text" which does not tell you what the character encoding is. That other comment says:
> Plain text is text intended to be interpreted as bytes that map simply to characters.
However, it is not always so clear and simple what "characters" is, depending on the character sets and what language you are writing. And then, there are also control characters, to be considered, so it is again not quite so "plain".
> And yes, ASCII means mostly limiting things to English but for many environments that's almost expected. I would even defend this not being a native English speaker myself.
In my opinion, it depends on the context and usage. One character set (regardless of which one it is) cannot be suitable for all purposes. However, for many purposes, ASCII is suitable (including C source code; you might put l10n in a separate file).
You should have proper m17n (in the contexts where it is appropriate, which is not necessarily all files), but Unicode is not a good way to do it.
But at the same time the motor is extremely finicky/fragile in the source of energy (negentropy) it will accept, while natural life is extremely hardy and adaptable.
I wonder how much of machine-like "efficiency" is actually "overfitting" at the cost of robustness.
Who are we to say the mechanisms of biology are “overfit”? Maybe it would be nice if our personal machinery was more robust, but that robustness comes at an evolutionary cost. The greater force that is life on earth as a system for regulating planetary energy dissipation does not care about the needs of the individual. It does not care about the fashions of the millennia. Its sights are set much farther and its history much deeper than that
For more complicated organisms, robustness comes in the form of cellular turnover, and regenerative healing in response to injury, at least in youth. I wonder though if single celled organisms have or even need such a function.
> pretty diligent about applying search blocklists, closing hacking loopholes, and reading model outputs to catch unanticipated hacks. If we wanted to, we could choose to close our eyes and plug our ears and report higher scores for Terminal-bench, SWE-bench, etc. that technically comply with the reference implementation but aren't aligned with real value delivered to users
Of course, but that's the difference between sins of commission and sins of omission. The question is what "pretty diligent" actually translates to in practice. How many people will encourage delays in a model release or post-training improvement waiting "for more thorough evaluation"? How many popularized AI results can you vouch for on this?
The zeitgeist is to celebrate bias for action, avoiding analysis paralysis and shipping things (esp. with conference driven research culture, even before we get into thorny questions of market dynamics), so even if we have a few pockets of meticulous excellence, the incentive structure pushes towards making the whole field rot.
Is the cost of laying fiber (via a public utility) to each household counted as part of the monthly internet bill, or is that funded separately? (eg. as part of property taxes)
Yes, and that becomes more intuitive when you "un-curry" the nested lambdas into a single lamba with twice the number of arguments. The point is that the state of a constant does not depend whatsoever on the state of the (rest of the) world, how much ever of that state piles on.
Yes. Cars have fixed cost ranges for people so the end result is pretty much predetermined - electric cars settling at the same price and quality points as ICE cars are today.
> what's wrong with legitimately not knowing what e.g. the data structure will end up looking?
But that's not what the above comment said.
> Just let it run, check debugger/stdout/localhost page and adjust: "Oh, right, the entries are missing canonical IDs, but at the same time there are already all the comments in them, forgot they would be there
So you did have an expectation that the entries should have some canonical IDs, and anticipated/desired a certain specific behavior of the system.
Which is basically the meaning of "what will the output be?" when simplified for programming novices at university.
I wonder whether the blast radius of the law might interfere with OSs running on cloud machines. That might explain why California based companies in the cloud business might want to ensure that the bits they resell are compliant.
To elaborate on @jeswin's point above (IDK why it got downvoted)... a data structure is basically like a cache for the processing algorithm. The business logic and algorithm needs will dictate what details can be computed on-the-fly -vs- pre-generated and stored (be it RAM or disk). Eg: if you're going to be searching a lot then it makes sense to augment the database with some kind of "index" for fast lookup. Or if you are repeatedly going to be pllotting some derived quantity then maybe it makes sense to derive that once and store with the struct.
It's not enough for a data structure to represent the "fundamental" degrees of freedom needed to model the situation; the algorithmic needs (vis-a-vis the available resources) most definitely matter a lot.
Dylan Beattie has a thought-provoking presentation for anyone who believes that "plain text" is a simple / solid substrate for computing: "There's no such thing as plain text" https://www.slideshare.net/slideshow/theres-no-such-thing-as... (you'll find many videos from different conferences)
reply