Hacker Newsnew | past | comments | ask | show | jobs | submit | jesuscyborg's commentslogin

They just lost me a hundred dollars worth of steak tonight. Their service has become increasingly incompetent in recent months where >50% of orders arrive cold, because it sits at the restaurant so long waiting to be picked up. Now with this service outage the service has grown from incompetent to incapable. Support small businesses and don't go through the middleman. Restaurants will love you.


YouTube has a demonetization feature too. Let's say you run a programming tutorials channel that's too small to monetize and you don't want Google layering ads over your content. Just crosspost something like a Tim Pool video and hope the YouTube AI categorizes your channel as too toxic for advertisers. Otherwise it'll be interesting to see how the standard changes when Google is the sole beneficiary. That money is too toxic for thee, not for me.


They’re running ads now on channels that were previously deemed too toxic for advertisers, which makes me doubt that that was ever a real concern and not just an excuse to defund people who made videos they didn’t like.


I'd rather have ads on my videos than ruin my reputation by posting trash like Tim Pool.


Have you read James Damore's essay? If you look at only the abstract theory, there wasn't a whole lot of 'think differently' or rebellious points of view in there. Totally identical mindset to the status quo establishment. He just decided to be rude while presenting his arguments using non-technical language, and pointing out things about people that they're powerless to change. That's what makes it scary. For example, no one would have thought the worst of him if he wrote a treatise talking about behavioral correlations between samples having or not having the sry gene and called out publishers for suppressing such statistics. Instead he called women neurotic. Not a great way to speak truth to power.


Not defending the rest of damore but the neurotic thing is a reference to well known research: https://en.wikipedia.org/wiki/Neuroticism. The fault in that case lies with whoever named a technical research concept using a word that already had negative popular associations.


Not true. All you have to do is launch Matthew McConaughey into a black hole and he'll solve it.


Unfortunately “love” is difficult to experimentally test.


Building a good product will get you a pat on the back. What you need is a good product + leverage. http://paulgraham.com/wealth.html


The way I'd code a better search engine is I'd design an ML model that's trained to recognize handwritten HTML like this, and only add those to the index. It'd be cheap to crawl probably only needing a single computer to run the whole search engine. It'd resurrect The Old Web, that still exists, but just got buried beneath the spammy SEO optimized grifter web over the years as normies flooded the scene.


I hope to never use your search engine. I love hand written HTML as much as the next guy, but search engine's are made to find things. And useful information exists on web sites that use generated and/or minified HTML.


Thanks for that buzzkill. I guess the lesson is if you can't do everything Google does, don't even try.


> I guess the lesson is if you can't do everything Google does, don't even try

I don't agree with that at all. But if your goal is to make "a better search engine" as you said, it does actually have to be "better" and not just different.


If you want a CRT that's compile-once run-anywhere for x86 then try cosmopolitan with actually portable executable. https://justine.storage.googleapis.com/ape.html It even supports fork() on windows as of a few days ago: https://github.com/jart/cosmopolitan/commit/db33973e0aae7ffc...


That is 9 :)

In effect it is yet another CRT and the idea is sound. But many times what I found was you may have one that works on say linux and windows and bsd. All the same 'code' but you dig under the covers a bit and it is a maze of ifdefs so each platform has its own quirks. For example threading between fork and createthread is on the surface not too different and you can wrap createthread with it (several libs did). But you dig into it a bit and you find portions that just do not map at all between the systems (usually with IPC and locks). At best they do not compile, slightly worse they return error codes, at worst they act like they work.

A real good example of what I am talking about is the pthread library. It works up to a point but it is a very linux/bsd orientated library. There are some gaps in there from windows that just do not map and the other way around. What is worse is the docs on some of these do not talk about cross platform issues. Luckily you can see the source code of most of them and can tell what is going on. Annoying but one of the things I learned moving code between platforms is that each one has its own way of doing things. You can try to work against it or sit down and unwind what is going on, which takes time. I have even seen this sort of issue in python and java. Where you get down to some low level thing and it just is different on different platforms.


Use wcspbrk.

UTF-8 continuation characters are limited to the range \200 through \300 so there's basically zero chance that if you choose something like comma as your delimiter that it's going to tokenize the middle of a multibyte sequence.

Also take into consideration that, under the hood, functions like strpbrk() are typically accelerated by CPU instructions such as PCMPISTRI which doesn't support UTF-8 natively but it does support UCS-2.


> so there's basically zero chance that if you choose something like comma as your delimiter that it's going to tokenize the middle of a multibyte sequence.

Not just "basically;" there is no possible collision between ASCII characters and any valid multibyte encoding. This can be seen somewhat visually in this table[1] and is an intentional aspect of the UTF-8 design.

[1]: https://en.wikipedia.org/wiki/UTF-8#Encoding


How about with joiners and combining characters? Eg. If you encode é as U+0065, U+0301 (\x65\xcc\x81), then search for 'e' and act on the result somehow, you fail to consider the whole glyph.


Sure. You're talking about glyphs that are composed of multiple unicode codepoints; my earlier comment is true of single codepoints only. The comment I was responding to is also talking only about single codepoints (wcspbrk cannot represent delimiters longer than a single codepoint).

On joiners / combining characters: I'd encourage using composed normalization (NFC) rather than decomposed normalization (NFD).

Just curiosity: are there any glyphs that lack a single codepoint representation, where one of the joined codepoints is an ASCII character? (That only helps after normalization, of course.)


Yes. ASCII uses \b as the combining character mark which is a convention that's always been widely supported by typesetting programs such as less and nroff. For example, A\b_ is A̲, and you can do the same thing with apostrophe and tilde for accent marks. There's also UNICODE emojis where two codepoints in sequence get joined together as a single glyph. Never underestimate the creative ways text can be used, or that standards just codify a long history of practices.


Er, I was asking about unicode joining, not this roff \b thing. Sorry for the confusion. I'm aware that multiple-codepoint unicode glyphs exist; I'm asking if any of those involve a codepoint in the ASCII (1-127) range which cannot be normalized to a single codepoint (e.g., e + ' normalizes to a single codepoint é).


Of course. Take for example mͫ (m+m) there's no way to represent that as a single codepoint. Combining marks can also be overlaid multiple times, e.g. m͚ͫ (m+m+∞) so the number of glyphs you can create is limitless. There's only a tiny number of the combinations that are possible which have a tinier normalized form. The new UNICODE combining marks work by almost exactly the same principles as the \b ASCII combining mark. That's why I mentioned it earlier.


Thanks!


In what data format or programming language is 'e' a delimiter? One situation is floating-point constants, where 'e' is a delimiter indicating the exponent. However, if an é occurs in the middle of such a constant, whether as a single code point or a combined character, that is an error. The 'e' must be followed by an optional sign and one or more decimal digits.

The ISO C library string handling stuff is for systems programming, not for scanners and parsers for natural written language.


The historic planes beyond the basic multilingual plane are usually referred to as the "astral planes" which includes things like gothic, runes, alchemy, egyptian, and emoji https://justine.storage.googleapis.com/astralplanes.txt


And the etymology of this being that Dungeons and Dragons has a "Prime Material Plane" and an "Astral Plane", where the Astral Plane connects the PMP to various "Outer Planes" made of ridiculous not-oft-encountered stuff.

But whoever came up with this cute analogy got the analogy wrong — the higher Unicode planes are analogous to the "outer planes" themselves; while the "astral plane" would be some sort of glue allowing you to access these outer planes from within the BMP. Like... surrogate-pair characters! One could nickname the reserved surrogate-pair range in the BMP, the "astral projection" range ;)


"Astral plane" predates Dungeons and Dragons by centuries. Looking at old discussions, I couldn't find any evidence that Unicode's usage is connected with D&D.

Early discussion of "astral character" or "astral plane" for the Unicode supplementary planes at: https://unicode.org/mail-arch/unicode-ml/Archives-Old/UML024... Even earlier 1998 use: https://www.unicode.org/L2/L1998/98354.pdf


The term "astral plane" is older than D&D, and I would assume they took it from the more general usage, not the specific usage in D&D. https://en.wikipedia.org/wiki/Astral_plane


I’ve met several of the Unicode standard committee - They’re nerds. The kind of nerds for whom “Astral Plane” is a multilayered joke. It’s not not about the general usage, but nor is it not about the D&D term.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: