There are many comments about potential abuse. I would be curious to know if your team have ever challenged each other to look like a real person accessing a site and the other part of the team tries to detect and block them? If there is anyone that could do this it would be the creators of Headless.
Why go through the exercise, one may ask? I believe it would be a critical thinking exercise to improve Headless even more while giving website maintainers a way to opt out of receiving traffic from it. If not your team, have you reached out to see if people from project zero would take on that challenge in their abundance of spare time? [1]
We regularly get feature requests for Headless to provide a field or property that can be polled by JS frameworks to detect if Headless is active e.g. windows.isBot.
Well, Headless is open source, which means anybody could build a Headless version with such a property set to "I am a human, trust me!" and employ such a modified binary ... ;-)
Oh absolutely, relying on a header would be a placebo at best. I was thinking more along the line of having two teams, one that develops Headless and another team at Google that try to defeat it non stop. An official game of cat and mouse. Project: Tom and Jerry? I guess legal would never buy into that name.
My own personal method for my silly hobby sites is just to put passwords on things with an auth prompt delay.
Why should Google redteam their headless browser though? As other comments point out there's plenty of ways for bot detectors to id bots even with a browser which mirrors a normal one: https://news.ycombinator.com/item?id=34858056
Almost all of those are things are outside of the scope of the browser itself. And anyone doing serious bot attacks already have scripts/forks that modify these signals. I don't see how the chrome team could do much to help stop that at that level.
In theory their blue team could come up with even more advanced puzzles that bots trip over and then open source and document the bot puzzles. I don't know that they would, incentives or lack thereof and all. If nothing else it might make their work day more fun.
Or if I put my evil corp hat on, the incentive could be that they make puzzles that only Headless can get around and all other bots become trivial to block and obsolete by even the least knowledgeable hobbyist. Perhaps Google release Nginx, Apache HTTPD, Apache Traffic Server, Envoy and HAProxy modules that only Headless can get around and all other bots internet-wide are entirely silenced. Chrome becomes the one and only bot to rule them all.
I suppose that Google going through that exercise would mean that they get market dominance on bot gathering data and anyone not using Chrome Headless would be unable to obtain freebie data. This could enable future features whatever that may be. readjusts hat One future feature could be auto-discovery of Google DNS and Google proxies in GCP so they can learn about new data sources through crowd-sourcing thus making their big-data sets more complete and their machine learning more powerful. Developers could block the proxies or compile them out but as we know most people are too lazy to do this and many won't care.
Another advantage would be that eventually the only bots abusing Google would be bots using their code and they would know how to detect and deal with as they would implement their own open source anti-bot modules in their web servers, load balancers, etc...
There are more obscure ideas but I am doffing the hat before the hat-wraiths sense it.
You jest, but I could actually see this becoming a thing. I envision a future dystopian internet where people first have to authenticate their network gear, PC's, laptops, cell phones, cars, trucks, e-bikes, toasters, coffee makers to a government contracted service. Once authenticated they utilize something similar to that RFC but probably instead a nonce or jwt token tied to their device that gets embedded in the packet header somehow. Then sanctioning a continent, country, state, ISP, city, company, manufacturer, distributor or person would be simply disabling their evil bits so to speak.
The push for this is starting with adult content [1] but the goal posts could easily be mounted on train car with a very long and smooth train track that only goes downhill.
There's a huge amount of aggro pissy shitthrowing that Chrome is facilitating automation in these threads. Bollocks.
You know what? The Internet Is For End Users [1]. If we're going to cite an RFC, it should be RFC 8890. Not having a better headless Chrome would be a violation of the most basic principles of the internet.
There are some cases where automation can get out of hand, but blocking these efforts should not come at user expense. So says the RFC8890, and a general collective belief/hum-in-the-room. The availability of a good browser like Chrome helping should not be an issue, given how many other ways bad players have to go too far & cause harm to sites. The people who have to deal with this are not the priority & this doesn't radically change their troubles; this radically helps end users wishing to exercise agency though.
In most cases being able to script & automate a site is a completely primitive user-agency, of no special regard. Headless Chrome being a somewhat tolerable way of doing that scripting is 100% morale, correct. It greatly assists us in fulfilling a primary & clear overarching purpose of the internet: to be for end users.
I wish I could say I cannot believe the complaining & whinining & snivelling, the pretentious-nonsense/acting-offended that Chrome would dare help make good automation. I wish I could say I don't think this crowd recognizes nor comprehends the basic purpose of the internet, but again, I think I know better; I suspect they do but their protests are disingenous, that they have allied their hearts with darker forces, against the user.
>Headless is open source, which means anybody could build a Headless version with such a property set to "I am a human, trust me!"
This is flawed reasoning. Just because we can't eliminate abuse from headless browsers that doesn't mean we shouldn't work to reduce it. Finding such a modified binary or making it yourself is additional friction that will cause less of these bots to exist. Some people may not care if a website is able to block them or not or some people may not decided to do the work to read the robots.txt. By implementing these capabilites into the product by default you are making the web ecosystem a better place wit less abuse. You are right that someone could make a version without the antiabuse parts, but surely that fork will be less popular and less used.
If I run a soup kitchen, and Google is sending robots to my establishment which are indistinguishable from humans, I should I have the right to ask if the client is a robot.
I would hope that Google's robots would not be programmed to lie to me, but would be honest.
If robots are required to be honest, then I have a choice to serve them or not. If they are not honest, I do not have a choice.
While I appreciate your answer from a technical point of view - indeed it is trivial modify/spoof - there is an ethical dimension.
Should bots have the legal right to say they are human?
For example - if Google Inc is visiting a web page to collect information about it using a headless bowser, and the server asks - are you a bot - should Google be legally or ethically allowed to answer no? (declarations in headers could remove the need for question/answer chatter.)
(I want to pre-empt dismissing this line of questioning via 'what if Google wants to know how the site will be served to a human for better search results because google could include a specific header for that, eg "I am a bot, but request that you serve the version of this page served to humans". It would be up to the server to honor or reject that request.)
The defaults Google choose have compounding effects in our society. If you make it "normal" for bots to pretend to be human, the industry has minimal pressure to hold any standard above what you do, and better norms may never appear, or be delayed by a decade. The alternative is to be thoughtful today to try to create a better world.
Headed chrome adds a huge amount of overhead, and can also be fingerprinted more easily. This is a lot more declarative and makes it easier to run an abuse farm. Although, per my other comment, I don't see Headless as a tool that will particularly move the needle on abuse cases.
Isn't headed chrome usually fingerprinted by variables inserted by the chromedriver? You can rename these variables and be undetectable (you don't even have to recompile chromedriver, you can use a hex editor or a perl replacement).
There are even Puppeteer plugins that will do it for you. [^1]
The best detection I've come across so far (i.e. before this release) has just required I run headless Chrome in headed mode. Granted, I don't do a ton of scraping -- mostly just pulling data out of websites so that I can play with it in aggregate using more civilized tools.
I am that anyone you mentioned. For example, autoposting on 4chan works very well for me. I spam goods on 4chan to buy or create opinions that I force.
Would you please stop posting in the flamewar style? We've had to ask you this in the past as well. It's not what this site is for, and destroys what it is for.
Because it suggests adding usage controls, possibly enforced via cloud connectivity, to add restrictions that will inevitably make legitimate usage more difficult, frustrating, and most importantly, subject to outside control. Extend this far enough and the world starts to look like Doctorow's "Unauthorized Bread".
This is an awful world, one designed to reinforce class divide and protect the entrenched and the rich by deliberately handicapping easily-accessible tools, because of a few bad actors. It creates a world where the code for literally everything is the most hideously complex version of itself because it is riddled with constant checks, phone-homes, and arbitrary usage limits. It further pushes us towards a disempowering future where our computing is limited exclusively to appliance-like devices whos inner workings are controlled for it. It stands against the very principle of general-purpose computing.
If you are soy developer who thinks cloudflare is god that should solve problems for you and use O(n^2) or even worse algorithms in your code so you can't even optimize it, it is only your problem, correct.
In 2000 sites were running where code has been precisely made such way DDoS attack was impossible. Now it is heckin sauce of js malware obfuscated proprietary code.
If your site like this, you deserved it. Cloudflare and such companies just need your money for solving 5-minutes problem like AWF that is just a regex, and you have limits even for user agent filtering, lol.
Stop making shitcode and learn HTTP and TCP/IP theory, and you will make antispam filter that is 200% better than any cloudflare shit that is simply malware that runs cryptominer as a "IUAM" mode for their own benefit and you even pay for it.
For what it's worth, the large "players" already seem to have this capability. They've forced pretty much everyone to roll out captchas, waf-level throttling, proof of work interstitials, and behavior-based fingerprinting.
While my immediate response was the same as yours, I think this actually won't really change much in the way of bad actors.
It's unfortunate, but basic controls (such as throttling, etc) are pretty much a floor-required feature - one way to avoid this burden is to do things like use 3rd party idp (aka google login). I'm not happy with the state of things but I don't think headless will particularly contribute to a material increase in abuse cases.
I didn't know this was a restriction before! Interesting. I would have assumed old headless had a profile, that typical command-line efforts[1] would let one load extensions. Are we sure that your question is valid? Are we sure that previous headless Chrome didn't have profiles or couldn't load extensions? I'm not sure this question is valid. I think maybe the assumptions here are incorrect.
The new Chrome headless certainly purports to be "just Chrome" "without actually rendering." One of the notable differences in the new headless mode is that it at least shows the stock/built-in extensions. From the submission:
> Similarly, when it comes to plugins, the old headless Chrome used to return no plugins with navigator.plugins, which is a technique that used to be exploited for detection when Headless Chrome got released 6 years ago, cf this blog post. The new headless Chrome returns the same plugins as a headful Chrome, and that’s the same for the mimeTypes obtained with navigator.mimeTypes:
Maybe perhaps the new headless is faking it, but my impression is that extensions definitely work as normal in the new headless Chrome. How or whether they worked before is another very very interesting question I'd like answers to.
I do wish the AMA dev had actually replied to this. My hope is that this wasn't an issue before (but default plugins just weren't installed, and now they are, just to alter fingerprinting), and that now the situation is unchanged but default plugins are installed.
Improving test environments is a huge upside. I haven't worked on browser automation in nearly a decade, but finding ways to work around shortcomings in the headless environment used to burn a lot of time on that team. I know of many small teams which made deliberate decisions NOT to do any browser automation tests (e.g. Selenium) because some issues required testing hooks in production code.
Is it too late to change the name from "new headless"? It won't be new forever, and then there will need to be a new new mode, or a differently named one that people think is older because it isn't the new mode.
No, obviously, the next version will be called Newer Headless. Then you get the More Newer or Even Newer release. Or my personal favorite NewV2. /s
Using the word "new" in naming conventions is the most moronic and shortsighted way to name things in something that is quite obviously going to be changing in the somewhat near future.
But then how would you have the pleasure of figuring out the sort order between New $Feature, Advanced $Feature, Revamped $Feature and Enhanced $Feature?
It's real Chromium, not emulating a Chromium browser. "Old" Headless was merely pretending to be a Chromium browser, the "New" Headless is a Chromium browser. "Old" Headless requires a parallel/duplicate implementation of features, which leads to subtle behavior differences or infeasability to support certain features e.g. extensions proper.
Edit: Please also note that we have not released New Headless yet. We "merely" landed the source code.