Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Google gets preferential permissions in robots.txt (knuckleheads.club)
20 points by bobsil1 on Dec 15, 2020 | hide | past | favorite | 11 comments


Yeah, probably the most useful action that can be taken in the antitrust case is to force Google to spin out the crawler into a separate entity that sells access to the crawl database from a single, public price list. They can set those prices however they like, but can't customize the pricing for certain customers or refuse new customers.

Everybody (including Google) pays the same price per query/byte/whatever. CrawlCo gets ownership of the crawling IP range and is prohibited from entering any other market. GoogleCo is prohibited from doing its own crawling, but can ask (perhaps require) CrawlCo to add new crawl products which are then available to everybody.

Basically force Google to do with their crawler what Amazon voluntarily did with their datacenters.


Makes sense until you realize you'd have to prohibit everyone else's own crawling as well (can't single out and put Google at a disadvantage) and then you'd just have turned Google's de facto-monopoly into an actual, government-mandated crawling monopoly.

I'd rather make http://commoncrawl.org more current, accessible and more commonly used so website publishers can see a benefit in actively supporting it to lighten the load on their servers.


It's a natural monopoly. It has already de facto prohibited everyone else's own crawling, as the linked article demonstrates.

When it had a monopoly, AT&T was forbidden from selling software.


I disagree on the "natural" part. Robots.txt that put other search engines at a disadvantage aren't the norm, they're, just like in the early years, some websites supporting only Netscape and MSIE, a direct consequence of Google's current market share and might change once there is a good reason (like DDG growing into a significant player).

If a collection like commoncrawl with bulk downloads was more useful and thus used more often, even Google would have a good reason to use it.


> Robots.txt that put other search engines at a disadvantage aren't the norm, they're, just like in the early years

It's not just robots.txt, it's also cloudflare and IP-based throttling. And it is very, very commonplace: http://gigablast.com/blog.html




What if i don't want Google to crawl my website but i'm ok if others do?


As it currently is you can't stop them from using an undeclared IP range to crawl you.

This doesn't change that.

"Everybody except X is allowed" rules have always been "on your honor" type restrictions. "X" can always claim to be Joe Rando.


Anyone got a TL:DR? The crux of their argument isn't even on the first page...


They found a bunch of robots.txt files on sites like census.gov that blocked all robots except a few known ones, especially GoogleBot




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: