Google gets preferential permissions in robots.txt

octoberfranklin · on Dec 15, 2020

Yeah, probably the most useful action that can be taken in the antitrust case is to force Google to spin out the crawler into a separate entity that sells access to the crawl database from a single, public price list. They can set those prices however they like, but can't customize the pricing for certain customers or refuse new customers.

Everybody (including Google) pays the same price per query/byte/whatever. CrawlCo gets ownership of the crawling IP range and is prohibited from entering any other market. GoogleCo is prohibited from doing its own crawling, but can ask (perhaps require) CrawlCo to add new crawl products which are then available to everybody.

Basically force Google to do with their crawler what Amazon voluntarily did with their datacenters.

lazyjones · on Dec 15, 2020

Makes sense until you realize you'd have to prohibit everyone else's own crawling as well (can't single out and put Google at a disadvantage) and then you'd just have turned Google's de facto-monopoly into an actual, government-mandated crawling monopoly.

I'd rather make http://commoncrawl.org more current, accessible and more commonly used so website publishers can see a benefit in actively supporting it to lighten the load on their servers.

octoberfranklin · on Dec 15, 2020

It's a natural monopoly. It has already de facto prohibited everyone else's own crawling, as the linked article demonstrates.

When it had a monopoly, AT&T was forbidden from selling software.

lazyjones · on Dec 15, 2020

I disagree on the "natural" part. Robots.txt that put other search engines at a disadvantage aren't the norm, they're, just like in the early years, some websites supporting only Netscape and MSIE, a direct consequence of Google's current market share and might change once there is a good reason (like DDG growing into a significant player).

If a collection like commoncrawl with bulk downloads was more useful and thus used more often, even Google would have a good reason to use it.

octoberfranklin · on Dec 15, 2020

> Robots.txt that put other search engines at a disadvantage aren't the norm, they're, just like in the early years

It's not just robots.txt, it's also cloudflare and IP-based throttling. And it is very, very commonplace: http://gigablast.com/blog.html

ColinHayhurst · on Dec 17, 2020

Not commonplace https://news.ycombinator.com/item?id=25373909

eznzt · on Dec 15, 2020

oof https://commoncrawl.org/wp-content/uploads/2016/03/box-5.png

Tepix · on Dec 15, 2020

What if i don't want Google to crawl my website but i'm ok if others do?

octoberfranklin · on Dec 15, 2020

As it currently is you can't stop them from using an undeclared IP range to crawl you.

This doesn't change that.

"Everybody except X is allowed" rules have always been "on your honor" type restrictions. "X" can always claim to be Joe Rando.

londons_explore · on Dec 15, 2020

Anyone got a TL:DR? The crux of their argument isn't even on the first page...

Tepix · on Dec 15, 2020

They found a bunch of robots.txt files on sites like census.gov that blocked all robots except a few known ones, especially GoogleBot