Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Taking action against scraping for hire (fb.com)
220 points by pawelkobojek on July 7, 2022 | hide | past | favorite | 227 comments


Collecting the rhetorical BS:

"scraping attacks"

Scraping is not an attack. Monopolists want to pretend they own your data because they get unlimited access to monetize it whereas competitors should have none.

"self-compromised"

Monopolists want to sell you thus it's imperative they maintain the fiction of "one person, one account". By admitting you own your account, they'd have to allow sharing and they wouldn't be able to provide their customers (advertisers) with reliable data about individuals.

"protect people from scraping"

Monopolists will protect themselves and call it protecting you. They will attempt to make you afraid of some other actor using your data in harmful ways so as to detract from how they monetize you and use your data in harmful ways.

"deter the abuse"

Monopolists don't want to argue about what constitutes abuse. Anything they write in their TOS is entirely for their benefit and only constrained by local law (if that). They will abuse you to the fullest extent they can get away with while arguing that any action to use your rights is "abuse."

"safeguard people against clone sites"

Monopolists want to maintain their monopoly, there is no greater threat than a direct challenge to that monopoly by allowing data to move freely.

--

More subtle but even more ironic rhetorical points

"for hire" / "paying for access"

Emphasizing that people making money (gasp) for providing this service, is bad.

"industry leader in taking legal action" + "across many platforms and national boundaries, also requires a collective effort from platforms, policymakers and civil society"

Monopolists can pay high priced marketers to rebrand them as patriotic hero figures fighting valiantly for the little guy.


While I agree with your assessment of the BS in the article wrt scraping, and also agree with your assessment that the behaviour is completely about FB protecting itself and its monopoly control (the word control being important), I think its important to emphasize its not about FB caring whether other entities having access to the data, its about FB caring about it's public perception with regard to its having that data at all.

Over the last few years or so it feels like, to reference a @dril tweet[1], Facebook has just been 'turning a big dial taht says "data access" on it and constantly looking back at the audience for approval like a contestant on the price is right' with how much it allows 3rd parties to get at its data.

Keep in mind ~5 years ago the big thing at FB was "Open Graph" and "Graph Search" which gave everyone really in-depth access to their data with the idea that Facebook would be the "data platform" on top of which all of these 3rd parties would build apps and interfaces. This of course eventually resulted in the whole Cambridge Analytica thing and now this gigantic swing in the other direction of being overly protective of the data as a kneejerk PR reaction to all the bad press.

FB loved sharing data and provided a direct API for accessing it when the public narrative was about data freedom and 3rd party developer friendliness and it hates giving any access at all and goes around sues web scrapers now that the public narrative is all about privacy.

Facebook will happily align itself in whatever way results in the least public outcry arguing they shouldn't be allowed to have the data in the first place regardless of if that means giving access or restricting it.

1: https://twitter.com/dril/status/841892608788041732


The example you stated is a truly fantastic one. Graph Search was pretty much like a direct API into their front facing network.


Great post that summarizes exactly what I feel about globocorps. The euphemisms and propaganda are disgusting.


The users agreed to share their data with Facebook, not some other company. If they didn't prevent this, they'd be asking for another Cambridge Analytica


The users agreed to share their data with everyone that uses Instagram. Because that's how the site works.


There’s an important difference between technically consenting and informed consent.

Given what I know about the bot problem on Instagram, I would imagine many people have been tricked into sharing their private profiles with scraping bots. Many bots are copying real people’s profiles and then spamming their friends with follow requests. It’s highly effective and gives these bots access to private profiles.

Fooling people is fraudulent, period.


The user agreed in facebook to have is data "public", so it can't complain that a robot scrap it.

Nothing prevents him to restrict access to his pages an data to "trusted" friends.


The description in the article sounds like it scrapes private profile data.

> Octopus designed the software to scrape data accessible to the user when logged into their accounts


Were they showing the private data to everyone, or just to the person whose account was used for the scraping? If it’s the latter, then this is also not a crime, it is just someone accessing data they have been authorized to access, but in an automated way.


I don't think so, it is more like you scrape what is accessible to this user. So in the end you will scrape your friends data. This is why I said that you are free to only share with friends that 'you trust'.


That is a very good point, but surely it was taken into consideration when scraping was declared legal?


All that case says is "scraping is not a violation of the CFAA". But of course the scraped data still exists in legal limbo; maybe you can compute derived information from it, but the moment a scraper reproduces it there is all of copyright law waiting for them.


In that case, the user owns the copyright, not the company, as the user is the author. So it would be up to them to take legal action if deemed necessary.



The only argument I have here (sadly in favor of FB) is with "safeguard people against clone sites". While I did give my data to FB, I didn't approve that transfer to another site/system. That is the only place I could possibly see some legal foot hold.


What happens when FB builds a shadow instagram profile of you based on your FB account? That already happens. FB clones their own data for other projects no different than what you might fear happening if this data were cloned to a third party. The cat is out of the bag already but FB wants to pretend they are the only ones with the right to abuse.


It's impossible to control information once been created. The longer it's existed and the more locations you can see it make that spread exponentially more likely.

Wehether we make that spread of informationlegal or not does little to affect whether it happens.

There are two things that might help. First, don't share as much information. Once it's no longer limited to you or your close group of friends which hopefully won't share it along with your name, it's mostly out of your control. Second, put limits (laws) on what information companies are able to synthesize about you, and how long they can retain it. If there's less information created about you (or it's ephemeral, created and destroyed as needed), and if they need to clean out older data, there's less to be shared or stolen.


“It’s hard to enforce the rule of law” is not a good reason to abandon it entirely. Data privacy laws make data privacy better even without being 100% infallible.

We should be both practicing good data hygiene and using legal tools to combat those who abuse data privacy.


> “It’s hard to enforce the rule of law” is not a good reason to abandon it entirely.

I didn't?

> We should be both practicing good data hygiene and using legal tools to combat those who abuse data privacy.

That's what I said. The first thing is data hygiene, the second is legal requirements. The difference I think is that the legal requirements should be on the actual creation and retention of the data, not just who owns it, who it can be shared with, etc.

As soon as PII information over a certain age is radioactive and linked to a fine per person, all of a sudden there'll be a lot less giant repositories of PII to worry about.


they also toss in the chinese affiliation in hopes to bring even more ill will from the reader towards the company. china is probably doing some bad things, but scraping facebook ain’t one of them.


Scraping social media is something that China is very notorious for doing. They are 100% positively scraping all major social networks around the world.

They do this to collect information of foreign policy interest to them, to silence political dissidents abroad, etc.

For example: https://www.washingtonpost.com/national-security/china-harve...

And: https://www.propublica.org/article/even-on-us-campuses-china...


Good point, I missed that one.


I don't get the thing about "monopoly".

Let's start with one thing: copyright on databases. Take IMDb: they collect and combine totally open data on movies cast, crew, soundtracks used and so on. Everyone can go to the cinema, wait until movie ends, write down data from credits roll and put it on the database. There's no prohibition on this activity. Cinema may prohibit filming inside, but not using pencil on paper. Or you may buy a DVD released later, and do just the same. Or you may even write a movie company email asking for those data in electronic form and chances are they will send it to you or point to some promo materials website where it is published already.

But the entire database is a product of work, and that makes it valuable. So the company or organization spent time and money collecting, indexing and cross-linking those data, and has a right to bank on that work. Easily copying that database for commercial purpose _is_ stealing. This is why we have a database copyright laws.

Now back to Meta. They created this product and made it attractive enough so people are adding their data voluntary. Every single piece of data is quite open (maybe not really so for personal bits like face photos, emails and phone numbers). Meta spent a lot of cash making and keeping product that attractive, and now banks on those collected data by targeting ads.

Nothing in the world prohibits everyone else to create a service, make it valuable, attract people, collect data (according to data collection laws) and bank on that. But just copying data collected my Meta is stealing, and Meta is in its own right to protect it. The fact that Meta did it before doesn't makes it monopolist. In fact, there are lots of companies doing the same, like Google, Amazon, Apple, eBay etc. So in my opinion it is not a monopoly defending its' position, but rather business defending its' assets from stealing.


Missed this one:

> a US subsidiary of a "Chinese national" "high-tech" enterprise

Replacing it with "a business" would do just fine.


Indeed. It's the height of hypocrisy for a company to define the borders of its own system and then prosecute those who they consider in violation of them. There is no consideration given to whether the data should have been collected and retained by Facebook in the first place, regardless of whatever arbitrary access policies they defined to fit their own business and data model.

It's not clear what Facebook's position on scraping truly is. Sometimes they downplay it as "normalized and widespread," and other times they castigate it as inexplicably legal and clearly immoral, or even outright "in violation of state and federal law." For example:

- April 2021. Researchers find an exposed database containing the scraped data of 533 million facebook users. Some news reports refer to it as a "breach." Facebook attempts to downplay the issue as the result of third party scraping. Headline in ZDNet: "Internal Facebook email reveals intent to frame data scraping as ‘normalized, broad industry issue’" [0]

- October 2020. Facebook announces lawsuits against companies it claimed created a "malicious extension on Google’s Chrome Web Store designed to scrape Facebook, in violation of Facebook’s Terms and Policies and state and federal law." [1]

So... which is it? Does Facebook believe that scraping is a "broad, normalized industry issue?" Or is it a violation of "state and federal law?" It seems like they measure severity of its impact primarily based on the reactions of political commentators.

And what's the difference between automating a browser and automating an API client? Why did Facebook design an API for accessing the data they collected, if it's illegal to collect? They've even claimed to be the victim of Cambridge Analytica, who purchased a "quiz" application created by a developer who pieced it together using code straight from the "examples" section of Facebook's API documentation.

There is one obvious resolution to this apparent contradiction. If we remove Facebook from the question, then the contradiction resolves itself. All we need to do is stop presuming that Facebook has the right to collect and retain this data in the first place. And as a user, if you publish your data to a website designed for sharing it with other people, then by definition it is no longer private data. Therein lies the central question: what is "semi-private" data, and who controls its boundaries?

[0] https://www.zdnet.com/article/facebook-internal-email-reveal...

[1] https://about.fb.com/news/2020/10/taking-legal-action-agains...

p.s. another thing they never mention is why companies want to scrape lists of facebook users. perhaps it might have something to do with the "lookalike audience" feature, and its more precisely targetable predecessors, which allow advertisers to upload a list of usernames and email addresses for targeted advertising?


[flagged]


I've reread the previous comment and I really don't see where there is any justification stated for acting in an unethical manner. While Facebook may be making an argument against unethical behavior by a few, using the language they do is detrimental to legitimate uses of crawling content available on the Web.

Corporations, by nature, work in a way that individuals at those companies don't. They are literally "non-corporal" entities and work toward increasing profit and stakeholder value, not improving the lives or situations of their users, unless that happens to correspond to making them more money.

We should all be wary of corporate control and claim to rights built from their user base, especially if those services are offered for "free".


We should all be wary of corporate control and claim to rights built from their user base, especially if those services are offered for "free".

That's fine then. And I agree with you. But leave you with this.

Do. Not. Give. The. Company. Your. Data.

They are literally "non-corporal" entities and work toward increasing profit and stakeholder value

Again, I agree. But if you think this is a bad thing, then you don't believe in capitalism, and I'm not quite sure what the intention is to argue this point on a platform (HN) that encourages the most basic forms of capitalism - starting up companies with innovative technology and solutions.


What a pretty picture capitalism is. Break out the popcorn for the latest regular installment of “ok for me but not for thee”:

People You May Know employs tons of shady stuff Facebook doesn’t reveal and has saved their bacon early on from stagnating at around 100M users.

https://mashable.com/article/people-you-may-know-facebook-cr...

Facebook Beacon and others had a big outcry. They got hauled into Congress multiple times. And of course whenever they get caught, they always throw a “mea culpa” and do it all over again in a year under a different name. Here they are recording faces of their users secretly using camera permisions!!

https://www.independent.co.uk/tech/facebook-app-recording-ca...

Their entire business model is “Give us all your data for free.” Mark Z early on was flabbergasted himself when he realized he no longer needed to scrape sites on Harvard’s house websites and could just ask people to submit the data for each other: “They ‘trust me’, dumb fucks.”

https://www.esquire.com/uk/latest-news/a19490586/mark-zucker...

Proceeds to build entire business on this data…

BUT THEN. Someone else does it to them and they get mad. “You can’t scrape us!” LinkedIn tried this:

https://www.zdnet.com/google-amp/article/court-rules-that-da...

And it’s not like capitalist enterprises even try to be consistent in their legal complaints:

https://9to5mac.com/2022/04/14/apple-calls-out-meta-for-hypo...


The problem isn’t “capitalism”, it’s crony-capitalism enabled by certain elements of state complicity.


Okay, is there a single problem with capitalism, or is it perfect? The problem is never w capitalism?


Yeah, nothing says "commie" like trust busting and keeping markets competitive.


Cough, cough, Google, cough, cough…

I’m not ashamed to admit that I’ve done some jquery shenanigans on my Facebook friends page to “export” my friend list so I can retake control of my friend relationships (disintermediation for the in-crowd).

So easy to push data in to Facebook, so hard to get even basic data out of it.


In my opinion, breaking a click-through license agreement or violating the small print on some dense and difficult to read web page is hardly an issue of morality or ethics.

Let's also remember that a big reason Meta is hating on scraping is because of their own problematic behavior. It wasn't so long ago that they were suing NYU over research on political ads and how Facebook targets their readers.[0] In fact, it wouldn't surprise me if Meta's larger goal is to prevent this sort of research.

[0]: https://news.bloomberglaw.com/privacy-and-data-security/face...


Google search's business model is scraping the web, indexing it, and then pasting ads all over their search results made up of other people's content. If Google can build a business on third-party data then these meta scrapers can do the same thing.

It is like saying a photographer can't photograph a building from the street because she doesn't own it. The building is there, taking a picture takes nothing from the building. That is all that is going on here, repeating publicly available information.


No it's more like you subletting an apartment to a dodgy photographer who wants to take pictures of the children's playground your back window looks out on even though your contract explicitly forbids it subletting. The suit is against companies that use login credentials that are not theirs. It is not public information that is being scraped. It is information behind a login with a terms of service for what you are allowed to do with that login.


> the vast majority of Web scraping efforts are to build businesses on top of other organizations hard work and innovation.

Not really. Scraping just gets data, not code, so it's hard to support this argument. The anti-scraping view is that the right to use the data rests with the company that collected it, but I don't think that view is held by most people.


If you are arguing that an organization's data is worthless but only their code has worth, then I'm not quite sure where to go from this point in this discussion, other than to say that is crazy.


The data is obviously valuable, but they don't necessarily deserve a monopoly on that data, since that data primarily belongs to the users who created the data; so while it's understandable that organizations want to restrict that data, we have no obligation (moral or otherwise) to respect that desire.


Exactly. Your list of friends does not belong to Facebook, it belongs to you.

I am sure Facebook believes they deserve a monopoly for having obtained it first. They do not. The market forces you to compete for every dollar you earn, so you have every right to expect Facebook to compete for every dollar they earn, and "I touched it first therefore it's mine!" is not competition.


But, but, but..... you agreed that Facebook does own your friends list when you signed up for an account and started giving them all your data.

If I run a restaurant, and I stipulate that when you walk through the doors and place an order I reserve the right to take your picture and post it on the bulletin board, why would you place the order and then get pissed off when I post a picture of you on the bulletin board? And why would you be mad at me if I stipulated that no one else can use a camera in my restaurant? Terms of service, my friend. Unless prohibited by legislation, I can stipulate how things run in my restaurant.


If your bulletin board somehow let you monopolize the restaurant industry (? lol) then we should absolutely vote for some politicians to boot your entitled ass back into competition.

Obviously, the idea of a bulletin board granting a restaurant an effective monopoly is ridiculous so your analogy is trash, but even if your analogy wasn't trash, your conclusion would still be wrong.


I'm not saying that the data isn't valuable, but that possession of the data, valuable though it may be, is not related to the organization's hard work or innovation. For the most part, any control rights to the data likely rest or should rest with the people who provided it to the company.

Meta claiming that all of the photos on Instagram are Meta's property does not comport with current IP law or the views/opinions of most of the users on Instagram who do own the copyrights to those photos.

You really shouldn't be able to sue anyone for use or copying of data to which you do not hold copyright. The stuff on FB is licensed to FB by the people who own it (their users).


I don't sympathize with a monopoly that people are trying to weaken.

I loathe Meta and want to boycott it. Unfortunately this means I'm now locked out of the only repository of most local events and gatherings in my city.

In some countries, life is literally not possible without WhatsApp.

If Meta wants to cry about the mean bullies trying to exfiltrate data, they need to stop wiping out competitors.


> the vast majority (…) Period. End of story.

If you’re going to assert something as definitely true to the point of closing off discussion, I’d expect a modicum of evidence. At a minimum that you’d explain the reasoning behind your conclusion. What’s the source of the “vast majority” claim? There’s little point to advertising when you’re scraping a website for personal consumption, so it seems dubious anyone would have reliable numbers on which kind is more prevalent.


Regardless, it’s very rich that a company like meta is mad that they’re being beat at their own game (making money off of data that they obtained through shady means).


Sorry but there are many legitimate reasons to scrape a website. Price manipulation is one example. Because of scraping we know Amazon does things like price gouging and raising prices right before they go on “sale”. Scraping can be very useful for researchers to monitor trends and find correlations. It’s not just about bad guys stealing personal information. There are far to many legitimate uses that banning scraping would be a bad thing.


Pretty ironic that Mark Z himself started out exactly like this: scraping Harvard servers and photos to power facemash.

He subsequently realized that he doesn’t need to scrape if he can just make a viral site that lets people share this info with each other while he can eavesdrop on ALL OF IT:

https://www.esquire.com/uk/latest-news/a19490586/mark-zucker...


Nah, you are straight-up wrong. In fact, it’s the opposite - the only companies who are scared of scraping are the ones whose business models rely on artificial lock-in, and we should all be working as hard as we can to demolish them.


It's wild that people are arguing that their friend list should belong exclusively to facebook and not, you know, to them and their friends.


>the only companies who are scared of scraping are the ones whose business models...<snip whatever other nonsense followed>

This is just patently false. There is an expense incured by scraping. There is no benefit to a host providing the data from those scrapers. My logs are full of various bots that pull data from my webhost that costs me money to serve. I run various sites that do not serve ads. I do not include any 3rd party tracking. They're just simple sites that I pay for out of my own pocket because that what I've chosen to do. Nothing shady about any of it.

It's just sad that your own personal feelings towards scraping prevents you from being able to accept that there are people with views other than your own.


Hey, I totally accept people have views other than my own. I just disagree with them.

It seems extremely weird that you’d want to publish content, but then get mad that people are using the thing that you published. But you do you.


How is that weird? I publish on my site to have people visit my site. I don't publsh for people to take my data and do what they will without attribution for where they got the data. How that makes no sense to others has me saying please don't do you because you are being not considerate to others


> the vast majority of Web scraping efforts are to build businesses on top of other organizations hard work and innovation. Period. End of story.

Yeah and the vast majority of the internet and all these mega corps run on open source while paying pittance back to the ecosystem. Cry me a fuckin river.

Can't wait til someone sue's them for "scraping" their site for web previews and thumbnails everytime someone shares a link on Facebook.

The double standard of these muppets.


I disagree precisely for the simple reason that these businesses are using Meta's weapon against them. It will be an interesting battle to watch - and if my memory doesn't fail me, LinkedIn lost one already. The more the press writes about it, the better: (ordinary) people will sooner or later see through their doublespeak and realize what is at stake.


I feel the same way. My biggest pet peeve is that scrapers/bots traversing my site generates more data than the target audience of users. The scrapers get all of this data for "free" at my expense of the hosting costs to provide them that "free" data.


>the vast majority of Web scraping efforts are to build businesses on top of other organizations hard work and innovation

The vast majority of Facebook/Google's efforts are to build businesses on top of other organizations hard work and innovation.


And that’s the trick. You use the bad apples to delegitimise the good ones. Works every time.


[flagged]


If simp is supposed to be short for simpleton, you might want to consider how simple your thoughts are.



I can also link to a source that's going to be biased in my favor: https://www.etymonline.com/word/simp


> 1903

> 1640s

Lol, no. I'm using the definition from this century:

> Someone who does way too much for a person they like


What a pretty picture capitalism is.

“Give us all your data for free.” “They ‘trust me’, dumb fucks.”

https://www.esquire.com/uk/latest-news/a19490586/mark-zucker...

Proceeds to build entire business on this data…

“You can’t scrape us!”

LinkedIn tried this:

https://www.zdnet.com/google-amp/article/court-rules-that-da...

And it’s not like capitalist enterprises even try to be consistent in their legal complaints:

https://9to5mac.com/2022/04/14/apple-calls-out-meta-for-hypo...


> I love to hate on Meta, but their actions here are spot on and make my morning very enjoyable as I sip my cup of coffee.

You might want to reassess your intelligence there friend. It seems to be suffering from a common form of cogntive dissonance combined with some form of confirmation bias.

How so?

Well you clearly don't like scraping, otherwise you wouldn't be agreeing with a criminal... So there's the confirmation bias...

Which is also the cognitive dissonance part. You clearly don't like Meta/Zuckerberg by your own admission; but you are agreeing with a empty rhetoric attack against people who are smart enough to make use of Zuckerbergs terrible security practices...

Do you not see the problem in this?


This is a total non-sequitur argument here. You've gone from accusing me of lack of intelligence to suffering from cognitive dissonance and confirmation bias, to Facebook's terrible security practices: simply because I'm pleased that an organization has taken action against Web scrapers for violation of Terms of Service.

Yes, I've gone on record indicating that I believe Web scraping to be generally unethical, and that I'm pleased that some action was taken against those that make it their business to do so. And that is all that I have stated in my OP. You've decided to take me on some circular mental gymnastics journey I'm still trying to wrap my head around.


Let me restate this how I view what you've stated: your position is that because Facebook has a Terms of Service that may define something that is not illegal - means that one must abide by it? Also... Facebook/Meta/Zuckerberg have lied over and over and over very publicly to get their way or to give themselves an advantage: by giving themselves unfettered and unwarranted access to data that they profit from by their own fast and loose rules.

If Facebook/Meta/Zuckerberg are OK with lying, stealing and cheating - then why should anyone leveraging their online properties need to abide? Until they're held accountable under broader rules I see no reason the consumption side can't bend them as well. And you may argue "this isn't how it works" but we all know this isn't how Facebook/Meta/Zuckerberg operate. They operate under the premise of: do whatever makes us money because breaking the rules is the cost of doing business. So, no - they don't get to spew propaganda to the advantage of their business under the guise of protecting users. That is complete and utter bullshit.


Thank you. This is pretty much what I am getting at as well, though in different words.

But of course, the general populous thinks it knows better than the people who actually know best. That being those of us who have been able to live our lives while learning from not just our mistakes; but others around them.

We are the rare and few; and considered the enemy to the mob. Good luck comrade.


Who is the criminal here? Scraping is not illegal. This is a civil suit, so even if Meta wins, it's still not remotely criminal for anyone involved.

Also please explain to me how someone giving a company their Facebook credentials is an example of "people who are smart enough to make use of Zuckerbergs terrible security practices."


Of course, Facebook wants to make it sound like scraping is illegal, when it generally isn't.

But account hijacking and mass-creation of accounts just to access private pages are clear violations of the Facebook and Instagram ToS, so they surely can sue for that.


Violation of ToS does not mean a violation of the law.


Most law suits aren't due to breaches of the law, but breaches of contract. Whether terms of service constitute an enforceable contact is another matter.


ToS have been around for decades, surely this question is settled by now?


Former attorney turned software developer here!

Nope, it's not a settled question in the way that I think you mean. Each ToS is different so each would be subject to individual legal analysis in court on its own terms.

Questions would include whether the ToS is unconscionable, whether the terms violate laws of the locality/nation, and so forth.

It's the same with traditional contracts - the fact that contracts have been around for hundreds (maybe thousands) of years doesn't mean much if you and I create a brand new one between us. Our contract's specific terms (and events/actions between us as a result) would be the issue in court.


Why can't FB simply include a clause like "No kind of automated scraping is allowed, except for search engines in robots.txt"? This would save them so much time in court, arguing over the use of fake accounts which should really be irrelevant.


It's not clear that clause would be enforceable. Scraping has been found to be lawful in many jurisdictions, including the US, even without the consent of the host.


So even the general question of "Whether terms of service constitute an enforceable contract" depends on each individual ToS?


Congress or a state legislature could pass a law that says "No terms of service are ever enforceable" but to my knowledge no one has done that.

So, under the current state of the law whether or not a contract is enforceable depends entirely on what the terms in that specific contract are.

Unfortunately, this is yet another instance where the law has failed to keep up with technology. Contract laws (at least in the USA) date back long before anyone ever dreamed up the idea of a EULA or ToS. Our laws contemplate two or more parties with roughly equal bargaining power sitting down and hashing things out, and go from there.

Laws based on that assumption are a pretty poor fit for a world filled with EULAs and ToS but it's what we are stuck with at the moment.


if a bot creates the account, who breaches the contract?


The person who ran the bot. Programs do not have agency, they are just tools.

That's like saying "If the gun fires the bullet, who is liable for murder?" It's a silly question.


> That's like saying "If the gun fires the bullet, who is liable for murder?" It's a silly question.

I don't know I've seen several people unironically argue that it should be the gun's manufacturer.


Probably should also add "successfully", there's a reason NYPD had/has guns that require 12 pounds of force to pull the trigger (instead of a normal ~5 lbs).


Software that exclusively has illegitimate uses has been shut down. Whether we agree that it is a good argument or not, it is definitely an argument people have made (that some types of guns are mainly designed to hurt people).

With software of course it is a little complicated because:

* it can be produced really easily in a distributed fashion over the internet by anonymous people in many jurisdictions, so there isn't always an obvious company or entity to sue

* most automation tools can be repurposed for malicious use (nobody would sue John Deere because their tractors can be armored and turned into pseudo-tank things)


That is why they are suing rather than pressing charges. When someone steals your car you don't sue them you press charges. When someone doesn't uphold their end of a contract you don't press charges you sue for breach of contract.


in reality, you as an individual can't press charges. Only the state can. And many times the state chooses not to. You can sue in civil court, but individuals can't bring cases in criminal court.


You are confusing pressing charges and indictment. Pressing charges just means you accuse somebody of a crime and “press” the prosecutor to indict them. So the state does have the ultimate say on who is prosecuted, but that doesn’t mean you can’t press charges.


Many countries do have the concept of private criminal prosecutions.


"pressing charges" isn't a thing.


As far as I am aware it isn't a specific thing, but a general catchall term for going through the process of filing a criminal complaint, and seeing it through to completion. Maybe there is better words for it but "pressing charges" is what they use on TV so it is top of mind.

In general I meant there is a difference between criminal and civil law, and suing generally refers to civil not criminal law.


It is a thing. In America pressing charges is when you accuse somebody of a crime and ask a prosecutor to bring criminal charges against them.


Prosecutors exclusively decide who is charged. No charges can be "pressed" by a victim.


Yes, in most cases it is the prosecutor's discretion whether to bring a case to a grand jury, but that isn't what pressing charges is. See Merriam Webster's definition[0].

[0] https://www.merriam-webster.com/dictionary/press%20charges


I don't think I know the answer, but I'm curious:

Does violating a website's TOS meant your accessing it beyond your authority, making it a violation of the US's Computer Fraud and Abuse Act?


Not a violation. Decided by Supreme Court in 2021. Van Buren vs. United States. It was a big deal.


Violating TOS no; Gaining access beyond your authority maybe https://www.eff.org/deeplinks/2010/07/court-violating-terms-...


I was assuming that in this case, a person's authority was specifically granted by the ToS.

I wondered if the interplay of those two concepts muddied the waters.


I don't have a source for this, but my recollection is that this has been successfully argued by a couple of companies—but then an appeals court found very firmly that it was not the case.

Essentially, having that be true would mean that any given website could create whole new classes of criminal behavior.


> having that be true would mean that any given website could create whole new classes of criminal behavior.

While this is true, reading the lawsuit it is clear that Meta is suing in civil court, so maybe they're trying to enforce their contract, especially their automated collection ToS (https://www.facebook.com/apps/site_scraping_tos_terms.php)?


Since when do you get sued for breaching TOS?


Since you start a business on the violation.

"Since when do I get sued for taking too many free samples from Costco?" -> "Since you started taking millions of them to resell"


im not sure on american law, but if you give me those samples willingly i can do whatever i want with them.

Actually this is the reason why many products come with the lable "not for resale" but i have yet to find somebody who cares about it :D


>give me those samples willingly

Doesn't seem like Facebook is giving them willingly.


Since when do you get sued for breaching a contract? When the offense is worth it.


You can get sued for anything that causes harm.

Relevant life lesson: don't do things to people with money that they might perceive as harm.

Corollary: Being sued is as much punishment as losing a suit for most people.


I don't know but it's at least been that way since Aaron Swartz did it I suppose.


Data harvesting is moral for me, but not for thee.


In general I agree that harvesting public data is moral. I think that in these particular cases it's: 1) extracting data from profiles that opted for not being public (only available to logged in users) and 2) reposting scraped data (publicly?) as belonging to the guy who scraped it without users consent.


Facebook has hidden much of Instagram's content behind logins, so that makes most of it "not public".

At the same time, I don't think all of Instagram's users care if their images are hidden, or not.

It's quite unfortunate Facebook/Meta is using hostile language and the word "scraping" together in this case. Scraping is a legitimate process used by various business models to gather information from the Web, which itself was originally intended to be an open forum for people to share content.

Hostile business models have corrupted that intent and turned it into a competitive environment that is harming users and legitimate models which may not have the funding larger corporations can muster.

I have a "scraper" I've built that will either snapshot a page from a user's browser or crawl it remotely with Selinium/Firefox, on the user's behalf, to save the content in an index for searching later, by that user. It's not automated, nor does it parse and crawl URLs in the pages saved. It doesn't use page content in a wider context, either.

I've spent a significant amount of time trying to "work around" anti-scraping efforts by various companies and it's frustrating to see hostility instead of cooperation in certain types of use.


> Facebook has hidden much of Instagram's content behind logins, so that makes most of it "not public".

1) It was public when the content was posted by its authors. Facebook locked it down retroactively, regardless of the author's intent.

2) A login requirement doesn't make it non-public, if making an account is trivial, and there are already hundreds of millions of accounts. Is the plot of Avengers: Endgame also not public, because it's locked behind a ticket purchase or subscription?


Also login requirement is not certain. e.g. Google doesn't need to login to index those pages, neither do you for first few profiles. Only after your identity (ip or fingerprint) is know instagram starts locking public content behind login gates.


> extracting data from profiles that opted for not being public

The tool lets you download the contact info of your friends, which you should be able to do anyway. In fact Facebook tries to trick its users into thinking they can do this with their data takeout option, but the downloaded files don't actually include any of the contact info for your contacts. Which makes zero sense, considering the entire point of Facebook is that it's a digital rolodex for storing your friends' contact info.


From the article, it seems to be service for scrapping data you have access anyway. As long as they only handle those data to the requesting customer, whose login they used, I don't see a difference between general public, and this users personalized "public". If access is still limited to the people who have the access-rights, then I don't see a difference between accessing through the official interface, or via scrapped data.


Users make information available on facebook with the expectation that they are able to later control access to it (other than the obvious threat model of screenshotting, etc). This is violating that expectation and thus their privacy.


> they are able to later control access to it

This has never realistically been the case. An illusion of control is provided by facebook, but they've never really put much effort into it. For a really simple example, look at how long content remained available to the entire internet after "deletion". Sometimes it took years.

Expecting any semblance of privacy from a company who profits from using and selling your data is, if I'm being blunt, lunacy.


This is a false expectation and it’s important people learn this.


They’ll stop posting in the way they currently enjoy and will, therefore, have lost some freedom. Great outcome!

In other news: your partner may also leak your most intimate secrets. I hope they do, to teach you a lesson?

Every trust can be betrayed. Why do you believe a world without trust would be better? Only because you cannot handle the nuance of different levels of trust?


> In other news: your partner may also leak your most intimate secrets

Indeed, and that's why it's important to choose the right partner. Likewise, it's important to choose the right friends on instagram to share your photos with. Because as you noted, they can always screenshot away and there's nothing Facebook can do.

What's dangerous is thinking that Facebook/Meta is the keyholder. That's a false perception, perpetrated by Facebook because they want to monopolize everyone's data. It was and always will be about the people who you share your information with. Don't want your profile scraped and leaked? Don't share it with sketchy people.


The counterparty risk from Facebook has almost nothing to do with trust of individual human beings. It has to do with the nature of systems, failure, vulnerabilities, attack surface area, etc. It's "privacy through obscurity" to act in a way that your data is not on the precipice of being leaked by a bad actor or a mistake.


The freedom to live in a fictional world where Facebook safeguards your data is just as available regardless the reality of the situation.

The reality of the situation is that Facebook is a walled garden built on the labor of it's users and it is objecting to those users reclaiming the fruits of their labor by scraping.


So taking shackles off is called “losing freedom” now? Also, people enjoy many things, just look at the junkheads. Still, it's more natural to have trust in a heroin addict than to have trust in businesses like Facebook.


"They’ll stop posting in the way they currently enjoy and will, therefore, have lost some freedom."

That is, quite honestly, one of the oddest definitions of freedom I've come across.


There's no evidence of the accused scraper sharing the scraped data with anyone but the account-holder, so the privacy of their friends is still protected.


The state of "opted for not being public" and 'available to any system authenticated person' seem contradictory.

I appreciate that 'system authenticated person' is a smaller set than those who can access anything publicly accessible, and that the former is a subset of the latter.


I agree with the moral argument against posting the scraped data publicly, but if someone gave my account access to their data, I don't think they have a moral right to say I can't use a script to do something private with it.

Scripts are tools, and like any tool they're extensions of the self. If it's morally okay to do it by hand, it's morally okay to do it with a script, so long as my script is respectful of server resources.


Instagram behind a login screen is public. If you say were an OnlyFans model and somebody paid for your videos, scraped them, then there would've been implicit agreement.

Sharing photos on Instagram, there is no such understanding, news outlets have been logging in to view and publish your instagram photos so.


If they are being harvested it makes them public by definition. Unless there was a break-in.


It's their platform. Do you really want some random companies scraping your facebook and instagram posts?


> Do you really want some random companies scraping your facebook and instagram posts?

Thought experiment: if you want to keep control over your data, try something radical: don't hand it to Meta/FB/IG at all

(Full disclosure, I'm neither on FB nor IG)


Yes. I want a free and open web.


Good for you. Normal people do not want posts shared privately amongst friends to become publicly available.


Then why would you ever put it on a website that generates its revenue from using and selling your data?


Because you're (not you, but people in general) are dumb and overly trusting.


This is the correct answer.


Because you agreed to do so under the terms of conditions of that website.


Look I understand you point from a legal standpoint, but do you really truly believe even a small fraction of FB and IG users actually “agreed to do so under the terms and conditions of that website”? They just clicked whatever was necessary to create their accounts. I doubt there was much affirmative agreement going on there.


There's no evidence the scraper companies mentioned there are making the scraped data public or sharing it with anyone beyond the individual customer that is already entitled to access that data through the official clients.


Then you need to trust your friends, because copy/paste and screenshots exist.


I'd rather anyone than "just Facebook".

"Just Facebook" has made the web shittier; entire realms of essentially public, often great content hidden behind a login wall.


It’s not “your Facebook”, it’s Facebook’s Facebook. You already made that data public, otherwise it would be impossible to scrap it.


As others said, there is no “you” in the scheme. It's Facebook's data. When people access that data without paying, they are “bad guys”. When the very same people pay for it, they are “legal partners”. In both cases they can do anything with it, while Facebook can't be held responsible because of all the official agreements. So as long as there is no specifically bad publicity or money loss anything goes either way.

“You” only exist in numerous empty statements about “privacy”, “respect”, etc. If you are feeling artsy, you can make that hyped NFT thing out of those, and see whether those kilobytes of text really worth anything.


What you are claiming here is not true in Europe. If FB hold data about you, the data is still your legal right. You can have it deleted and changed if it is somehow untrue and have variou other rights too.

There is a relationship involved because ultimately as a FB user, if I don't like what they are doing, I can ask them to remove my data permanently and they must legally do that. If someone has "scraped" that data (if it is considered PID), without my permission or a legal basis to do so, they are in breach of the GDPR and can have enforcement taken against them.

I think some of these "aggregation" businesses will fall foul of this in Europe but I don't know what will realistically happen if that business does not exist in Europe and breaches the GDPR.


This is how it works in press releases. The problem is that data protection laws were in fact lobbied by corporations either openly or behind the scenes, and focus on things like real names and passport numbers that look impressive but aren't really important for the data market. These are just put into some high security database (e.g. for billing info), and it's fine. However, the real behavioral data that costs money is shared as easy as it ever was in the form of “User ID <long number> was at the location of Wi-Fi AP ID <another long number>”. It doesn't matter that the data owner still trades all the history of activity of a certain individual, or that Wi-Fi station locations can be matched with some external database. Everything is fine as long as you don't slap someone's real name on that. And, contrary to the show social networks make, they couldn't care less about real names. Even if you trick the system by calling yourself John Doe, you still look at the specific content, and have specific contacts, you are you, and the data is the same.

I remember that about a decade ago some IT guys have paid for the common Facebook advertiser access, then targeted the ad campaigns using filters in such a way that their intersection only resulted in a single user, or just a couple of them, and were able to match those “anonymized” accounts to real ones. You didn't have to be a genius to do that. Facebook certainly knew it could be used like that. Everyone who made money on that simply agreed to use “anonymization” as a smokescreen. Later, with all the scandals, those routine operations were presented as something exceptional done by a small number of bad actors.


> breaches the GDPR.

Facebook breaches the GDPR all the time and manages to stay in business. GDPR enforcement is barely existent, and when it does happen, it's insufficient.


You published them for the world to see... so yes, presumably.


“This industry makes scraping available to individuals and companies that otherwise would not have the capabilities.” - seems like web scraping companies are doing a good job :)


The phone charger makes engery available to individuals and companies that otherwise would not have the capabilities. ;)


Maybe some irony here as IIRC Facebook started as essentially a scraping company, pulling student profiles from college websites and re-publishing it for their own profit.

The scrapers have become the scrapees. The horror.


>Octopus, a US subsidiary of a Chinese national high-tech enterprise, built a cloud-based platform designed to provide paying customers access to on-demand scraping software and services.

It is interesting as how they try to position this as a Chinese attack on them.


It must coincide with Christopher Wray's sudden claim that there is an active dragnet of sorts that is trying to subvert America from within much like the recent election interference of a former Tianmen square activist who tried to run for congress I think.

It makes me think that there are many people on CCP's dole, rich powerful famous people are somehow beholden to the CCP in some unknown way but we can all guess correctly that they are all old white men who have previously been seen with young females.


it look like Zack is giving up on the Chinese market.


I guess after Winnie the Pooh rejected to name his children for him he got sour grapes for China.


People that are criticizing this probably were also critical of the Cambridge Analytica scandal, but it would be useful to compare what happened there and here.

With Cambridge Analytica:

- Facebook allowed users (with informed consent) to allow external developers to access their data and limited data about their friends, in order to build social-enabled apps.

- CA exploited this to scrape basic profile data from a large number of users. It broke the ToS by doing so (in particular by using the data for purposes different than stated)

Here the same is happening:

- people are giving a third company access to their profile, which includes access to friends' data (in fact a lot more than what the app platform allowed to do)

- the company is scraping all the data.

At the time of CA, the criticism was that Facebook didn't do enough to enforce its ToS (or maybe that the data sharing should have not been allowed in the first place? But the terms were common knowledge and the attack potential became clear only in hindsight), here people are criticizing that Facebook is in fact enforcing its ToS.

Also note that strong enforcement against scraping is one of the mandates that came from the FTC settlement.

It seems inevitable that any news about Facebook/Meta is read in the worst possible light these days, even when the criticism is self-contradictory. I would expect less superficial commentary from HN.


The real reason most people were upset about Cambridge Analytica was it revealed to the public how advertising and PR companies manipulate us. The fact they violated facebook ToS is moreso the excuse for the press covering it when they wanted to write another anti-Trump piece. If you were accusing a specific newspaper of hypocrisy based on two article I might agree. But you're referring to general public sentiment, and I really don't think most people cared or were surprised about the data collection. The shock and scandal was the realization that targeted advertising campaigns and information bubbles have the potential to sway elections.


I'm referring to the HN crowd, I'm not sure that can be equated to "general public sentiment".

I agree with your first paragraph, and my point is that it is not possible to argue at the same time that Facebook should share data more broadly and allow scraping, and at the same time be critical that Facebook allowed CA to happen in the first place.

If the CA scandal was a wake-up call, it appears it was not internalized enough for people to understand the implications of what they're suggesting in this thread?


In the early days of FB, they convinced people that pages (or some content, sorry I do not know the FB terms) could be public for anyone to view without needing to login to FB. This was very helpful for small businesses and communities. In many countries this is still the quickest place to make a public page. Though now, every small business or community page I want to visit is locked out unless I login FB. Even if I do login it is impossible to copy paste the important details of a page or post, plus the UI is as ugly as it has always been.


I am currently in the USA and when I visit a public FB page e.g. [1], there is a small login header, and a very big annoying footer login. I estimate 15% of the content is blocked. I had spent the past year outside USA until one month ago. When I visited the same sites while traveling outside the USA, the annoying login footer moves to the middle of the page blocking almost all content. I do not have proof at the moment, but that was my experience trying to read 95% of government, business, and community pages who are almost all on FB.

  [1] https://www.facebook.com/ParquesNacionalesdeArgentina


This is different from LinkedIn v HiQ because HiQ was only scraping publicly available data that was generally accessible to the broader internet. In these two cases, the data is being scraped from FB/Insta using credentials that the client handed over or the mass creation of accounts solely for scraping purposes.


> the mass creation of accounts solely for scraping purposes.

Those accounts wouldn't be allowed to view private data though unless they friend/follow the person first, so they'll only still be limited to data the account holders intend to be public and available to anyone.

There's also no evidence that the scraped data was aggregated at scale or commingled in any way, so even if customers provided their actual credentials which grant them access to private data of their friends, the scraper didn't share it with anyone else but them.


Yeah, I think this is more like the Cambridge Analytica situation.


Did FB ever take any legal action against Cambridge Analytica? I can't remember anything about it and this sounds very similar to that (although back in those days FB's tools made this incredibly easy).


No. FBs ToS at the time [1] allowed CA to do what they did.

Namely, CA didn't resell the data or give it to an ad agency.

[1]: https://web.archive.org/web/20180329131546/https://developer...


I wish the Cambridge Analytica FUD would stop. CA's "attack" was to setup a malicious website that convinced idiots to give it access to their Facebook account using the standard oAuth2 flow.

Did they misuse the collected data? Sure. But people granted access to that data knowingly. This wasn't really an attack in my view.

Facebook wasn’t really complicit and definitely didn’t sell/give away any data.


What would be your position the data being scraped is data the site is selectively providing google for indexing but don't provide publicly.


> After paying for access to the scraping software, customers self-compromised their Facebook and Instagram accounts by providing their authentication information to Octopus

"self-compromised" lol

clearly these people just wanted an automated way to access their own data


> clearly these people just wanted an automated way to access their own data

GDPR and CCPA (and probably many other national/state privacy laws) forces facebook/instagram/etc to let you download and/or delete your data without using third party websites. Usually people self-compromise their accounts in exchange for money: https://www.buzzfeednews.com/article/craigsilverman/facebook...


They have to keep the walls up on their garden so they can get maximum value from harvesting.


Remember back when facebook grew their little network by scraping your gmail contacts.

Google blocked them.

There was animus between the two companies that resulted in Facebook not making an official android app until 2010.


> scrapping attack


That cracked me up when I read it lol


Ironically, around a year ago I disclosed (using their White Hat bug bounty program) that I'm able to access recruitment data (candidates details mostly) using very cheap form of scraping against a 3rd party service provider, they dismissed it and instructed me to report it to the 3rd party that operates that service (which I did beforehand but the issue has had not been fixed).

Sorry for being vague here, I haven't publicly disclosed it yet, but will probably have to if it don't get fixed.


Funny story from the early days of TheFaceBook, probably around 2005ish:

I was a webmaster of a set of servers on a major university's network. I also had access (enough to run arbitrary programs that had pretty much full ingress/egress to the public internet) to a number of machines across the campus's network. Through some of my coursework and ACM chapter activities I met some other similarly minded technical people with similar levels of access.

We decide that it would be fun to use our superpowers (access + programming abilities + curiosity) to sign up for various accounts on FB and essentially scrape and friend as much as possible. At the time they had some rate limiting, some IP banning (which wasn't terrible because the Uni gave public IPv4 addrs to all machines on campus by default) and then added some early CAPTCHA which we ended up breaking pretty trivially with some python and image recognition code.

Never got sued... :) Never really did much with the scripts or data except test that they worked. Fun times.


I would consider this appropriate if one of the largest offenders of scrapping weren't the one pretending to be the offended.


"Scraping attacks" LOL


Why not? weev was put in jail over incrementing a number in a url. Surely writing software to put values into urls is even worse.


Let's be clear and accurate: technically weev was put in jail for conspiring on IRC with JacksonBrown. JacksonBrown was the one who wrote a PHP script that incremented a value in a URL (and appended a valid Luhn check digit following incrementation).

Conspiracy to access a protected computer system - that is, typing on IRC. weev didn't write any of the code or access the API.


It's like they don't know that courts just made it legal: https://techcrunch.com/2022/04/18/web-scraping-legal-court/


From the article: "[T]he Ninth Circuit reaffirmed its original decision and found that scraping data that is publicly accessible on the internet is not a violation of the Computer Fraud and Abuse Act."

The key phrase is "publicly accessible." This wasn't that. The scraping was done by automating Facebook accounts, which have terms of service, which forbid scraping.

ToS/EULAs make a big difference. They're the reason Blizzard could shut down bnetd's StarCraft server. They're why no one can legally reverse engineer Oracle to create a drop-in replacement, despite interoperability provisions.

More and more platforms are putting the majority of your user-generated content behind auth walls with ToS because that's how they prevent competitors from swiping it.


> ToS/EULAs make a big difference. They're the reason Blizzard could shut down bnetd's StarCraft server. They're why no one can legally reverse engineer Oracle to create a drop-in replacement, despite interoperability provisions.

Strictly referencing EULAs for user-owned copies of software here, not ToS:

That is not true. The Blizzard court clearly erred in not considering unconscionability when analyzing the EULA. As for Oracle, the interoperability provisions are what overrides that part of the EULA.


Does it go into detail about the actual meaning of "publicly accessible"? Because most content on Facebook/Instagram requires any valid login (as opposed to a specific account) and that data people intend to be public (especially on Insta).

In this case, the account requirement would be a technicality and the data, for all intents and purposes, would still be considered "publicly accessible" if anyone with an account can access it.


Putting a login screen that any public member can bypass isn't private information. Private info would be Onlyfans videos. So far there is no such feature on Instagram


"Legal" doesn't make it ethical, nor does it shield you from liability if you willfully violate contract law (terms of service)


So much bad faith in this press release but not surprising from such a disgusting company, with of course some China-related fear-mongering despite no evidence of wrongdoing.

> After paying for access to the scraping software, customers self-compromised their Facebook and Instagram accounts by providing their authentication information to Octopus.

They didn't "self-compromise" their account. They trust Octopus to act on their behalf, and unlike Facebook, Octopus' interests are most likely more aligned with their users' since their service is paid. This is no different from handing your Facebook credentials to your social media manager or secretary. There's no evidence that Octopus misused this access in any way.

> Octopus designed the software to scrape data accessible to the user when logged into their accounts, including data about their Facebook Friends such as email address, phone number, gender and date of birth, as well as Instagram followers and engagement information such as name, user profile URL, location and number of likes and comments per post.

This is either information people intend to be public or information they trust their friends to keep private. Now if Octopus was leaking the private information to third-parties it would be one thing, but so far I see no evidence Octopus was disclosing the scraped information to anyone but their customer (who is already authorized to access it).

> Meta is an industry leader in taking legal action to protect people from scraping and exposing these types of services

Translation: Meta is an industry leader in protecting its disgusting business model that hinges on making public data behind a walled garden with an unacceptable "privacy" policy. There wouldn't be a market for Octopus (or other scrapers) if Facebook already allowed customers to efficiently access information they're already entitled to, but that would be against their interests as their entire business hinges on information being held hostage.

They've created a problem, are selling the cure (well in this case monetizing it via ads) and are now pissed off that someone else is selling the cure for cheaper.


Anyone else heard of Tim Berners-Lee's idea of hosting your data in pods outside the relevant corps wanting access to it and you controlling what's shared and how? This is such a completely different way of doing it, I'm not sure of all the implications, be that from admin (how much effort) to security (would this be a massive hacking opportunity) etc. https://www.theregister.com/2022/01/20/tim_bernerslee/


Ironically, Octopus reminds me of "Octopus VR" in the Silicon Valley show.

https://www.youtube.com/watch?v=ltFB4WBdDg4


"It's a water animal"


One of Facebook’s earliest acquisitions was a scraping company called Octazen.


Fingers crossed they eventually get around to suing Clearview AI out of existence.

https://www.nytimes.com/2020/01/18/technology/clearview-priv...


Pretty rich idea coming from FB, lol. They do human scraping.


We need to update the law to make sure Meta loses in cases like this.


I'm torn on Web scraping because the extreme of each end of the spectrum on this issue both seem unreasonable.

On one side, you have people who say any form of scraping is be disallowed, even prosecutable. This went so far that the Department of Justice on behalf of AT&T prosecuted a case of URL modification [1]. One of the few bright spots for this psychotic Supreme Court was to curtail the government's power under the CFAA by limiting what constituted "unauthorized" access [2].

On the other hand, there are those who think that any level of scraping should be fine and I think that's untenable too. Consider Yahoo indexing of Stack Overflow [3]:

> In the meantime, since Yahoo (via Slurp!) is about 0.3% of our traffic, but insists on rudely consuming a huge chunk of our prime-time bandwidth, they’re getting IP banned and blocked.

Do these "scraping extremists" think such actions should be illegal? It's actually not that far-fetched given the Ninth Circuit decided LinkedIn wrongly blocked HiQ scraping [4]. Like if you change your website with the intent that it'll make scraping more difficult, is that a problem? What if it's an unintended side effect?

Additionally, companies like Meta, Google and Apple are going to be way more acountable to abiding by data retention laws and regulations than any scraper. If it's OK to scrape FB.com completely, that information is out there forever.

I certainly think the government shouldn't prosecute on behalf of companies. At least that should expose to people how the government's #1 priority is in fact to protect the true constituents: corporations and the capital-owning class.

[1]: https://www.techdirt.com/2013/09/30/dojs-insane-argument-aga...

[2]: https://en.wikipedia.org/wiki/Van_Buren_v._United_States

[3]: https://stackoverflow.blog/2009/06/16/the-perfect-web-spider...

[4]: https://blog.ericgoldman.org/archives/2019/09/ninth-circuit-...


> So much about this case is ridiculous, and it’s complicated by the fact that nearly everyone agrees that weev is a world-class jerk. But, you need to separate that out from the details of what he did here, to note that it was nothing particularly special, and it involved the sort of thing that security researchers do all the time, and which all sorts of non-security researchers do quite often.

Yeah... uhm... I used to do exactly this sort of thing...

When I was a teenager, I would look at the URL of whatever site I was on, and would change a number here, or a letter there; and see what I got.

Sometimes you get nothing, sometimes you get something. Sometimes that something is quite interesting.


> Meta is an industry leader in taking legal action to protect people from scraping and exposing these types of services, which provide scraping as a service across multiple websites.

Sure, as long as Meta is not the one selling the data to Cambridge Analytica it's wrong.


HN is hypocritical - most commenters here are against this because "Meta bad," but at the same time, most commenters wouldn't want their posts shared privately amongst friends to be scraped and made available publicly.


> most commenters wouldn't want their posts shared privately amongst friends to be scraped and made available publicly.

Where's the "posts shared privately amongst friends made public" part? There are two cases here:

1. A service that logs in as the customer (who voluntarily provide their credentials) and scrapes information visible to said customer on their behalf. Nothing about "made available publicly" is alleged.

2. An individual using a pool of bot accounts to scrape posts visible to any logged in user. Nothing about "shared privately" is alleged. To be clear I don't like the method, but I'll also have to admit I've used one of the Instagram "clone sites" in the past thanks to their login wall.

Unless I missed something, it sounds like you just made it up.


For that to happen, one of your friends would have had to willingly allow this tool to scrape their social network, which would include your private posts.

Is the scraper to blame here, or the friend?


As many other people, you are calling something “private” when it is not.

“Privately shared with friends” used to mean that only you and your friends know something. You don't “share” anything with “friends” on a social network. You give the information to a giant corporation. If it finds it suitable, it then delivers it to other users, but only after it records your location, analyzes the content to check if you were, say, affected by some melodramatic event (and therefore should be tricked into spending more time… I mean, get “personal recommendations” for a certain kind of content), and does a billion other things.

If you consider that this is fine, please relay all your conversations with family and friends through me from now on. I offer secure, reliable, fast, yada yada communication service. And it's hip! Ask anyone on the street what they use.


There are two cases they brought up, one being web scraping and the other is making a clone website publicly displaying content from Instagram.

I think Meta might be mixing up these two cases here on purpose to make it look like web scraping is as bad as stealing photos to publish it on a clone website.


Who is scraping their private messages? Themselves or their friends?


lol maybe if you don't want that happening you shouldn't be using Facebook


Wasn’t Meta stealing news articles and not paying news organizations for them?


Octopus sounds really useful; is there an open source equivalent? I'd love to be able to scrape my own data on Facebook. Their data export feature is fairly good but far from complete.


Google has turned Google Search into a walled garden by scraping people's content and serving it up on their own platter. Is anyone going to stand up to them?


Or Facebook could just open up their data. Oh wait, not their data, silly me. Everyone else's data. Keep on scraping, I say.


The fact they're wasting time on that is a sign that Facebook decay phase has already started.


whoa wasn't there somebody on HN that ran a web scraping shop that were boasting they can scrape instagram a while back? are these the same guys???

I don't know how far Facebook can get with this, thought Linkedin's court ruling made scraping legal de-facto


So, Facebook doesn't want to share the data it wants us to share with them? Figures...


Hey instagram/facebook/linkedin/etc: It's not your data.


It's like they don't know that courts made it legal: https://techcrunch.com/2022/04/18/web-scraping-legal-court/


Evil Big Co. that literally STEALS people's personal information everywhere they go even after they've indicated they want to be left alone is now offended when someone does the same to them?

Well, color me surprised /s

Fuck Facebook. Meta. Or whatever you want to call it.


Is this much different from LinkedIn vs hiQ?


Logged in vs not logged in data.


> Logged in

Is this actually private data, or is it public stuff that's become annoyingly hard to view anonymously because Meta chose to stick it behind a login box?


>public stuff that's become annoyingly hard to view anonymously because Meta chose to stick it behind a login box

this one


Anything behind a login gate is private data for that registered user only.


> Anything behind a login gate is private data for that registered user only

That's quite the claim, if only the login gate were either always there or indeed always not.

Presuambly such "private" data ought not to be being indexed by search engines and returned to users who search?

"site:instagram.com" is of the order of 228 million pages on google.com, and "site:facebook.com" is another 422 million.


pretty sure you get hit with a login gate if you navigate to the results via site:instagram.com no?


> you get hit with a login gate if you navigate to the results via site:instagram.com

Nope, I just tried it (private browser session, no IG activity from my IP recently)

google.com -> "site:instagram.com nojito" -> results -> www.instagram.com/explore/tags/nojito/ with a page of photos.

Quickly scrolling down the page for several dozen photos does eventually trigger the login box, though.


Depends if another user can also access it, or whether the original author/owner of the data in question intends for it to be public. In Facebook's case, there are permission levels you can set on posts, including a "public" option (which isn't actually public though and will require a login anyway, but it can be any login) which would settle that debate quickly - hell I wouldn't be surprised if that option were to be hidden as to not acknowledge that a particular bit of data was explicitly posted for everyone to see.


> In Facebook's case, there are permission levels you can set on posts, including a "public" option (which isn't actually public though and will require a login anyway, but it can be any login)

Q: Have you tried this?

In a private browser session I started at google.com, searched for "site:facebook.com nextgrid", picked some random post, click through, and was reading the post without anything other than seeing FB's cookie banner. No sign of any login (which is good 'cause I don't have one)


I suspect it depends on your region, page/post in question and browser fingerprint. A post marked as public isn't 100% guaranteed to be publicly viewable. Sometimes you can view it but merely scrolling down on the page would trigger a login form for example (I've had this happen for pages that are definitely meant to be public such as businesses who'd have an interest in getting as many eyeballs as possible on their content).

I might be wrong and maybe the behavior is actually fully deterministic and isn't nefarious, but knowing the company behind it I'll assume malice until proven otherwise.


but you make it public for everybody with the publicly accessible login so it wouldn't be considered private data for the same reason news outlets can use your instagram images and share it widely without your permission.

you can't throw up a login screen but then allow people to post themselves that ends up in public domain because the login does not distinguish from public or permissioned user authorized to view your selfie pics.


From GDPR point-of-view this kind of 3rd party data collection is not acceptable (assuming it covers personal information, for example names of people and what they have posted). The difference with Meta's own data collection is that the users have relationship with Meta and users have given their permission for Meta to handle the data. Users also know they can contact Meta and ask them to remove the data.

3rd parties don't have the consent from users. Users don't even have an idea these companies might be holding their data.


From a GDPR point of view the scraper would be acting as a data processor on behalf of their customer, no different from using a cloud storage service for your contacts. It's fine as long as the third-party doesn't misuse the scraped data or share it with third-parties and there's no evidence they did so in this case.


> and there's no evidence they did so in this case.

Indeed; the users probably wanted to make the data public, if scraper accounts could see it. There is a GDPR allowance for data "manifestly made public by the data subject".

https://gdpr-info.eu/art-9-gdpr/

Here, it's just Facebook wanting to keep the data inside a walled garden.

For the same reason, I quit LinkedIn and made my own site. I don't want people to have to sign in to see my profile.


Fuck off Facebook you scumbags


Is it Octopus Data Inc. aka Octoparse they are suing?


They are will using fb.com domain? I though meta is not FaceBook?....


I think it's like Google vs Alphabet. Alphabet is the parent company like Meta.

As for why their domain is facebook for their news site, not sure why. It would make for sense for it to be under meta instead.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: