more PigiVinci83's comments

PigiVinci83 · on Aug 4, 2022

Oh, i read your substack and find it brilliant!

PigiVinci83 · on May 28, 2022

Thank you, much appreciated your comment

PigiVinci83 · on May 27, 2022

I'm sure no one will add here its secret sauce :)

gdcbe · on May 28, 2022

Indeed. Personally I would want to give away more, but for now mostly getting the red light here, for obvious reasons. What we could however do better as a community is share generic tooling.

PigiVinci83 · on May 27, 2022

Thanks for your reply, mobile data it's a thing i need to add soon. Usually we check using Fiddler if there's an API inside, but only for really problematic website.

PigiVinci83 · on May 27, 2022

Having a large codebase like ours, we find out that XPATH are more readable, but i understand it's a personal feeling. We don't have high frequency scraping, so the performances of CSS vs XPATH were not considered. It's an interesting point i'd like to write more about, thanks for sharing.

PigiVinci83 · on May 27, 2022

Thanks for the question, i can speak for what we've encountered in these years of web scraping and nothing beats API and JSON, but i'm sure there are formats even more friendly to read.

PigiVinci83 · on May 27, 2022

A work in progress guide about web scraping in python, anti bot softwares and techniques and so on. Please feel free to share and contribute with your own experience too.

alexchamberlain · on May 27, 2022

The tab formatting seems like an odd (and rather unPythonic) addition. What's the intention there?

hrbf · on May 27, 2022

Same here. It feels out of place, unnecessary and its rationalization unconvincing. Considering Python, an outright weird suggestion.

wheelerof4te · on May 27, 2022

Python allows indenting using tabs, so I don't understand why it's a weird decision.

In fact, they even stated their reasoning in the document. I don't see why anyone has to blindly follow PEP8 nor do I get why 4 spaces indent has to be considered a standard.

mekoka · on May 27, 2022

> Python allows indenting using tabs, so I don't understand why it's a weird decision.

A standard is not a set of rules already enforced in a language, otherwise it would not be needed. It's rather a set of practical guidelines that a group of people agrees upon, with the purpose of making each other's lives easier. That's why indenting with tabs is weird.

> I don't see why anyone has to blindly follow PEP8

In the <tabs> vs <space> debate there's really no reasons not to follow PEP8. The number of people (and editors, and tooling, etc) that abide by it is quite large and it seems to work well enough for most. The only reason that someone would even mention spaces as an inconvenience, is that they can somehow perceive a difference when editing code, which may point to badly configured or outdated tooling, rather than faulty standard. Most people who code Python have their editor set to feed 4 spaces when the <tab> key is pressed and delete 4 consecutive spaces on <Backspace>. If instead you repeatedly press (4 times) the <space> bar or the <Backspace> to insert or remove code indentations, you're doing it wrong.

_abox · on May 28, 2022

I think tabs are handier exactly because they don't have a fixed width. You can adjust code easily to your readability preference without actually having to change the code. Some people like deep indents, others like them more shallow. Just modify the tab width to your personal preference.

Also, it avoids the issue of only accepting 2 or 4 spaces, leaving the 1,3,5 spaces as incorrect combinations leading to issues. With tabs there are no combinations which are invalid, though you can still have too many or few indents than intended of course. But the hunting for that extra space you copied in is gone.

I just don't subscribe to the 'tabs are evil' narrative. I like that python supports them but I think it's really annoying that YAML doesn't.

The argument of "each editor does things differently" is also not really valid when you're going to need a special editor that can convert tabs to spaces and delete spaces in bunches to really work with it comfortably. It would have been much easier to just use an editor that handles tabs in the way that was needed. Either way you're going to want specific editor features.

datalopers · on May 27, 2022

[flagged]

PigiVinci83 · on May 27, 2022

Because it's not a tutorial on web scraping but a mix of what we suggest internally to do and what we've learnt from our experience in this field in these years. For our codebase we prefer tabs instead of spaces, but i understand it's a subject for debates that last decades :) But thanks for the point, I'll rephrase the topic in the guide

datalopers · on May 27, 2022

It's odd to me that your apparent revenue stream is from scraping difficult-to-scrape sites and you're broadcasting the exact tactics you use to bypass anti-scraping systems. You're making your own life difficult by giving Cloudflare/PerimeterX/etc the information necessary to improve their tooling.

You also seem to advertise many of the sites/datasets you're scraping, which opens you up to litigation. Especially if they're employing anti-scraping tooling and you're brazenly bypassing those. It doesn't matter that it's legal in most jurisdictions of the world, you'll still have to handle cease and desists or potential lawsuits, which is a major cost and distraction.

punnerud · on May 27, 2022

« You also seem to advertise many of the sites/datasets you're scraping, which opens you up to litigation.»

Is that a done deal now after the “LinkedIn vs HiQ” case public information only hold copyright, but you can use the by product as it’s fit you for new business?

datalopers · on May 27, 2022

The only clear outcome from the LinkedIn case, afaik, is that scraping publicly of accessible data is not a federal crime under CFAA [1]. There are still plenty of other civil ways that someone can sue you to stop scraping their site: breach of contract, trespass to chattels, trademark infringement, etc. And they can do so over and over again til you're broke. OP is based in Italy anyway so I have absolutely no clue what does and doesn't apply.

I'd like to point out that, while HiQ Labs "won" the case, that company is basically dead. The CEO and CTO are both working for other companies now. So I think the bigger takeaway is: don't get yourself sued while you're a tiny little startup.

[1] https://www.natlawreview.com/article/hiq-labs-v-linkedin

civilized · on May 27, 2022

It's not a best practice, it's just a random thing your team does. It does make the team sound amateurish if it can't distinguish between meaningful best practices and just conventions the team happens to have.

jamestimmins · on May 27, 2022

I appreciate the inclusion of anti-bot software. As someone who builds plugins for enterprise apps (currently Airtable), I really want to build automated tests for my apps with Selenium, but keep getting foiled by anti-bot measures.

Can anyone recommend other resources for understanding anti-bot tech and their workarounds?

judge2020 · on May 28, 2022

Anyone with a stake in bypassing anti-bot measures isn't going to share their tactics, since sharing it will lead to such workaround being patched or mitigated, requiring them to research for more bot detection workarounds.

Projects like cloudscraper[0] are often linked to point and say "look! they broke Cloudflare!" but CF and the rest of the industry has detections for tools like this, and instead of rolling out blocks for these tools, they give website owners tools like bot score[1] to manage their own risk level on a per-page basis.

0: https://github.com/VeNoMouS/cloudscraper

1: https://developers.cloudflare.com/bots/concepts/bot-score/

jamestimmins · on May 28, 2022

Yeah that's a shame but it makes sense.

Probably need to find an in with bot builders, if that's really a goal I have.

etskinner · on May 27, 2022

On the page about canvas fingerprinting[0], it only mentions Cloudflare. From what I can tell, reCaptcha v3 also uses canvas fingerprinting [1]

[0] https://github.com/reanalytics-databoutique/webscraping-open...

[1] https://brianwjoe.com/2019/02/06/how-does-recaptcha-v3-work/

PigiVinci83 · on May 27, 2022

Thanks for sharing, i'll update soon the page.

PigiVinci83 · on March 16, 2022

Hi, in my startup we do a lot of web scrapinf and rely on upwork to find freelancers for our projects. Why don’t we keep in touch?

_gfwu · on March 21, 2022

whats your email?

PigiVinci83 · on Oct 12, 2021

Great article. I've put online a discord server for sharing knowledge about web scraping, if some of you wants to join https://discord.gg/fwqhqrWhHW