Indeed. Personally I would want to give away more, but for now mostly getting the red light here, for obvious reasons. What we could however do better as a community is share generic tooling.
Thanks for your reply,
mobile data it's a thing i need to add soon.
Usually we check using Fiddler if there's an API inside, but only for really problematic website.
Having a large codebase like ours, we find out that XPATH are more readable, but i understand it's a personal feeling. We don't have high frequency scraping, so the performances of CSS vs XPATH were not considered.
It's an interesting point i'd like to write more about, thanks for sharing.
Thanks for the question, i can speak for what we've encountered in these years of web scraping and nothing beats API and JSON, but i'm sure there are formats even more friendly to read.
A work in progress guide about web scraping in python, anti bot softwares and techniques and so on. Please feel free to share and contribute with your own experience too.
Python allows indenting using tabs, so I don't understand why it's a weird decision.
In fact, they even stated their reasoning in the document. I don't see why anyone has to blindly follow PEP8 nor do I get why 4 spaces indent has to be considered a standard.
> Python allows indenting using tabs, so I don't understand why it's a weird decision.
A standard is not a set of rules already enforced in a language, otherwise it would not be needed. It's rather a set of practical guidelines that a group of people agrees upon, with the purpose of making each other's lives easier. That's why indenting with tabs is weird.
> I don't see why anyone has to blindly follow PEP8
In the <tabs> vs <space> debate there's really no reasons not to follow PEP8. The number of people (and editors, and tooling, etc) that abide by it is quite large and it seems to work well enough for most. The only reason that someone would even mention spaces as an inconvenience, is that they can somehow perceive a difference when editing code, which may point to badly configured or outdated tooling, rather than faulty standard. Most people who code Python have their editor set to feed 4 spaces when the <tab> key is pressed and delete 4 consecutive spaces on <Backspace>. If instead you repeatedly press (4 times) the <space> bar or the <Backspace> to insert or remove code indentations, you're doing it wrong.
I think tabs are handier exactly because they don't have a fixed width. You can adjust code easily to your readability preference without actually having to change the code. Some people like deep indents, others like them more shallow. Just modify the tab width to your personal preference.
Also, it avoids the issue of only accepting 2 or 4 spaces, leaving the 1,3,5 spaces as incorrect combinations leading to issues. With tabs there are no combinations which are invalid, though you can still have too many or few indents than intended of course. But the hunting for that extra space you copied in is gone.
I just don't subscribe to the 'tabs are evil' narrative. I like that python supports them but I think it's really annoying that YAML doesn't.
The argument of "each editor does things differently" is also not really valid when you're going to need a special editor that can convert tabs to spaces and delete spaces in bunches to really work with it comfortably. It would have been much easier to just use an editor that handles tabs in the way that was needed. Either way you're going to want specific editor features.
Because it's not a tutorial on web scraping but a mix of what we suggest internally to do and what we've learnt from our experience in this field in these years.
For our codebase we prefer tabs instead of spaces, but i understand it's a subject for debates that last decades :)
But thanks for the point, I'll rephrase the topic in the guide
It's odd to me that your apparent revenue stream is from scraping difficult-to-scrape sites and you're broadcasting the exact tactics you use to bypass anti-scraping systems.
You're making your own life difficult by giving Cloudflare/PerimeterX/etc the information necessary to improve their tooling.
You also seem to advertise many of the sites/datasets you're scraping, which opens you up to litigation. Especially if they're employing anti-scraping tooling and you're brazenly bypassing those. It doesn't matter that it's legal in most jurisdictions of the world, you'll still have to handle cease and desists or potential lawsuits, which is a major cost and distraction.
« You also seem to advertise many of the sites/datasets you're scraping, which opens you up to litigation.»
Is that a done deal now after the “LinkedIn vs HiQ” case public information only hold copyright, but you can use the by product as it’s fit you for new business?
The only clear outcome from the LinkedIn case, afaik, is that scraping publicly of accessible data is not a federal crime under CFAA [1]. There are still plenty of other civil ways that someone can sue you to stop scraping their site: breach of contract, trespass to chattels, trademark infringement, etc. And they can do so over and over again til you're broke. OP is based in Italy anyway so I have absolutely no clue what does and doesn't apply.
I'd like to point out that, while HiQ Labs "won" the case, that company is basically dead. The CEO and CTO are both working for other companies now. So I think the bigger takeaway is: don't get yourself sued while you're a tiny little startup.
It's not a best practice, it's just a random thing your team does. It does make the team sound amateurish if it can't distinguish between meaningful best practices and just conventions the team happens to have.
I appreciate the inclusion of anti-bot software. As someone who builds plugins for enterprise apps (currently Airtable), I really want to build automated tests for my apps with Selenium, but keep getting foiled by anti-bot measures.
Can anyone recommend other resources for understanding anti-bot tech and their workarounds?
Anyone with a stake in bypassing anti-bot measures isn't going to share their tactics, since sharing it will lead to such workaround being patched or mitigated, requiring them to research for more bot detection workarounds.
Projects like cloudscraper[0] are often linked to point and say "look! they broke Cloudflare!" but CF and the rest of the industry has detections for tools like this, and instead of rolling out blocks for these tools, they give website owners tools like bot score[1] to manage their own risk level on a per-page basis.