Hi! I'm the PM at GitLab who works on Snippets, so thanks for providing this feedback. We do have Recaptcha support which can be configured - are you seeing these kinds of issues with that enabled/configured?
I'm not in the group working on that, but it does appear to be coming soon and would limit the ability of newly created accounts from doing anything until they're approved.
I built a privacy friendly alternative to ReCaptcha called FriendlyCaptcha [1], is there a possibility to see this integrated as a more user friendly alternative?
On my machine it doesn't take any time to solve it and I see no signs of CPU usage. Even trying a couple of times in incognito mode and watching CPU immediately after loading the page for the first time.
On many sites creating a profile takes a few seconds. Loading one of my CPU cores for another 5 seconds doesn't really bother me if I wanted to create massive amounts of profiles/posts. I'll still do over 100 per minute on a standard desktop PC.
The default difficulty is set to a difficulty that makes sense on websites that have a varied audience (which includes some ancient browsers on old devices).
The solver runs in WebAssembly and is really really fast (~4M hashes per second) - but not every browser supports WASM yet (around 0.3% empirically). The JS fallback is around 10 times slower (more in 5+ year old browsers) - for those users you want at least a decent solve time too.
For Gitlab's audience the difficulty can probably be increased a lot - it all depends on the website and usecase. I'm sure the JS fallback's performance can be improved (it involves a lot of operations on 64bit ints that need to be represented as two numbers in JS), happy to accept PRs [1] :)
What are your thoughts on performing a quick intial test on each client to measure their performance then tailoring the puzzle to be difficult enough for each?
Hopefully you are successful, but how can you scale? If it takes 5 seconds on a desktop, then a server can solve 500.000 captchas per month. At $5 per month, a spammer can still send 1.000 messages for a cent.
It's not enabled yet in production - but the main mechanism is by increasing the difficulty as more requests are made from an IP in a certain timeframe (it's basically rate limiting at that point). Think: every 3rd request in a minute doubles the difficulty with some cooldown period.
With that the cost (and complexity) of an attack can hopefully be in the same ballpark (or higher) than ReCaptcha - without your end user having to label cars or send data to Google.
But in the end a determined spammer will get through any captcha cheaply (for reference: ReCaptcha solves are sold by the thousands for $1) - we just hope we can do better than ReCAPTCHA, especially UX-wise.
I love this concept of proof-of-work captchas, but there's a growing number of tools and ways to bypass IP blocks via IP rotation[1], specially after the explosion of IaaS providers. How do you intend to tackle this?
There are free and paid list of all ip addresses from datacenters like https://udger.com/resources/datacenter-list, they probably existing for specifically preventing this, so maybe thats an option here.
The obvious follow-up question is how IPv6 impacts this, because I think it's supposed to be easy for someone to get their hands on a decent chunk of IPv6 addresses.
Maybe the difficulty could scale as a property of how similar the IP address is to previously seen addresses... so the addresses in the same /64 block would be very closely related, for example. (I think that's how IPv6 works... but definitely something I haven't researched lately, so I could just sound very confused)
I don't have all the answers yet, but indeed rate limiting a larger block (at least /64), or even at multiple prefix sizes with different weighting makes sense.
So the way this is supposed to work is that providers hand out /48s and each site should be allocated a /64. In practice if you for example rent a VPS, you'll be handed a /64 for it by your service provider from their /48.
I would personally treat any /64 as the same. Depending on your local network setup the second half of the address could be anything and could change frequently. You might also get multiple addresses. Whereas getting a new /64, or /48, requires slightly more effort.
Of course there's a risk you'll block a /64 and that takes out some whole company or whatever, but I've seen that happen to corporate proxies that got flagged as a source of spam as well so this is not an easy problem even without the 2^128 address space.
Your website mention that friendlycaptcha is open source but looking at the license in the repository, it is a custom license that can't be defined as open source. Can you change it to source available?
There doesn't appear to be any discussion on your website or on GitHub about why, to be blunt, this is even a good idea in the first place.
A classic 2004 paper, "Proof-of-Work" Proves Not to Work [0], explained that the fundamental problem with proof-of-work bot filters is that attackers will always be able to solve the cryptographic puzzle faster than legitimate users. A touch of security-through-obscurity can help at the margins, but you chose Blake2b, which is used by cryptocurrencies like Zcash, Siacoin, and Nano [1], and as a result there are optimized GPU algorithms (first Google result [2]) and FPGA designs (one of the top Google results [3]). Have you run the numbers on any of those?
The closest to any discussion of these numbers that I saw was a mention on your website that it may take up to 20s on mobile; for comparison, the much-hated image CAPTCHA takes about 6-12s on average for native English speakers, and 7-14s for non-native speakers [4].
In another comment you bring up the idea of starting with a lower difficulty, and increasing it with repeated requests from the same IP address (IPv4, I assume). Unfortunately, access to unique IPv4 addresses is highly correlated with access to more compute power: laptops and desktops in developed countries are most likely to be in a household with a unique IPv4 address, whereas mobile devices on 4G internet and households in developing countries are more likely to be behind Carrier-Grade NAT [5], where thousands or millions [6] of hosts share a pool of a handful or dozens of IPv4 addresses. (The exact same concern applies to IPv6 /64 prefixes.)
This means that mobile devices will face a "double-jeopardy": your service will present them with higher proof-of-work difficulties because the same IPv4 address is shared by more people, and at the same time, the mobile device solves the proof-of-work slower for the same difficulty than a desktop.
Do you have documented anywhere on your website or GitHub how you address these concerns?
I'm not associated with the project in any way, but your well researched comment did miss at least one important factoid.
This comment:
> The closest to any discussion of these numbers that I saw was a mention that it may take up to 20s on mobile; for comparison, the much-hated image CAPTCHA takes about 6-12s on average for native English speakers, and 7-14s for non-native speakers.
Missed this quote from the website:
> As soon as the user starts filling the form it starts getting solved
> By the time the user is ready to submit, the puzzle is probably already solved.
The time spent solving reCAPTCHA is active user involvement. The time being spent on Friendly Captcha is passive and can overlap with time being spent filling out a form.
"up to 20 seconds" was also seemingly presented as a worst-case scenario. Most users' devices would presumably be faster than that, but I don't know how the author researched that conclusion on how performance scales. Friendly Captcha does report back some information on how long it is taking users to solve the captcha, and it looks like website owners could use that to adjust the difficulty based on the needs of their specific audience and how tolerant they are of untargeted spam.
The stuff you point out about Blake2b seems entirely legitimate, and I wonder if an Argon variant would be more appropriate to avoid specialized hardware being quite so problematic.
Personally, I really like the idea of Friendly Captcha. Certainly, there are problems with any captcha implementation. People can rant for many, many paragraphs about websites that use reCAPTCHA... I'm not surprised to see someone ripping apart a different captcha system. The ideal solution would be for spammers to just stop being so obnoxious... but good luck with that plan.
The time being spent on Friendly Captcha is passive and can overlap with time being spent filling out a form.
Great point!
I wonder if an Argon variant would be more appropriate
The creators of Argon2 actually also created a memory-hard proof-of-work function they call MTP (for "Merkle Tree Proof", which is a terrible name, totally un-Googleable; I always have to search for the title of their paper, "Egalitarian Computing"): https://arxiv.org/pdf/1606.03588.pdf
A bug bounty for it was sponsored by Zcoin, which is nice. Zcoin is actually considering moving away from it, but mainly because the proof size of 200kb is prohibitive, which is less of a concern for a captcha system: https://forum.zcoin.io/t/should-we-change-pow-algorithm/477
I'm not surprised to see someone ripping apart a different captcha system
I really don't mean to rip it apart. I just wanted to see some discussion, any discussion, of the well-known flaws with the idea and what ideas OP has to address them.
It is also important to note that the 6-12 seconds and 7-14 seconds reported in the paper is for the garbled text CAPTCHAs, not for image labeling tasks (fire hydrants, cars, etc).
I'll try to provide my thoughts on each of the issues you've mentioned, let me know if there's something I missed.
On using blake2b:
I chose blake2b as I was looking to use a hash function that is small in implementation, readily available and already optimized. With WebAssembly the solver can achieve (close to native) speeds and be least be an order of magnitude or two closer to optimized GPU algorithms.
Using specialized hardware, image tasks (and even more so audio tasks which must be present for accessibility reasons) have the same issue that they can be solved by GPU algorithms (i.e. machine learning, in which even a low percentage success rate would already be enough). If you search on GitHub you will find there are more ML captcha cracking repos than captcha implementations - they are probably even easier to get started with than adapting GPU miner code.
Image/Audio Captcha vs ML is an arms race that can be beat for split seconds of compute (even on CPU) or cheap human labeling: it's just as broken. FriendlyCaptcha optimizes for the end user (privacy + effort + accessibility) by not engaging in the arms race - I think it makes a better trade-off. Like the sibling comment pointed out the captcha solving can happen entirely in the background so that hopefully it doesn't even make the user wait.
As for rate limiting/difficulty adjustment: it's not perfect and it could lead to problems if you share the IP with a spammer (and let's be realistic: even with a million users on one IP there won't be tens of users signing up to some forum per minute). Also normal captchas have problems here though: users from these locales already get presented with much more difficult+frequent recaptcha tasks (I also doubt they are localized: American sidewalks are harder to label if you've never seen one in real life). Setting a reasonable upper limit to difficulty may be good enough here.
On not using blake2b:
I have considered mutating the hashing algorithm every day randomly to make writing an optimized solver for it all that more difficult - but that would mean one could no longer self-serve the JS+WASM and be done with it. I won't rule it out for FriendlyCaptcha v2 if this does ever become a real problem.
Swapping out the hash function should be easy (the puzzles are versioned to allow for this). If you have a different function in mind and someone implements it in Assemblyscript (so we also have a JS fallback) then we can definitely consider it.
I've seen all the projects claiming to have broken ReCAPTCHA—often using Google's own ML services, hilariously—but it's unclear to me how broken image/audio CAPTCHAs are in practice (and the number of GitHub repos doesn't seem like a good measure to me). If they really are completely broken, then why are they still so widely used? If they really are completely broken by ML, how do human CAPTCHA-solving services stay in business?
FriendlyCaptcha optimizes for the end user (privacy + effort + accessibility)
Good point. I am concerned though that burning CPU cycles on the proof-of-work uses battery life if the end-user is on mobile, without getting their getting any choice in the matter. What if, given an informed choice, they would have preferred an image CAPTCHA? (On the other hand, that could use more cellular data. Might be good to run the numbers on this too.)
even with a million users on one IP there won't be tens of users signing up to some forum per minute
CAPTCHA: a computer program or system intended to distinguish human from machine input, typically as a way of thwarting spam and automated extraction of data from websites
I would say this Oxford Languages dictionary definition is close enough.
This doesn't use a blockchain, it uses a Hashcash-style proof-of-work function (an idea that predates the Bitcoin by decades): https://en.wikipedia.org/wiki/Hashcash
It's not perfect, but maxing a single core for 20 seconds on an older smartphone is a necessary evil for this kind of captcha.
The alternative: loading a third party script and multiple images (~2MB) to label for ReCAPTCHA and spending time performing the task also takes some battery (and mental) power.
Hi! I'm a PM at GitLab. Please see my reply above for more details but TL;DR we shipped the first iteration of the `Optional Admin Approval for local user sign up` feature in 13.5. I'd love your feedback! Please comment on the epic if there are other changes for this feature that would help your use case https://gitlab.com/groups/gitlab-org/-/epics/4491
Thanks for the update. I can certainly manage user sign-up from the admin tab for the time being. Once it's hooked into email, I believe that will make things maintainable again for me.
From a UX standpoint it's still sub-par. Someone who wants to report an issue doesn't want to wait an arbitrary amount of time to be allowed to report an issue. They are ready to report it at that moment.
And as an admin, I don't want to have to approve new users on a schedule to ensure the delay is low enough that they are still willing to submit the issue after I approve them. I'd much prefer they go ahead and submit the content, especially so that I can use it in my review of whether to approve the sign up or not.
I seem to remember some pattern in Gitlab where my login period timed out before I finished making a comment. When I logged back in, Gitlab had somehow saved my comment content so that I could then post it so that others could see it. Is there any way to use that pattern for users who haven't been approved yet? So that they can post content, but with a warning shown to them that other users won't see it until the sign-up is approved.
That's a really interesting idea! Users could have limited interactions with the instance and content queued up until approved by an administrator. I created an issue to capture this. https://gitlab.com/gitlab-org/gitlab/-/issues/273542
I immediately back out whenever encounter Recaptcha.
The other day I was forced to endure it, because I wanted to delete my ancient Minecraft account, since Microsoft pulled a Facebook and are going to require a Microsoft account to play going forwards. Without exaggeration, it took me 15 minutes of training Google surveillance AI (had to solve it three times), for Recaptcha to let me in. I guess Google really hates me.
Yesterday I spent the longest ever with a recaptcha, about 2-3 minutes, at a frigging checkout page. I decided to endure it just because I really needed that ergonomic kb+mouse combo.
Hopefully they'll allow me to solve captchas for longer without getting a RSI.
I'm human enough, and I've been a licensed driver long enough, to recognize that rumble strips at the side of a road are not crosswalks. But apparently enough bots thought they were that the system is now trained on that 'fact', and I as a human am forced to misidentify rumble strips as crosswalks to pass as human.
Try reCAPTCHA’s audio version (the headphones icon), it’s much easier than guessing what images it wants you to click (if you speak English, have headphones, and are not hearing-impaired).
This sounds like it has the potential to be a modern version of the credit score: avoid it enough, and you become persona non grata. That is, for more than 15 minutes.
You're doing something very wrong if you take 15 minutes to solve these and aren't on Tor. Even on public VPN and Firefox this doesn't happen usually.
I know people that pick the wrong options to fuck with their models though, and then go on HN to complain about recaptcha being annoying.
I have similar issues. I do not pick the wrong options. It also doesn't take me too long to solve the captchas, leading to "too many queries from your ip address".
This is what internet users deal with when blocking most google services.
Thanks for bringing up this epic in the conversation phkai. I'm a PM at GitLab for our Auth group and am working on the `Optional Admin Approval for local user sign up` feature. I'm happy to tell y'all that we shipped the first iteration of this in our 13.5 release. You can find more information in our release blog https://about.gitlab.com/releases/2020/10/22/gitlab-13-5-rel... . I've also updated the epic with more information about its current status https://gitlab.com/groups/gitlab-org/-/epics/4491#status-upd....
For this specific case, the Wikimedia Foundation has explicitly stated that "It is the Free Software release of GitLab that runs optional non-free software such as Google Recaptcha to block abuse, which we do not plan to use." So, not incredible helpful at the moment.
Also, is manual approval for new signups a good idea for a large FOSS project? It seems like a pretty big barrier to legitimate discussion.
We (at torproject.org) also adopted GitLab CE recently and we had to close down registrations because of abuse. Tens (hundreds?) of seemingly fake accounts were created in the two weeks we had registrations opened and we had to go through each one of those to make sure they were legitimate. In our case, snippets were not directly the problem: user profiles were used as spam directly.
We can't use ReCAPTCHA or Akismet for obvious privacy reasons. The new "admin approval" process in 13.5 is interesting, but doesn't work so well for us, because it's hard to judge if an account should be allowed or not.
As a workaround, we implemented a "lobby": a simple Django app that sits in front of gitlab to moderate admissions.
The idea is people have to provide a reason (free form text field) to justify their account. We'd also like people to be able to file bugs from there directly, in one shot.
We're also thinking of enabling the service desk to have that lower bar for entry, but we're worried about abuse there as well.
Having alternatives to ReCAPTCHA would be quite useful for us as well.
You have to remove incentives. Block the viewing of these snippets by logged out users by default and require opt-in and a way to whitelist snippets by snippet or user. Same for user profiles
That's the point. Having a way to disable search engines would also work, but wouldn't be obvious to spammers so they would still try to spam. Disabling all users by default works to remove the incentive to try
Is just adding the attribute rel="nofollow ugc" to any links in submitted content may be good enough. This tells search engines to not index, or tag them as suspicious, allowing them them to identify SEO spam more easily. [1]
One item that is on the roadmap that is coming and may be of interest is `Optional Admin Approval for local user sign up` - https://gitlab.com/groups/gitlab-org/-/epics/4491.
I'm not in the group working on that, but it does appear to be coming soon and would limit the ability of newly created accounts from doing anything until they're approved.