Agree that context is often needed! (Which is why it was strange to us that raters weren't presented with any context besides the comment text itself -- not even the subreddit, much less the original Reddit post.)
One interesting question, though: if "LETS FUCKING GOOOOO YOU DINGBAT" were meant to be a combative insult, would someone still add a bunch of O's ("GOOOOO" instead of merely "go")? My intuition is that if combativeness were intended, "let's fucking go, you dingbat" would be more likely than "LETS FUCKING GOOOOO YOU DINGBAT", but of course it's a bit hard to say without that context.
Gaming is a really fun and interesting labeling domain, given the community jargon (I'm actually a big Twitch user, but still couldn't tell you what many common emotes mean... took me years to understand "poggers") and context (is "i'm going to kill Garen" a death threat or in-game action?).
Great question! I'd love to measure that more rigorously too.
Although from what we've seen, the amount context sensitivity matters really depends on the labeling task / application.
For example, when you're trying to label a tweet that's a reply, context matters even more than when you're labeling a parent tweet: it's often hard to understand what the reply tweet is talking about when you can't see the full thread, it can be hard to tell whether something is a joke or an insult when you can't tell whether the replier and original tweeter follow each other or not, etc. This is important because sometimes our customers don't realize this, and will send us tweet text by itself instead of a full tweet link.
It's also important because even if your models are using text alone (and not a richer set of context/features), there may be patterns in the text itself that an ML could pick up on that a human wouldn't without that extra context.
I'd love to chat. Want to reach out to the email in my profile? I'm the founder of a startup solving this exact problem (https://www.surgehq.ai), and previously built the human computation platforms at a couple FAANGs (precisely because this was a huge issue I always faced internally).
We work with a lot of the top AI/NLP companies and research labs, and do both the "typical" data labeling work (sentiment analysis, text categorization, etc), but also a lot more advanced stuff (e.g., search evaluation, training the new wave of large language models, adversarial labeling, etc -- so not just distinguishing cats and dogs, but rather making full use of the power of the human mind!).
Good news for you: being your target audience, we actually did have you guys on our radar.I
For the scale of our project, however, the price point was prohibitive.
We ended up building a small cli tool that interactively trained the model, and allowed us to focus on the most important messages (eg those where positive/negative sentiment was closest, the labels with the smallest volume, etc).
EDIT: If I now look at your website, it seems like you’ve also just provide good tooling for doing these types of things yourself? If that were the case, I wouldn’t mind having paid $50-$100 for a week of access to such a tool. But $20/hr to hire someone who classifies data which we would still need to audit afterwards was too much for us.
$20/hour to classify data sounds reasonable though?
If you have more time than money it might not make sense, but at that price point I could save myself a lot of time by just working a few extra hours doing SE and let someone else do 3x that amount of labelling.
I fully agree $20/hr is reasonable, it just was too expensive for us at that time.
So in the end the whole problem boils down to “quality is (more) expensive”; but MTurk is a special case since they’re so heavily positioning themselves as “the” solution for this and they’re terrible.
Completely agree on the need for serious commitment and attention!
Funnily enough, though, many ML engineers and data scientists I know (even those at Google, etc., who depend on human-annotated datasest) aren't familiar with these kinds of errors. At least in my experience, many people rarely inspect their datasets -- they run their black box ML pipelines and compute their confusion matrices, but rarely look at their false positive/negatives to understand more viscerally where and why their models might be failing.
Or when they do see labeling errors, many people chalk it up to "oh, it's just because emotions are subjective, overall I'm sure the labels are fine" without realizing the extent of the problem, or realizing that it's fixable and their data could actually be so much better.
One of my biggest frustrations actually is when great engineers do notice the errors and care, and try to fix them by improving guidelines -- but often the problem isn't the guidelines themselves (in this case, for example, it's not like people don't know what JOY and ANGER are! creating 30 pages of guidelines isn't going to help), but rather that the labeling infrastructure is broken or nonexistent from the beginning. Hence why Surge AI exists, and we're building what we're building :)
In short, one way to prevent your language models from devolving into violence (with extremely high safety guarantees) is by building "AI red teams" of labelers who try to trick it into generating something violent. Then you train your models to detect those strategies (just like other kinds of red teams find holes in your security, which you then patch). Then your "red data labeling teams" find new strategies to trick your AI into becoming violent, you train models to counter those strategies, and so on.
It sounds to me as if it might be a smarter bet to simply train the model on a corpus restricted solely to the kind of material that you want to generate.
Shovelling all kinds of content into an AI and trying to censor what comes out, strikes me as having a team of snipers employed solely to watch a barn and shoot any horses that try to bolt. It works, but won't be foolproof.
Happy to answer any questions.