Idiomatic translation of text matching a human professional (e.g. free of errors for legal terms, interesting and natural for fiction) is unlikely to be achieved until we have AGI. So no.
I won't comment on the first bit as i've not personally tested in that are but GPT-4 can absolutely make short work on the second. I don't think people realize how good Bilingual LLMs are at translations. Yes you have idioms transfer between languages. Feel free to test it yourself.
I have tested it :) I've asked it to translate English fictional text into Japanese, it falls over often. It's unnatural and often makes no sense at all. It doesn't compare to a typical professional translation (which are often not that idiomatic either), let alone a really good one.
I'm sure it'll be doing that in five years, but not now.
One interesting thing is that's it's nondeterministic, so sometimes 'For chrissakes' turns to ちくしょう (Damn!) but sometimes to クリスのために (for Chris' sake). Sometimes 'the goddamn door' turns into クソドア ('shit door'), sometimes the goddamn changes the phrasing of the whole sentence instead. If you run it five times and take the best sentences out of all five runs it's probably quite good. Maybe prompting would help too, I said "idiomatic Japanese" but it still usually translated it in a very "foreigner Japanese" way typical of US drama/movie translations.
Are you giving it multiple paragraphs to translate at once so that it has enough context for a good translation? If so, would you mind sharing a sample input and output that you found unsatisfactory?
In "Can GPT-4 translate literature?" (Mar 18, 2023) [https://youtu.be/5KKDCp3OaMo?t=377], Tom Gally, a former professional translator and current professor at the University of Tokyo, said:
> …the point is, to my eye at least, the the basic quality of the translation [from Japanese to English] is as good as a human might do, and with some relatively mild editing by a sensitive human editor, it could be turned into a novel that would be readable and enjoyable.
I don't think we disagree. The video says the translation will be "readable" but needs several days of an experienced editor passing over it. That's an amazing result, but again, it's not as good as a human yet. It's way faster and it'll make media accessible to tons of people.
Like he says, there's lots of ambiguity in Japanese that needs to be handled, gender not being specified until later, etc. and an editor would need to spend time going over it - but it saves months of traditional work. There are words and _concepts_ that are hard to translate, there are cultural issues, dialects, slang, registers. So yeah it'll make the media accessible, but it won't be as a good as a skilled translator.
Last night I used GPT-4 to translate the first several pages of Ted Chiang's Lifecycle of Software Objects (a sci-fi piece) from English to Chinese. I'd say it's about as good as me, save a few minor errors. It's safe to say it performs better than a "tired me", and some translators I've seen on the market.
I'm a native speaker of Chinese, but not a professional translator.
It may depend on a language. For Polish - which is considered one of the most difficult languages due to various forms of words, it works almost perfect - on par with average human translators.
> I don't think people realize how good Bilingual LLMs are at translations
This.
GPT/ChatGPT is able to even translate between different "accents" or dialects of the same language. For example it can give you the same sentence in Mexican, Argentinean or Peruvian Spanish.
Example:
Me: Give me a sentence in spanish that is different in Mexican, Argentinean and Peruvian Spanish. Write the sentence in each dialect.
These sentences mean "What's up, dude? How are you?" in English. The primary difference is the slang term used for "dude" or "friend" in each dialect: "güey" in Mexican Spanish, "boludo" in Argentinean Spanish, and "causa" in Peruvian Spanish.
It really depends on the tone and context. If you are a tourist and say it in a joking manner, people are probably going to laugh. If you say it in anger to someone, they might not like it very much.
Similar to how a lot of swear words work in many languages.
It's interesting to see how what matters is not the word, but the intention behind it. At the end we are trying to communicate meaning, and words are just one of our tools to do it.
I am also multilingual as well and I've tested it personally. English <-> Portuguese does really well, but Portuguese <-> Japanese or even Japanese <-> English is not as good as a human translator by a long shot because of a lot of hidden subtext in conversation. Even something that a university student would probably pickup on in their first year of Japanese as a foreign language. It is still much better than GPT-3.5, so much so that it made a lot of waves here in Japan, but a few friends who work in translation of books and manga find it is not really a go-to tool yet (yet...).
Oh for sure i don't mean to say it's excellent in every language. But i personally think a lot of that is training data representation. Doesn't need to be anywhere equal but for instance after English(93%), the biggest language representation for GPT-3's training corpus is french at...1.8 % Pretty wild.
I am sure it will improve even further as you pointed out the languages outside of English are fairly low in data represented. However, I guess you said you speak Chinese correct? How well does it do with certain things like older poetic Chinese hanzi? In Japanese if there is a string of kanji it tends to mess up the context. Another area of Japanese it seems poorest at is keigo or polite business Japanese. The way you speak to a superior is almost a different language. So I unfortunately still can't use GPT-4 to help me with business emails (yet).
I didn't try with old poetic stuff. Passages sampled from 5 books released in the last 2 decades. You can see what I did thoroughly here. Before GPT-4. Basically a comparison between GLM-130b (English/Chinese model) vs Deepl, Google chatGPT(3.5) etc https://github.com/ogkalu2/Human-parity-on-machine-translati...
Mandarin isn't the second language I speak but I officially compared with it because I wanted to test also with a model that had more equivalent corpus training than the very lopsided gpt models. And Chinese/English is the only combo that has a model of note in that regard.
What language pairs are you talking about? I don't think people realize just how much the difficulty level and the state of technology differ depending on that choice.
Which is to say that there are edge cases like legal texts or other fields where a high level of domain expertise is needed to interpret and translate text. Which most human translators would also not have.
For almost everything else, it seems to produce pretty decent and usable translations, even when used against relatively obscure languages.
I used it a some green landic article that was posted on hn yesterday (about Greenland having gotten rid of daylight saving time). I don't speak a word of that language but the resulting English translation looked like it matched the topic and generally read like correct and sensible English. I can't vouch for the correctness obviously. But I could not spot any weird errors or strange formulations that e.g. Google translate suffers from. That matches my earlier experience trying to get chat gpt to answer in some Dutch dialects, Frysian, Latin, and a few other more obscure outputs. It does all of that. Getting it to use pirate speak is actually quite funny.
The reason that I used Chat GPT for this is that Google translate does not understand greenlandic. Understandable because there are only a few tens of thousands of native speakers of that language and presumably there's not a very large amount of training material in that language.
Therein lies the rub. There's a huge gap between what LLMs can currently do (spit back something in a target language that gives you the basic idea, however awkwardly phrased, of what was said in the source language). And what is actually needed for idiomatic, reasonably error-free translation.
By "reasonably error-free" I mean, say, requiring a human correction for less than 5 percent of all sentences. Current LLMs are nowhere near that level, even for resource-rich language pairs.
I've tried it between English and Dutch (which is my native language). It's pretty fluent, makes less grammar mistakes than google translate and seems to generally get the gist of the meaning across. It's not a pure syntactical translation. Which is why it can work even between some really obscure language pairs. Or indeed programming languages. Where it goes wrong is when it misunderstands context. It's not an AGI and may not pick up on all the subtleties. But it's generally pretty good.
I ran the abstract of this article through chat gpt. Flawless translation as far as I can see. To be fair, Google translate also did a decent job. Here's the chat GPT translation.
Veel NLP-toepassingen vereisen handmatige gegevensannotaties voor verschillende taken, met name om classificatoren te trainen of de prestaties van ongesuperviseerde modellen te evalueren. Afhankelijk van de omvang en complexiteit van de taken kunnen deze worden uitgevoerd door crowd-werkers op platforms zoals MTurk, evenals getrainde annotatoren, zoals onderzoeksassistenten. Met behulp van een steekproef van 2.382 tweets laten we zien dat ChatGPT beter presteert dan crowd-werkers voor verschillende annotatietaken, waaronder relevantie, standpunt, onderwerpen en frames detectie. Specifiek is de zero-shot nauwkeurigheid van ChatGPT hoger dan die van crowd-werkers voor vier van de vijf taken, terwijl de intercoder overeenkomst van ChatGPT hoger is dan die van zowel crowd-werkers als getrainde annotatoren voor alle taken. Bovendien is de per-annotatiekosten van ChatGPT minder dan $0.003, ongeveer twintig keer goedkoper dan MTurk. Deze resultaten tonen het potentieel van grote taalmodellen om de efficiëntie van tekstclassificatie drastisch te verhogen.
Translating the Dutch back to English using Google translate (to rule out model bias) you get something that is very close to the original that is still correct:
Many NLP applications require manual data annotations for various tasks, especially to train classifiers or evaluate the performance of unsupervised models. Depending on the size and complexity of the tasks, these can be performed by crowd workers on platforms such as MTurk, as well as trained annotators, such as research assistants. Using a sample of 2,382 tweets, we show that ChatGPT outperforms crowd workers for several annotation tasks, including relevance, point of view, topics, and frames detection. Specifically, ChatGPT's zero-shot accuracy is higher than crowd workers for four of the five tasks, while ChatGPT's intercoder agreement is higher than both crowd workers and trained annotators for all tasks. In addition, ChatGPT's per-annotation cost is less than $0.003, about twenty times cheaper than MTurk. These results show the potential of large language models to dramatically increase the efficiency of text classification.
I'm sure there are edge cases where you can argue the merits of some of the translations but it's generally pretty good and usable.
Thanks for counter-example; I'll confess to having spent far too much time with edge-case translations of late (on languages a bit farther apart), rather than on more generic cases like the above.
I will be re-assessing my view on general-case translation performance accordingly.
I wrote accepted corrections to state regulations law on a particular topic, and I can tell you that super-dense legalese for big-time industrial topics, had loopy and inconsistent language.