Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> Those aren't the only options. You (or someone) could also write your own compatibility layers for the APIs that avoid some of the problems I mentioned (e.g., by producing errors on inconvertible characters, by being compatible with former Windows versions, by not affecting other DLLs in your process, etc.)

That's akin to writing a partial C library. If MSFT makes UTF-8 as the codepage work well enough I'd rather use that.

> Or you could e.g. get upstream to start caring about their users on other platforms, and play ball.

The upstream is often not paid for this. Even if they get a PR, if the PR makes their code harder to work on they might reject it.

Microsoft has to make UTF-8 a first-class citizen.

> I don't think you're understanding the problem here. Interaction is not part of the picture at all. You might not be loading the DLL yourself at all. DLLs get loaded by the OS and user for all sorts of reasons (antiviruses, shell extensions, etc.) and they easily run in the background without anything else in the process "knowing" anything about the at all. Your program is declaring that everything in the process is UTF-8 compatible, but those DLLs might not be compatible with that, and so you're just praying that they don't use -A functions in an incompatible manner.

You mean changing the codepage for use with the "A" functions? Any DLL that does that must go on the bonfire. There's a special place in Hell for developers who build such DLLs.

> "Depends greatly on the context" kinda makes my point. It can turn a zero-copy program into single- or double-copy. Generally not a showstopper by any means, but it sure as heck can impact some programs. And if that program is a DLL people use - well now you can't work around. (Yes, there's a reason I listed this last. But there's a reason I listed it at all.)

I'm assuming you're referring to having to re-encode at certain boundaries. But note that nothing in Windows forces or even encourages you to use UTF-16 for bulk data.

> The reality is Windows isn't UTF-16 and nix isn't UTF-8, which was the crux of most of my points.

Windows clearly prefers UTF-16, and its filesystems generally use just-wchar-strings for filenames on disk (they don't have to though). Unix clearly prefers UTF-8, and its filesystems generally use just-char-strings on disk.



>> Those aren't the only options. You (or someone) could also write your own compatibility layers for the APIs that avoid some of the problems I mentioned (e.g., by producing errors on inconvertible characters, by being compatible with former Windows versions, by not affecting other DLLs in your process, etc.)

> That's akin to writing a partial C library. If MSFT makes UTF-8 as the codepage work well enough I'd rather use that.

I found out about activeCodePages thanks to developers of those compatibility layers documenting the option and recommending it over their own solutions.

> The upstream is often not paid for this. Even if they get a PR, if the PR makes their code harder to work on they might reject it

The project I work on is an MFC application stemming from 9x and early XP and abandoned for 15 years. Before I touched it it had no Unicode support at all. I'm definitely not being paid to work on it, let alone the effort to convert everything to UTF-16 when the tide seems to be going the other direction.

> Your program is declaring that everything in the process is UTF-8 compatible, but those DLLs might not be compatible with that, and so you're just praying that they don't use -A functions in an incompatible manner.

Programs much, much, much more popular than mine written by the largest companies in the world, and many programs you likely use as a developer on Windows, set activeCodePage to UTF-8. Notwithstanding the advice in the article to set it globally for all applications (and it implies it already is the default in some locales). Those DLLs will be upgraded, removed, or replaced.


Forget it, you ain't gonna make Linux-centric open-source community to really care about Windows (or other un-POSIX-like OSes, of which today there is almost none). The others have to give in and accomodate to their ways if those others want to use their code.

And since Windows-centric developers, when porting their apps to Linux, are generally willing to accomodate for Linux-specific idiosyncrasies (that's what porting is about, after all) if they care abour that platform enough, the dynamic will generally stay the same: people porting from Windows to Linux will keep making compatibility shims, people porting from Linux to Windows will keep telling you "build it with MinGW or just run it in WSL2, idgaf".


> That's akin to writing a partial C library.

Not really. It's just writing an encoding layer for the APIs. For most APIs it doesn't actually matter what they're doing at all; you don't have to actually care what their behaviors are. In fact you could probably write compiler tooling to do automatically analyze the APIs and generate code for most functions so you don't have to do this manually.

> If MSFT makes UTF-8 as the codepage work well enough I'd rather use that.

"Well enough" as in, with all the warts I'm pointing out? Their current solution is all-or-nothing for the whole process. They haven't provided a module-by-module solution and I don't expect them to. They haven't provided a way to avoid information loss and I don't expect them to.

> You mean changing the codepage for use with the "A" functions? Any DLL that does that must go on the bonfire. There's a special place in Hell for developers who build such DLLs.

"Changing" the code page? No, I'm just saying any DLL that calls FooA() without realizing FooA() can now accept UTF-8 could easily break. You're just praying that they don't.

> I'm assuming you're referring to having to re-encode at certain boundaries. But note that nothing in Windows forces or even encourages you to use UTF-16 for bulk data.

Nothing? How do you say this with such confidence? What about, say, IDWriteFactory::CreateTextLayout(const wchar_t*) (to give just one random example)?

And literally everything that interacts with other apps/libraries/etc. that use Unicode (which at least includes the OS itself) will have to encode/decode. Like the console, clipboard, or WM_GETTEXT, or whatever.

The whole underlying system is based on 16-bit code units. You're going to get a performance hit in some places, it's just unavoidable. And performance isn't just throughput, it's also latency.

> Windows clearly prefers UTF-16, and its filesystems generally use just-wchar-strings for filenames on disk (they don't have to though). Unix clearly prefers UTF-8, and its filesystems generally use just-char-strings on disk.

Yes, and you completely missed the point. I was replying to your claim that "UTF-8 has won" over UTF-16. I was pointing out that what you have here is neither UTF-8 on one side nor UTF-16 on the other. Going with who "won" makes no sense when neither is the one you're talking about, and you're hitting information loss during conversions. If you were actually dealing with UTF-16 and UTF-8, that would be a very different story.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: