> If it is, as you claim, permissible to train the model (and allow users to generate code based on that model) on any code whatsoever and not be bound by any licensing terms, why did you choose to only train Copilot's model on FOSS? For example, why are your Microsoft Windows and Office codebases not in your training set?
Github's position doesn't appear to offer any advantage with regards to Copilot's creation.
OpenAI Codex (which copilot grew out of IIRC), Amazon and Salesforce versions of Copilot exist. Huggingface Bloom was trained on a sizeable amount of public code. Tab9, now behind, was one of the earliest to combine public code repositories with Deep learning for smarter autocomplete. The data requirements for Transformer scaling mean any and all public facing repositories will be assimilated, whether Github, Gitlab, Stackoverflow or so on.
Wish more energy was spent on how to fund pretrained models that will also run efficiently on CPUs, fine-tuneable to one's language and local environment. Removing reliance on cloud services.
Curious about people's opinions on Dall-E 2 or Google Image-gen, which parallel pretty much the same thing with Renders, Illustrations and Paintings, or upcoming models doing the same for voice acting and music. Coders seem more excited about the potential of those tools.
Is anyone using Dall-E, Imagen, or any other generative model for art to create commercial products? If so, they're probably also concerned about copyright issues.
CoPilot is being offered for widespread commercial use, so it's held to a higher standard. Respecting copyright is much more important when you're building a business and not just sharing fun AI art on social media.
OpenAI currently offers a GPT-3 API and an invite only DALLE2 API. Both of these are commercial products trained on web datasets and can output collisions with the training set. They have effectively zero concerns about copyright due to it being covered under fair use and OpenAI having copyright on all outputs (in the case of DALLE2).
The boring answer is probably something along the lines of “copilot was trained by employees of OpenAI who aren’t technically MS employees”. When I worked at MS you had to jump through all sorts of hoops to get access to code from other orgs. I can’t imagine what BS you’d need to do to give access to a vendor.
At least a year ago in Azure that wasn't true; everyone had access to nearly every internal service's code (+the windows kernel). Though there were some exceptions (the Teams team didn't want to share their source at all for whatever reason).
> you had to jump through all sorts of hoops to get access to code from other orgs
This may be the dumbest move from M$ that I have read on this thread! Sure, companies need to protect their private IP, but this really feels like creating unnecessary friction for no good reason...
I want to know what stuff you guys are putting in public GitHub FOSS repos that you don't want replicated in any way...
I also want to know why people think their code is so special that no one else could have ever come up with it independently. Each and every opponent of Copilot is the best developer ever, I guess?
That said, I don't understand the choice to use GPL for any reason, so maybe I'm not equipped to understand the arguments against Copilot. Forcing your code to be open forever isn't freedom, it's the omission of freedom. Someone using your (for example) MIT-licensed code in a closed-source commercial software project doesn't "un-free" the code you released; your code is still exactly as open and as available as it was before, and zero freedoms were lost by anyone.
> I want to know what stuff you guys are putting in public GitHub FOSS repos that you don't want replicated in any way...
Please feel free to use my code in any way that its license permits: attribution for the permissive licenses, share-and-share-alike for the copyleft licenses. Those license terms are the price of the code, no different from a proprietary product's "this costs $x" or "this costs $x/month". I'm happy to give away most of what I work on every day, and I ask that people 1) give credit, and 2) in some cases, share under the same terms, and 3) in many cases, don't sue me or other users of code I've written over software patents (which shouldn't exist).
If the day comes that copyright goes away, and we can freely copy and share the code of any currently proprietary software and other works, I'd celebrate that. Until then, I don't want an asymmetric situation in which proprietary licenses must be adhered to but Open Source licenses are ignored.
If copyright goes away, that won't magically make the source code of all proprietary software public. The only thing that will be liberated is existing shared-source software.
>I want to know what stuff you guys are putting in public GitHub FOSS repos that you don't want replicated in any way...
Nobody has claimed that they want this. People just want derived work to adhere to the license they chose for their project.
>I also want to know why people think their code is so special that no one else could have ever come up with it independently. Each and every opponent of Copilot is the best developer ever, I guess?
Would you feel the same way about ripping off game assets, or music?
I think you just have an axe to grind with free software in general based on your messages and the general tone. Just because you don't understand it doesn't mean that the ideas are invalid.
I am also curious why copyright laws should protect proprietary software, music, games, writing, etc but not apply to my software, even if it isn't the highest quality work?
At one point does AI recreating patterns it has seen from reading source code count as a derived work? What if a human learns to code by reading only GPLed code, does all the code they write fall under GPL as a derived work now?
> “If you look at the GitHub Terms of Service, no matter what license you use, you give GitHub the right to host your code and to use your code to improve their products and features,” [Kate] Downing, [an IP lawyer specializing in FOSS compliance] says. “So with respect to code that’s already on GitHub, I think the answer to the question of copyright infringement is fairly straightforward.”
This has some interesting implications – for example, it means I can't mirror somebody else's (open source) code on GitHub without their explicit agreement.
> > “If you look at the GitHub Terms of Service, no matter what license you use, you give GitHub the right to host your code and to use your code to improve their products and features,” [Kate] Downing, [an IP lawyer specializing in FOSS compliance] says. “So with respect to code that’s already on GitHub, I think the answer to the question of copyright infringement is fairly straightforward.”
So any code uploaded by someone other than the copyright holder renders someone liable to be sued for copyright infringement, AFAICS. The only question is whom it makes liable -- the uploader, GitHub (=Microsoft!), or both?
I can see arguments either way: The uploader is clearly infringing by giving away a right that isn't theirs to give. But so is GitHub / Microsoft, for using a "right" they haven't been properly given. So I'm provisionally leaning towards "both".
> I can't mirror somebody else's (open source) code on GitHub without their explicit agreement.
Who is doing the "mirroring" -- you, in uploading the code, or GitHub / Microsoft in actually hosting it, keeping it available for download from their "mirror"[1] site?
___
[1]: Is that even the correct terminology nowadays, when AIUI for lots of projects GitHub is their primary code repository?
So GitHub should immediately take down (and remove from their Copilot learning model!) all *GPL code uploaded by anyone but the ("primary"?) copyright holder.
There's one thing I'm missing from all these discussions and posts: is the generated code even copyrightable? IANAL, but code snippets often fall under the "scènes à faire" doctrine (everybody would do it in a similar way), in which case it's not. https://en.m.wikipedia.org/wiki/Sc%C3%A8nes_%C3%A0_faire
GitHub seems to think it is copyrightable, personally I doubt it is, simply because a human didn't create it and the process it was created by was automatic with no creativity.
Well, if the entire thing was generated, then no (according to the first link I posted above), since it was not produced by a human. However, no useful program is going to be entirely written by an AI, so any real program would have quite a lot of user input (I regularly will take what copilot suggests and then tweak it to what I specifically want). And then, yeah, it's copyrightable.
Also, there's no way for anyone to know what portion of code that I commit was hand written vs. generated, so you kind of have to treat it all as written by the committer anyway.
Though this does bring up interesting questions about what happens with things like automated PRs that fix bugs / update dependencies... are those then non-copyrightable? ¯\_(ツ)_/¯
Here's the kicker: your modified code snippet may still not be copyrightable if it's generic enough that everyone would do it in a similar manner.
Just as much as a hero riding off into the sunset is not copyrightable in a movie script. However, a hero riding off into the sunset with bananas in the pistol holsters would be.
This is what I would want to hear more about when discussing if Copilot violates copyright.
No, it's a good analogy, because it's not between the similarity of people and code. The cases are similar, because in both you restrict freedom to enable freedom.
Making source code available and not requiring the same of those who use it is a temporary fleeting freedom that soon turns into lack of freedom.
Like thinking you're ending slavery by freeing all the current slaves but not making it illegal to own, buy, and sell slaves, or capture previously free people into slavery. Guess if you'd have slavery again very soon?
The analogy is about freedom vs lack thereof, not manual labour vs software. And as you see, it works very well.
> their code is so special that no one else could have ever come up with it independently
I'm worried about exactly the opposite: having Copilot help me write code that seems quite generic to me, but which in fact makes my code subject to a license I don't even know about, and/or simply violates copyright.
For an open-source project this could be embarrassing but probably fixable. It gets more complicated if FAANG is doing due diligence on your company. I can see Copilot being both an accelerant and, later, a liability for startups.
There's a setting on GitHub that blocks any suggestions that exactly match code in the training set. I doubt you'd ever get in trouble for code that was similar in structure but different variables etc from existing licensed code (especially since most small snippets of code are not terribly unique to begin with).
I mean, it's nice that they have a setting for the bare minimum a lazy undergrad would do to avoid getting caught for plagarism — replace some of the words in the copied paragraph with replacements from a thesaurus. It's not something I'd personally expect to hold up under real scrutiny though.
AFAIK that's not enough, for instance see the long-standing industry practice that people working on the Important Stuff are not allowed to ever look at the source code of the Direct Competitor; or clean-room reverse engineering, etc.
I guess time will tell how much acquiring companies (my worry) care about Copilot. Given the difficulty hiring good devs, and the productivity level of body-shop devs, I see it getting a whole lot of use very soon, acknowledged or not.
There's a big difference between reverse engineering (i.e. intentionally writing software that behaves identically to another piece of software), and writing your own code to solve your own problem that may superficially contain small portions of the similar logic as some other project. Copyrighted code has to be sufficiently creative and unique to qualify, otherwise after the first person wrote code to parse json from a web request, no one else would be able to do the same thing.
Kind of interesting.. I would like to point out this seems to be specific for the US.
But also.. In that case, when I commission an artist to paint my portrait, surely I can't claim to be the artist.. But I'm no lawyer.
I'm not sure there is a contractual agreement in GitHub's co-pilot that says: "Any code you write here is commissioned work". But honestly I didn't read the T&C's.
So I think you MAY have debunked my analogy, but not the main reason for the analogy.
Copy and paste doesn't really write code, just copies it from one place to another. Copilot on the other hand does generate new potentially novel code.
I'm sure that's what people said when they went from punch cards to assembly, and from assembly to C, and from C to Java.... and yet, here we are. Tools that let us write higher level code faster, just allow us to create more complicated software in a reasonable amount of time.
That's still 100% true of the examples I mentioned. There's always a higher level to consider. When we moved to C, we could stop worrying about what registers we were using. When we moved to python/Java we could stop worrying about managing memory. When we moved to web frameworks we stoping writing the guts of our servers. And if anything, programmers have become even better paid, despite so many more people in the industry.
I agree with you--however, programmers have not become even better paid because society values programmers. They have become better paid because software is a relatively new artefact in human society which has taken the human life by storm, which has made software companies immensely profitable, which meant more companies wanted to create software and attract the people that could help them do it.
As software takes a back seat (or at least a "normal" seat) in society, would we see a normalization of income? Could this be hastened by the development and introduction of tools such as copilot?
Potentially, unless there are new / better things that humans can claim they can provide compared to AI tools. This is the point where I think you and I agree, and I think it's your primary argument in any case (unless I'm mistaken).
AI can code low level stuff. This one function. This small piece of logic. What it can't do is conceive of how to take a bunch of different functions and put them together to produce an actual product. It can't tell you if you should use postges or mongo. Programmers will always be needed, we'll just move up the stack, and we'll produce more value per hour of our work, justifying our high salaries.
Compare the visible output of someone writing in assembly vs someone writing on top of a modern web framework. Is assembly harder? Yeah. But the web framework is going to give you a usable product in a fraction of the time with way more features. And that's worth more money to the company you work for.
It's always going to be a knowledge worker's job. It's always going to reward experience and creativity and attention to detail. A lot of programming is looking at the world, seeing a gap in what exists, and figuring out what best fits that gap. An AI can't do that. Programming is making 1000 tiny decisions that can't possibly be specified completely by a product manager and need a human to weigh the tradeoffs.
> AI can code low level stuff. This one function. This small piece of logic. What it can't do is conceive of how to take a bunch of different functions and put them together to produce an actual product.
Thats what everybody in the chess world said: "AI can decide low level stuff. This one move. This small attack on a rook. What it can't do is conceive of how to take a bunch of different tactics and put them together to produce a game of chess."
...Until Deep Blue beat Garry Kasparov.
> It can't tell you if you should use postges or mongo.
Yeah, and then came: "It may be able to play chess, but it can't tell you how to play Go."
The hard part about writing code isn't "how to write a for loop" and similar trivial things. Copilot make this process faster, but the hard part is still organizing your code so that it doesn't become a steaming pile of cowdung a few iterations down the line. That Copilot does not do for you.
So, unless you are a code monkey punching code into autogenerated skaffolding all day, your job is safe.
Forcing your code to be open forever is guaranteeing freedom of all users of my code, both direct and indirect. Developers don't need to have any more freedoms than other users.
> Forcing all of your code to be GPL is like saying “I am on a diet, so now I will force everyone else be on the same diet. Freedom!”
Nobody is forcing anyone to use the code.
If they chose to use it they have to abide by the licensing terms because that’s how it works. If the people laboring for free to produce this code don’t want it to be used in a proprietary application then tough luck, write the code yourself.
Every time the GPL comes up someone drags out this same old dead horse to beat on a little bit more.
until the time comes when a tax department gets the funny idea to use it, and forced you to use it, or people with guns come to your door and haul you away in the morning.
its not about whether its a problem in real life, its about whether the end user might be forced to use a product, which IS a thing, that that is the ONLY point I made
> Forcing all of your code to be GPL is like saying “I am on a diet, so now I will force everyone else be on the same diet. Freedom!”
This is a terrible analogy. Here’s a better one: I’m holding a potluck. If you decide to come, you can eat all you want. If you take food from my event, you can’t hoard it, you must share it, even if you’ve “made it better” by changing it somehow after you left.
Don’t like my rules? OK, don’t come to my potluck.
By analogy, there is a law against me putting handcuffs on another, and in fact the police would stop me from doing so. Did the police protect freedom? Aren't they restricting me from handcuffing others?
In a similar manner, under the MIT I can restrict my users from modifying and compiling my source code. Is a license that means I have to let my users modify code restricting freedom? Isn't it ensuring freedom of others, in the same way that making laws of "you shall not handcuff others for no reason" is ensuring freedom of others?
Suppose that there's a law that states that water and access to it is always supposed to remain public, because water is a public good.
Suppose that someone comes tomorrow and starts claiming ownership of all the water springs in your country, he becomes the only entry point to get water, and you have to pay him a fee every time you open a tap.
Is he still free to do so? In other words, is the freedom of someone who restrict the freedoms for everyone else still a form of freedom that is worth even considering, let alone respecting?
Because the foundation of your ideas is exactly the reason why capitalism fucked things up and just let a bunch of jerks get rich without merit.
> I want to know what stuff you guys are putting in public GitHub FOSS repos that you don't want replicated in any way...
What a disingenous reply. FOSS licenses do not grant ability to replicate "in any way" that you wish. You still have to comply with the license terms. What the hell is wrong with you?
> I don't understand the choice to use GPL for any reason ...
Also note: Copilot violates the attribution requirements of permissive licenses like MIT as well. Even if you put your code on GitHub with the intent of it being freely used in proprietary software, attribution is still a fair demand.
Just to clarify: you seem to believe that most of our code isn't good enough, so copying it is not a big deal.
Do you feel the same about other creative processes as well? Can I rip a Justin Bieber's song and say that it's mine just because it's a shitty song anyway, so who cares? Or does this only apply to software because software is somehow an "inferior" art? Do licenses even have any legal value to you?
The D language uses the Boost license because it is the least restrictive. Anyone is free to use it in closed-source non-free commercial apps if they like, or Open Source if they like.
I don't know what 0-clause BSD is. The Boost license is:
Boost Software License - Version 1.0 - August 17th, 2003
Permission is hereby granted, free of charge, to any person or organization
obtaining a copy of the software and accompanying documentation covered by
this license (the "Software") to use, reproduce, display, distribute,
execute, and transmit the Software, and to prepare derivative works of the
Software, and to permit third-parties to whom the Software is furnished to
do so, all subject to the following:
The copyright notices in the Software and this entire statement, including
the above license grant, this restriction and the following disclaimer,
must be included in all copies of the Software, in whole or in part, and
all derivative works of the Software, unless such copies or derivative
works are solely in the form of machine-executable object code generated by
a source language processor.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE, TITLE AND NON-INFRINGEMENT. IN NO EVENT
SHALL THE COPYRIGHT HOLDERS OR ANYONE DISTRIBUTING THE SOFTWARE BE LIABLE
FOR ANY DAMAGES OR OTHER LIABILITY, WHETHER IN CONTRACT, TORT OR OTHERWISE,
ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
DEALINGS IN THE SOFTWARE.
0-clause BSD goes even further, and completely omits the attribution requirement:
Permission to use, copy, modify, and/or distribute this software for any
purpose with or without fee is hereby granted.
THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES WITH
REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY
AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY SPECIAL, DIRECT,
INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM
LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR
OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR
PERFORMANCE OF THIS SOFTWARE.
This is a very "american" definition of freedom, which is basically, just let me do what I want.
GPL uses a different definition of freedom, which I prefer. They look at consequences of restrictions / permissions, and their implication on freedom (not just for me, but for everyone). So some restrictions can lead to actually more freedom, while some permissions can actually decrease freedom.
This is similar to gun-control. While it reduces freedom for gun owners, it allows everyone to be more free of hanging out anywhere they want without being afraid of being shot. Similar arguments can be made for vaccine mandates.
So GPL restricts usage of software because in the long term it gives back power to users, which will be more free.
> This is a very "american" definition of freedom, which is basically, just let me do what I want.
Eh. I see what you’re saying about gun control, but the idea that “some restrictions can lead to actually more freedom, while some permissions can actually decrease freedom” is actually very American.
The free software movement says that everyone deserves software freedom. The Declaration of Independence similarly says “We hold these truths to be self-evident, that all men are created equal, that they are endowed by their Creator with certain unalienable Rights, that among these are Life, Liberty and the pursuit of Happiness.” While I haven’t found a source confirming it, I think that the founders believed that the freedom of speech was one of these unalienable rights.
The GPL puts restrictions in place to make sure that downstream projects give users software freedom. The Constitution put restrictions in place to ensure that the federal (and nowadays the entire) government doesn’t interfere with our unalienable rights.
Take a look at how the first amendment is worded:
“Congress shall make no law respecting an establishment of religion, or prohibiting the free exercise thereof; or abridging the freedom of speech, or of the press; or the right of the people peaceably to assemble, and to petition the Government for a redress of grievances.”
The first amendment does not grant the freedom of speech because it doesn’t need to be granted. From the founders’ perspective, god already grants the freedom of speech to everyone forever. The key phrase here is “Congress shall make no law”. The first amendment is restricting Congress to ensure freedom.
The idea that “some permissions can actually decrease freedom” is also present in the Constitution. For example, take a look at Article I sections 8 and 9. The framers of the Constitution could have given Congress the power to pass any law. Instead, they chose to specifically enumerate what Congress can and cannot do.
Perhaps, though, most Americans don’t know much about our founding and think that freedom=just let me do what I want. I don’t know.
I think GPL is a good idea (and ensures that everyone is having the freedom from a modified version of the code, and other things that it protects users from), but there is some problem too, such as I think it can be complicated to deal with.
For this reason, I had idea to make up a new license (although I will not write most of my ideas here but will do so elsewhere). But, its main working would be: mostly you can do whatever you want (including omitting attribution and copyright notices) without worrying about the license, but you cannot use legal processes (such as lawsuits, DMCA, etc) to prohibit these freedoms to any downstream recipients (regardless of how many). The license would also ensure patents can be used freely, disclaimer of warranty (if the license is included in the copy and the recipient has not paid for the copy), and some other things to ensure freedom (although there can be some restrictions on the use of trademarks (e.g. to avoid false advertising), and some things to avoid working around the freedoms in certain ways). You can be forgiven any number of times, though; the license will not be terminated. Furthermore, for a practical reason of license compatibility, relicensing by GPL3 and AGPL3 (and possibly also CC-BY-SA 4.0, for works other than computer programs) are also allowed, as long as you have a copy of the source code and can satisfy the terms of those licenses.
"I also want to know why people think their code is so special that no one else could have ever come up with it independently. "
Really? What exactly does this CoPilot thing actually spit out? I can't help but think that it spits out near verbatim, which in the UK is probably dodgy on Copywrite.
You then go on to decide that the GPL isn't for you. That's fine. You even explain that you are ill-equipped for something. That too is fine.
You are not a fan of free or "libre" stuff. That comes across loud and clear. Thank you.
Forcing the code to be open is the kind of freedom where restricting locally something enables the freedom globally. Granting the freedom to do whatever with the code will make the code end up used in closed ways, empowering those who close the code.
A similar line of thought is the "paradox of tolerance", which posits that if a society tolerates the intolerant, the tolerance of that society will lessen.
Freedom is not, and cannot be, an absolute. If I am 100% free, that by definition restricts the freedom of others (for instance, if I am free to punch you in the face, you are not free to not be punched in the face; if I am free to own you as a slave, you thus lose a lot of freedoms).
Determining what freedom should mean is not, and has never been, a simple matter of "well, if you make any restrictions on it, then it's not real freedom, so everyone just gets to be free!" It's all about finding balance, and dealing with nuance, and all that frustrating hard stuff.
Note that while Copilot is a major motivator for this effort, it isn't the only one; there's a pile of other reasons listed at https://sfconservancy.org/GiveUpGitHub/ . GitHub and its lock-in has been a problem for a long time, and this is just the most recent problem.
I mean the answer to that question is obvious: they're not under any obligation to include their own code in the training data. Why would they?
A better question would be whether they would take legal action against a competitor that creates a copilot equivalent and publicly states that they trained it on leaked, proprietary M$ source code. That would actually be an example of hypocrisy.
> They're not under any obligation to include their own code in the training data. Why would they?
Because these models work better with more data and presumably this a lot of high quality data that they already have lying around anyway? Because there no downside according to their own reasoning? Because it would shut up a lot of these criticisms right away? Because marketing would be so much easier with that kind of dogfooding?
In short: because according to their own story there would be only upsides, no downsides.
According to their logic, if I train a model using stolen Windows source code, it's fair use.
Just because they use FLOSS licenses, does not allow them to evade things like Affero GPL3. And, to that end, if they are using Affero, I want the source to the whole copilot infrastructure -or- proof they used no AGPL3 code anywhere.
Perhaps because there is a (small) risk of leaking confidential information through its output.
But that's not as damning as it sounds.
First, we know Copilot, if given the right prompt and told to autocomplete repeatedly without any manual input, can regurgitate bits of code seen many times in many different repositories, like the famous Quake fast inverse square root function and the text of licenses. That doesn't mean it does so under normal prompts and normal use. Perhaps it does sometimes, and that would be a real concern. But any regurgitation that isn't under normal use, which only happens if the user is trying to make Copilot regurgitate, is not a problem when it comes to copyright violations of open source code (since anyone trying to violate an open source license can do so much more easily without using Copilot), yet it may still be a problem when it comes to leaking confidential information.
Second, whether something is a copyright violation and whether it risks leaking confidential information are somewhat orthogonal. A copyright violation usually requires at least several lines of code, and more if the copying is not verbatim, or if the code is just a series of function calls which must be written near-verbatim in order to use an API. On the other hand, `const char PRIVATE_KEY[] = ` could hypothetically complete to something dangerous in just one line of code. That said, it almost certainly wouldn't, since even if a private key was stored in source code in the first place (obviously it shouldn't be), it probably wouldn't be repeated enough to be memorized by the model. Yet…
…third, the risk tolerances are different. If, to use completely made-up numbers, 0.1% of Copilot users commit minor copyright violations and 0.001% commit major ones, that's probably not a big deal considering how many copyright violations are committed by hand – sometimes intentionally, mostly unintentionally. (When it comes to unintentional ones, consider: Did you know that if you copy snippets from Stack Overflow, you're supposed to include attribution even in any binary packages you distribute, and also the resulting code is incompatible with several versions of the GPL? Did you know that if you distribute binaries of code written in Rust, you need to include a copy of the standard library's license?) But when it comes to leaking confidential information, even one user getting it would be somewhat bad (though admittedly Microsoft does distribute much of their source code privately to some parties), and taking even a small risk would be a questionable decision when there is a ready alternative.
> Perhaps because there is a (small) risk of leaking confidential information through its output.
If Microsoft/Github ever made that argument, that also means that when Copilot is using GPL software as input, the output can only be released under the GPL.
Copyright licenses don't apply to small snippets, no matter if you think they do, and learning and applying other people's code isn't prohibited by the license, and thank god, can't be prohibited.
FWIW, there are some (admittedly fairly naive) checks to prevent PII and other sensitive info from being suggested to users. Copilot looks for things like ssh keys, social security numbers, email addresses, etc, and removes them from the suggestions that get sent down to the client.
There's also a setting at https://github.com/settings/copilot (link only works if you've signed up for copilot) that will check any suggestion on the server against hashes of the training set, and block anything that exactly duplicates code in the training set (with a minimum length, so very common code doesn't get completely blocked). Users must choose the value for this setting when they sign up for copilot.
I tried using copilot and it literally attributed the function i was writing to someone else even before I could start writing a line. its been updated since and these errors are rare now, but still exist
> why are your Microsoft Windows and Office codebases not in your training set?
This is my favorite question about Copilot ever.
While GitHub might have a license to use that code to train the model, it’s debatable what license applies to the output of the model, and what users of the model can do with it.
It’s possible for an AI to reproduce something so close to the original that it would be considered an infringement on the original work.
The reasons that Windows is awful have nothing to do with code quality. Windows is awful because of intentional choices Microsoft made (e.g., bloatware that gets reinstalled with every update, mandatory Microsoft accounts, and mandatory telemetry).
Whenever I have to start windows 10, I still see the same kind of bugs, that were present on XP. One example: They seem to be simply unable to fix the icons "near the clock", which are still shown, when some app has been killed, until you hover over them. Things like that, but of course also lots of stuff that affects people more in form of annoyances, making every action take at least twice as long as on GNU/Linux distros I run. It only takes minutes, and I am already frustrated with the system, because everything takes so long to do.
One similarly ignored bug that springs to mind is the performance of the "Send To" context menu item in File Explorer. I always dreaded dragging my mouse over it by accident.
They could also cache that computed menu and proactively update the cache whenever the relevant keys are changed. Either way, pretty far from "cannot be fixed without breaking the API".
Windows 11 is an example of poor code quality. Bugs everywhere, while the same things work on Ubuntu/popos.
Past MS engineers have been commenting for a decade on how MS has grown too big, can't manage, and has become a monolith "too big to fail". By nature when engineers are small pieces of a giant machine, they don't do their best work. And those with the experience move on to better things.
My experience has also been that Windows 11 is buggy (haven't been using it for a while because it can't even reliably connect to the internet). But also in my limited experience (just one install on a single machine in ~2020, used for a few months): Ubuntu is just as bad or even worse.
Your experience its quite limited and you probably need to know how to properly update ubuntu since most of the issues I've found with it (since I started using it ~12 years ago) are usually issues caused by lack of drivers (which gets solved in 15 minutes once you know where to click) once those are solved it is sturdy and you can keep it runing for several months without having to restart it or it becoming unusably slow as it hapens with windows systems after about 4 days of uptime
This comment is funny to me because it was up to date and the particular issue wasn’t driver related: it was specifically that after not touching it at all for a couple months each subsequent time I logged in it would randomly lock up, took about 15min to boot.
It would lock up as in just take an extremely long time to do certain things in the UI. That sounds like a pretty odd way for a driver issue to manifest, but maybe I'm missing something.
The biggest issue with Windows isn't poor code or shitty engineering, it's the support for legacy software. MS engineers are some of the smartest in the world. The devs can fix the code and make a much better OS but that would break boomer software used by big banks that haven't updated since the 80s. When Microsoft write code, it has to promise support for decades, that means having to maintain the same old outdated APIs for many years.
Outdated APIs don't have to affect the shell and built in programs or anything else that is kept up to date. My linux programs are no more buggy due to having Wine installed for similar compat with legacy Windows executables.
Was it? I recall the kuro5hin analysis of the leaked Windows 2000 source code[0] that said:
>there is nothing really surprising in this leak. Microsoft does not steal open-source code. Their older code is flaky, their modern code excellent. Their programmers are skilled and enthusiastic. Problems are generally due to a trade-off of current quality against vast hardware, software and backward compatibility.
They explicitly listed the reasons they think it's awful. My personal grievances with Windows align more or less with theirs and while I wouldn't go as far as to say it's awful, I'd use something certainly stronger than "annoyance".
Specifically, clear anti-user choices that exceed by far being "annoying":
* Making it exceedingly difficult or impossible to use the OS without logging in with a Microsoft account.
* Forcing the user in various ways to surrender data to Microsoft. Some of them can be disabled if you really go out of your way, others can't.
* Prompting me again and again to switch to Edge and other MS defaults. I've had the same install for a few years now and NO, I don't want to change to "Microsoft recommended defaults", no matter how many times you ask me.
* Showing the same "OS setup" screen after some updates, requiring me to pay very close attention to what I'm clicking, lest I select something MS is trying to lead me to. The amount of attention required from the user on those screens corresponds quite well with anti-user behavior.
>Making it exceedingly difficult or impossible to use the OS without logging in with a Microsoft account
This is hilarious. I recently got a new laptop that has window$ 11. After setting it up with a Non Microsoft email (which required some good fight), I tries to install some random app from the Microsoft store, but got a "something went wrong please try again" on the first screen.
It's pathetic. I haven't used Windows since Win 7 , which I basically installed for gaming. Seeing the latest version of the OS makes me feel sorry for them. That's why Apple with all their assholery is eating their lunch (on the flip side my wife just got a MBP m1 and I was pleasantly surprised that it has hdmi port, magsafe, several USBc ports. Apple seems going in the right direction.)
You haven't had root admin on Windows since Windows 7.
The telemetry makes this clear. Reboots and updates even more so.
The UI lag and stealing of focus ("oh, you're typing a document... too bad, I want to launch a new Explorer window that will immediately steal focus") make it clear that the computer is in charge and will probably listen to your requests, but on the timeline it chooses.
The default of windows already do compatibility in some crazy way. And the compatibility mode lies to the program about system version or even fake old bugs so program relies on bug will run. And I'd imagine. To make this work, ms would need tons of most shitty code you'd imagine in the source o fake those behaviors.
I called it microsoft DNA. The way they do stuff is inhuman alien logic without any compassion or remorse (like all 10k+ their windows apis, or dontnet, or way they add features and handle support ).
It is however plausible that the code is only "good" given internal considerations. Microsoft has a specific internal coding styles designed to work with internal tools
This is my favorite question about Copilot ever.