Tips For AI-Assisted Coding Amid Open-Source Theft Claims
In an early test of a potential landmark lawsuit involving generative AI coding tools, a claim for breach of open-source licenses partially survived the defendants’ motion to dismiss.
The May 11 decision in Doe 1 v. GitHub Inc. joins a recent trend of litigation surrounding generative AI tools and centers on Copilot, which assists end users with writing blocks of code using AI technology.
What Is Copilot?
Copilot is a programming assistant powered by the Codex machine learning model. To access it, users download and install a plug-in for their code editor, such as Visual Studio or VS Code.
Then, much like the autocomplete feature in certain email applications or web search tools, Copilot offers suggestions in real time as the user writes code. The suggestions initially appear as italicized “ghost text” that a user can either accept or reject.
According to its website, Copilot helps users “spend less time creating boilerplate and repetitive code patterns, and more time on what matters: building great software.” Copilot is a subscription tool available for $10 per month or $100 per year.
Like other generative AI tools, Copilot was trained on an extensive data set, including third-party source code stored in public repositories on GitHub Inc., the largest internet hosting service for software projects.
When GitHub users store their code on GitHub, they choose whether to make their code repositories public or private. Users can also select from a range of preset open-source licenses to apply to the code published in their GitHub repositories, apply for their own individual licenses, or select none at all.
Typically, these open-source licenses require some form of attribution, usually by, among other things, including a copy of the license along with the name and copyright notice of the original author.
The Plaintiffs’ Allegations
The plaintiffs in the present case are anonymous coders who claim ownership of software projects stored on Github. These plaintiffs alleged, among other things, that the defendants used their source code to train Copilot and that Copilot reproduces portions of the plaintiffs’ code without providing attribution as required under the open-source licenses.
This conduct, they alleged, is a breach of the open-source code licenses, as well as a violation of the Digital Millennium Copyright Act, which prohibits the removal of certain copyright management information from copyrighted works.
The Defendants’ Response
In their motion to dismiss, the defendants pointed out, among other things, that the plaintiffs have not identified any specific instance in which Copilot reproduced their licensed code. Indeed, because the coders filed their claims anonymously, they did not even identify the specific repositories on GitHub in which they claim rights.
As such, the defendants contend that the anonymous coders have failed to allege viable legal claims.
The defendants also pointed out that under GitHub’s terms of service, users such as the plaintiffs, who make their code repositories public grant other GitHub users a nonexclusive, worldwide license to use, display, and perform their code through GitHub services and to reproduce their code on GitHub as permitted through GitHub’s functionality.
In addition, every user agrees to GitHub’s terms of service, which includes a “license grant” to GitHub to “store, archive, parse, and display … and make incidental copies” as well as “parse it into a search index or otherwise analyze it” and “share” the content in public repositories with other users. These terms, according to the defendants, foreclose any license breach claims.
The Court’s Ruling
The court largely tossed the plaintiffs’ demands for monetary damages, concluding that they had failed to establish a cognizable injury to their property rights. Nonetheless, the court concluded that the plaintiffs could continue to pursue injunctive relief on such claims, such as a court order that would prohibit Copilot from reproducing the plaintiffs’ code without proper attribution.
Implications
Computer coding and software development have been some of the most popular use cases for generative AI in recent years, but as this pending lawsuit makes clear, the legal implications of AI-generated code are a matter of dispute.
These concerns stem at least in part from the use of open-source code to train Copilot and similar tools, which, as noted above, may impose some obligations on programmers who use that code.
In addition to attribution requirements that may be relatively easy to comply with, some open-source licenses impose highly onerous obligations, such as requiring the derivative work to be made available on the same terms as the open-source code.
What’s more, to date, it may not be clear what works comprise particular training sets.
Thus, it is generally not possible to say definitively what open-source license terms govern the code used to train Copilot or other similar tools.
Finally, it can be difficult to determine whether generative AI coding tools are directly reproducing licensed open-source code or merely taking inspiration from it.
These uncertainties — variable open-source license terms, uncertainty regarding data sets, and difficulty determining whether AI is directly copying open-source code — have caused some to question whether the risks associated with the use of AI coding assistants outweigh the benefits.
Practice Tips
So what can be done for companies that want to use these tools?
Proceed With Caution
As with other AI-generated content, users should proceed cautiously, while carefully reviewing and testing AI-contributed code. Generative AI tools should aid human creation, not replace it.
Reducing Likelihood of Claims
Some AI developers are now providing tools that allow coders to exclude AI-generated code that matches code in large public repositories — in other words, making sure the AI assistant is not directly copying other public code — which would reduce the likelihood of an infringement claim or inclusion of open-source code.
Minimize AI-Generated Code
Despite the pending litigation against the developers of Copilot, it is not clear that users of Copilot will face a significant risk of breach or infringement claims. That said, users may be at lower risk of such claims if they minimize the amount of AI-generated code they incorporate into their projects.
Short snippets or strings of AI-generated software code may be at a lower risk of incorporating copyrightable or otherwise protectable portions of the open-source libraries. In contrast, using generative AI tools to generate large strings of software code or entire software programs may be more likely to include protectable snippets.
Reconsider Due Diligence
Companies may also want to reconsider how they approach corporate due diligence.
Typically, during diligence, each party evaluates the intellectual property, licenses and material contracts involved in a transaction.
Therefore, companies, particularly those with significant software businesses or needs, should consider whether the inclusion of AI-generated code would undermine the value of the targeted asset or generate unacceptable levels of risk.
For a seller, failure to disclose the use or inclusion of generative AI code may open the door to indemnification obligations, breaches of disclosures, and violation of representations and warranties.
Contractor and Vendor Restrictions
Relatedly, companies may want to consider imposing restrictions on work-for-hire contractors and vendors, such as by either requiring them to disclose their use of generative AI tools or prohibiting or restricting such use.
Review Policy
Finally, all companies should consider reviewing existing policies or adopting new policies governing the use of generative AI by employees and vendors.
Contacts
- Related Industries