COMMUNIA Association - artificial intelligence https://communia-association.org/tag/artificial-intelligence/ Website of the COMMUNIA Association for the Public Domain Wed, 25 Oct 2023 06:53:20 +0000 en-US hourly 1 https://wordpress.org/?v=6.4.2 https://communia-association.org/wp-content/uploads/2016/11/Communia-sign_black-transparent.png COMMUNIA Association - artificial intelligence https://communia-association.org/tag/artificial-intelligence/ 32 32 Statement on Transparency in the AI Act https://communia-association.org/2023/10/23/statement-on-transparency-in-the-ai-act/ Mon, 23 Oct 2023 18:12:49 +0000 https://communia-association.org/?p=6370 A fifth round of the trilogue negotiations on the Artificial Intelligence (AI) Act is scheduled for October 24, 2023. Together with Creative Commons, and Wikimedia Europe, COMMUNIA, in a statement, calls on the co-legislators to take a holistic approach on AI transparency and agree on proportionate solutions. As discussed in greater detail in our Policy […]

The post Statement on Transparency in the AI Act appeared first on COMMUNIA Association.

]]>
A fifth round of the trilogue negotiations on the Artificial Intelligence (AI) Act is scheduled for October 24, 2023. Together with Creative Commons, and Wikimedia Europe, COMMUNIA, in a statement, calls on the co-legislators to take a holistic approach on AI transparency and agree on proportionate solutions.

As discussed in greater detail in our Policy Paper #15, COMMUNIA deems it essential that the flexibilities for text-and-data mining enshrined in Articles 3 and 4 of the Copyright in the Digital Single Market Directive are upheld. For this approach to work in practice, we welcome practical initiatives for greater transparency around AI training data to understand whether opt-outs are being respected.

The full statement is provided below:

Statement on Transparency in the AI Act

The undersigned are civil society organizations advocating in the public interest, and representing  knowledge users and creative communities.

We are encouraged that the Spanish Presidency is considering how to tailor its approach to foundation models more carefully, including an emphasis on transparency. We reiterate that copyright is not the only prism through which reporting and transparency requirements should be seen in the AI Act.

General transparency responsibilities for training data

Greater openness and transparency in the development of AI models can serve the public interest and facilitate better sharing by building trust among creators and users. As such, we generally support more transparency around the training data for regulated AI systems, and not only on training data that is protected by copyright.

Copyright balance

We also believe that the existing copyright flexibilities for the use of copyrighted materials as training data must be upheld. The 2019 Directive on Copyright in the Digital Single Market and specifically its provisions on text-and-data mining exceptions for scientific research purposes and for general purposes provide a suitable framework for AI training. They offer legal certainty and strike the right balance between the rights of rightsholders and the freedoms necessary to stimulate scientific research and further creativity and innovation.

Proportionate approach

We support a proportionate, realistic, and practical approach to meeting the transparency obligation, which would put less onerous burdens on smaller players including non-commercial players and SMEs, as well as models developed using FOSS, in order not to stifle innovation in AI development. Too burdensome an obligation on such players may create significant barriers to innovation and drive market concentration, leading the development of AI to only occur within a small number of large, well-resourced commercial operators.

Lack of clarity on copyright transparency obligation

We welcome the proposal to require AI developers to disclose the copyright compliance policies followed during the training of regulated AI systems. We are still concerned with the lack of clarity on the scope and content of the obligation to provide a detailed summary of the training data. AI developers should not be expected to literally list out every item in the training content. We maintain that such level of detail is not practical, nor is it necessary for implementing opt-outs and assessing compliance with the general purpose text-and-data mining exception. We would welcome further clarification by the co-legislators on this obligation. In addition, an independent and accountable entity, such as the foreseen AI Office, should develop processes to implement it.

Signatories

The post Statement on Transparency in the AI Act appeared first on COMMUNIA Association.

]]>
Defining best practices for opting out of ML training – time to act https://communia-association.org/2023/09/29/defining-best-practices-for-opting-out-of-ml-training-time-to-act/ Fri, 29 Sep 2023 11:50:04 +0000 https://communia-association.org/?p=6355 In April of this year we published our Policy Paper #15 on using copyrighted works for teaching the machine which deals with the copyright policy implications of using copyrighted works for machine learning (ML) training. The paper highlights that the current European copyright framework provides a well-balanced framework for such uses in the form of […]

The post Defining best practices for opting out of ML training – time to act appeared first on COMMUNIA Association.

]]>
In April of this year we published our Policy Paper #15 on using copyrighted works for teaching the machine which deals with the copyright policy implications of using copyrighted works for machine learning (ML) training. The paper highlights that the current European copyright framework provides a well-balanced framework for such uses in the form of the text and data mining exceptions in Articles 3 & 4 of the Copyright Directive.

In their new Policy brief on defining best practices for opting out of ML training published today, our member Open Future takes a closer look at a key element of the text and data mining exception: The rights reservation mechanism foreseen in Article 4(3) of the Directive, which allows authors and other rights holders to opt out of their works being used to train (generative) ML models.

The policy brief highlights that there are still many open questions relating to the implementation of the opt out mechanism. That is, how the machine-readable reservation of rights under Article 4 will work in practice. One of the key issues in this context is the fact that currently there are no generally accepted standards or protocols for the machine-readable expression of the reservation. The authors of the policy brief provide an overview of existing initiatives to provide standardized opt-outs which include initiatives by Adobe and a publisher-led W3C community working group, as well as the artist-led project Spawning, which provides an API that aggregates various opt-out systems. In addition, they highlight a number of proprietary initiatives from model developers, including Google and OpenAI.

Lack of a technical standard

According to the policy brief, one of the key problems facing creators and other rights holders who wish to opt out of ML training is that it is unclear whether and how their intentions to opt-out will be respected by ML model developers. According to the authors of the policy brief, this is deeply problematic and risks undermining the legal framework put in place by the 2019 Copyright Directive:

Continued lack of clarity on how to make use of the opt-out from Article 4 of the CDSM Directive creates the risk that the balanced regulatory approach adopted by the EU in 2019 might fail in practice, which would likely lead to a reopening of substantive copyright legislation during the next mandate. Given the length of EU legislative processes, this would prolong the status quo and, as a result, fail to provide protection for creators and other rightholders in the immediate future.

It seems clear that this scenario should be avoided, both in the interest of creators, who need to have meaningful tools to enforce their rights vis-à-vis commercial ML companies, and in order to preserve the hard-won compromise reflected in the TDM exceptions.

In line with this, the Open Future policy brief calls on the European Commission to “provide guidance on how to express machine-readable rights reservations”. According to the brief, the Commission needs to step in and “publicly identify data sources, protocols and standards that allow authors and rightholders to express a machine-readable rights reservation in accordance with Article 4(3) CDSM”. This guidance would provide important clarity about the availability of freely usable methods of reservation and certainty as to their functionality.

According to the authors, such an intervention would allow the Commission to support creators and other rightholders seeking means to opt out of ML training, while at the same time providing “more certainty to ML developers seeking to understand what constitutes best efforts to comply with their obligations under Article 4(3) of the CDSM Directive.

Time to act

The Open Future policy identifies an important shortcoming in the existing EU approach to the use of copyrighted works for ML training. Without clear guidelines for standardized machine-readable rights reservations, the opt-out mechanism foreseen in Article 4 is unlikely to work in practice. While there are a number of existing standards, the fragmentation of this system causes tremendous uncertainty for creators.

As Open Future points out, it is up to the Commission, which is responsible for ensuring the proper implementation of the Directive’s provisions, to intervene in this area and provide initial clarity to all stakeholders. In the longer term, it would be ideal to see the emergence of an open standard that is maintained independently of any direct stakeholders. Such an effort should not be limited to aggregating opt-outs, but should also be designed to ensure that works in the public domain or made available under licenses that allow and/or encourage reuse are clearly identified as such (In line with our Policy Recommendation #20).

The post Defining best practices for opting out of ML training – time to act appeared first on COMMUNIA Association.

]]>
The AI Act and the quest for transparency https://communia-association.org/2023/06/28/the-ai-act-and-the-quest-for-transparency/ Wed, 28 Jun 2023 07:00:33 +0000 https://communia-association.org/?p=6325 Artificial intelligence (AI) has taken the world by storm and people’s feelings towards the technology range from fascination about its capabilities to grave concerns about its implications. Meanwhile, legislators across the globe are trying to wrap their heads around how to regulate AI. The EU has proposed the so-called AI Act which aims to protect […]

The post The AI Act and the quest for transparency appeared first on COMMUNIA Association.

]]>
Artificial intelligence (AI) has taken the world by storm and people’s feelings towards the technology range from fascination about its capabilities to grave concerns about its implications. Meanwhile, legislators across the globe are trying to wrap their heads around how to regulate AI. The EU has proposed the so-called AI Act which aims to protect European citizens from potential harmful applications of AI, while still encouraging innovation in the sector. The file, which was originally proposed by the European Commission in April of 2021 just entered into trilogues and will be hotly debated over the coming months by the European Parliament and Council.

One of the key issues for the discussions will most likely be how to deal with the rather recent phenomenon of generative AI systems (also referred to as foundational models) which are capable of producing various content ranging from complex text to images, sound computer code and much more with very limited human input.

The rise of generative AI

Within less than a year, generative AI technology went from having a select few, rather niche applications to becoming a global phenomenon. Perhaps no application represents this development like ChatGPT. Originally released in November 2022, ChatGPT broke all records by reaching one million users within just five days of its release with the closest competitors for this title, namely Instagram, Spotify, Dropbox and Facebook, taking several months to reach the same stage. Fast forward to today, approximately half a year later, and ChatGPT reportedly counts more than 100 million users.

One of the reasons for this “boom” of generative AI systems is that they are more than just a novelty. Some systems have established themselves as considerable competitors for human creators for certain types of creative expressions, being able to write background music or produce stock images that would take humans many more hours to create. In fact, the quality of the output of some systems is already so high while the cost of production is so low that they pose an existential risk to specific categories of creators, as well as the industries behind them.

But how do generative AI systems achieve this and what is the secret behind their ability to produce works that can comfortably compete with works of human creativity? Providing an answer to this question, even at surface level, is extremely difficult since AI systems are notoriously opaque, making it nearly impossible to fully understand their inner workings. Furthermore, developers of these systems have an obvious interest in keeping the code of their algorithm as well as the training data used secret. This being said, one thing is for certain: generative AI systems need data, and lots of it.

The pursuit of data

Creating an AI system is incredibly data intensive. Data is needed to train and test the algorithm throughout its entire lifecycle. Going back to the example of ChatGPT, the system was trained on numerous datasets throughout its iterations containing hundreds of gigabytes of data equating to hundreds of billions of words.

With so much data needed for training alone, this opens up the question how developers get their hands on this amount of information. As is fairly obvious by the sheer numbers, training data for AI systems is usually not collected manually. Instead, developers often rely on two sources for their data: curated databases which contain vast amounts of data and so-called web crawlers which “harvest” the near boundless information and data resources available on the open internet.

The copyright conundrum

Some of the data available in online databases or collected by web scraping tools will inevitably be copyrighted material which raises some questions with regards to the application of copyright in the context of training AI systems. Communia has extensively discussed the interaction between copyright and text and data mining (TDM) in our policy paper #15 but just as a short refresher about the clear framework established in the 2019 Copyright Directive:

Under Article 3, research organizations and cultural heritage institutions may scrape anything that they have legal access to, including content that is freely available online for the purposes of scientific research. Under Article 4, this right is extended to anyone for any purposes but rights holders may reserve their rights and opt out of text and data mining, most often through machine-readable means.

While this framework, in principle, provides appropriate and sufficient legal clarity on the use of copyrighted materials in AI training, the execution still suffers from the previously mentioned opacity of AI systems and the secrecy around training data as there is no real way for a rightsholder to check whether their attempt to opt out of commercial TDM has actually worked. In addition, there’s still a lot of uncertainty about the best technical way to effectively opt out.

Bringing light into the dark

Going back to the EU’s AI Act reveals that the European Parliament recognises this issue as well. The Parliament’s position foresees that providers of generative AI models should document and share a “sufficiently detailed” summary of the use of training data protected under copyright law (Article 28b). This is an encouraging sign and a step in the right direction. The proof is in the pudding, however. More clarity is needed with regards to what “sufficiently detailed” means and how this provision would look in practice.

Policy makers should not forget that the copyright ecosystem itself suffers from a lack of transparency. This means that AI developers will not be able – and therefore should not be required – to detail the author, the owner or even the title of the copyrighted materials that they have used as training data in their AI systems. This information simply does not exist out there for the vast majority of protected works and, unless right holders and those who represent them start releasing adequate information and attaching it to their works, it is impossible for AI developers to provide such detailed information.

AI developers also should not be expected to know which of their training materials are copyrightable. Introducing a specific requirement for this category of data adds legal complexity that is not needed nor advisable. For that and other reasons, we recommend in our policy paper that AI developers be required to be transparent about all of their training data, and not only about the data that is subject to copyright.

The fact that AI developers know so little about each of the materials that is being used to train their models should not, however, be a reason to abandon the transparency requirement.

In our view, those that are using publicly available datasets will probably comply with the transparency requirement simply by referring to the dataset, even if the dataset is lacking detailed information on each work. Those that are willing to submit training data with a data thrust that would ensure the accessibility of the repository for purposes of assessing compliance with the law would probably also ensure a reasonable level of transparency.

The main problem is with those that are not disclosing any information about their training data, such as OpenAI. These need to be forced to make some sort of public documentation and disclosure and at least need to be able to show that they have not used copyrighted works that have an opt-out attached to it. And that begs for the question: how can creators and other right holders effectively reserve their training rights and opt-out of the commercial TDM exception?

Operationalizing the opt-out mechanism

In our recommendations for the national implementation of the TDM exceptions we suggested that the proper technical way to facilitate web mining was by the use of a protocol like robot.txt which creates a binary “mine”/“don’t mine” rule. However, this technical protocol has some significant limitations when it comes to its application in the context of data mining for AI training data.

Therefore, one of the recommendations in our policy paper is for the Commission to lead these technical discussions and provide guidance on how the opt-out is supposed to work in practice to end some of the uncertainty that exists among creators and other rights holders.

In order to encourage a fair and balanced approach to both the opt-out and the transparency issues, the Commission could convene a stakeholder dialogue and include all affected parties, namely AI developers, creators and rights holders as well as representatives of civil society and academia. The outcome of this dialogue should be a way to operationalise the opt-out system itself and the transparency requirements that will uphold such a system without placing a disproportionate burden on AI developers.

Getting this right would provide a middle ground that allows creators and other rights holders to protect their commercial AI training rights over their works while encouraging innovation and the development of generative AI models in the EU.

The post The AI Act and the quest for transparency appeared first on COMMUNIA Association.

]]>
Using Copyrighted Works for Teaching the Machine – New Policy Paper https://communia-association.org/2023/04/26/using-copyrighted-works-for-teaching-the-machine-new-policy-paper/ Wed, 26 Apr 2023 10:16:56 +0000 https://communia-association.org/?p=6173 The surge of generative artificial intelligence has gone alongside a renewed interest in questions about the relationship between machine learning and copyright law. In our newly published policy paper #15 entitled “Using copyrighted works for teaching the machine” (also available as a PDF file), we are looking at the input side of the equation within the […]

The post Using Copyrighted Works for Teaching the Machine – New Policy Paper appeared first on COMMUNIA Association.

]]>
The surge of generative artificial intelligence has gone alongside a renewed interest in questions about the relationship between machine learning and copyright law. In our newly published policy paper #15 entitled “Using copyrighted works for teaching the machine” (also available as a PDF file), we are looking at the input side of the equation within the EU copyright framework.

We discuss the considerations of the use of copyright-protected works and other protected subject matter as training data for generative AI models, and provide two recommendations for lawmakers. Here, we leave aside questions relating to the output of AI models (e.g. whether the output of generative AI models is copyrightable and in how far such output can be infringing exclusive rights), which we will address in another, yet to be published paper.

This paper is without prejudice to the position of COMMUNIA or individual COMMUNIA members regarding this discussion in other jurisdictions.

The post Using Copyrighted Works for Teaching the Machine – New Policy Paper appeared first on COMMUNIA Association.

]]>