The post An AI Christmas Miracle appeared first on COMMUNIA Association.
]]>One of the key priorities for COMMUNIA was the issue of transparency of training data. In April, we issued a policy paper calling the EU to enact a reasonable and proportional transparency requirement for developers of generative AI models. We have followed the work up with several blogposts and a podcast, outlining ways to make the requirement work in practice, without placing a disproportionate burden on ML developers.
From our perspective, the introduction of some form of transparency requirement was essential to uphold the legal framework that the EU has for ML training, while ensuring that creators can make an informed choice about whether to reserve their rights or not. Going by leaked versions of the final agreement, it appears that the co-legislators have come to similar conclusions. The deal introduces two specific obligations on providers of general-purpose AI models, which serve that objective: an obligation to implement a copyright compliance policy and an obligation to release a summary of the AI training content.
In a leaked version, the obligation to adopt and enforce a copyright compliance policy reads as follows:
[Providers of general-purpose AI models shall] put in place a policy to respect Union copyright law in particular to identify and respect, including through state of the art technologies where applicable, the reservations of rights expressed pursuant to Article 4(3) of Directive (EU) 2019/790
Back in November, we suggested that instead of focussing on getting a summary of the copyrighted content used to train the AI model, the EU lawmaker should focus on the copyright compliance policies followed during the scraping and training stages, mandating developers of generative AI systems to release a list of the rights reservation protocols complied with during the data gathering process. We were therefore pleased to see the introduction of such an obligation, with a specific focus on the opt-outs from the general purpose text and data mining exception.
Interestingly, the leaked version contains a recital on which the co-legislators declare their intent to apply this obligation to “any provider placing a general-purpose AI model on the EU market (…) regardless of the jurisdiction in which the copyright-relevant acts underpinning the training of these foundation models take place”. While one can understand why the EU lawmakers would want to ensure that all AI models released in the EU market respect these EU product requirements, the fact that these are also copyright compliance obligations, which apply previously to the release of the model in the EU market, would raise some legal concerns. It is not clear how the EU lawmakers intend to apply EU copyright law when the scrapping and training takes place outside the EU borders without an appropriate international legal instrument.
The text goes on to require that developers of general-purpose AI models make publicly available a sufficiently detailed summary about the AI training content:
[Providers of general-purpose AI models shall] draw up and make publicly available a sufficiently detailed summary about the content used for training of the general-purpose AI model, according to a template provided by the AI Office
While we have previously criticized the formulation “sufficiently detailed summary” due to the legal uncertainty it could cause, having an independent and accountable entity draw-up a template for the summary (as we defended in here) could alleviate some of the vagueness and potential confusion.
We were also pleased to see that the co-legislators listened to our calls to extend this obligation to all training data. As we have said before, on the one hand introducing a specific requirement only for copyrighted data would add unnecessary legal complexity, since ML developers would first need to know which of their training materials are copyrightable, and on the other hand knowing more about the data that is feeding models that can generate content is essential for a variety of purposes, not all related to copyright.
We should also highlight that the co-legislators appear to have a similar understanding to ours in terms of how compliance with the transparency requirement could be achieved when the AI developers use publicly available datasets. In the leaked version there is a clarifying recital stating that “(t)his summary should be comprehensive in its scope instead of technically detailed, for example by listing the main data collections or sets that went into training the model, such as large private or public databases or data archives, and by providing a narrative explanation about other data sources used.”. When the training dataset is not publicly accessible, we maintain that there should be a way to ensure conditional access to the dataset, namely through a data trust, to confirm legal compliance.
Taking these amendments into account, the compromise found by the co-legislators manages to strike a good balance between what is technically feasible and what is legally necessary.
Merry Christmas!
The post An AI Christmas Miracle appeared first on COMMUNIA Association.
]]>The post The transparency provision in the AI Act: What needs to happen after the 4th trilogue? appeared first on COMMUNIA Association.
]]>As discussed in our Policy Paper #15, transparency is key to ensuring a fair balance between the interests of creators on the one hand and those of commercial AI developers on the other. A transparency obligation would empower creators, allowing them to assess whether the copyrighted materials used as AI training data have been scraped from lawful sources, as well as whether their decision to opt-out from AI training has been respected. At the same time, such an obligation needs to be fit-for-purpose, proportionate and workable for different kinds of AI developers, including smaller players.
While the European Parliament’s text has taken an important step towards improving transparency, it has been criticised for falling short in two key aspects. First, the proposed text focuses exclusively on training data protected under copyright law which arbitrarily limits the scope of the obligation in a way that may not be technically feasible. Second, the Parliament’s text remains very vague, calling only for a “sufficiently detailed summary” of the training data, which could lead to legal uncertainty for all actors involved, given how opaque the copyright ecosystem itself is.
As such, we are encouraged to see the recent work of the Spanish presidency on the topic of transparency, improving upon the Parliament’s proposed text. The presidency recognises that there is a need for targeted provisions that facilitate the enforcement of copyright rules in the context of foundation models and proposes that providers of foundation models should demonstrate that they have taken adequate measures to ensure compliance with the opt-out mechanism under the Copyright Directive. The Spanish presidency has also proposed that providers of foundation models should make information about their policies to manage copyright-related aspects public.
This proposal marks an important step in the right direction by expanding the scope of transparency beyond copyrighted material. Furthermore, requiring providers to share information about their policies to manage copyright-related aspects could provide important clarity as to the methods of opt-out that are being respected, empowering creators to be certain that their choices to protect works from TDM are being respected.
Unfortunately, while the Spanish presidency has addressed one of our key concerns by removing the limitation to copyrighted material, ambiguity remains. Calling for a sufficiently detailed summary about the content of training data leaves a lot of room for interpretation and may lead to significant legal uncertainty going forward. Having said that, strict and rigid transparency requirements which force developers to list every individual entry inside of a training dataset would not be a workable solution either, due to the unfathomable quantity of data used for training. Furthermore, such a level of detail would provide no additional benefits when it comes to assessing compliance with the opt-out mechanism and the lawful access requirement. So what options do we have left?
First and foremost, the reference to “sufficiently detailed summary” must be replaced with a more concrete requirement. Instead of focussing on the content of training data sets, this obligation should focus on the copyright compliance policies followed during the scraping and training stages. Developers of generative AI systems should be required to provide a detailed explanation of their compliance policy including a list of websites and other sources from which the training data has been reproduced and extracted, and a list of the machine-readable rights reservation protocols/techniques that they have complied with during the data gathering process. In addition, the AI Act should allocate the responsibility to further develop transparency requirements to the to-be-established Artificial Intelligence Board (Council) or Artificial Intelligence Office (Parliament). This new agency, which will be set up as part of the AI Act, must serve as an independent and accountable actor, ensuring consistent implementation of the legislation and providing guidance for its application. On the subject of transparency requirements, an independent AI Board/Office would be able to lay down best-practices for AI developers and define the granularity of information that needs to be provided to meet the transparency requirements set out in the Act.
We understand that the deadline to find an agreement on the AI Act ahead of the next parliamentary term is very tight. However, this should not be an excuse for the co-legislators to rush the process by taking shortcuts through ambiguous language purely to find swift compromises, creating significant legal uncertainty in the long run. In order to achieve its goal to protect Europeans from harmful and dangerous applications of AI while still allowing for development and encouraging innovation in the sector, and to potentially serve as model legislation for the rest of the world, the AI Act must be robust and legally sound. Everything else would be a wasted opportunity.
The post The transparency provision in the AI Act: What needs to happen after the 4th trilogue? appeared first on COMMUNIA Association.
]]>The post We need to talk about AI and transparency! appeared first on COMMUNIA Association.
]]>Over the course of 20 minutes, Teresa walks the listeners through some of the key questions when it comes to the training of generative AI models and how it affects the text and data mining (TDM) exception laid down in Articles 3 and 4 of the EU Copyright Directive (see also our Policy Paper #15).
Disclaimer: Playback of the embedded video establishes a connection to YouTube and may lead to data being collected by and shared with third parties. Proceed only if you agree.
While the Directive clearly establishes the right to mine online content and use it to train machine learning algorithms, this right hinges on the possibility for rightsholders to opt out their works if the activity takes place in a commercial context. A key issue we are currently seeing is that there is a lack of transparency around the training of generative AI models, which makes it impossible to tell whether such opt-outs are being respected or not.
In the discussion, Teresa highlighted the need for more transparency regarding opt-outs but also across the copyright ecosystem as a whole. COMMUNIA has long advocated for the creation of a database of copyrighted works, which would contribute to managing the system of opt-outs.
The discussion ended with a call upon the European Commission to provide guidance and lead technical discussions towards establishing a clear, reliable and transparent framework for opt-outs for TDM (see also Open Future’s recent policy brief on this issue). Only through dialogue between the concerned stakeholders, led by an independent third party, will we be able to establish best practices that uphold Articles 3 and 4 of the Copyright Directive while providing a fair and balanced framework for the training of machine learning models.
The post We need to talk about AI and transparency! appeared first on COMMUNIA Association.