I half-agree.
I do think that companies should clarify how they’re training their models and on what datasets. For one thing, this will allow outside researchers to gauge the risks of particular models. (For example, is this AI trained on “the whole Internet,” including unfiltered hate-group safe-havens? Does the training procedure adequately compensate for the bias that the model might learn from those sources?)
However, knowing that a model was trained on copyrighted sources will not enough to prevent the model from reproducing copyrighted material.
There’s no good way to sidestep the issue, either. We have a relatively small amount of data that is (verifiably) public-domain. It’s probably not enough to train a large language model on, and if it is, then it probably won’t be a very useful one in 2023.
That’s a fair point.
In my eyes, the difference is the sheer volume of content that these models rip through in training. It would take many, many lifetimes for a person to read as much as an LLM “reads,” and it’s difficult to tell what an LLM is actually synthesizing versus copying.
Now, does it really matter?
I think the answer comes down to how much power is actually put into the hands of artists rather than the mega-corps. As it stands, the leaders of the AI race are OpenAI/Microsoft, Google, and Meta. If an open LLM comes out (a la Stable Diffusion), then artists do stand to benefit here.