Next.js App Router + React Server Components Demo

new
past
show
ask
show
jobs
submit

▲Penguin Random House underscores copyright protection in AI rebuff (thebookseller.com)

34 points by marban 3 days ago | 52 comments

openrisk 3 days ago [-]

"Digital societies with a poor rule of law and institutions that exploit the population do not generate growth or change for the better"

Excerpt from the announcement of the 2044 Nobel prize in economics.

It echoes the earlier arguments of the 2024 prize to Acemoglu, Johnson, Robinson about the economic fate of colonies and an even earlier thesis by Hernando de Soto about the role of property rights in wealth formation.

Exploiting legal loopholes to transfer wealth with immunity has been the central profit mechanism of many recent tech business models, whether that was the gig economy, adtech or now "AI".

The tragedy is that stock markets as currently organized (besides being famously amoral) have no sense of long term value. This creates a positive feedback loop where tech business models that gnaw at the pre-digital institutional wealth foundations are empowered to do so even more.

Yet those economists are not wrong. In historical timescales the digital transition will see massive redistribution of wealth to nations that can actually create strong digital institutions and legal protections.

This is, incidentally, why the EU and its enthusiastic rule making might be onto something, even if its currently lethargic tech industry does not help validate the alternative vision.

gruez 3 days ago [-]

IANAL, but this seems kind of pointless? Either AI training is copyright infringement or it's not. A tiny disclaimer isn't going to affect that. The AI companies contend that it's fair use, which would probably override any fine print on the inside cover.

dwattttt 3 days ago [-]

> AI companies contend that it's fair use

It's a good thing they don't train on copyrighted material from jurisdictions that have something other than "fair use", such as Australia's "fair dealing". Otherwise, they'd have to argue that their use of such copyrighted material doesn't "usurp either the market of the original work or a derivative market".

https://addisons.com/knowledge/insights/fair-use-or-fair-dea...

AStonesThrow 3 days ago [-]

> The AI companies contend that it's fair use

Do they? "Fair Use" is an affirmative defense, so the only time we're going to get into that is in a court case, where it'll be tested through legal means.

I would say it's even more nuanced: if LLM training involves merely reading a dataset, but it is not strictly necessary to copy, or even store it verbatim to be useful, then does it even fall under copyright protection at all? A lot of computer-based data processing is already immune to copyright issues; you can place a webpage into a server-based cache or CDN, you can stream it across a network, you can cache it in RAM or local storage, you can make backups of things, and all these processing uses don't fall afoul of copyright.

So I would say that we're going to watch the LLM trainers say that the models aren't storing copies at all, and that seems an even stronger defense than "Fair Use". It is a strange copyright protection indeed that explicitly or implicitly prohibits certain types of machine readings, while allowing many others.

profmonocle 3 days ago [-]

> if LLM training involves merely reading a dataset, but it is not strictly necessary to copy, or even store it verbatim to be useful, then does it even fall under copyright protection at all?

For instance, imagine I read a novel, then I decide to write my own, unauthorized sequel to it. It's not a literal "copy" of the original material - it's my own original text, but obviously a derivative work of the original material. Under copyright law, that would be infringement - I would be sued if I tried to sell that. (Yes, that means fanfiction is infringing, but most rights holders have wisely decided to look the other way on that, as long as it's non-commercial.)

This is what people who claim AI is infringing are worried about. Not that the AI has a literal copy of the source material in its training data, but that the training data can be used to produce a derivative work.

I could write a (crappy) fanfic of the Lord of the Rings without directly referencing the books/movies. And that doesn't mean I have a complete copy of the books/movies in my head - that isn't how memory works. Until now, creating a derivative work without directly using the source material was something only humans could do. This is completely uncharted legal territory.

musicale 2 days ago [-]

LLM-generated book clones (as seen on Amazon and elsewhere) could potentially fall afoul of copyright law in many ways, including: rights to reproduction/substantial similarity; derivative works; adaptation (including translation); distribution; performance and public display (including broadcast or transmission); etc.

AStonesThrow 2 days ago [-]

LLMs don't necessarily need to reproduce their source material to make use of it. They could summarize, analyze, condense, paraphrase, extract statistics or factoids. There's also the question of how the models actually store the source material or not. It's physically impossible for the verbatim text to live in the model weights, and so at the very least, it's compressed or abstracted. So any copyright claims will need to get beyond a simplistic allegation of copying, for sure.

lelanthran 1 days ago [-]

> It is a strange copyright protection indeed that explicitly or implicitly prohibits certain types of machine readings, while allowing many others.

As far as I know, licenses can discriminate on whatever constraint they want to[1].

A license that is basically "This work is for $FOO only. If you want a $BAR license please contact us." is perfectly legal right now!

There is no restriction in law that $FOO cannot be "human consumption" and $BAR cannot be ""machine consumption".

If the AI companies are not arguing the "fair use" argument, then they are arguing for AI companies having a special exemption for themselves carved out in law whereby a copyright holder is forced to license their works to the AI companies specifically.

IOW, their only recourse is to argue "fair use", because any other argument boils down to "please make a special exemption for us in copyright law, defying hundreds of years of precedent", which is a particularly hard sell.

[1]Excluding discriminating against protected classes.

A4ET8a8uTh0 3 days ago [-]

<< Do they?

It is a fair point. Companies contend what they always contend, which is their position in an argument; they do so forcefully and regardless of the reality on the ground. Companies are basically modeled after opportunists.

xnorswap 3 days ago [-]

It's hard to understand the position of AI companies, when their AIs can be prompted to reproduce copyrighted works verbatim, e.g.

https://imgur.com/a/XBO2B7V

You can argue that a game that old ought not to be still in copyright, and I'd support that position.

But it is in copyright, and I'd rather a world where we curtail copyright terms but enforce them fairly rather than a world in which we stifle culture by automated means of suppressing "DMCA" violations on some platforms while large platform owners invest in AI which trample over the notion of copyright.

ben_w 3 days ago [-]

It's a complicated mess.

Reason being: I, too, am physically capable of reproducing copyrighted works verbatim when correctly prompted. Or at least, can do so as well as current GenAI can in certain specific cases, and know how to use a photocopier in others.

Is the copyright violation the user asking for this, or is it the model having the capability?

If it's the mere capability, the inventors of copy-paste and the camera and voice recorder apps have a problem, along with all search engines that ever index pirated material.

If it's the users, given that copyright databases are much too large for any human to actually know exhaustively what is and isn't in them, there's a real danger that normal people will accidentally do exactly this.

kranke155 3 days ago [-]

Ridiculous. The copyright violation is in its use for training data. And then its doubled down by the user asking for copyrighted material and getting it verbatim.

It's not complicated. It's only complicated because it might be in the way of some people making billions or trillions.

Let's stop with the analogies. Analogies won't get you anywhere.

- if you recorded a concert with a tape recorder, then distributed it online for money, that was a copyright violation.

- if you recorded a CD with a tape recorder, then sold copies, that was a copyright violation.

- If you used a camera to film a movie being shown in a theatre, then distributed it online, that is a copyright violation.

Stop the BS.

jpalawaga 3 days ago [-]

Well, copyright law has literally everything to do with reproducing content and has nothing to do with consumption of content.

So I have to disagree. Copyright doesn’t stipulate the terms of your consumption of media. Sure, people can write licenses and whatnot, but that’s not copyright, that’s a license (for example, the TOS of NYT website may dictate your rights to scrape it)

Copyright law is woefully under-prepared to deal with the challenges of llms. If someone has a photographic memory, it’s not illegal for them to read the book, it’s illegal for them to reproduce it. That’s essentially what we’re seeing with LLMs.

In all of your examples, the illegal part is really the distribution so far as copyright is concerned.

kranke155 3 days ago [-]

So how do you suggest LLMs could prevent reproduction of copyrighted material, if they're doing so now.

jpalawaga 3 days ago [-]

I think that’s a more interesting question. I’m not sure how to do it, but I think that finding a way to stop the reproduction of copyrighted content is probably the missing piece.

If there was a monetary penalty for reproduction of copyright works (apply human laws to machine), then I bet these companies would quickly figure out how to fingerprint output and match it to source data before sending it to the user.

polotics 3 days ago [-]

So I go into a cinema with my recording equipment, record the movie, step outside and start selling downloads, and when the cops come, I say:

"I’m not sure how to do it, but I think that finding a way to stop the reproduction of copyrighted content is probably the missing piece."

Really?

ben_w 3 days ago [-]

That's not a close reading of the other comment.

If LLMs were specially advertised as a way to get all the stuff you already love for free, it would be.

They're not.

What they are advertised as is a way to solve problems and create novel things.

In this regard, LLMs are less of a copyright infringement issue than, e.g. Google News and Google Images both of which had to change because they were in law copyright infringements.

This does not mean that LLMs or diffusion models must get a free pass or anything like that: these are a novel things that didn't previously exist, so while I was initially surprised by the legal cases against Stability and OpenAI due to the existence of Google search and that nobody seemed to care about GPT-3 or the original DALL•E, the arguments made against them are nevertheless interesting and worth caring about.

I suspect LLMs are so useful the powers that be will just carve out a space for them; conversely I don't see that applying to image generation models, so I kinda expect them to be strictly limited to cases where the source data can be proven correctly licensed.

ben_w 3 days ago [-]

> The copyright violation is in its use for training data.

Every textbook, every educational TV show or YouTube video, every artist whose works have been shown to me in my school lessons in art, music, graphic design, usw. have all been copyrighted.

With one exception: Shakespeare.

Even the bible, being a translation*, had copyright notices on it.

But worse than that: if it were so even for scraping the whole web and training an AI on it… what do you think Google is? Page Rank is a big ol' matrix multiplication, and it spits out quotes verbatim.

> It's not complicated. It's only complicated as it might be in the way of some people making billions.

You know some of these LLMs can be downloaded and run locally, right? For free, even.

"Not making money" is absolutely not sufficient to remove a claim of copyright infringement; and conversely these models are too useful to be simply ignored — they're now in the realm of national strategic thinking.

> Let's stop with the analogies. Analogies won't get you anywhere.

They are the only thing we have, as this didn't exist before, and we need to create a new legal framework for them.

Even copyright itself as a legal idea comes from analogies starting at least as far back as a 6th century Irish dispute where King Diarmait Mac Cerbhaill gave the judgement "To every cow belongs her calf, therefore to every book belongs its copy."**

> Also, if you recorded a concert with a tape recorder, then distributed it online for money, pretty sure that was a copyright violation.

Yes, that's the point.

And yet I am not forbidden from being online due to the fact that I have the capability to do so.

Only the actual performance of this is an offence, not the mere capability.

* All translation is necessarily also interpretation, specifically this one: https://en.wikipedia.org/wiki/Good_News_Bible

** Or at least that's the modern English translation of what he said, presumably either in in pre-medieval Latin or Old Irish.

kranke155 3 days ago [-]

I agree that copyright law is woefully unprepared, and that LLMs have become useful enough it will be difficult if not impossible to stop their development.

> Page Rank is a big ol' matrix multiplication, and it spits out quotes verbatim.

That makes no sense. Google Books was blocked exactly because it was reproducing enough material to be infringing. Google search usually does not copy enough material that it might be considered infringing (even though I think news sites have sued over this in some jurisdictions, right?)

> And yet I am not forbidden from being online due to the fact that I have the capability to do so.

It's fairly obvious that the people running LLMs don't want to prevent them from infringing copyright, particularly in image gen.

ben_w 3 days ago [-]

> That makes no sense. Google Books was blocked exactly because it was reproducing enough material to be infringing. Google search usually does not copy enough material that it might be considered infringing (even though I think news sites have sued over this in some jurisdictions, right?)

Google Books? That's a non-sequitur. When I said "Page Rank" I mean the algorithm used by Google Search to rank web pages in their search engine results.

The search results give you direct quotations from the pages they linked to, and for a long time also had links to cached copies of those pages.

> It's fairly obvious that the people running LLMs don't want to prevent them from infringing copyright, particularly in image gen.

Apart from all the times they pop up messages refusing to reproduce copyrighted material? The only reason I was even willing to create the following query is because I knew it would refuse me on that grounds: https://chatgpt.com/share/67138f5b-a110-8011-adf8-a82f3fc473...

Image gen… well, for at least the third-party fine-tuners, I agree with your impression. The main players? Unclear to me, especially as we see some models from big players which are created from scratch using only correctly licensed materials — I believe Adobe's models would be an example of correctly licensed training content.

polotics 3 days ago [-]

I disagree: I think copyright law is wholly prepared.

If some thing produces a copy of copyrighted material, without explicit approval of the copyright owner, then it is infringing on the copyright.

Where is the unprepared?

pnut 3 days ago [-]

Looking at copyright law through the lens of "land grabbing output token sequences" diminishes its perception of long term viability to me.

Generative computation is becoming reproducible at scale.

kranke155 3 days ago [-]

It's unprepared for the fact that you can now build thinking machines using these new techniques, and that governments will be reluctant to regulate them if they think it will give them issues in an AI race against strategic opponents.

Of course it's doubtful that image and video gen AI are strategically important, but I trust the AI lobby to make sure that governments won't make that distinction.

ben_w 3 days ago [-]

> Of course it's doubtful that image and video gen AI are strategically important, but I trust the AI lobby to make sure that governments won't make that distinction.

I suspect otherwise. The use cases for image generators are much less obvious than for a personal assistant that's just as fluent in Chinese, international relations, hacking, military strategy, and speech writing… even when the current fluency of LLMs relative to government workers is somewhere between "mediocre" to "it'll do".

I think big copyright owners, however, will probably make their own models. Disney etc. could very easily use their own content to (help) make as many more films as they want. Real human artists will be disempowered regardless of how the copyright stuff happens. Probably.

kranke155 3 days ago [-]

I hope so, but big AI is banking on being able to use those models for everything.* Our "luck" is that Disney and other giant corporate copyright holders will indeed sue them to oblivion, or at least try.

* Afaik image gen AI is one the few profitable use cases, with Midjourney apparently being profitable already.

ben_w 3 days ago [-]

Re image gen cost, by my estimate inference was $0.0001/image 2 years ago: https://benwheatley.github.io/blog/2022/10/09-19.33.04.html

I suspect that there are significantly better options than the thing which just happened to be accessible to me at the time, so I would say that quality, not cost, is the thing currently preventing feature length movies from being generated for less than the cost a movie theatre ticket.

polotics 3 days ago [-]

You wrote "thinking machines"? seriously? You do know that attention is all you need, right? (https://en.wikipedia.org/wiki/Attention_Is_All_You_Need) Having a Generative Pretrained Transformer regurgitate pattern-matched input into mostly logical-appearing output does not make a thinking machine! (https://en.wikipedia.org/wiki/Generative_pre-trained_transfo...)

case in point, look at this, and think:

""" It's true that relying solely on attention mechanisms, while groundbreaking, doesn't equate to genuine thinking. These models, impressive as they are, essentially excel at sophisticated mimicry. They learn to identify patterns in vast amounts of data and reproduce them in a way that often seems intelligent. However, they lack true understanding, consciousness, or the ability to form original thoughts and intentions.

The danger lies in anthropomorphizing these systems. We marvel at their ability to generate human-like text, translate languages, and even write different kinds of creative content, and we may be tempted to ascribe human-like qualities to them. But it's crucial to remember that beneath the surface lies a complex algorithm, not a conscious mind.

While the "Attention is All You Need" paper revolutionized the field of natural language processing, it's important to maintain a critical perspective and acknowledge the limitations of current AI technology. The quest for true artificial intelligence, a machine capable of independent thought and understanding, remains an ongoing challenge. """

Maybe you were able to spot a pattern, no?

kranke155 3 days ago [-]

I can spot a pattern that people who think AI should be able to violate copyright often anthropomorphize it as a silly defense.

If you mean what I wrote, it's irrelevant what you call them. If governments think there is an AI race that they will lose if they enforce copyright, they won't enforce copyright.

polotics 2 days ago [-]

ok, good point!

CaptainFever 3 days ago [-]

If it's fair use in the US, it doesn't matter how many "X is prohibited because of copyright" clauses they add, AI training is allowed.

For EU TDM, this opt out probably works to disallow commercial entities from using it for training, but not researchers.

For SG TDM, this opt out can be ignored. All legally-acquired material can be used for AI training purposes, period.

Disclaimer: IANAL.

NeoTar 3 days ago [-]

SG = Singapore, TDM = Text Data Mining?

CaptainFever 40 minutes ago [-]

You're correct. Apologies for the vagueness.

photonthug 3 days ago [-]

> There is no standard ‘All rights reserved’ wording and even the most basic notice covers all uses. Having said that, we’re pleased to see publishers starting to add to the ‘All rights reserved’ notice to explicitly exclude the use of a work for the purpose of training [generative AI], as it provides greater clarity and helps to explain to readers what cannot be done without rights-holder consent.”

So their position is that all rights reserved always meant that unauthorized use of any kind was already forbidden, but that was ignored / unenforceable and so they are adding “looking at you” language to carve out stuff that’s disallowed more specifically. Feels like bargaining, because this just gives the opposition the chance to argue that it wasn’t illegal before.

dwattttt 3 days ago [-]

> unauthorized use of any kind was already forbidden

Is a funny sentence to have to say. Is unauthorized use also unauthorized?

turbonaut 3 days ago [-]

Authorization and forbidding are both explicit actions.

Unauthorized refers to the absence of authorisation.

‘Unauthorized use is forbidden’ means ‘all use must have explicit permission.’

Clearly it is a bit disingenuous as it usually means ‘all use I don’t like / is not inline with norms’.

rahimnathwani 3 days ago [-]

Which country has the most favorable 'fair use' laws, and why wouldn't big companies train their models there?

CaptainFever 36 minutes ago [-]

In the context of AI, it's possible that OpenAI is already doing this. From what I know, Singapore has the most pro-AI law currently, where any legally-acquired material can be used as training data, and you can legally ignore all opt-outs (e.g. in the ToS, noAI, etc). OpenAI has recently opened an office there. I'm not sure if it's related, but it's possible.

Maybe this is why LAION was in the EU, too? They have a similar law too, though from what I know they require commercial entities to respect opt-outs.

(Related to sibling comments; Japan seems to have one too.)

https://store.lawnet.com/blog/post/understanding-the-text-an...

NitpickLawyer 3 days ago [-]

IIRC Japan has had at least one court ruling that training on copyrighted data is fair use (or a version thereof).

anticensor 3 days ago [-]

Japan does not have US-style fair use, their copyright exemptions are closer to UK/EU-style fair dealing (with stricter enforcement and a shorter copyright term on work-for-hire, though).

profmonocle 3 days ago [-]

Would that matter if the company wants to do business in countries with more restrictive laws?

I.E. if I wrote my own spin-off of a popular book series, which was somehow considered fair use in country A, but considered infringing in country B, the publisher could get it removed from stores in country B.

By the same logic, if AI training is ruled as copyright infringement in the US, it won't matter if the company trains their model somewhere else - if they open a US division to sell service using that model, they'd get sued.

Granted I'm not an IP lawyer and AI IP law is in its infancy - maybe I'm missing something?

rahimnathwani 3 days ago [-]

IANAL but the article has a quotation from a lawyer that says that the infringing act is the training.

kranke155 3 days ago [-]

Always happy to see the copyright experts in HN jumping out of the woodwork whenever a thread like this shows up.

Whether AI can use copyrighted material as training data is legally undetermined. There's only an argument that it could be. It will be decided in court.

CaptainFever 34 minutes ago [-]

> Whether AI can use copyrighted material as training data is legally undetermined. There's only an argument that it could be. It will be decided in court.

Note that this is U.S. specific.

https://store.lawnet.com/blog/post/understanding-the-text-an...

artninja1988 3 days ago [-]

That's a funny post, since your other reply in this thread is:

>Ridiculous. The copyright violation is in its use for training data. And then its doubled down by the user asking for copyrighted material and getting it verbatim.

>It's not complicated. It's only complicated because it might be in the way of some people making billions or trillions.

kranke155 3 days ago [-]

I was debating someone, not asserting that I know the answer.

They were speculating whether there could be a copyright violations vs for instance using a tape recorder or a video camera.

I made an example that yes if there is a copyright violation here it is, and that’s it’s not that different.

I could have pillowed my language differently sure but what’s the point. It’s clear in context.

portaouflop 3 days ago [-]

Penguin Random House is a predatory corrupt business that the world would be much better off without. If AI means businesses like this shut down I need more „AI“

2muchcoffeeman 3 days ago [-]

Two wrongs don’t make a right.

portaouflop 3 days ago [-]

Anything that Penguin Random House perceives as a threat is good and right for me.

kranke155 3 days ago [-]

That's an simplification of the world that might help you think, but it's actually harmful in understanding what's really going on.

Everything is nuanced. So if you refuse to understand nuance you don’t understand the world.

portaouflop 3 days ago [-]

Of course everything is nuanced - but the goal here is not to understand Penguin Random House.

It is to make sure that organisations like it that stifle and monopolise culture don’t keep getting more power.

If AI is the way to do that so be it.

paganel 3 days ago [-]

Honest question, what's "predatory corrupt" about their business? I have some of their paperbacks on my bookshelves and they've been a pretty decent value for money thing.

dartharva 3 days ago [-]

All major publishers are known to be fundamentally rent-seeking entities. Their vast history of litigative bullying and anti-consumer behavior has created a self-serving perception of them as only capable of being high-profile scoundrels. Which means even if there somehow exist some good ones in their current iteration, they will always be evil for many people just by being associated with that domain.

portaouflop 3 days ago [-]

Look into Penguin Random house antitrust trial - they’re monopolising culture - which is corrupt and their business practices are predatory

dartharva 3 days ago [-]

Scorched earth is never the way.

portaouflop 2 days ago [-]

Why not? Plenty of examples in history where it was the way - and if it was never a viable strategy it wouldn’t have a name.

I’d rather see no one profiting from producing culture than orgs like Random House.

Rendered at 13:25:24 GMT+0000 (Coordinated Universal Time) with Wasmer Edge.