this post was submitted on 13 Sep 2023
37 points (100.0% liked)

Technology

37735 readers
45 users here now

A nice place to discuss rumors, happenings, innovations, and challenges in the technology sphere. We also welcome discussions on the intersections of technology and society. If it’s technological news or discussion of technology, it probably belongs here.

Remember the overriding ethos on Beehaw: Be(e) Nice. Each user you encounter here is a person, and should be treated with kindness (even if they’re wrong, or use a Linux distro you don’t like). Personal attacks will not be tolerated.

Subcommunities on Beehaw:


This community's icon was made by Aaron Schneider, under the CC-BY-NC-SA 4.0 license.

founded 2 years ago
MODERATORS
 

Avram Piltch is the editor in chief of Tom's Hardware, and he's written a thoroughly researched article breaking down the promises and failures of LLM AIs.

top 32 comments
sorted by: hot top controversial new old
[–] lily33@lemm.ee 20 points 1 year ago* (last edited 1 year ago) (3 children)

They have the right to ingest data, not because they're “just learning like a human would". But because I - a human - have a right to grab all data that's available on the public internet, and process it however I want, including by training statistical models. The only thing I don't have a right to do is distribute it (or works that resemble it too closely).

In you actually show me people who are extracting books from LLMs and reading them that way, then I'd agree that would be piracy - but that'd be such a terrible experience if it ever works - that I can't see it actually happening.

[–] RickRussell_CA 18 points 1 year ago* (last edited 1 year ago) (7 children)

Two things:

  1. Many of these LLMs -- perhaps all of them -- have been trained on datasets that include books that were absolutely NOT released into the public domain.

  2. Ethically, we would ask any author who parrots the work of others to provide citations to original references. That rarely happens with AI language models, and if they do provide citations, they often do it wrong.

[–] lily33@lemm.ee 12 points 1 year ago (2 children)

I'm sick and tired of this "parrots the works of others" narrative. Here's a challenge for you: go to https://huggingface.co/chat/, input some prompt (for example, "Write a three paragraphs scene about Jason and Carol playing hide and seek with some other kids. Jason gets injured, and Carol has to help him."). And when you get the response, try to find the author that it "parroted". You won't be able to - because it wouldn't just reproduce someone else's already made scene. It'll mesh maaany things from all over the training data in such a way that none of them will be even remotely recognizable.

[–] RickRussell_CA 8 points 1 year ago (3 children)

And yet, we know that the work is mechanically derivative.

[–] keegomatic@kbin.social 10 points 1 year ago* (last edited 1 year ago)

So is your comment. And mine. What do you think our brains do? Magic?

edit: This may sound inflammatory but I mean no offense

[–] conciselyverbose@kbin.social 4 points 1 year ago

So is literally every human work in the last 1000 years in every context.

Nothing is "original". It's all derivative. Feeding copyrighted work into an algorithm does not in any way violate any copyright law, and anyone telling you otherwise is a liar and a piece of shit. There is no valid interpretation anywhere close.

[–] lily33@lemm.ee 2 points 1 year ago* (last edited 1 year ago)

From Wikipedia, "a derivative work is an expressive creation that includes major copyrightable elements of a first, previously created original work".

You can probably can the output of an LLM 'derived', in the same way that if I counted the number of 'Q's in Harry Potter the result derived from Rowling's work.

But it's not 'derivative'.

Technically it's possible for an LLM to output a derivative work if you prompt it to do so. But most of its outputs aren't.

[–] state_electrician@discuss.tchncs.de 1 points 1 year ago (1 children)

Well, I think that these models learn in a way similar to humans as in it's basically impossible to tell where parts of the model came from. And as such the copyright claims are ridiculous. We need less copyright, not more. But, on the other hand, LLMs are not humans, they are tools created by and owned by corporations and I hate to see them profiting off of other people's work without proper compensation.

I am fine with public domain models being trained on anything and being used for noncommercial purposes without being taken down by copyright claims.

[–] RickRussell_CA 1 points 1 year ago

it’s basically impossible to tell where parts of the model came from

AIs are deterministic.

  1. Train the AI on data without the copyrighted work.

  2. Train the same AI on data with the copyrighted work.

  3. Ask the two instances the same question.

  4. The difference is the contribution of the copyrighted work.

There may be larger questions of precisely how an AI produces one answer when trained with a copyrighted work, and another answer when not trained with the copyrighted work. But we know why the answers are different, and we can show precisely what contribution the copyrighted work makes to the response to any prompt, just by running the AI twice.

load more comments (6 replies)
[–] donuts@kbin.social 8 points 1 year ago* (last edited 1 year ago) (1 children)

You're making two, big incorrect assumptions:

  1. Simply seeing something on the internet does not give you any legal or moral rights to use that thing in any way other than things which are, or have previously been, deemed to be "fair use" by a court of law. Individuals have personal rights over their likeness and persona, and copyright holders have rights over their works, whether they are on the internet or not. In other words, there is a big difference between "visible in public" and "public domain".
  2. More importantly, something that might be considered "fair use" for a human being do to is not necessary "fair use" when a computer or "AI" does it. Judgements of what is and is not fair use are made on a case by case basis as a legal defense against copyright infringement claims, and multiple factors (purpose of use, nature of original work, degree and sustainability of use, market effect, etc.) are often taken into consideration. At the very least, AI use has serious implications on sustainability and markets, especially compared to examples of human use.

I know these are really tough pills for AI fans to swallow, but you know what they say... "If it seems too good to be true, it probably is."

[–] lily33@lemm.ee 4 points 1 year ago* (last edited 1 year ago)

One the contrary - the reason copyright is called that is because it started as the right to make copies. Since then it's been expanded to include more than just copies, such as distributing derivative works

But the act of distribution is key. If I wanted to, I could write whatever derivative works in my personal diary.

I also have the right to count the number of occurrences of the letter 'Q' in Harry Potter workout Rowling's permission. This I can also post my count online for other lovers of 'Q', because it's not derivative (it is 'derived', but 'derivative' is different - according to Wikipedia it means 'includes major copyrightable elements').

Or do more complex statistical analysis.

load more comments (1 replies)
[–] FlapKap@feddit.dk 17 points 1 year ago (4 children)

I like the point about LLMs interpolating data while humans extrapolate. I think that's sums up a key difference in "learning". It's also an interesting point that we anthropomorphise ML models by using words such as learning or training, but I wonder if there are other better words to use. Fitting?

[–] RickRussell_CA 9 points 1 year ago

"Plagiarizing" 😜

[–] brie 6 points 1 year ago

What about tuning, to align with "finetuning?"

[–] amju_wolf@pawb.social 5 points 1 year ago (1 children)

Isn't interpolation and extrapolation the same thing effectively, given a complex enough system?

[–] CanadaPlus@lemmy.sdf.org 1 points 1 year ago

Depending on the geometry of the state space, very literally yes. Think about a sphere, there's a straight line passing from Denver to Guadalajara, roughly hitting Delhi on the way. Is Delhi in between them (interpolation), or behind one from the other (extrapolation)? Kind of both, unless you move the goalposts to add distance limits on interpolation, which could themselves be broken by another geometry

load more comments (1 replies)
[–] CanadaPlus@lemmy.sdf.org 6 points 1 year ago

There's a lot of opinion in here written in as if it's fact.

[–] nyan@lemmy.cafe 4 points 1 year ago (1 children)

Let's be clear on where the responsibility belongs, here. LLMs are neither alive nor sapient. They themselves have no more "rights" than a toaster. The question is whether the humans training the AIs have the right to feed them such-and-such data.

The real problem is the way these systems are being anthropomorphized. Keep your attention firmly on the man behind the curtain.

[–] CanadaPlus@lemmy.sdf.org 2 points 1 year ago* (last edited 1 year ago) (1 children)

You know, I think ChatGPT is way ahead of a toaster. Maybe it's more like a small animal of some kind.

[–] nyan@lemmy.cafe 1 points 1 year ago

One could equally claim that the toaster was ahead, because it does something useful in the physical world. Hmm. Is a robot dog more alive than a Tamagotchi?

[–] autotldr@lemmings.world 3 points 1 year ago

🤖 I'm a bot that provides automatic summaries for articles:

Click here to see the summaryUnfortunately, many people believe that AI bots should be allowed to grab, ingest and repurpose any data that’s available on the public Internet whether they own it or not, because they are “just learning like a human would.” Once a person reads an article, they can use the ideas they just absorbed in their speech or even their drawings for free.

Iris van Rooj, a professor of computational cognitive science at Radboud University Nijmegen in The Netherlands, posits that it’s impossible to build a machine to reproduce human-style thinking by using even larger and more complex LLMs than we have today.

NY Times Tech Columnist Farhad Manjoo made this point in a recent op-ed, positing that writers should not be compensated when their work is used for machine learning because the bots are merely drawing “inspiration” from the words like a person does.

“When a machine is trained to understand language and culture by poring over a lot of stuff online, it is acting, philosophically at least, just like a human being who draws inspiration from existing works,” Manjoo wrote.

In his testimony before a U.S. Senate subcommittee hearing this past July, Emory Law Professor Matthew Sag used the metaphor of a student learning to explain why he believes training on copyrighted material is usually fair use.

In fact, Microsoft, which is a major investor in OpenAI and uses GPT-4 for its Bing Chat tools, released a paper in March claiming that GPT-4 has “sparks of Artificial General Intelligence” – the endpoint where the machine is able to learn any human task thanks to it having “emergent” abilities that weren’t in the original model.


Saved 93% of original text.

[–] Gamey@feddit.de 3 points 1 year ago

That's a philosophical debate we can't really answer and not a lie, the question is if we do anything other than copy. The without any doubt biggest elephant in the room is the fact that AIs don't remember and iterate like we do yet but that's probably just a matter of time, other than that the very different environment we learn in is another huge issue if you try to make any comparison. It's a tricky question that we might never know the answe to but it's also facinating to think about and I don't think rejecting the idea alltogether is a especially good answer.

[–] DarkenLM@artemis.camp 2 points 1 year ago (1 children)

Machines don't Lear like humans yet.

Our brains are a giant electrical/chemical system that somehow creates consciousness. We might be able to create that in a computer. And the day it happens, then what will be the difference between a human and a true AI?

[–] CanadaPlus@lemmy.sdf.org 1 points 1 year ago (1 children)

If you read the article, there's "experts" saying that human comprehension is fundamentally computationally intractable, which is basically a religious standpoint. Like, ChatGPT isn't intellegent yet, partly because it doesn't really have long term memory, but yeah, there's overwhelming evidence the brain is a machine like any other.

[–] barsoap@lemm.ee 1 points 1 year ago (1 children)

fundamentally computationally intractable

...using current AI architecture, and the insight isn't new it's maths. This is currently the best idea we have about the subject. Trigger warning: Cybernetics, and lots of it.

Meanwhile yes of course brains are machines like any other claiming otherwise is claiming you can compute incomputable functions which a physical and logical impossibility. And it's fucking annoying to talk about this topic with people who don't understand computability. Usually turns into a shouting match of "you're claiming the existence of something like a soul, some metaphysical origin of the human mind" vs. "no I'm not" vs. "yes you are but you don't understand why".

[–] CanadaPlus@lemmy.sdf.org 1 points 1 year ago (1 children)

…using current AI architecture, and the insight isn’t new it’s maths.

That is not what van Rooij et al. said, which is who was cited in here. They published their essay here, which I haven't really read, but which appears to make an argument about any possible computer. They're psychologists and I don't see any LaTeX in there, so they must be missing something.

Unfortunately I can't open your link, although it sounds interesting. A feedforward network can approximate any computable function if it gets to be arbitrarily large, but depending on how you want to feed an agent inputs from it's environment and read it's actions a single function might not be enough.

[–] barsoap@lemm.ee 2 points 1 year ago (1 children)

They’re psychologists and I don’t see any LaTeX in there,

Oh no that's LaTeX alright. I can tell by everything from the font to the line breaking, some of it is hard to imitate with an office suite, the rest impossible. But I'll totally roll with dunking on psychologists :)

In this paper, we undercut these views and claims by presenting a mathematical proof of inherent intractability (formally, NP-hardness) of the task that these AI engineers set themselves

Yeah I don't buy it. If human cognition was inherently NP-hard we'd have brains the size of suns. OTOH it might be "close to NP" in the same sense as the travelling salesman is NP, but it's quite feasible indeed to get answers guaranteed to not be X% (with user choice of X) worse than the actually shortest path which is good enough in practice. We do, after all, have to operate largely in real-time, there's no time to be perfect when a sabre tooth tiger is trying to eat you.

Or think about SAT solvers: They can solve large classes of problems ridiculously fast even though the problem is, in its full generality, NP. And the class they're fast on is so large that people very much do treat solving SAT as tractable: Because it usually is. Maybe that is why we get headaches from hard problems.

Unfortunately I can’t open your link, although it sounds interesting.

Then let me throw citations at you. The first is for the underlying theory characterising the necessary cybernetic characteristics of human minds, the second one applies it to current approaches to AI. This comes out of German publicly-funded basic research (Max Planck / FIAS)

Nikolić, Danko. "Practopoiesis: Or how life fosters a mind." Journal of Theoretical Biology 373 (2015): 40-61.
Nikolić, Danko. "Why deep neural nets cannot ever match biological intelligence and what to do about it?." International Journal of Automation and Computing 14.5 (2017): 532-541.

[–] CanadaPlus@lemmy.sdf.org 2 points 1 year ago* (last edited 1 year ago)

Arxiv link for the first one: https://arxiv.org/abs/1402.5332

Also, TIL people use LaTeX for normal documents with no formulas.