270 terabytes of books stolen for AI training

𝗡𝘂𝗻 𝗶𝘀𝘁 𝗼𝗳𝗳𝗲𝗻𝗯𝗮𝗿 𝗱𝗶𝗲 𝗞𝗮𝘁𝘇𝗲 𝗮𝘂𝘀 𝗱𝗲𝗺 𝗦𝗮𝗰𝗸!

As reported by several media (e.g. https://lnkd. in/e-bvsSX8), Meta has now confirmed that it used the illegal pirated library LibGen to train its AI.

The explanatory memorandum states that books are of course the best source for AI training, as they are often better in terms of language, content and subject matter than any short snippets from social media (logical). They are “well-written representations of human language”.

𝗗𝗲𝘀𝗵𝗮𝗹𝗯 𝘄𝘂𝗿𝗱𝗲𝗻 𝘀𝗮𝗴𝗲 𝘂𝗻𝗱 𝘀𝗰𝗵𝗿𝗲𝗶𝗯𝗲 𝟮𝟳𝟬 𝗧𝗲𝗿𝗮𝗯𝘆𝘁𝗲 𝗕𝘂̈𝗰𝗵𝗲𝗿 (𝗰𝗮. 𝟳.𝟱 𝗠𝗶𝗹𝗹𝗶𝗼𝗻𝗲𝗻 𝗕𝘂̈𝗰𝗵𝗲𝗿 𝘂𝗻𝗱 𝟴𝟬 𝗠𝗶𝗹𝗹𝗶𝗼𝗻𝗲𝗻 𝘄𝗶𝘀𝘀𝗲𝗻𝘀𝗰𝗵𝗮𝗳𝘁𝗹𝗶𝗰𝗵𝗲 𝗦𝘁𝘂𝗱𝗶𝗲𝗻) 𝗴𝗲𝗸𝗹𝗮𝘂𝘁 – 𝗮𝗻𝗱𝗲𝗿𝘀 𝗸𝗮𝗻𝗻 𝗺𝗮𝗻 𝗱𝗮𝘀 𝗻𝗶𝗰𝗵𝘁 𝘀𝗮𝗴𝗲𝗻. 𝗨𝗿𝗵𝗲𝗯𝗲𝗿𝗿𝗲𝗰𝗵𝘁𝗹𝗶𝗰𝗵 𝗶𝘀𝘁 𝗱𝗮𝘀 𝗻𝗮𝘁𝘂̈𝗿𝗹𝗶𝗰𝗵 𝗲𝗶𝗻 𝗮𝗯𝘀𝗼𝗹𝘂𝘁𝗲𝘀 𝗡𝗼-𝗚𝗼.

Now you can argue that Meta did not steal the data itself, but “merely” used an illegally curated stock for training. And you can argue that training AI does not constitute copyright infringement. The courts will decide on all of this.

𝗠𝗮𝗻 𝗸𝗮𝗻𝗻 𝗮𝗯𝗲𝗿 𝗲𝗯𝗲𝗻𝗳𝗮𝗹𝗹𝘀 𝗲𝗶𝗻𝗺𝗮𝗹 𝗺𝗲𝗵𝗿 𝘀𝗲𝗵𝗲𝗻: 𝘄𝗮𝘀 𝗴𝗲𝗺𝗮𝗰𝗵𝘁 𝘄𝗲𝗿𝗱𝗲𝗻 𝗸𝗮𝗻𝗻 𝘄𝗶𝗿𝗱 𝗴𝗲𝗺𝗮𝗰𝗵𝘁 – 𝗼𝗵𝗻𝗲 𝗥𝘂̈𝗰𝗸𝘀𝗶𝗰𝗵𝘁 𝗮𝘂𝗳 𝗥𝗲𝗰𝗵𝘁, 𝗚𝗲𝘀𝗲𝘁𝘇𝘁, 𝗨𝗿𝗵𝗲𝗯𝗲𝗿. 𝗨𝗻𝗱 𝗺𝗮𝗻𝗻 𝗸𝗮𝗻𝗻 𝘀𝗶𝗰𝗵 𝘀𝗶𝗰𝗵𝗲𝗿 𝘀𝗲𝗶𝗻, 𝗱𝗮𝘀𝘀 𝗠𝗲𝘁𝗮 𝗻𝗶𝗰𝗵𝘁 𝗱𝗶𝗲 𝗲𝗶𝗻𝘇𝗶𝗴𝗲𝗻 𝘀𝗶𝗻𝗱, 𝗱𝗶𝗲 𝘀𝗼 𝗮𝗿𝗯𝗲𝗶𝘁𝗲𝗻. 𝗗𝗶𝗲’𝗵𝗮𝘁’𝘀 𝗵𝗮𝗹𝘁 𝗷𝗲𝘁𝘇𝘁 𝗲𝗿𝘄𝗶𝘀𝗰𝗵𝘁 𝘂𝗻𝗱 𝘀𝗶𝗻𝗱 𝗮𝘂𝗳𝗴𝗲𝗳𝗹𝗼𝗴𝗲𝗻.

𝗦𝗰𝗵𝗼̈𝗻𝗲 𝗻𝗲𝘂𝗲 𝗪𝗲𝗹𝘁!

P.S.: currently the users of LLM’s are responsible for their results, i.e. if you now use Meta’s Llama model and the text generated with it uses content from the illegally used training data, you are responsible for it – not Meta!

Hashtag#informatikersindcool Hashtag#kiistdaundbleibt

Focal points

Categories

Contact us