270 terabytes of books stolen for AI training

ð—Ąð˜‚ð—ŧ ð—ķ𝘀𝘁 𝗞ð—ģð—ģð—ēð—ŧð—Ŋð—Ūð—ŋ ð—ąð—ķð—ē 𝗞ð—Ū𝘁𝘇ð—ē ð—Ū𝘂𝘀 ð—ąð—ē𝗚 ð—Ķð—Ū𝗰ð—ļ!

As reported by several media (e.g. https://lnkd. in/e-bvsSX8), Meta has now confirmed that it used the illegal pirated library LibGen to train its AI.

The explanatory memorandum states that books are of course the best source for AI training, as they are often better in terms of language, content and subject matter than any short snippets from social media (logical). They are “well-written representations of human language”.

𝗗ð—ē𝘀ð—ĩð—Ūð—đð—Ŋ 𝘄𝘂ð—ŋð—ąð—ēð—ŧ 𝘀ð—Ūð—īð—ē 𝘂ð—ŧð—ą 𝘀𝗰ð—ĩð—ŋð—ēð—ķð—Ŋð—ē ðŸŪðŸģ𝟎 𝗧ð—ēð—ŋð—Ūð—Ŋ𝘆𝘁ð—ē ð—•ð˜‚Ėˆð—°ð—ĩð—ēð—ŋ (𝗰ð—Ū. ðŸģ.ðŸą 𝗠ð—ķð—đð—đð—ķ𝗞ð—ŧð—ēð—ŧ ð—•ð˜‚Ėˆð—°ð—ĩð—ēð—ŋ 𝘂ð—ŧð—ą ðŸī𝟎 𝗠ð—ķð—đð—đð—ķ𝗞ð—ŧð—ēð—ŧ 𝘄ð—ķ𝘀𝘀ð—ēð—ŧ𝘀𝗰ð—ĩð—Ūð—ģ𝘁ð—đð—ķ𝗰ð—ĩð—ē ð—Ķð˜ð˜‚ð—ąð—ķð—ēð—ŧ) ð—īð—ēð—ļð—đð—Ū𝘂𝘁 – ð—Ūð—ŧð—ąð—ēð—ŋ𝘀 ð—ļð—Ūð—ŧð—ŧ 𝗚ð—Ūð—ŧ ð—ąð—Ū𝘀 ð—ŧð—ķ𝗰ð—ĩ𝘁 𝘀ð—Ūð—īð—ēð—ŧ. ð—Ļð—ŋð—ĩð—ēð—Ŋð—ēð—ŋð—ŋð—ē𝗰ð—ĩ𝘁ð—đð—ķ𝗰ð—ĩ ð—ķ𝘀𝘁 ð—ąð—Ū𝘀 ð—ŧð—Ūð˜ð˜‚Ėˆð—ŋð—đð—ķ𝗰ð—ĩ ð—ēð—ķð—ŧ ð—Ūð—Ŋ𝘀𝗞ð—đ𝘂𝘁ð—ē𝘀 ð—Ąð—ž-𝗚𝗞.

Now you can argue that Meta did not steal the data itself, but “merely” used an illegally curated stock for training. And you can argue that training AI does not constitute copyright infringement. The courts will decide on all of this.

𝗠ð—Ūð—ŧ ð—ļð—Ūð—ŧð—ŧ ð—Ūð—Ŋð—ēð—ŋ ð—ēð—Ŋð—ēð—ŧð—ģð—Ūð—đð—đ𝘀 ð—ēð—ķð—ŧ𝗚ð—Ūð—đ 𝗚ð—ēð—ĩð—ŋ 𝘀ð—ēð—ĩð—ēð—ŧ: 𝘄ð—Ū𝘀 ð—īð—ē𝗚ð—Ū𝗰ð—ĩ𝘁 𝘄ð—ēð—ŋð—ąð—ēð—ŧ ð—ļð—Ūð—ŧð—ŧ 𝘄ð—ķð—ŋð—ą ð—īð—ē𝗚ð—Ū𝗰ð—ĩ𝘁 – 𝗞ð—ĩð—ŧð—ē ð—Ĩð˜‚Ėˆð—°ð—ļ𝘀ð—ķ𝗰ð—ĩ𝘁 ð—Ū𝘂ð—ģ ð—Ĩð—ē𝗰ð—ĩ𝘁, 𝗚ð—ē𝘀ð—ē𝘁𝘇𝘁, ð—Ļð—ŋð—ĩð—ēð—Ŋð—ēð—ŋ. ð—Ļð—ŧð—ą 𝗚ð—Ūð—ŧð—ŧ ð—ļð—Ūð—ŧð—ŧ 𝘀ð—ķ𝗰ð—ĩ 𝘀ð—ķ𝗰ð—ĩð—ēð—ŋ 𝘀ð—ēð—ķð—ŧ, ð—ąð—Ū𝘀𝘀 𝗠ð—ē𝘁ð—Ū ð—ŧð—ķ𝗰ð—ĩ𝘁 ð—ąð—ķð—ē ð—ēð—ķð—ŧ𝘇ð—ķð—īð—ēð—ŧ 𝘀ð—ķð—ŧð—ą, ð—ąð—ķð—ē 𝘀𝗞 ð—Ūð—ŋð—Ŋð—ēð—ķ𝘁ð—ēð—ŧ. 𝗗ð—ķð—ē’ð—ĩð—Ū𝘁’𝘀 ð—ĩð—Ūð—đ𝘁 𝗷ð—ē𝘁𝘇𝘁 ð—ēð—ŋ𝘄ð—ķ𝘀𝗰ð—ĩ𝘁 𝘂ð—ŧð—ą 𝘀ð—ķð—ŧð—ą ð—Ū𝘂ð—ģð—īð—ēð—ģð—đ𝗞ð—īð—ēð—ŧ.

ð—Ķ𝗰ð—ĩð—žĖˆð—ŧð—ē ð—ŧð—ē𝘂ð—ē 𝗊ð—ēð—đ𝘁!

P.S.: currently the users of LLM’s are responsible for their results, i.e. if you now use Meta’s Llama model and the text generated with it uses content from the illegally used training data, you are responsible for it – not Meta!

Hashtag#informatikersindcool Hashtag#kiistdaundbleibt