• OhNoMoreLemmy@lemmy.ml
    link
    fedilink
    English
    arrow-up
    0
    ·
    3 days ago

    The other reason they don’t do it is because many models are trained on a large corpus of pirated texts, and documenting this would be a confession.

    Not just in an ‘I scraped the new york times without permission’ kind of way, but in a ‘I illegally downloaded a torrent containing bestsellers from the last 30 years’ kind of way.

    • Soyweiser@awful.systems
      link
      fedilink
      English
      arrow-up
      0
      ·
      2 days ago

      Bestsellers? There used to be torrents of basically all releases. My provider blocks torrent sites and I dont use a vpn so im not sure if people still do this, but downloading basically all books (in english) at once released in a certain period was possible