Using model-generated content in training causes irreversible defects, a team of researchers says. “The tails of the original content distribution disappears,” writes co-author Ross Anderson from the University of Cambridge in a blog post. “Within a few generations, text becomes garbage, as Gaussian distributions converge and may even become delta functions.”

Here’s is the study: http://web.archive.org/web/20230614184632/https://arxiv.org/abs/2305.17493

    • Pigeon@beehaw.org
      link
      fedilink
      arrow-up
      6
      ·
      2 years ago

      Both in terms of factual information, news, etc, and just in terms of language change. An LLM needs to be able to keep up with slang and other new words, both for understanding prompts and for producing passable results.

    • Kerb@discuss.tchncs.de
      link
      fedilink
      arrow-up
      2
      ·
      2 years ago

      Afaik, there are already solution to that.

      You first train the data on the outdated but correct data, to establish the correct “thought” patterns.

      And then you can train the ai on the fresh but flawed data, without tripping about the mistakes.