• wewbull@feddit.uk
    link
    fedilink
    English
    arrow-up
    7
    ·
    10 months ago

    I don’t know much about LLMs but latent diffusion models already have “meaning” encoded into the model. The whole concept of the u-net is that as it reduces the spacial resolution of the image, it increases the semantic resolution by adding extra dimensions of information. It came from medical image analysis where the idea of labelling something as a tumor would be really useful.

    This is why you get body dysmorphic results on earlier (and even current) models. It’s identified something as a human limb, but isn’t quite sure on where the hand is, so it adds one on to what we know is a leg.

    • Lvxferre@mander.xyz
      link
      fedilink
      English
      arrow-up
      1
      ·
      edit-2
      10 months ago

      That’s perhaps why image generators are comparatively better than text generators. But there’s still something off, by your example it seems that the model cannot reliably use clues like position to understand “this is a «leg»”. And I don’t know much about image generators but I think that they’re still statistics- and probability-based.

    • FaceDeer@kbin.social
      link
      fedilink
      arrow-up
      2
      arrow-down
      1
      ·
      10 months ago

      There was an interesting paper published just recently titled Generative Models: What do they know? Do they know things? Let’s find out! (a lot of fun names and titles in the AI field these days :) ) That does a lot of work in actually analyzing what an AI image generator “knows” about what they’re depicting. They seem to have an awareness of three dimensional space, of light and shadow and reflectivity, lots of things you wouldn’t necessarily expect from something trained just on 2-D images tagged with a few short descriptive sentences. This article from a few months ago also delved into this, it showed that when you ask a generative AI to create a picture of a physical object the first thing the AI does is come up with the three-dimensional shape of the scene before it starts figuring out what it looks like. Quite interesting stuff.