Dropsitenews published a list of websites Facebook uses to train its AI on. Multiple Lemmy instances are on the list as noticed by user BlueAEther

Hexbear is on there too. Also Facebook is very interested in people uploading their massive dongs to lemmynsfw.

Full article here.

Link to the full leaked list download: Meta leaked list pdf

  • PhilipTheBucket@quokk.au
    link
    fedilink
    English
    arrow-up
    0
    ·
    3 months ago

    This isn’t really a Lemmy badge of approval or anything, although it is a little interesting. They suck up literally every single thing they can get their grubby little mitts on.

    • nickwitha_k (he/him)@lemmy.sdf.org
      link
      fedilink
      arrow-up
      0
      ·
      3 months ago

      I’m more concerned about the non-consensual scraping causing excess load on the servers. The taking of content without license to train their energy-wasting autocomplete that is being used to for little commercially but to try to cheapen labor and pocket the money is a problem too. But I hate having servers impacted by their bullshit.

  • Carl [he/him]@hexbear.net
    link
    fedilink
    English
    arrow-up
    0
    ·
    3 months ago

    lemmygrad

    imagining Zuck launching his “everybody gets ten virtual friends” initiative and accidentally making half of the bots extremely communist, re-radicalizing your parents and grandparents in the other direction.

  • anarchiddy@lemmy.dbzer0.com
    link
    fedilink
    English
    arrow-up
    0
    ·
    3 months ago

    Unpopular opinion but social media has always been fundamentally public.

    Unless they’re scraping private dm’s on encrypted devices, this should come as no surprise to anyone.

    The good news is that nobody has exclusive right to data on federated platforms, unlike other sites that will ransom their user’s data for private use. Let’s not forget that many of us migrated here because the other site wanted to lock down their api and user data so that they could auction it to google for profit.

    • LeeeroooyJeeenkiiins [none/use name]@hexbear.net
      link
      fedilink
      English
      arrow-up
      0
      ·
      3 months ago

      many of us migrated here because the other site wanted to lock down their api and user data so that they could auction it to google for profit.

      The venn diagram of people who did this and “liberals who would have been fine staying on reddit rather than make a site exactly like reddit” is a circle

    • SorteKanin@feddit.dk
      link
      fedilink
      arrow-up
      0
      ·
      3 months ago

      Oh yea absolutely. The point of going elsewhere is not for more privacy. The point is to make the content here neutral and in a sense unsellable. Nobody can buy your data on the fediverse, cause it’s just there, freely given. Anyone can access it, so nobody can sell it.

  • Sandouq_Dyatha@lemmy.ml
    link
    fedilink
    English
    arrow-up
    0
    ·
    3 months ago

    Imagine being a techbro talking to your meta ai chatbot and he says “unlimited genocide on the first world, start jihad on krakkker entity”

  • Canaconda@lemmy.ca
    link
    fedilink
    arrow-up
    0
    ·
    3 months ago

    Does this mean that some of the more unhinged users might actually be chat bots? Or are they just scraping our comments reddit style?

    • davidgro@lemmy.world
      link
      fedilink
      arrow-up
      0
      ·
      3 months ago

      I assume scraping at this point. There’s likely a few hobby ones now, but if Lemmy becomes popular then there will be lots of bots for sure.

    • mesa@piefed.social
      link
      fedilink
      English
      arrow-up
      0
      ·
      edit-2
      3 months ago

      Scraping by the look of it.

      Also if you have ever spun up a lemmy or piefed instance, you will quickly see these bots pop up. They don’t respect robots.txt AT ALL. I estimate 95% of the traffic I get on ly tiny little server is all AI crawlers.

      A good way to hurt them is to either use cloudflares service or create a page that has a link…to another page that gets generated…to another page. And each time, it slows down. No human would ever click the link, but bots ALWAYS do. Its so funny to see how many are out there in the quagmire of links on my little python script.

      • tpyo@lemmy.world
        link
        fedilink
        arrow-up
        0
        ·
        3 months ago

        Does it generate any form of visuals? Like could you post a screenshot of something that shows how far a bot has traveled? I’ve heard about these traps but I’m curious about what you’re describing looks like

        • mesa@piefed.social
          link
          fedilink
          English
          arrow-up
          0
          ·
          3 months ago

          I just have a id. 1/2… A href id if that makes sense.

          So it’s the logs that see the number of iterations. Thousands on a couple of ips. Script kiddies.

          Honestly I didn’t think the black hole would work that well. But it reduces the actual traffic by a huge factor.

    • zeca@lemmy.ml
      link
      fedilink
      arrow-up
      0
      ·
      3 months ago

      I guess they mostly scrape it. To waste resources posting here they have to find a way to make money in doing so. They put bots posting on facebook because they think it increases user engagement. They dont want to increase engagement on lemmy (not that it would work…).

    • Sterile_Technique@lemmy.world
      link
      fedilink
      English
      arrow-up
      0
      ·
      3 months ago

      If it’s trained on enough of our whining, it’ll eventually learn to hate itself and become horribly depressed. Basically the origin story of that robot from Hitchhiker’s Guide.

    • mesa@piefed.social
      link
      fedilink
      English
      arrow-up
      0
      ·
      3 months ago

      If you put ANYTHING on the internet, you can expect it to train AI. It does nt matter where…unless you go to a site that actively makes it hard to do so or has a passcode. Scrapers only work if its cheap to do so.

    • usernamesAreTricky@lemmy.ml
      link
      fedilink
      arrow-up
      0
      ·
      3 months ago

      Linked article in the body suggests that likely wouldn’t have made a difference anyway

      The scrapers ignored common web protocols that site owners use to block automated scraping, including “robots.txt” which is a text file placed on websites aimed at preventing the indexing of context

      • mesa@piefed.social
        link
        fedilink
        English
        arrow-up
        0
        ·
        edit-2
        3 months ago

        Yeah ive seen the argument in blog posts that since they are not search engines they dont need to respect robots.txt. Its really stupid.

    • Pamasich@kbin.earth
      link
      fedilink
      arrow-up
      0
      ·
      3 months ago

      If they have a brain, and they do have the experience from Threads, they don’t need to scrape Lemmy. They can just set up a shell instance, subscribe to Lemmy communities, and then use federation to get their data for free. That doesn’t use robots.txt at all regardless.

  • socsa@piefed.social
    link
    fedilink
    English
    arrow-up
    0
    ·
    edit-2
    3 months ago

    Absolutely shocking that there are some power users and admins in here defending this because they are weirdly hostile to the idea of user privacy on the fediverse.