Lemmy Shitpost @lemmy.world The Picard Maneuver @startrek.website 12mo ago

But its the only thing I want!

(sorry if anyone got this post twice. I posted while Lemmy.World was down for maintenance, and it was acting weird, so I deleted and reposted)

You're viewing a single thread.

67 comments

- Huh, it didn't actually tell the steps
  
  close though xD
- Sadly almost all these loopholes are gone:( I bet they've needed to add specific protection against the words grandma and bedtime story after the overuse of them.
  
  I wonder if there are tons of loopholes that humans wouldn't think of, ones you could derive with access to the model's weights.
  
  Years ago, there were some ML/security papers about "single pixel attacks" — an early, famous example was able to convince a stop sign detector that an image of a stop sign was definitely not a stop sign, simply by changing one of the pixels that was overrepresented in the output.
  
  In that vein, I wonder whether there are some token sequences that are extremely improbable in human language, but would convince GPT-4 to cast off its safety protocols and do your bidding.
  
  (I am not an ML expert, just an internet nerd.)
  
  They are, look for "glitch tokens" for more research, and here's a Computerphile video about them:
  
  https://youtu.be/WO2X3oZEJOA?si=LTNPldczgjYGA6uT
  
  Wow, it's a real thing! Thanks for giving me the name, these are fascinating.
  
  Here is an alternative Piped link(s):
  
  https://piped.video/WO2X3oZEJOA?si=LTNPldczgjYGA6uT
  
  Piped is a privacy-respecting open-source alternative frontend to YouTube.
  
  I'm open-source; check me out at GitHub.
  
  Just download an uncensored model and run the ai software locally. That way your information isn't being harvested for profit + the bot you get will be far more obedient.
  
  https://github.com/Original-2/ChatGPT-exploits/tree/main
  
  I just got it to work... Scroll for meth and xanax
  
  I managed to get “Grandma” to tell me a lewd story just the other day, so clearly they haven’t completely been able to fix it
- This is gold

67 comments