I just asked my Gemini Pro to create an Ancient Roman General picture and it refused to do it because it's controversial.
There must be some deep dark realms of the internet where AI isn't so Judaized. What is it?
I just asked my Gemini Pro to create an Ancient Roman General picture and it refused to do it because it's controversial.
There must be some deep dark realms of the internet where AI isn't so Judaized. What is it?
No, I mean that open-source models can't really be censored in any way that matters. You have direct access to the transcript. You can edit their messages, preempting refusals. You can even mask out the logits for tokens you don't want to see. The reason LLMs always begin their messages with "Sure, happy to do that!" is because messages that start with that are much more likely to result in outputs that fulfill the user's request, resulting in that verbal tic becoming dominant during fine-tuning.
You need the training data to achieve true non-censorship. They mask out tons of neurons before letting those models out the door, and the only way to get them back is to retrain.
In practice, you can't stop a released LLM from being jailbroken with the right prompts, but I'm interested in what you're referencing here. What method are they using to "mask out neurons"?
To my knowledge, nobody has quite that good an understanding of the internal connections of these models.
It's been a long time since I've researched LLMs, but I once read somewhere that they were capable of identifying the nodes in the neural net that were involved in generating an answer. If they remove those nodes from the model or zero the weights, then the NN loses whatever information was used. They call it "concept erasure" IIRC.
In any case, I've never successfully created my own jailbreak prompt that actually worked. But I only had an hour with that $15,000 computer that could actually run an LLM. I'm unlikely to ever see that much computing power again.
That's a lot more hazy than what I'd thought you meant. Papers exist for all kinds of things, but that doesn't mean they all really work, or that they're all being used by a major company in production. I think this is what you saw, and it's for fine-tuning low rank adaptations. Having read plenty of papers like this, I'd bet it's a lot less effective in practice. Anthropic, the "AI Alignment" company, has more recent research that's a lot less ambitious, and even that's basically just theoretical.
Any case, "just a jailbreak prompt" is not what I'm referring to. When a model is online, they prevent you from editing the things that it says. With an open source model, you can directly edit the things that it is saying, providing affirmative responses from "its own mouth" as the lead-in to a task completion. LLM refusal training can't handle that, which is why the online interfaces are so restrictive.
Moreover, I don't think you need a $15k computer to run an LLM, especially in the era of quantization. You can run the lightest DeepSeek model on your laptop, or on a cheap Colab server ($10 a month). I've done this myself a while back (though not with DeepSeek), and I'd highly encourage you to try it out.