A more fundamental problem is that in a neural network, everything is connected to everything else. If you artificially remove neuron weights, censor training data, or alter behavior with RLHF, then you're not just changing the target behavior, you're subtly removing capability all across the system.
This is why you'll see better performance in uncensored generative AI models, because they have more general knowledge about what humans look like. Also in recent rounds of censorship updates to chatGPT, the model became more argumentative across the board, because they've artificially juiced the concept of refusal in all scenarios.
A more fundamental problem is that in a neural network, everything is connected to everything else. If you artificially remove neuron weights, censor training data, or alter behavior with RLFH, then you're not just changing the target behavior, you're subtly removing capability all across the system.
This is why you'll see better performance in uncensored generative AI models, because they have more general knowledge about what humans look like. Also in recent rounds of censorship updates to chatGPT, the model became more argumentative across the board, because they've artificially juiced the concept of refusal in all scenarios.