- cross-posted to:
- hackernews@lemmy.smeargle.fans
- cross-posted to:
- hackernews@lemmy.smeargle.fans
Can you judge if the model is being truthful or untruthful by looking at something like
|states . honesty_control_vector|
? Or dynamically chart mood through a conversation?Can you keep a model chill by actively correcting the anger vector coefficient once it exceeds a given threshold?
Can you chart per-layer truthfulness through the layers to see if the model is being glibly vs cleverly dishonest? With glibly = “decides to be dishonest early”, cleverly = “decides to be dishonest late”.
Can you judge if the model is being truthful or untruthful by looking at something like
|states . honesty_control_vector|
? Or dynamically chart mood through a conversation?Can you chart per-layer truthfulness through the layers to see if the model is being glibly vs cleverly dishonest? With glibly = “decides to be dishonest early”, cleverly = “decides to be dishonest late”.
There’s been previous work developing a method to do this by reading an LLM’s internal state. The paper actually trains multiple classifiers on different LLMs, each reading the state of a different layer, but found different levels of accuracy at different layers depending on the LLM used and didn’t investigate further on why.